Evan Warfel asks a question:

Let’s say that a researcher is collecting data on people for an experiment. Furthermore, it just so happens that due to the data collection procedure, data is gathered and recorded in 100-person increments. (Making it so that the researcher effectively has a time series, and at some point t, they decide to stop collecting data.)

Now, let’s assume that the researcher is not following best practices and wants to compute summary statistics at each timestep. How might they tell if the data they are collecting has or is likely to have finite vs. non-finite variance? E.g. How might they tell that what they are studying isn’t best described by, say, a Cauchy distribution (where they can’t be sure that estimates of the standard deviation will stabilize with more data)? Is the only solution to run goodness-of-fit tests at each time t and just hope for finite variance?

I ask because my understanding is that many statistical calculations relating to NHST rely on the central limit theorem. Given cognitive biases like scope neglect, I wonder if people might be systematically bad at reasoning about the true amount of variation in human beings, and if this might cause them to accept the CLT as applicable to all situations.

He elaborates:

Clearly, if one’s data comes from a source generating approximately normal data, one can have a decent indication that the variance is finite via standard issue tests of normality. Maybe at large samples (like those implied by 100-person increments), this question is easy enough to resolve? But what if one only has the time/resources to gather fewer data points? Or, to flip it around—Is it definitionally possible to have diabolical distributions where the first N points sampled are likely to look normal or have finite variance, and after N points things devolve? Or does this push the definition of finite variance too far?

I’ll answer the questions in reverse order:

1. Yes, you can definitely see long-tailed distributions where the tail behavior is unclear even from what might be considered a large sample. For an example, see section 7.6 of BDA3, which is an elaboration of an example from Rubin (1983).

2. You can also see this by drawing some samples from a Cauchy distribution. In the R console, type sort(round(rcauchy(100))) or sort(round(rcauchy(1000))).

3. By the way, if you want insight into the Cauchy and related distributions, try thinking of them as ratios of estimated regression coefficients. If u and v have normal distributions, then u/v has a distribution with Cauchy tails. (If v is something like normal(4,1) then you won’t typically see those tails, but they’re there if you go out far enough.)

4. One reason why point #3 is important is that it can make you ask why you’re interested in the distribution of whatever you’re looking at in the first place.

5. To return to point #1 above, that example in BDA3, one way to get stable inferences for a long-tailed distribution is to put a constraint on the tail. This could be a hard constraint (saying that there are no values in the distribution greater than 10^7 or whatever) or a soft constraint bounding the tail behavior. In a real-world problem you should be able to supply such information.

6. To get to the original question: I don’t really care if the underlying distribution has finite or infinite variance. I don’t see this mapping to any ultimate question of interest. So my recommendation is to decide what you’re really trying to figure out, and then go from there.

I think Evan makes a valid point that “people might be systematically bad at reasoning about the true amount of variation” because they are most familiar with methods that assume finite means and variances. With sufficiently robust statistical methods, it may not matter much whether the variance is infinite, but I think it more likely that people may go astray in their analysis if their mental default assumptions are more Gaussian than Cauchy.

Cue Nassim Taleb ;)

Andrew, I’m concerned your answer doesn’t directly address issues with the questions in the first place.

Hypothesis tests aren’t impacted by looking at the data. An ultimate hypothesis test would be impacted by doing hypothesis tests but not by just looking at variance to see if the data you’re collecting are reasonable for such a test. Checking such things *is* best practice. You’re confusing making sure that your data is being collected well, which is a great idea, with the bad practice of doing repeated hypothesis tests. Calculate all of the summary stats you want.

An equal variance test cannot tell you if you have equal variance. It can give you a good idea it’s not but has nothing to do with whether variances are equal. This is a very common mistake.

Also, Evan is focusing too much on what can be derived from data. Many data generating mechanims are reasonably well understood. The best way to figure out th eanswers to your questiona re often from the nature fo the data generating process.

Psyoskeptic said,

“Hypothesis tests arenâ€™t impacted by looking at the data. “

This might depend on the interpretation. For example, looking at the data can influence the “choice” of hypothesis tests to perform.

I suspect that we statisticians are, as a rule, so focused on finding practical solutions to statistical problems–on pursuing the shortest path between question and answer–that we forget to invest the work with imagination, even fancy. Not to say that we aren’t creative, despite stereotypes. Every statistical problem is unique and therefore requires a creative approach. A creative approach leads to good, even original, ideas, but an imaginative approach leads to great ones. And yes, “theoretical statistics is the theory of applying statistics.” That’s how you can tell we have a deficit.

Richard Hamming seems to have made a big effort near the end of his career to push other, younger smart people to do great things in their careers. Hamming’s argument was that, in his experience, all the “great minds” we recognize in hindsight, who saw connections among ideas that no one else did, didn’t necessarily have unique insight. They were just the only ones to ask whether a connection existed, and to keep turning that question over in their thoughts until they had an answer. Statisticians are content to leave that kind of wonder to the mathematicians, who mostly have neither the experience nor the incentives to think imaginatively about statistics. I would even say that statistics is almost uniquely barren of (living) great minds, in proportion to our abundance of great intellects. With the possible exception of Evan.

Rather than “testing” whether you have infinite variance, first rely on the following logic:

1) If you are talking about data, which is to say measurements, then all quantities in the world are bounded by some (possibly very large) number, and hence have finite variance. For example lengths are bounded by the radius of the observable universe. Particle counts are bounded by the number of particles in the universe, dollars are bounded by the total money supply.

2) ratios can have infinite variance when the denominator has probability density in the vicinity of zero. Can your quantity be actually zero? If so, what about the numerator? Maybe compute the reciprocal? Often we have a simple model for something and it has some support near zero, but only because we’re lazy. Like the height of a person, you might have a normal(m,s) model, and maybe the m is around 150 cm and s is 30 or 40 cm but we know for damn sure that no people are anywhere near a millimeter high. yet the normal model is convenient. In fact though, you could truncate the normal model at normal(150,30) truncated to 10cm and you’d be doing just fine and the infinite variance of a ratio with this in the denominator would go away.

3) Regression coefficients etc: if the regression parameter translates to a prediction of a quantity and when the parameter goes towards infinity, so does the prediction of a physical quantity, then see (1). This is again a sign of a model that’s ill specified.

Basically infinite variance doesn’t really exist for most applied concepts except when we accept poor models because we’re ok with their lack of fit to the tails. Ask yourself if a quantity in the real world can exceed the maximum possible floating point number in your computer. If so, you should fix that first before worrying about infinite variance.

Because of all that, let’s re-phrase the question. Instead of “testing whether something has infinite variance” let’s change it to “estimate how fat the tails are”. As soon as you change to that question you’re off to the races. Because even finite variance distributions can be so fat tailed that they pose problems. Like a t distribution with 2.04 degrees of freedom is not as fat tailed as a cauchy, and has a finite variance, but it’s still pretty damn fat.

This is a good analysis. Real-world data fit to power-law distributions like Pareto (think wealth, earthquake magnitudes, etc) may well result in estimates for the tail exponent alpha that, in fact, imply “infinite variance”. Heck, sometimes even the first moment is “infinite”! But this usually is because the distribution being used is, as Daniel says, not truncated from above. Any finite real-world sample will of course have finite variance, mean, and so on. But what the parameter estimate tells us is that no amount of continued sampling will stabilize whatever moment is “infinite”, so all intuition based on CLT for instance goes out the window.

So, I prefer to think of this as an *undefined* moment.

Rather than “testing”, I would prefer to just fit a potentially fat-tailed distribution and estimate whatever parameter yields “fat-tailedness”. For instance, rather than normal(mu, sigma), estimate student_t(df,mu,sigma) which some kind of prior on df.

Hear, hear.

Don’t say fat, say heavy. What is halfway between 1 and oo in KL distance? A. ~ 2.3853 for Students.

In fact, given that all observed quantities are bounded (leaving the ratio issue aside), not only everything has finite variance, also distributions with infinite value range such as the t(2.04) or even the normal will not occur. This, of course, doesn’t contradict Daniel’s major message, namely that the theory regarding infinite variance, although not directly applying to real data, holds the implicit message that tails may be so fat to cause trouble. The trouble may go away if we can observe 10^10^10^10 observations, but with a sample size of say 1000, still very easily real data can be troublesome enough that the message from infinite variance theory applies better to them than asymptotic normality.

This is exactly right, normal models are just convenient models as all data is really on some finite truncated interval. The thing is that truncating a normal at say 10 SD produces a density that is within epsilon of the untruncated one. Its best to think of normal and other infinite tailed models as just convenient mathematical approximations to truncated distributions. When it comes to modeling things that must be positive though I do stick to distributions like gamma or I explicitly truncate other distributions at 0.

Actually there’s even more subtlety in the relation between “realistic” variance assumptions and asymptotic theory (“realistic” is in quotes here because in reality everything depends on everything else, so even i.i.d. or, for that matter, exchangeability won’t hold, but I leave that aside for the moment).

Consider a distribution of the form Q=(1-eps)N(a,sigma^2)+eps*delta_x, delta being the one point/Dirac distribution. This model is often used in robust statistics, the delta_x being the simplest possible model for a potential outlier occurring somewhere. In fact, this obviously has all moments existing, so is theoretically a fine and simple distribution, and the CLT applies without problems. Let’s say eps is very very small. If we have a real sample size n, surely eps could be so small that the probability is arbitrarily large, 99% or whatever you want, so that no observation from delta_x actually occurs in the sample, and the sample looks as normal as anything can be. The mean of such a sample will very nicely estimate the mean a of the normal distribution and all seems fine. Except all is not well at all. The true mean of Q is in fact (1-eps)a+eps*x, which depending on x can be arbitrarily far away from a, meaning that the sample mean from a sample that has all n observations from N(a,sigma^2) will be very bad (how bad it is will depend on x, which of course cannot be known if no such x is observed).

Now one could argue that in fact the x is an outlier and that the aim may well be to estimate the a unaffected by outliers, rather than the true mean of Q, EQ. In this case it’s fine if there is no observation from delta_x in the sample – disaster may strike if there is (unless robust estimators are used).

In this situation, there is some kind of “heavy tail” problem in case an x is observed, however estimating EQ properly requires that one or more of these are in the sample. In that case, given enough observations (potentially *very* many if eps is small), the CLT will make sure that we’re doing fine. The issue is that we do not *want* to do fine in case the interest is in estimating a, not EQ (which in reality is probably more common, as given away by the branding of x as “outlier”).

In that case, we’re all good if we *don’t* have enough observations to catch an outlier (as we automatically would for n large enough), and actually don’t have enough observations to even know they would exist. The CLT does *not* help us because it is based on estimating EQ (which we’re not interested in) rather than a. The asymptotics hold without issue, but in practice we want to ignore what they say! We want an estimator that doesn’t estimate what the CLT suggests should be estimated, and we will do well as long as we have a sufficiently low number of observations to not get a true impression of what the true distribution is going to be.

A similar game can be played replacing delta_x by a Cauchy (imagining that there actually can be infinite variance). In that situation, the CLT tells us that we’re going to explode, but as long as in practice nothing from the Cauchy is observed (so to say, as long as the distribution doesn’t reveal its true nastiness), we’re all fine.

By the way this also demonstrates that if infinite variances or even dangerously heavy tails are caused by something with too small probability, the probability that there is *any* information in a sample that allows to diagnose this is very small.

We may feel good about our sample, may naively compute a mean, that mean may just estimate well what we want to estimate (say, the a, not the EQ), and all will be good, except that from a theoretical point of view we are wrong in two ways, led astray in two respects (namely using an estimator that doesn’t work for what theoretically should be estimated, but which are not aware of, and we wouldn’t want in case we were either), that somehow they cancel out. Nice, huh?

All this is another illustration of the fact that the job of models is not being “true” – in a given situation where we observe an uncontaminated sample it may well be that there would have been a certain probability of contamination, had we observed more data, but a “true” model with a so small contamination probability that we actually didn’t see any of it could only have confused, not helped.

The contamination idea (where the contaminants are very far from the regular distribution) is less helpful if we assume that users would screen their data for such aberrations, e.g. 10-foot tall 3 year olds, or whatever.

Should they do this screening – which of course is standard in applications – then the corresponding analysis tells us about the population that meet the inclusion criteria. This might not be (quite) the population everyone wants to learn about, but the resulting inference is nevertheless often useful.

As long as we’re talking about what to do in practice, this is fair enough (although its quality ultimately depends on how exactly the screening is done), however what I was going on about is how what we do relates to the theory, and this is often not as it seems to be. (Of course if you can specify a fixed data-independent bound for discarding outliers, this is “theoretically nice”, but it may be very hard to justify such a bound in many real situations, and if you involve the data, things become more complicated.)