As long as we’re talking about what to do in practice, this is fair enough (although its quality ultimately depends on how exactly the screening is done), however what I was going on about is how what we do relates to the theory, and this is often not as it seems to be. (Of course if you can specify a fixed data-independent bound for discarding outliers, this is “theoretically nice”, but it may be very hard to justify such a bound in many real situations, and if you involve the data, things become more complicated.)

]]>The contamination idea (where the contaminants are very far from the regular distribution) is less helpful if we assume that users would screen their data for such aberrations, e.g. 10-foot tall 3 year olds, or whatever.

Should they do this screening – which of course is standard in applications – then the corresponding analysis tells us about the population that meet the inclusion criteria. This might not be (quite) the population everyone wants to learn about, but the resulting inference is nevertheless often useful.

]]>Consider a distribution of the form Q=(1-eps)N(a,sigma^2)+eps*delta_x, delta being the one point/Dirac distribution. This model is often used in robust statistics, the delta_x being the simplest possible model for a potential outlier occurring somewhere. In fact, this obviously has all moments existing, so is theoretically a fine and simple distribution, and the CLT applies without problems. Let’s say eps is very very small. If we have a real sample size n, surely eps could be so small that the probability is arbitrarily large, 99% or whatever you want, so that no observation from delta_x actually occurs in the sample, and the sample looks as normal as anything can be. The mean of such a sample will very nicely estimate the mean a of the normal distribution and all seems fine. Except all is not well at all. The true mean of Q is in fact (1-eps)a+eps*x, which depending on x can be arbitrarily far away from a, meaning that the sample mean from a sample that has all n observations from N(a,sigma^2) will be very bad (how bad it is will depend on x, which of course cannot be known if no such x is observed).

Now one could argue that in fact the x is an outlier and that the aim may well be to estimate the a unaffected by outliers, rather than the true mean of Q, EQ. In this case it’s fine if there is no observation from delta_x in the sample – disaster may strike if there is (unless robust estimators are used).

In this situation, there is some kind of “heavy tail” problem in case an x is observed, however estimating EQ properly requires that one or more of these are in the sample. In that case, given enough observations (potentially *very* many if eps is small), the CLT will make sure that we’re doing fine. The issue is that we do not *want* to do fine in case the interest is in estimating a, not EQ (which in reality is probably more common, as given away by the branding of x as “outlier”).

In that case, we’re all good if we *don’t* have enough observations to catch an outlier (as we automatically would for n large enough), and actually don’t have enough observations to even know they would exist. The CLT does *not* help us because it is based on estimating EQ (which we’re not interested in) rather than a. The asymptotics hold without issue, but in practice we want to ignore what they say! We want an estimator that doesn’t estimate what the CLT suggests should be estimated, and we will do well as long as we have a sufficiently low number of observations to not get a true impression of what the true distribution is going to be.

A similar game can be played replacing delta_x by a Cauchy (imagining that there actually can be infinite variance). In that situation, the CLT tells us that we’re going to explode, but as long as in practice nothing from the Cauchy is observed (so to say, as long as the distribution doesn’t reveal its true nastiness), we’re all fine.

By the way this also demonstrates that if infinite variances or even dangerously heavy tails are caused by something with too small probability, the probability that there is *any* information in a sample that allows to diagnose this is very small.

We may feel good about our sample, may naively compute a mean, that mean may just estimate well what we want to estimate (say, the a, not the EQ), and all will be good, except that from a theoretical point of view we are wrong in two ways, led astray in two respects (namely using an estimator that doesn’t work for what theoretically should be estimated, but which are not aware of, and we wouldn’t want in case we were either), that somehow they cancel out. Nice, huh?

All this is another illustration of the fact that the job of models is not being “true” – in a given situation where we observe an uncontaminated sample it may well be that there would have been a certain probability of contamination, had we observed more data, but a “true” model with a so small contamination probability that we actually didn’t see any of it could only have confused, not helped.

]]>This is exactly right, normal models are just convenient models as all data is really on some finite truncated interval. The thing is that truncating a normal at say 10 SD produces a density that is within epsilon of the untruncated one. Its best to think of normal and other infinite tailed models as just convenient mathematical approximations to truncated distributions. When it comes to modeling things that must be positive though I do stick to distributions like gamma or I explicitly truncate other distributions at 0.

]]>Don’t say fat, say heavy. What is halfway between 1 and oo in KL distance? A. ~ 2.3853 for Students.

]]>Hear, hear.

]]>Psyoskeptic said,

“Hypothesis tests aren’t impacted by looking at the data. “

This might depend on the interpretation. For example, looking at the data can influence the “choice” of hypothesis tests to perform.

]]>This is a good analysis. Real-world data fit to power-law distributions like Pareto (think wealth, earthquake magnitudes, etc) may well result in estimates for the tail exponent alpha that, in fact, imply “infinite variance”. Heck, sometimes even the first moment is “infinite”! But this usually is because the distribution being used is, as Daniel says, not truncated from above. Any finite real-world sample will of course have finite variance, mean, and so on. But what the parameter estimate tells us is that no amount of continued sampling will stabilize whatever moment is “infinite”, so all intuition based on CLT for instance goes out the window.

So, I prefer to think of this as an *undefined* moment.

Rather than “testing”, I would prefer to just fit a potentially fat-tailed distribution and estimate whatever parameter yields “fat-tailedness”. For instance, rather than normal(mu, sigma), estimate student_t(df,mu,sigma) which some kind of prior on df.

]]>1) If you are talking about data, which is to say measurements, then all quantities in the world are bounded by some (possibly very large) number, and hence have finite variance. For example lengths are bounded by the radius of the observable universe. Particle counts are bounded by the number of particles in the universe, dollars are bounded by the total money supply.

2) ratios can have infinite variance when the denominator has probability density in the vicinity of zero. Can your quantity be actually zero? If so, what about the numerator? Maybe compute the reciprocal? Often we have a simple model for something and it has some support near zero, but only because we’re lazy. Like the height of a person, you might have a normal(m,s) model, and maybe the m is around 150 cm and s is 30 or 40 cm but we know for damn sure that no people are anywhere near a millimeter high. yet the normal model is convenient. In fact though, you could truncate the normal model at normal(150,30) truncated to 10cm and you’d be doing just fine and the infinite variance of a ratio with this in the denominator would go away.

3) Regression coefficients etc: if the regression parameter translates to a prediction of a quantity and when the parameter goes towards infinity, so does the prediction of a physical quantity, then see (1). This is again a sign of a model that’s ill specified.

Basically infinite variance doesn’t really exist for most applied concepts except when we accept poor models because we’re ok with their lack of fit to the tails. Ask yourself if a quantity in the real world can exceed the maximum possible floating point number in your computer. If so, you should fix that first before worrying about infinite variance.

Because of all that, let’s re-phrase the question. Instead of “testing whether something has infinite variance” let’s change it to “estimate how fat the tails are”. As soon as you change to that question you’re off to the races. Because even finite variance distributions can be so fat tailed that they pose problems. Like a t distribution with 2.04 degrees of freedom is not as fat tailed as a cauchy, and has a finite variance, but it’s still pretty damn fat.

]]>Richard Hamming seems to have made a big effort near the end of his career to push other, younger smart people to do great things in their careers. Hamming’s argument was that, in his experience, all the “great minds” we recognize in hindsight, who saw connections among ideas that no one else did, didn’t necessarily have unique insight. They were just the only ones to ask whether a connection existed, and to keep turning that question over in their thoughts until they had an answer. Statisticians are content to leave that kind of wonder to the mathematicians, who mostly have neither the experience nor the incentives to think imaginatively about statistics. I would even say that statistics is almost uniquely barren of (living) great minds, in proportion to our abundance of great intellects. With the possible exception of Evan.

]]>Hypothesis tests aren’t impacted by looking at the data. An ultimate hypothesis test would be impacted by doing hypothesis tests but not by just looking at variance to see if the data you’re collecting are reasonable for such a test. Checking such things *is* best practice. You’re confusing making sure that your data is being collected well, which is a great idea, with the bad practice of doing repeated hypothesis tests. Calculate all of the summary stats you want.

An equal variance test cannot tell you if you have equal variance. It can give you a good idea it’s not but has nothing to do with whether variances are equal. This is a very common mistake.

Also, Evan is focusing too much on what can be derived from data. Many data generating mechanims are reasonably well understood. The best way to figure out th eanswers to your questiona re often from the nature fo the data generating process.

]]>