John Cook considers how people justify probability distribution assumptions:

Sometimes distribution assumptions are not justified.

Sometimes distributions can be derived from fundamental principles [or] . . . on theoretical grounds. For example, large samples and the central limit theorem together may justify assuming that something is normally distributed.

Often the choice of distribution is somewhat arbitrary, chosen by intuition or for convenience, and then empirically shown to work well enough.

Sometimes a distribution can be a bad fit and still work well, depending on what you’re asking of it.

Cook continues:

The last point is particularly interesting. It’s not hard to imagine that a poor fit would produce poor results. It’s surprising when a poor fit produces good results.

And then he gives an example of an effective but inaccurate model used to model survival times in a clinical trial. Cook explains:

The [poorly-fitting] method works well because of the question being asked. The method is not being asked to accurately model the distribution of survival times for patients in the trial. It is only being asked to determine whether a trial should continue or stop, and it does a good job of doing so. As the simulations in this paper show, the method makes the right decision with high probability, even when the actual survival times are not exponentially distributed.

This is an excellent point, and I’d like to elaborate by considering a different way in which a bad model can work well.

**An example where a bad model works well because of its implicit assumptions**

In Section 9.3 of Bayesian Data Analysis (second edition), we compare several different methods for estimating a population total from a random sample in an artificial problem in which the population is the set of all cities and towns in a state. The data are skewed—some cities have much more population than others—but if you use standard survey-sampling estimates and standard errors, you get OK inferences. The inferences are not perfect—in particular, the confidence interval can include negative values because the brute-force approach doesn’t “know” that the data (city populations) are all positive—but the intervals make sense and have reasonable coverage properties. In contrast, as Don Rubin showed when he first considered this example, comparable analyses applying the normal distribution to log or power-transformed data can give horrible answers.

What’s going on? How come the interval estimates based on these skewed data have reasonable coverage we use the normal distribution, while inferences based on the much more sensible lognormal or power-transformed models are so disastrous?

A quick answer is that the normal-theory method makes implicit use of the central limit theorem, but then this just pushes the question back one step: Why should the central limit theorem apply here? Why indeed. The theorem applies for this finite sample (n=100, in this case) because, although the underlying distribution is skewed, there are no extreme outliers. By using the normal-based interval, we are implicitly assuming a reasonable upper bound in the population. And, in fact, if we put an upper bound into the power-transformed model, it works even better.

The moral of the story? Sometimes an ill-fitting model works well because, although it doesn’t fit much of the data, it includes some assumption that is relevant to inferences, some aspect of the model that would be difficult to ascertain from the data alone. And, once we identify what that assumption is, we can put it directly into an otherwise better-fitting model and improve performance.

If anything you’re understating things. Every model ever written down neglects almost everything in the universe. In that sense every model is an extremely poor replica of the thing it’s modeling and is horribly wrong. And yet many many models do give good answers.

Note also that there seems to be an Ontological-Epistemological reciprocity with models. Specifically, the more stable a phenomenon is in the real world, the easier it is to find a bad model which predicts it well. Thinking on that, and the point made in the last paragraph of the post, may shed some light on why models ever get a write answer.

It also helps explain why model assumptions are rarely checked, while model predictions are checked all the time. Out of all the times I’ve seen a Scientist average their measurements to get an accurate number, I’ve never once seen anyone verify that the measurement device generated errors with a Normal frequency distribution. This is true even though Gauss showed Normality was the implicit assumption in the taking the arithmetic average. Scientists do this every day without checking the error/frequency assumptions, and nothing bad happens! http://en.wikipedia.org/wiki/Normal_distribution#History

This recent paper by Stigler might be of relevance

https://files.nyu.edu/ts43/public/research/.svn/text-base/Stigler.pdf.svn-base

“But the magic available when it

does work [not too bad] overwhelms the very real shortcomings [when it actually is too bad].

Andrew: you seem to be en route to reinventing an argument made by Jaynes – I recommend chapter 7 of his book, which discusses these issues at some length. The key idea is that a good model is not one that, after the fact, is seen to reproduce the frequency distribution of the system being described, but rather one that matches the data and other information (this includes assumptions of the type you discuss) available at the time of the modeling procedure while avoiding making any assumptions not based on available information. Also: calling a model that fails to reproduce unobserved frequency distributions “bad” is an example of the sort of loaded terminology you recently posted about.

Entsophy: Gauss did indeed show that normality is the implicit assumption, but this should not be taken to be an assumption about the frequency distribution of errors (which is often a false assumption, making the success of normality assumptions appear mysterious). Instead, the normality assumption is a description of the available information. Specifically (though I might be mangling it here), it encodes the idea that the first and second order moments of the frequency distribution are estimable on realistic data sets, while higher order moments are not). Again, see chapter 7 of Jaynes (I think you are already familiar with this, but it didn’t come across in your comment).

Konrad:

I should take a look. Just to be clear, it is originally Rubin’s example, and indeed Rubin has often made that point, that for many purposes it is better to have an adjustment that fits the data than an adjustment that fits the underlying distribution. Rubin makes that point, for example, when discussing why he prefers the fitted propensity score to the “true” (hypothetical superpopulation) propensity score. Even if the true were available he’d still prefer the fitted.

I do think the case of fitted propensity scores is logically different. Ideally we want exactly equal comparison groups but _have to_ settle for randomized groups which are only equal in distribution. With randomized groups, fitted propensity scores (in this case where the true propensity score is known to be equal to say .5) can form even more equal groups. (Here, I am just remembering what Don said a few years back.)

The case I thought was being discussed here, the model was known to be wrong, ways of making it less wrong were fallible (and cost in multiplicities) and hedging against all possible important model wrongnesses (robustness) was neither possible or didn’t have a desirable cost/benefit.

And I think Stigler put this better than I can in the paper I referred to.

Konrad,

I agree completely and can probably recite chapter 7 from memory. One of Jayne’s phd students had a paper which illustrates much of this explicitly with simulated data: http://bayes.wustl.edu/glb/near-irrelevance.pdf

But actually I think the point can be illustrated even simpler. Suppose I assume that my errors are IID Normal with standard deviation equal to 1. Further suppose my actual series of errors is 1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,.5. Will the resulting 95% interval estimate correctly identify the value of the thing you’re trying to measure? Why yes it will, even though the actual errors are no wear near having a normal histogram, or independence, or anything else that supposed be required to get this right.

So none of this is a problem for me. It’s only a problem for Frequentists who believe Normality assumptions about errors are a physical statement about the frequency of errors in real experiments. I never seen someone verify this in a real example, and the few cases of I’ve heard of people trying it for a real physical measuring device the errors weren’t normally distributed.

Failing to believe (or verify) Normality assumptions doesn’t hold back frequentists from making inference based on clearly non-Normal data.

Why? Because the Central Limit Theorem tells frequentists – and Bayesians, and anyone else who cares to listen – that the sample mean, under ONLY plausible sampling mechanisms and often-plausible regularity conditions, is going to have as close to a Normal distribution as you’ll need to make useful statements about the mean of the population from a modest sample size. Similar results hold for much more general parameters, and their corresponding estimates.

Modeling-based approaches are great, but they’re very far from the only way to interpret e.g. the sample mean. Why does that matter? Well, saying that Normality is an “implicit assumption” of using the sample mean – like many textbooks essentially do – encourages users to focus on unhelpful areas – e.g. exact Normality of the data, not independence of observations, or perhaps (for linear regression) non-constant variance.

Fred,

Can you give a single example of anyone anywhere under any circumstances verifying that the assumptions of the Central Limit Theorem apply to a real physical measuring device used in a laboratory?

Entsophy, yes.

The CLT applies to independent binary observations e.g. dead/alive, which are “real” and measured (very well!) in many experiments in laboratories; showing that the CLT holds in this case is a trivial exercise.

Fred: for the errors thrown off by a measuring device in a laboratory. I’m pretty sure the measurement errors for every measuring device I’ve ever used were not independent. Just out of shear curiosity, I love to see if anyone has ever had a different (verified!) experience.

Entsophy; I appreciate that not all sampling (or measuring) procedures are going to give data points that are anything close to independent, and that treating that data as independent can give garbage results. I’m sure you know this too. But obsessing over independence when something pretty close to it holds is also not helpful; not only can one not verify exact independence in practice (as Christian notes) but CLTs hold in situations other than exact independence, so one may not need to.

The bigger point is that in many practical situations, frequentist inference does *not* actually require the sort of strict assumptions you claim. In these situations, whether or not scientists verify assumptions made for ease of textbook derivations is irrelevant, at best.

Assuming that “verifying” here means checking, rather than demonstrating to be true — and also assuming that “in the laboratory” is not meant to imply that no measurements take place ooutside one:

Harold Jeffreys in his “Theory of probability” (Oxford University Press, 1939/1948/1961).

Entsophy: Independence and dependence can strictly not be told apart by observations. The only way to diagnose dependence is to look for patterns produced by very specific models for dependence such as high autocorrelation. Independence cannot be verified by any means.

I think that the terminology to say that a model “works” or not is confusing. What does the model actually do? A model is not a method. Neither does a model predict anything. Only a prediction rule, possibly derived from a model but maybe not, predicts something.

Sure, the arithmetic mean can be derived as ML estimator from the normal distribution, but this doesn’t mean in any way that the normal distribution has to be true or even approximately true in order to use the mean. The mean can be computed for data from any distribution. Whether it is reasonable or not to compute a mean doesn’t only depend on the distributional shape. Example: I may run a simulation study comparing several estimators of a 1-d parameter. Is the mean or the median a better summary statistic of the estimation error? Well, the median is crappy even if the distributional shape of the errors matches a double exponential distribution almost perfectly, because it is a bad idea to have a quality measure for an estimator that is invariant against replacing the upper half of the results by something arbitrarily large. So I’d rather use the mean. This doesn’t say anything about whether the “normal distribution works” because using the mean is not motivated by distributional shape at all here.

Thinking about this a little bit more, the median would in fact be good if we *knew* that *all* underlying distributions are double exponential. But we don’t, even if this distribution *fits* very well what we see. Still, because we are interested in *quality measurement* and not *fitting*, I prefer the mean because it gives the most extreme distributions a weight that I find appropriate faced with the possibility that my compared estimators may work very badly with a small but nonzero probability.

I think that it is a misguided statistician’s obsession to always think of problems in terms of probability models. Probability models can be very useful, no doubt about that, but we should not ignore that we are interested, in many if not most cases, in other aspects, too, that are not related to the “true underlying model” idea.

Christian:

My point in the example is not that the standard sample survey estimate

shouldbe interpreted as coming from a normal distribution. Rather, the standard estimatecanbe thus interpreted, at which point it seems reasonable to instead fit a better model. This is something that scientists (not just statisticians) do all the time, try to come up with models that better fit the data and are closer to some underlying truth. But, in this example, a more reasonable-seeming and better-fitting model actually gives poor inferences. In our book chapter, we explore how this happens, and we ultimately conclude that the original procedure, or model, implicitly encoded a boundedness assumption that keeps the estimate to reasonable values. Having realized this, we can go back to the better-fitting model and improve it, thus ending up with a procedure that is superior to what came before.Nowhere in the above do I see anything “misguided,” nor do I see any “obsession.”

Andrew: My last paragraph was rather general and not meant to apply your posting in particular. However, I do believe that the implicit mix-up of models and methods in sentences such as “an ill-fitting model works well” is a symptom of a more general biasedness of statisticians (and maybe scientists in general) toward identifying all (or too many) kinds of data analytic problem solving with (probability) modeling. J. Cook’s example is probably a better illustration of this than yours because it is about decision making rather than inference (although he appropriately switches his talk from distributions to methods at some point).

I realise by the way that I may sound too critical and I add that I like the fact that you brought up the topic. I probably agree with the data analytic conclusions despite having criticised they way the word “model” is used here, and I cannot agree more with making people aware that not only methodology can be good and useful of which the “official model assumptions” hold.