## Oscar Wilde, George Orwell, Clark Kent, and posterior predictive checks

It was Oscar Wilde, was it not, who said he would sooner believe a falsehood told well than a truth told falsely? And George Orwell who wrote that good prose is like a windowpane, but sometimes it needs a bit of Windex and a clean rag to fully do its job.

Along those lines, Don Rubin has long ago convinced me of the importance of clean statistical notation. One example that’s been important to me is model checking–residual plots, p-values, and all the rest. The key, to me, is the Tukeyesque idea of comparing observed data to what could’ve occurred if the model were true. The usual way this used to be done in statistics books was to talk about data y and a random variable Y. If the test statistic is T(y), then the p-value is Pr (T(Y)>T(y)) or, more generally, Pr (T(Y)>T(y) | theta). (I’m assuming continuous models here so as to avoid having to use the “greater than or equal” symbol.)

But this notation starts to break down once you start thinking about uncertainty in theta. If theta can be well estimated from data, then maybe you’re ok with Pr (T(Y)>T(y) | theta.hat). But once we go beyond point estimation, we’re in trouble, and the trouble is that y is said to be a “realization” of Y. Just as Clark Kent is a particular realization of Superman.

Can we talk about a posterior probability, Pr (T(Y)>T(y) | y)? Not really, because y and Y have no joint distribution. What could p(y,Y) possibly mean? It would involve defining things such as Pr(y=2,Y=4), which sounds sort of funny given that we’re usually talking about Pr(Y=y). If you’re careful you can probably get the right answer, but you have to be really careful when working with Y and y, sometimes treating them as two separate creatures and sometimes not. It’s notation that’s a hindrance to understanding, rather than being the sort that actually moves our logic along.

That’s why Xiao-Li, Hal, and I switched to a more formal notation, defining y (the data that have been observed or will be observed in the current experiment) and y.rep (future data; data that could be observed if the model were true). The graphical model goes like this:

theta –> (y, y.rep).

It’s tricky for me to draw the branching in Ascii so that it shows up correctly on the blog, but the idea is for there to be two arrows forking from theta, one arrow to y and the other to y.rep. y and y.rep are conditionally independent given theta. And so there is a mathematically defined joint distribution:

p (y, y.rep, theta) = p(theta) p(y|theta) p(y.rep|theta),

and thus there is also a posterior predictive distribution, p(y.rep|y), formed by conditioning on y and averaging over theta (which we typically do using simulation rather than analytic integration), and we can do graphical predictive checks, compute p-values and all the rest.

All of this is by way of preface to my discussion with Judith Rousseau about her work on posterior predictive checks, which she refers to as “double use of the data.”

I think a big stumbling block for Rousseau is her use of the (Y,y) notation, which limits what she can do because Y and y are not treated symmetrically and it’s awkward in her notation to work with the all-important joint distribution of parameters, data, and replicated data.

There is no “double use of the data” in posterior predictive checks. We’re conditioning on the data exactly once, as always in Bayesian inference.

But Rousseau (following Berger and others) does raise a practical concern, which is that the distribution of a posterior predictive p-value (that is, the distribution of possible values it could attain, if the model (including the prior distribution) were true), is not uniform (except in special cases when the test statistic is pivotal, so that its distribution does not depend on theta). As Xiao-Li, Hal, and I showed, the distribution of the posterior-predictive distribution is stochastically less variable than uniform, which typically means that, if the model is true, you’re less likely to get extreme p-values than you might think, given their nominal values. (For example, if the model is true, you’ll typically have less than a 5% chance that a particular p-value will be less than 0.05. (What we proved in our paper was that the probability is less than 2p, or 10% in this case, but in just about all the examples we’ve seen, the probability is less than p itself.)

This behavior has led posterior predictive p-values to be called “conservative” or “uncalibrated,” but I don’t think that’s a correct description. I’d prefer to say that, in the presence of uncertainty about parameters, the p-value–the probability, if the model is true, of seeing something more extreme than the observed data–is not in general a u-value, where the latter is defined as a data summary that has a uniform distribution if the model is true. As Xiao-Li, Hal, and I discussed, prior predictive p-values are u-values, but there are lots of reasons for preferring posterior predictive checks instead, the principal reason being that it’s the posterior distribution that’s being using as an inferential summary going forward from the data analysis.

Differences in notation and terminology aside, though, Rousseau is wrestling with real issues:

1. How useful are posterior predictive checks, really, if they’re so conservative that they don’t reject false models often enough? (If a test does not reject the true model at the nominal level, it’s a safe bet that it doesn’t reject false models very often either; in statistical jargon, it will have “low power.”)

2. If we consider predictive checks not as a way of obtaining accept-reject rules but rather as summaries of model fit (as in chapter 6 of BDA), then how exactly can they be interpreted? To be understood at all, do p-values first need to be “calibrated” and transformed to a uniform distribution?

I will address both these concerns in turn.

1. My goal in model checking is not to reject false models (or, for that matter, true models). All the models I work with are false, and a simple hypothetical (sometimes called “pre-posterior”) analysis reveals that, with enough data, I would be able to reject any model I might have. The chi-squared test is a measure of sample size, and all that. So I don’t see “statistical power” as an issue in this setting.

If that’s true, you might ask, why do I test models at all? Or, to take a slightly weaker position, would I be happy if someone were to add noise to all my model checks? Sure, this would hurt my statistical power, but I just said I didn’t care about that, right? This point is a good one, and I’ll get back to it in a bit.

To return to my main thread: I check models not to “reject” (or to “accept” or “not-reject”) them but to understand the ways in which they do not match available data and our current understanding. See my discussion of a recent paper by Bayarri and Castillo for further discussion of this point. There are different sorts of model checks corresponding to comparisons of the posterior distribution to different sorts of knowledge: prior-posterior comparisons, cross-validation, external validation, and comparisons of predictions to observed data. It is this last that people typically are talking about when they talk about posterior predictive checks, and which I view as a generalization of classical hypothesis testing and exploratory data analysis.

For an example of a model check from my recent applied work, see here.

Any particular model check is an evaluation of some particular use of a model. The key reason I’m not concerned with the power of a posterior predictive test is that I don’t treat non-rejection as acceptance. If the data fit the model under a particular test (that is, if T(y) is well within the distribution of T(y.rep)), that’s fine–but it doesn’t stop the model from having problems in other ways. A test with low power simply reflects a data summary–a test statistic–that the model happens to be able to automatically (with high probability) fit well from data. That’s what models are often supposed to do–the normal model, for example, does a good job of fitting the sample mean–and it’s fine to be able to check this.

I still care about statistical power, though–see chapter 20 of ARM, for example, or my recent article with David Weakliem. The version of statistical power that I care about is the ability of model-and-data to estimate some parameter of interest, or to distinguish between models. But I would generally do this using Bayesian inference on the parameters, not using model checking.

2. Various statisticians have argued that posterior predictive p-values are not calibrated because they do not have a uniform distribution under the null hypothesis. My answer to this criticism is: Just interpret them directly as probabilities! The sample space for a posterior predictive check–the set of all possible events whose probabilities sum to 1–comes from the posterior distribution of y.rep. If a posterior predictive p-value is 0.4, say, that means that, if we believe the model, we think there’s a 40% chance that tomorrow’s value of T(y.rep) will exceed today’s T(y). If we were able to observe such replications in many settings, we could collect them and check that, indeed, this happens 40% of the time when the p-value is 0.4, that it happens 30% of the time when the p-value is 0.3, and so forth. (Of course, this won’t really happen–our models aren’t actually true–but that’s how probability always goes.) These things are as calibrated as any other model-based probability, for example a statement such as, “From a roll of this particular pair of loaded dice, the probability of getting double-sixes is 0.11,” or, There is a 50% probability that Barack Obama won more than 52% of the white vote in Michigan.”

Two examples, one where calibration makes sense and one where it does’t

As I wrote at the beginning (several screens gone, at this point), I think Rousseau is considering some important issues, and I’d like to explore them further via a couple of simple theoretical examples in which the posterior distribution is far from uniform.

In the first example, I think the nonuniform low-powered posterior predictive distribution is fine, and I don’t see Rousseau’s recalibration idea as making much sense at all; rather, it destroys the useful meaning of the test. In the second example, however, the noise in the predictive distribution makes the raw test (and its p-value) essentially unusable, and recalibration fixes the problem.

The research question, then, is to understand the difference between these examples and come up with a generally useful way of understanding calibration procedures in predictive testing.

Example 1: Testing the sample mean as fit by a normal distribution. Consider the data model, y ~ N (theta, 1), prior distribution theta ~ N (0, A^2), with A set to some high value such as 100 (that is, a noninformative prior distribution). We will use as the test statistic the sample mean, y. (In this case, y is just a single data point, but that’s just for mathematical convenience. The point is that we’re using the sample mean or regression coefficient to test the fit of a normal linear model.)

The short answer is that this test will essentially never reject. In fact, the posterior predictive p-value, Pr (y.rep > y | y), is going to be near 0.5 for just about any y that might reasonably come from this model.

Here’s the math:
theta|y ~ N (A^2/(A^2+1) y, A^2/(A^2+1))
y.rep|y ~ N (A^2/(A^2+1) y, 1 + A^2/(A^2+1))
posterior predictive p-value is Phi(-(y – E(y.rep|y))/sd(y.rep|y)) = Phi (-(1/(A^2+1))*y/sqrt((2A^2+1)/(A^2+1)))
marginal (prior predictive) distribution of y: N(0,A^2+1).

Plugging in A=100, the posterior predictive p-value for data y is Phi (-y/14000), so to get a (one-sided) p-value of 0.025, you need y>28000, an extremely unlikely event if the marginal distribution of y is N(0,100^2). Even in the extremely unlikely event of y=500 (that is, five standard deviations away from the mean, under the prior predictive distribution), the p-value is still only Phi (500/14000) = .486. Thus, in this example, we can be virtually certain that the p-value will fall between .48 and .52.

But the p-value can be calibrated to have a uniform distribution under the prior predictive distribution. The recalibrated p-value is simply the normal distribution function evaluated at y/sqrt(A^2+1). For example, with A=100 and y=500, this recalibrated p-value will be 3*10^-7.

But this is not what I want! Let’s continue with this example in which A=100 and the observed data are y=500. Sure, this is an unexpected value, clearly inconsistent with the prior distribution, and worthy of ringing the alarm in a prior predictive test. But what about the posterior distribution? Here, the model is working just fine: p(theta|y) = N (theta | 499.95, 0.99995^2). In this case, even though it is contradicted by the data, the prior distribution is acting noninformatively. The posterior distribution is fine, and an extreme p-value would be inappropriate.

To put it another way, consider this example, with data y=500, and consider sliding the value of A–the prior scale–from 50 to 5000. As this prior scale changes, there is essentially no change in the posterior distribution: under all these values, p(theta|y) is extremely close to N(theta|500,1). And, correspondingly, the posterior predictive p-value for the test statistic T(y)=y is close to 0.5. The u-value, however–the p-value recalibrated to have a uniform prior predictive distribution–changes a lot. I don’t like the u-value as a measure of model fit, as it is sensitive to aspects of the model that have essentially no influence on the posterior distribution. The point is that the same mean really is well estimated by a normal linear model. This is not “conservatism” or “low power”; it’s just the way it is. I have no particular need for p-values to have a uniform distribution or anything like that. I interpret the p-value directly as a probability statement on y.rep.

Example 2: Testing skewness in the presence of huge amounts of missing data.

OK, now, as promised, an example where my posterior predictive p-value gives a bad answer but Rousseau’s recalibration succeeds. Consider the normal model with n independent data points, y.i ~ N (theta, 1), a flat prior distribution on theta (OK, theta ~ N(0,A^2) with a very high value of A), and using the sample skewness as a test statistic: T(y) = (1/n)*sum(((y.i – y.bar)/s.y)^3). This test statistic is pivotal, thus its p-value (which can be calculated using the t distribution function) is also a u-value.

So far so good. Now we muddy the waters by supposing that n=1000 but with only 10 values observed; the other 990 cases are missing. And we still want to use the sample skewness as a test statistic.

Of course we could simply compute the sample skewness of the 10 observations; this will yield a p-value (and a u-value) that is just find. But that would be cheating. We want to do our test on the full, completed data–all 1000 observations–using the Bayes posterior distribution to average over the 990 missing values.

This should work fine under recalibration, I think–I haven’t checked the details–all the noise introduced by the missing data gets averaged over in the calibration step. But the straight posterior predictive check won’t work so well. The problem is that the test statistic will be dominated by the 990 imputed values. These will be imputed under the model–that’s what “averaging over the posterior distribution” is–and so it will be virtually impossible for the test to reveal any problem.

What the posterior predictive test is saying here is that the sample skewness of a new set of 10000 observations will look similar to that of 1000 observations so far, if the model is in fact true. This is fine as a mathematical statement but not so relevant for practical evaluation of model fit.

OK, so here we have two examples where, if the model is true or close to true, the posterior predictive p-value will almost certainly be very close to 0.5. In the first example (checking the sample mean under a normal model), this is just fine; in the second example (with 99% missing data), it doesn’t work so well. The challenge now is to isolate what’s going on and express it as a general principle which can guide applied model checking.

1. Corey says:

I don't like the u-value as a measure of model fit, as it is sensitive to aspects of the model that have essentially no influence on the posterior distribution.

The same criticism applies to Bayes Factors and posterior probabilities of models. In fact, it seems to me that this is the main reason one ought to be leery of them (rather than on account of the discreteness of the set of possible models, which can be reasonable in settings other than social science).

2. Andrew Gelman says:

Corey: Yes, I agree with this criticism of Bayes factors. In fact, we discuss this issue in section 6.7 of Bayesian Data Analysis.

3. David says:

you mean:

y.rep
^
|
theta –> y

4. Richard D. Morey says:

Bayes factors have their uses, but if a genie popped out of a bottle and asked whether I could have knowledge of the best single number summary of model fit, or the best method of discovering how a model complex misfits, I'd pick the latter any day of the week.

5. Megan Pledger says:

I am still trying to come to grips with all this stuff but…

It seems to me that it is all fine and dandy to use posterior predictive checking and come to a conclusion that the model doesn't fit.

But aren't you using the data twice if, when you see a misfit with the model (by looking at plots of y.rep and y), you use that infomation to *change* the model?

Doesn't that just lead to the overfitting problems that you get in frequentist regression?

6. perceval says:

"For an example of a model check from my recent applied work, see here.