Confusions about posterior predictive checks

Posted on February 7, 2009 2:56 PM by Andrew

I recently reviewed a report that used posterior predictive checks (that is, taking the fitted model and using it to simulate replicated data, which are then compared to the observed dataset). One of the other reviewers wrote (in response to the report, not to me):

The model goodness-of-fit statistics that the authors present on this page are biased, and should be interpreted with at least some caution. They give an over-optimistic evaluation of the fit of the hierarchical Bayes model. This is because the data are used twice: once to fit the model, and once again to assess the fit of the model. In fact, the posterior p-values are not asymptotically uniform, as they should be.

I completely disagree! I’ve discussed this point before. But the attitude expressed in the above quote is held strongly enough, and commonly enough, that I’m willing to spend some time trying to clear things up.

Let’s unpack things.

1. The reviewer wrote that posterior p-values “should be” asymptotically uniform. As I’ve written before, the classical p-values of a pivotal test statistic T (that is, a test statistic whose distribution does not depend on any unknown parameters) has two properties:
– The p-value is the probability that the observed test statistic would be exceeded in replicated data: Pr (T(y.rep) > T(y). (I’ll assume for simplicity that the test statistic is continuous so we don’t have to worry about ties.)
– The p-value has a uniform(0,1) distribution if the model is true.

2. Unfortunately, in general it is not possible to satisfy both these properties. I find it helpful to give them different names:
– I use “p-value” to describe any expression of the form, Pr (T(y.rep) > T(y), with y.rep representing replications under some fitted model.
– I use “u-value” to describe any function of the data that has a U(0,1) distribution under an assumed model. (The “u” in u-value stands for “uniform.”)

I am distinguishing between p-values (posterior probabilities that specified antisymmetric discrepancy measures will exceed 0) and u-values (data summaries with uniform sampling distributions). As discussed in the above-linked article, p-values, unlike u-values, are Bayesian probability statements in that they condition on observed data.

I’m not (yet) saying that you shouldn’t look at u-values. But I certainly think it’s a misconception to say that p-values “should” have a uniform distribution.

3. Why might you want to work with u-values? One reason might be that, if nothing is happening in your dataset. you want an exactly 5% chance, and no less, of rejection–of saying your model is false. But in the example at hand–and in almost every statistical example I’ve seen–we know ahead of time that the model is false.

All models are wrong, and the purpose of model checking (as I see it) is not to reject a model but rather to understand the ways in which it does not fit the data. From a Bayesian point of view, the posterior distribution is what is being used to summarize inferences, so this is what we want to check. Other people might want to check the model implied by their point estimate, and I have no problem with that.

4. I would not say that the posterior predictive p-value is “over-optimistic.” A p-value of 0.2 (say) is not, and should not, be interpreted as a claim that the model is “true”; rather, it might be interpreted as a statement that the model (probabilistically speaking) fits one particular aspect of the data.

5. I’m actually not a big fan of p-values. In the years since writing my paper with Meng and Stern, I’ve moved toward graphical checks. (Compare chapter 6 in the first and second editions of Bayesian Data Analysis.)

6. Regarding the claim that “the data are used twice” in posterior predictive checks: No! These are straight posterior probabilities (conditional on the model, as is appropriate given that it is the model we’re trying to check).

Further discussion here.

4 thoughts on “Confusions about posterior predictive checks”

Dana on February 7, 2009 8:50 PM at 8:50 pm said:

Andrew,
I'm curious as to what report you were reviewing that used these checks, and where the criticism came from, if this is publicly available (or disclosable) information? We've used them in our work for NASA and NRC, some of this was the reason for our earlier discussion. I had hoped this issue was tucked safely into bed.

Could you perhaps go on national TV or something to help get the message out that a posterior predictive probability is NOT the probability that the assumed model is wrong? This is a concept that just doesn't sink in. People even try to come up with numerical decision criteria for this, or worse yet, just apply the 0.05 value for "statistical significance." As to graphical checks, I try to use them, but it seems that once you get far enough up the food chain, the decision makers would rather have a single number.
Richard D. Morey on February 8, 2009 3:26 AM at 3:26 am said:

I like graphical model fit checks too. They are more intuitively interpretable most of the time. For some models though, like models for binary responses, it can be difficult to know how to represent the model fit graphically.
Andrew Gelman on February 8, 2009 11:53 AM at 11:53 am said:

Richard,

I agree but there are some methods. See, for example, the cover of Bayesian Data Analysis, or the dogs example in ARM.
Andrew Gelman on February 8, 2009 11:56 AM at 11:56 am said:

Dana,

1. For confidentiality reasons I don't think I should reveal more about the review.

2. I appreciate your suggestion. I think the closest thing I have to national TV is this blog.

3. In the long run, I'm hoping the p-value/u-value distinction will help. Deeper theoreticians than I have bemoaned the mixing of the Fisher and Neyman-Pearson theories that have led people to consider p=0.05 as a decision cutoff. But there still is confusion.

Comments are closed.