Discussion with Sander Greenland on posterior predictive checks

Sander Greenland is a leading epidemiologist and educator who’s strongly influenced my thinking on hierarchical models by pointing out that often the data do not supply much information for estimating the group-level variance, a problem that can be particularly severe when the number of groups is low. (And, in some sense, the number of groups is always low or always should be low, in that if there are many groups they can be subdivided into different categories.) As a result, Greenland has generally recommended that researchers use pre-determined values for hierarchical variance parameters rather then trying to estimate them from data. For many years I thought that attitude was eccentric—having been weaned on the examples of Lindley, Novick, Dempster, Rubin, etc. from the 1970s, I’d just assumed that the right way to go was hierarchical Bayes, estimating the hierarchical parameters from the data—and in my books and research articles I’ve done it that way, fitting hierarchical models with flat or (more recently) weakly informative hyperpriors. More and more, though, I’ve been seeing the point of strongly informative priors, in general and for group-level variance parameters in particular. I’ve come around to Greenland’s attitude that there is typically a lot of external (“prior”) information available on group-level variances and not so much local data information, and our analyses should reflect this. For too many years I’ve been a “cringing Bayesian,” trying to minimize the use of prior information in my analyses, but I’m thinking that this has been a mistake. I’m not all the way there yet—in particular, BDA3 remains full of flat or weak priors—but this is the direction I’m going.

Here’s an example of Greenland’s thinking on this, from a few years ago. I read it at the time, but I really wan’t paying enough attention.

I’ve also been positively disposed to Sander Greenland (if not to Greenland itself), ever since 1997 when he published a very positive review of Bayesian Data Analysis for the American Journal of Epidemiology. This was at a time when I was feeling very paranoid after having my work attacked in an ignorant fashion by a pack of theoretical statisticians at my former workplace. It was a true relief to learn that there was a world outside that closed environment, where people were interested in fitting and learning from statistical models without ideological constraints.

So this is all just to say that I have a lot of respect for what Sander has to say, because (a) he has a track record of recommending something different than what I do and being right, and (b) he is not a methodological ideologue.

With all this as background, I’ll convey a recent email discussion I’ve had with Sander in which we disagree and I think he’s mistaken. I’ll give you the whole exchange and you can judge for yourself.

Controversy over posterior predictive checks

In classical goodness-of-fit testing, you simulate fake data from a model and compare to the data at hand. To the extent the data don’t fit the model, this indicates a problem.

From my perspective, goodness-of-fit testing (“model checking”) is different from null hypothesis significance testing (“hypothesis testing”) in that the goal of model checking is not to reject a model—we already know that essentially all our models are false—but rather to understand the ways in which our model does not fit the data. One way to put it is that, in hypothesis testing, the null hyp is typically something that the researcher does not like, and the goal is to reject it and thus prove something interesting. But in model checking, the model being checked is something that the researcher likes, the model’s falsity is taken for granted, and the goal is to find problems with it, with the ultimate goal of improving or replacing the model. My paper with Cosma Shalizi (see also here) presents this Lakatosian perspective in detail.

In practice, goodness-of-fit testing has the problem that the model is typically being compared to data that are used to fit the model itself. For linear models, a “degrees of freedom correction” is is needed to adjust for this overfitting issue. For nonlinear models, or models with constraints, or models with informative prior distributions, it’s not so clear how to make this correction. This was a problem that concerned me back in the late 1980s when I was writing my Ph.D. thesis. The topic was image reconstruction, the model was linear but with positivity constraints (the intensity of an image is a nonnegative function of space), lots of parameters are being fit but not all the constraints are active for any particular fit. For this problem, I came up with the idea of averaging over the posterior distribution to get a comparison distribution for the chi-squared statistic, a computation that could be performed using posterior simulation. I talked about the idea with my thesis advisor, Don Rubin, and he pointed out that this could be viewed as a model-checking application of his idea of multiple imputation for summarizing uncertainty, and also as a posterior predictive check of the sort discussed in his classic 1984 paper. I kept thinking about these ideas, and Xiao-Li Meng, Hal Stern, and I ultimately wrote a paper on posterior predictive checking, a paper that got rejected my various journals and eventually appeared, with many discussions, in Statistica Sinica in 1996 (see here for the article, many discussions by others, and our rejoinder).

Since that article has come out, Xiao-Li and I have had recurring discussions with various people about the so-called miscalibration of posterior predictive checks. Here’s an example from 2007, and here’s a blog discussion on the topic from a few years ago. Also this.

The quick summary is that some people (including Sander Greenland) are unhappy that posterior predictive p-values do not in general have uniform distributions when averaging over the prior distribution, whereas other people (including me) see this as a feature not a bug. I think it all comes back to the question of what these p-values are for: is the goal to reject with a specified probability or to explore aspects of model misfit?

The discussion

OK, now here’s my conversation with Sander, which unfolded over email over a few days.

Gelman:

Hi, Sander. John Carlin told me that you had a problem with calibration of posterior predictive p-values. First I just wanted to emphasize that I’m not such a fan of p-values and prefer to do these checks graphically (we discuss this in chapter 6 of BDA). But second I don’t think there is any calibration problem at all: any posterior p-value can be interpreted directly as the probability that replicated data will be more exteme (in some dimension) than observed data. This paper might help clarify:
http://www.stat.columbia.edu/~gelman/research/published/ppc_understand3.pdf

Greenland:

Hi Andrew,

Thanks for sending this final version. I think you sent a preprint of it in March of last year, which seems the same.
I don’t know what to say except I still can’t follow your reasoning, and am left as unsatisfied as in the e-mail exchange among you, me and Jim Berger in March of last year.

We can set one minor issue aside: That of what to call a tail area for an observed statistic, versus a uniform statistic. For our debate I am content to call the first a P-value and the second a U-value, as in your sec. 3.

However (like Robins, Ventura and Van der Vaart, as well as Bayarri and Berger) I can’t make any sense of your argument for using a P-value that is not also a U-value (which unfortunately would be called either an UP value or a PU value). At least, I can’t make any frequentist sense of it. That is important because the aformentioned critics are making a frequentist argument, one which I would call Neyman-Pearsonian (NP).

Let me quote directly from your paper some passages that I find problematic:
“The nonuniformity of the posterior distribution has been attributed to a double use of the data (Bayarri and Castellanos,2007), although this latter claim has been disputed based on the argument that the predictive p-value is a valid posterior probability whether or not its marginal distribution is uniform (Gelman,2007).”
– this seems like a complete non sequitur. Bayarri and all the other critics of PPP are arguing about its poor (indeed, unacceptable) NP frequency behavior. Whether or not it is valid posterior probability is irrelevant. To understand their complaint you might pretend for a moment you are trapped in the mindset of (say) Finney in 1950 in which Bayesian probabilities are like alchemy and astrology, exiled to the realm of pseudoscience, and that only frequency behavior matters. That the PPP fails to be uniform is a mechanical consequence of using the data twice, once to construct the predictive distribution, then again to find the point on it at which to start the area. This is just a variation on the well-known fact that a classifier constructed from data will, when used to classify the same data, grossly overestimate the sensitivity and specificity of the classifier. Likewise the data gets used in the PPP to construct a predictor and then the predictor gets evaluated against the data used in its construction. Again, whether that is a valid posterior probability never enters your critics arguments one way or the other, so I have no idea why you invoke that fact.

The next sentence seems to continue the non sequitur, now for Robins et al.:
“From an applied direction, some researchers have proposed adjusted predictive checks that are calibrated to have asymptotic uniform null distributions (Robins, Vaart, and Ventura, 2000); others have argued that, in applied examples, posterior predictive checks are directly interpretable without the need for comparison to a reference uniform distribution (Gelman et al., 2003).”
– whether or not PPP are “directly interpretable” in some sense is again irrelevant to the issue at hand. The whole point is that, however they are interpreted, they should not be confused with a U-value such as (say) a chi-squared test of fit. And again the reason for this is that the entire justification for using a P-value in your critics’ world (which adopts the framework of NP and Wald, even for “objective Bayes” treatments) revolves around these data-frequency properties:
For any alpha, we require
Validity: rejection of the tested hypothesis no greater than alpha of the time when that hypothesis is true,
Unbiasedness: rejection is minimized when the tested hypothesis is true, and
Uniform power: rejection of the tested hypothesis at the maximum possible rate possible among valid unbiased tests, whatever the alternative is.

Setting aside technical issues about when exact UMPU tests exist, these properties can be jointly satisfied asymptotically under our standard regression models, and it turns out that the P-values that provide these tests (including the usual Wald, likelihood-ratio, and score test P-values) have to be asymptotically uniform under the test hypothesis, concentrating more and more toward 0 as one moves away from that hypothesis (and of course as the sample size increases, given the test hypothesis is false).
None of this theory invokes posterior probability, and tail area only arises as a consequence of seeking power, not as a core definition as in Fisher’s system.

Now in section 2 you lay out a series of circumscriptions that you seem to imply free you of power concerns. In fact you state you are not interested in power.
You also make statements that I can’t make sense of:
“in a classical setting we want to be assured that the data are coherent with predictive inferences given point or interval estimates.”
What is a “classical setting”? Some classify NP as classical (even though they are really latecomers by a century and a half relative to P-values and posterior probabilities).
If you reject power and NP, what are your criteria for judging a diagnostic? If you talk about repeated-sampling criteria you will be led right back to power, so by coherence you must eb assigning a Bayesian meaning, in which case I think your first example exhibits the discrepancy between the formal way you are using “coherence” and what I would consider the practical, commonsensical meaning of coherence (not unlike the difference between formal and informal but commonsensical meanings of “significance”).

You say:
“We are working within a world in which the purpose of a p-value or diagnostic of fit is to reveal systematic differences between the model and some aspects of the data; if the model is false but its predictions fit the data, we do not want our test to reject.”
If the model is false in some identifiable way, that falsity will harm the accuracy of out of data predictions and we should want the test to warn us about this.
Whether its predictions fit the analysis data depends entirely on our criterion for fit, so to claim PPP is indicative of fit seems to me circular. The question is, does it warn us as well as it should about discrepancies among the entire set of constraints we are imposing on the problem? (which, as Box took pains to describe, is the data and the data model and the prior).

You invoke the Jeffreys-Lindley paradox, a discrepancy between NP testing and a very particular Bayesian test, which (as Lindley noted) arises from placing a prior spike at the test value, something which (I thought) we agreed made no sense in practice.

Then section 3 you say
“This property has led some to characterize posterior predictive checks as conservative or uncalibrated. We do not think such labeling is helpful; rather, we interpret p-values directly as probabilities.”
-this seems to me to dismiss the NP perspective of your critics with no justification at all. Labeling PPP as uncalibrated is very helpful: it says they should be ignored when doing an NP analysis. Your writing here also seems to mix up Fisherian and Bayesian P-values. Now I am as ecumenical as anyone, but I view these systems as distinct languages that have to be applied separately and their meaning translated into one another. To mix up Fisherian and Bayesian quantities strikes me as akin to speaking in sentences that are a mix of English and French words and grammar. Sometimes the words are the same or translate easily, but subtle differences can change the interpretation drastically. E.g., even Fisherian tests of fit (like chi-squared) are calibrated, and can also be viewed as 0-1 rescalings of the distance between the data and the data-model manifold in observation space, so why would a Fisherian want a PPP?

Finally, your interpretation of the first example makes no sense to me at all: With y=500 you have direct evidence that something is terribly wrong with at least one of your prior expectations or your current data. If I saw the prior P-value result with some (or any) context I would be immediately suspicious that my whole view of the background or the study generating the data is seriously flawed, e.g., previous or current data has been mis-recorded by orders of magnitude; and I would definitely not want to mix the data and the prior, however little influence the prior had – instead I’d want to report the practical incoherence or inconsistency between the prior and the data!

On the whole then, your paper seems to me to fall far short of justification for using PPP or for ignoring prior checks or Fisherian tests of fit, but perhaps you have better examples (preferably real) or arguments – last year you mentioned you were trying to do something with Jim Berger. Did anything result from that?

All the Best,
Sander

From here on, I’ll remove the hellos and goodbyes and other incidental materials. The conversation continues:

Gelman:

The short answer is that I treat Bayesian p-values as posterior probabilities, and I find them to be helpful in many settings. As discussed in that article of mine, I am not particularly interested in constructing tests that reject 5% of the time if the null hypothesis is true. In almost every problem I work on, I know the model is false. To me the point of goodness-of-fit testing is not to have the goal of rejection but rather to reveal aspects of misfit of model to data, which I operationalize as settings where the predicted data from the model differ consistently from the observed data.

I think all the problems you discuss below arise based on the framework in which the goal is to reject false hypotheses as often as possible while keeping a 5% chance of rejecting true hypotheses. But this is not a problem that interests me.

On another matter, I guess we could argue all night about the expression “using the data twice,” as ultimately this phrase has no mathematical definition. From a Bayesian perspective, the posterior predictive p-value is a conditional probability given the data, and as such it uses the data exactly once.

Finally, you write, “perhaps you have better examples (preferably real).” I have many real examples of posterior predictive checks. Chapter 6 of Bayesian Data Analysis has several, also I’ve written many many papers with real examples of posterior predictive checks from my research and that of others. One thing I don’t lack is real examples!

Greenland:

Perhaps we might agree that PPP is not admissible in the precise frequentist sense I described, and thus not admissible to a conjunctive (“and”) ecumenist, by which I mean someone who regards such admissibility as a necessary criterion for adoption of a method, along with Bayesian coherence of the method for priors derived from the context. This is how I would interpret Cox’s eclecticism, as he laid out in many places such as his comment on Lindley (2000) on p. 321-324 of the attached. In the present case I think his philosophy (to which mine is very close) says that if I have a Bayesian method like PPP that falls short of admissibility, the frequentist can offer me a recalibration (as did Efron and Morris for parametric EB intervals, and BB and RVV 2000 did for PPP). This recalibration makes sure I don’t overlook potentially fatal yet easily avoidable errors or model problems, like those I described for possible scenarios that would give rise to your example 1 (where the prior check screams that there is an identifiable mistake somewhere in our formulation, such as data miscoding or fraud, a fact I want to know about but which the posterior check pastes over), and which you may have overlooked in your real examples if you did not also do calibrated checks.

I note that Rod Little summarizes the superficially similar-sounding ecumenicism of “calibrated Bayes” in the attached Am Stat article. I think this article is superb and agree with almost all of it. But toward the end of p. 7 he offers in brief the same defense of PPP as you do, which I find objectionable in the sense I described below for your example 1. So we have evidence that there are at least two types of ecumenicism afoot: One strict or conjuctive that strives for both strong repeated-sampling calibration (as Cox and Hinkley laid out in Ch. 2 of their 1974 book) along with coherency with specified prior information; the other weak or disjunctive (“or”) ecumencism outlined by Little but apparently attributable to Don (1984) requiring only one or the other criterion be met. (I think that were Jack Good still with us he could offer at least 46,000 more kinds of ecumenicism/eclecticism.)

Finally, I think PPP does indeed use the data twice in very straightforward arithmetic way: On p. 2597 of your EJS article, you define the PPP as Pr(T(y_rep)>T(y)|y), although I think a close-paren is missing in your expression. The data y appear twice in this expression: once after the conditioning bar “|” and once before. This is not mere notation, as the data are used twice and in this order to compute the expression: once to construct the posterior measure Pr and then again to define the set it is applied to. Why deny this fact? Rod seems to accept it in his wording at the end of his p. 7, and it has no bearing that I can see on the core argument for PPP that you and he share. After all, some superb statistical methodologies count the data twice in essentially the same sense, such as empirical Bayes (something which Neyman himself praised as a brilliant innovation, albeit its developers paid close attention to NP calibration). In the end our divergence here comes down to a matter of philosophy of applied statistics (a very good topic to enrich, I think) and what one means by “calibrated Bayes”, not whether the data are counted twice. And in practical terms, the entire dispute might be resolved if one advocated always presenting and comparing the prior or recalibrated diagnostic alongside the PPP, so that one would be alerted to the problems caught by the prior or recalibrated checks, but missed by the PPP.

Gelman:

I wouldn’t be surprised if most of my readers (at least those who are informed on these issues) will agree with you. But I still think that your view is ultimately a holdover from the “null hypothesis significance testing” approach which I think is ill-founded.

Just to respond briefly: I don’t know what is meant by “not admissible in the precise frequentist sense [you] described.” I have never seen a definition of admissibility in this sense. So I really don’t know what you are saying here.

A posterior predictive p-value is a probability statement regarding replications, and, like any Bayesian statement, it is a correct (i.e., calibrated and fully informative) statement to the extent that the model is true. Of course the model isn’t true, so the probability has this funny interpretation as being a predictive probability, conditional on the model that we don’t believe—but, again, that’s true of every Bayesian probability statement (with some special-case exceptions). Nonetheless, many researchers (including you and me) find Bayesian inference to be useful sometimes.

Regarding my article in EJS, the point is that, in my first example, I have a ppp-value and also a “calibrated” ppp-value (in my terminiology, a “u-value” using the same test statistic). And, as I explain in that paper, for the purpose of model evaluation, I prefer my p-value (whose distribution under the prior is highly concentrated near 0.5) to the u-value. Again, I’m diong predictive checks to evaluate model fit, not to maximize the probability of rejecting a false model while having a fixed probability of rejecting th emodel if it is true.

In my example, you say that the prior check is screaming at me—but my point is that, for the purposes of future inferences, the posterior model is just fine. Meng (cc-ed here), Stern (cc-ed here), and I discuss this in our 1996 paper: prior predictive replications are different than posterior predictive replications. It is possible for a model to work well for posterior predictive replications but not to work with prior predictive replications. Indeed, this happens all the time. A model can work well for predictions of new data from existing groups but not for new data from new groups. To see this, consider a hierarchical model with a super-weak prior on the group parameters and lots of data for each group. Lots of data on each group implies that the group-level parameters will be estimated fine for existing groups; the weak prior is no problem cos there’s lots of data. But for new groups, the weak prior kills you, it’s not allowing you to learn about the distribuiton of the group-level parameters. Or, for another example, suppose the group-level parameters are bimodal but you fit them to a hierarchical normal model. Again, with lots of data within each group you will estimate the group-level parameters fine. But when making predictions for new groups, you’ll just keep drawing from this normal model and this will be wrong. So it makes sense that the predictive check will show a problem when predicting new groups (the prior predictive check) but not when predicting new observations from existing groups (the posterior predictive check).

If you want to say that “Pr(T(y_rep)>T(y)|y)” uses the data twice, then you could just as well say that the expression “y^2” uses the data twice because it mulitplies y by itself!

Finally, I agree with your last suggestion which is that users can look at prior predictive checks as well as posterior predictive checks. That’s an excellent point. Indeed, as XL, Hal, and I wrote, prior predictive checks are a special case of posterior predictive checks in which the predictive replication involves re-drawing the parameters as well as the data. In my writing I’ve tended to say that people should look at replications of interest, but there’s no reason not to look at lots of different replications, if that will help.

Greenland:

I am opposed to blanket or mindless null hypothesis testing (NHT) as much as anyone and for two decades have been lecturing on “Overthrowing the Tyranny of the Null Hypothesis”. But that NHT is a grossly overused tool does not negate that the fact that it sometimes makes sense to use it. I note that in a number of articles (e.g., Gelman and Weakliem, 2009) you have described the “significance” and “nonsignificance” of null tests quite prominently, so I had the impression that you have used NHT yourself.

In my view, testing the difference between the prior and the likelihood function is an example of useful null testing, not because we believe some null (or that our prior or data model is perfect) but for the Fisherian reason that we want to be alerted to outright statistical contradictions between the prior and the data model (where outright statistical contradiction could mean a 5 sigma difference, as in your example 1 and in detecting the Higgs boson).

Regarding “admissibility,” I was carelessly using “admissible” as a shorthand for the test being approximately uniformly most powerful among unbiased tests (UMPU). So the passage at issue could be rewritten: Perhaps we might agree that PPP is not UMPU, and thus not acceptable to a conjunctive (“and”) ecumenist, by which I mean someone who regards such UMPU as a necessary or at least desirable property for a test.

All probability statements (including so-called nonparametric P-values) are conditional on models, e.g., that somehow selection was randomized within observed covariate levels. Thus I don’t believe any of the models we use in observational health, medical and social sciences are more than rough guides at best. Using ‘model’ in the Boxian sense of prior+data model, I do not want to make inferences from a model (or a posterior computed from it) when the data indicate that at least one of the prior or data model are way off of reality.

To me, example 1 in your Electronic Journal of Statistics paper is demonstrating exactly the opposite of what you claim in your paper: That the PPP can miss important evidence of model problems which are easily seen with other diagnostics. So your basis for ‘preferring’ PPP over supplementing it continues to elude me.

Now if one rigidly insists that the data model is correct (which neither of us wants to do) and then has a situation like example 1 in which the prior and data model conflict statistically, one can claim as you do that the high PPP indicates that the prior (which must be bad) did not harm your inferences much. It seems to me however that this defense of PPP corresponds to nothing more than the old observation that, in finite-parameter problems, the data eventually swamp the prior (assuming the prior support includes the true value, as it must in your example). This defense never appealed to me, because why would one want to use inferences contaminated by a discredited prior?

When you say “the purposes of future inferences, the posterior model is just fine,” this is only if the data model is correct. If the divergence between the prior and the likelihood function arose from undetected data error or fraud, or from data-model mis-specification, I don’t think the posterior model will be just fine for all future inferences, and those possibilities are things to worry about in practice.

I am concerned to know if the analyst has clearly failed to predict the new groups with any kind of accuracy. It could call into question the analyst’s specification competence as well as the data integrity, so I could see why the prior check would be unpopular.

Regarding “using the data twice,” I think you missed my point completely, which is that y appears on both sides of the conditioning bar “|”. In other words, the way that y is used both to construct the sampling (data) measure and to define the sampling event being measured leads to the need for recalibration. This observation applies to EB inferences, which are respectable frequentist tools, as well as to PPP. It seems to me that you are treating this “double-counting” observation defensively as if it is some sort of insult of PPP, rather than the matter of fact that it is. It is just a math property that alerts a frequentist to the need for (and is addressed straightforwardly by) recalibration. Bayesian computations will analogously yield incoherently precise posteriors if the data are double counted in similar ways and this shortcut is not accounted for, e.g., if nuisance parameters are replaced by their MLEs and subsequently treated as fixed quantities in posterior computation, instead of being integrated out.

Finally, is it safe to say we agree for practical purposes that one should perform and report the prior vs. likelihood check, or a recalibrated PPP, even if one also wants a raw PPP? Does anybody who has read this far disagree?

Gelman:

In my paper with Weakliem, we show some null hypothesis significance testing results not because we think null hypothesis significance testing is a good idea in that example but rather to communicate with that aspect of the scientific community that uses these tests. In fact, in that paper we state that, whether or not the observed comparison were statistically significant, we would not believe the claim. Further discussion of this point is in my recent paper with John Carlin for Perspectives on Psychological Science:
http://www.stat.columbia.edu/~gelman/research/unpublished/retropower20.pdf

Regarding UMPU: As I stated earlier, UMPU is based on the framework in which the goal is to reject false hypotheses as often as possible while keeping a 5% chance of rejecting true hypotheses. But this is not a problem that interests me. I just don’t care about UMPU nor do I think it is a useful general principle for model checking. I don’t see why wanting UMPU makes someone an ecumenist.

Finally, you write, “is it safe to say we agree for practical purposes that one should perform and report the prior vs. likelihood check, or a recalibrated PPP, even if one also wants a raw PPP?” No, I don’t agree with this. If someone wants to report posterior predictive checks for different replications, that’s fine, I agree that it’s a good idea but I wouldn’t call it a “should.” I think it’s best for people to realize that model checking depends on the purposes to which a model will be used. Part of this dependence-on-purposes comes in the choice of test variable (for example, if you test a model based on the skewness, you might miss a problem of the kurtosis), and this is well understood. Another part of this dependence-on-purposes comes in the choice of replication. In a hierarchical model with data y, local parameters alpha, and hyperparamters phi, one can consider three possible replications:
(i) same phi, same alpha, new y
(ii) same phi, new alpha, new y
(iii) new phi, new alpha, new y
You seem to want to privilege (iii), and that’s fine, it might make a lot of sense in your practice. But in other cases, for other people’s models, (i) or (ii) might make more sense. For example if someone has a flat prior on phi, they can’t do (iii) at all. But that doesn’t mean the model can’t be checked.

Greenland:

Well, no convergence between us here, so I guess the controversy over PPP will remain unresolved once again (as with the issue of what constitutes ‘valid’ as opposed to ‘proper’ imputation). Maybe that makes it all the more worth blogging about. If there is one thing our new debate revealed to me, however, it is that I am much more frequentist than I initially thought! Hence I’m going to forward our exchange to Mayo, since with her I spent most of my time defending Bayesian perspectives

Just a footnote re UMPU: As UMPU is just one of many possible formalizations of how to optimize a test, and unbiasedness is a particularly dubious criterion, I don’t want to push UMPU. But in frequentist evaluation of methods, I do believe it is important to deploy some kind of optimization criteria to minimize error or loss (whether in testing or estimation) and I was just using UMPU as a well-known example. The problem for me is that I still don’t know what frequentist error-minimization criteria are satisfied by PPP. Thus, for now, due to calibration objections, PPP will remain inadmissible (in the informal English sense) in my methodology, which as I said tries to satisfy both frequentist and Bayesian desiderata; I don’t want to ignore or dismiss objections from either frequentist or Bayesian perspectives, even though satisfying both completely may often be impractical if not impossible.

Gelman:

There are multiple frequentist desiderata; indeed, I have written that one strenghth of the frequentist approach is that the frequentist can use commonsensical and subject-matter concerns to choose the appropriate desiderata for any particular problem. I’m sure there are some problems for which it makes sense to have the goal of rejecting false hypotheses as often as possible while keeping a 5% chance of rejecting true hypotheses. It’s just not something that’s ever come up in the hundreds of problems I’ve worked on over the years. To me, null hypothesis significance testing is a tool that can be useful but I don’t see it as being based on underlying principles that make sense, at least not in the applications I’ve seen. The idea of model checking, though: that seems important to me. I am not interested in demonstrating that a model is false—that’s something I already know—but I am interested in understanding the ways in which a model does not fit the data.

And Greenland ends it with:

I too am most interested in understanding ways in which the model (which again I take as prior+data model) fits poorly, and not in demonstrating the model is false, which I already know (as you do). Thus it is interesting how we can agree in principle when that principle is phrased in broad and nebulous English, but diverge about devilish details.

Indeed.

88 thoughts on “Discussion with Sander Greenland on posterior predictive checks

  1. Given a string of data points x1,…,xn and a model posterior P(X1,…,Xn) simulating data and comparing it to the real data is essentially checking that x1,…,xn is in the high probability region of P(X1,…,Xn).

    If future (or otherwise unknown) data looks similar enough to past data to also be in the high probability area, then the posterior will provide good inferences.

    The essential problem is that being in the high probability region is NOT the same as calibrating the probabilities such that P(X_i) = percentage of x’s equal to X_i. They’re not even close to being the same in most cases and P(X_i) is usually numerically very different from the frequency for a good (useful) probability distribution.

    Nor is there any requirement that P(X1,…,Xn) be the frequency that x1,…,xn’s occur. It’s sufficient that the new data we care about is similar enough to the old data so that it too is in the high probability region of P(X1,…,Xn).

    As long as that happens, the inferences from the posterior will be good. There needn’t even exist in principle, or in reality, enough data streams to check the frequency properties!

    There seems to be some weird impenetrable mental block among Frequentists which makes them unable to see this. Until they can get over that handicap there is NO possibility of progress. Either they come to understand this or waste another century of everyone’s time.

    • To me, the only time Bayesian probability statements become frequency statements is when sampling from the Bayesian probability using a random number generator. Here we are using frequency properties as an alternative method of computing, and we really DO have mathematical guarantees, Frequency guarantees in frequentist inference are about potentially future replications of the physical experiment, this is a statement about the external world (often wrong!). Frequency guarantees in sampling from a posterior are about guaranteeing your mathematical RNG will put a certain number of points in a certain region of an abstract space, a purely mathematical statement which can be proven.

      And, even this only needs to be approximate in many cases (for example, variational bayes or other approximation methods). One of the things to understand about those approximate methods is that they typically approximately preserve the shape and size of the high probability region, it’s the tails, or the skewness or the “unevenness” or other often not very important measures of the target distribution that are typically most altered by the approximation.

      • To be fair, I guess you could argue about frequentist “guarantees” being only conditional on the model being true. Since neither frequentist nor bayesian models are ever true… I guess it shouldn’t surprise anyone that frequentist coverage is typically way off from its “guaranteed” level of coverage. But then I see frequentists saying things like “these intervals will have their advertised coverage” as if they were true, when in fact they are only true “provided our model is perfect” which it isn’t.

  2. I have a sense of how to summarize Gelman’s point:

    If I pretend that my model really does describe the “generating process” of the data, and I take parameters in the high probability region of my posterior, how does a newly generated dataset differ from the actually observed dataset? (typically a qualitative question, and hence the preference for graphical analysis).

    p values in this context tell you how often your generating model will create more extreme data points than the observed data set, and *that’s all* as far as I can see. (am I misunderstanding something?).

    I don’t even know what it means for a p value to reject a true model exactly 5% of the time. It’s like asking what color a platonic number 5 is. Any model of reality is a false model.

  3. Helpful to get to read this.

    > It is possible for a model to work well for posterior predictive replications but not to work with prior predictive replications.

    > dependence-on-purposes

    > and the goal is to find problems with it, with the ultimate goal of improving or replacing the model.

    I agree purpose is primary and for instance in medicine one should avoid doing tests on patients that will not impact their treatment management so it might seem sensible to skip prior vs. likelihood check as it won’t have much impact on the posterior you settle on (so you are focussing on finding important wrongness not just any wrongness) – but unlike medical tests there aren’t side effects or large costs?

    Also, when you move to more informative priors won’t the impact of changing the prior increase?

    Also, also, if I was reviewing a Bayesian analysis, I would do/ask for a prior vs. likelihood check if anything just as a check on the analyst’s understanding of the analysis they did.

    p.s. doing away with K? partial anonymity

  4. Amazing post.

    It seems that Greeland is bothered by the fact that PPP isn’t optimized in anyway. Even if UMPU optimizes based on a false assumption (that the hull is true), at least is an optimization and, hopefully, this optimization will be correlated with something good. But I’m not sure this is the case, due to assertions by Greeland like this one: “testing the difference between the prior and the likelihood function is an example of useful null testing, not because we believe some null (or that our prior or data model is perfect) but for the Fisherian reason that we want to be alerted to outright statistical contradictions between the prior and the data model (where outright statistical contradiction could mean a 5 sigma difference, as in your example 1 and in detecting the Higgs boson).” I may need to reread some of the articles mentioned, but can someone explain to me how the “Fisherian reason” doesn’t mean that we assume the null is true (in the sense that a p-value is the probability of observing a statistic equal or bigger than the data at hand, condtional on the model or null being true)?

  5. There is a lot of talk here about “every model is false”. What about the Coulomb’s inverse-square rule (I don’t like the term “law)? The people running these experiments report use the model 1/r^(2+q) and that the data is consistent with q<6E-17. The theoretical model of 1/r^2 seems to be literally true, or at least it is plausible that this is so.

    "Finally, the results of experiments to test Coulomb’s
    inverse square law are listed in table 2 as a comparison.
    Coulomb’s Law is a fundamental law of electromagnetism,
    and it seems pertinent to inquire to what extent this law is
    experimentally known to hold, in particular its inverse square
    nature. The above experimental results reveal that the validity
    of its inverse square nature can be unassailable almost to a
    certainty at the macroscopic level, the length scale of which
    has been shown to be of the order of 1013 cm by laboratory
    and geophysical tests reviewed above. As for the microcosmic
    scale, the well-knownRutherford experiments on the scattering
    of alpha particles by a thin metal foil gave an indication that
    Coulomb’s Law would be valid at least down to distances
    of about 10−11 cm, which is roughly the size of the nucleus.
    Modern high energy experiments on the scattering of electrons
    and protons proved that Coulomb’s inverse square law was
    successful even down to the fermi range [36]. Thus, the
    evidence from experimental results reveals that the inverse
    square Coulomb’s Law is valid not only over the classical
    range, but deep into the quantum domain also, a total length
    scale spanning 26 orders of magnitude or more: this range is
    impressive but still finite."
    https://www.princeton.edu/~romalis/PHYS312/Coulomb%20Ref/TuCoulomb.pdf

    • Question:

      I agree that we occasionally work with models that are true, or essentially true. But this sort of example describes very few of the models that I ever fit or that arise in political science or environmental science or epidemiology (which are the areas that Sander Greenland and I typically work in).

    • Connecting a model to data requires more than just one part of the model being correct. So for example:

      1) knowing which particles are involved in any interaction
      2) knowing the position of the particles in any interaction (so you can calculate distances for coulomb’s law), already quantum mechanics makes this imprecise.
      3) the stability of the particles and whether they interact with virtual particles or not: the value of many physical constants has a “bare” value and an effective value given a certain observational scale (this is related to renormalization group ideas).
      4) The value of the coefficients in the model, 1/r^(2+q) is only valid in units of force, charge, and distance in which unit charges a unit distance apart apply unit force. This is a dimensionless analysis, and knowing what the actual physical units are has its own level of uncertainty.

      But I take your point, sure, some models are so precise that their “wrongness” doesn’t matter. There are relatively few though.

    • A few more pedantic points, just for yucks:

      Are distance or time quantized? Is there a smallest possible distance between two particles? Perhaps the universe is really a tiny grid of points? Perhaps time flies not like an arrow but by jumps between the smallest possible increments of time? I don’t think physicists have an answer for this, but I think they have an upper bound on how small such things have to be to be consistent with data. There are things like the Planck length (the length of the smallest possible distance measurement, apparently a photon with enough energy to have this wavelength collapses into a black hole)

      If you call string theorists physcists, apparently we aren’t even sure how many spatial dimensions there are ;-)

        • I don’t mean to denigrate String Theorists. String theorists are clearly studying something, but most physicists I know are skeptical that it’s the actual universe we live in. It’s not my area of expertise, but it’s relevant to the question of whether coulomb interactions are exactly correct I guess. If there are 10 spatial dimensions then r^2 = r . r where each r is a 10 vector instead of a 3 vector… The fact that that the extent of 7 of those dimensions is so small that it’s not normally relevant is useful for approximate reasoning but it nevertheless makes r.r for 3 vectors logically incorrect in that framework.

          I basically stick to: “every model is an approximation, some are extremely precise approximations”

        • I was going to suggest “numerologists”, but that’s a bit insulting to numerologists since they do make checkable predictions.

          I wonder how much string theory cost taxpayers all total?

        • The mantra “all models are wrong…” is great for bringing the NHST crowd back to reality, but it serves no greater purpose in the philosophy of science.

        • Note that frequentist claims about coverage of frequentist intervals for physical constants fail even though some aspects of physical models like culomb’s law are “essentially true”.

          see: http://radfordneal.wordpress.com/2009/03/07/does-coverage-matter/

          We tend to forget that when dealing with real world data, it’s not just the theoretical model that is being tested, but also all the incidental stuff along the way, like the measurement instruments and soforth.

        • Daniel,

          Do you disagree that Coulomb’s equation is treated as if it were true? Sure it may be wrong, but we have no evidence for that despite many careful tests. Of course, when applying the model to real life observations there will be various unaccounted for influences that render the model an approximation, but if conditions are carefully controlled the model appears to be literally true.

        • The reasoning used by physicists here is a distinct improvement over that used by classical statisticians. The question isn’t:

          “is Coulomb’s law true?”

          rather it’s:

          “is there any point in assuming Coulomb’s law is false?”

          For most purposes (perhaps all we currently know of) the answer is no.

          Your Coulomb’s Law discussion at the beginning is a perfect example of why interval estimates are so much more useful than rejecting or accepting the null.

          Regardless of what you or I or anyone else thinks the ultimate reality is, the fact that the interval estimate for q is vanishingly small means we can both happily get along just using the Coulomb equation.

        • Sure, Coulomb’s law may as well be true for all the cases your or I will ever need to know. For those who want to know what happened in the first nanosecond of the big bang, it could be wildly false, and yet the difference would have no measureable affect on any observation you or I would ever want to make. But as I point out, as soon as you have real world measurements involved, you have models of measurement instruments, models of experimental apparatus, models of computer data collection errors (flipped bits and the like), models of this and that. Overall they make a single model of how the data arose, and every one of these models is a simplification (as pointed out by Anonymous, all models simplify in order to be useful). A perfect example is the faster-than-light neutrinos which turned out to be something like a faulty fiber-optic cable together with an out-of-spec oscillator on some computer board.

          To me the “all models are wrong” mantra helps keep us grounded in the reality that we can’t and shouldn’t model reality in all its details. Our best bet is to get more and more precise and accurate until it doesn’t matter anymore for some meaning of “matter”.

        • The greatest philosopher of science of the 20th century C. Truesdell (who made Kuhn and Popper look like the fakers they were) had this to say,

          “A theory is not a gospel to be believed and sworn upon as an article of faith, nor must two different and seemingly contradictory theories battle each other to the death. A theory is a mathematical model for an aspect of nature. One good theory extracts and exaggerates some facets of the truth. Another good theory may idealize other facets. A theory cannot duplicate nature, for if they did so in all respects, it would be isomorphic to nature itself and hence useless, a mere repetition of all the complexity which nature presents to us, that very complexity we frame theories to penetrate and set aside.

          If the theory were not simpler than the phenomena it was designed to model, it would serve no purpose. Like a portrait, it can represent only a part of the subject it pictures. This exaggerates, if only because it leaves out the rest. It’s simplicity is its virtue, provided the aspect it portrays be that which we wish to study. If, on the other hand, our concern is an aspect of nature which a particular theory leaves out of account, then that theory is for us not wrong but simply irrelevant.”

          So models are “wrong” in that they’re not isomorphic to nature, but for the part they cover, we should still be able to take measurements and use them to accurately predict other things the model was intended to predict.

          I get the feeling those who aggressively bleat the “all models are wrong” stuff (beyond pointing out the silliness of NHST) use models which don’t accurately predict what they’re supposed to predict.

        • Anonymous,

          So how would you suggest that people studying complex, poorly-understood systems proceed? I was taught NHST, when I looked closer at it I determined that (in practice) it is essentially an obfuscated (by math) method of disproving strawmen. This is clearly faulty.

          If you have data from two groups of animals, one got a drug while the other didn’t, how would you attempt to extract useful info. Usually there is no theory on what the effect size due to the drug should be.

        • Extracting info is easy. Any interval estimate for the effect size should include any value reasonably consistent with everything we know (prior good theories/knowledge + data).

          The bigger problem is that this interval is often too wide to be useful and statisticians want to shrink it using assumptions they fooled themselves into thinking are true, but are typically false. Hence oodles of non-reproducible research.

          Personally, I’d try to find the missing theory. While everyone else is publishing irreproducible papers, I’d be working my way to a nobel prize. No one said being the next Einstein was easy.

        • Anon,

          “The bigger problem is that this interval is often too wide to be useful and statisticians want to shrink it using assumptions they fooled themselves into thinking are true, but are typically false. Hence oodles of non-reproducible research.”

          This is an apt assessment.

          “Personally, I’d try to find the missing theory. While everyone else is publishing irreproducible papers, I’d be working my way to a nobel prize. No one said being the next Einstein was easy.”

          It may be easier than you would think. Looking back at what research was being done before the NHST meme infected my field, I find researchers were attempting to build rational mathematical theories (eg Thurstone’s learning function, Cajal’s laws of neuronal branching). These efforts are just widely (but not completely) ignored in favor of lists of increasing/decreasing influences. I have been becoming more and more convinced that it would be better for me to ignore all research since ~1940 and start anew.

        • I don’t know about your field, but I’m convinced the greatest single advantage physics has over more recent fields is that it had several centuries of stunning success under it’s belt before anyone had ever heard of NHST.

          If history had of been crueler and produced NHST around 1600, Physics would be the dismal science today, and economics would require a new disparaging nick name.

        • Anon
          I get the feeling you’re a person I both agree and disagree with on a lot of things. +1 for quoting Truesdell. He was blinded by his own polemic at times but had some good things to say.

        • Wow. I never met any who had heard if Truesdell. I thought he had been lost to history. How did you run across him?

        • Well…there are still places in the world where you can learn about the principle of material frame-indifference ;-) Whether this means Truesdell has avoided obscurity I’m not so sure! But I’ve read a quite a bit of his work, mostly a while ago, and it certainly influenced me. How did you come across him?

        • Ok, now I’m seriously impressed.

          Initially, it was from a deep dissatisfaction with the way physicists were treating part of their bread and butter subject matter, and wanting something better. I can’t tell you how shocking it was for someone who thought Physics was the king of the sciences to discover mathematicians and engineers were doing this work dramatically better.

          His books have gotten incredibly expensive on amazon, but I bought almost all of them back when they were around the $100 mark.

          The above quote was taken from the introduction to “Fundamentals of Maxwell’s Kinetic Theory of a Simple Monatomic Gas” which as far as I know was completely ignored by physicists.

        • I wrote Truesdell a letter a couple of years before he died. It must have confirmed all his worst suspicions about the modern university since it was barely literate.

          Despite his reputation for being an elitist hard-ass however, he was encouraging and generous. He even sent me some books and papers with his reply.

        • Interesting! I don’t want to hijack this thread, so I should keep it brief. It would be fascinating sometime to read his response; maybe you could post it online somewhere…

          I always meant to get a copy of the book you mention…

        • I was excited by it at the time, but in retrospect it’s depressing reading. He was clearly aware that he’d lost energy and entered his dotage. He lived in Baltimore too which became second only to Detroit as the murder capital of the US.

          For such a cultured man (and wife – they were quite the duo) seeing his adopted home town descend into barbarism seemed to have a deep affect on him: a bit like those Europeans who came of age before 1914 watching 1914-1945 happen.

          He was a Victorian living in the age of the Brady Bunch.

  6. I think the nice thing about using posterior predictive checks to generate ‘p-values’ is that you can test whether literally any aspect of the data is consistent with the model. I don’t know if there are other approaches that are as flexible and also ‘admissible’. Maybe the recalibrated posterior checks?

  7. My interpretation of this illuminating dialog (most of which was well over my head, BTW):
    Prof. Gelman: I have a model and a data, let’s see whether the model generates something close to it.
    Critic: But you already used the data to construct your model!
    Prof. Gelman: I know that! I just look whether they agree to figure out if I missed something.
    Critic: If you are going to do it this way all the time, you’ll end up gaming the system and constructing models so that your checks give you an OK, but you’ll be just deluding yourself.
    Prof. Gelman: I will do no such thing! Promise.

    • D.O.:

      No, I don’t think anyone is saying that I’m gaming the system. If my checks give me an OK, that is just saying that these models work for these particular purposes, not that they are correct. I already know my models are incorrect, and if my only goal is to reject a model, I can just gather a few zillion data points and demonstrate that.

      • I am not saying that anybody accuses you personally (if my comment smacks of that, it is just for playfulness). I am saying that if you hold PPP as a gold standard, it will encourage people to overfit the data.

        Second, I am not discussing the model being correct or incorrect in absolute terms (or zillion new data points terms). I (or rather my imagined Critic) is asking whether you will overestimate the applicability of your model, limited as it is. In other words, whether you have a good chance to capture some important features of the future data even if that future data will refute your model.

        Third, not being a statistician by either education or vocation, just an interested person (I read your BDA book, for example), I think that as long as you use for PPP purposes an aspect of the data which you didn’t try to model, it should be OK. Otherwise, overifitting might be a problem. You can probably try to use the model itself to figure out whether this is going to be a problem, but surely you (and your real critics) already explored all this and didn’t come to the agreement.

        • D.O.:

          My experience is the opposite, that before predictive checking became popular, Bayesians were not checking their models at all. Indeed, it was worse than that: Bayesians were not simply “not checking their models,” they were out-and-out refusing to check their models, on the grounds that (a) model checking is frequentist and not Bayesian, and (b) Bayesian models are subjective and thus inherently uncheckable. These issues are discussed in my paper with Shalizi. So once people agree with me that it’s important and useful to check models, I feel like we’ve already made a huge amount of progress.

          Finally, there can be many different aspects of data to be fit by a model. So just because a model fits a particular aspect of the data, that should not be taken to imply that the model is fine for other purposes.

        • Are you sure we aren’t attacking a strawman. Would any reasonable person really dispute that “it’s important and useful to check models”?

          I find that so hard to believe. The fact that any model ought to be validated seems so fundamental to me as a critical part of the scientific enterprise.

        • Rahul:

          It’s your choice whether to believe it or not. But I saw it, I was at a conference of Bayesians in 1991 where just about nobody was interested in checking their models. Also you can see this attitude in lots of applied work. Just read some journal articles.

        • No, I believe your report. I only thought maybe I was confusing the meaning of your words.

          I guess I’m attending the wrong conferences & reading the wrong articles. I thought it would be a ridiculously indefensible position for anyone to insist that models need not be checked. It’s like someone saying let’s design a drug molecule on the computer and use Biochem as reasoning for why it should work and then let’s skip the Drug Trial entirely.

        • Andrwe:

          To further get the flavor of this can you point to an applied article that illustrates someone “out-and-out refusing to check their models”. Or where someone explicitly denies that it is “important and useful to check models”.

        • Rahul: To begin to understand, you have to ask: what exactly is being checked? (there isn’t agreement on this, particularly if the model includes priors). Second, consider that checking one’s model requires using data other than the data observed, thereby violating (some think) the Likelihood Principle. If sampling distributions play a key role there, why block their use in general in inference?

        • You’ve nailed the exact three reasons I’m not a Bayesian.

          (1) If Bayesian’s disagree among themselves, that’s because they’re all wrong.

          (2) Gelman’s comparing simulated data to the actual data violates the Likelihood principle.

          (3) If sampling distributions play a key role in posteriors then you don’t need the posteriors.

        • Mayo:

          Checking one’s model can use data other than the data observed but this is not required. For example if you have 1000 data points that are purportedly independent draws from a single normal distribution, but the data are highly skewed, this is evidence from the data alone that there is a problem with the model.

        • Of course you are right, that models have to be checked. I am not speaking from any well-reasoned mathematical position, but just from the common sense point of view. People make all sorts of mistakes and refusing even to look whether one might have been made is a strange attitude.

          The problem, of course, is to how to work it out correctly. Frequentists’ attitude (I am not sure about the methodology) seems to be a requirement to define a universal framework for everything a statistician is about to do, check its distributional properties under some assumptions and thus gain some (maybe more than healthy) amount of confidence in the method. Bayesians counter that it is better to maintain some flexibility because you never sure about the assumptions around which the distributional properties are defined and it is better to proceed on the common-sense case-to-case basis.

          I thought that this tension more or less sums up your disagreement with Sander Greenland.

  8. I read the two-short-examples paper a while ago, and my take-away was that choosing a statistic for posterior checks (whether p-value or graphical) requires careful thought.

    …So, naturally, I’m no longer interested in them.

    (j/k!)

  9. ‘More and more, though, I’ve been seeing the point of strongly informative priors…For too many years I’ve been a “cringing Bayesian,” trying to minimize the use of prior information in my analyses, but I’m thinking that this has been a mistake. I’m not all the way there yet—in particular, BDA3 remains full of flat or weak priors—but this is the direction I’m going.’

    I’m no bayesian but I agree. If you’re gonna regularize, then regularise.

    • +1

      I never liked the weak priors business. If you want to go Bayesian then might as well give your priors some respect and learn to trust them. If priors aren’t giving you good results might as well ditch the approach.

      • Rahul:

        It’s a cost-benefit thing, for the data model (the “likelihood”) as well as the prior. In either case, including all relevant information takes effort, work that might not be worth it if the data are highly informative for questions of interest.

    • Is this casual use of “regularize” the classical statistician’s way of say “I’m sorry for dismissing your methods as metaphysical nonsense and trying to suppress them for so long”?

      If so, apology accepted.

      The whole thing is a case study in the dangers of “it doesn’t make sense to me, so it must not make sense to anyone” reasoning.

      • A little from column A, a little from column B :-) I certainly agree that many important topics that were controversial ‘metaphysically’ but useful practically later admitted clearer interpretations, usually from another (mathematical) point of view – zero, infinity, complex numbers, generalized functions…etc. These often ended up quite different to what they were initially, and I’m sure criticism helped with this. Even Andrew doesn’t buy the old bayes story (e.g. his discussion of model checking, his paper with Shalizi etc). I don’t think you do either?

        Personally, I’ve never been a ‘classical statistician’, nor have I ‘suppressed’ bayes. It doesn’t float my boat (in its usual forms, anyway) but it’s a shame people like Andrew (though he seems to be doing OK!) faced such strong opposition.

        PS your last sentence sounds awfully like how some people around here discuss ‘classical statistics’!

        • Bayesians are in the supremely self satisfying position of never having been in a position to suppress Frequentists in a big way, so they didn’t.

          We’ll gladly remind Frequentists of their past impropriety and quietly leave unmentioned any speculations as to what would have happened if the tables were turned.

        • I do think the reluctance of many Bayesians (at least a few years ago) to check their models is conflated with trying to suppress Frequentist ideas especially as Box was advocating these for model checking (1980 Sampling and Bayes’ Inference in Scientific Modelling and Robustness.)

          And as hjk termed it the “old bayes story” was marketed as a way to suppress any concern that anything could go wrong with a Bayesian analysis (except perhaps MCMC non-convergence.)

          I believe both were/are unfortunate for those who want to get less wrong about the world.
          (But as Andrew put it in one of his comments, economy of research – cost/benefit – should always be kept in mind.)

        • That’s odd because when I read Laplace 1749-1827 I find no hint of a “don’t check your model doctrine”.

          When I read Keynes (published 1921) I find no hint of a “don’t check your model doctrine”.

          When I read Jeffreys (published 1939) I find no hint of a “don’t check your model doctrine”.

          When I read Jaynes (publishing from the 1950’s onward) I find no hint of a “don’t check your model doctrine”.

          Clearly “old Bayes story” has a somewhat peculiar meaning.

        • I would be much more convinced by a random sample of Bayesian authors and yours does not appear very random.

          And I was construing the “old Bayes story” as the Lindleyesque axiom system that represents past knowledge and updates it in a fully coherent manner into definitive probability statements.

  10. Sander kindly mentioned my 2006 Am Stat article; I confess that my comments on posterior predictive checks (PPC) in that article were superficial, since I was sitting on the fence with respect to this controversy and did not want to jump to either side. But the dialog between Sander and Andy is stimulating, if not decisive, so I’ll take the liberty of giving my sense of these methods, bearing in mind that I am not an expert or fully on top of the technical nuances. Hopefully Andy will correct any excessively wooly thinking.

    I find PPC’s more relevant than prior predictive checks, since it is ultimately the posterior model predictions, not the “truth” of the prior, that matter to me. The ideal check would compare the observed data with predictions under the model, given the truth of the model and the true values of unknown parameters. If the check statistic has a distribution that does not depend on the unknown parameters, there is no problem. If it does, however, then the posterior predictive check (PPC) uses a predictive distribution that integrates over the posterior uncertainty in the parameter estimate, instead of using the (unattainable) true values. As a result, the PPC tends to be over-optimistic about the fit of the model – thus a small P value is evidence of a problem, but a large P value is not necessarily evidence that the model is OK (as gauged by this check statistic). One might say “so what, since we know all models are wrong?” But I find it a weakness of the PPC that the large P value is not easily interpretable, since the impact of integrating over the PPD of the unknown parameters is uncertain and variable. So I would conclude that there is no harm in trying PPCs, but interpret large P values “with a pinch of salt”.

    Other approaches to checking models also appear less than ideal. I sometimes wonder if just plugging in a good estimate of the parameters, e.g. the ML estimate, might not have some advantages over the PPC, since it avoids the increased dispersion of the checking distribution in PPC arising from uncertainty in estimating the parameters. This increased dispersion seems to me not so beneficial in this setting, unlike the situation where we are carrying out inference under the model. However the ML estimate is not the true value, and tends to favor the model because of the “double dipping” problem. Cross-validation approaches try to side-step this issue by using (to varying degrees) independent samples for estimation and checking. However, these also lose some power in the model check, by not using all the data to estimate the parameters.

    In conclusion, I think model-checking is at this stage an art as well as a science, and as my Amstat article indicates, I favor pragmatism over dogmatism.

    • My favoring pragmatism here leads me to side with Sander over Andrew largely because I believe my cost and benefit assessments are much closer to Sanders.

      Perhaps because we have been more focus about utilising (which requires extensive appraising of) studies carried out and analysed by others in Meta-analysis. Here any wrongness, can be important – not just the locally this time in this analysis important wrongness. It is more like an audit (which I have sometimes defined a meta-analysis as being) and unimportant discrepancies can be worth noting or at least thoroughly checked for.

      Also, for me future inferences are not just about the question addressed in the study that will use the posterior as the future prior but other related investigation that will re-use the same approach including the same prior or at least the same way of coming up with the prior.

      In my earlier experience in undertaking Bayesian analyses I was burned this way. I started with a published example of how to analyse correlated proportions which had nicely provided the Bugs code. When I ran it, the posterior probability of negative proportions was over 40%! The prior put a lot of (implied) probability on one proportion being negative and the little data I had did little to change that. When I contacted the author suggesting it was inappropriate of him to publish this _as a method to use_ without a warning about this, he simply replied – “It was not a problem for me, given the data I had.”

      So though I do not doubt you on this “not the “truth” of the prior, that matter to me” I am not convinced it should not matter to others is this particular and especially future cases. Not that you _should_ but reviewers _should_ ask you to. And those less experienced in Bayesian analysis most definitely _should_.

        • My problem with posterior predictive P-values (PPP) has nothing to do with ease of doing, but is rather unease in interpreting. John Carlin hit one of my objections to PPPs dead on with this remark: “Do I get concerned if PPP is less than 0.3, 0.2 or what, and why?” Exactly! Rod Little echoed this quandary.

          Certainly a very small PPP would be of concern, but how do I interpret PPP = 0.25 = 1/4 if I don’t know the PPP operating characteristics over various possible hypotheses, only that it concentrates near 0.50 = 1/2 under the null? Does PPP=1/4 bode well or ill for the model in the application? This is a posterior probability but not a U-value, so (as Stephen Senn has astutely observed,) there is no basis for carrying over “significance levels” that have become entrenched for calibrated if hypothetical frequency statistics. With the PPP unmoored (and indeed far out to sea) from a uniform reference distribution under the tested hypothesis, I feel at a loss for making sense of it. It is like having a room-temperature display given in log degrees Kelvin where the base of the logs is unknown.

          To repeat my early comments, the other problem I have with PPP values is displayed in the first example (p. 2598) of Andrew’s paper in the Electronic Journal of Statistics:
          http://projecteuclid.org/euclid.ejs/1382448225
          In the example the PPP = 0.49, but the prior predictive P = 0.0000003, indicating terrible conflict between the assumed prior information and the likelihood. On the face of it, the PPP says it is OK to proceed as if nothing is wrong, but the discrepancy is a danger signal that is being glossed over by PPP – after all, maybe the observation was mis-recorded, maybe it was faked (both phenomena happen frequently enough in health and medical science). I can’t imagine a setting in which I would not want to be alerted when data are extremely far from prior expectations, as displayed in that example. This objection does not exclude giving PPP, but contrary to the impression the paper gives, the example does not excuse skipping a prior check or a recalibrated PPP.

          So I’d like to get a straight specific answers to:
          1) how we are supposed to judge specific PPP values, like PPP=0.25? and
          2) where would we not want to know that the data we saw are at severe odds what we expected to see a priori?

          I think identifying my problems with PPP as a holdover from null hypothesis testing is dodging these hard questions. Neither of my concerns have to do with believing the null model is true or probably true. At least in some applications, Neyman, Pearson and Fisher were all clear that they were trying to distance their methods from such prior beliefs, so the null models they were testing as tenable working models and the purpose of their tests were to see if the model remained tenable (meaning reasonable to continue using) in light of new data. For Neyman in particular this was explicated as a decision to either continue to “behave as if” the model were true, or reject it and move on, not to believe in the model. That subsequent writers started off in a fairy-tale world of true models is a tragedy (encouraged by traditional math-stat formalisms), much like the tragedy of those who think a model is true because P exceeds 0.05; but neither of these tragedies is a core part of the original methodologies.

          In modern frequency theory the null model is just a reference point traditionally used for calibration purposes, but calibration at other points is not only possible but important. In the NP framework that’s addressed by confidence intervals and power analysis, major steps that NP took ahead of Fisher. Fisherian theory can do even better by introducing multipoint testing and P-value functions (as Birnbaum did in 1961). As far as I can see, PPP is even less interpretable in calibration terms off the null than on.

          I have the impression Andrew and I agree about general philosophy of applied stats and data analysis, and about most details as well. For example, unlike committed Bayesians we don’t reject all frequency evaluations; unlike committed frequentists, we both want to move beyond flat priors to use real prior information; and unlike a lot of applied Bayesians, we both reject spiked priors for our applications (for more on that see our exchange in Epidemiology 2013; vol. 24, p. 62-78). And (like Rod) I think both of us prefer pragmatism to dogmatism. So I am most interested in getting to the bottom of our divergence here.

        • Regarding my first question, one might respond that PPP=0.25 is a posterior prediction that, in 25% of future data replications, the test statistic will be as or more extreme than what was observed. But that does not tell me how to judge this value as a model diagnostic (a problem which I think John Carlin and Rod Little also mentioned). So, let me rephrase my question as:
          1) Is PPP=25% reassuring about the model or not, for whatever purpose? If not reassuring, why? (or, why is 25% small?) If reassuring, why? (or, why is 25% large?). Same question for 5%, 10%, 15%, 20%, 30%, 35%, 40% etc.

          Regarding my second question, I don’t think we disagree that prior predictive misfit is extremely important if the model is being used to make predictions for new groups. Where I work (epidemiologic research), the job of the model is always, ultimately, to predict observations in a new setting under different potential actions (like whether mortality will increase or decrease if a drug gets pulled off the market, compared to no status change) – the future is always the target, bringing in new setting with new groups. Even when the stated goal is to predict what would have happened to an existing group under a counterfactual (as in compensation cases), that counterfactual involves new conditions which effectively define a new (unobserved) group for prediction (or postdiction, if you prefer). Therefore, whatever the uses for PPP, I always want to see prior vs. likelihood checks for a Bayesian analysis unless it is obvious that there could be no serious conflict (as when, in approximately normal cases, the MLE is within a few SEs or a few prior SDs of the prior mean).

        • Sander:

          1. I just interpret the probabilities as probabilities, I don’t try to transform them to a verbal scale such as “reassuring” etc. I have a similar discomfort when Bayesians try to set up rule such as “Bayes factor greater than 30 is ‘strong evidence'” etc. I’d rather have the probabilities stand for themselves.

          2. We have many many examples of posterior predictive checks in our books and applied research articles. There are settings where it can make sense to do a pure prior predictive check (which is a special case of posterior predictive check under a particular replication in which all parameters are re-sampled from the model), but in your setting where a drug is being studied, I don’t think you’d want to do this. From your description above it sounds to me like you’re interested in new patients and new conditions but not a new drug; i.e., the “theta” representing the drug will not be resampled.

          As a side note, I agree with you on the emphasis on being interested in new people. One thing that frustrates me with some presentations of causal inference (including Rubin’s) is the focus on causal inference for the people who happen to be in the experiment. Almost always the people in the experiment are taken as a sample from some larger population of interest, and I find it very frustrating when researchers ignore this step of generalization and smugly think that they’re causally kosher just because they’ve randomized within their group. (See the freshman fallacy.

        • 1. I really have no idea how any diagnostic functions without verbal explanation of exactly what it is diagnosing and when, which in practice comes down to what it means in decision terms, e.g., that P=0.00000003 (as for the prior PP in your first example) means something must be wrong somewhere. Same for med diagnostics – a blood pressure of 180/110 has no meaning until we know it signifies dangerous hypertension with high risk of stroke. If the PPP is a diagnostic, I need this guidance to use it; if it is only a raw probability, it’s not yet a diagnostic and I have no idea why I should be computing it.

          2. The situation I had in mind was precisely one drug only. In your first example where you dismiss the prior PP, I would take it as evidence that either the data are very in error (e.g., fraudulent or miscoded), or else I really had no idea what the parameter meant before going into the problem. Either way, I had better stop and investigate (think outside the prior and data model). The posterior PP misses all this. Plus, prior tests are so easy to compute that I can see no reason to not do them, e.g., in our usual normal-prior+approximate-normal-MLE regression models we can test the difference between the prior-mean coefficient vector and the MLE vector.

          3. Causal inference within a single randomized group is interesting indeed, and there is a sizable (if unsettled) literature on it traceable back at least to one of the fights between Neyman and Fisher in the 1930s. Here’s but one article on the topic, connecting it to the old margin-conditioning dispute: Greenland, S. (1991). On the logical justification of conditional tests for two-by-two contingency tables. The American Statistician, 45, 248-251.

        • Andrew:
          On your point #1, sometimes a decision has to be made and decision makers need to be informed what a “Bayes factor greater than 30” means.

        • CK:

          I completely completely disagree with you. When a decision needs to be made using posterior probabilities, it can be made directly by defining the utility function and so forth. We provide 3 detailed examples in the Decision Analysis chapter of BDA. Decisions should be made based on costs, benefits, and probabilities, not on probabilities alone. This is a fundamental principle of Bayesian decision theory.

        • Sander:

          1. As noted above (I believe) and as demonstrated in chapter 6 of BDA and in my statistical practice, I find graphical model checks to be much more valuable than p-values. But, to the extent we use p-values, I interpret them as probabilities. If you want a calibration scale, just recall that p=1/2 corresponds to the probability of a coin flip landing “heads,” p=1/4 corresponds to 2 heads in a row, etc. We discuss this sort of thing further in chapter 1 of BDA. Probabilities have their own natural scale. That is not the same thing as blood pressure measurements whose scale is external.

          2. If there’s only one drug, and you have parameters “theta” characterizing the drug, parameters “alpha” characterizing the patients and the conditions in the study, and data “y,” then it sounds to me like you’d want a predictive check in which theta is unchanged, alpha.rep is re-drawn from the model given theta, and y.rep is re-drawn from the model given theta and alpha.rep. In these sorts of interesting real-world situations I find it helpful to think of the graph of the model and consider what is being replicated.

          3. I’ve linked to this one on the blog before, but, in any case, my thoughts on conditioning for tests on contingency tables are in section 3.3 of this paper from 2003, a paper which ultimately derives from a talk I gave in 1997. I wish I’d been aware of your 1991 paper when I did all this.

        • Andrew:
          I never said that the decision should involve probability alone but if probability plays a role in my decision, I definitely would be interested to be informed if I can trust the derived probability value. If the prior and the likelihood are substantially incompatible with each other to me that is a sign that something might not be right somewhere.

        • CK:

          There are two issues. First, you wrote, “sometimes a decision has to be made and decision makers need to be informed what a ‘Bayes factor greater than 30’ means.” My answer to that is that, if you want to make a decision, you should plug these probabilities into a utility analysis and there is no need to interpret Bayes factors as being “strong evidence” or “weak evidence” or whatever.

          The second point is that we do not trust our models. Again, though, it’s not so simple. We can make lots of progress with flat priors (not always, but often), even though if you have a flat prior on an unbounded space, the data will automatically be in complete conflict with the prior, if you look at a test statistic such as T(y)=|y|. This sort of conflict with the prior will completely doom you if you are interested in replications of new parameters from the model but not so much if you are interested in replications with the same parameters. In Sander Greenland’s example from the comments, a flat prior will doom you if you’re interested in using the model to make predictions about a new drug by drawing from the prior, but not so much if you’re using the model to make predictions about an existing drug.

        • Hi, one more response on my part (to an email that Sander said, repeating some of the above points):

          As I wrote on blog (although perhaps this was in response to a different commenter, not you; I don’t remember now), different posterior predictive checks answer different questions. In a model with parameter theta describing a drug, parameters alpha describing patients, and data y, you can consider replications such as:
          (1) theta, alpha, y.rep (that’s new data from the same drug and same patients)
          (2) theta, alpha.rep, y.rep (new data from the same drug and new patients)
          (3) theta.rep, alpha.rep, y.rep (new data from new drug and new patients)
          All these can be thought of as posterior predictive checks under different repliations. #3 is also a prior predictive check. It’s my impression that #2 would serve your purposes the best. But there’s no reason not to do #1 and #3 as well.

          With regard to test #3 (the prior predictive check), again my point is that reseachers often do just fine in the posterior with nonsensical priors. Look at the Box and Tiao book or, for that matter, most of Bayesian Data Analysis. All over the place we’re in the position of estimating some parameter theta such as a logistic regression coefficient and we give it a uniform prior density on (-infinity, infinity). This (a) makes no sense and (b) is trivially violated, for any data, by a prior predictive check using a test statistic such as T(y) = |y| (or, in the logistic case, something like |mean(y) – 0.5|, since under the prior the probability is 1 that |theta| exceeds any finite value in absolute value. This is not a trick; flat priors really don’t make sense, either in theory or in reality. Indeed I’ve been motivated by several sources (not least including your own work) to move away from noninformative priors. That all said, in settings with strong data, noninformative priors can be ok, and in such settings I’d like to have both checks: the posterior predictive check showing the predictive behavior of the posterior distribution (under whatever replications are deemed important) and the prior predictive check reminding me that the model indeed makes no sense.

        • Thanks Andrew for the further explanations. Here are edited excerpts from what I wrote you in response to your latest comments:

          You said: “if you have a flat prior on an unbounded space, the data will automatically be in complete conflict with the prior, if you look at a test statistic such as T(y)=|y|. This sort of conflict with the prior will completely doom you if you are interested in replications of new parameters from the model but not so much if you are interested in replications with the same parameters. In Sander Greenland’s example from the comments, a flat prior will doom you if you’re interested in using the model to make predictions about a new drug by drawing from the prior, but not so much if you’re using the model to make predictions about an existing drug.”

          – From the classical frequentist perspective (which I am adopting for the purposes of this debate), the statement about being ‘doomed when drawing from the prior’ is meaningless because the parameter is fixed and that puts us in the same-parameter case automatically.

          Now, the frequentist model can be extended to allow parameter draws via random-parameter models, but those must have proper distributions. Improper priors won’t do because they have no sampling interpretation, as can be seen by noting that there is no finite computer program that randomly samples uniformly across the real line, or even the integers – a so-called improper prior is not even a probability measure and so does not support probability statements about sampling from it.

          So, sticking with the normal case for simplicity (as in your first example) with y ~ normal(mu,1), we can avoid these frequentist objections by considering a normal(0,tau^2) random-parameter distribution p(mu) for mu with known finite tau (here I’ve relabeled your A by tau). I then want a valid frequentist P-value (U-value) to evaluate compatibility among the assumed fully-specified unconditional random-parameter distribution p(mu) with the data model f(y_rep|mu) and the data y. The P value I want reduces to a test of whether y came from the marginal normal(0,tau^2+1) distribution under the entire model {p(mu), f(y_rep|mu)} (which is identical to comparing two normal draws, a ‘prior draw’ equal to 0 = E(mu) with SD tau and a ‘current draw’ equal to y with SD 1). That test is the one given by the prior predictive P in your paper, derived from the chi-square y^2/(tau^2+1), which goes to zero for any fixed observation y as tau increases. (In the logistic case the asymptotic generalization is straightforward – although there I would also want valid P-values testing the fit between the data model f(y_rep|mu) and the data y, e.g., the Pearson chi-squared fit test when the data are nonsparse and discrete.) If as in your example this P is tiny (0.0000003), it is telling me that y is wildly unexpected under the total model {p(mu), f(y_rep|mu)} and so I had better see if one or both model elements are wrong, or if y is itself misrecorded.

          The bottom line is, even in this simplest case I have 3 inputs (p(mu), f(y_rep|mu), y) that I want to check for mutual compatibility (stochastic consistency) before merging – even if I decide to merge them despite revealed incompatibilities. I want to know of any danger signs, and as a frequentist that means I want U-values for these checks. The checks I want don’t require exclusion of anything else (like PPP) and are easy computationally. The prior checks may not be essential if the prior is obviously swamped by the data. This swamping will always occur for noninformative priors, but should not be taken for granted otherwise, even for the proper default or reference priors which I think even frequentists should adopt (since they improve frequency properties where we should be most concerned about them)

          It sounds to me that you have moderated your initial stance (which seemed to be rejecting prior vs likelihood checking as a routine diagnostic); at least I hope so, as it would resolve our debate in practical terms (although still leave me at a loss about how to make use of PPP).

        • Sander:

          I’m flattered to have my name brought into this august debate [pun not intended but noticed after the fact!]. However, on thinking about it I’m afraid my question about how to judge a particular PPP-value was essentially rhetorical and I don’t really expect there to be an answer, beyond the limits of interpretation of any probability (as discussed by Andrew). From a conditional (Bayesian, if you like) point of view, it is not clear to me why, for a particular applied problem, knowing that your probability could be embedded in a well-behaving hypothetical repeated sampling framework (5% chance of making “bad” rejections etc) really helps. This seems like the age-old conundrum of the relevance of frequentist evaluations — to me they seem to provide reassurance that families of procedures (penalised estimation, for example) behave well enough to be widely recommended, but I don’t see how they really help with specific datasets, i.e. given the data I’m analysing how does the presence of a defined frequency evaluation help me think about the conclusions I might draw? This seems particularly pertinent to model checking, where I think we want to know about aspects of *these particular data* that do or do not fit the model framework that we’ve proposed. [Note that I’m leaving aside the question of which particular predictive distributions might be more or less useful, which has been covered extensively by you and Andrew.]

          On the other hand, I fear that in the real world of data analysis, mostly done following standard procedures and with somewhat formulaic interpretation of results, people may be a little too willing to buy into a “method” because it’s been recommended on the basis of its good behaviour. So it might actually be a little dangerous to present an officially “calibrated” procedure for performing something that essentially requires thought and judgement conditional on the data… Final thought is that the “P-value” aspect of posterior predictive checking seems to have caused a lot of grief; I guess largely because the terminology brings too many connotations from those other P-values (which are surely even harder to interpret usefully despite the fact that Nature felt it OK to refer to them as “the ‘gold standard’ of scientific validity” in the headline of their otherwise useful article (Nuzzo, Nature, 13-Feb-2014)).

        • Thanks John for posting back…

          One rationale for NP calibration is that we are seeing how reliable our procedure is in simple thought experiments. If it can’t give us a reliable guide in these simple situations, why should we trust it to magically do well in the messy reality we face? Without these calibrations, in any application we are left at the mercy of often opaque, mysteriously varying and (frankly) likely biased judgments about what constitutes acceptable and unacceptable procedures or fits.

          Of course we already are always partially at such mercy, but the point of demanding relevant calibration is to lessen capricious aspects of statistical analyses, to make them at least a little better than biased opinions or damn lies shrouded in a dazzling cloak of Markov Chains. There is no objectivity in any absolute sense (any more than absolute motion), but there is a practical meaning behind the overblown (and often propagandistic) claims to “objectivity” in frequentist calibrations: they provide guidelines about how to proceed grounded on performance of methods in simpler, transparent situations called sampling models. Single-case judgments cannot be avoided, but that should not make them immune to rational criticism and insights such as those offered by calibration studies. In particular, we should want to see how what we are doing would fare under models that are tailored to resemble the current setting.

          As Box explained, these sampling models are priors about the current setting, and thus worthy of Bayesian criticism for their failure to to account for contextual information (coherence); conversely, Bayesian analyses are worthy of criticism by seeing how they behave under contextually sound sampling models. We thus have a feedback loop for model development which we follow until we see insufficient value added by continuing.

          Without a frequentist/sampling criticism and calibration stage we never even start this conceptual cross-validation loop. How then are we to assure our readers that our Bayesian results can be trusted under any circumstance, including the current one? To paraphrase a criticism of fiducial inference back before we were born, are we supposed to send every analysis to the authors of BDA for determination of whether the various model specifications and the data appear mutually consistent enough to safely rely on the generated posteriors?

          For more on the deficiencies of Bayesian analyses unrestrained by frequentist perspectives, I urge everyone who has made it this far to read Cox’s comment on Lindley in The Statistican, 2000; 49(3): 321-324, if they have not done so already. I’m sure there are many other excellent commentaries to this effect which could be profitably cited here, and I would value recommendations.

    • “committed Bayesians we don’t reject all frequency evaluations”

      If be “committed Bayesian” you mean “one of the more die hard polemical Bayesians of the 20th century”, then E. T. Jaynes was certainly committed. Yet when the goal of the analysis was to infer or predict real frequencies he didn’t think twice about comparing his predictions to actual frequencies (a famous example of this supposed influenced Gelman).

      Hence there is nothing un-Bayesian about “frequency evaluations” if your goal is to predict/infer a frequency.

      Nor does sometimes equating probabilites with frequencies numerically make you a Frequentist. When the analysis showed prob ~ freq, Jaynes again didn’t hesitated to compare them for that problem.

      It’s not the use of frequencies that makes Frequentists what they are. The key to being a frequentist is ALWAYS interpreting probabilities as frequencies.

      Just as not every use of Bayes theorem makes you a Bayesian, not every mention of frequencies makes you a Frequentist.

  11. Doing a wikipedia on oona suggests that it is in fact more or less bog standard irish, but that it’s also popular in Finland, which I guess makes sense why I thought of it as scandanavian.

  12. Pingback: Shared Stories from This Week: Aug 15, 2014

  13. Pingback: "The Firth bias correction, penalization, and weakly informative priors: A case for log-F priors in logistic and related regressions" - Statistical Modeling, Causal Inference, and Social Science Statistical Modeling, Causal Inference, and Social

  14. Pingback: Bayesian model checking via posterior predictive simulations (Bayesian p-values) with the DHARMa package | theoretical ecology

  15. Pingback: The house is stronger than the foundations - Statistical Modeling, Causal Inference, and Social Science

Leave a Reply to hjk Cancel reply

Your email address will not be published. Required fields are marked *