One rationale for NP calibration is that we are seeing how reliable our procedure is in simple thought experiments. If it can’t give us a reliable guide in these simple situations, why should we trust it to magically do well in the messy reality we face? Without these calibrations, in any application we are left at the mercy of often opaque, mysteriously varying and (frankly) likely biased judgments about what constitutes acceptable and unacceptable procedures or fits.

Of course we already are always partially at such mercy, but the point of demanding relevant calibration is to lessen capricious aspects of statistical analyses, to make them at least a little better than biased opinions or damn lies shrouded in a dazzling cloak of Markov Chains. There is no objectivity in any absolute sense (any more than absolute motion), but there is a practical meaning behind the overblown (and often propagandistic) claims to “objectivity” in frequentist calibrations: they provide guidelines about how to proceed grounded on performance of methods in simpler, transparent situations called sampling models. Single-case judgments cannot be avoided, but that should not make them immune to rational criticism and insights such as those offered by calibration studies. In particular, we should want to see how what we are doing would fare under models that are tailored to resemble the current setting.

As Box explained, these sampling models are priors about the current setting, and thus worthy of Bayesian criticism for their failure to to account for contextual information (coherence); conversely, Bayesian analyses are worthy of criticism by seeing how they behave under contextually sound sampling models. We thus have a feedback loop for model development which we follow until we see insufficient value added by continuing.

Without a frequentist/sampling criticism and calibration stage we never even start this conceptual cross-validation loop. How then are we to assure our readers that our Bayesian results can be trusted under any circumstance, including the current one? To paraphrase a criticism of fiducial inference back before we were born, are we supposed to send every analysis to the authors of BDA for determination of whether the various model specifications and the data appear mutually consistent enough to safely rely on the generated posteriors?

For more on the deficiencies of Bayesian analyses unrestrained by frequentist perspectives, I urge everyone who has made it this far to read Cox’s comment on Lindley in The Statistican, 2000; 49(3): 321-324, if they have not done so already. I’m sure there are many other excellent commentaries to this effect which could be profitably cited here, and I would value recommendations.

]]>I’m flattered to have my name brought into this august debate [pun not intended but noticed after the fact!]. However, on thinking about it I’m afraid my question about how to judge a particular PPP-value was essentially rhetorical and I don’t really expect there to be an answer, beyond the limits of interpretation of any probability (as discussed by Andrew). From a conditional (Bayesian, if you like) point of view, it is not clear to me why, for a particular applied problem, knowing that your probability could be embedded in a well-behaving hypothetical repeated sampling framework (5% chance of making “bad” rejections etc) really helps. This seems like the age-old conundrum of the relevance of frequentist evaluations — to me they seem to provide reassurance that families of procedures (penalised estimation, for example) behave well enough to be widely recommended, but I don’t see how they really help with specific datasets, i.e. given the data I’m analysing how does the presence of a defined frequency evaluation help me think about the conclusions I might draw? This seems particularly pertinent to model checking, where I think we want to know about aspects of *these particular data* that do or do not fit the model framework that we’ve proposed. [Note that I’m leaving aside the question of which particular predictive distributions might be more or less useful, which has been covered extensively by you and Andrew.]

On the other hand, I fear that in the real world of data analysis, mostly done following standard procedures and with somewhat formulaic interpretation of results, people may be a little too willing to buy into a “method” because it’s been recommended on the basis of its good behaviour. So it might actually be a little dangerous to present an officially “calibrated” procedure for performing something that essentially requires thought and judgement conditional on the data… Final thought is that the “P-value” aspect of posterior predictive checking seems to have caused a lot of grief; I guess largely because the terminology brings too many connotations from those other P-values (which are surely even harder to interpret usefully despite the fact that Nature felt it OK to refer to them as “the ‘gold standard’ of scientific validity” in the headline of their otherwise useful article (Nuzzo, Nature, 13-Feb-2014)).

]]>You said: “if you have a flat prior on an unbounded space, the data will automatically be in complete conflict with the prior, if you look at a test statistic such as T(y)=|y|. This sort of conflict with the prior will completely doom you if you are interested in replications of new parameters from the model but not so much if you are interested in replications with the same parameters. In Sander Greenland’s example from the comments, a flat prior will doom you if you’re interested in using the model to make predictions about a new drug by drawing from the prior, but not so much if you’re using the model to make predictions about an existing drug.”

– From the classical frequentist perspective (which I am adopting for the purposes of this debate), the statement about being ‘doomed when drawing from the prior’ is meaningless because the parameter is fixed and that puts us in the same-parameter case automatically.

Now, the frequentist model can be extended to allow parameter draws via random-parameter models, but those must have proper distributions. Improper priors won’t do because they have no sampling interpretation, as can be seen by noting that there is no finite computer program that randomly samples uniformly across the real line, or even the integers – a so-called improper prior is not even a probability measure and so does not support probability statements about sampling from it.

So, sticking with the normal case for simplicity (as in your first example) with y ~ normal(mu,1), we can avoid these frequentist objections by considering a normal(0,tau^2) random-parameter distribution p(mu) for mu with known finite tau (here I’ve relabeled your A by tau). I then want a valid frequentist P-value (U-value) to evaluate compatibility among the assumed fully-specified unconditional random-parameter distribution p(mu) with the data model f(y_rep|mu) and the data y. The P value I want reduces to a test of whether y came from the marginal normal(0,tau^2+1) distribution under the entire model {p(mu), f(y_rep|mu)} (which is identical to comparing two normal draws, a ‘prior draw’ equal to 0 = E(mu) with SD tau and a ‘current draw’ equal to y with SD 1). That test is the one given by the prior predictive P in your paper, derived from the chi-square y^2/(tau^2+1), which goes to zero for any fixed observation y as tau increases. (In the logistic case the asymptotic generalization is straightforward – although there I would also want valid P-values testing the fit between the data model f(y_rep|mu) and the data y, e.g., the Pearson chi-squared fit test when the data are nonsparse and discrete.) If as in your example this P is tiny (0.0000003), it is telling me that y is wildly unexpected under the total model {p(mu), f(y_rep|mu)} and so I had better see if one or both model elements are wrong, or if y is itself misrecorded.

The bottom line is, even in this simplest case I have 3 inputs (p(mu), f(y_rep|mu), y) that I want to check for mutual compatibility (stochastic consistency) before merging – even if I decide to merge them despite revealed incompatibilities. I want to know of any danger signs, and as a frequentist that means I want U-values for these checks. The checks I want don’t require exclusion of anything else (like PPP) and are easy computationally. The prior checks may not be essential if the prior is obviously swamped by the data. This swamping will always occur for noninformative priors, but should not be taken for granted otherwise, even for the proper default or reference priors which I think even frequentists should adopt (since they improve frequency properties where we should be most concerned about them)

It sounds to me that you have moderated your initial stance (which seemed to be rejecting prior vs likelihood checking as a routine diagnostic); at least I hope so, as it would resolve our debate in practical terms (although still leave me at a loss about how to make use of PPP).

]]>As I wrote on blog (although perhaps this was in response to a different commenter, not you; I don’t remember now), different posterior predictive checks answer different questions. In a model with parameter theta describing a drug, parameters alpha describing patients, and data y, you can consider replications such as:

(1) theta, alpha, y.rep (that’s new data from the same drug and same patients)

(2) theta, alpha.rep, y.rep (new data from the same drug and new patients)

(3) theta.rep, alpha.rep, y.rep (new data from new drug and new patients)

All these can be thought of as posterior predictive checks under different repliations. #3 is also a prior predictive check. It’s my impression that #2 would serve your purposes the best. But there’s no reason not to do #1 and #3 as well.

With regard to test #3 (the prior predictive check), again my point is that reseachers often do just fine in the posterior with nonsensical priors. Look at the Box and Tiao book or, for that matter, most of Bayesian Data Analysis. All over the place we’re in the position of estimating some parameter theta such as a logistic regression coefficient and we give it a uniform prior density on (-infinity, infinity). This (a) makes no sense and (b) is trivially violated, for any data, by a prior predictive check using a test statistic such as T(y) = |y| (or, in the logistic case, something like |mean(y) – 0.5|, since under the prior the probability is 1 that |theta| exceeds any finite value in absolute value. This is not a trick; flat priors really don’t make sense, either in theory or in reality. Indeed I’ve been motivated by several sources (not least including your own work) to move away from noninformative priors. That all said, in settings with strong data, noninformative priors can be ok, and in such settings I’d like to have both checks: the posterior predictive check showing the predictive behavior of the posterior distribution (under whatever replications are deemed important) and the prior predictive check reminding me that the model indeed makes no sense.

]]>There are two issues. First, you wrote, “sometimes a decision has to be made and decision makers need to be informed what a ‘Bayes factor greater than 30’ means.” My answer to that is that, if you want to make a decision, you should plug these probabilities into a utility analysis and there is no need to interpret Bayes factors as being “strong evidence” or “weak evidence” or whatever.

The second point is that we do not trust our models. Again, though, it’s not so simple. We can make lots of progress with flat priors (not always, but often), even though if you have a flat prior on an unbounded space, the data will automatically be in complete conflict with the prior, if you look at a test statistic such as T(y)=|y|. This sort of conflict with the prior will completely doom you if you are interested in replications of new parameters from the model but not so much if you are interested in replications with the same parameters. In Sander Greenland’s example from the comments, a flat prior will doom you if you’re interested in using the model to make predictions about a new drug by drawing from the prior, but not so much if you’re using the model to make predictions about an existing drug.

]]>I never said that the decision should involve probability alone but if probability plays a role in my decision, I definitely would be interested to be informed if I can trust the derived probability value. If the prior and the likelihood are substantially incompatible with each other to me that is a sign that something might not be right somewhere. ]]>

1. As noted above (I believe) and as demonstrated in chapter 6 of BDA and in my statistical practice, I find graphical model checks to be much more valuable than p-values. But, to the extent we use p-values, I interpret them as probabilities. If you want a calibration scale, just recall that p=1/2 corresponds to the probability of a coin flip landing “heads,” p=1/4 corresponds to 2 heads in a row, etc. We discuss this sort of thing further in chapter 1 of BDA. Probabilities have their own natural scale. That is not the same thing as blood pressure measurements whose scale is external.

2. If there’s only one drug, and you have parameters “theta” characterizing the drug, parameters “alpha” characterizing the patients and the conditions in the study, and data “y,” then it sounds to me like you’d want a predictive check in which theta is unchanged, alpha.rep is re-drawn from the model given theta, and y.rep is re-drawn from the model given theta and alpha.rep. In these sorts of interesting real-world situations I find it helpful to think of the graph of the model and consider what is being replicated.

3. I’ve linked to this one on the blog before, but, in any case, my thoughts on conditioning for tests on contingency tables are in section 3.3 of this paper from 2003, a paper which ultimately derives from a talk I gave in 1997. I wish I’d been aware of your 1991 paper when I did all this.

]]>I completely completely disagree with you. When a decision needs to be made using posterior probabilities, it can be made directly by defining the utility function and so forth. We provide 3 detailed examples in the Decision Analysis chapter of BDA. Decisions should be made based on costs, benefits, and probabilities, not on probabilities alone. This is a fundamental principle of Bayesian decision theory.

]]>On your point #1, sometimes a decision has to be made and decision makers need to be informed what a “Bayes factor greater than 30” means. ]]>

2. The situation I had in mind was precisely one drug only. In your first example where you dismiss the prior PP, I would take it as evidence that either the data are very in error (e.g., fraudulent or miscoded), or else I really had no idea what the parameter meant before going into the problem. Either way, I had better stop and investigate (think outside the prior and data model). The posterior PP misses all this. Plus, prior tests are so easy to compute that I can see no reason to not do them, e.g., in our usual normal-prior+approximate-normal-MLE regression models we can test the difference between the prior-mean coefficient vector and the MLE vector.

3. Causal inference within a single randomized group is interesting indeed, and there is a sizable (if unsettled) literature on it traceable back at least to one of the fights between Neyman and Fisher in the 1930s. Here’s but one article on the topic, connecting it to the old margin-conditioning dispute: Greenland, S. (1991). On the logical justification of conditional tests for two-by-two contingency tables. The American Statistician, 45, 248-251.

]]>The big tell though was linking to a site that was all about curing herpes, losing weight fast, working from home, getting rich quick etc.

]]>1. I just interpret the probabilities as probabilities, I don’t try to transform them to a verbal scale such as “reassuring” etc. I have a similar discomfort when Bayesians try to set up rule such as “Bayes factor greater than 30 is ‘strong evidence'” etc. I’d rather have the probabilities stand for themselves.

2. We have many many examples of posterior predictive checks in our books and applied research articles. There are settings where it can make sense to do a pure prior predictive check (which is a special case of posterior predictive check under a particular replication in which all parameters are re-sampled from the model), but in your setting where a drug is being studied, I don’t think you’d want to do this. From your description above it sounds to me like you’re interested in new patients and new conditions but not a new drug; i.e., the “theta” representing the drug will not be resampled.

As a side note, I agree with you on the emphasis on being interested in new people. One thing that frustrates me with some presentations of causal inference (including Rubin’s) is the focus on causal inference for the people who happen to be in the experiment. Almost always the people in the experiment are taken as a sample from some larger population of interest, and I find it very frustrating when researchers ignore this step of generalization and smugly think that they’re causally kosher just because they’ve randomized within their group. (See the freshman fallacy.

]]>1) Is PPP=25% reassuring about the model or not, for whatever purpose? If not reassuring, why? (or, why is 25% small?) If reassuring, why? (or, why is 25% large?). Same question for 5%, 10%, 15%, 20%, 30%, 35%, 40% etc.

Regarding my second question, I don’t think we disagree that prior predictive misfit is extremely important if the model is being used to make predictions for new groups. Where I work (epidemiologic research), the job of the model is always, ultimately, to predict observations in a new setting under different potential actions (like whether mortality will increase or decrease if a drug gets pulled off the market, compared to no status change) – the future is always the target, bringing in new setting with new groups. Even when the stated goal is to predict what would have happened to an existing group under a counterfactual (as in compensation cases), that counterfactual involves new conditions which effectively define a new (unobserved) group for prediction (or postdiction, if you prefer). Therefore, whatever the uses for PPP, I always want to see prior vs. likelihood checks for a Bayesian analysis unless it is obvious that there could be no serious conflict (as when, in approximately normal cases, the MLE is within a few SEs or a few prior SDs of the prior mean).

]]>I almost wonder if the algorithm was searching for similar text/comments using keywords from the post and stringing some search results together to make it seem like a real comment.

]]>Thanks. I put that comment in the spam folder. I’ll have to be more careful with these. It is a sad comment on the world economy if someone is being paid to write this sort of spam.

]]>Yes, that’s right. It’s from Jaynes that I took the idea of taking a model seriously, riding it hard until it breaks, then thinking about how to improve it.

]]>When I talk about a model, I’m not talking about whether swans are white. I’m talking about statistical models which have assumptions such as additivity, linearity, specific correlation structures, etc., all of which are precise assumptions which are false in the sense that with sufficient data they can be clearly refuted.

]]>Checking one’s model *can* use data other than the data observed but this is not *required*. For example if you have 1000 data points that are purportedly independent draws from a single normal distribution, but the data are highly skewed, this is evidence from the data alone that there is a problem with the model.

(1) If Bayesian’s disagree among themselves, that’s because they’re all wrong.

(2) Gelman’s comparing simulated data to the actual data violates the Likelihood principle.

(3) If sampling distributions play a key role in posteriors then you don’t need the posteriors.

]]>If be “committed Bayesian” you mean “one of the more die hard polemical Bayesians of the 20th century”, then E. T. Jaynes was certainly committed. Yet when the goal of the analysis was to infer or predict real frequencies he didn’t think twice about comparing his predictions to actual frequencies (a famous example of this supposed influenced Gelman).

Hence there is nothing un-Bayesian about “frequency evaluations” if your goal is to predict/infer a frequency.

Nor does sometimes equating probabilites with frequencies numerically make you a Frequentist. When the analysis showed prob ~ freq, Jaynes again didn’t hesitated to compare them for that problem.

It’s not the use of frequencies that makes Frequentists what they are. The key to being a frequentist is ALWAYS interpreting probabilities as frequencies.

Just as not every use of Bayes theorem makes you a Bayesian, not every mention of frequencies makes you a Frequentist.

]]>Certainly a very small PPP would be of concern, but how do I interpret PPP = 0.25 = 1/4 if I don’t know the PPP operating characteristics over various possible hypotheses, only that it concentrates near 0.50 = 1/2 under the null? Does PPP=1/4 bode well or ill for the model in the application? This is a posterior probability but not a U-value, so (as Stephen Senn has astutely observed,) there is no basis for carrying over “significance levels” that have become entrenched for calibrated if hypothetical frequency statistics. With the PPP unmoored (and indeed far out to sea) from a uniform reference distribution under the tested hypothesis, I feel at a loss for making sense of it. It is like having a room-temperature display given in log degrees Kelvin where the base of the logs is unknown.

To repeat my early comments, the other problem I have with PPP values is displayed in the first example (p. 2598) of Andrew’s paper in the Electronic Journal of Statistics:

http://projecteuclid.org/euclid.ejs/1382448225

In the example the PPP = 0.49, but the prior predictive P = 0.0000003, indicating terrible conflict between the assumed prior information and the likelihood. On the face of it, the PPP says it is OK to proceed as if nothing is wrong, but the discrepancy is a danger signal that is being glossed over by PPP – after all, maybe the observation was mis-recorded, maybe it was faked (both phenomena happen frequently enough in health and medical science). I can’t imagine a setting in which I would not want to be alerted when data are extremely far from prior expectations, as displayed in that example. This objection does not exclude giving PPP, but contrary to the impression the paper gives, the example does not excuse skipping a prior check or a recalibrated PPP.

So I’d like to get a straight specific answers to:

1) how we are supposed to judge specific PPP values, like PPP=0.25? and

2) where would we not want to know that the data we saw are at severe odds what we expected to see a priori?

I think identifying my problems with PPP as a holdover from null hypothesis testing is dodging these hard questions. Neither of my concerns have to do with believing the null model is true or probably true. At least in some applications, Neyman, Pearson and Fisher were all clear that they were trying to distance their methods from such prior beliefs, so the null models they were testing as tenable working models and the purpose of their tests were to see if the model remained tenable (meaning reasonable to continue using) in light of new data. For Neyman in particular this was explicated as a decision to either continue to “behave as if” the model were true, or reject it and move on, not to believe in the model. That subsequent writers started off in a fairy-tale world of true models is a tragedy (encouraged by traditional math-stat formalisms), much like the tragedy of those who think a model is true because P exceeds 0.05; but neither of these tragedies is a core part of the original methodologies.

In modern frequency theory the null model is just a reference point traditionally used for calibration purposes, but calibration at other points is not only possible but important. In the NP framework that’s addressed by confidence intervals and power analysis, major steps that NP took ahead of Fisher. Fisherian theory can do even better by introducing multipoint testing and P-value functions (as Birnbaum did in 1961). As far as I can see, PPP is even less interpretable in calibration terms off the null than on.

I have the impression Andrew and I agree about general philosophy of applied stats and data analysis, and about most details as well. For example, unlike committed Bayesians we don’t reject all frequency evaluations; unlike committed frequentists, we both want to move beyond flat priors to use real prior information; and unlike a lot of applied Bayesians, we both reject spiked priors for our applications (for more on that see our exchange in Epidemiology 2013; vol. 24, p. 62-78). And (like Rod) I think both of us prefer pragmatism to dogmatism. So I am most interested in getting to the bottom of our divergence here.

]]>When it comes to pragmatism, it is perhaps relevant that posterior predictive checks are easy to do!

]]>And I was construing the “old Bayes story” as the Lindleyesque axiom system that represents past knowledge and updates it in a fully coherent manner into definitive probability statements.

]]>Perhaps because we have been more focus about utilising (which requires extensive appraising of) studies carried out and analysed by others in Meta-analysis. Here any wrongness, can be important – not just the locally this time in this analysis important wrongness. It is more like an audit (which I have sometimes defined a meta-analysis as being) and unimportant discrepancies can be worth noting or at least thoroughly checked for.

Also, for me future inferences are not just about the question addressed in the study that will use the posterior as the future prior but other related investigation that will re-use the same approach including the same prior or at least the same way of coming up with the prior.

In my earlier experience in undertaking Bayesian analyses I was burned this way. I started with a published example of how to analyse correlated proportions which had nicely provided the Bugs code. When I ran it, the posterior probability of negative proportions was over 40%! The prior put a lot of (implied) probability on one proportion being negative and the little data I had did little to change that. When I contacted the author suggesting it was inappropriate of him to publish this _as a method to use_ without a warning about this, he simply replied – “It was not a problem for me, given the data I had.”

So though I do not doubt you on this “not the “truth” of the prior, that matter to me” I am not convinced it should not matter to others is this particular and especially future cases. Not that you _should_ but reviewers _should_ ask you to. And those less experienced in Bayesian analysis most definitely _should_.

]]>When I read Keynes (published 1921) I find no hint of a “don’t check your model doctrine”.

When I read Jeffreys (published 1939) I find no hint of a “don’t check your model doctrine”.

When I read Jaynes (publishing from the 1950’s onward) I find no hint of a “don’t check your model doctrine”.

Clearly “old Bayes story” has a somewhat peculiar meaning.

]]>And as hjk termed it the “old bayes story” was marketed as a way to suppress any concern that anything could go wrong with a Bayesian analysis (except perhaps MCMC non-convergence.)

I believe both were/are unfortunate for those who want to get less wrong about the world.

(But as Andrew put it in one of his comments, economy of research – cost/benefit – should always be kept in mind.)

To further get the flavor of this can you point to an applied article that illustrates someone “out-and-out refusing to check their models”. Or where someone explicitly denies that it is “important and useful to check models”.

]]>I guess I’m attending the wrong conferences & reading the wrong articles. I thought it would be a ridiculously indefensible position for anyone to insist that models need not be checked. It’s like someone saying let’s design a drug molecule on the computer and use Biochem as reasoning for why it should work and then let’s skip the Drug Trial entirely.

]]>The problem, of course, is to how to work it out correctly. Frequentists’ attitude (I am not sure about the methodology) seems to be a requirement to define a universal framework for everything a statistician is about to do, check its distributional properties under some assumptions and thus gain some (maybe more than healthy) amount of confidence in the method. Bayesians counter that it is better to maintain some flexibility because you never sure about the assumptions around which the distributional properties are defined and it is better to proceed on the common-sense case-to-case basis.

I thought that this tension more or less sums up your disagreement with Sander Greenland.

]]>It’s your choice whether to believe it or not. But I saw it, I was at a conference of Bayesians in 1991 where just about nobody was interested in checking their models. Also you can see this attitude in lots of applied work. Just read some journal articles.

]]>It’s a cost-benefit thing, for the data model (the “likelihood”) as well as the prior. In either case, including all relevant information takes effort, work that might not be worth it if the data are highly informative for questions of interest.

]]>http://pages.jh.edu/~gazette/2009/06apr09/06truesdell.html

By all accounts, his (second) wife was every bit as big a character as he was. Personally, I take this as firm proof that there’s someone for everyone.

]]>For such a cultured man (and wife – they were quite the duo) seeing his adopted home town descend into barbarism seemed to have a deep affect on him: a bit like those Europeans who came of age before 1914 watching 1914-1945 happen.

He was a Victorian living in the age of the Brady Bunch.

]]>I always meant to get a copy of the book you mention…

]]>We’ll gladly remind Frequentists of their past impropriety and quietly leave unmentioned any speculations as to what would have happened if the tables were turned.

]]>Personally, I’ve never been a ‘classical statistician’, nor have I ‘suppressed’ bayes. It doesn’t float my boat (in its usual forms, anyway) but it’s a shame people like Andrew (though he seems to be doing OK!) faced such strong opposition.

PS your last sentence sounds awfully like how some people around here discuss ‘classical statistics’!

]]>Despite his reputation for being an elitist hard-ass however, he was encouraging and generous. He even sent me some books and papers with his reply.

]]>Initially, it was from a deep dissatisfaction with the way physicists were treating part of their bread and butter subject matter, and wanting something better. I can’t tell you how shocking it was for someone who thought Physics was the king of the sciences to discover mathematicians and engineers were doing this work dramatically better.

His books have gotten incredibly expensive on amazon, but I bought almost all of them back when they were around the $100 mark.

The above quote was taken from the introduction to “Fundamentals of Maxwell’s Kinetic Theory of a Simple Monatomic Gas” which as far as I know was completely ignored by physicists.

]]>I find PPC’s more relevant than prior predictive checks, since it is ultimately the posterior model predictions, not the “truth” of the prior, that matter to me. The ideal check would compare the observed data with predictions under the model, given the truth of the model and the true values of unknown parameters. If the check statistic has a distribution that does not depend on the unknown parameters, there is no problem. If it does, however, then the posterior predictive check (PPC) uses a predictive distribution that integrates over the posterior uncertainty in the parameter estimate, instead of using the (unattainable) true values. As a result, the PPC tends to be over-optimistic about the fit of the model – thus a small P value is evidence of a problem, but a large P value is not necessarily evidence that the model is OK (as gauged by this check statistic). One might say “so what, since we know all models are wrong?” But I find it a weakness of the PPC that the large P value is not easily interpretable, since the impact of integrating over the PPD of the unknown parameters is uncertain and variable. So I would conclude that there is no harm in trying PPCs, but interpret large P values “with a pinch of salt”.

Other approaches to checking models also appear less than ideal. I sometimes wonder if just plugging in a good estimate of the parameters, e.g. the ML estimate, might not have some advantages over the PPC, since it avoids the increased dispersion of the checking distribution in PPC arising from uncertainty in estimating the parameters. This increased dispersion seems to me not so beneficial in this setting, unlike the situation where we are carrying out inference under the model. However the ML estimate is not the true value, and tends to favor the model because of the “double dipping” problem. Cross-validation approaches try to side-step this issue by using (to varying degrees) independent samples for estimation and checking. However, these also lose some power in the model check, by not using all the data to estimate the parameters.

In conclusion, I think model-checking is at this stage an art as well as a science, and as my Amstat article indicates, I favor pragmatism over dogmatism.

]]>If so, apology accepted.

The whole thing is a case study in the dangers of “it doesn’t make sense to me, so it must not make sense to anyone” reasoning.

]]>