As regular readers of this space should be aware, Bayesian model checking is very important to me:
1. Bayesian inference can make strong claims, and, without the safety valve of model checking, many of these claims will be ridiculous. To put it another way, particular Bayesian inferences are often clearly wrong, and I want a mechanism for identifying and dealing with these problems. I certainly don’t want to return to the circa-1990 status quo in Bayesian statistics, in which it was considered virtually illegal to check your model’s fit to data.
2. Looking at it from the other direction, model checking can become much more effective in the context of complex Bayesian models (see here and here, two papers that I just love, even though, at least as measured by citations, they haven’t influenced many others).
On occasion, direct Bayesian model checking has been criticized from a misguided “don’t use the data twice” perspective (which I won’t discuss here beyond referring to this blog entry and this article of mine arguing the point).
Here I want to talk about something different: a particular attempted refutation of Bayesian model checking that I’ve come across now and then, most recently an a blog comment by Ajg:
The example [of the proportion of heads in a number of “fair tosses”] is the most deeply damning example for any straightforward proposal that probability assertions are falsifiable.
The probabilistic claim “T” that “p(heads) = 1/2, tosses are independent” is very special in that it, in itself, gives no grounds for preferring any one sequence of N predictions over another: HHHHHH…, HTHTHT…, etc: all have identical probability .5^N and indeed this equality-of-all-possibilities is the very content of “T”. There is simply nothing inherent in theory “T” that could justify saying that HHHHHH… ‘falsifies’ T in some way that some other observed sequence HTHTHT… doesn’t, because T gives no (and in fact, explicitly denies that it could give any) basis for differentiating them.
Among all possible tests – and note that we can’t apply them all – some will have HHHHHH as disconfirming and some will not. So “HHHHHH disconfirms T” is not a remotely self-contained statement let alone being true: it requires context about the tests that were run (and perhaps about why these ones were chosen).
As noted above, I’ve seen this error before, and perhaps it’s worth a blog entry to shoot it down.
The mistake in the above quote comes in ignoring the choice required in any model checking. The commenter thinks there’s no reason ahead of time to consider #heads (that is, unordered sequences) as a test summary, but there’s also no reason ahead of time to consider ordered sequences as a test summary either.
To put it another way, the equal probability of each sequence under the coin-flipping model does not make testing impossible: it’s as kosher to group the sequences in terms of #heads as it is to treat them as symmetric atoms for decision making. Either way you’re making a choice about what to look at. (For example, you could imagine a setting in which someone flipped the coins, reported the total #heads and #tails, allowed you to test the model, and then, at further request, gave you the ordered sequences. Or, to take it in another direction, you could imagine having even more information, for example some data regarding the coin’s path though the air during each flip, in which case the Heads and Tails sequences would themselves represent only partial information.)
A similar difficulty arises when considering the posterior mode, which depends on the parameterization. So, yes, in a sense Ajg is right that the count-the-number-of-heads test is not a “self-contained statement” and requires “context about the tests”–but this is true of all model checks. If you want to abandon falsification in the coin-flipping example, I think you have to abandon it in all statistical examples, which might be a coherent philosophical position but in my opinion leads to huge practical problems. Part of falsification or refutation is knowing where to look, and that’s true in non-Bayesian statistic
P.S. At a technical level, commenter Sebastian pointed out that HHHHHHHHHHHH could be considered either as a rejection of the hypothesis that p=1/2 or a rejection of the hypothesis of independence. I agree with Sebastian that in general you will be checking the entire model at once; it takes more work to separately test different hypotheses within an assumed model.