As regular readers of this space should be aware, Bayesian model checking is very important to me:

1. Bayesian inference can make strong claims, and, without the safety valve of model checking, many of these claims will be ridiculous. To put it another way, particular Bayesian inferences are often clearly wrong, and I want a mechanism for identifying and dealing with these problems. **I certainly don’t want to return to the circa-1990 status quo in Bayesian statistics, in which it was considered virtually illegal to check your model’s fit to data.**

2. Looking at it from the other direction, model checking can become much more effective in the context of complex Bayesian models (see here and here, two papers that I just love, even though, at least as measured by citations, they haven’t influenced many others).

On occasion, direct Bayesian model checking has been criticized from a misguided “don’t use the data twice” perspective (which I won’t discuss here beyond referring to this blog entry and this article of mine arguing the point).

Here I want to talk about something different: a particular attempted refutation of Bayesian model checking that I’ve come across now and then, most recently an a blog comment by Ajg:

The example [of the proportion of heads in a number of “fair tosses”] is the most deeply damning example for any straightforward proposal that probability assertions are falsifiable.

The probabilistic claim “T” that “p(heads) = 1/2, tosses are independent” is very special in that it, in itself, gives no grounds for preferring any one sequence of N predictions over another: HHHHHH…, HTHTHT…, etc: all have identical probability .5^N and indeed this equality-of-all-possibilities is the very content of “T”. There is simply nothing inherent in theory “T” that could justify saying that HHHHHH… ‘falsifies’ T in some way that some other observed sequence HTHTHT… doesn’t, because T gives no (and in fact, explicitly denies that it could give any) basis for differentiating them.

Ajg continued:

Among all possible tests – and note that we can’t apply them all – some will have HHHHHH as disconfirming and some will not. So “HHHHHH disconfirms T” is not a remotely self-contained statement let alone being true: it requires context about the tests that were run (and perhaps about why these ones were chosen).

As noted above, I’ve seen this error before, and perhaps it’s worth a blog entry to shoot it down.

The mistake in the above quote comes in ignoring the choice required in any model checking. The commenter thinks there’s no reason ahead of time to consider #heads (that is, unordered sequences) as a test summary, but there’s also no reason ahead of time to consider ordered sequences as a test summary either.

To put it another way, the equal probability of each sequence under the coin-flipping model does not make testing impossible: it’s as kosher to group the sequences in terms of #heads as it is to treat them as symmetric atoms for decision making. Either way you’re making a choice about what to look at. (For example, you could imagine a setting in which someone flipped the coins, reported the total #heads and #tails, allowed you to test the model, and then, at further request, gave you the ordered sequences. Or, to take it in another direction, you could imagine having even more information, for example some data regarding the coin’s path though the air during each flip, in which case the Heads and Tails sequences would themselves represent only partial information.)

A similar difficulty arises when considering the posterior mode, which depends on the parameterization. So, yes, in a sense Ajg is right that the count-the-number-of-heads test is not a “self-contained statement” and requires “context about the tests”–but this is true of *all* model checks. If you want to abandon falsification in the coin-flipping example, I think you have to abandon it in *all* statistical examples, which might be a coherent philosophical position but in my opinion leads to huge practical problems. Part of falsification or refutation is knowing where to look, and that’s true in non-Bayesian statistic

P.S. At a technical level, commenter Sebastian pointed out that HHHHHHHHHHHH could be considered either as a rejection of the hypothesis that p=1/2 or a rejection of the hypothesis of independence. I agree with Sebastian that in general you will be checking the entire model at once; it takes more work to separately test different hypotheses within an assumed model.

Sure, there are observations (number of heads) that are in closer agreement with the hypothesis of p(heads) = 1/2 than others. But wasn't (part of) the point that all possible observations have a non-zero probability, making the hypothesis unfalsfiable in a strict sense? HHHHHHHHHHHH might be extremely unlikely under the hypothesis, but strictly does not reject it, since it is not impossible. I know Popper realized this problem, but I have no idea how he talked himself out of it. Others have tried, though (e.g., A Falsifying Rule for Probability Statements, Gillies 1971), but, as far as I know, there seems to be no widely accepted solution.

Professor Gelman,

Thank you for taking the time to write this.

I sincerely wish I had taken more time to (and had the ability to) be clearer on the purpose of my comment! I definitely wasn't trying to refute Bayesian model checking then and having now read quite a bit more of your papers and books – to the extent I understand things – would even less want to try something so futile and at odds with my own limited understanding of statistics.

I was attempting to make a vastly less interesting point about the "objective" philosophies of probability and Popperian falsification. It was, I suppose, directed at no one in particular, but most definitely not at you!

If I might try once last time? Very simple Popper: A non-probabilistic hypothesis H has an opinion on possible observations ("consistent with me", "not consistent with me"). We see an observation O that H declares impossible. H is thus falsifed. IMPORTANT POINT: The last statement is objectively true in a very strong sense: No two people witnessing O can disagree unless they actually are disagreeing about the contents of H.

Now again, but with probabilities. H's attitude towards observations are now probabilistic judgements. We make an observation "O". Since O may be possible even if unlikely, there's a bit a of a problem (cf vvv's comment). But on the face of it, even this may be difficult, we can certainly _imagine_ finding a more nuanced definition of "falsify" that addresses this particular issue.

The fair coin example shows there's another problem. (MY POINT:) It _proves_ there can be no generalization of "falsifies" with the same super-clear objective standing as we found in the non-probabilistic cases. The observers MUST bring something else to the table – something subjective, if you will, though this is a laden term – to reach a falsification (or whatever replacement concept you introduce) judgment. Even if we solve the concern of the previous paragraph, two observers can legitimately reach different conclusions after seeing the same O, and both fully agreeing as to the content of H. I'd be surprised if you don't find this obviously so! Nevertheless, in my experience, I don't think everyone believes this is obviously so, and I believe the fair coin case is useful because it effectively _proves_ that this is the case.

I don't believe this says anything interesting about statistical practice – one always has biases, purposes, and interests that come into play one way or another in hypothesis testing and this is as it should (and must) be. I would not read you as disputing this, indeed likely the opposite, am I wrong?

-ajg

N.b. I do think your response muddles things up a bit unfairly by implying everything is merely one or another choice of "test summary". I am thinking about _observations_ – what I see – the sequence of coin tosses, the sound they make, the location they landed, and all the rest. This surely has a privileged status; to say in effect "well, I might have seen less or reason as if I did see less" isn't a complete equivalence: I saw what I saw.

Vvv:

I think you're pointing out that continuous probabilities are never zero; thus, for example, a rejection at p=.001 is not the same thing as a rejection at p=0. That's an interesting point too, but it's something separate from my note above.

Ajg:

No. Go to your last paragraph. My point is that you have choice in your definition of "observation," just as there's choice in my definition of "test statistic." You want to call HHHTTHH and HTHTHHH as two different atomic "observations," but you could just as well treat them as two sub-observations of the single "observation" (5H, 2T). See my third-to-last paragraph above. Talking about "test variables" makes the choice aspect more clear, but it's really already there when you talk about how to parameterize your sample space. Because these particular events happen to have equal probability in your model, it might seem natural to count them as atomic observations, but that's really just one particular choice.

Thanks again.

I have to think I'm (trying to) say something entirely different from what you think I'm saying. You're convinced I'm saying something that's simply and unequivocally wrong, but I can't begin to usefully connect your responses (which otherwise I have no disagreement with) to my actual point – and I know you are are careful and clear writer, and I think I normally understand you, so something's just wrong.

But my point – even if it is correct (I still believe it is) – isn't _that_ interesting and for sure is "merely" philosophy and not practically relevant to anything. Anyway, I do thank you for your patience, but we are either talking past each other or you are explaining something at a level I'm just not equipped to appreaciate. Either way, not worth pursuing, but thanks again.

-ajg

N.b. I _knew_ as I wrote it that my final paragraph would muddle my point and be a dangerous distractor; I should have trusted my first impulses and omitted it :-(

AJG, here's a question for you, suppose your hypothesis is that p=0.35, how can you distinguish this case from p=0.5 using your sequences as the test observations? Ultimately to make a statement about the probability of heads you will have to count the number of heads. p=0.5 just happens to be the symmetry point where H and T have the same probability and therefore all sequences of H and T have the same probability. If p(H) = 0.35 sequences with more H are less likely than sequences with more T.

Ultimately, counting H and T is the only meaningful way to evaluate claims about p(H), not looking at which particular sequence you got.

Daniel Lakeland:

IMO you've changed the picture entirely (and simplified it, in a good way, but maybe making it less relevant to the discussion) by bringing another hypothesis into the picture; a single simple hypothesis no less. Let's call it H_0.35 (can I assume that it encompasses a claim of independence as well?).

If I can interpret/paraphrase your question of distinction as: "How could an observation O confirm H_0.5 relative to H_0.35", my answer is simple: it does so through one number, the likelihood ratio.

The likelihood ratio is philosophically clean, practically speaking tells you just what you want (at least in in my experience), and is "objective" in a sense I won't try again to define but which I still doubt can be achieved in non-comparative claims about confirmation.

-ajg

N.b. [I hope this footnote does not get me into trouble again but…] It's just not true that I need to count head and tails to get the likeihood ratio. I could compute the ratio without ever thinking about or explicitly tabulating these counts, but rather just looking at sequences, and I'll get exactly the same value as if I used counts. It may take me a bit longer though!

You may not have the same confidence in the appropriateness of the likelihood ration that I have, but whatever you see in my other posts (!)

_this_ position has plenty of respectable support :-) So if you are convinced that "ultimately, you'll need to count the number of heads" it's more than just I who would disagree.