There was a lot of fascinating discussion on this entry from a few days ago. I feel privileged to be able to get feedback from scientists with different perspectives than my own. Anyway, I’d like to comment on some things that Danielle Navarro wrote in this discussion. Not to pick on Danielle but because I think those comments, and my responses, may highlight some different views about what is meant by “Bayesian inference” (or, as I would prefer to say, “Bayesian data analysis,” to include model building and model checking as well as inference).

So here goes . . .

Danielle writes:

I think that falsification is a great thing to be able to do, but it doesn’t happen much in practice, and doesn’t really make sense in theory. When a model doesn’t work, we tweak it so it does and then claim that we were right all along. No-one ever really thinks their ideas have been falsified, and so the theories survive precisely as long as their authors can convince others to believe in them.

I disagree! I falsify models all the time (see, for example, the cover illustration in the 2nd edition of Bayesian Data Analysis, along with other examples from chapter 6 of that book). I certainly have found my ideas to be falsified–or more precisely, I’ve found reality to not fit my models in interesting ways, that have forced me to reassess my ideas and attitudes.

I can’t honestly say that I’m the harshest critic of my theories, but I try my best to shoot my own ideas down–and I’ve been successful at it more than once.

Danielle also says,

I think the probability that the “truth” is expressible in the language of probability theory (or any other language humans can use) is vanishingly small, so we should conclude a priori that all theories are falsified.

This I agree with. Then Danielle continues:

So both in principle and in practice I don’t find falsificationism to be helpful.

I think I’ve picked up on a point of confusion. In my philosophy of statistics (derived, in part, from my own readings/interpretations/extrapolations of Jaynes and Popper), the point of falsifying a model is not to learn that the model is false—certainly, all models that I’ve ever considered are false, which is why the chi-squared test is sometimes described as “a measure of sample size”—but rather to learn the ways in which a model is false. Just as with exploratory data analysis, the goal is to learn about aspects of reality not captured by the model (see my recent paper in the International Statistical Review for more on this).

So, yes, the goal of “falsification” (as I see it), is not to demonstrate falsity but rather to learn particular aspects of falsity.

The other point of confusion relates to falsification in a Bayesian context. Danielle refers to “what happens if you compare the posterior odds of a great model you thought of yesterday against an old one that’s got a lot of holes in it.” That’s fine, I guess, but as a purer falsificationist, I’m happy to find flaws in a model without needing a new model to compare it to. That’s what posterior predictive checking is all about. So, I completely agree with Danielle’s comment that “it’s not clear to me how falsifying Newtonian physics would naturally and inevitably lead to one positing relativistic physics.” Falsification tells us where there’s something wrong, then a different process is needed to come up with a new theory that is consistent with the facts and also makes sense. I thought Popper was clear on this.

Finally, Danielle refers to Bayesian model comparison: “you just throw a new model into your class of explanations, and see what comes out having the best posterior odds.” This doesn’t really work for me, at least not in the problems I’ve worked on. See pages 184-185 of BDA2 for more on this.

My stories of model rejection and new models are more on the lines of: we fitted a model comparing treated and control units (in the particular example I’m thinking of, these are state legislatures, immediately after redistrictings or not), and assumed a constant treatment effect (i.e., parallel regression lines in “after” vs. “before” plots, with the treatment effect representing the difference between the lines). We made some graphs and realized that this model made no sense. The control units had a much steeper slope than the treated units. We fit a new model, and it had a completely different story about what the treatment effects meant. The graph falsified the first model and motivated us to think of something better. (This is from my 1994 paper with Gary King in the American Political Science Review.)

So . . . to me, falsification is about plots and predictive checks. Not about so-called Bayes factors or posterior probabilities of candidate models.

P.S. Above post altered to change Dan to Danielle.

It looks to me that the confusion is caused because falsifiability means different things to different people. I think it's clear that you can't be using falsification in the sense Popper meant it (or at least meant it in 1933), as he wrote (§66 of The Logic of Scientific Discovery:

"Probability estimates are not falsifiable. Neither, are they, of course, verifiable and this for the same reasons as hold for any other hypotheses, seeing that no experiment result however numerous and favorable, can ever finally establish that the relative frequency of 'heads' is 1/2 and will always be 1/26."

I also think that you're also arguing that you use falsification to check the ceteris paribus conditions of your model: i.e. not the main hypothesis being tested, but the satellite conditions. I'm not sure if this has been discussed much by philosophers – certainly it doesn't seem to appear in many of the classic arguments.

My interpretation is that you're a falsificationist in a very weak sense, and in sense that everyone is (or should be ) a falsificationist: we practice criticism, and are prepared to accept being shown wrong.

Bob

Yes, Bob has a point here.

I think we would all agree that what we want to do is check the plausibility of our hypotheses, or models, and look for evidence that would cause us to reassess them.

In a previous post I described this as trying to tell the right story about how something happened – that is, identifying the correct causal mechanisms that are at work.

What is more much contentious is the argument that falsificationism is the hallmark of proper science – the "demarcation criterion" that

distinguishes science from metaphysics.

Here we run into deep philosophical waters, which we have covered previously.

A major question, as Bob notes, is how to isloate a single hypothesis from the satellite hypotheses. In the literature this is known as the Quine-Duhem problem. Wikipedia has a nice short entry with a little background.

I also greatly appreciate the discussion by the way – I know too little of bayesian inference (mea culpa) – so the focus on the link to statistical methods is of great interest to me.

I don't think I disagree with much in this post, really. I guess the point where I might part company relates to some of the pragmatics. As an example, there's an ongoing theoretical dispute in psychology about whether errors in recognition memory are primarily due to "item noise" (in which the representations of different stimuli interfere with each other) or primarily due to "context noise" (in which the representations of prior situations overlap with the representations of the current one). Historically, item noise theories have been predominant, and most memory models assume item noise. The process by which things changed seems to have been very Kuhnian in spirit. People went about their business, tinkering with item noise models and doing all that good stuff. Along the way, lots of models were proposed and falsified, leading to all sorts of new refinements and better models. After a while though, things cropped up that really don't fit very well with any of the item noise models (like the "null list length effect", in which the length of a list doesn't seem to affect recognition at all). The only way they ever really accounted for it was by suggesting two theoretically-distinct parameters that just happen to trade-off against each other. And that's about the time that a few people suggested that the whole "item noise framework" needed to be scrapped and new things built from scratch around the idea of context noise. And there's a real chance that context noise might be a bit of a revolutionary idea (as a disclaimer, one of the main "context noise" people is in my department, so I'm probably a bit biased).

Anyway, the analogy that I would make with Bayesian inference is that posteriors over variables are always conditional on the model at hand. So we would always write p( heta|X,M), where M is our model. Given two models belonging to some larger class C, we can marginalize over their parameters and find p(M|C). But p( heta|X,M) says very little about p(M|C), nor does p(M|C) say much about p(C). So when an entirely new class of models is proposed, it could very quickly turn out to be the case that the new one "overthrows" the old one quite fast (a sort of "Outside Context Problem"). The only reason that it hadn't done so earlier is that no-one had suggested it. So the process of careful testing and refinement doesn't even remotely guarantee that you won't have to throw out the whole framework tomorrow if something better comes along, and my feeling is that this is how it works a lot of the time. But that might just be a consequence of being in a psych department.

So when I say "Bayesian inference reminds me of Kuhn", I mean it in the sense that this kind of phenomenon fits nicely in both. While you can think about falsification & revolutions in Bayesian terms, and you can think about falsification as a part of a Kuhnian revolution, it's hard to think about revolutionary changes as a part of falsification. Why should the introduction of a new model cause you to change your opinion about whether the old one has been falsified? For that reason, I find Bayes and Kuhn to be a better pairing, with Popper as a subset of both. So my issue with falsificationism is not that it's bad or wrong, just that it's missing something. Like you say, what matters is finding the particular ways in which a model is wrong, and where we might want to go in finding the next false model. A simple "reject/no reject" version of falsificationism (which is really the target of my complaint) like that used in null hypothesis testing won't get you that. Since the decision is just "no" or "not no", it provides no help with the question of "what should I do next?" And that's sort of what I want to know.

ps. I should certainly retract my remark about falsification not happening in practice. That was a bit of an exaggeration. What I should have said is that "there are a lot of people out there who don't try very hard to falsify their theories". But it's not even remotely a universal. It's possible that I was being just a little too cynical there.