I hate Bayes factors (when they’re used for null hypothesis significance testing)

Oliver Schultheiss writes:

I am a regular reader of your blog. I am also one of those psychology researchers who were trained in the NHST tradition and who is now struggling hard to retrain himself to properly understand and use the Bayes approach (I am working on my first paper based on JASP and its Bayesian analysis options). And then tonight I came across this recent blog by Uri Simonsohn, “If you think p-values are problematic, wait until you understand Bayes Factors.”

I assume that I am not the only one who is rattled by this (or I am the only one, and this just reveals my lingering deeper ignorance about the Bayes approach) and I was wondering whether you could comment on Uri’s criticism of Bayes Factors on your own blog.

My reply: I don’t like Bayes factors; see here. I think Bayesian inference is very useful, but Bayes factors are based on a model of point hypotheses that typically does not make sense.
To put it another way, I think that null hypothesis significance testing typically does not make sense. When Bayes factors are used for null hypothesis significance testing, I generally think this is a bad idea, and I don’t think it typically makes sense to talk about the probability that a scientific hypothesis is true.

More discussion here: Incorporating Bayes factor into my understanding of scientific information and the replication crisis. The problem is not so much with the Bayes factor as with the idea of null hypothesis significance testing.

11 thoughts on “I hate Bayes factors (when they’re used for null hypothesis significance testing)

  1. I’m not necessarily a fan of Bayes factors, but even as a non-expert I feel like I can spot some issues in the discussion of them in that post. The way Bayes factors are used in the post has a lot of the same issues as NHST, where instead of proposing a specific alternative hypothesis about the effect size, you basically just have a hypothesis “X makes Y increase”, with no real specificity about how big that increase is. If Milton’s theory is so vague that the best you can do is use a uniform prior that gives the same weight to modest increases of 1% and absolutely huge increases of 10% in unemployment, maybe the problem is with the theory and not the analysis method.

  2. Oh for God sakes. No more, please. It’s just a model. No models are true; none are not even false. The point null is a super important platonic abstract. It says, “there is a conservation or an invariance here.” What could be so insulting about that?

    • Jeff:

      There’s nothing insulting here. The problem is that Bayes factors for these null hypothesis tests can easily give really bad answers. See chapter 7 of BDA3 or the above links for discussions and details.

      More generally (going beyond Bayes factors to my problem with null hypothesis significance testing in general), I think the null hypothesis of zero effect and zero systematic error is very rarely interesting. I’m not interested in rejecting it, given that I know ahead of time I could reject it by gathering enough data. See also here too.

      • P.S. I feel passionate about statistical methods because ultimately I care about the applications (as in our discussions of the way that classical statistical methods can lead to drastic overestimates of effect sizes in policy analysis, as discussed in section 2.1 of this article, or because I hate to see scientists waste their efforts and I’d like them to be able to do better (hence my annoyance at dead on arrival studies), or because I’m bothered by logical/mathematical/scientific errors (as with the hot hand fallacy fallacy). What I “hate” is not Bayes factors or p-values or whatever; it’s the way that these methods can lead us astray.

  3. I definitely do not want to argue in favor of using Bayes factors, but I do want to point out that some scientific hypotheses are possibly exactly true. The rest mass of an electron could exactly equal the rest mass of a positron. The speed of light in a vacuum could be perfectly independent of the direction of motion relative to the source. Energy could be perfectly conserved. I realize these are not the kinds of hypotheses that are usually discussed on this social-science-oriented blog, and I agree that any model of human behavior is only going to be approximate.

  4. So I’m a big fan of Bayes factors (and of Harold Jeffreys’s work in general), and I’ve even managed to get Andrew to co-author papers reporting Bayes factors (albeit with a footnote stating that he hates them :-)). Here’s my brief take, for balance:
    1. The issue of the point null hypothesis and its plausibility is really orthogonal to the Bayes factor, as Andrew suggests. The Bayes factor can be used to test *any* two models, as long as they make predictions. If you prefer a normal prior with variance epsilon instead of the point, nothing stops you from using that instead.
    2. For discrete parameter spaces, the update from prior to posterior distribution *is* a Bayes factor. Bayes factors are part of Bayes rule; this is why Jack Good termed them “Bayes” factors. See https://www.bayesianspectacles.org/bayes-factors-for-those-who-hate-bayes-factors/
    3. To me, in my line of work, Bayes factors seem to address the question that researchers care about: “Is there some signal in this noise, or am I just reading tea leaves?” Harold Jeffreys claimed that (from memory): “variation should be considered random until evidence to the contrary is presented”. I think this is a nice statistical interpretation of what it means to be skeptical, and organized skepticism is an important part of science.
    4. A recent paper outlining the philosophy of Jeffreys and its practical implementation is: Ly, A., Stefan, A., van Doorn, J., Dablander, F., van den Bergh, D., Sarafoglou, A., Kucharsky, S., Derks, K., Gronau, Q. F., Raj, A., Boehm, U., van Kesteren, E.-J., Hinne, M., Matzke, D., Marsman, M., & Wagenmakers, E.-J. (in press). The Bayesian methodology of Sir Harold Jeffreys as a practical alternative to the p-value hypothesis test. Computational Brain & Behavior. Preprint: https://psyarxiv.com/dhb7x

    Cheers,
    E.J.

  5. I’m deeply ambivalent about Bayes factors, but I find Simonsohn critique very flawed. He offers two vignettes under the titles “Thing 1” and “Thing 2”. Here’s the conclusion of the “Thing 1” section:

    All I have falsified is the irrelevant hypothesis that anchoring for rivers is like anchoring for doors.

    All I have falsified is a hypothesis I considered only so that I could run the Bayes factor, one I was never interested in.

    This critique is basically, “I used a tool to do a stupid thing, therefore the tool is stupid.” There’s not much to say about this sort of argument beyond pointing out its basic structure.

    In the “Thing 2” section we have this line:

    If Milton had said “the effect is anywhere between -50% and +50%”, a prediction that is *never* false, the Bayes factor would *always* deem it false, because observed values would always be, on average, unlikely, under ‘the alternative.’

    This is clearly wrong, and it shows me that Simonsohn hasn’t carried out the math of Bayes factors (or at least, hasn’t carried it out correctly) and thus doesn’t understand them. Earlier in the post he posits a 1% observed effect; if that effect had been measured very precisely then it would fall well outside of the neighbourhood in which the prior predictive mass under the no-effect hypothesis is found, and then the alternative hypothesis of “some effect between -50% and +50%” would indeed be favoured.

    How can we be having a discussion of Bayes factors — or any discussion about statistics — in which we’re talking about an observed effect and not giving consideration to the uncertainty with which the effect is measured? We wouldn’t be talking about p-values in this example without at least mentioning the standard error…

  6. My understanding was that Bayes Factors didn’t take into account the fact that the alternative hypothesis is made of a bunch of different hypotheses, and that the probability of each individual hypothesis rises or falls relative to the others as more data is colleced. Which, if I understand things correctly, changes the probability distribution for future data under the alternative hypothesis as more data is collected. Am I incorrect about this?

Leave a Reply to EJ Wagenmakers Cancel reply

Your email address will not be published. Required fields are marked *