Skip to content

p-values blah blah blah

Karl Ove Hufthammer points me to this paper by Raymond Hubbard and R. Murray Lindsay, “Why P Values Are Not a Useful Measure of Evidence in Statistical Significance Testing.”

I agree that p-values are a problem, but not quite for the same reasons as Hubbard and Lindsay do. I was thinking about this a couple days ago when talking with Jeronimo about FMRI experiments and other sorts of elaborate ways of making scientific connections. I have a skepticism about such studies that I think many scientists share: the idea that a questionable idea can suddenly become scientific by being thrown in the same room with gene sequencing, MRIs, power-law fits, or other high-tech gimmicks. I’m not completely skeptical–after all, I did my Ph.D. thesis on medical imaging–but I do have this generalized discomfort with these approaches.

Consider, for example, the notorious implicit assocation test, famous for being able to “assess your conscious and unconscious preferences” and tell if you’re a racist. Or consider the notorious “baby-faced politicians lose” study.

From a statistical point of view, I think the problem is with the idea that science is all about rejecting the null hypothesis. This is what researchers in psychology learn, and I think it can hinder scientific understanding. In the “implicit association test” scenario, the null hypothesis is that people perceive blacks and whites identically; differences from the null hypothesis can be interpreted as racial bias. The problem, though, is that the null hypothesis can be wrong in so many different ways.

To return to the main subject, an alarm went off in my head when I read the following sentence in the abstract to Hubbard and Lindsay’s paper: “p values exaggerate the evidence against [the null hypothesis].” We’re only on page 1 (actually, page 69 of the journal, but you get the idea) and already I’m upset. In just about any problem I’ve studied, the null hypothesis is false; we already know that! They describe various authoritative-seeming Bayesian articles from the past several decades, but all of them seem to be hung up on this “null hypothesis” idea. For example, they include the notorious Jeffreys (1939) quote: “What the use of P implies … is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred. This seems a remarkable procedure.” OK, sure, but I don’t believe that the hypothesis “may be true.” The question is whether the data are informative enough to reject the model.

Any friend of the secret weapon is a friend of mine

OK, now the positive part. I agree with just about all the substance of Hubbard and Lindsay’s recommendations and follow them in practice: interval estimates, not hypothesis tests; and comparing intervals of replications (the “secret weapon“). More generally, I applaud and agree with their effort to place repeated studies in a larger context; ultimately, I think this leads to multilevel modeling (also called meta-analysis in the medical literature).

P.S. This is minor, but I’m vaguely offended by referring to Ronald Fisher as “Sir” Ronald Fisher in an American journal. We don’t have titles here! I guess it’s better than calling him Lord Fisher-upon-Tyne or whatever.

P.P.S. I don’t know if I agree that “An ounce of replication is worth a ton of inferential statistics.” More data are fine, but sometimes it’s worth putting in a little effort to analyze what you have. Or, to put it more constructively, the best inferential tools are those that allow you to analyze more data that have already been collected.


  1. Richard D. Morey says:

    I agree with your central point. With respect to the IAT, the power comes from the fact that some response combinations (black+good) are more difficult to make than others (black+bad). This requires a more complex explanation than simply perceiving blacks and whites differently. It may not be implicit race bias, but there is a healthy discussion in the psychological literature about the meaning of the IAT effect. I've found that it is typically the science press that jumps to unwarranted conclusions about the IAT; psychologists doing the research are aware of many possible explanations for the effect. Among the reasonable explanations is implicit race bias. I don't find the possibility far fetched.

  2. Frank says:

    On the use of titles: Should a Chinese scientist refer to you as Comrade Gellman?

    If the guy has a title in the UK and that is the tradition, then use that out of courtesy and leave the Jacobinism out…

  3. Seth Roberts says:

    I'd like to hear more about why you don't think an ounce of replication is worth a ton of inferential statistics. That has been my experience. The value of inferential statistics is that they predict what will happen. Plainly another way to figure out what will happen is to do it again.

  4. Alex says:

    On this topic, what is your general opinion of empirical Bayes methods for larger data sets, ie false discovery rate and other such approaches?

  5. Nick Cox says:

    Greetings, Comrade Gelman:

    The scope for minor and even major irritation over things like titles is enormous.

    It works in reverse too. J.M. Keynes, who was very English and who among over things wrote on both probability and statistics, never got a Ph.D. (he wrote books instead) and was never a Professor. He was often irritated by those, usually from the US, who wrongly assumed that an academic so eminent must be a Professor. There was much inverted snobbery in that, I imagine.

    A British friend of mine prunes "Jr", "III", "IV" and the like from bibliographic references as a minute silent protest at what he sees as an irrelevant convention, asking why should we care that (say) somebody's grandfather had the same name too?

    More seriously, there seems to be some growing interest in the statistical literature in giving a little detail on who people were. Even introductory statistics textbooks are now decorated not only with gratuitous colour photos of attractive people laughing a lot at something or other but also with extra boxes on "Who was Student?" or the like. A jolly good thing, too, so long as the facts are right. A recent book on ecological statistics refers to the American mathematician Abraham de Moivre and Sir William of Ockham.

    P.S. It's Sir Harold Jeffreys to you. (Incidentally, I can't think of any other Bayesian who was knighted. The frequentists are way ahead on that score.)

  6. Andrew says:


    That makes sense. I haven't read the psychology literature on this.


    It's not Jacobinism, it's Americanism, or republicanism, or something. Whatever George Washington said. But, yes, as far as I'm concerned, the Chinese can call me whatever they want in their journals (as long as they spell my name right), and similarly for the Brits. I'd just prefer U.S.-style in American journals.


    I'm not sure how to put replication and inferential statistics on the same scale . . . but a ton is 32,000 times an ounce. To put in dollar terms, for example, I think that in many contexts, $32,000 of data analysis will tell me more than $1 worth of additional data. Often the additional data are already out there but haven't been analyzed.


    In answer to your questions:

    (1) I don't like the term "empirical Bayes" or the concept. I prefer hierarchical Bayes. (Look up "empirical Bayes" in the index to BDA.)

    (2) Regarding false discovery rate etc., I'll post the latest version of our multiple comparisons paper soon.

  7. Manoel Neto says:


    have you read the original articles? They appeared to me (a non-especialist) vere good ones. Specially Berger & Sellke. Besides, one thing is to know in your practice that p-value exagerate evidence against null hipotheses, and other one is to prove that as Berger et. al. did.

    So, I would like to know these things and would apreciate if you could answer.

  8. Andrew says:


    The math in those articles is fine; I just don't think it makes any sense to assign 50% of prior probability to the null hypothesis, or to do anything like that. I can see the virtues of seeing what would happen if the null hyp were true, and I can see just ignoring the point null hyp and modeling things directly, but I don't think it makes much sense to put the two approaches on a common probability scale. And I certainly don't see such as analysis as demonstrating that the p-value exaggerates evidence.