The p-value is one of the most common, and one of the most confusing, tools in applied statistics. Seasoned educators are well aware of all the things the p-value is not. Most notably, it’s not “the probability that the null hypothesis is true.” McShane and Gal find that even top researchers routinely misinterpret p-values.
But let’s forget for a moment about what p-values are not and instead ask what they are. It turns out that there are different meanings of the term. At first I was going to say that these are different “definitions,” but Sander Greenland pointed out that not all are definitions:
Definition 1. p-value(y) = Pr(T(y_rep) >= T(y) | H), where H is a “hypothesis,” a generative probability model, y is the observed data, y_rep are future data under the model, and T is a “test statistic,” some pre-specified specified function of data. I find it clearest to define this sort of p-value relative to potential future data; it can also be done mathematically and conceptually without any reference to repeated or future sampling, as in this 2019 paper by Vos and Holbert.
Definition 2. Start with a set of hypothesis tests of level alpha, for all values alpha between 0 and 1. p-value(y) is the smallest alpha of all the tests that reject y. This definition starts with a family of hypothesis tests rather than a test statistic, and it does not necessarily have a Bayesian interpretation, although in particular cases, it can also satisfy Definition 1.
Property 3. p-value(y) is some function of y that is uniformly distributed under H. I’m not saying that the term “p-value” is taken as a synonym for “uniform variate” but rather that this conditional uniform distribution is sometimes taken to be a required property of a p-value. It’s not a definition because in practice no one would define a p-value without some reference to a tail-area probability (Definition 1) or a rejection region (Definition 2)—but it is sometimes taken as a property that is required for something to be a true p-value. The relevant point here is that a p-value can satisfy Property 3 without satisfying Definition 1 (there are methods of constructing uniformly-distributed p-values that are not themselves tail-area probabilities), and a p-value can satisfy Definition 1 without satisfying Property 3 (when there is a composite null hypothesis and the distribution of the test statistic is not invariant to parameter values; see Xiao-Li Meng’s paper from 1994).
Description 4. p-value(y) is the result of some calculations applied to data that are conventionally labeled as a p-value. Typically, this will be a p-value under Definition 1 or 2 above, but perhaps defined under a hypothesis H that is not actually the model being fit to the data at hand, or a hypothesis H(y) that itself is a function of data, for example from p-hacking or forking paths. I’m labeling this as a “description” rather than a “definition” to clarify that this sort of p-value is used all the time without always a clear definition of the hypothesis, for example if you have a regression coefficient with estimate beta_hat and standard error s, and you compute 2 times the tail-area probability of |beta_hat|/s under the normal or t distribution, without ever defining a null hypothesis relative to all the parameters in your model. Sander Greenland calls this sort of thing a “descriptive” p-value, capturing the idea that the p-value can be understood as a summary of the discrepancy or divergence of the data from H according to some measure, ranging from 0 = completely incompatible to 1 = completely compatible. For example, the p-value from a linear regression z-score can be understood as a data summary without reference to a full model for all the coefficients.
These are not four definitions/properties/descriptions of the same thing. They are four different things. Not completely different, as they coincide in certain simple examples, but different, and they serve different purposes. They have different practical uses and implications, and you can make mistakes when you use one sort to answer a different question. Just as, for example, posterior intervals and confidence intervals coincide in some simple examples but in general are different: lots of real-world posterior intervals don’t have classical confidence coverage, even in theory, and lots of real-world confidence intervals don’t have Bayesian posterior coverage, even in theory.
A single term with many meanings—that’s a recipe for confusion! Hence this post, which does not claim to solve any technical problems but is just an attempt to clarify.
In all the meanings above, H is a “generative probability model,” that is, a class of probability models for the modeled data, y. If H is a simple null hypothesis, H represents a specified probability distribution, p(y|H). If H is a composite null hypothesis, there is some vector of unknown parameters theta indexing a family of probability distributions, p(y|theta,H). As Daniel Lakeland so evocatively put it, a null hypothesis is a specific random number generator.
Under any of the above meanings, the p-value is a number—a function of data, y—and also can be considered as a random variable with probability distribution induced by the distribution of y under H. For a composite null hypothesis, that distribution will in general depend on theta, but that complexity is not our focus here.
So, back to p-values. How can one term have four meanings? Pretty weird, huh?
The answer is that under certain ideal conditions, the four meanings coincide. In a model with continuous data and a continuous test statistic and a point null hypothesis, all four of the above meanings give the same answer. Also there are some models with unknown parameters where the test statistic can be defined to have a distribution under H that is invariant to parameters. And this can also be the case asymptotically.
More generally, though, the four meanings are different. None of them are “better” or “worse”; they’re just different. Each has some practical value:
– A p-value under Definition 1 can be directly interpreted as a probability statement about future data conditional on the null hypothesis (as discussed here).
– A p-value under Definition 2 can be viewed as a summary of a class of well-defined hypothesis tests (as discussed in footnote 4 of this article by Philip Stark).
– A p-value with Property 3 has a known distribution under the null hypothesis, so the distribution of a collection of p-values can be compared to uniform (as discussed here).
– A p-value from Description 4 is unambiguously defined from existing formulas so is a clear data summary even if it can’t easily be interpreted as a probability in the context of the problem at hand.
As an example, in this article from 1989, Besag and Clifford come up with a Monte Carlo procedure that yields p-values that satisfy Property 3 but not Definition 1 or Description 4. And in 1996, Meng, Stern, and I discussed Bayesian p-values that satisfied Definition 1 but not Property 3.
The natural way to proceed is to give different names to the different p-values. The trouble is that different authors choose different naming conventions!
I’ve used the term “p-value” for Definition 1 and “u-value” for Property 3; see section 2.3 of this article from 2003. And in this article from 2014 we attempted to untangle the difference between Definition 1 and Property 3. I haven’t thought much about Definition 2, and I’ve used the term “nominal p-value” for Description 4.
My writing about p-values has taken Definition 1 as a starting point. My goal has not been to examine misfit of the null hypothesis with respect to some data summary or test statistic, not to design a procedure to reject with a fixed probability conditional on a null hypothesis or to construct a measure of evidence that is uniformly distributed under the null. Others including Bernardo, Bayarri, and Robins are less interested in a particular test statistic and are more interested in creating a testing procedure or a calibrated measure of evidence, and they have taken Definition 2 or Property 3 as their baseline, referring to p-values with Property 3 as “calibrated” or “valid” p-values. This terminology is as valid as mine; it’s just taking a different perspective on the same problem.
In an article from 2023 with follow-up here, Sander Greenland distinguishes between “divergence p-values” and “decision p-values,” addressing similar issues of overloading of the term “p-value.” The former corresponds to Definition 1 above using the same sort of non-repeated-sampling view of p-values as favored by Vos and Holbert and addresses the issues raised by Description 4, and the latter corresponds to Definition 2 and addresses the issues raised by Property 3. As Greenland emphasizes, a p-value doesn’t exist in a vacuum; it should be understood in the context in which it will be used.
My thinking has changed.
My own thinking about p-values and significance testing has changed over the years. It all started when I was working on my Ph.D. thesis, fitting a big model to medical imaging data and finding that the model didn’t fit the data. I could see this because the chi-squared statistic was too large! We had something like 20,000 pieces of count data, and the statistic was about 30,000, which would be compared to a chi-squared distribution with 20,000 minus k degrees of freedom, where k is the the number of “effective degrees of freedom” in the model. The “effective degrees of freedom” thing was interesting, and it led me into the research project that culminated in the 1996 paper with Meng and Stern.
The relevant point here was that I was not coming into that project with the goal of creating “Bayesian p-values.” Rather, I wanted to be able to check the fit of model to data, and this was a way for me to deal with the fact that existing degrees-of-freedom adjustments did not work in my problem.
The other thing I learned when working on thaT project was that a lot of Bayesians didn’t like the idea of model checking at all! They had this bizarre (to me) attitude that, because their models were “subjective,” they didn’t need to be checked. So I leaned hard into the idea that model checking is a core part of Bayesian data analysis. This one example, and the fallout from it, gave me a much clearer sense of data analysis as a Popperian or Lakatosian process, leading to this 2013 article with Shalizi.
In the meantime, though, I started to lose interest in p-values. Model checking was and remains important to me, but I found myself doing it using graphs. Actually, the only examples I can think of where I used hypothesis testing for data analysis were the aforementioned tomography model from the late 1980s (where the null hypothesis was strongly rejected) and the 55,000 residents desperately need your help! example from 2004 (where we learned from a non-rejection of the null). Over the years, I remained aware of issues regarding p-values, and I wrote some articles on the topic, but this was more from theoretical interest or with the goal of better understanding common practice, not with any goal to develop better methods for my own use. This discussion from 2013 of a paper by Greenland and Poole gives a sense of my general thinking.
P.S. Remember, the problems with p-values are not just with p-values.
P.P.S. I thank Sander Greenland for his help with this, even if he does not agree with everything written here.
P.P.P.S. Sander also reminds me that all the above disagreements are trivial compared to the big issues of people acting as if “not statistically significant” results are confirmations of the null hypothesis and as if “statistically significant” results are confirmations of their preferred alternative. Agreed. Those are the top two big problems related to model checking and hypothesis testing, and then I’d say the third most important problem is people not being willing to check their models or consider what might happen if their assumptions are wrong (both Greenland and Stark have written a lot about that problem, and, as noted above, that was one of my main motivations to doing research on the topic of hypothesis testing).
Compared to those three big issues, the different meanings of p-values are less of a big deal. But they do come up, as all four of the above sorts of p-values are used in serious statistical practice, so I think it’s good to be aware of their differences. Otherwise it’s easy to slip into the attitude that other methods are “wrong,” when they’re just different.