4 different meanings of p-value (and how my thinking has changed)

Given the discussion of our yesterday’s post on p-values, I thought it could help to re-run a related post from a year ago, 4 different meanings of p-value (and how my thinking has changed), which begins:

The p-value is one of the most common, and one of the most confusing, tools in applied statistics. Seasoned educators are well aware of all the things the p-value is not. Most notably, it’s not “the probability that the null hypothesis is true.” McShane and Gal find that even top researchers routinely misinterpret p-values.

But let’s forget for a moment about what p-values are not and instead ask what they are. It turns out that there are different meanings of the term. . . .

Definition 1. p-value(y) = Pr(T(y_rep) >= T(y) | H), where H is a “hypothesis,” a generative probability model, y is the observed data, y_rep are future data under the model, and T is a “test statistic,” some pre-specified specified function of data. I find it clearest to define this sort of p-value relative to potential future data; it can also be done mathematically and conceptually without any reference to repeated or future sampling, as in this 2019 paper by Vos and Holbert.

Definition 2. Start with a set of hypothesis tests of level alpha, for all values alpha between 0 and 1. p-value(y) is the smallest alpha of all the tests that reject y. This definition starts with a family of hypothesis tests rather than a test statistic, and it does not necessarily have a Bayesian interpretation, although in particular cases, it can also satisfy Definition 1.

Property 3. p-value(y) is some function of y that is uniformly distributed under H. I’m not saying that the term “p-value” is taken as a synonym for “uniform variate” but rather that this conditional uniform distribution is sometimes taken to be a required property of a p-value. It’s not a definition because in practice no one would define a p-value without some reference to a tail-area probability (Definition 1) or a rejection region (Definition 2)—but it is sometimes taken as a property that is required for something to be a true p-value. The relevant point here is that a p-value can satisfy Property 3 without satisfying Definition 1 (there are methods of constructing uniformly-distributed p-values that are not themselves tail-area probabilities), and a p-value can satisfy Definition 1 without satisfying Property 3 (when there is a composite null hypothesis and the distribution of the test statistic is not invariant to parameter values; see Xiao-Li Meng’s paper from 1994).

Description 4. p-value(y) is the result of some calculations applied to data that are conventionally labeled as a p-value. Typically, this will be a p-value under Definition 1 or 2 above, but perhaps defined under a hypothesis H that is not actually the model being fit to the data at hand, or a hypothesis H(y) that itself is a function of data, for example from p-hacking or forking paths. I’m labeling this as a “description” rather than a “definition” to clarify that this sort of p-value is used all the time without always a clear definition of the hypothesis, for example if you have a regression coefficient with estimate beta_hat and standard error s, and you compute 2 times the tail-area probability of |beta_hat|/s under the normal or t distribution, without ever defining a null hypothesis relative to all the parameters in your model. Sander Greenland calls this sort of thing a “descriptive” p-value, capturing the idea that the p-value can be understood as a summary of the discrepancy or divergence of the data from H according to some measure, ranging from 0 = completely incompatible to 1 = completely compatible. For example, the p-value from a linear regression z-score can be understood as a data summary without reference to a full model for all the coefficients.

These are not four definitions/properties/descriptions of the same thing. They are four different things. Not completely different, as they coincide in certain simple examples, but different, and they serve different purposes. They have different practical uses and implications, and you can make mistakes when you use one sort to answer a different question. . . .

The great thing about this post is that it’s purely descriptive (see this comment for elaboration on this particular point). I’m not telling anyone what to do. So this is a rare article about p-values where there’s nothing to argue about!

26 thoughts on “4 different meanings of p-value (and how my thinking has changed)

  1. Property 3 will not apply for tests where the possible outcomes are discrete, and neither if tests are approximate (because of asymptotics, say) or conservative; the latter cases will also have issues with other definitions. Just saying. Just descriptive as well – of course one might think that in these cases these are not true real p-values, but then somebody else may disagree.

  2. My preference is to say that, given a subset of the parameter space H0, and a family of testing procedures, {t_alpha}, where the supremum (over H0) probability of t_alpha rejecting is alpha, and where for any alpha1 <= alpha2, if t_alpha1 rejects then t_alpha2 rejects, the p-value is the smallest alpha where t_alpha rejects. Then the other definitions you gave apply to instances of my definition under specific conditions. But that’s just my preference.

  3. I must remark that I never wrote and do not endorse the idea in def. 4 that “the p-value from a linear regression z-score can be understood as a data summary without reference to a full model for all the coefficients” or anything like that. On the contrary, all my writings about P-values from 2016 onward take pains to emphasize that every tail-area P-value (def. 1) refers to all aspects of the model used to compute it, in the sense of conditioning on that model and calling that model into doubt if the P-value is “near” zero, but never confirming the model even if the P-value equals 1. And this descriptive usage applies only to tail-area P-values.
    BTW the use of “compatibility” to describe tail-area P-values, and with the same cautions, can be found in writings of Karl Pearson and RA Fisher; Kempthorne used “consonance” and DR Cox used “consistency” for the same descriptive approach to tail-area P-values. For more details see
    Greenland, S. (2023). Divergence vs. decision P-values: A distinction worth making in theory and keeping in practice (with discussion). Scandinavian Journal of statistics, 50(1), 1-35, https://arxiv.org/ftp/arxiv/papers/2301/2301.02478.pdf, https://onlinelibrary.wiley.com/doi/10.1111/sjos.12625, discussion 50(3), 899-933, corrigendum 51(1), 425.
    Greenland, S. (2023). Connecting simple and precise p-values to complex and ambiguous realities (includes rejoinder to comments on “Divergence vs. decision P-values”). Scandinavian Journal of Statistics, 50(3), 899-914, https://arxiv.org/abs/2304.01392, https://onlinelibrary.wiley.com/doi/10.1111/sjos.12645

  4. Definition 2 is an abomination. In my opinion the greatest problem with the use of p-values is the all-or-none Neymanian “decision” and that definition would lead any student to think that the role of p-values is bound up with that “decision”. Most unfortunate.

      • The underlying problem here is really not finding an adequate definition, and I suspect that that is why Andrew lists some of them as property or description. The standard definition of a p-value being the probability of randomly obtaining a test statistic value at least as extreme as that observed when the null hypothesis is true, according to the statistical model is correct. But note that I have included the often implied but unstated “according to the statistical model”.

        Definition 2 is based on thresholding that degrades the relationship between p-values and their evidential meaning. It might be a good fit with Neyman’s all-or-none decision framework, but it is ill-suited to most circumstances where a p-value is calculated.

  5. I do not see a definition of the p-value among these that I would be happy to teach. As was discussed the last time this post was made, the y_rep thing in definition 1 makes that definition dependent on notional repetitions of the experiment rather than (more correctly) the potential datasets or test statistic values entailed by the chosen statistical model.

    I know that you (Andrew) do not care for p-values, but surely it would be OK for you to modify this post before re-using it.

    • Michael:

      I think that “notional replications” and “potential datasets” are the same thing! So feel free to replace “replication” with “potential datasets” everywhere the word appears in the above post.

      • I agree that “notional repetitions” and “potential datasets” are the same thing. (The argument in the Vos and Holbert reference cited is essentially that discussing “fair coin flips” can be done considering an infinite sequence of flips with the same frequency of heads and tails but it can also be done considering an infinite ensemble of flips with the same fraction of heads and tails. I don’t disagree, even though I’m not sure if there is a problem with the first interpretation that the second interpretation solves.)

        The main problem with your first definition in my opinion is that it looks like the classical one but it isn’t: the sampling distribution used to calculate the p-value is not independent of y here.

        • Carlos:

          In some settings, the classical p-value corresponds to all four definitions above! That’s the source of much of the confusion, that people associate “p-value” with one of these properties, without realizing that in general the four properties won’t coincide.

        • The classical p-value p=Pr(T .gt. t | H_0) – where t is the observed value of the statistic and T the sampling distribution of the statistic conditional on the null hypothesis – is included in the first definition.

          However, the first definition doesn’t correspond to the classical p-value.

          I’d say that mixing different things in definition 1 doesn’t help to reduce the confusion.

      • OK, I see that they can mean pretty much the same thing. Nonetheless I would say that “notional replications” puts the emphasis on the experiment (the thing that is replicated) whereas “potential datasets entailed by the statistical model” puts the emphasis on the statistical model, where it belongs.

  6. Andrew, This may confirm that you cannot post anything about p-values that will not raise arguments.
    I very much like definition 1 because it’s relatively easy to understand, and `probability ` is front and center. If we don’t have a probability distribution in mind, if we have not gone through the trouble of understanding what is random in our data, we should not be quoting p-values. Maybe my kind of p-value is a special kind and should really be called a probability-value.
    All the other definitions are fine for statisticians to use and discuss, but maybe not for the general population. For the gen-pop I would argue for one definition which emphasizes that there is a mathematical concept of a probability distribution underlying a p-value, and if we cannot identify what this distribution is, or why it is, we should not be quoting p-values. This may be a radical position, but I am not in academia. I work in an industry where proofs are made by p-values, where p-values are quoted when we know the population entirely, where reports are filled with p-values computed based exploratory data analysis results, and where at the end of long talk describing the efforts made to find an association between this and that, the Chief Data Science officer will ask for a p-value to be computed. And these are very smart people – if I am in a meeting with 2 or more colleagues, I know I’m in the lower quartile of intelligence. Yet there is so much confusion about this concept that I feel very strongly that it needs to be simplified and de-mystified.

    -Fz

    • Francis:

      I agree with you—I like definition 1 also! To me, a p-value is a tail-area probability, a statement about some distribution of hypothetical replications under a model. That’s why, in my 2003 paper, I defined p-values that way, and I used “u-value” to refer to statistics with a uniform distribution under the null hypothesis. To me, definition 1 was core, and the other definitions were properties that p-values happened to have under some simple scenarios.

      But, over the years, I’ve learned that many people take definition 2, 3, or 4 as their core property. I’ve had endless discussions with people who have said that my “p-values” are “uncalibrated” because they don’t satisfy definition 3. This annoys me to no end, and my response has always been to say, Yeah, p-values are not in general u-values, that’s just the way things are. I even published a paper, Two simple examples for understanding posterior p-values whose distributions are far from uniform to elaborate on this point.

      But ultimately I realized that I’m not going to change the terminology. As with fixed and random effects, we just have to accept that the term “p-value” is overloaded; different people use the expression to mean different things.

      I just wish I’d realized this back in the late 1980s when I was first working in the area. If, the very first time I’d written about p-values, I’d said clearly that the classical p-value generalizes in (at least) two different ways, and that I was talking about the tail-area property, not the uniformity property, I think that would’ve reduced the confusion level in this area.

    • Note that his definition 1 is not equivalent to the easy to understand one that you like (if I understand you correctly).

      The sampling distribution used to calculate the p-value in definition 1 is in principle conditional on the observed data. It’s “predictive”, it’s about “future” replications.

      (The “classic” p-value definition uses a sampling distribution for the statistic which doesn’t depend on the observed data. It depends only on the null hypothesis which is independent from the observed data.)

      • Carlos:

        My definition 1 is a generalization of the definition you’re talking about. It reduces to the definition you’re talking about when you have a pivotal quantity with an invariant distribution of the test statistic. This fits my general theme that different definitions of the p-value coincide in various important special cases, which is one reason that people have often been slow to realize that the term “p-value” can have multiple meanings more generally.

        • Yes, the point is that they are not equal (and it’s not obvious so it’s worth pointint it out).

          I agree that the classical definition may be covered if in the context of a model of unknown parameters we identify the submodel for some pre-specified value of the parameters H0 (independent of observed data) with the “generative probability model” mentioned in definition 1.

          The difference between the “classical” case and the extension where the model depends on the data is not that the distribution of the test statistic doesn’t depend on the parameters (that’s convenient but not necessary). The difference is that the distribution of the test statistic doesn’t depend on the observed data. We recover the “classical” definition when, using your notation, the sampling distribution of y_rep is independent of y.

    • There are really only two categories of p-values:

      1) p-values calculated using what your theory predicts
      2) p-values calculated using something besides what your theory predicts

      All of the troubles you describe correspond to usecase #2. Blaming p-values for that is a correlation == causation fallacy. I really think the path forward is to just recognize the problem for what it is (a sociological one) and do your best work around it.

Leave a Reply

Your email address will not be published. Required fields are marked *