Good, mediocre, and bad p-values

From my 2012 article in Epidemiology:

In theory the p-value is a continuous measure of evidence, but in practice it is typically trichotomized approximately into strong evidence, weak evidence, and no evidence (these can also be labeled highly significant, marginally significant, and not statistically significant at conventional levels), with cutoffs roughly at p=0.01 and 0.10.

One big practical problem with p-values is that they cannot easily be compared. The difference between a highly significant p-value and a clearly non-significant p-value is itself not necessarily statistically significant. . . . Consider a simple example of two independent experiments with estimates ± standard error of 25 ± 10 and 10 ± 10. The first experiment is highly statistically significant (two and a half standard errors away from zero, corresponding to a (normal-theory) p-value of about 0.01) while the second is not significant at all. Most disturbingly here, the difference is 15 ± 14, which is not close to significant . . .

In short, the p-value is itself a statistic and can be a noisy measure of evidence. This is a problem not just with p-values but with any mathematically equivalent procedure, such as summarizing results by whether the 95% confidence interval includes zero.

Good, mediocre, and bad p-values

For all their problems, p-values sometimes “work” to convey an important aspect of the relation of data to model. Other times a p-value sends a reasonable message but does not add anything beyond a simple confidence interval. In yet other situations, a p-value can actively mislead. Before going on, I will give examples of each of these three scenarios.

A p-value that worked. Several years ago I was contacted by a person who suspected fraud in a local election (Gelman, 2004). Partial counts had been released throughout the voting process and he thought the proportions for the different candidates looked suspiciously stable, as if they had been rigged ahead of time to aim for a particular result. Excited to possibly be at the center of an explosive news story, I took a look at the data right away. After some preliminary graphs—which indeed showed stability of the vote proportions as they evolved during election day—I set up a hypothesis test comparing the variation in the data to what would be expected from independent binomial sampling. When applied to the entire dataset (27 candidates running for six offices), the result was not statistically significant: there was no less (and, in fact, no more) variance than would be expected by chance. In addition, an analysis of the 27 separate chi-squared statistics revealed no particular patterns. I was left to conclude that the election results were consistent with random voting (even though, in reality, voting was certainly not random—for example, married couples are likely to vote at the same time, and the sorts of people who vote in the middle of the day will differ from those who cast their ballots in the early morning or evening), and I regretfully told my correspondent that he had no case.

In this example, we did not interpret a non-significant result as a claim that the null hypothesis was true or even as a claimed probability of its truth. Rather, non-significance revealed the data to be compatible with the null hypothesis; thus, my correspondent could not argue that the data indicated fraud.

A p-value that was reasonable but unnecessary. It is common for a research project to culminate in the estimation of one or two parameters, with publication turning on a p-value being less than a conventional level of significance. For example, in our study of the effects of redistricting in state legislatures (Gelman and King, 1994), the key parameters were interactions in regression models for partisan bias and electoral responsiveness. Although we did not actually report p-values, we could have: what made our paper complete was that our findings of interest were more than two standard errors from zero, thus reaching the p<0.05 level. Had our significance level been much greater (for example, estimates that were four or more standard errors from zero), we would doubtless have broken up our analysis (for example, separately studying Democrats and Republicans) in order to broaden the set of claims that we could confidently assert. Conversely, had our regressions not reached statistical significance at the conventional level, we would have performed some sort of pooling or constraining of our model in order to arrive at some weaker assertion that reached the 5% level. (Just to be clear: we are not saying that we would have performed data dredging, fishing for significance; rather, we accept that sample size dictates how much we can learn with confidence; when data are weaker, it can be possible to find reliable patterns by averaging.) In any case, my point here is that in this example it would have been just fine to summarize our results in this example via p-values even though we did not happen to use that formulation. A misleading p-value. Finally, in many scenarios p-values can distract or even mislead, either a non-significant result wrongly interpreted as a confidence statement in support of the null hypothesis, or a significant p-value that is taken as proof of an effect. A notorious example of the latter is the recent paper of Bem (2011), which reported statistically significant results from several experiments on ESP. At brief glance, it seems impressive to see multiple independent findings that are statistically significant (and combining the p-values using classical rules would yield an even stronger result), but with enough effort it is possible to find statistical significance anywhere (see Simmons, Nelson, and Simonsohn, 2011).

The focus on p-values seems to have both weakened the study (by encouraging the researcher to present only some of his data so as to draw attention away from non-significant results) and to have led reviewers to inappropriately view a low p-value (indicating a misfit of the null hypothesis to data) as strong evidence in favor of a specific alternative hypothesis (ESP) rather than other, perhaps more scientifically plausible alternatives such as measurement error and selection bias.

I’ve written on these issues in many other places but the questions keep coming up so I thought it was worth reposting.

Tomorrow I’ll highlight another part of this article, this time dealing with Bayesian inference.

4 thoughts on “Good, mediocre, and bad p-values

  1. If you change your analysis strategy due to features of the data, wouldn’t you be facing a Garden of Forking Paths issue, even though you are not after statistically signifiant p-values? After all, the criterion “estimate is X sd away from zero” is also sensitive to model definition, and if we change the model based on data, how can we rely on it?
    In other words: how can you avoid overfitting the model if you keep tinkering with its definition and parameters based on what the data allow you to do? Of course, we can “learn” a lot more from the data this way, but how can we make sure those patterns are not due only to random variation withou some sort of cross-validation?

    • My intuition is that one fairly safe way to change a model is to expand to a larger model in which the original model was nested. Such a nested model is equivalent to the expanded model accompanied by a degenerate prior that places all of its mass in the region of parameter space corresponding to the nested model. Expanding the model is equivalent to using a less dogmatic prior.

      • Corey, could you elaborate on your intuition on why it is fairly safe to expand a model this way? I understand that we are already doing some sort of averaging when we restrict an interaction term to zero. But how can we define a prior that is a good compromise between the too restrictive and too liberal? I guess an informed prior may act as a regulator without being too restrictive, as Andrew points out in the remaining of the article. This way, or with a multilevel model and partial pooling, we can be a little less worried about HARKing . Or can we? If we do not validate the model with some kind of independent test set, how can we “learn with confidence”

  2. Alternatively, a strict seperation between an exploratory study versus a confirmatory one?

    i.e. Model refinement & model validation must be strictly distinguished apart.

Leave a Reply to Rahul Cancel reply

Your email address will not be published. Required fields are marked *