What is a p-value in practice? The p-value is a measure of discrepancy of the fit of a model or “null hypothesis” H to data y. In theory the p-value is a continuous measure of evidence, but in practice it is typically trichotomized approximately into strong evidence, weak evidence, and no evidence (these can also be labeled highly significant, marginally significant, and not statistically significant at conventional levels), with cutoffs roughly at p=0.01 and 0.10.
One big practical problem with p-values is that they cannot easily be compared. The p-value is itself a statistic and can be a noisy measure of evidence. This is a problem not just with p-values but with any mathematically equivalent procedure, such as summarizing results by whether the 95% confidence interval includes zero.
I’ve discussed this paper before (it’s a discussion from 2013 in the journal Epidemiology) but with all the recent controversy about p-values and statistical significance, I think it’s worth reposting:
The casual view of the p-value as posterior probability of the truth of the null hypothesis is false and not even close to valid under any reasonable model, yet this misunderstanding persists even in high-stakes settings . . . The formal view of the p-value as a probability conditional on the null is mathematically correct but typically irrelevant to research goals (hence the popularity of alternative if wrong interpretations). A Bayesian interpretation based on a spike-and-slab model makes little sense in applied contexts in epidemiology, political science, and other fields in which true effects are typically nonzero and bounded (thus violating both the “spike” and the “slab” parts of the model).
Greenland and Poole [in the article in Epidemiology that I was discussing] make two points. First, they describe how p-values approximate posterior probabilities under prior distributions that contain little information relative to the data:
This misuse [of p-values] may be lessened by recognizing correct Bayesian interpretations. For example, under weak priors, 95% confidence intervals approximate 95% posterior probability intervals, one-sided P-values approximate directional posterior probabilities, and point estimates approximate posterior medians.
I used to think this way too (see many examples in our books) but in recent years have moved to the position that I don’t trust such direct posterior probabilities. Unfortunately, I think we cannot avoid informative priors if we wish to make reasonable unconditional probability statements. To put it another way, I agree with the mathematical truth of the quotation above but I think it can mislead in practice because of serious problems with apparently noninformative or weak priors. . . .
When sample sizes are moderate or small (as is common in epidemiology and social science), posterior probabilities will depend strongly on the prior distribution.
I’ll continue to quote from this article but for readability will remove the indentation.
Good, mediocre, and bad p-values
For all their problems, p-values sometimes “work” to convey an important aspect of the relation of data to model. Other times a p-value sends a reasonable message but does not add anything beyond a simple confidence interval. In yet other situations, a p-value can actively mislead. Before going on, I will give examples of each of these three scenarios.
A p-value that worked. Several years ago I was contacted by a person who suspected fraud in a local election. Partial counts had been released throughout the voting process and he thought the proportions for the different candidates looked suspiciously stable, as if they had been rigged to aim for a particular result. Excited to possibly be at the center of an explosive news story, I took a look at the data right away. After some preliminary graphs—which indeed showed stability of the vote proportions as they evolved during election day—I set up a hypothesis test comparing the variation in the data to what would be expected from independent binomial sampling. When applied to the entire dataset (27 candidates running for six offices), the result was not statistically significant: there was no less (and, in fact, no more) variance than would be expected by chance alone. In addition, an analysis of the 27 separate chi-squared statistics revealed no particular patterns. I was left to conclude that the election results were consistent with random voting (even though, in reality, voting was certainly not random—for example, married couples are likely to vote at the same time, and the sorts of people who vote in the middle of the day will differ from those who cast their ballots in the early morning or evening), and I regretfully told my correspondent that he had no case.
In this example, we did not interpret a non-significant result as a claim that the null hypothesis was true or even as a claimed probability of its truth. Rather, non-significance revealed the data to be compatible with the null hypothesis; thus, my correspondent could not argue that the data indicated fraud.
A p-value that was reasonable but unnecessary. It is common for a research project to culminate in the estimation of one or two parameters, with publication turning on a p-value being less than a conventional level of significance. For example, in our study of the effects of redistricting in state legislatures, the key parameters were interactions in regression models for partisan bias and electoral responsiveness. Although we did not actually report p-values, we could have: what made our paper complete was that our findings of interest were more than two standard errors from zero, thus reaching the p<0.05 level. Had our significance level been much greater (for example, estimates that were four or more standard errors from zero), we would doubtless have broken up our analysis (for example, studying Democrats and Republicans separately) to broaden the set of claims that we could confidently assert. Conversely, had our regressions not reached statistical significance at the conventional level, we would have performed some sort of pooling or constraining of our model in order to arrive at some weaker assertion that reached the 5% level. (Just to be clear: we are not saying that we would have performed data dredging, fishing for significance; rather, we accept that sample size dictates how much we can learn with confidence; when data are weaker, it can be possible to find reliable patterns by averaging.) In any case, my point here is that in this example it would have been just fine to summarize our results in this example via p-values even though we did not happen to use that formulation.
A misleading p-value. Finally, in many scenarios p-values can distract or even mislead, either a non-significant result wrongly interpreted as a confidence statement in support of the null hypothesis, or a significant p-value that is taken as proof of an effect. A notorious example of the latter is the recent paper of Bem, which reported statistically significant results from several experiments on ESP. At brief glance, it seems impressive to see multiple independent findings that are statistically significant (and combining the p-values using classical rules would yield an even stronger result), but with enough effort it is possible to find statistical significance anywhere.
The focus on p-values seems to have both weakened that study (by encouraging the researcher to present only some of his data so as to draw attention away from non-significant results) and to have led reviewers to inappropriately view a low p-value (indicating a misfit of the null hypothesis to data) as strong evidence in favor of a specific alternative hypothesis (ESP) rather than other, perhaps more scientifically plausible alternatives such as measurement error and selection bias.
So-called noninformative priors (and, thus, the usual Bayesian interpretation of classical confidence intervals) can be way too strong
The general problem I have with noninformatively-derived Bayesian probabilities is that they tend to be too strong. At first this may sound paradoxical, that a noninformative or weakly informative prior yields posteriors that are too forceful—and let me deepen the paradox by stating that a stronger, more informative prior will tend to yield weaker, more plausible posterior statements.
How can it be that adding prior information weakens the posterior? It has to do with the sort of probability statements we are often interested in making. Here is an example. A sociologist examining a publicly available survey discovered a pattern relating attractiveness of parents to the sexes of their children. He found that 56% of the children of the most attractive parents were girls, compared to 48% of the children of the other parents, and the difference was statistically significant at p<0.02. The assessments of attractiveness had been performed many years before these people had children, so the researcher felt he had support for a claim of an underlying biological connection between attractiveness and sex ratio. The original analysis by Kanazawa had multiple comparisons issues, and after performing a regression rather than selecting the most significant comparison, we get a p-value closer to 0.2 rather than the stated 0.02. For the purposes of our present discussion, though, in which we are evaluating the connection between p-values and posterior probabilities, it will not matter much which number we use. We shall go with p=0.2 because it seems like a more reasonable analysis given the data. Let θ be true (population) difference in sex ratios of attractive and less attractive parents. Then the data under discussion (with a two-sided p-value of 0.2), combined with a uniform prior on θ, yields a 90% posterior probability that θ is positive. Do I believe this? No. Do I even consider this a reasonable data summary? No again. We can derive these No responses in three different ways, first by looking directly at the evidence, second by considering the prior, and third by considering the implications for statistical practice if this sort of probability statement were computed routinely. First off, a claimed 90% probability that θ>0 seems too strong. Given that the p-value (adjusted for multiple comparisons) was only 0.2—that is, a result that strong would occur a full 20% of the time just by chance alone, even with no true difference—it seems absurd to assign a 90% belief to the conclusion. I am not prepared to offer 9 to 1 odds on the basis of a pattern someone happened to see that could plausibly have occurred by chance alone, nor for that matter would I offer 99 to 1 odds based on the original claim of the 2% significance level.
Second, the prior uniform distribution on θ seems much too weak. There is a large literature on sex ratios, with factors such as ethnicity, maternal age, and season of birth corresponding to difference in probability of girl birth of less than 0.5 percentage points. It is a priori implausible that sex-ratio differences corresponding to attractiveness are larger than for these other factors. Assigning an informative prior centered on zero shrinks the posterior toward zero, and the resulting posterior probability that θ>0 moves to a more plausible value in the range of 60%, corresponding to the idea that the result is suggestive but not close to convincing.
Third, consider what would happen if we routinely interpreted one-sided p-values as posterior probabilities. In that case, an experimental result that is 1 standard error from zero—that is, exactly what one might expect from chance alone—would imply an 83% posterior probability that the true effect in the population has the same direction as the observed pattern in the data at hand. It does not make sense to me to claim 83% certainty—5 to 1 odds—based on data that not only could occur by chance alone but in fact represent an expected level of discrepancy. This system-level analysis accords with my criticism of the flat prior: as Greenland and Poole note in their article, the effects being studied in epidemiology are typically range from -1 to 1 on the logit scale, hence analyses assuming broader priors will systematically overstate the probabilities of very large effects and will overstate the probability that an estimate from a small sample will agree in sign with the corresponding population quantity.
How I have changed
Like many Bayesians, I have often represented classical confidence intervals as posterior probability intervals and interpreted one-sided p-values as the posterior probability of a positive effect. These are valid conditional on the assumed noninformative prior but typically do not make sense as unconditional probability statements. As Sander Greenland has discussed in much of his work over the years, epidemiologists and applied scientists in general have knowledge of the sizes of plausible effects and biases. . . .
The default conclusion from a noninformative prior analysis will almost invariably put too much probability on extreme values. A vague prior distribution assigns a lot of its probability on values that are never going to be plausible, and this messes up the posterior probabilities more than we tend to expect, something that we probably don’t think about enough in our routine applications of standard statistical methods.