# P-values and statistical practice

What is a p-value in practice? The p-value is a measure of discrepancy of the fit of a model or “null hypothesis” H to data y. In theory the p-value is a continuous measure of evidence, but in practice it is typically trichotomized approximately into strong evidence, weak evidence, and no evidence (these can also be labeled highly significant, marginally significant, and not statistically significant at conventional levels), with cutoffs roughly at p=0.01 and 0.10.

One big practical problem with p-values is that they cannot easily be compared. The p-value is itself a statistic and can be a noisy measure of evidence. This is a problem not just with p-values but with any mathematically equivalent procedure, such as summarizing results by whether the 95% confidence interval includes zero.

I’ve discussed this paper before (it’s a discussion from 2013 in the journal Epidemiology) but with all the recent controversy about p-values and statistical significance, I think it’s worth reposting:

The casual view of the p-value as posterior probability of the truth of the null hypothesis is false and not even close to valid under any reasonable model, yet this misunderstanding persists even in high-stakes settings . . . The formal view of the p-value as a probability conditional on the null is mathematically correct but typically irrelevant to research goals (hence the popularity of alternative if wrong interpretations). A Bayesian interpretation based on a spike-and-slab model makes little sense in applied contexts in epidemiology, political science, and other fields in which true effects are typically nonzero and bounded (thus violating both the “spike” and the “slab” parts of the model).

Greenland and Poole [in the article in Epidemiology that I was discussing] make two points. First, they describe how p-values approximate posterior probabilities under prior distributions that contain little information relative to the data:

This misuse [of p-values] may be lessened by recognizing correct Bayesian interpretations. For example, under weak priors, 95% confidence intervals approximate 95% posterior probability intervals, one-sided P-values approximate directional posterior probabilities, and point estimates approximate posterior medians.

I used to think this way too (see many examples in our books) but in recent years have moved to the position that I don’t trust such direct posterior probabilities. Unfortunately, I think we cannot avoid informative priors if we wish to make reasonable unconditional probability statements. To put it another way, I agree with the mathematical truth of the quotation above but I think it can mislead in practice because of serious problems with apparently noninformative or weak priors. . . .

When sample sizes are moderate or small (as is common in epidemiology and social science), posterior probabilities will depend strongly on the prior distribution.

I’ll continue to quote from this article but for readability will remove the indentation.

Good, mediocre, and bad p-values

For all their problems, p-values sometimes “work” to convey an important aspect of the relation of data to model. Other times a p-value sends a reasonable message but does not add anything beyond a simple confidence interval. In yet other situations, a p-value can actively mislead. Before going on, I will give examples of each of these three scenarios.

A p-value that worked. Several years ago I was contacted by a person who suspected fraud in a local election. Partial counts had been released throughout the voting process and he thought the proportions for the different candidates looked suspiciously stable, as if they had been rigged to aim for a particular result. Excited to possibly be at the center of an explosive news story, I took a look at the data right away. After some preliminary graphs—which indeed showed stability of the vote proportions as they evolved during election day—I set up a hypothesis test comparing the variation in the data to what would be expected from independent binomial sampling. When applied to the entire dataset (27 candidates running for six offices), the result was not statistically significant: there was no less (and, in fact, no more) variance than would be expected by chance alone. In addition, an analysis of the 27 separate chi-squared statistics revealed no particular patterns. I was left to conclude that the election results were consistent with random voting (even though, in reality, voting was certainly not random—for example, married couples are likely to vote at the same time, and the sorts of people who vote in the middle of the day will differ from those who cast their ballots in the early morning or evening), and I regretfully told my correspondent that he had no case.

In this example, we did not interpret a non-significant result as a claim that the null hypothesis was true or even as a claimed probability of its truth. Rather, non-significance revealed the data to be compatible with the null hypothesis; thus, my correspondent could not argue that the data indicated fraud.

A p-value that was reasonable but unnecessary. It is common for a research project to culminate in the estimation of one or two parameters, with publication turning on a p-value being less than a conventional level of significance. For example, in our study of the effects of redistricting in state legislatures, the key parameters were interactions in regression models for partisan bias and electoral responsiveness. Although we did not actually report p-values, we could have: what made our paper complete was that our findings of interest were more than two standard errors from zero, thus reaching the p<0.05 level. Had our significance level been much greater (for example, estimates that were four or more standard errors from zero), we would doubtless have broken up our analysis (for example, studying Democrats and Republicans separately) to broaden the set of claims that we could confidently assert. Conversely, had our regressions not reached statistical significance at the conventional level, we would have performed some sort of pooling or constraining of our model in order to arrive at some weaker assertion that reached the 5% level. (Just to be clear: we are not saying that we would have performed data dredging, fishing for significance; rather, we accept that sample size dictates how much we can learn with confidence; when data are weaker, it can be possible to find reliable patterns by averaging.) In any case, my point here is that in this example it would have been just fine to summarize our results in this example via p-values even though we did not happen to use that formulation.

A misleading p-value. Finally, in many scenarios p-values can distract or even mislead, either a non-significant result wrongly interpreted as a confidence statement in support of the null hypothesis, or a significant p-value that is taken as proof of an effect. A notorious example of the latter is the recent paper of Bem, which reported statistically significant results from several experiments on ESP. At brief glance, it seems impressive to see multiple independent findings that are statistically significant (and combining the p-values using classical rules would yield an even stronger result), but with enough effort it is possible to find statistical significance anywhere.

The focus on p-values seems to have both weakened that study (by encouraging the researcher to present only some of his data so as to draw attention away from non-significant results) and to have led reviewers to inappropriately view a low p-value (indicating a misfit of the null hypothesis to data) as strong evidence in favor of a specific alternative hypothesis (ESP) rather than other, perhaps more scientifically plausible alternatives such as measurement error and selection bias.

So-called noninformative priors (and, thus, the usual Bayesian interpretation of classical confidence intervals) can be way too strong

The general problem I have with noninformatively-derived Bayesian probabilities is that they tend to be too strong. At first this may sound paradoxical, that a noninformative or weakly informative prior yields posteriors that are too forceful—and let me deepen the paradox by stating that a stronger, more informative prior will tend to yield weaker, more plausible posterior statements.

How can it be that adding prior information weakens the posterior? It has to do with the sort of probability statements we are often interested in making. Here is an example. A sociologist examining a publicly available survey discovered a pattern relating attractiveness of parents to the sexes of their children. He found that 56% of the children of the most attractive parents were girls, compared to 48% of the children of the other parents, and the difference was statistically significant at p<0.02. The assessments of attractiveness had been performed many years before these people had children, so the researcher felt he had support for a claim of an underlying biological connection between attractiveness and sex ratio. The original analysis by Kanazawa had multiple comparisons issues, and after performing a regression rather than selecting the most significant comparison, we get a p-value closer to 0.2 rather than the stated 0.02. For the purposes of our present discussion, though, in which we are evaluating the connection between p-values and posterior probabilities, it will not matter much which number we use. We shall go with p=0.2 because it seems like a more reasonable analysis given the data. Let θ be true (population) difference in sex ratios of attractive and less attractive parents. Then the data under discussion (with a two-sided p-value of 0.2), combined with a uniform prior on θ, yields a 90% posterior probability that θ is positive. Do I believe this? No. Do I even consider this a reasonable data summary? No again. We can derive these No responses in three different ways, first by looking directly at the evidence, second by considering the prior, and third by considering the implications for statistical practice if this sort of probability statement were computed routinely. First off, a claimed 90% probability that θ>0 seems too strong. Given that the p-value (adjusted for multiple comparisons) was only 0.2—that is, a result that strong would occur a full 20% of the time just by chance alone, even with no true difference—it seems absurd to assign a 90% belief to the conclusion. I am not prepared to offer 9 to 1 odds on the basis of a pattern someone happened to see that could plausibly have occurred by chance alone, nor for that matter would I offer 99 to 1 odds based on the original claim of the 2% significance level.

Second, the prior uniform distribution on θ seems much too weak. There is a large literature on sex ratios, with factors such as ethnicity, maternal age, and season of birth corresponding to difference in probability of girl birth of less than 0.5 percentage points. It is a priori implausible that sex-ratio differences corresponding to attractiveness are larger than for these other factors. Assigning an informative prior centered on zero shrinks the posterior toward zero, and the resulting posterior probability that θ>0 moves to a more plausible value in the range of 60%, corresponding to the idea that the result is suggestive but not close to convincing.

Third, consider what would happen if we routinely interpreted one-sided p-values as posterior probabilities. In that case, an experimental result that is 1 standard error from zero—that is, exactly what one might expect from chance alone—would imply an 83% posterior probability that the true effect in the population has the same direction as the observed pattern in the data at hand. It does not make sense to me to claim 83% certainty—5 to 1 odds—based on data that not only could occur by chance alone but in fact represent an expected level of discrepancy. This system-level analysis accords with my criticism of the flat prior: as Greenland and Poole note in their article, the effects being studied in epidemiology are typically range from -1 to 1 on the logit scale, hence analyses assuming broader priors will systematically overstate the probabilities of very large effects and will overstate the probability that an estimate from a small sample will agree in sign with the corresponding population quantity.

How I have changed

Like many Bayesians, I have often represented classical confidence intervals as posterior probability intervals and interpreted one-sided p-values as the posterior probability of a positive effect. These are valid conditional on the assumed noninformative prior but typically do not make sense as unconditional probability statements. As Sander Greenland has discussed in much of his work over the years, epidemiologists and applied scientists in general have knowledge of the sizes of plausible effects and biases. . . .

The default conclusion from a noninformative prior analysis will almost invariably put too much probability on extreme values. A vague prior distribution assigns a lot of its probability on values that are never going to be plausible, and this messes up the posterior probabilities more than we tend to expect, something that we probably don’t think about enough in our routine applications of standard statistical methods.

## 49 thoughts on “P-values and statistical practice”

1. > The default conclusion from a noninformative prior analysis will almost invariably put too much probability on extreme values

+1 The whole point of a prior is to regularize – ie remove bad/pathological solutions from consideration. This is actually a conservative procedure.

• It’s too easy to overfit with complex models but it’s also too easy to underfit with simple models. NHST tends to fall under the latter category, as it’s often used for comparing strawman (simple) models to noisy (complex) data. Priors are one way to attempt to redress this imbalance.

• The frequentist fix is to think more carefully about the discrepancies of interest as null hypotheses, which effectively amounts to the same thing. Both regularize by throwing away solutions from unrealistic models that arise from choosing too big a class to begin with.

• So: if you don’t like to do tests, you need a good class (use a prior); if you don’t like your class, don’t hold them to a high standard (relax your passing mark).

Train of thought fin.

2. There has been some recent noise about doing more replication studies, and some recent noise about the historical abuse of p-values; I’ve not seen it noted that a replication study actually fits the whole classical inference framework better than most “original research” papers. Not to say that p-values elsewhere are useless, nor that they can’t be misinterpreted in the context of a replication study, but if you’re looking for a context in which the p-value-as-single-summary-statistic is reasonably well-motivated, that’s where it is.

In other contexts, a low p-value, especially where there aren’t multiple tests issues but where the result is ex ante unexpected, usually makes me think “you’ve justified another research grant to study this further”. This leaves problems of publication selection against “negative results”.

I’ve said for years that I’d like to start a Journal of Replications and Negative Results, but I’m still not in a professional place to do that. Feel free to steal the idea.

3. This very readable paper might be helpful for those to whom its not clear how wonky priors mess up the posterior probabilities more than we tend to expect. Hidden Dangers of Specifying Noninformative Priors http://www.tandfonline.com/doi/abs/10.1080/00031305.2012.695938

I think attitudes have changed a lot since for instance I wrote Two cheers for Bayes 20 years ago http://www.sciencedirect.com/science/article/pii/S0197245696907539 .

But somehow I think Kadane’s response would be the same ;-)

For instance “I [Kadane] agree with the author [me] that “when a posterior is presented, I believe it should be
clearly and primarily stressed as being a ‘function’ of the prior probabilities and not
the probability ‘of treatment effects’.” So, I think, do most Bayesians [in 1996????].”

But the “something that we probably don’t think about enough in our routine applications of standard statistical methods” did not apply to his strict subjective approach where priors should not be checked nor questioned but simply noted?

4. “What is a p-value in practice?”

A function of the data which necessary reflects some of the information in the data. How much information?

From that replication study: overall “Ninety-seven percent of original studies had significant results (P < .05). Thirty-six percent of replications had significant results;", yet for very low p-values they got “Twenty of the 32 [i.e. sixty three percent] original studies with a P value of less than 0.001 could be replicated”.

In other words, p-values reflect practically none of the information in the data. The real question though is “what price do we pay for ignoring p-values completely?” The answer is “none”.

• P.s. to the first knucklehead who says those aren’t “real” p-values they’re “nominal” something equivalent: I, like everyone else, has only ever seen “nominal” p-values since they are only ones calculable.

You have to decide what you believe. You can believe either (A) p-values are a function of the data with an extremely weak correlation to the truth (as the evidence of all kinds overwhelmingly suggests), or you can believe (B) for every calculated p-value there is a mystical and magical “real” p-value that’s nearly perfectly correlated with the truth.

• I don’t see how you can say that the nominal p value contains “practically none of the information in the data”. Take a two-sample t-test. If you know the sample sizes and the p-value, then you can compute: t, d, confidence interval of d, log likelihood ratio, difference in AIC, difference in BIC, and the Bayes Factor (for a prior based on standardized effect sizes), among others. I set up a web site to do the transformations:

http://psych.purdue.edu/~gfrancis/EquivalentStatistics/

In some situations that are common to social science, the p-value contains just as much information about the data set as many alternative statistics. The differences between these statistical inferences is not about what is in the data set, but how it is interpreted. Thus, scientists have to figure what they are interested in and then use the statistic appropriate for that interest. I do agree, though, that there are situations where none of the above statistics reflect what scientists are interested in.

• the p value is a function both of the data, and of the choice of test and null hypothesis. You have specified both a test (t-test) and null hypothesis (no difference in means).

If you grep through the text of a random journal article for the pattern /p ?[] ?[0-9.]+[0-9]+/

you find out practically nothing.

• ack. blog hates me because I do math not html coding

/p ?[<>] ?[0-9.]+[0-9]+/

• You might argue “well of course” but if they just reported average effect sizes without any pretense of statistics you’d find out a lot more.

• “the p value is a function both of the data, and of the choice of test and null hypothesis.”
Yes, and the “null hypothesis” is the entire null model, not just the null parameter that you’re interested in. I.e., in the two sample t-test, the null model specifies not only d=0, but normal errors, no omitted variables, etc. A small p-value indicates poor fit of data to the model, but not necessarily to the part of the model that encodes the scientific null hypothesis (such as “no mean difference”). I guess this is obvious, and just means that you must assume your model is “correct” in order to make inferences, but in typical practice p-values are related to the scientific null hypothesis without much regard for model checking.

• I’m sorry, but “no omitted variables” is as necessary for effect sizes or any other quantity of interest I can imagine, as it is for p-values

• Of course. No need to be sorry :). My point was just that the “null hypothesis” is always embedded in a larger model, and it’s the entire model that determines the test statistic and p-value and effect size and whatever; rejecting the model does not necessarily reject the null hypothesis, which is generally just one parameter in the model.

• You are close to the ultimate realization. Just make sure the “null” hypothesis corresponds to the actual scientific hypothesis (ie what is predicted by the theory). Even then, you can always blame bad data or auxiliary assumptions.

It is impossible for science to prove something true, it is impossible for science to prove something false. There will always be other explanations, you can only rule out the conjunction of your theory plus a number of assumptions, the most mundane being “measuring device not malfunctioning”.

• Tom M
Yes and that is why it is so obvious if its a non-randomized study there will be confounding (when addressing causal questions) and the distribution of the p_values will be very non-uniform given the (unconfounded) effect is null.

• Keith wrote: “when addressing causal questions”

I’ve begun thinking that causality is a problematic concept, and I’m not even sure there is a need for it. If B very often follows A and rarely occurs in the absence of a preceding A, we can use detection of A to accurately predict B. Is it actually of any relevance whether A *causes* B?

• Anoneuoid,

Wow. Um, causality is important if you want to take effective action… right?

• Cory wrote: “Wow. Um, causality is important if you want to take effective action… right?”

I see what you mean, but on reflection I don’t think so. What difference is there between knowing that flipping the switch is highly correlated with lights turning on/off under a wide range of conditions vs. causing it? In practice I don’t think the distinction ever matters.

• Depends whose practice you’re talking about.

Suppose we’re investigating the effectiveness of a flu vaccine. Consider a group of study subjects and two study designs, a prospective study (i.e., study subjects decide for themselves whether to get the vaccine) and a randomized study (for simplicity we’ll imagine that compliance with the assigned treatment status is 100%). Imagine the two counterfactual histories that would result for the two different studies; would you expect the randomized trial would show more, less, or the same effectiveness in preventing flu as the prospective study?

• @Anoneuid

The lightbulb (non LED) turning hot is also strongly correlated with the light turning on under a wide variety of conditions.

The causality problem arises if someone tries turning on the bulb by heating it with an air dryer.

• @Anoneuoid

“What difference is there between knowing that flipping the switch is highly correlated with lights turning on/off under a wide range of conditions vs. causing it?”

Because if you knew that flipping the switch doesn’t actually *cause* the light to go on, then you would know not to bother pressing it if you want light?

• To all,

A better way of putting my thoughts is that the term “cause” is a convenient shorthand for a certain type of relationship. It is used when A precedes B, and A is also highly correlated with B. Correlations are measurable, but I don’t see how you can measure, or even observe, cause.

• Anoneuoid, time-ordering and correlation aren’t enough: you can easily have a cause with two effects, one of which occurs before the other, but with no direct causal relation between them. The two effects will be highly correlated, and yet intervening on the earlier effect will fail to, um, affect the later effect.

For example, a person’s general conscientiousness about health may prompt them to seek out a flu vaccine during flu season and also prompt them to wash their hands often. If so, a prospective study with a naïve estimate for the effectiveness of the vaccine would tend over-estimate its effectiveness relative to a randomized study.

• “the p value is a function both of the data, and of the choice of test and null hypothesis. You have specified both a test (t-test) and null hypothesis (no difference in means).”

Yes, the same specification is required to compute (or understand) the other statistics. It is still the case that if the p-value contains none of the information in the data, then neither do the other statistics.

Your grep search is about the decision (statistical significance) based on a p-value. I agree that such a decision is not as informative as the p-value itself or the other statistics (but decisions based on other statistics might not be as informative either).

I am not arguing that decisions based on p-values are generally useful (although as Andrew notes, they can be in some situations). I think the conditions necessary for a p-value to be useful are difficult to satisfy; but in terms of information in the data set (at least for a two-sample t-test) it’s not true that a p-value contains no information.

5. I would like to complement you on actually including relevant parts of an article in your discussion rather than referencing it by section in a paper or book and requiring the reader to invest the effort to retrieve the information. We’ve had this discussion before (and there were additional comments both ways) but the “cartoon” example including all relevant point is often the best way to get one’s point across (and yes, I own a copy of BDA3 so I’m not trying to be cheap), and more to the point, give it the widest dissemination.

6. > At first this may sound paradoxical, that a noninformative or weakly informative prior yields posteriors that are too forceful—and let me deepen the paradox by stating that a stronger, more informative prior will tend to yield weaker, more plausible posterior statements.

If “plausible” means “consistent with prior knowledge” it doesn’t sound very paradoxical. If not, what does “plausible” mean?

• Carlos:

It is paradoxical because we have been trained to think of noninformative priors as safe, and we have been trained to be suspicious of informative priors, and we have been trained to think of Bayesian inference as scarily dependent on assumptions.

• It’s obviously not the issue in this posting, but I’m curious about how, where, to what extent people and by whom are “trained” to think this way. Where I am, subjective Bayes certainly isn’t extinct and Bayesians (and others anyway) tend to be sceptical about noninformative priors. At the same time I do realise (as you know) that many Bayesians when publishing are at pains to avoid any impression of subjectivity. If anybody has something to say about a “cultural mapping” of Bayesians, please do!

• The beginning was meant to say: “It’s obviously not the issue in this posting, but I’m curious about how, where, to what extent and by whom people are “trained” to think this way.”

7. I understand that this language might help to present the argument to the intended audience, but it seems to me that referring to something obvious as paradoxical contributes to perpetuate the misconceptions. It’s not that you’re talking about something surprising and unexpected. Of course the results won’t be biased towards our preferred, more plausible, results if we don’t include our assumptions somewhere in the model (and what better place to include that information than the prior).

• > not that you’re talking about something surprising and unexpected
It is (was) surprising even to many practitioners of Bayesian analysis – Peter Thall once in public admitted that he hit his head against the wall for a month trying to figure what was going wrong in one of his sequential trials (he had assumed a non-informative prior on the log-odds scale that was terribly informative on the proportions scale that mattered and had very small data).

Just because something is implied by a model does not mean people are aware of it or find it intuitive.

So something implied cannot be paradoxical but it may not be obvious.

(I always though people should be forced to plot the implied marginal priors but I now realize most would rather be wrong than stoop to such a simple way of detecting problems)

• I’d say that’s a different kind of surprise. You may be surprised to learn that decaffeinated coffee does contain caffeine (maybe 5-10% relative to regular coffee). You should not be surprised if decaffeinated coffee fails to keep you awake.

I agree there are hidden dangers in “non-informative” priors. Some thought about the model is required to choose a non-informative prior (for example, we could be using lengths or areas, differences or ratios). In some cases, invariance arguments will help you to choose the right one. In other cases, you will be happy to get one which is reasonably indifferent within some reasonable region. And if you’re doing reparametrizations I agree that the transformations of the probability densities are not obvious and it’s worth checking what does the prior actually look like.

However, being surprised because a “non-informative” is indeed non-informative (it won’t favour the regions that we find more plausible a priori) seems unwarranted.

• What was the purported non-informative prior on the log-odds scale? The Haldane prior is flat on the log-odds and shouldn’t give a too-crazy posterior for the proportion — if the posterior manages to be is proper, i.e., if there’s at least one observation of each possible outcome.

8. “One big practical problem with p-values is that they cannot easily be compared” Tell that to the statistical geneticists.

9. I just read your original manuscript and saw that you cited Gossett’s t-test paper as “The probable error of the man”. Pun intended?

• Philip:

That error was introduced by the copy editors. But, sure, maybe it was on purpose, I have no idea.

10. @Anonuoid and “cause” question above (started over due to nesting limit):

Yes, if A causes B then A precedes B and is correlated with B (at least, in “distance correlation” thanks to Corey for that reference a while back).

But it is definitely possible to observe that although in many cases A precedes B and is correlated with it, it is not causal.

For example: Anne walks into a room, reaches out her hand, and the lights come on. Walking into the room precedes the lights coming on, and is highly correlated with it, but if the lights are controlled by a switch and not a motion detector, then walking into the room does not cause the lights to turn on. Another person who walks into the room and waves hands (assuming a motion sensor) will find that sure enough, it’s not sufficient to cause the lights to come on, you have to reach out and flip the switch.

the vast majority of good experimental science is about ruling out possible causes until the one that can’t be ruled out anymore is assumed true.

What you CAN do is create situations which can be explained by other methods only in the most baroque of ways (the motion sensor that detects the motion that causes the light to turn on is only sensitive to motion in the vicinity of the switch itself, so it’s really the hypothesized motion sensor that causes the lights to come on, but you need to move the switch or something nearby the switch for the sensor to detect it…)

In the context of say medicine, causality is observable and critical, consider the difference between “cholera is spread through the air” vs “cholera is spread through water”:

https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak

• Flipping the switch is not sufficient either (eg power is out), that is a distal cause, just less distal than walking into the room. Why not say electricity flowing through the bulb causes the light to turn on? It quickly becomes a game of every preceding event is a cause of every later event. What we can observe is the correlation and time elapsed between various events. These observables are deduced from an a priori idea of “cause” that we feel must exist.

Perhaps this is a variation on the old modus tollens vs affirming the consequent issue. You cannot empirically establish that something is a cause, but it can be ruled out (or at least say “Either it is not a cause or some assumption is wrong”).

• Anon:

I don’t have the energy to explain this now, but suffice it to say you are reconstructing a couple of centuries of thought on this topic. To help you along, I recommend you take a look at the Imbens and Rubin book or at this short article I wrote with Imbens.

11. “Just to be clear: we are not saying that we would have performed data dredging, fishing for significance; rather, we accept that sample size dictates how much we can learn with confidence; when data are weaker, it can be possible to find reliable patterns by averaging.”

Which brings up the point that the primary purpose of statistics is to explore data. You give a good example here of how and why that would be done. The problems with p-values arise when people take it to mean that the job of statistics is to prove (or disprove).