John Carlin and I write:
It is well known that even experienced scientists routinely misinterpret p-values in all sorts of ways, including confusion of statistical and practical significance, treating non-rejection as acceptance of the null hypothesis, and interpreting the p-value as some sort of replication probability or as the posterior probability that the null hypothesis is true.
A common conceptual error is that researchers take the rejection of a straw-man null as evidence in favor of their preferred alternative. A standard mode of operation goes like this: p < 0.05 is taken as strong evidence against the null hypothesis, p > 0.15 is taken as evidence in favor of the null, and p near 0.10 is taken either as weak evidence for an effect or as evidence of a weak effect.
Unfortunately, none of those inferences is generally appropriate: a low p-value is not necessarily strong evidence against the null, a high p-value does not necessarily favor the null (the strength and even the direction of the evidence depends on the alternative hypotheses), and p-values are in general not measures of the size of any underlying effect. But these errors persist, reflecting (a) inherent difficulties in the mathematics and logic of p-values, and (b) the desire of researchers to draw strong conclusions from their data.
Continued evidence of these and other misconceptions and their dire consequences for science . . . motivated the American Statistical Association to release a Statement on Statistical Significance and p-values in an attempt to highlight the magnitude and importance of problems with current standard practice . . .
At this point it would be natural for statisticians to think that this is a problem of education and communication. If we could just add a few more paragraphs to the relevant sections of our textbooks, and persuade applied practitioners to consult more with statisticians, then all would be well, or so goes this logic.
Nope. It won’t be so easy.
We consider some natural solutions to the p-value communication problem that won’t, on their own, work:
Listen to the statisticians, or clarity in exposition
. . . it’s not that we’re teaching the right thing poorly; unfortunately, we’ve been teaching the wrong thing all too well. . . . The statistics profession has been spending decades selling people on the idea of statistics as a tool for extracting signal from noise, and our journals and textbooks are full of triumphant examples of learning through statistical significance; so it’s not clear why we as a profession should be trusted going forward, at least not until we take some responsibility for the mess we’ve helped to create.
Confidence intervals instead of hypothesis tests
A standard use of a confidence interval is to check whether it excludes zero. In this case it’s a hypothesis test under another name. Another use is to consider the interval as a statement about uncertainty in a parameter estimate. But this can give nonsensical answers, not just in weird trick problems but for real applications. . . . So, although confidence intervals contain some information beyond that in p-values, they do not resolve the larger problems that arise from attempting to get near-certainty out of noisy estimates.
Bayesian interpretation of one-sided p-values
. . . The problem comes with the uniform prior distribution. We tend to be most concerned with overinterpretation of statistical significance in problems where underlying effects are small and variation is high . . . We do not consider it reasonable in general to interpret a z-statistic of 1.96 as implying a 97.5% chance that the corresponding estimate is in the right direction.
Focusing on “practical significance” instead of “statistical significance”
. . . in a huge study, comparisons can be statistically significant without having any practical importance. Or, as we would prefer to put it, effects can vary: a +0.3 for one group in one scenario might become −0.2 for a different group in a different situation. Tiny effects are not only possibly trivial, they can also be unstable, so that for future purposes an estimate of 0.3±0.1 might not even be so likely to remain positive. . . . That said, the distinction between practical and statistical significance does not resolve the difficulties with p-values. The problem is not so much with large samples and tiny but precisely-measured effects but rather with the opposite: large effect-size estimates that are hopelessly contaminated with noise. . . . This problem is central to the recent replication crisis in science . . . but is not at all touched by concerns of practical significance.
Another direction for reform is to preserve the idea of hypothesis testing but to abandon tail-area probabilities (p-values) and instead summarize inference by the posterior probabilities of the null and alternative models . . . The difficulty of this approach is that the marginal likelihoods of the separate models (and thus the Bayes factor and the corresponding posterior probabilities) depend crucially on aspects of the prior distribution that are typically assigned in a completely arbitrary manner by users. . . . Beyond this technical criticism . . . the use of Bayes factors for hypothesis testing is also subject to many of the problems of p-values when used for that same purpose . . .
What do do instead? We give some suggestions:
Our own preferred replacement for hypothesis testing and p-values is model expansion and Bayesian inference, addressing concerns of multiple comparisons using hierarchical modeling . . . or through non-Bayesian regularization techniques such as lasso . . . The general idea is to use Bayesian or regularized inference as a replacement of hypothesis tests but . . . through estimation of continuous parameters rather than by trying to assess the probability of a point null hypothesis. And . . . informative priors can be crucial in getting this to work.
It’s not all about the Bayes:
Indeed, in many contexts it is the prior information rather than the Bayesian machinery that is the most important. Non- Bayesian methods can also incorporate prior information in the form of postulated effect sizes in post-data design calculations . . . In short, we’d prefer to avoid hypothesis testing entirely and just perform inference using larger, more informative models.
But, we continue:
To stop there, though, would be to deny one of the central goals of statistical science. . . . there is a demand for hypothesis testing. We can shout till our throats are sore that rejection of the null should not imply the acceptance of the alternative, but acceptance of the alternative is what many people want to hear. . . . we think the solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation . . . we recommend saying No to binary conclusions in our collaboration and consulting projects: resist giving clean answers when that is not warranted by the data. Instead, do the work to present statistical conclusions with uncertainty rather than as dichotomies. Also, remember that most effects can’t be zero (at least in social science and public health), and that an “effect” is usually a mean in a population (or something similar such as a regression coefficient)—a fact that seems to be lost from consciousness when researchers slip into binary statements about there being “an effect” or “no effect” as if they are writing about constants of nature. Again, it will be difficult to resolve the many problems with p-values and “statistical significance” without addressing the mistaken goal of certainty which such methods have been used to pursue.
This article will be published in the Journal of the American Statistical Association, as a comment on the article, “Statistical significance and the dichotomization of evidence,” by Blakeley McShane and David Gal.
P.S. Above cat picture is from Diana Senechal. If anyone wants to send me non-copyrighted cat pictures that would be appropriate for posting, feel free to do so.