Skip to content
 

Abandoning statistical significance is both sensible and practical

Valentin Amrhein​, Sander Greenland, Blakeley McShane, and I write:

Dr Ioannidis writes against our proposals [here and here] to abandon statistical significance in scientific reasoning and publication, as endorsed in the editorial of a recent special issue of an American Statistical Association journal devoted to moving to a “post p<0.05 world.” We appreciate that he echoes our calls for “embracing uncertainty, avoiding hyped claims…and recognizing ‘statistical significance’ is often poorly understood.” We also welcome his agreement that the “interpretation of any result is far more complicated than just significance testing” and that “clinical, monetary, and other considerations may often have more importance than statistical findings.”

Nonetheless, we disagree that a statistical significance-based “filtering process is useful to avoid drowning in noise” in science and instead view such filtering as harmful. First, the implicit rule to not publish nonsignificant results biases the literature with overestimated effect sizes and encourages “hacking” to get significance. Second, nonsignificant results are often wrongly treated as zero. Third, significant results are often wrongly treated as truth rather than as the noisy estimates they are, thereby creating unrealistic expectations of replicability. Fourth, filtering on statistical significance provides no guarantee against noise. Instead, it amplifies noise because the quantity on which the filtering is based (the p-value) is itself extremely noisy and is made more so by dichotomizing it.

We also disagree that abandoning statistical significance will reduce science to “a state of statistical anarchy.” Indeed, the journal Epidemiology banned statistical significance in 1990 and is today recognized as a leader in the field.

Valid synthesis requires accounting for all relevant evidence—not just the subset that attained statistical significance. Thus, researchers should report more, not less, providing estimates and uncertainty statements for all quantities, justifying any exceptions, and considering ways the results are wrong. Publication criteria should be based on evaluating study design, data quality, and scientific content—not statistical significance.

Decisions are seldom necessary in scientific reporting. However, when they are required (as in clinical practice), they should be made based on the costs, benefits, and likelihoods of all possible outcomes, not via arbitrary cutoffs applied to statistical summaries such as p-values which capture little of this picture.

The replication crisis in science is not the product of the publication of unreliable findings. The publication of unreliable findings is unavoidable: as the saying goes, if we knew what we were doing, it would not be called research. Rather, the replication crisis has arisen because unreliable findings are presented as reliable.

I especially like our title and our last paragraph!

Let me also emphasize that we have a lot of positive advice of how researchers can design studies and collect and analyze data (see for example here, here, and here). “Abandon statistical significance” is not the main thing we have to say. We’re writing about statistical significance to do our best to clear up some points of confusion, but our ultimate message in most of our writing and practice is to offer positive alternatives.

P.S. Also to clarify: “Abandon statistical significance” does not mean “Abandon statistical methods.” I do think it’s generally a good idea to produce estimates accompanied by uncertainty statements. There’s lots and lots to be done.

42 Comments

  1. I have visited each of John Ioannidis Youtube presentations three or four times.

    What has stood out in John’s explanations of vibration of effects’ and the ‘Janus phenomenon, is that John, in particular, is very pessimistic about the prospect of more studies/trial, however well conducted? I would be pessimistic in light of both, given those explanations.

    What really puzzles me is why John thinks a ‘state of statistical anarchy.” will ensue with the abandonment of statistical sig. It seems to me that John has implied and stated that chaos is in the making already. Why some audiences don’t acknowledge this also curious. Maybe it will be valuable to spell out, in more detail, where the agreements and disagreements are actually.

    John may suggest that there is far more agreement between Sander Greenland specifically and himself. We can evaluate that claim.

  2. Jacob says:

    I just don’t get it sometimes. Were we ever engaging in dichotomous thinking? If I submit a study to PNAS and p < .05, I don't get automatic acceptance, right? Assuming I am right, we have always been bringing multiple criteria to the table when we evaluate research results. We look at statistical results partly with our own prior — p = .04 for a finding that seems like it must be logically true is more compelling than p = .04 for a result that is counterintuitive. An experiment is more compelling than a quasi-experiment, except when other aspects of the design make the latter more generalizable.

    Nobody would say that experiments are automatically valid, other designs automatically invalid, etc. So why impose such a threshold on statistical results?

    I like to just say here's my motivation, design, model, results. Make your own decision after I try to highlight the things that are important. I won't sell you my p = .07 finding as God's given truth, but I'd prefer to tell you what exactly it is and you can decide if we should utterly ignore the finding or consider it something tentative. And you might also see my p < .005 result and think the model is inappropriate or whatever. You mix the strength of the statistical result with everything else.

    • Andrew says:

      Jacob:

      We discuss your concern here. The short answer is that that we are opposing the status quo, which is a lexicographic decision rule in which findings are first evaluated based on whether the p-value is statistically significant, and then decisions are made. Beyond this, a key concern is that in a research projects there are typically lots of results, and the common procedure of dichotomizing them or trichotomizing them based on statistical significance is a way to add lots of noise to one’s results.

      • I do concur with some academics who draw attention to the question of what effects some methodologic, conceptual, or terminological changes will have on how science is conducted & improved: perhaps introducing different risks and outcomes. That is a prudent concern.

      • Mayo says:

        Andrew: I don’t see such a qualification in the most recent response to Ioannidis that you’ve joined in on (and perhaps strengthened). Did I miss it? But I don’t really get the point of Ioannidis battling Greenland et al on this. I would venture to say that he is far more negative on error statistical tests than Greenland. Most important, it’s too late. At the very least some of the more misleading sentences in the Wasserstein editorial can be modified.

        • Andrew says:

          Deborah:

          We were limited to something like 500 words so not everything could fit.

        • Hi Deborah,

          Which sentences in the Wasserstein editorial should be modified?

        • To add, as an editor, I might have held off on publishing the Comments simultaneously with the TAS19 Special Edition. I now speaking as an ad/public relations person. Apologies. I have been immersing myself in marketing ideas. I know, I know just craven. But if u can’t lick’em then join ’em. hehe, Seriously the conversation has been reduced to a ripple due to the acrimony on Twitter and other media. This has occurred on Facebook as well. The argument culture dominates many domains. Not sure that it leads to entropy change, a change what we need now.

    • Ben Prytherch says:

      You don’t get automatic acceptance for p 0.05. There are plenty of meta-analyses showing a large drop off in the frequency of published p-values just above 0.05, as well as too large a frequency of p-values just below 0.05.

  3. Zad Chow says:

    Has JAMA accepted this response or is it still unknown whether they’re going to publish it or not?

  4. Anoneuoid says:

    It is so sad to see all the people saying stuff like “We rely on statistical significance in genetics/medicine/astronomy/etc, nothing else can achieve what it does for us”. A field can survive use of NHST, but saying it is relied upon is basically an admission that decades of work will need to be redone.

    What a mess that’s been created… I always return to this prophecy:

    “We are quite in danger of sending highly trained and highly intelligent young men out into the world with tables of erroneous numbers under their arms, and with a dense fog in the place where their brains ought to be. In this century, of course, they will be working on guided missiles and advising the medical profession on the control of disease, and there is no limit to the extent to which they could impede every sort of national effort.”

    http://www.york.ac.uk/depts/maths/histstat/fisher272.pdf

    • Martha (Smith) says:

      Interesting quote (and link).

      • Garnett says:

        I liked the paper as well. Writing styles have certainly changed a lot since 1957!

        On a side note, I’m always stricken by how clearly written many statistics papers are, at least in comparison to the mainstream scientific work that I read on a daily basis. I suspect it has to do with the importance of logic within the statistics discipline.

  5. Fritz Strack says:

    It is often a tradeoff between innovation and reliabiity. And it is the editors’ task to decide if a submission is worth being published. To arrive at such a decision and enable the readers to come up with their own judgments, the p-values must be reported in their full beauty.

  6. Peter Gerdes says:

    Why is the threshold of publish/not-publish in the absence of statistical significance expected to be any less distorting than the current system? It will still favor results that show stronger effects and disfavor insignificant results.

    Is the move from a sharp cutoff to a more flexible standard an advantage or does it make the problems posed by forking paths and file drawer effects harder to estimate?

  7. A ubiquitous problem in much writing is that it does not contain specific enough examples demonstrating where the uses of stat sig and p-values make specific sense, a view amplified by Raymond Hubbard even though several have offered definitional clarifications.

    ‘The ASA statement on p-values (Wasserstein and Lazar 2016 Wasserstein, R. L., and Lazar, N. A. (2016), “The ASA’s Statement on p-Values: Context, Process, and Purpose,” The American Statistician, 70, 129–133.[Taylor & Francis Online], [Web of Science ®], [Google Scholar]), of course, had to be of a general nature. Subsequent publications on the topic of the appropriate and inappropriate uses/interpretations of p-values, whether from the ASA or elsewhere, must be specific; the more specific the better. They must amount to a list of Do’s and Don’ts concerning p-values. A good place to start would be for the ASA to articulate those circumstances if they exist, in which the use of NHST clearly is beneficial. At the same time, this will serve to illustrate that its rank and file usage is little more than scientist window dressing.’

    https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1497540?af=R

    To make progress in this effort to improve endeavors billed as science related. I speculate that Raymond Hubbard’s broader framework points to the inadequacies of specific proposals. Here are a couple of observations, taken from a review of Hubbard’s Corrupt Research, that seems relevant.

    ‘Less familiar to readers may be his [Hubbard, my emphasis] discussion of how the widespread tacit acceptance of a philosophically naive form of Hypothetico-Deductivism (HD) as the Scientific Method permits and exacerbates the other complementary causes. According to Hubbard, HD is erroneously thought to legitimize the inappropriate use of NHST and other methodological flaws.’

    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5136553/

    I attribute the prevalence of HD to the natural inclination of most Westerners to think in binary constructs. They pop somewhere in thought processes. So while there is now greater awareness of potential cognitive biases, it is difficult for most to avoid those identified when a particular context may demand their avoidance. Nearly everyone is subject to binary constructs and cognitive biases.

    The use of HD is naive as it may also indicate a violation of Occam’s Razor. But HD is what western thought is emersed in. A similar observation was made by Robert Nozick. Unfortunately, I no longer possess Nozick’s books for references. I donated my library a few years ago.

    Science is a messy prospect. John refers to ‘Science’ as one of the best things to happen to the species. Well, that may be so. But it has had risks and consequences associated with it that require oversight.

    Lastly, I think that Steven Goodman wants us to be more circumspect as to what will transpire after a decision to adopt a specific proposal. We just don’t know really.

    That is why specificity at this time should be a priority: that is to give enough examples as I started out suggesting earlier.

  8. Tom says:

    Andrew, I am probably missing something but one can be opposed to statistical significance, a binary measure, without being opposed to p-values, a continuous one. I believe that the p-values should be reported without emphasizing the binary measure. A p-value of 0.11 may be informative for somebody with a very different prior. In this sense everybody would be allowed to make up their mind, rather than being nudged, more or less forcefully, to pay attention only to the results that clear the p<0.05 threshold.

    I am sure that your critique runs much deeper. My point, I guess, is that even for somebody who is still a good old frequentist, there are many ways to improve the current system.

    • Mayo says:

      Tom: I agree. I propose one way (in Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP) by reformulating tests so that one reports discrepancies that are (and are not) indicated at different levels. One is reporting,not on confidence or belief, but on how well or poorly tested claims are. The claims are, like CIs, generally of inequalities (e.g., mu > mu’). People prefer CIs to tests, but there is a clear duality between the two. The lower CI bound is the value that x is statistically significantly greater than at the given level; the upper, the value that x is statistically significantly lower than, at the given level. To emphasize what has been poorly indicated, at least 1 intermediate benchmark is useful. As Ioannidis points out, the “enthusiasts” are concerned with fallacies of non-significance: taking a modest p-value as indicating no effect. But this is easily remedied in the manner I’m describing (even power analysis enables going beyond the dichotomy to setting an upper bound that a non-significant result rules out statistically.

      To test statistical assumptions or check if a result replicates, however, statistical tests and minimal thresholds are required. As Fisher always insisted, isolated small p-values don’t suffice; but when you can bring about a few statistically significant results, a genuine anomaly is indicated. Without that, there’s no falsification in science (with few exceptions, building up a genuine anomalous effect, in order to falsify a theory, is strictly statistical). The choice isn’t a rigid, automatic dichotomy, or a lack of thresholds to distinguish fairly good from terrible evidence.

  9. jd says:

    I really like the last sentence of the last paragraph in “Abandoning statistical significance is both sensible and practical.”

    Ignoring statistical significance and p-values for a second – In regards to “noise amplification” and “unrealistic expectations of replicability” – It always strikes me how many of Andrew Gelman’s examples that I remember seeing (himmicaines, beauty and sex ratio, esp, etc) would seem to an outsider rather implausible at face value, without any data or statistics or studies.

    Maybe that’s just because I’m not an expert in the field, but still I wonder if ‘statistical significance’ has gotten in the way of common sense at times.

    Ioannidis last sentence is interesting – “Without clear rules for the analyses, science and policy may rely less on data and evidence and more on subjective opinions and interpretations.”

    Well, at present, there does seem to be evidence that with “clear rules” (statistical significance) for the analyses, science and policy relies less on common sense and more on statistical filters.
    Also, I think the “noise amplification” of statistical significance can actually foster subjective opinion and interpretation. Not the reverse.

    • Perhaps one way to put this, is that Ioannidis is arguing for the need to block bad information processing without realizing that it can also block good information processing. The latter seems to be mostly whats happening currently in many ares of science.

      The bigger picture to me is the preference for censorship (blocking bad information processing at least among the elite) over enabling/encouraging better information processing widely. My preference has been for the latter.

    • Beep beep, said Roadrunner says:

      I hope I don’t remember the details horribly wrong, but here we go…

      There was a study about how using p-values as a red herring affects how people (students and researchers in this case, again, if I remember correctly). There was–shudder–a worded question about some treatment, how n one group people on average got better in, say, 3 months and in the other group in 4 months. The task was, for the participants, to tell if these two numbers are different from each other. In this case, if indeed the number four is larger than the number three.

      When they were given the unnecessary information that the difference wasn’t statistically significant, these highly educated people failed this simple question that most elementary school students would ace. Somehow in their brains the fact that the difference wasn’t statistically significant seemed to mean that, yes, the numbers 3 and 4 for are different on the surface level, but the significance test has reveled a much deeper level of the universe to us, a level, in which these numbers are actually NOT different at all! Oh those uneducated plebeians who in their foolish minds think that 3 and 4 are different numbers… HAH HAH HAH (slowed down laugh of a ghost in a cheesy film)

    • Ben Prytherch says:

      I completely agree regarding the often counter-productive use of statistical significance. Ioannidis and many others see significance as this kind of protection against being fooled by randomness, and they worry that if we get rid of that protection then all observed patterns, even those highly probable under the null of “no effect”, will be treated us evidence for some claim.

      I don’t buy this, for the reason you give: we have too many examples of sketchy results that on their face look like noise, but get that “p < 0.05" seal of approval and thus become publishable. In this sense, "statistical significance" may be achieving the exact opposite of what it is intended to do. It's supposed to make patterns in noise less likely to be taken serious; it very well may be making them more likely to be taken seriously.

      Correlation statistics are a great example. You don't need that large of a sample size for a small correlation to be significant. And so we see so many studies highlighting "statistically significant" correlations that are so small they can't be detected by the naked eye – in the rare instances that plots are published. I have a hard time believing that a correlation which, when plotted, looks like noise, would get highlighted as good evidence for a theory if it weren't for a "p < 0.05" seal of approval.

      • In all fairness, we tend to overgeneralize and misrepresent what some experts recommend. This is a key problem with many critiques and commentaries. Then I find myself having to intervene to correct both. I do this for individuals with whom I disagree as well. I welcome reciprocal criticisms.

        If you follow John Ioannidis’ comments, John offers a wider range of cautions on practices than on simply ‘statistical significance’. John Ioannidis has consistently suggested that stat sig and p-values should not constitute the default practice. Their utility is relevant in a minority of queries.

        John’s objection to the ‘retire significance’ comment/article was perhaps a bit too narrowly focused. In previous articles & talks, John Ioannidis has emphasized the differentiated conceptual & methodologic needs of different fields. and subfields It is in that frame, John has expressed his views of the utility of stat sig and p values. In particular, he seems to have focused on ‘discoveries’. I believe that even the Benjamin et al article left open the question of practices in non-discovery related studies. I’m being a bit lazy probably by not furnishing the exact quote from the Benjamin article.

      • Martha (Smith) says:

        Ben said,
        “I don’t buy this, for the reason you give: we have too many examples of sketchy results that on their face look like noise, but get that “p < 0.05" seal of approval and thus become publishable. In this sense, "statistical significance" may be achieving the exact opposite of what it is intended to do. It's supposed to make patterns in noise less likely to be taken serious; it very well may be making them more likely to be taken seriously."

        Also worth mentioning: Making decisions based on statistical significance leads to "The Winner's Curse", since the "false positive" are likely to have inflated estimates (Type M Error).

  10. There has been an abiding trust in allopathic medicine that more recently has translated into mistrust. Ironically some within the medical establishment has written about the latter. Of course, Richard Harris’ Rigor Mortis ranks as one the best narratives of the current malaise being visited in medicine and statistics.

    I’m not sure why some question or not question, among the highly educated. As I have speculated now and then, some people are natural diagnosticians, regardless of their educational background.

    I think we need to do even more depth analyses of clinical trial procedures, design, and methods a view that I shared with some academics in the ’90s.
    But whether I endorse larger and larger trials is a question. John Ioannidis also has expressed whether even more studies will yield information gain; thus requiring exploration of new concepts, methods, etc.

  11. A particularly gross example of abuse of p values occurred in Science not long ago. Memory is improved by TMS,p=0.043. The results were very unconvincing IMO,Zbut they were tweeted by Science and rapidly got a high altmetric score.
    http://www.dcscience.net/2014/11/02/two-more-cases-of-hype-in-glamour-journals-magnets-cocoa-and-memory/

    • Garnett says:

      Thank you for posting this! Our center does some work with rTMS, so it’s especially interesting to me.

      From the hyped-up press release:

      “They remembered more face-word pairings after the stimulation than before, which means their learning ability improved,” Voss said. “That didn’t happen for the placebo condition or in another control experiment with additional subjects.”

      This conclusion was drawn because the p-value for percentage change from baseline during active treatment was less than 0.05, while the p-value during sham treatment was bigger than 0.05. N=16 subjects.

      All of that notwithstanding, they observe about +25% average change from baseline in face-word pairings after active treatment compared to about +5% average change from baseline after sham treatment (contrast p=0.043, as you wrote).

      “+25% improvement”? That really seems like a lot, but it’s hard to relate change in performance on a face-word pairing task with day-to-day quality of life or clinical significance. How do we interpret the magnitude of this claim? Interestingly, the press release claims that rTMS may help stroke survivors or people with Alzheimer’s apparently because of the statistically significant result that you note. Maybe because the therapy is (presumably) non-invasive, _any_ benefit is considered worthy of recommendation?

  12. Much of the problem is surely with review processes that took shape pre-internet, in times when funding and reward systems were not so pernicious in their effects. As has been suggested, all studies that meet reasonable “severe check” criteria should appear somewhere, but they do not have to appear (and maybe should not) on a printed page. Pre-registration, at least to the extent of posting details of the proposed study design and execution, should be required. There should be provision for continuing critical re-evaluation, with reporting on replication attempts encouraged.

    This strikes me as the only way to move reasonably seamlessly from present inadequate (relative to what really matters) refereeing to refereeing that will look more incisively at design, execution, and modeling — those who attempt independent replication of the work will not commonly repeat the same mistakes. Such independent checking has the potential to focus scrutiny, not just on the study, but also on the quality of the refereeing process that gave the paper a tick. “Publication” processes of this type have the potential to generate useful and interesting data.

    I have no settled view on the role of p-values. I am attracted to David Colquhoun’s proposal that they should be supplemented by a properly documented rough assessment of what might be the false positive risk, if only in order to make it clear that the p-value does not directly measure the false positive risk.

    In those areas that retain traditional pre-internet (and pre the modern era) publication processes, statistical analysis (with or without p-values) is being asked to do a job for which it is not fitted.

    • John, Great set of comments. Transparency is most certainly critical to identifying & developing better approaches to the scientific enterprise. Why I support Open Science Initiatives. In particular, Pre-registration should be mandatory. But that missive appears not to be a robust expectation among some stakeholders that have vested interests. Nevertheless, I think the state of several sub-fields in medicine necessitates pre-registration, anyway, you know, including the protocol, the data analysis plan, Code, etc.

      What I am less sure of is whether it makes sense to specify a statistic plan at the outset. That just does not compute with me, as a general rule. I would think it depends on the context. In any case broader science & social sciences related questions arise concomitantly from the outset. The tendency is to proceed in the hypo-deductivist mode, which Raymond Hubbard, rightfully, critiques in his ‘deemed’ landmark book Corrupt Research. He also discusses ‘abduction’. I haven’t yet read the book, but it is next.

Leave a Reply