Skip to content
 

“Retire Statistical Significance”: The discussion.

So, the paper by Valentin Amrhein, Sander Greenland, and Blake McShane that we discussed a few weeks ago has just appeared online as a comment piece in Nature, along with a letter with hundreds (or is it thousands?) of supporting signatures.

Following the first circulation of that article, the authors of that article and some others of us had some email discussion that I thought might be of general interest.

I won’t copy out all the emails, but I’ll share enough to try to convey the sense of the conversation, and any readers are welcome to continue the discussion in the comments.

1. Is it appropriate to get hundreds of people to sign a letter of support for a scientific editorial?

John Ioannidis wrote:

Brilliant Comment! I am extremely happy that you are publishing it and that it will certainly attract a lot of attention.

He had some specific disagreements (see below for more on this). Also, he was bothered by the group-signed letter and wrote:

I am afraid that what you are doing at this point is not science, but campaigning. Leaving the scientific merits and drawbacks of your Comment aside, I am afraid that a campaign to collect signatures for what is a scientific method and statistical inference question sets a bad precedent. It is one thing to ask for people to work on co-drafting a scientific article or comment. This takes effort, real debate, multiple painful iterations among co-authors, responsibility, undiluted attention to detailed arguments, and full commitment. Lists of signatories have a very different role. They do make sense for issues of politics, ethics, and injustice. However, I think that they have no place on choosing and endorsing scientific methods. Otherwise scientific methodology would be validated, endorsed and prioritized based on who has the most popular Tweeter, Facebook or Instagram account. I dread to imagine who will prevail.

To this, Sander Greenland replied:

YES we are campaigning and it’s long overdue . . . because YES this is an issue of politics, ethics, and injustice! . . .

My own view is that this significance issue has been a massive problem in the sociology of science, hidden and often hijacked by those pundits under the guise of methodology or “statistical science” (a nearly oxymoronic term). Our commentary is an early step toward revealing that sad reality. Not one point in our commentary is new, and our central complaints (like ending the nonsense we document) have been in the literature for generations, to little or no avail – e.g., see Rothman 1986 and Altman & Bland 1995, attached, and then the travesty of recent JAMA articles like the attached Brown et al. 2017 paper (our original example, which Nature nixed over sociopolitical fears). Single commentaries even with 80 authors have had zero impact on curbing such harmful and destructive nonsense. This is why we have felt compelled to turn to a social movement: Soft-peddled academic debate has simply not worked. If we fail, we will have done no worse than our predecessors (including you) in cutting off the harmful practices that plague about half of scientific publications, and affect the health and safety of entire populations.

And I replied:

I signed the form because I feel that this would do more good than harm, but as I wrote here, I fully respect the position of not signing any petitions. Just to be clear, I don’t think that my signing of the form is an act of campaigning or politics. I just think it’s a shorthand way of saying that I agree with the general points of the published article and that I agree with most of its recommendations.

Zad Chow replied more agnostically:

Whether political or not, it seems like signing a piece as a form of endorsement seems far more appropriate than having papers with mass authorships of 50+ authors where it is unlikely that every single one of those authors contributed enough to actually be an author, and their placement as an author is also a political message.

I also wonder if such pieces, whether they be mass authorships or endorsements by signing, actually lead to notable change. My guess is that they really don’t, but whether or not such endorsements are “popularity contests” via social media, I think I’d prefer that people who participate in science have some voice in the manner, rather than having the views of a few influential individuals, whether they be methodologists or journal editors, constantly repeated and executed in different outlets.

2. Is “retiring statistical significance” really a good idea?

Now on to problems with the Amrhein et al. article. I mostly liked it, although I did have a couple places where I suggested changes of emphasis, as noted in my post linked above. The authors made some of my suggested changes; in other places I respect their decisions even if I might have written things slightly differently.

Ioannidis had more concerns, as he wrote in an email listing a bunch of specific disagreements with points in the article:

1. Statement: Such misconceptions have famously warped the literature with overstated claims and, less famously, led to claims of conflicts between studies where none exist
Why it is misleading: Removing statistical significance entirely will allow anyone to make any overstated claim about any result being important. It will also facilitate claiming that that there are no conflicts between studies when conflicts do exist.

2. Statement: Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P-value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero.
Why it is misleading: In most scientific fields we need to conclude something and then convey our uncertainty about the conclusion. Clear, pre-specified rules on how to conclude are needed. Otherwise, anyone can conclude anything according to one’s whim. In many cases using sufficiently stringent p-value thresholds, e.g. p=0.005 for many disciplines (or properly multiplicity-adjusted p=0.05, e.g. 10-9 for genetics or FDR or Bayes factor threhsolds or any thresholds) make perfect sense. We need to make some careful choices and move on. Saying that any and all associations cannot be 100% dismissed is correct strictly speaking, but practically it is nonsense. We will get paralyzed because we cannot exclude that everything may be causing everything.

3. Statement: statistically non-significant results were interpreted as indicating ‘no difference’ in XX% of articles
Why it is misleading: this may have been entirely appropriate in many/most/all cases, one has to examine carefully each one of them. It is probably at least or even more inappropriate that some/many of the remaining 100-XX% were not indicated as “no difference”.

4. Statement: The editors introduce the collection (2) with the caution “don’t say ‘statistically significant’.” Another article (3) with dozens of signatories calls upon authors and journal editors to disavow the words. We agree and call for the entire concept of statistical significance to be abandoned. We don’t mean to drop P-values, but rather to stop using them dichotomously to decide whether a result refutes or supports a hypothesis.
Why it is misleading: please see my e-mail about what I think regarding the inappropriateness of having “signatories” when we are discussing about scientific methods. We do need to reach conclusions dichotomously most of the time: is this genetic variant causing depression, yes or no? Should I spend 1 billion dollars to develop a treatment based on this pathway, yes or no? Is this treatment effective enough to warrant taking it, yes or no? Is this pollutant causing cancer, yes or no?

5. Statement: whole paragraph beginning with “Tragically…”
Why it is misleading: we have no evidence that if people did not have to defend their data as statistically significant, publication bias would go away and people would not be reporting whatever results look nicer, stronger, more desirable and more fit to their biases. Statistical significance or any other preset threshold (e.g. Bayesian or FDR) sets an obstacle to making unfounded claims. People may play tricks to pass the obstacle, but setting no obstacle is worse.

6. Statement: For example, the difference between getting P = 0.03 versus P = 0.06 is the same as the difference between getting heads versus tails on a single fair coin toss (8).
Why it is misleading: this example is factually wrong; it is true only if we are certain that the effect being addressed is truly non-null.

7. Statement: One way to do this is to rename confidence intervals ‘compatibility intervals,’ …
Why it is misleading: Probably the least thing we need in the current confusing situation is to add yet a new, idiosyncratic term. “Compatibility” is even a poor choice, probably worse than “confidence”. Results may be entirely off due to bias and the X% CI (whatever C stands for) may not even include the truth much of the time if bias is present.

8. Statement: We recommend that authors describe the practical implications of all values inside the interval, especially the observed effect or point estimate (that is, the value most compatible with the data) and the limits.
Why it is misleading: I think it is far more important to consider what biases may exist and which may lead to the entire interval, no matter how we call it, to be off and thus incompatible with the truth.

9. Statement: We’re frankly sick of seeing nonsensical ‘proofs of the null’ and claims of non-association in presentations, research articles, reviews, and instructional materials.
Why it is misleading: I (and many others) are frankly sick with seeing nonsensical “proofs of the non-null”, people making strong statements about associations and even causality with (or even without) formal statistical significance (or other statistical inference tool) plus tons of spin and bias. Removing entirely the statistical significance obstacle, will just give a free lunch, all-is-allowed bonus to make any desirable claim. All science will become like nutritional epidemiology.

10. Statement: That means you can and should say “our results indicate a 20% increase in risk” even if you found a large P-value or a wide interval, as long as you also report and discuss the limits of that interval.
Why it is misleading: yes, indeed. But then, welcome to the world where everything is important, noteworthy, must be licensed, must be sold, must be bought, must lead to public health policy, must change our world.

11. Statement: Paragraph starting with “Third, the default 95% used”
Why it is misleading: indeed, but this means that more appropriate P-value thresholds and, respectively X% CI intervals are preferable and these need to be decided carefully in advance. Otherwise, everything is done post hoc and any pre-conceived bias of the investigator can be “supported”.

12. Statement: Factors such as background evidence, study design, data quality, and mechanistic understanding are often more important than statistical measures like P-values or intervals (10).
Why it is misleading: while it sounds reasonable that all these other factors are important, most of them are often substantially subjective. Conversely, statistical analysis at least has some objectivity and if the rules are carefully set before the data are collected and the analysis is run, then statistical guidance based on some thresholds (p-values, Bayes factors, FDR, or other) can be useful. Otherwise statistical inference is becoming also entirely post hoc and subjective.

13. Statement: The objection we hear most against retiring statistical significance is that it is needed to make yes-or-no decisions. But for the choices often required in regulatory, policy, and business environments, decisions based on the costs, benefits, and likelihoods of all potential consequences always beat those made based solely on statistical significance. Moreover, for decisions about whether to further pursue a research idea, there is no simple connection between a P-value and the probable results of subsequent studies.
Why it is misleading: This argument is equivalent to hand waving. Indeed, most of the time yes/no decisions need to be made and this is why removing statistical significance and making it all too fluid does not help. It leads to an “anything goes” situation. Study designs for questions that require decisions need to take all these other parameters into account ideally in advance (whenever possible) and set some pre-specified rules on what will be considered “success”/actionable result and what not. This could be based on p-values, Bayes factors, FDR, or other thresholds or other functions, e.g. effect distribution. But some rule is needed for the game to be fair. Otherwise we will get into more chaos than we have now, where subjective interpretations already abound. E.g. any company will be able to claim that any results of any trial on its product do support its application for licensing.

14. Statement: People will spend less time with statistical software and more time thinking.
Why it is misleading: I think it is unlikely that people will spend less time with statistical software but it is likely that they will spend more time mumbling, trying to sell their pre-conceived biases with nice-looking narratives. There will be no statistical obstacle on their way.

15. Statement: the approach we advocate will help halt overconfident claims, unwarranted declarations of ‘no difference,’ and absurd statements about ‘replication failure’ when results from original and the replication studies are highly compatible.
Why it is misleading: the proposed approach will probably paralyze efforts to refute the millions of nonsense statements that have been propagated by biased research, mostly observational, but also many subpar randomized trials.

Overall assessment: the Comment is written with an undercurrent belief that there are zillions of true, important effects out there that we erroneously dismiss. The main problem is quite the opposite: there are zillions of nonsense claims of associations and effects that once they are published, they are very difficult to get rid of. The proposed approach will make people who have tried to cheat with massaging statistics very happy, since now they would not have to worry at all about statistics. Any results can be spun to fit their narrative. Getting entirely rid of statistical significance and preset, carefully considered thresholds has the potential of making nonsense irrefutable and invincible.

That said, despite these various specific points of disagreement, Ioannidis emphasized that Amrhein et al. raise important points that “need to be given an opportunity to be heard loud and clear and in their totality.”

In reply to Ioannidis’s points above, I replied:

1. You write, “Removing statistical significance entirely will allow anyone to make any overstated claim about any result being important.” I completely disagree. Or, maybe I should say, anyone is already allowed to make any overstated claim about any result being important. That’s what PNAS is, much of the time. To put it another way: I believe that embracing uncertainty and avoiding overstated claims are important. I don’t think statistical significance has much to do with that.

2. You write, “In most scientific fields we need to conclude something and then convey our uncertainty about the conclusion. Clear, pre-specified rules on how to conclude are needed. Otherwise, anyone can conclude anything according to one’s whim.” Again, this is already the case that people can conclude what they want. One concern is what is done by scientists who are honestly trying to do their best. I think those scientists are often misled by statistical significance, all the time, ALL THE TIME, taking patterns that are “statistically significant” and calling them real, and taking patterns that are “not statistically significant” and treating them as zero. Entire scientific papers are, through this mechanism, data in, random numbers out. And this doesn’t even address the incentives problem, by which statistical significance can create an actual disincentive to gather high-quality data.

I disagree with many other items on your list, but two is enough for now. I think the overview is that you’re pointing out that scientists and consumers of science want to make reliable decisions, and statistical significance, for all its flaws, delivers some version of reliable decisions. And my reaction is that whatever plus it is that statistical significance sometimes provides reliable decisions, is outweighed by (a) all the times that statistical significance adds noise and provides unreliable decisions, and (b) the false sense of security that statistical significance gives so many researchers.

One reason this is all relevant, and interesting, is that we all agree on so much—yet we disagree so strongly here. I’d love to push this discussion toward the real tradeoffs that arise when considering alternative statistical recommendations, and I think what Ioannidis wrote, along with the Amrhein/Greenland/McShane article, would be a great starting point.

Ioannidis then responded to me:

On whether removal of statistical significance will increase or decrease the chances that overstated claims will be made and authors will be more or less likely to conclude according to their whim, the truth is that we have no randomized trial to tell whether you are right or I am right. I fully agree that people are often confused about what statistical significance means, but does this mean we should ban it? Should we also ban FDR thresholds? Should we also ban Bayes factor thresholds? Also probably we have different scientific fields in mind. I am afraid that if we ban thresholds and other (ideally pre-specified) rules, we are just telling people to just describe their data as best as they can and unavoidably make strength-of-evidence statements as they wish, kind of impromptu and post-hoc. I don’t think this will work. The notion that someone can just describe the data without making any inferences seems unrealistic and it also defies the purpose of why we do science: we do want to make inferences eventually and many inferences are unavoidably binary/dichotomous. Also actions based on inferences are binary/dichotomous in their vast majority.

I replied:

I agree that the effects of any interventions are unknown. We’re offering, or trying to offer, suggestions for good statistical practice in the hope that this will lead to better outcome. This uncertainty is a key reason why this discussion is worth having, I think.

3. Mob rule, or rule of the elites, or gatekeepers, consensus, or what?

One issue that came up is, what’s the point of that letter with all those signatories? Is it mob rule, the idea that scientific positions should be determined by those people who are loudest and most willing to express strong opinions (“the mob” != “the silent majority”)? Or does it represent an attempt by well-connected elites (such as Greenland and myself!) to tell people what to think? Is the letter attempting to serve a gatekeeping function by restricting how researchers can analyze their data? Or can this all be seen as a crude attempt to establish a consensus of the scientific community?

None of these seem so great! Science should be determined my truth, accuracy, reproducibility, strength of theory, real-world applicability, moral values, etc. All sorts of things, but these should not be the property of the mob, or the elites, or gatekeepers, or a consensus.

That said, the mob, the elites, gatekeepers, and the consensus aren’t going anywhere. Like it or not, people do pay attention to online mobs. I hate it, but it’s there. And elites will always be with us, sometimes for good reasons. I don’t think it’s such a bad idea that people listen to what I say, in part on the strength of my carefully-written books—and I say that even though, at the beginning of my career, I had to spend a huge amount of time and effort struggling against the efforts of elites (my colleagues in the statistics department at the University of California, and their friends elsewhere) who did their best to use their elite status to try to put me down. And gatekeepers . . . hmmm, I don’t know if we’d be better off without anyone in charge of scientific publishing and the news media—but, again, the gatekeepers are out there: NPR, PNAS, etc. are real, and the gatekeepers feed off of each other: the news media bow down before papers published in top journals, and the top journals jockey for media exposure. Finally, the scientific consensus is what it is. Of course people mostly do what’s in textbooks, and published articles, and what they see other people do.

So, for my part, I see that letter of support as Amrhein, Greenland, and McShane being in the arena, recognizing that mob, elites, gatekeepers, and consensus are real, and trying their best to influence these influencers and to counter negative influences from all those sources. I agree with the technical message being sent by Amrhein et al., as well as with their open way of expressing it, so I’m fine with them making use of all these channels, including getting lots of signatories, enlisting the support of authority figures, working with the gatekeepers (their comment is being published in Nature, after all; that’s one of the tabloids), and openly attempting to shift the consensus.

Amrhein et al. don’t have to do it that way. It would be also fine with me if they were to just publish a quiet paper in a technical journal and wait for people to get the point. But I’m fine with the big push.

4. And now to all of you . . .

As noted above, I accept the continued existence and influence of mob, elites, gatekeepers, and consensus. But I’m also bothered by these, and I like to go around them when I can.

Hence, I’m posting this on the blog, where we have the habit of reasoned discussion rather than mob-like rhetorical violence, where the comments have no gatekeeping (in 15 years of blogging, I’ve had to delete less than 5 out of 100,000 comments—that’s 0.005%!—because they were too obnoxious), and where any consensus is formed from discussion that might just lead to the pluralistic conclusion that sometimes no consensus is possible. And by opening up our email discussion to all of you, I’m trying to demystify (to some extent) the elite discourse and make this a more general conversation.

P.S. There’s some discussion in comments about what to do in situations like the FDA testing a new drug. I have a response to this point, and it’s what Blake McShane, David Gal, Christian Robert, Jennifer Tackett, and I wrote in section 4.4 of our article, Abandon Statistical Significance:

While our focus has been on statistical significance thresholds in scientific publication, similar issues arise in other areas of statistical decision making, including, for example, neuroimaging where researchers use voxelwise NHSTs to decide which results to report or take seriously; medicine where regulatory agencies such as the Food and Drug Administration use NHSTs to decide whether or not to approve new drugs; policy analysis where non-governmental and other organizations use NHSTs to determine whether interventions are beneficial or not; and business where managers use NHSTs to make binary decisions via A/B tests. In addition, thresholds arise not just around scientific publication but also within research projects, for example, when researchers use NHSTs to decide which avenues to pursue further based on preliminary findings.

While considerations around taking a more holistic view of the evidence and consequences of decisions are rather different across each of these settings and different from those in scientific publication, we nonetheless believe our proposal to demote the p-value from its threshold screening role and emphasize the currently subordinate factors applies in these settings. For example, in neuroimaging, the voxelwise NHST approach misses the point in that there are typically no true zeros and changes are generally happening at all brain locations at all times. Plotting images of estimates and uncertainties makes sense to us, but we see no advantage in using a threshold.

For regulatory, policy, and business decisions, cost-benefit calculations seem clearly superior to acontextual statistical thresholds. Specifically, and as noted, such thresholds implicitly express a particular tradeoff between Type I and Type II error, but in reality this tradeoff should depend on the costs, benefits, and probabilities of all outcomes.

That said, we acknowledge that thresholds—of a non-statistical variety—may sometimes be useful in these settings. For example, consider a firm contemplating sending a costly offer to customers. Suppose the firm has a customer-level model of the revenue expected in response to the offer. In this setting, it could make sense for the firm to send the offer only to customers that yield an expected profit greater than some threshold, say, zero.
Even in pure research scenarios where there is no obvious cost-benefit calculation—for example a comparison of the underlying mechanisms, as opposed to the efficacy, of two drugs used to treat some disease—we see no value in p-value or other statistical thresholds. Instead, we would like to see researchers simply report results: estimates, standard errors, confidence intervals, etc., with statistically inconclusive results being relevant for motivating future research.

While we see the intuitive appeal of using p-value or other statistical thresholds as a screening device to decide what avenues (e.g., ideas, drugs, or genes) to pursue further, this approach fundamentally does not make efficient use of data: there is in general no connection between a p-value—a probability based on a particular null model—and either the potential gains from pursuing a potential research lead or the predictive probability that the lead in question will ultimately be successful. Instead, to the extent that decisions do need to be made about which lines of research to pursue further, we recommend making such decisions using a model of the distribution of effect sizes and variation, thus working directly with hypotheses of interest rather than reasoning indirectly from a null model.

We would also like to see—when possible in these and other settings—more precise individual-level measurements, a greater use of within-person or longitudinal designs, and increased consideration of models that use informative priors, that feature varying treatment effects, and that are multilevel or meta-analytic in nature (Gelman, 2015, 2017; McShane and Bockenholt, 2017, 2018).

P.P.S. Regarding the petition thing, I like what Peter Dorman had to say:

A statistical decision rule is a coordination equilibrium in a very large game with thousands of researchers, journal editors and data users. Perhaps once upon a time such a rule might have been proposed on scientific grounds alone (rightly or wrongly), but now the rule is firmly in place with each use providing an incentive for additional use. That’s why my students (see comment above) set aside what I taught in my stats class and embraced NHST. The research they rely on uses it, and the research they hope to produce will be judged by it. That matters a lot more to them than what I think.

That’s why mass signatures make sense. It is not mob rule in the sociological sense; we signers are not swept up in a wave of transient hysterical solidarity. Rather, we are trying to dent the self-fulfilling power of expectations that locks NHST in place. 800 is too few to do this, alas, but it’s worth a try to get this going.

451 Comments

  1. Brent Hutto says:

    Put me down on the side of not feeling at all comfortable with “science” being protected somehow by long lists of signatories on an advocacy document. But I’ll admit to often suffering an excess of cynicism when it comes to both “mob rule” and “elites”. That said, anything that Sander Greenland and Andrew Gelman both advocate is, for me, worthy of serious contemplation.

    Now with that out of the way, the thing I want to say is I agree with (at least) the one underlying common theme in most of Ioannidis’ complaints. Any effort to ban usage or mention of a specific, long-standing decision rule like “p-value” without specifying a drop-in replacement decision rule (or at least decision technique that admits formulation of context-specific rules) is a bad idea, doomed to at best failure and at worse the kind of success that leads to the situation being worse than before.

    To paraphrase what Ioannidis keeps repeating in various forms, if you can’t say prior to conducting an analysis exactly what criterion will constitute support for the presence of an effect then you are describing the data. Not analyzing, describing. And in many endeavors descriptive statistics, no matter how elaborate, are not the point of the research.

    • Christian Hennig says:

      Like often I’m coming too late to this discussion, but the first thing that strikes me is that the first two postings state that the article that is discussed asks for abandoning p-values, which it quite explicitly doesn’t.

  2. Anonymous says:

    I think Ioannides is coming at this from the medical field, and I think that he has a great point. It is hard for me to understand how the FDA will be able to do it’s job if p-values were dispensed with. The FDA just doesn’t have the resources to rerun all the basis science and pharmacological research upon which a drug application relies. At the end of the day, being able to say that some specific threshold was not met is essential for the regulator to have. Regulators just can’t go around and evaluate the evidence for each claim on an individualized basis. They have to be able to go into a court and explain their decision to a non-science who is more concerned with (and equipped to understand( the question of “Did the regulator apply the same standard here it applied in other contexts?” than “Does the evidence really justify the claim?” Without some scientific consensus on thresholds, industry will be able to push around regulators. Our society will get decidedly less scientific in a more important way than just having a lot of nonsense psychological and nutritional research being release in the press.

    • Anoneuoid says:

      It sounds like you would advocate for arbitrary obstacles to drug approval over none at all.

      What exactly do you think statistical significance is telling the regulators?

      • steve says:

        Of course, I would. There has to be a decision rule. If a better decision rule can be set up, that is preferable, but it will always be arbitrary in some sense. The FDA, for instance, insists on two randomized controlled trials to get a drug indication approved. Why two. Three would be better. Five is even more than three. The FDA has to pick something that it can explain to Congress. If an industry comes to Congress and says we have all these studies that show that the risky dangerous product that we want to make money off of is not going to be risky or dangerous at all, but these mean regulators won’t let us give it to Americans, the regulators are going to need clear scientifically based, but arbitrary, rules to tell Congress that the studies are bogus. That is not a scientific reason to have significance tests. It is just the practical reality. If they are abandoned because of the obvious problems, what decision rule will replace them? How will the regulator push back and say these results are indistinguishable from chance? There may be a better decision rule, but we have to recognize the risks of not having one at all.

        • Anoneuoid says:

          If you are going to be arbitrary then do the cheapest thing possible, eg flip a coin.

          the regulators are going to need clear scientifically based, but arbitrary, rules to tell Congress that the studies are bogus.

          The way statistical significance is used in these trials is not “scientifically based”. It amounts to disproving a strawman.

          Do you mean the regulators just want some expensive and elaborate ritual performed so it seems like they are doing something?

          How will the regulator push back and say these results are indistinguishable from chance?

          1) Statistical significance does not tell you that.
          2) That isn’t a question the regulators, or congress, or anyone, should care about.

          If an industry comes to Congress and says we have all these studies that show that the risky dangerous product that we want to make money off of is not going to be risky or dangerous at all

          Then there should be some cost-benefit assessment performed that takes into account the benefits and compares to the risk and cost. Statistical significance has nothing to do with this.

          • Steve says:

            Are you saying that requiring the p-value to be under .05 has no value in distinguishing results that may be the result of chance and those that are not? That seems strong. The FDA does weight the risk to patients against the benefits. That too will rely on arbitrary decision rules. Also, I, very clearly, am not advocating significance tests. I am just saying that some decision rule has to be in place for regulators to discount studies where the results are insufficiently distinguishable from chance. Any such rule will be arbitrary.

            Also, your rhetorical technique is a bit obnoxious. How about I do it to you? Are you saying that you don’t want drug companies to be regulated at all?!? (Not what you said, right? Annoying.)

            • Anoneuoid says:

              Are you saying that requiring the p-value to be under .05 has no value in distinguishing results that may be the result of chance and those that are not?

              Yes. With sufficient sample size you will always detect that your strawman null model is wrong (50% chance it is wrong in the “right” direction). It basically measures the prevailing opinion: Is society as a whole willing to devote enough resources to get significance for this new treatment or not?

              Are you saying that you don’t want drug companies to be regulated at all?!?

              I don’t trust them anyway since I have “seen how the sausage is made”, so it makes little difference to me honestly. There should be some organization that ensures what they say is in the pill (and only that, up to some limit) is actually in the pill though.

              • Patrick says:

                > With sufficient sample size you will always detect that your strawman null model is wrong (50% chance it is wrong in the “right” direction).

                In the infinite case, sure. Practically, to be approved, drugs first need to show efficacy in a finite sample size of (on the order of) hundreds of individuals. Given a particular and finite range of sample sizes, the chance of getting a significant p-value from a very small true effect should on average be smaller than the chance of getting a significant p-value from a large true effect.

                Of course, for p-values to be valuable in decision-making, it is not actually necessary that they be valuable in isolation from all other information, and it is very typical for a clinical trial to report effect sizes, sample sizes, and p-values together.

                > I don’t trust them anyway since I have “seen how the sausage is made”, so it makes little difference to me honestly.

                I don’t understand the logic here; surely if an industry is untrustworthy, more rather than less independent oversight is called for. That aside, drug trials don’t just assess efficacy, but also toxicity. Not collecting information about adverse outcomes in smaller trials before making a drug widely available seems unwise.

              • Anonymous says:

                It’s unclear what you’re actually blathering about. Clinical trials are not conducted with an infinite sample size, and the FDA absolutely takes effect size and real-world utility of the drug into account before making a decision. You’re the one positing the strawman that the FDA only looks at the pvalue at the bottom of the report.

    • This is a valid concern and this is how the world works today. However, I’d love some more work to plug parameter uncertainties into a decision model where the costs and benefits of different dubious claims may be very different. I’m not sure how I would think about evaluating claims (“this supplement makes you healthier”) although if they want to avoid BS marketing they should ban all vague claims (which won’t happen). On the other hand, when evaluating new medical techniques I think it *would* be helpful to plug in the full decision model. For example if an experimental drug has negligible side effects but trials suggest only weak (ie, not significant) improvement in outcomes, it might be OK to allow as a last resort. (People in the medical field will doubtless push back on that claim so perhaps it is a poor example — I just want to illustrate why the full decision analysis could help us move beyond having the FDA check p-values).

      • Anoneuoid says:

        if an experimental drug has negligible side effects but trials suggest only weak (ie, not significant) improvement in outcomes, it might be OK to allow as a last resort.

        Why would this be a last resort? Many people prefer that to developing new issues they need to deal with, that is why you see such a movement towards “natural” treatments.

      • Patrick says:

        There are actually already different guidelines for how drugs get approved: see, e.g., orphan diseases. The FDA isn’t just checking p-values in a vacuum; they’re using them as a usually-necessary-but-not-sufficient part of a decision-making process, with different criteria for different risk-benefit situations. See, for example: https://catalyst.harvard.edu/pdf/biostatsseminar/O'Neill_Slides.pdf

        • Thanks!!!

          I did not raise drug regulation in the email discussions because (as Andrew pointed out) the focus was scientific publication.

          Having worked in drug regulation and even jointly with the FDA (sometimes Bob O’Neil), I do find comments on blogs rather poorly informed about what happens in drug regulation (Bob’s slides may help with that). On the other hand, I do need to very careful about what I say publicly.

          My private take is that there needs to be lines in the sand but always accompanied by not withstanding clauses (openness and ability to change for good reasons). Additionally, in agreement with Don Berry, I believe there needs to be a over arching concern with how often things are approved when they shouldn’t be and vice versa, as many approvals are made or not each and every year.

          Having said that, from my experience, when people are on the ball and rise to the occasion the decision making is well informed maybe even up to Andrew’s ideals. Now, in some countries, there maybe legal limitations on what can matter in deciding approval, for instance, that the size of the effect other than being it positive cannot explicitly be considered. Laws aren’t perfect.

          Now, I have no idea how often people are not on the ball and fail to rise to the occasion in various regulatory agencies but it certainly happens.

    • Anoneuoid says:

      Patrick wrote:

      In the infinite case, sure. Practically, to be approved, drugs first need to show efficacy in a finite sample size of (on the order of) hundreds of individuals. Given a particular and finite range of sample sizes, the chance of getting a significant p-value from a very small true effect should on average be smaller than the chance of getting a significant p-value from a large true effect.

      The point is there is always a “true effect” (deviation from the strawman null model). It is just a matter of spending enough to detect it or not. Ie, the results are only ever true positives and false negatives (no false positives).

      Of course, as you say, the bigger the deviation from the model, the cheaper it will be to detect the deviation. What we see in practice is that the threshold moves down from 0.1 ->.05 -> 0.01 -> 3e-7 -> 5e-8 depending on how cheap it is to collect the data required for the “right” amount of significant results to be yielded on average (“alpha is the expected value of p”). If it is too easy or too hard to “get significance” the community rejects the procedure and adjust the threshold.

      Of course, for p-values to be valuable in decision-making, it is not actually necessary that they be valuable in isolation from all other information, and it is very typical for a clinical trial to report effect sizes, sample sizes, and p-values together.

      I never pay any attention to the p-values that clutter up medical journal papers and seem to understand them better than most. It is just pollution that obscures the important stuff like what was actually measured, what predictions were tested (if any), what was the relationship between the variables under study, what alternative explanations could there be, etc. It wouldn’t be so bad if we could just ignore them… but they determine what gets published since people think (significance = “real discovery”)…

      > I don’t trust them anyway since I have “seen how the sausage is made”, so it makes little difference to me honestly.

      I don’t understand the logic here; surely if an industry is untrustworthy, more rather than less independent oversight is called for. That aside, drug trials don’t just assess efficacy, but also toxicity. Not collecting information about adverse outcomes in smaller trials before making a drug widely available seems unwise.

      River of radioactive waste = current medical literature[1]
      Goggles = NHST
      https://www.youtube.com/watch?v=juFZh92MUOY

      [1] While disagreeing with his conclusion, liked this characterization of the cancer literature as “augean stables”:
      https://www.nature.com/news/cancer-reproducibility-project-scales-back-ambitions-1.18938

      • Martha (Smith) says:

        Anoneuoid said:

        “… [p-values are] just pollution that obscures the the important stuff like what was actually measured, what predictions were tested (if any), what was the relationship between the variables under study, what alternative explanations could there be, etc. It wouldn’t be so bad if we could just ignore them… but they determine what gets published since people think (significance = “real discovery”)…”

        I pretty much agree

    • Anoneuoid says:

      It’s unclear what you’re actually blathering about. Clinical trials are not conducted with an infinite sample size, and the FDA absolutely takes effect size and real-world utility of the drug into account before making a decision. You’re the one positing the strawman that the FDA only looks at the pvalue at the bottom of the report.

      Pick a real life clinical trial you think is done correctly, and it will become clear what I am “blathering about”.

  3. Anoneuoid says:

    Ioannidis wrote:

    In many cases using sufficiently stringent p-value thresholds, e.g. p=0.005 for many disciplines (or properly multiplicity-adjusted p=0.05, e.g. 10-9 for genetics or FDR or Bayes factor threhsolds or any thresholds) make perfect sense.

    Here we go again… Real life example?

    I can tell you now all this does is makes a community choose a threshold + sample size to typically get around the “correct” number of genes (or whatever) they want to (or, in practice, can) study further. Just skip the pseudoscience and rank the genes by whatever statistic you want (use p-values if that makes sense to you), then pick the top X for further study. But do not add in this step of finding “real” signals and “not real” signals.

    • Matt says:

      It seems that the significance threshold of 5*10^-8 has worked quite well in GWAS research. The false positive problem of the candidate gene paradigm has been largely eliminated and thousands of significant SNP effects that can be reproduced at least in ancestrally similar samples have been identified. If p-values were retired, how would GWAS research or the equivalent be conducted?

      • Anoneuoid says:

        thousands of significant SNP effects that can be reproduced at least in ancestrally similar samples have been identified.

        1) Like what? Please link to one along with the replication(s).
        2) Just counting the “success” is not informative. How many thousands cannot be reproduced?

        If p-values were retired, how would GWAS research or the equivalent be conducted?

        1) I don’t assume research like that should be conducted.
        2) If it should be conducted, I gave an example: sort your p-values and pick the top 10 to investigate further.

        • Matt says:

          Here’s some replication results from last year’s big GWAS of educational attainment:

          We conducted a replication analysis of the 162 lead SNPs identified at genome-wide significance in a previous combined-stage (discovery and replication) meta-analysis (n = 405,073). Of the 162 SNPs, 158 passed quality-control filters in our updated meta-analysis. To examine their out-of-sample replicability, we calculated Z-statistics from the subsample of our data (n = 726,808) that was not included in the previous study… Of the 158 SNPs, we found that 154 have matching signs in the new data (for the remaining four SNPs, the estimated effect is never statistically distinguishable from zero at P < 0.10). Of the 154 SNPs with matching signs, 143 are significant at P < 0.01, 119 are significant at P < 10−5 and 97 are significant at P < 5 × 10−8.

          It seems to me that significance testing working here just like it should be working. Replication in holdout samples is routine in GWAS research these days, take a look at the catalogue for more.

          I don’t assume research like that should be conducted.

          Why not?

          If it should be conducted, I gave an example: sort your p-values and pick the top 10 to investigate further.

          What would those further investigations be about? Individual differences in most human traits are influenced by thousands of genetic variants. The effect of any one genetic difference is tiny, so examining the top 10 effects would usually be pointless. Understanding genetic effects on complex trait variation in functional/mechanistic terms in the foreseeable future is not a realistic prospect in my opinion. Quite likely, it will never be realistic. Rather, the utility of GWAS results is that they can be combined into polygenic scores and used for prediction and intervention.

          • Anoneuoid says:

            This is the replication result (Fig S3)?
            https://i.ibb.co/Cz6S2xC/gwasrep.png

            Maybe I am missing something but it looks like almost no correlation at all to me. If the results were reproducible, shouldn’t the effect sizes for the two studies be scattered around the diagonal line? If there is little correlation in effect sizes but the “significance” still corresponds, then this would be showing that the SNP-specific sample sizes (does this vary across genes according to susceptibility to quality control issues, etc?) or variances were similar across the studies.

            Also according to the (usual) “significance” definition of replication these results were not very good:

            Of the 154 SNPs with matching signs, 143 are significant at P < 0.01, 119 are significant at P < 10−5 and 97 are significant at P < 5 × 10−8.
            </blockquote.

            The original threshold was 5e-8, so 97/162 ~ 60% replicated according to the statistical significance definition. If nothing was going on and sample size was sufficient, we would expect ~50% to be significant in the same direction, they got 60%…

            And they say:

            The reference allele is chosen to be the allele estimated to increase EA in the previous study; therefore, all points above the dotted line have matching signs in the replication sample.

            Why wouldn’t they also look at alleles that could “decrease EA”?

            I’ll leave the other stuff for later to focus on this for now. I have to believe I am misunderstanding that.

            • Anoneuoid says:

              Also, isn’t it strange that almost all (except ~5) the SNPs that met p < 5e-8 for the current study (all must have in the original to be included here) are above the diagonal and it looks like only one with a “lesser” p-values is above?

              Very bizarre figure to me, hopefully someone can point out my misunderstanding.

            • Patrick says:

              I haven’t read the whole paper, but I can answer this:

              The reference allele is chosen to be the allele estimated to increase EA in the previous study…

              Why wouldn’t they also look at alleles that could “decrease EA”?

              They do. When you calculate the effect size of having an allele “…AGTC…”, it is in contrast to having a different allele at that location, e.g., “…AGTA_…”, “…AGTT…”, etc. What they are saying here is that if in the previous study, they found a positive effect for “AGTC” at one locus over other possible alleles, they then also plotted the effect size for “AGTC” in the replication cohort (and not some other allele like “AGTA” or “AGTT”). In other words, all they’re saying there is that for it to be a real replication the same actual DNA sequence has to show an effect in the same direction.

              The original threshold was 5e-8, so 97/162 ~ 60% replicated according to the statistical significance definition. If nothing was going on and sample size was sufficient, we would expect ~50% to be significant in the same direction, they got 60%…

              I’m not sure where you’re getting 50% from. Given a completely random sample of 162 (out of around 150000 unlinked SNPs), I definitely wouldn’t expect half to be found by chance in a GWAS study that identified around 1100 significant hits at p < 5e-8.

              • Anoneuoid says:

                What they are saying here is that if in the previous study, they found a positive effect for “AGTC” at one locus over other possible alleles, they then also plotted the effect size for “AGTC” in the replication cohort (and not some other allele like “AGTA” or “AGTT”).

                Yes, at first I thought it made sense since there would always be a positive and negative allele, but there are four possibilities at each nucleotide like you say.

                Can’t we get results like the following for correlation with edutainment:
                AGTC = positive
                AGTA = neutral
                AGTT = neutral
                AGTG = negative

                Now there is both a “positive” and “negative” allele.

                In other words, all they’re saying there is that for it to be a real replication the same actual DNA sequence has to show an effect in the same direction.

                So statistical significance is not required to declare the replication a success now, it is only important for the initial screening?

                I am really more curious as to what process would lead to results like this. We see a consistent “effect” in sign only with little to no correlation in magnitude of effect.

                It is something that needs explaining to me, personally I suspect some form of p-hacking (did they do it with and without this “winners curse” adjustment, etc). Or perhaps it is from playing around with specifying the regression model they mention. But maybe there is some biological explanation everyone uses that I am unaware of.

                I’m not sure where you’re getting 50% from. Given a completely random sample of 162 (out of around 150000 unlinked SNPs), I definitely wouldn’t expect half to be found by chance in a GWAS study that identified around 1100 significant hits at p < 5e-8.

                The 50% only applies to studies with sufficient power to detect all the real differences at the threshold used. Once you have billions or trillions of samples the procedure will lead to the conclusion that every single gene will correlate with every single behavior. Of course at that point they will move the threshold to 5e-16 or whatever to get the right number of “real” correlations…

              • Patrick says:

                The 50% only applies to studies with sufficient power to detect all the real differences at the threshold used. Once you have billions or trillions of samples the procedure will lead to the conclusion that every single gene will correlate with every single behavior.

                There are indeed people who now believe this (the “omnigenic model” of complex traits). But they arrived at this hypothesis by looking at the results of GWAS studies with large sample sizes, not by assuming a priori that there was no such thing as a null effect in genetics. On the other extreme, for example, Mendelian traits also exist; there is a lot of room between “one gene” and “every gene” that could in theory explain most of a trait’s heritability. Even assuming that the omnigenic model is a good description of reality, it is still not clear that literally all alleles will have a truly non-zero effect on any trait: for example, there are synonymous mutations in protein-coding regions.

                More importantly for this specific example, though, there were a finite number of samples in this experiment, and not every SNP was actually significant at a threshold of p < 5e-8 in either study. If this set of ~150 SNPs were simply an arbitrary random sample of all true SNPs (still assuming here that every SNP has a "true" effect), then you would expect that SNPs would replicate at the rate of around ~1100 discoveries / ~75000 unlinked SNPs (with effects in the same direction) ~= 0.015, not 0.5 or 0.6.

                What we see in practice is that the threshold moves down from 0.1 ->.05 -> 0.01 -> 3e-7 -> 5e-8 depending on how cheap it is to collect the data required for the “right” amount of significant results to be yielded on average (“alpha is the expected value of p”).

                In the case of GWAS, 5e-8 was originally chosen to account for multiple testing, and has been used pretty consistently in studies of common variation. I haven’t seen lots of groups choosing their own ad hoc thresholds based on prior expectations in the way you’re describing.

                there are four possibilities at each nucleotide like you say

                SNP markers in genotyping arrays are typically chosen to be biallelic, at least in the population used to design the array.

                So statistical significance is not required to declare the replication a success now, it is only important for the initial screening?

                I think you’ve misunderstood me. I didn’t say that statistical significance *wasn’t* a criterion for replication success. I just said that which allele had a “positive” effect needed to be the same in both studies, as a necessary but not sufficient condition. Indeed, the authors looked at both statistical significance and effect sign.

              • Anoneuoid says:

                There are indeed people who now believe this (the “omnigenic model” of complex traits). But they arrived at this hypothesis by looking at the results of GWAS studies with large sample sizes, not by assuming a priori that there was no such thing as a null effect in genetics.

                Yes, I take that as an a priori principle about everything, not just genotype/phenotype. Literally everything correlates with everything else (to be sure, most of these correlations are unimportant/uninteresting) and stuff that does not is exceptional and interesting. If you use NHST, you are assuming an opposite principle.

                there are synonymous mutations in protein-coding regions

                There will still be slightly different affinities for various enzymes like polymerases, melting temps, etc. Plenty of reasons for some difference to arise.

                not every SNP was actually significant at a threshold of p < 5e-8 in either study.

                My understanding is all the ones checked for “replication” did meet that criteria in the first study.

                If this set of ~150 SNPs were simply an arbitrary random sample of all true SNPs (still assuming here that every SNP has a “true” effect), then you would expect that SNPs would replicate at the rate of around ~1100 discoveries / ~75000 unlinked SNPs (with effects in the same direction) ~= 0.015, not 0.5 or 0.6.

                Yes, there will be many false negatives if the study is underpowered. The sample size and variability determine the expected number of significant vs not results.

                In the case of GWAS, 5e-8 was originally chosen to account for multiple testing, and has been used pretty consistently in studies of common variation. I haven’t seen lots of groups choosing their own ad hoc thresholds based on prior expectations in the way you’re describing.

                It isn’t possible for individual groups to set the standard, the community for a given field sets it collectively based on how many “discoveries” need to be published per year to keep getting funding or whatever. This isn’t something that is discussed openly, it just needs to happen or that research will get shut down for being too productive (look like fakers) or unproductive (never learn anything new). The first person who looked at GWAS data first saw all the “significance” at 0.05 and then decided this was unacceptable so they made it more stringent.

                It seems to be about 1 over the average sample size (+/- 1-2 orders of magnitude), but of course depends on how noisy the type of data tends to be and how wrong the null model usually is.

                Here is a great example of the thought process:

                I am wondering under what circumstances is it more appropriate to use an alpha value of .01 instead of the standard .05 (using a T-test for equal or unequal variances). I have some data in which almost every group is significantly different from almost every other group when using the alpha value of .05, but not .01. I am not well educated in statistics so any help would be greatly appreciated.

                https://www.researchgate.net/post/Should_my_alpha_be_set_to_05_or_01

                I am not saying there is anything wrong with the above thinking, except it is based on the premise that statistical significance is discriminating between “real” and “chance” correlations to begin with.

                SNP markers in genotyping arrays are typically chosen to be biallelic, at least in the population used to design the array.

                Thanks, so that explains one strange aspect of the chart? Any comment on the rest, ie what type of process would result in only a directional correlation while magnitude is largely irrelevant?

              • Anoneuoid says:

                Here it is in black and white, “too many significant GWAS results” -> “more stringent threshold”:

                Sequencing studies lead to an increased number of low-frequency (0.5%<MAF<5%) and rare (MAF<0.5%) variants, arguing for a more stringent statistical threshold for association testing in studies utilizing sequence data.

                https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4970684/#bib1

            • Matt says:

              Lee et al. discuss the logic of their replication procedure in section 1.10 of the supplement. The table in that section demonstrates that the results are consistent with true positive effects.

              I would think the p-value in an exact replication when the hypothesis is true will be larger (less significant) than the original p-value 50% of the time, so your suggestion that all p-values in the replication should be expected to be below the 5 * 10^-8 threshold is surely erroneous. Only a bit over half should be that low. (The replication by Lee et al. of course isn’t exact, e.g. the sample size is larger.)

              I’m not sure what’s going on in Fig. S3.

              Whether you choose the “plus” or “minus” alleles as reference alleles makes no difference to the results of a GWAS. For example, in Lee et al. the effect size for having T rather than C in the rs7623659 locus was 0.02899. Someone with two T alleles is expected to get 2*0.02899=0.05798 units more education than someone with two C alleles. You could choose to model the effect in terms of C instead, in which case those with two C alleles would be expected to get 0.05798 units less education than someone with two T alleles. The two ways of modeling the effect are completely equivalent. (I think the unit in this analysis is years of education, so 0.05798 units is about 21 days.)

              • Anoneuoid says:

                I would think the p-value in an exact replication when the hypothesis is true will be larger (less significant) than the original p-value 50% of the time, so your suggestion that all p-values in the replication should be expected to be below the 5 * 10^-8 threshold is surely erroneous

                Can you expand on this? I can think of one consequence if you are correct:

                There are many people/projects (cancer replication project, psych replication project, just in general) who have been defining “successful replication” “as statistically significant in the same direction”. While I don’t think that is a good definition for other reasons, you seem to be claiming they should only expect ~50% success rate using that definition.

              • Matt says:

                Here’s Senn (2002) on expected p-values in replications:

                Some years ago, in the pages of this journal, Goodman gave an interesting analysis of ‘replication probabilities’ of p-values. Specifically, he considered the possibility that a given experiment had produced a p-value that indicated ‘significance’ or near significance (he considered the range p=0.10 to 0.001) and then calculated the probability that a study with equal power would produce a significant result at the conventional level of significance of 0.05. He showed, for example, that given an uninformative prior, and (subsequently) a resulting p-value that was exactly 0.05 from the first experiment, the probability of significance in the second experiment was 50 per cent. A more general form of this result is as follows. If the first trial yields p=α then the probability that a second trial will be significant at significance level α (and in the same direction as the first trial) is 0.5.

                So, if the p-value in the first study is 5*10^-8 and the effect size is correct (i.e. the true population value), the p-value in an exact replication will be less than 5*10^-8 in 50% of replication attempts. Note that this is about how well a particular p-value replicates, NOT whether the effect replicates at some conventional level such as 0.05, so your comment about a 50% success rate is mistaken.

                The real replication test is not about p-values but about how the polygenic effects replicate across contexts and education polygenic scores pass that test just fine, e.g. https://www.pnas.org/content/115/31/E7275

              • Anoneuoid says:

                So, if the p-value in the first study is 5*10^-8 and the effect size is correct (i.e. the true population value), the p-value in an exact replication will be less than 5*10^-8 in 50% of replication attempts.

                I haven’t worked this out for myself but it seems like an interesting point. However, in this study the “lead SNP” p-values were all less than 5e-8, not equal to it.

                Note that this is about how well a particular p-value replicates, NOT whether the effect replicates at some conventional level such as 0.05, so your comment about a 50% success rate is mistaken.

                In the paper they are concerned with passing the threshold, as was I. I didn’t realize you changed the subject. Rereading I do see you did compare new vs old p-values, but then switch to saying “a bit over half should be below the threshold”:

                I would think the p-value in an exact replication when the hypothesis is true will be larger (less significant) than the original p-value 50% of the time, so your suggestion that all p-values in the replication should be expected to be below the 5 * 10^-8 threshold is surely erroneous. Only a bit over half should be that low.

                Where does “bit over half should be that low” (below the 5e-8 threshold) come from?

                I’m not sure what’s going on in Fig. S3.

                Can anyone explain this? This is the type of stuff that drove me nuts when I did medical research, they would just present the most bizarre looking data as if it was totally normal all the time.

              • Matt says:

                In the paper they are concerned with passing the threshold, as was I. I didn’t realize you changed the subject. Rereading I do see you did compare new vs old p-values, but then switch to saying “a bit over half should be below the threshold”:

                I didn’t change the subject. You mistakenly thought that I was claiming that the criterion “statistically significant in the same direction” is expected to be met in replications of true effects only 50% of the time. My actual claim is that the p-value in such a replication is expected to be smaller (larger) than the original p-value 50% of the time. Therefore, replications will be statistically significant at 5% level half the time only if the original p-value was 0.05.

                Lee et al. compare the p-value distribution in their replication study to a theoretically expected distribution of replication p-values. For example, the theoretical expectation was that 79.4 (SD=3) p-values would be less than 5*10^-8 whereas the observed frequency was 97. They offer some explanations as to why the observed p-value distribution doesn’t exactly match the theoretical one, but in any case the observed distribution is entirely incompatible with the idea that many of the genome-wide significant SNPs found in Okbay et al. (2016) were false positives.

                Where does “bit over half should be that low” (below the 5e-8 threshold) come from?

                As you said, the p-values in the replication were all below the threshold rather than equal to it, so a bit over half of the p-values in an exact replication (which Lee’s wasn’t) should be below the threshold, again assuming that the effect sizes were estimated without bias in the original study. (Lee et al. do not assume that the original effect sizes are unbiased. Rather, they shrink them to adjust for the winner’s curse, or regression toward the mean. The fact that they use these shrunken effect sizes may, or may not, explain something about Fig. S3 as well.)

              • Matt says:

                “the p-values in the replication were all below the threshold rather than equal to it”

                This should read: the p-values in the ORIGINAL STUDY were all below the threshold rather than equal to it

          • Anoneuoid says:

            in any case the observed distribution is entirely incompatible with the idea that many of the genome-wide significant SNPs found in Okbay et al. (2016) were false positives.

            I agree, they were not false positives. They are “true positives”. The problem is all the “non-significant” SNPs were false negatives due to insufficient sample size for the chosen threshold.

        • Nicola says:

          Actually if you bother reading any serious GWAS study from say the last 10 years, you will find that they all contain replication, in fact no one will accept gwas paper without it.

  4. “For example, even if researchers could conduct two perfect replication studies of some genuine effect, each with 80% power (chance) of achieving P < 0.05, it would not be very surprising for one to obtain P  0.30. Whether a P value is small or large, caution is warranted.”

    What is understood by “not be very surprising”? A little simulation study suggests that this situation would happen less than 1 in 20 times…

    Am I missing something here?

  5. Thomas Passin says:

    There’s the p-value, and then there are other ways to figure significance. They are not necessarily equivalent. The p-value, in particular, is probably one of the noisiest statistics you could find. With a standard deviation of around 0.24, it’s hard to know what to conclude just because your sample estimate is, say, p = 0.035. Does that tell us that the “true” p-value would have been in the range [0, .48]? Hard to say exactly…

    Confidence bands, whatever you want to call them, based on e.g. 2-sigma, are at least more stable, less subject to noise. I would say that almost all of us, looking at a set of data, would say that a sample result within 0.3 standard errors of another has poor support for the claim that the two are much different. Conversely, results 5 S.E. apart would be convincing to almost everyone, or at least would set off a hunt for systematic bias and error. As they should.

    I conclude that the problem is not exactly with “statistical significance” per se, but with 1) the use of a very noisy way to estimate it, and 2) a desire to force hard decisions out of data that really aren’t able to support them.

    A corollary is that if a decision has to be made based on data that is less well supported by its statistical qualities, then revisiting it from time to time is essential, to see if the consequences have held up. Of course, for politically charged issues, this is often nearly impossible.

    And let me end with my own pet little peeve – papers where you can’t tell if they are using the standard deviation or the standard error. Grrr!

    • Corey says:

      What do you mean by “true” p-value? I’m seeking a mathematical definition here. My confusion as to what you mean by the phrase is due to the fact that the p-value is a random variable — pre-data it has a distribution and post-data it is a single realized value. (Do you mean the expectation of the p-value? This would indeed be a function of the unknown parameter(s), but it would also depend on the sample size.)

      • Thomas Passin says:

        @Corey: “What do you mean by “true” p-value?”

        I’m talking about estimating the value of a statistic vs getting the population-wide value. The p-value is used to talk about how likely it is that observed differences might or might not be found by random variations alone. But you can only ever get an estimate of the p-value because it is ultimately based on the sample standard deviation (or something more or less equivalent), which is only an estimate for the population standard deviation.

        • A p value is the frequency of getting certain kinds of data from a particular RNG, the RNG you choose is an arbitrary choice, it’s up to you what hypothesis you want to test, not random. The data you test it on is observed. There is nothing random about the p value itself. The only randomness involved is the hypothetical repetitions of data collection you imagine you might perform.

          • Daniel WOW if you had rendered that definition to me in my statistics class, I would be scratching my head. It bolsters the need to standardize definitions. If I compare your defintion and the one I had in my 1st year statistics class, I’d be like ummmmm come again?

          • Carlos Ungil says:

            The point is that the p-value (a statistic calculated from the data) is random in the same sense that the average height of individuals from a population is random (you could say that there is nothing random about the average itself).

            And Thomas seems to think that in the same way that the average height is an estimator of the true average height in the population the p-value is an estimater of a true “something”.

            • The average height of a sample of individuals is frequentist random in the sense that repeat samples give different values. But the p value as calculated is a definite number, and it is the only number of relevance for the sample. Just as if I asked you what is the sample average, and you started talking about a mystical true sample average…

              The sample average has a connection to a population average, but it exactly answers the question “what is the average of these data points?”

              Similarly the p value exactly answers the question “how often would this chosen random number generator produce data more extreme than these data points”

              • I guess you might say that the PROCEDURE to observe some data, and *choose a hypothesis to test* basing it on an observed SD or skewness or whatever results in you testing a randomly selected hypothesis, and the hypothesis you “really” care about is one where the sd or the skewness or whatever is exactly known…

                the relevance of that problem can be simulated I guess… I doubt it will turn out to be terribly important. In some ways that was the purpose of the t-test to average over the sampling distribution of the sd. It makes a big difference for very small samples (like 1 or 2 or even 5) but becomes much less relevant for 10 and by 25 the discrepancy between known sd and the t test is usually thought of as irrelevant.

                Is there a “t-test” style test you can use for other distributions, yes. None of this addresses actually important questions in my mind. The sampling distribution of the chosen hypothesis is hardly the big problem with p values.

              • Anonymous says:

                > But the p value as calculated is a definite number

                The average height of a sample of individuals is a definite number. They are not different in this regard.

                I agree that the difference is that the sample average has a clear connection to a characteristic of the population (it’s an estimator of the population average) while the p-value doesn’t (it’s some complex thing related to the model… or to a RNG if you’re so inclined).

          • Christian Hennig says:

            Of course if you understand probability in a frequentist way the p-value is random. You assume some underlying model and generate some data randomly, then the p-value is just a statistic of the data, a random variable that has a certain distribution and takes a fixed value once the data are observed like every statistic/RV.
            (As always, my take on frequentist probability is *not* that the model is necessarily true but rather that it is designated to be the basis of probability calculations because we are interested in how the world would be *if* the model was true, just to nail down something that allows us to do calculations.)

            What complicates matters is that you can look at two underlying distributions, (a) the distribution P0 that you want to test and (b) the distribution Q that you take in order to compute the distribution of the p-value.
            You can be interested in the case P0=Q, i.e., what happens if the H0 is true, in which case the p-value is often but not always uniformly [0,1]-distributed; but you can also be interested in the distribution of the p-value, computed for checking P0 against the data, if in fact the underlying distribution is something else than P0, although the p-value is still defined as quantifying the relation between the data and P0.

            Chances are that people who talk about the “true p-value” mean the situation that Q is the true underlying distribution, usually not precisely P0. However, if you believe like me and most Bayesians that such a “true” distribution doesn’t exist, talk about a “true p-value” doesn’t make sense in my book.

            • Corey says:

              I don’t think Passin means “the situation that Q is the true underlying distribution, usually not precisely P0” — or at least, not just that. He was pretty clear in expressing that he understands the p-value as a quantity with a sampled value and a distinct “true” population value, just as the notion of “standard deviation” covers both the concept of the population standard deviation as a particular functional of a distribution and the sample standard deviation as an estimator of the population standard deviation. I was trying to get down to brass tacks on exactly what that would mean in the simplest possible context, mostly because I think he’s confused and I’m trying to do the Socratic dialogue thing but also because it’s possible that I’m the confused one because I’m misinterpreting him.

  6. Harlan Campbell says:

    “We’re frankly sick of seeing such nonsensical ‘proofs of the null’ and claims of non-association in presentations, research articles, reviews and instructional materials. An interval that contains the null value will often also contain non-null values of high practical importance. That said, if you deem all of the values inside the interval to be practically unimportant, you might then be able to say something like ‘our results are most compatible with no important effect’.”

    It seems to me that “if you deem all of the values inside the interval to be practically unimportant” could be improved with something like: “if in a pre-specified analysis plan (i.e., before seeing the data) you identified a range of values considered to be practically unimportant, you might be able to say…”

    I share Ioannidis’ concerns with people describing the data “as they wish, kind of impromptu and post-hoc” and point you to a longer discussion on this:
    https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0195145

  7. Jonathan (another one) says:

    Ioannides’ argument is that if you take away a flawed decisionmaking tool, people will resort to making decisions with no tools at all, and further that they will be listened to. If true, this is the result of three separate problems: (a) the ultimate decisionmakers don’t understand the arguments being made by the advocates; (b) the advocates themselves can’t tell good evidence from bad evidence (or perhaps lack the incentives to inquire carefully into the difference); and (c) there is no trained surrogate for the decisionmaker whose job it is to play “null’s advocate.” There is literally nothing that can ever be done about (a) – it is unrealistic to expect those who have the power to make decisions to have stopped along the path to power to put in the hard work to become scientific analysts. There is little that can be done about (b) — as has been pointed out many times, it is very difficult to get someone to see something that it is not in his interest to see. But there is lots that can be done about (c). Imagine if in getting a drug approved, the applicant had to spend a sum of money equal to the sum spent on the trial on someone else working on behalf of say, the established drug he wished to supplant. Both of them make reports to the decisionamker in language that the decisionmaker can understand (hopefully) but making as many salient points about the uncertainty of the applicant’s analysis as they can cogently express. What neither side can do in in this case, however, is blithely say: “I’m right. Look at the p value.” because that position has no intellectual responsibility and this piece, even as the product of a mob, demand to have attention paid.

    So to answer Ioannides: Yes, decisions must be made, and if decisionmakers are going to depend on a smooth-talking PNAS-level purveyor, that’s on him. No amount of p-values are going to stop the ignorant from falling into error. And sure, people argue for what they want to argue, but the counter for sloppy argument is focused counterargument. If decisionmakers don’t want to hear counterargument (and to hear it they’d have to pay for it) they deserve what they get. But cogent counterargument, arguments that can be understood, oughtn’t just blithely use p-values either, because the arguments from faulty p-values are no better for the null than they are for the alternative.

    • Phil says:

      Yes, it’s striking that Ioannides seems to think that a decision-making rule that doesn’t rely on a p-value is no rule at all, you’ll have people just making up claims and nobody will have any idea of what’s true or not. It just seems like a ridiculous argument.

    • Steve says:

      The problem with this is that regulators need decision rules that they can apply uniformly. There are smart people at the FDA and other regulators, who understand science and statistics, but they have to answer to elected representatives and courts. They have to be able to explain their decisions not just in terms of the science but in terms of why they are applying the same standards to everyone. You cannot have completely different decision rules for each individual case. We are a country of laws. Right now p-values and significance testing play a role in those decisions. Any decision rule is going to provide an arbitrary cut point for what studies get rejected as evidence, but the regulators have to have a decision rule. What can replace statistical significance that will provide a better decision rule?

      • Jonathan (another one) says:

        This is a strawman. First, while bad rules are probably better than no rules, maybe not. But no one is saying that everyone isn’t subject to the same criteria: (a) do good work; (b) explain it well; (c) make all your data and programs freely available to anyone who wants to dispute it, who can then do their own good work and make their own better explanation.

      • Martha (Smith) says:

        Steve said,
        “The problem with this is that regulators need decision rules that they can apply uniformly. There are smart people at the FDA and other regulators, who understand science and statistics, but they have to answer to elected representatives and courts. They have to be able to explain their decisions not just in terms of the science but in terms of why they are applying the same standards to everyone. You cannot have completely different decision rules for each individual case. We are a country of laws. Right now p-values and significance testing play a role in those decisions. Any decision rule is going to provide an arbitrary cut point for what studies get rejected as evidence, but the regulators have to have a decision rule. What can replace statistical significance that will provide a better decision rule?”

        What you are describing is often called a “bright line rule”. See https://en.wikipedia.org/wiki/Bright-line_rule for discussion of the controversy (at least in the U.S.) about requiring “bright line rules”.

      • Andrew says:

        Steve:

        As I wrote in another comment on this thread: If we need rules, let’s have rules. But why should the rules be defined based on the tail-area probability with respect to a meaningless null hypothesis? That’s just weird. They could have rules based on minimum sample size, minimum accuracy of measurement, maximum standard error of estimation, things that are more relevant to the measurement and inference process.

        See also my P.S. in the above post.

  8. Anonymous says:

    “I don’t think it’s such a bad idea that people listen to what I say, in part on the strength of my carefully-written books”

    You know what would be at least 100x more effective than publishing an opinion piece in Nature? Putting all the early warnings and recommendations by, e.g., Meehl, Cohen, Ioannidis, and Turkey, and later stuff by Simonsohn, Simmons, Gelman, Greenland, etc., into all your stats TEXTBOOKS and SYLLABI! Why? Because students mostly learn from textbooks and assigned readings, and your students will hold your feet to the fire. Mine certainly do. Virtually no grad students want to go deeply in debt and waste years of their lives pursuing a fraud. Who busted Hauser? His students.

    Review your colleagues’s stat text books. If they don’t have a thorough discussion of the replication crisis and all the factors that contribute to it, give them a bad review. Request your colleagues’ syllabi. If they don’t contain a thorough intro to the replication crisis, ask them why. Make it impossible for students to not know this stuff.

    (And by “you”, I mean everyone teaching a stats course or writing a stats textbook.)

    • Ben Prytherch says:

      +1

      My students probably get sick of hearing me bang on about this. But I know they’re not gonna hear it from anyone else, so I bang on.

    • Andrew says:

      Anon:

      We put some of this stuff in Regression and Other Stories, which is the first book I’ve written since becoming fully aware of these issues.

      • Ben Prytherch says:

        Andrew, you’re probably sick of being asked this, but any idea on when that will come out? I’m teaching a 2nd semester applied methods course and I would love to have a textbook that covers all the standard regression models and also includes “grown up” treatments of significance.

    • Bob says:

      This times 1 billion. I’m sick of hearing about students ‘not understanding p values’. Who is teaching them? How do you pass a stats course and not understand p values? Why don’t these students fail? Lazy and incompetent lecturers, that’s the problem.

      I’ve just finished watching McElreath’s most recent lectures and he does this continually. I hate this term, students don’t understand that term, blah blah blah. He never actually tries to teach the damn things correctly.

      Easy solution.
      1. Teach the damn stuff correctly.
      2. Give the stats students a short, 5-10 min oral examination. You don’t know, you fail.

      Have some standards.

      • Ben Prytherch says:

        Problem is, you can teach it 100% correctly and ask them loads of questions that all fall within the framework as you’ve taught it, and good students will:

        a) Answer the questions correctly

        b) Think that the p-value is the probability of the null, that non-significant results “were likely due to chance”, etc.

        The problem is that this whole method of analysis is counter-intuitive. I’m going to take my research hypothesis, come up with a specific version of “not my hypothesis”, and then calculate the probability of getting my data (or data more “extreme”) if that version of “not my hypothesis” is true. If this probability is small enough, I reject “not my hypothesis” and conclude that I have evidence for my hypothesis.

        Are we surprised that students take this method and rework it in their heads as “I’m calculating the probability that my hypothesis is wrong”?

        • Martha (Smith) says:

          “The problem is that this whole method of analysis is counter-intuitive.”

          Yes, but since it’s so widely used, we need to teach it *and* emphasize that it is indeed counterintuitive, and that it’s often misused, etc., etc.
          (e.g., if a student asks, “But if it’s so counterintuitive, why do people use it?”, we need to reply honestly that sometimes people do things just because “That’s the way we’ve always done it,” which makes life more difficult for everyone involved.)

        • Bob says:

          Then examine them to see if that is what they are doing, if so they fail. This isn’t hard.

          Stop allowing ‘scientists’ to treat stats 101 as a bare minimum requirement they have to fulfill before they go and ‘do science’.
          These people are ill educated and are being let loose on problems that really matter. Who is accrediting them?

        • Ben Prytherch says:

          Martha: I fully agree. Sometimes I feel conflicted about simultaneously teaching and criticizing a method. I don’t want to breed cynicism, and I also don’t want to breed credulity. It’s a tough line to walk.

          Bob: I admire your faith in the power of STAT 101 exams to ensure that future scientists thoroughly understand NHST.

    • Peter Dorman says:

      When I was teaching stats in our masters program a few years back I definitely included a number of these arguments and stressed the dangers of relying on p<.05. When students handed in statistical work, or just reviewed other studies in their lit reviews, everything I said was lost. I'll take that up again in a later comment.

      • Jeff Walker says:

        I share Peter’s experience. Everything I teach is untaught by 1) advisor training and 2) nearly the entire literature that students read.

        • Martha (Smith) says:

          So we need to work on
          1) educating the advisors
          2) educating students to *critically* examine the literature in their fields (in particular, teach them to question, “That’s the way we’ve always done it”.

          (I don’t claim these tasks are easy, but they’re important.)

          • I speculate that the critical thinking needed to rectify some of these developments in statistics reflect a much larger educational shortfall that has not been articulated all that well. Some tech thought leaders seem to be aware of it. So that is something that some of us are looking at. It starts with a hunch sometimes.

            • Brent Hutto says:

              I completely agree with your point although I can’t seem to talk about this critical thinking “shortfall” without sounding like Grandpa Simpson.

              • Cute. You may have read or heard that most transformative ideas honed through very messy thought processes. Epiphanies, hunches, etc. Fluid and crystallized intelligence. Some scientists at a conference in Boston emphasized that dirth of creativity in universities was considered a large risk

        • In the courses I taught at Duke, it was the tutors that tried to undo it insisting that the students definitely had to decide whether to reject or not. Was able to quickly change their minds maybe only because they were informed they had to (maybe they reverted when I was gone). But it underlines the evils of inertia and culture.

          I also discussed in the class the likely extreme contrast to what they saw being done in all the literature and even by the professors in their other courses. Seemed to just make them uncomfortable. And perhaps just coincidentally, the next year one of the larger disciplines started to teach their own stats course.

      • Bob says:

        Did you fail these students? If not, then you are the problem.

        • Jeff Walker says:

          Wow Bob, I hope you aren’t a professor!

          1. My experience is with students coming to me after they’ve taken my class, and their questions entirely focused on p-values and “which hypothesis test”. How can I fail these students after they’ve finished the class?
          2. Even if I had this experience while they were a student in my class, I cannot fail them on material outside of class
          3. Even if I had this experience on assignments in the class, this isn’t cause for failure (this would be weighting this concept ginormously compared to all the other concepts, the class isn’t called “Use and misuse of p-values”)

          • Anonymous says:

            So where is the mystery here? A student can get through your class while lacking a basic understanding of the most fundamental of concepts? Is it any wonder these things are misapplied?

            The subject is poorly taught, that’s why you have this problem, yet there’s a lot of chin scratching about what to do.

            No I’m not a professor, but I interview a lot of people who claim to have training in statistics, sometimes to masters level. It takes < 60 seconds to establish when they don't. I'm not mystified by it, I've sat in enough lecture halls and passed enough exams to know why. What baffles me is the reluctance of the teaching profession to examine their own role in all of this.

            • Jeff Walker says:

              Anonymous: what about “this would be weighting this concept ginormously compared to all the other concepts, the class isn’t called “Use and misuse of p-values”” did you not understand? The class is not about hypothesis testing but statistical modeling using lm, glm, and linear mixed models. Apparently your ignorance of my class doesn’t inhibit you from anonsplaining my problems to me.

              Were we all Anonymous, we’d have no need for this discussion, or even this blog.

              • Bob says:

                If you make a mistake on your driving test on something basic that’s considered to be fundamental to being a safe driver, you fail the test. You’re not safe to drive.

                What’s the point of teaching glm and linear mixed models, if students don’t understand something basic like p-values? How can you certify that these students understand statistics but not p-values? They’re not some niche concept for crying out loud.

                Were the academic profession to teach the damn things correctly, there’s be no need for this discussion either. But I don’t hold out much hope that they’ll take any responsibility.

              • Andrew says:

                Bob:

                You can take a look at Regression and Other Stories when it comes out. But, the short story is that I don’t think p-values are “basic” or at all important, except for what one might call sociological reasons, that there are researchers who use those methods. When I teach statistics, I spend very little time on p-values, really only enough to explain why I think the way they’re usually used is mistaken. If someone takes my class, which covers lots of things, and happens not to catch exactly what p-values are . . . ok, I’d be a little disappointed, as I want them to understand everything. But it’s not such a big deal. If they took my class and ended up analyzing data by looking at which results are statistically significant and which are not, then I’d be super-sad, because that would imply that, whether or not they understand the mathematical definition of a p-value, they’re not “safe to drive,” as you put it.

              • You can’t teach p-values correctly because they don’t make sense. Andrew has the right approach: Spend a little time on them to explain why you shouldn’t use them. Then teach things that do make sense and work.

              • Chris Wilson says:

                + 1 to Andrew and Jeff. The trouble is to change the culture of practice in the whole scientific community, hence the Nature letter which I happily co-signed. Only so much you can do in the classroom alone when students have advisors to report to, papers to publish, etc etc. Funnily enough, this whole “they must understand p-values the way I want or fail” is itself an instance of a dubious dichotomous decision based on p-values ;) Jokes aside, I’ve seen plenty of card-carrying statiscians abuse the heck out of NHST, have no awareness of type M/S, flagrantly misrepresent Bayesian statistics, etc. Change is sometimes frustratingly slow – we shouldn’t meet out excessive punishment to our students because of it.

              • Christian Hennig says:

                What I honestly think is that the whole concept of probability is not trivial at all, but will always be misunderstood by most and give headaches to the few who understand it a bit better. This applies to p-values, confidence intervals, Bayesian inference, likelihood, imprecise probabilities, you name it.

                Yes I agree that p-values are systematically misunderstood. I agree that they are often taught badly, but I also agree that it is very hard if not impossible to teach them in such a way that it all makes smooth sense. Except that I think that alternative approaches are no better in this respect. Our discipline has fundamental disagreements even about what the most basic term “probability” means, so how can we just say “let’s just teach the students stuff correctly, and then if they don’t get it let’s fail them.” Statistics is hard, hard even for those who teach it.

                Actually I think that the concept of a statistical test and a p-value is rather brilliant as a challenge of our thinking and dealing with uncertainty. I love to teach it. However, part of its brilliance is to appreciate what’s problematic about it and for what (good) reasons people write papers titled “Abandon Statistical Significance”.

                Major issues are that the concept of probability always refers to the question “if we hadn’t observed what we actually have observed, what else could it have been”, i.e., it refers to something fundamentally unobservable, and that statistics, when done with appropriate appreciation of uncertainty (be it Bayesian, frequentist or whatever), doesn’t give the vast majority of non-statisticians what they want, which in the real world unfortunately may put our jobs at risk.

            • Jeff Walker says:

              Andrew said “But it’s not such a big deal. If they took my class and ended up analyzing data by looking at which results are statistically significant and which are not, then I’d be super-sad”

              This is what I meant by “1. My experience is with students coming to me after they’ve taken my class, and their questions entirely focused on p-values and “which hypothesis test”. As I said, my class is on statistical modeling not on NHST. I emphasize effect estimation and uncertainty. I emphasize the need to understand the biological consequences of the effect size and if these consequences are consistent or radically differ across the CI. Some ecological theory may predict interactions of the magnitude as a main effect, but if the CI includes small values of the interaction, then we should embrace that uncertainty.

              I tend toward’s Greenland’s take on this. I think this https://www.biorxiv.org/content/10.1101/458182v2 evolved out of my class and is a pretty good summary of some goals of my class. If you have criticism, criticize that pre-print (constructive criticism is always welcome).

              • In some discussions, I am skeptical or, should I say, wary of the emphasis on ‘uncertainty’ in the contexts that some of you raise. The underlying question? What would have been the case had researchers exercised the hefty recommendations that you all circulate? There are what? 100 million research papers out there.

                Cautioning students about ‘uncertainty’ also seems to imply that even with the best of our research efforts, we can’t really deliver very good to stellar findings. I think an outsider to your enterprises may see this. Certainly, one that has few or no conflict of interests. And Sander’s comments on Twitter and discussions also reflect this as an underlying theme. Then he lists the reasons for this.

                This is just astounding to me. Why not go into another field that may deliver some value? When I was back in Cambridge, at MIT conference, I think I made the same or similar case. And a few thought that was a very fair question. I mean to suggest that the concerns we express today were salient 25-30 years ago. I read through some of the ASA articles about Stat Significance. I believe that the overall speculation I have is that the research focus is still too narrow. I had hoped that the special edition was going to cover new ground.

  9. Phil says:

    > in 15 years of blogging, I’ve had to delete less than 5 out of 100,000 comments—that’s 0.005%!

    That sounds significant.