“Retire Statistical Significance”: The discussion.

So, the paper by Valentin Amrhein, Sander Greenland, and Blake McShane that we discussed a few weeks ago has just appeared online as a comment piece in Nature, along with a letter with hundreds (or is it thousands?) of supporting signatures.

Following the first circulation of that article, the authors of that article and some others of us had some email discussion that I thought might be of general interest.

I won’t copy out all the emails, but I’ll share enough to try to convey the sense of the conversation, and any readers are welcome to continue the discussion in the comments.

1. Is it appropriate to get hundreds of people to sign a letter of support for a scientific editorial?

John Ioannidis wrote:

Brilliant Comment! I am extremely happy that you are publishing it and that it will certainly attract a lot of attention.

He had some specific disagreements (see below for more on this). Also, he was bothered by the group-signed letter and wrote:

I am afraid that what you are doing at this point is not science, but campaigning. Leaving the scientific merits and drawbacks of your Comment aside, I am afraid that a campaign to collect signatures for what is a scientific method and statistical inference question sets a bad precedent. It is one thing to ask for people to work on co-drafting a scientific article or comment. This takes effort, real debate, multiple painful iterations among co-authors, responsibility, undiluted attention to detailed arguments, and full commitment. Lists of signatories have a very different role. They do make sense for issues of politics, ethics, and injustice. However, I think that they have no place on choosing and endorsing scientific methods. Otherwise scientific methodology would be validated, endorsed and prioritized based on who has the most popular Tweeter, Facebook or Instagram account. I dread to imagine who will prevail.

To this, Sander Greenland replied:

YES we are campaigning and it’s long overdue . . . because YES this is an issue of politics, ethics, and injustice! . . .

My own view is that this significance issue has been a massive problem in the sociology of science, hidden and often hijacked by those pundits under the guise of methodology or “statistical science” (a nearly oxymoronic term). Our commentary is an early step toward revealing that sad reality. Not one point in our commentary is new, and our central complaints (like ending the nonsense we document) have been in the literature for generations, to little or no avail – e.g., see Rothman 1986 and Altman & Bland 1995, attached, and then the travesty of recent JAMA articles like the attached Brown et al. 2017 paper (our original example, which Nature nixed over sociopolitical fears). Single commentaries even with 80 authors have had zero impact on curbing such harmful and destructive nonsense. This is why we have felt compelled to turn to a social movement: Soft-peddled academic debate has simply not worked. If we fail, we will have done no worse than our predecessors (including you) in cutting off the harmful practices that plague about half of scientific publications, and affect the health and safety of entire populations.

And I replied:

I signed the form because I feel that this would do more good than harm, but as I wrote here, I fully respect the position of not signing any petitions. Just to be clear, I don’t think that my signing of the form is an act of campaigning or politics. I just think it’s a shorthand way of saying that I agree with the general points of the published article and that I agree with most of its recommendations.

Zad Chow replied more agnostically:

Whether political or not, it seems like signing a piece as a form of endorsement seems far more appropriate than having papers with mass authorships of 50+ authors where it is unlikely that every single one of those authors contributed enough to actually be an author, and their placement as an author is also a political message.

I also wonder if such pieces, whether they be mass authorships or endorsements by signing, actually lead to notable change. My guess is that they really don’t, but whether or not such endorsements are “popularity contests” via social media, I think I’d prefer that people who participate in science have some voice in the manner, rather than having the views of a few influential individuals, whether they be methodologists or journal editors, constantly repeated and executed in different outlets.

2. Is “retiring statistical significance” really a good idea?

Now on to problems with the Amrhein et al. article. I mostly liked it, although I did have a couple places where I suggested changes of emphasis, as noted in my post linked above. The authors made some of my suggested changes; in other places I respect their decisions even if I might have written things slightly differently.

Ioannidis had more concerns, as he wrote in an email listing a bunch of specific disagreements with points in the article:

1. Statement: Such misconceptions have famously warped the literature with overstated claims and, less famously, led to claims of conflicts between studies where none exist
Why it is misleading: Removing statistical significance entirely will allow anyone to make any overstated claim about any result being important. It will also facilitate claiming that that there are no conflicts between studies when conflicts do exist.

2. Statement: Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P-value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero.
Why it is misleading: In most scientific fields we need to conclude something and then convey our uncertainty about the conclusion. Clear, pre-specified rules on how to conclude are needed. Otherwise, anyone can conclude anything according to one’s whim. In many cases using sufficiently stringent p-value thresholds, e.g. p=0.005 for many disciplines (or properly multiplicity-adjusted p=0.05, e.g. 10-9 for genetics or FDR or Bayes factor threhsolds or any thresholds) make perfect sense. We need to make some careful choices and move on. Saying that any and all associations cannot be 100% dismissed is correct strictly speaking, but practically it is nonsense. We will get paralyzed because we cannot exclude that everything may be causing everything.

3. Statement: statistically non-significant results were interpreted as indicating ‘no difference’ in XX% of articles
Why it is misleading: this may have been entirely appropriate in many/most/all cases, one has to examine carefully each one of them. It is probably at least or even more inappropriate that some/many of the remaining 100-XX% were not indicated as “no difference”.

4. Statement: The editors introduce the collection (2) with the caution “don’t say ‘statistically significant’.” Another article (3) with dozens of signatories calls upon authors and journal editors to disavow the words. We agree and call for the entire concept of statistical significance to be abandoned. We don’t mean to drop P-values, but rather to stop using them dichotomously to decide whether a result refutes or supports a hypothesis.
Why it is misleading: please see my e-mail about what I think regarding the inappropriateness of having “signatories” when we are discussing about scientific methods. We do need to reach conclusions dichotomously most of the time: is this genetic variant causing depression, yes or no? Should I spend 1 billion dollars to develop a treatment based on this pathway, yes or no? Is this treatment effective enough to warrant taking it, yes or no? Is this pollutant causing cancer, yes or no?

5. Statement: whole paragraph beginning with “Tragically…”
Why it is misleading: we have no evidence that if people did not have to defend their data as statistically significant, publication bias would go away and people would not be reporting whatever results look nicer, stronger, more desirable and more fit to their biases. Statistical significance or any other preset threshold (e.g. Bayesian or FDR) sets an obstacle to making unfounded claims. People may play tricks to pass the obstacle, but setting no obstacle is worse.

6. Statement: For example, the difference between getting P = 0.03 versus P = 0.06 is the same as the difference between getting heads versus tails on a single fair coin toss (8).
Why it is misleading: this example is factually wrong; it is true only if we are certain that the effect being addressed is truly non-null.

7. Statement: One way to do this is to rename confidence intervals ‘compatibility intervals,’ …
Why it is misleading: Probably the least thing we need in the current confusing situation is to add yet a new, idiosyncratic term. “Compatibility” is even a poor choice, probably worse than “confidence”. Results may be entirely off due to bias and the X% CI (whatever C stands for) may not even include the truth much of the time if bias is present.

8. Statement: We recommend that authors describe the practical implications of all values inside the interval, especially the observed effect or point estimate (that is, the value most compatible with the data) and the limits.
Why it is misleading: I think it is far more important to consider what biases may exist and which may lead to the entire interval, no matter how we call it, to be off and thus incompatible with the truth.

9. Statement: We’re frankly sick of seeing nonsensical ‘proofs of the null’ and claims of non-association in presentations, research articles, reviews, and instructional materials.
Why it is misleading: I (and many others) are frankly sick with seeing nonsensical “proofs of the non-null”, people making strong statements about associations and even causality with (or even without) formal statistical significance (or other statistical inference tool) plus tons of spin and bias. Removing entirely the statistical significance obstacle, will just give a free lunch, all-is-allowed bonus to make any desirable claim. All science will become like nutritional epidemiology.

10. Statement: That means you can and should say “our results indicate a 20% increase in risk” even if you found a large P-value or a wide interval, as long as you also report and discuss the limits of that interval.
Why it is misleading: yes, indeed. But then, welcome to the world where everything is important, noteworthy, must be licensed, must be sold, must be bought, must lead to public health policy, must change our world.

11. Statement: Paragraph starting with “Third, the default 95% used”
Why it is misleading: indeed, but this means that more appropriate P-value thresholds and, respectively X% CI intervals are preferable and these need to be decided carefully in advance. Otherwise, everything is done post hoc and any pre-conceived bias of the investigator can be “supported”.

12. Statement: Factors such as background evidence, study design, data quality, and mechanistic understanding are often more important than statistical measures like P-values or intervals (10).
Why it is misleading: while it sounds reasonable that all these other factors are important, most of them are often substantially subjective. Conversely, statistical analysis at least has some objectivity and if the rules are carefully set before the data are collected and the analysis is run, then statistical guidance based on some thresholds (p-values, Bayes factors, FDR, or other) can be useful. Otherwise statistical inference is becoming also entirely post hoc and subjective.

13. Statement: The objection we hear most against retiring statistical significance is that it is needed to make yes-or-no decisions. But for the choices often required in regulatory, policy, and business environments, decisions based on the costs, benefits, and likelihoods of all potential consequences always beat those made based solely on statistical significance. Moreover, for decisions about whether to further pursue a research idea, there is no simple connection between a P-value and the probable results of subsequent studies.
Why it is misleading: This argument is equivalent to hand waving. Indeed, most of the time yes/no decisions need to be made and this is why removing statistical significance and making it all too fluid does not help. It leads to an “anything goes” situation. Study designs for questions that require decisions need to take all these other parameters into account ideally in advance (whenever possible) and set some pre-specified rules on what will be considered “success”/actionable result and what not. This could be based on p-values, Bayes factors, FDR, or other thresholds or other functions, e.g. effect distribution. But some rule is needed for the game to be fair. Otherwise we will get into more chaos than we have now, where subjective interpretations already abound. E.g. any company will be able to claim that any results of any trial on its product do support its application for licensing.

14. Statement: People will spend less time with statistical software and more time thinking.
Why it is misleading: I think it is unlikely that people will spend less time with statistical software but it is likely that they will spend more time mumbling, trying to sell their pre-conceived biases with nice-looking narratives. There will be no statistical obstacle on their way.

15. Statement: the approach we advocate will help halt overconfident claims, unwarranted declarations of ‘no difference,’ and absurd statements about ‘replication failure’ when results from original and the replication studies are highly compatible.
Why it is misleading: the proposed approach will probably paralyze efforts to refute the millions of nonsense statements that have been propagated by biased research, mostly observational, but also many subpar randomized trials.

Overall assessment: the Comment is written with an undercurrent belief that there are zillions of true, important effects out there that we erroneously dismiss. The main problem is quite the opposite: there are zillions of nonsense claims of associations and effects that once they are published, they are very difficult to get rid of. The proposed approach will make people who have tried to cheat with massaging statistics very happy, since now they would not have to worry at all about statistics. Any results can be spun to fit their narrative. Getting entirely rid of statistical significance and preset, carefully considered thresholds has the potential of making nonsense irrefutable and invincible.

That said, despite these various specific points of disagreement, Ioannidis emphasized that Amrhein et al. raise important points that “need to be given an opportunity to be heard loud and clear and in their totality.”

In reply to Ioannidis’s points above, I replied:

1. You write, “Removing statistical significance entirely will allow anyone to make any overstated claim about any result being important.” I completely disagree. Or, maybe I should say, anyone is already allowed to make any overstated claim about any result being important. That’s what PNAS is, much of the time. To put it another way: I believe that embracing uncertainty and avoiding overstated claims are important. I don’t think statistical significance has much to do with that.

2. You write, “In most scientific fields we need to conclude something and then convey our uncertainty about the conclusion. Clear, pre-specified rules on how to conclude are needed. Otherwise, anyone can conclude anything according to one’s whim.” Again, this is already the case that people can conclude what they want. One concern is what is done by scientists who are honestly trying to do their best. I think those scientists are often misled by statistical significance, all the time, ALL THE TIME, taking patterns that are “statistically significant” and calling them real, and taking patterns that are “not statistically significant” and treating them as zero. Entire scientific papers are, through this mechanism, data in, random numbers out. And this doesn’t even address the incentives problem, by which statistical significance can create an actual disincentive to gather high-quality data.

I disagree with many other items on your list, but two is enough for now. I think the overview is that you’re pointing out that scientists and consumers of science want to make reliable decisions, and statistical significance, for all its flaws, delivers some version of reliable decisions. And my reaction is that whatever plus it is that statistical significance sometimes provides reliable decisions, is outweighed by (a) all the times that statistical significance adds noise and provides unreliable decisions, and (b) the false sense of security that statistical significance gives so many researchers.

One reason this is all relevant, and interesting, is that we all agree on so much—yet we disagree so strongly here. I’d love to push this discussion toward the real tradeoffs that arise when considering alternative statistical recommendations, and I think what Ioannidis wrote, along with the Amrhein/Greenland/McShane article, would be a great starting point.

Ioannidis then responded to me:

On whether removal of statistical significance will increase or decrease the chances that overstated claims will be made and authors will be more or less likely to conclude according to their whim, the truth is that we have no randomized trial to tell whether you are right or I am right. I fully agree that people are often confused about what statistical significance means, but does this mean we should ban it? Should we also ban FDR thresholds? Should we also ban Bayes factor thresholds? Also probably we have different scientific fields in mind. I am afraid that if we ban thresholds and other (ideally pre-specified) rules, we are just telling people to just describe their data as best as they can and unavoidably make strength-of-evidence statements as they wish, kind of impromptu and post-hoc. I don’t think this will work. The notion that someone can just describe the data without making any inferences seems unrealistic and it also defies the purpose of why we do science: we do want to make inferences eventually and many inferences are unavoidably binary/dichotomous. Also actions based on inferences are binary/dichotomous in their vast majority.

I replied:

I agree that the effects of any interventions are unknown. We’re offering, or trying to offer, suggestions for good statistical practice in the hope that this will lead to better outcome. This uncertainty is a key reason why this discussion is worth having, I think.

3. Mob rule, or rule of the elites, or gatekeepers, consensus, or what?

One issue that came up is, what’s the point of that letter with all those signatories? Is it mob rule, the idea that scientific positions should be determined by those people who are loudest and most willing to express strong opinions (“the mob” != “the silent majority”)? Or does it represent an attempt by well-connected elites (such as Greenland and myself!) to tell people what to think? Is the letter attempting to serve a gatekeeping function by restricting how researchers can analyze their data? Or can this all be seen as a crude attempt to establish a consensus of the scientific community?

None of these seem so great! Science should be determined my truth, accuracy, reproducibility, strength of theory, real-world applicability, moral values, etc. All sorts of things, but these should not be the property of the mob, or the elites, or gatekeepers, or a consensus.

That said, the mob, the elites, gatekeepers, and the consensus aren’t going anywhere. Like it or not, people do pay attention to online mobs. I hate it, but it’s there. And elites will always be with us, sometimes for good reasons. I don’t think it’s such a bad idea that people listen to what I say, in part on the strength of my carefully-written books—and I say that even though, at the beginning of my career, I had to spend a huge amount of time and effort struggling against the efforts of elites (my colleagues in the statistics department at the University of California, and their friends elsewhere) who did their best to use their elite status to try to put me down. And gatekeepers . . . hmmm, I don’t know if we’d be better off without anyone in charge of scientific publishing and the news media—but, again, the gatekeepers are out there: NPR, PNAS, etc. are real, and the gatekeepers feed off of each other: the news media bow down before papers published in top journals, and the top journals jockey for media exposure. Finally, the scientific consensus is what it is. Of course people mostly do what’s in textbooks, and published articles, and what they see other people do.

So, for my part, I see that letter of support as Amrhein, Greenland, and McShane being in the arena, recognizing that mob, elites, gatekeepers, and consensus are real, and trying their best to influence these influencers and to counter negative influences from all those sources. I agree with the technical message being sent by Amrhein et al., as well as with their open way of expressing it, so I’m fine with them making use of all these channels, including getting lots of signatories, enlisting the support of authority figures, working with the gatekeepers (their comment is being published in Nature, after all; that’s one of the tabloids), and openly attempting to shift the consensus.

Amrhein et al. don’t have to do it that way. It would be also fine with me if they were to just publish a quiet paper in a technical journal and wait for people to get the point. But I’m fine with the big push.

4. And now to all of you . . .

As noted above, I accept the continued existence and influence of mob, elites, gatekeepers, and consensus. But I’m also bothered by these, and I like to go around them when I can.

Hence, I’m posting this on the blog, where we have the habit of reasoned discussion rather than mob-like rhetorical violence, where the comments have no gatekeeping (in 15 years of blogging, I’ve had to delete less than 5 out of 100,000 comments—that’s 0.005%!—because they were too obnoxious), and where any consensus is formed from discussion that might just lead to the pluralistic conclusion that sometimes no consensus is possible. And by opening up our email discussion to all of you, I’m trying to demystify (to some extent) the elite discourse and make this a more general conversation.

P.S. There’s some discussion in comments about what to do in situations like the FDA testing a new drug. I have a response to this point, and it’s what Blake McShane, David Gal, Christian Robert, Jennifer Tackett, and I wrote in section 4.4 of our article, Abandon Statistical Significance:

While our focus has been on statistical significance thresholds in scientific publication, similar issues arise in other areas of statistical decision making, including, for example, neuroimaging where researchers use voxelwise NHSTs to decide which results to report or take seriously; medicine where regulatory agencies such as the Food and Drug Administration use NHSTs to decide whether or not to approve new drugs; policy analysis where non-governmental and other organizations use NHSTs to determine whether interventions are beneficial or not; and business where managers use NHSTs to make binary decisions via A/B tests. In addition, thresholds arise not just around scientific publication but also within research projects, for example, when researchers use NHSTs to decide which avenues to pursue further based on preliminary findings.

While considerations around taking a more holistic view of the evidence and consequences of decisions are rather different across each of these settings and different from those in scientific publication, we nonetheless believe our proposal to demote the p-value from its threshold screening role and emphasize the currently subordinate factors applies in these settings. For example, in neuroimaging, the voxelwise NHST approach misses the point in that there are typically no true zeros and changes are generally happening at all brain locations at all times. Plotting images of estimates and uncertainties makes sense to us, but we see no advantage in using a threshold.

For regulatory, policy, and business decisions, cost-benefit calculations seem clearly superior to acontextual statistical thresholds. Specifically, and as noted, such thresholds implicitly express a particular tradeoff between Type I and Type II error, but in reality this tradeoff should depend on the costs, benefits, and probabilities of all outcomes.

That said, we acknowledge that thresholds—of a non-statistical variety—may sometimes be useful in these settings. For example, consider a firm contemplating sending a costly offer to customers. Suppose the firm has a customer-level model of the revenue expected in response to the offer. In this setting, it could make sense for the firm to send the offer only to customers that yield an expected profit greater than some threshold, say, zero.
Even in pure research scenarios where there is no obvious cost-benefit calculation—for example a comparison of the underlying mechanisms, as opposed to the efficacy, of two drugs used to treat some disease—we see no value in p-value or other statistical thresholds. Instead, we would like to see researchers simply report results: estimates, standard errors, confidence intervals, etc., with statistically inconclusive results being relevant for motivating future research.

While we see the intuitive appeal of using p-value or other statistical thresholds as a screening device to decide what avenues (e.g., ideas, drugs, or genes) to pursue further, this approach fundamentally does not make efficient use of data: there is in general no connection between a p-value—a probability based on a particular null model—and either the potential gains from pursuing a potential research lead or the predictive probability that the lead in question will ultimately be successful. Instead, to the extent that decisions do need to be made about which lines of research to pursue further, we recommend making such decisions using a model of the distribution of effect sizes and variation, thus working directly with hypotheses of interest rather than reasoning indirectly from a null model.

We would also like to see—when possible in these and other settings—more precise individual-level measurements, a greater use of within-person or longitudinal designs, and increased consideration of models that use informative priors, that feature varying treatment effects, and that are multilevel or meta-analytic in nature (Gelman, 2015, 2017; McShane and Bockenholt, 2017, 2018).

P.P.S. Regarding the petition thing, I like what Peter Dorman had to say:

A statistical decision rule is a coordination equilibrium in a very large game with thousands of researchers, journal editors and data users. Perhaps once upon a time such a rule might have been proposed on scientific grounds alone (rightly or wrongly), but now the rule is firmly in place with each use providing an incentive for additional use. That’s why my students (see comment above) set aside what I taught in my stats class and embraced NHST. The research they rely on uses it, and the research they hope to produce will be judged by it. That matters a lot more to them than what I think.

That’s why mass signatures make sense. It is not mob rule in the sociological sense; we signers are not swept up in a wave of transient hysterical solidarity. Rather, we are trying to dent the self-fulfilling power of expectations that locks NHST in place. 800 is too few to do this, alas, but it’s worth a try to get this going.

451 thoughts on ““Retire Statistical Significance”: The discussion.

  1. Put me down on the side of not feeling at all comfortable with “science” being protected somehow by long lists of signatories on an advocacy document. But I’ll admit to often suffering an excess of cynicism when it comes to both “mob rule” and “elites”. That said, anything that Sander Greenland and Andrew Gelman both advocate is, for me, worthy of serious contemplation.

    Now with that out of the way, the thing I want to say is I agree with (at least) the one underlying common theme in most of Ioannidis’ complaints. Any effort to ban usage or mention of a specific, long-standing decision rule like “p-value” without specifying a drop-in replacement decision rule (or at least decision technique that admits formulation of context-specific rules) is a bad idea, doomed to at best failure and at worse the kind of success that leads to the situation being worse than before.

    To paraphrase what Ioannidis keeps repeating in various forms, if you can’t say prior to conducting an analysis exactly what criterion will constitute support for the presence of an effect then you are describing the data. Not analyzing, describing. And in many endeavors descriptive statistics, no matter how elaborate, are not the point of the research.

    • Like often I’m coming too late to this discussion, but the first thing that strikes me is that the first two postings state that the article that is discussed asks for abandoning p-values, which it quite explicitly doesn’t.

  2. I think Ioannides is coming at this from the medical field, and I think that he has a great point. It is hard for me to understand how the FDA will be able to do it’s job if p-values were dispensed with. The FDA just doesn’t have the resources to rerun all the basis science and pharmacological research upon which a drug application relies. At the end of the day, being able to say that some specific threshold was not met is essential for the regulator to have. Regulators just can’t go around and evaluate the evidence for each claim on an individualized basis. They have to be able to go into a court and explain their decision to a non-science who is more concerned with (and equipped to understand( the question of “Did the regulator apply the same standard here it applied in other contexts?” than “Does the evidence really justify the claim?” Without some scientific consensus on thresholds, industry will be able to push around regulators. Our society will get decidedly less scientific in a more important way than just having a lot of nonsense psychological and nutritional research being release in the press.

    • It sounds like you would advocate for arbitrary obstacles to drug approval over none at all.

      What exactly do you think statistical significance is telling the regulators?

      • Of course, I would. There has to be a decision rule. If a better decision rule can be set up, that is preferable, but it will always be arbitrary in some sense. The FDA, for instance, insists on two randomized controlled trials to get a drug indication approved. Why two. Three would be better. Five is even more than three. The FDA has to pick something that it can explain to Congress. If an industry comes to Congress and says we have all these studies that show that the risky dangerous product that we want to make money off of is not going to be risky or dangerous at all, but these mean regulators won’t let us give it to Americans, the regulators are going to need clear scientifically based, but arbitrary, rules to tell Congress that the studies are bogus. That is not a scientific reason to have significance tests. It is just the practical reality. If they are abandoned because of the obvious problems, what decision rule will replace them? How will the regulator push back and say these results are indistinguishable from chance? There may be a better decision rule, but we have to recognize the risks of not having one at all.

        • If you are going to be arbitrary then do the cheapest thing possible, eg flip a coin.

          the regulators are going to need clear scientifically based, but arbitrary, rules to tell Congress that the studies are bogus.

          The way statistical significance is used in these trials is not “scientifically based”. It amounts to disproving a strawman.

          Do you mean the regulators just want some expensive and elaborate ritual performed so it seems like they are doing something?

          How will the regulator push back and say these results are indistinguishable from chance?

          1) Statistical significance does not tell you that.
          2) That isn’t a question the regulators, or congress, or anyone, should care about.

          If an industry comes to Congress and says we have all these studies that show that the risky dangerous product that we want to make money off of is not going to be risky or dangerous at all

          Then there should be some cost-benefit assessment performed that takes into account the benefits and compares to the risk and cost. Statistical significance has nothing to do with this.

        • Are you saying that requiring the p-value to be under .05 has no value in distinguishing results that may be the result of chance and those that are not? That seems strong. The FDA does weight the risk to patients against the benefits. That too will rely on arbitrary decision rules. Also, I, very clearly, am not advocating significance tests. I am just saying that some decision rule has to be in place for regulators to discount studies where the results are insufficiently distinguishable from chance. Any such rule will be arbitrary.

          Also, your rhetorical technique is a bit obnoxious. How about I do it to you? Are you saying that you don’t want drug companies to be regulated at all?!? (Not what you said, right? Annoying.)

        • Are you saying that requiring the p-value to be under .05 has no value in distinguishing results that may be the result of chance and those that are not?

          Yes. With sufficient sample size you will always detect that your strawman null model is wrong (50% chance it is wrong in the “right” direction). It basically measures the prevailing opinion: Is society as a whole willing to devote enough resources to get significance for this new treatment or not?

          Are you saying that you don’t want drug companies to be regulated at all?!?

          I don’t trust them anyway since I have “seen how the sausage is made”, so it makes little difference to me honestly. There should be some organization that ensures what they say is in the pill (and only that, up to some limit) is actually in the pill though.

        • > With sufficient sample size you will always detect that your strawman null model is wrong (50% chance it is wrong in the “right” direction).

          In the infinite case, sure. Practically, to be approved, drugs first need to show efficacy in a finite sample size of (on the order of) hundreds of individuals. Given a particular and finite range of sample sizes, the chance of getting a significant p-value from a very small true effect should on average be smaller than the chance of getting a significant p-value from a large true effect.

          Of course, for p-values to be valuable in decision-making, it is not actually necessary that they be valuable in isolation from all other information, and it is very typical for a clinical trial to report effect sizes, sample sizes, and p-values together.

          > I don’t trust them anyway since I have “seen how the sausage is made”, so it makes little difference to me honestly.

          I don’t understand the logic here; surely if an industry is untrustworthy, more rather than less independent oversight is called for. That aside, drug trials don’t just assess efficacy, but also toxicity. Not collecting information about adverse outcomes in smaller trials before making a drug widely available seems unwise.

        • It’s unclear what you’re actually blathering about. Clinical trials are not conducted with an infinite sample size, and the FDA absolutely takes effect size and real-world utility of the drug into account before making a decision. You’re the one positing the strawman that the FDA only looks at the pvalue at the bottom of the report.

    • This is a valid concern and this is how the world works today. However, I’d love some more work to plug parameter uncertainties into a decision model where the costs and benefits of different dubious claims may be very different. I’m not sure how I would think about evaluating claims (“this supplement makes you healthier”) although if they want to avoid BS marketing they should ban all vague claims (which won’t happen). On the other hand, when evaluating new medical techniques I think it *would* be helpful to plug in the full decision model. For example if an experimental drug has negligible side effects but trials suggest only weak (ie, not significant) improvement in outcomes, it might be OK to allow as a last resort. (People in the medical field will doubtless push back on that claim so perhaps it is a poor example — I just want to illustrate why the full decision analysis could help us move beyond having the FDA check p-values).

      • if an experimental drug has negligible side effects but trials suggest only weak (ie, not significant) improvement in outcomes, it might be OK to allow as a last resort.

        Why would this be a last resort? Many people prefer that to developing new issues they need to deal with, that is why you see such a movement towards “natural” treatments.

        • Thanks!!!

          I did not raise drug regulation in the email discussions because (as Andrew pointed out) the focus was scientific publication.

          Having worked in drug regulation and even jointly with the FDA (sometimes Bob O’Neil), I do find comments on blogs rather poorly informed about what happens in drug regulation (Bob’s slides may help with that). On the other hand, I do need to very careful about what I say publicly.

          My private take is that there needs to be lines in the sand but always accompanied by not withstanding clauses (openness and ability to change for good reasons). Additionally, in agreement with Don Berry, I believe there needs to be a over arching concern with how often things are approved when they shouldn’t be and vice versa, as many approvals are made or not each and every year.

          Having said that, from my experience, when people are on the ball and rise to the occasion the decision making is well informed maybe even up to Andrew’s ideals. Now, in some countries, there maybe legal limitations on what can matter in deciding approval, for instance, that the size of the effect other than being it positive cannot explicitly be considered. Laws aren’t perfect.

          Now, I have no idea how often people are not on the ball and fail to rise to the occasion in various regulatory agencies but it certainly happens.

    • Patrick wrote:

      In the infinite case, sure. Practically, to be approved, drugs first need to show efficacy in a finite sample size of (on the order of) hundreds of individuals. Given a particular and finite range of sample sizes, the chance of getting a significant p-value from a very small true effect should on average be smaller than the chance of getting a significant p-value from a large true effect.

      The point is there is always a “true effect” (deviation from the strawman null model). It is just a matter of spending enough to detect it or not. Ie, the results are only ever true positives and false negatives (no false positives).

      Of course, as you say, the bigger the deviation from the model, the cheaper it will be to detect the deviation. What we see in practice is that the threshold moves down from 0.1 ->.05 -> 0.01 -> 3e-7 -> 5e-8 depending on how cheap it is to collect the data required for the “right” amount of significant results to be yielded on average (“alpha is the expected value of p”). If it is too easy or too hard to “get significance” the community rejects the procedure and adjust the threshold.

      Of course, for p-values to be valuable in decision-making, it is not actually necessary that they be valuable in isolation from all other information, and it is very typical for a clinical trial to report effect sizes, sample sizes, and p-values together.

      I never pay any attention to the p-values that clutter up medical journal papers and seem to understand them better than most. It is just pollution that obscures the important stuff like what was actually measured, what predictions were tested (if any), what was the relationship between the variables under study, what alternative explanations could there be, etc. It wouldn’t be so bad if we could just ignore them… but they determine what gets published since people think (significance = “real discovery”)…

      > I don’t trust them anyway since I have “seen how the sausage is made”, so it makes little difference to me honestly.

      I don’t understand the logic here; surely if an industry is untrustworthy, more rather than less independent oversight is called for. That aside, drug trials don’t just assess efficacy, but also toxicity. Not collecting information about adverse outcomes in smaller trials before making a drug widely available seems unwise.

      River of radioactive waste = current medical literature[1]
      Goggles = NHST
      https://www.youtube.com/watch?v=juFZh92MUOY

      [1] While disagreeing with his conclusion, liked this characterization of the cancer literature as “augean stables”:
      https://www.nature.com/news/cancer-reproducibility-project-scales-back-ambitions-1.18938

      • Anoneuoid said:

        “… [p-values are] just pollution that obscures the the important stuff like what was actually measured, what predictions were tested (if any), what was the relationship between the variables under study, what alternative explanations could there be, etc. It wouldn’t be so bad if we could just ignore them… but they determine what gets published since people think (significance = “real discovery”)…”

        I pretty much agree

    • It’s unclear what you’re actually blathering about. Clinical trials are not conducted with an infinite sample size, and the FDA absolutely takes effect size and real-world utility of the drug into account before making a decision. You’re the one positing the strawman that the FDA only looks at the pvalue at the bottom of the report.

      Pick a real life clinical trial you think is done correctly, and it will become clear what I am “blathering about”.

  3. Ioannidis wrote:

    In many cases using sufficiently stringent p-value thresholds, e.g. p=0.005 for many disciplines (or properly multiplicity-adjusted p=0.05, e.g. 10-9 for genetics or FDR or Bayes factor threhsolds or any thresholds) make perfect sense.

    Here we go again… Real life example?

    I can tell you now all this does is makes a community choose a threshold + sample size to typically get around the “correct” number of genes (or whatever) they want to (or, in practice, can) study further. Just skip the pseudoscience and rank the genes by whatever statistic you want (use p-values if that makes sense to you), then pick the top X for further study. But do not add in this step of finding “real” signals and “not real” signals.

    • It seems that the significance threshold of 5*10^-8 has worked quite well in GWAS research. The false positive problem of the candidate gene paradigm has been largely eliminated and thousands of significant SNP effects that can be reproduced at least in ancestrally similar samples have been identified. If p-values were retired, how would GWAS research or the equivalent be conducted?

      • thousands of significant SNP effects that can be reproduced at least in ancestrally similar samples have been identified.

        1) Like what? Please link to one along with the replication(s).
        2) Just counting the “success” is not informative. How many thousands cannot be reproduced?

        If p-values were retired, how would GWAS research or the equivalent be conducted?

        1) I don’t assume research like that should be conducted.
        2) If it should be conducted, I gave an example: sort your p-values and pick the top 10 to investigate further.

        • Here’s some replication results from last year’s big GWAS of educational attainment:

          We conducted a replication analysis of the 162 lead SNPs identified at genome-wide significance in a previous combined-stage (discovery and replication) meta-analysis (n = 405,073). Of the 162 SNPs, 158 passed quality-control filters in our updated meta-analysis. To examine their out-of-sample replicability, we calculated Z-statistics from the subsample of our data (n = 726,808) that was not included in the previous study… Of the 158 SNPs, we found that 154 have matching signs in the new data (for the remaining four SNPs, the estimated effect is never statistically distinguishable from zero at P < 0.10). Of the 154 SNPs with matching signs, 143 are significant at P < 0.01, 119 are significant at P < 10−5 and 97 are significant at P < 5 × 10−8.

          It seems to me that significance testing working here just like it should be working. Replication in holdout samples is routine in GWAS research these days, take a look at the catalogue for more.

          I don’t assume research like that should be conducted.

          Why not?

          If it should be conducted, I gave an example: sort your p-values and pick the top 10 to investigate further.

          What would those further investigations be about? Individual differences in most human traits are influenced by thousands of genetic variants. The effect of any one genetic difference is tiny, so examining the top 10 effects would usually be pointless. Understanding genetic effects on complex trait variation in functional/mechanistic terms in the foreseeable future is not a realistic prospect in my opinion. Quite likely, it will never be realistic. Rather, the utility of GWAS results is that they can be combined into polygenic scores and used for prediction and intervention.

        • This is the replication result (Fig S3)?
          https://i.ibb.co/Cz6S2xC/gwasrep.png

          Maybe I am missing something but it looks like almost no correlation at all to me. If the results were reproducible, shouldn’t the effect sizes for the two studies be scattered around the diagonal line? If there is little correlation in effect sizes but the “significance” still corresponds, then this would be showing that the SNP-specific sample sizes (does this vary across genes according to susceptibility to quality control issues, etc?) or variances were similar across the studies.

          Also according to the (usual) “significance” definition of replication these results were not very good:

          Of the 154 SNPs with matching signs, 143 are significant at P < 0.01, 119 are significant at P < 10−5 and 97 are significant at P < 5 × 10−8.
          </blockquote.

          The original threshold was 5e-8, so 97/162 ~ 60% replicated according to the statistical significance definition. If nothing was going on and sample size was sufficient, we would expect ~50% to be significant in the same direction, they got 60%…

          And they say:

          The reference allele is chosen to be the allele estimated to increase EA in the previous study; therefore, all points above the dotted line have matching signs in the replication sample.

          Why wouldn’t they also look at alleles that could “decrease EA”?

          I’ll leave the other stuff for later to focus on this for now. I have to believe I am misunderstanding that.

        • Also, isn’t it strange that almost all (except ~5) the SNPs that met p < 5e-8 for the current study (all must have in the original to be included here) are above the diagonal and it looks like only one with a “lesser” p-values is above?

          Very bizarre figure to me, hopefully someone can point out my misunderstanding.

        • I haven’t read the whole paper, but I can answer this:

          The reference allele is chosen to be the allele estimated to increase EA in the previous study…

          Why wouldn’t they also look at alleles that could “decrease EA”?

          They do. When you calculate the effect size of having an allele “…AGTC…”, it is in contrast to having a different allele at that location, e.g., “…AGTA_…”, “…AGTT…”, etc. What they are saying here is that if in the previous study, they found a positive effect for “AGTC” at one locus over other possible alleles, they then also plotted the effect size for “AGTC” in the replication cohort (and not some other allele like “AGTA” or “AGTT”). In other words, all they’re saying there is that for it to be a real replication the same actual DNA sequence has to show an effect in the same direction.

          The original threshold was 5e-8, so 97/162 ~ 60% replicated according to the statistical significance definition. If nothing was going on and sample size was sufficient, we would expect ~50% to be significant in the same direction, they got 60%…

          I’m not sure where you’re getting 50% from. Given a completely random sample of 162 (out of around 150000 unlinked SNPs), I definitely wouldn’t expect half to be found by chance in a GWAS study that identified around 1100 significant hits at p < 5e-8.

        • What they are saying here is that if in the previous study, they found a positive effect for “AGTC” at one locus over other possible alleles, they then also plotted the effect size for “AGTC” in the replication cohort (and not some other allele like “AGTA” or “AGTT”).

          Yes, at first I thought it made sense since there would always be a positive and negative allele, but there are four possibilities at each nucleotide like you say.

          Can’t we get results like the following for correlation with edutainment:
          AGTC = positive
          AGTA = neutral
          AGTT = neutral
          AGTG = negative

          Now there is both a “positive” and “negative” allele.

          In other words, all they’re saying there is that for it to be a real replication the same actual DNA sequence has to show an effect in the same direction.

          So statistical significance is not required to declare the replication a success now, it is only important for the initial screening?

          I am really more curious as to what process would lead to results like this. We see a consistent “effect” in sign only with little to no correlation in magnitude of effect.

          It is something that needs explaining to me, personally I suspect some form of p-hacking (did they do it with and without this “winners curse” adjustment, etc). Or perhaps it is from playing around with specifying the regression model they mention. But maybe there is some biological explanation everyone uses that I am unaware of.

          I’m not sure where you’re getting 50% from. Given a completely random sample of 162 (out of around 150000 unlinked SNPs), I definitely wouldn’t expect half to be found by chance in a GWAS study that identified around 1100 significant hits at p < 5e-8.

          The 50% only applies to studies with sufficient power to detect all the real differences at the threshold used. Once you have billions or trillions of samples the procedure will lead to the conclusion that every single gene will correlate with every single behavior. Of course at that point they will move the threshold to 5e-16 or whatever to get the right number of “real” correlations…

        • The 50% only applies to studies with sufficient power to detect all the real differences at the threshold used. Once you have billions or trillions of samples the procedure will lead to the conclusion that every single gene will correlate with every single behavior.

          There are indeed people who now believe this (the “omnigenic model” of complex traits). But they arrived at this hypothesis by looking at the results of GWAS studies with large sample sizes, not by assuming a priori that there was no such thing as a null effect in genetics. On the other extreme, for example, Mendelian traits also exist; there is a lot of room between “one gene” and “every gene” that could in theory explain most of a trait’s heritability. Even assuming that the omnigenic model is a good description of reality, it is still not clear that literally all alleles will have a truly non-zero effect on any trait: for example, there are synonymous mutations in protein-coding regions.

          More importantly for this specific example, though, there were a finite number of samples in this experiment, and not every SNP was actually significant at a threshold of p < 5e-8 in either study. If this set of ~150 SNPs were simply an arbitrary random sample of all true SNPs (still assuming here that every SNP has a "true" effect), then you would expect that SNPs would replicate at the rate of around ~1100 discoveries / ~75000 unlinked SNPs (with effects in the same direction) ~= 0.015, not 0.5 or 0.6.

          What we see in practice is that the threshold moves down from 0.1 ->.05 -> 0.01 -> 3e-7 -> 5e-8 depending on how cheap it is to collect the data required for the “right” amount of significant results to be yielded on average (“alpha is the expected value of p”).

          In the case of GWAS, 5e-8 was originally chosen to account for multiple testing, and has been used pretty consistently in studies of common variation. I haven’t seen lots of groups choosing their own ad hoc thresholds based on prior expectations in the way you’re describing.

          there are four possibilities at each nucleotide like you say

          SNP markers in genotyping arrays are typically chosen to be biallelic, at least in the population used to design the array.

          So statistical significance is not required to declare the replication a success now, it is only important for the initial screening?

          I think you’ve misunderstood me. I didn’t say that statistical significance *wasn’t* a criterion for replication success. I just said that which allele had a “positive” effect needed to be the same in both studies, as a necessary but not sufficient condition. Indeed, the authors looked at both statistical significance and effect sign.

        • There are indeed people who now believe this (the “omnigenic model” of complex traits). But they arrived at this hypothesis by looking at the results of GWAS studies with large sample sizes, not by assuming a priori that there was no such thing as a null effect in genetics.

          Yes, I take that as an a priori principle about everything, not just genotype/phenotype. Literally everything correlates with everything else (to be sure, most of these correlations are unimportant/uninteresting) and stuff that does not is exceptional and interesting. If you use NHST, you are assuming an opposite principle.

          there are synonymous mutations in protein-coding regions

          There will still be slightly different affinities for various enzymes like polymerases, melting temps, etc. Plenty of reasons for some difference to arise.

          not every SNP was actually significant at a threshold of p < 5e-8 in either study.

          My understanding is all the ones checked for “replication” did meet that criteria in the first study.

          If this set of ~150 SNPs were simply an arbitrary random sample of all true SNPs (still assuming here that every SNP has a “true” effect), then you would expect that SNPs would replicate at the rate of around ~1100 discoveries / ~75000 unlinked SNPs (with effects in the same direction) ~= 0.015, not 0.5 or 0.6.

          Yes, there will be many false negatives if the study is underpowered. The sample size and variability determine the expected number of significant vs not results.

          In the case of GWAS, 5e-8 was originally chosen to account for multiple testing, and has been used pretty consistently in studies of common variation. I haven’t seen lots of groups choosing their own ad hoc thresholds based on prior expectations in the way you’re describing.

          It isn’t possible for individual groups to set the standard, the community for a given field sets it collectively based on how many “discoveries” need to be published per year to keep getting funding or whatever. This isn’t something that is discussed openly, it just needs to happen or that research will get shut down for being too productive (look like fakers) or unproductive (never learn anything new). The first person who looked at GWAS data first saw all the “significance” at 0.05 and then decided this was unacceptable so they made it more stringent.

          It seems to be about 1 over the average sample size (+/- 1-2 orders of magnitude), but of course depends on how noisy the type of data tends to be and how wrong the null model usually is.

          Here is a great example of the thought process:

          I am wondering under what circumstances is it more appropriate to use an alpha value of .01 instead of the standard .05 (using a T-test for equal or unequal variances). I have some data in which almost every group is significantly different from almost every other group when using the alpha value of .05, but not .01. I am not well educated in statistics so any help would be greatly appreciated.

          https://www.researchgate.net/post/Should_my_alpha_be_set_to_05_or_01

          I am not saying there is anything wrong with the above thinking, except it is based on the premise that statistical significance is discriminating between “real” and “chance” correlations to begin with.

          SNP markers in genotyping arrays are typically chosen to be biallelic, at least in the population used to design the array.

          Thanks, so that explains one strange aspect of the chart? Any comment on the rest, ie what type of process would result in only a directional correlation while magnitude is largely irrelevant?

        • Lee et al. discuss the logic of their replication procedure in section 1.10 of the supplement. The table in that section demonstrates that the results are consistent with true positive effects.

          I would think the p-value in an exact replication when the hypothesis is true will be larger (less significant) than the original p-value 50% of the time, so your suggestion that all p-values in the replication should be expected to be below the 5 * 10^-8 threshold is surely erroneous. Only a bit over half should be that low. (The replication by Lee et al. of course isn’t exact, e.g. the sample size is larger.)

          I’m not sure what’s going on in Fig. S3.

          Whether you choose the “plus” or “minus” alleles as reference alleles makes no difference to the results of a GWAS. For example, in Lee et al. the effect size for having T rather than C in the rs7623659 locus was 0.02899. Someone with two T alleles is expected to get 2*0.02899=0.05798 units more education than someone with two C alleles. You could choose to model the effect in terms of C instead, in which case those with two C alleles would be expected to get 0.05798 units less education than someone with two T alleles. The two ways of modeling the effect are completely equivalent. (I think the unit in this analysis is years of education, so 0.05798 units is about 21 days.)

        • I would think the p-value in an exact replication when the hypothesis is true will be larger (less significant) than the original p-value 50% of the time, so your suggestion that all p-values in the replication should be expected to be below the 5 * 10^-8 threshold is surely erroneous

          Can you expand on this? I can think of one consequence if you are correct:

          There are many people/projects (cancer replication project, psych replication project, just in general) who have been defining “successful replication” “as statistically significant in the same direction”. While I don’t think that is a good definition for other reasons, you seem to be claiming they should only expect ~50% success rate using that definition.

        • Here’s Senn (2002) on expected p-values in replications:

          Some years ago, in the pages of this journal, Goodman gave an interesting analysis of ‘replication probabilities’ of p-values. Specifically, he considered the possibility that a given experiment had produced a p-value that indicated ‘significance’ or near significance (he considered the range p=0.10 to 0.001) and then calculated the probability that a study with equal power would produce a significant result at the conventional level of significance of 0.05. He showed, for example, that given an uninformative prior, and (subsequently) a resulting p-value that was exactly 0.05 from the first experiment, the probability of significance in the second experiment was 50 per cent. A more general form of this result is as follows. If the first trial yields p=α then the probability that a second trial will be significant at significance level α (and in the same direction as the first trial) is 0.5.

          So, if the p-value in the first study is 5*10^-8 and the effect size is correct (i.e. the true population value), the p-value in an exact replication will be less than 5*10^-8 in 50% of replication attempts. Note that this is about how well a particular p-value replicates, NOT whether the effect replicates at some conventional level such as 0.05, so your comment about a 50% success rate is mistaken.

          The real replication test is not about p-values but about how the polygenic effects replicate across contexts and education polygenic scores pass that test just fine, e.g. https://www.pnas.org/content/115/31/E7275

        • So, if the p-value in the first study is 5*10^-8 and the effect size is correct (i.e. the true population value), the p-value in an exact replication will be less than 5*10^-8 in 50% of replication attempts.

          I haven’t worked this out for myself but it seems like an interesting point. However, in this study the “lead SNP” p-values were all less than 5e-8, not equal to it.

          Note that this is about how well a particular p-value replicates, NOT whether the effect replicates at some conventional level such as 0.05, so your comment about a 50% success rate is mistaken.

          In the paper they are concerned with passing the threshold, as was I. I didn’t realize you changed the subject. Rereading I do see you did compare new vs old p-values, but then switch to saying “a bit over half should be below the threshold”:

          I would think the p-value in an exact replication when the hypothesis is true will be larger (less significant) than the original p-value 50% of the time, so your suggestion that all p-values in the replication should be expected to be below the 5 * 10^-8 threshold is surely erroneous. Only a bit over half should be that low.

          Where does “bit over half should be that low” (below the 5e-8 threshold) come from?

          I’m not sure what’s going on in Fig. S3.

          Can anyone explain this? This is the type of stuff that drove me nuts when I did medical research, they would just present the most bizarre looking data as if it was totally normal all the time.

        • In the paper they are concerned with passing the threshold, as was I. I didn’t realize you changed the subject. Rereading I do see you did compare new vs old p-values, but then switch to saying “a bit over half should be below the threshold”:

          I didn’t change the subject. You mistakenly thought that I was claiming that the criterion “statistically significant in the same direction” is expected to be met in replications of true effects only 50% of the time. My actual claim is that the p-value in such a replication is expected to be smaller (larger) than the original p-value 50% of the time. Therefore, replications will be statistically significant at 5% level half the time only if the original p-value was 0.05.

          Lee et al. compare the p-value distribution in their replication study to a theoretically expected distribution of replication p-values. For example, the theoretical expectation was that 79.4 (SD=3) p-values would be less than 5*10^-8 whereas the observed frequency was 97. They offer some explanations as to why the observed p-value distribution doesn’t exactly match the theoretical one, but in any case the observed distribution is entirely incompatible with the idea that many of the genome-wide significant SNPs found in Okbay et al. (2016) were false positives.

          Where does “bit over half should be that low” (below the 5e-8 threshold) come from?

          As you said, the p-values in the replication were all below the threshold rather than equal to it, so a bit over half of the p-values in an exact replication (which Lee’s wasn’t) should be below the threshold, again assuming that the effect sizes were estimated without bias in the original study. (Lee et al. do not assume that the original effect sizes are unbiased. Rather, they shrink them to adjust for the winner’s curse, or regression toward the mean. The fact that they use these shrunken effect sizes may, or may not, explain something about Fig. S3 as well.)

        • “the p-values in the replication were all below the threshold rather than equal to it”

          This should read: the p-values in the ORIGINAL STUDY were all below the threshold rather than equal to it

        • in any case the observed distribution is entirely incompatible with the idea that many of the genome-wide significant SNPs found in Okbay et al. (2016) were false positives.

          I agree, they were not false positives. They are “true positives”. The problem is all the “non-significant” SNPs were false negatives due to insufficient sample size for the chosen threshold.

        • Actually if you bother reading any serious GWAS study from say the last 10 years, you will find that they all contain replication, in fact no one will accept gwas paper without it.

  4. “For example, even if researchers could conduct two perfect replication studies of some genuine effect, each with 80% power (chance) of achieving P < 0.05, it would not be very surprising for one to obtain P  0.30. Whether a P value is small or large, caution is warranted.”

    What is understood by “not be very surprising”? A little simulation study suggests that this situation would happen less than 1 in 20 times…

    Am I missing something here?

  5. There’s the p-value, and then there are other ways to figure significance. They are not necessarily equivalent. The p-value, in particular, is probably one of the noisiest statistics you could find. With a standard deviation of around 0.24, it’s hard to know what to conclude just because your sample estimate is, say, p = 0.035. Does that tell us that the “true” p-value would have been in the range [0, .48]? Hard to say exactly…

    Confidence bands, whatever you want to call them, based on e.g. 2-sigma, are at least more stable, less subject to noise. I would say that almost all of us, looking at a set of data, would say that a sample result within 0.3 standard errors of another has poor support for the claim that the two are much different. Conversely, results 5 S.E. apart would be convincing to almost everyone, or at least would set off a hunt for systematic bias and error. As they should.

    I conclude that the problem is not exactly with “statistical significance” per se, but with 1) the use of a very noisy way to estimate it, and 2) a desire to force hard decisions out of data that really aren’t able to support them.

    A corollary is that if a decision has to be made based on data that is less well supported by its statistical qualities, then revisiting it from time to time is essential, to see if the consequences have held up. Of course, for politically charged issues, this is often nearly impossible.

    And let me end with my own pet little peeve – papers where you can’t tell if they are using the standard deviation or the standard error. Grrr!

    • What do you mean by “true” p-value? I’m seeking a mathematical definition here. My confusion as to what you mean by the phrase is due to the fact that the p-value is a random variable — pre-data it has a distribution and post-data it is a single realized value. (Do you mean the expectation of the p-value? This would indeed be a function of the unknown parameter(s), but it would also depend on the sample size.)

      • @Corey: “What do you mean by “true” p-value?”

        I’m talking about estimating the value of a statistic vs getting the population-wide value. The p-value is used to talk about how likely it is that observed differences might or might not be found by random variations alone. But you can only ever get an estimate of the p-value because it is ultimately based on the sample standard deviation (or something more or less equivalent), which is only an estimate for the population standard deviation.

        • A p value is the frequency of getting certain kinds of data from a particular RNG, the RNG you choose is an arbitrary choice, it’s up to you what hypothesis you want to test, not random. The data you test it on is observed. There is nothing random about the p value itself. The only randomness involved is the hypothetical repetitions of data collection you imagine you might perform.

        • Daniel WOW if you had rendered that definition to me in my statistics class, I would be scratching my head. It bolsters the need to standardize definitions. If I compare your defintion and the one I had in my 1st year statistics class, I’d be like ummmmm come again?

        • The point is that the p-value (a statistic calculated from the data) is random in the same sense that the average height of individuals from a population is random (you could say that there is nothing random about the average itself).

          And Thomas seems to think that in the same way that the average height is an estimator of the true average height in the population the p-value is an estimater of a true “something”.

        • The average height of a sample of individuals is frequentist random in the sense that repeat samples give different values. But the p value as calculated is a definite number, and it is the only number of relevance for the sample. Just as if I asked you what is the sample average, and you started talking about a mystical true sample average…

          The sample average has a connection to a population average, but it exactly answers the question “what is the average of these data points?”

          Similarly the p value exactly answers the question “how often would this chosen random number generator produce data more extreme than these data points”

        • I guess you might say that the PROCEDURE to observe some data, and *choose a hypothesis to test* basing it on an observed SD or skewness or whatever results in you testing a randomly selected hypothesis, and the hypothesis you “really” care about is one where the sd or the skewness or whatever is exactly known…

          the relevance of that problem can be simulated I guess… I doubt it will turn out to be terribly important. In some ways that was the purpose of the t-test to average over the sampling distribution of the sd. It makes a big difference for very small samples (like 1 or 2 or even 5) but becomes much less relevant for 10 and by 25 the discrepancy between known sd and the t test is usually thought of as irrelevant.

          Is there a “t-test” style test you can use for other distributions, yes. None of this addresses actually important questions in my mind. The sampling distribution of the chosen hypothesis is hardly the big problem with p values.

        • > But the p value as calculated is a definite number

          The average height of a sample of individuals is a definite number. They are not different in this regard.

          I agree that the difference is that the sample average has a clear connection to a characteristic of the population (it’s an estimator of the population average) while the p-value doesn’t (it’s some complex thing related to the model… or to a RNG if you’re so inclined).

        • Of course if you understand probability in a frequentist way the p-value is random. You assume some underlying model and generate some data randomly, then the p-value is just a statistic of the data, a random variable that has a certain distribution and takes a fixed value once the data are observed like every statistic/RV.
          (As always, my take on frequentist probability is *not* that the model is necessarily true but rather that it is designated to be the basis of probability calculations because we are interested in how the world would be *if* the model was true, just to nail down something that allows us to do calculations.)

          What complicates matters is that you can look at two underlying distributions, (a) the distribution P0 that you want to test and (b) the distribution Q that you take in order to compute the distribution of the p-value.
          You can be interested in the case P0=Q, i.e., what happens if the H0 is true, in which case the p-value is often but not always uniformly [0,1]-distributed; but you can also be interested in the distribution of the p-value, computed for checking P0 against the data, if in fact the underlying distribution is something else than P0, although the p-value is still defined as quantifying the relation between the data and P0.

          Chances are that people who talk about the “true p-value” mean the situation that Q is the true underlying distribution, usually not precisely P0. However, if you believe like me and most Bayesians that such a “true” distribution doesn’t exist, talk about a “true p-value” doesn’t make sense in my book.

        • I don’t think Passin means “the situation that Q is the true underlying distribution, usually not precisely P0” — or at least, not just that. He was pretty clear in expressing that he understands the p-value as a quantity with a sampled value and a distinct “true” population value, just as the notion of “standard deviation” covers both the concept of the population standard deviation as a particular functional of a distribution and the sample standard deviation as an estimator of the population standard deviation. I was trying to get down to brass tacks on exactly what that would mean in the simplest possible context, mostly because I think he’s confused and I’m trying to do the Socratic dialogue thing but also because it’s possible that I’m the confused one because I’m misinterpreting him.

  6. “We’re frankly sick of seeing such nonsensical ‘proofs of the null’ and claims of non-association in presentations, research articles, reviews and instructional materials. An interval that contains the null value will often also contain non-null values of high practical importance. That said, if you deem all of the values inside the interval to be practically unimportant, you might then be able to say something like ‘our results are most compatible with no important effect’.”

    It seems to me that “if you deem all of the values inside the interval to be practically unimportant” could be improved with something like: “if in a pre-specified analysis plan (i.e., before seeing the data) you identified a range of values considered to be practically unimportant, you might be able to say…”

    I share Ioannidis’ concerns with people describing the data “as they wish, kind of impromptu and post-hoc” and point you to a longer discussion on this:
    https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0195145

  7. Ioannides’ argument is that if you take away a flawed decisionmaking tool, people will resort to making decisions with no tools at all, and further that they will be listened to. If true, this is the result of three separate problems: (a) the ultimate decisionmakers don’t understand the arguments being made by the advocates; (b) the advocates themselves can’t tell good evidence from bad evidence (or perhaps lack the incentives to inquire carefully into the difference); and (c) there is no trained surrogate for the decisionmaker whose job it is to play “null’s advocate.” There is literally nothing that can ever be done about (a) – it is unrealistic to expect those who have the power to make decisions to have stopped along the path to power to put in the hard work to become scientific analysts. There is little that can be done about (b) — as has been pointed out many times, it is very difficult to get someone to see something that it is not in his interest to see. But there is lots that can be done about (c). Imagine if in getting a drug approved, the applicant had to spend a sum of money equal to the sum spent on the trial on someone else working on behalf of say, the established drug he wished to supplant. Both of them make reports to the decisionamker in language that the decisionmaker can understand (hopefully) but making as many salient points about the uncertainty of the applicant’s analysis as they can cogently express. What neither side can do in in this case, however, is blithely say: “I’m right. Look at the p value.” because that position has no intellectual responsibility and this piece, even as the product of a mob, demand to have attention paid.

    So to answer Ioannides: Yes, decisions must be made, and if decisionmakers are going to depend on a smooth-talking PNAS-level purveyor, that’s on him. No amount of p-values are going to stop the ignorant from falling into error. And sure, people argue for what they want to argue, but the counter for sloppy argument is focused counterargument. If decisionmakers don’t want to hear counterargument (and to hear it they’d have to pay for it) they deserve what they get. But cogent counterargument, arguments that can be understood, oughtn’t just blithely use p-values either, because the arguments from faulty p-values are no better for the null than they are for the alternative.

    • Yes, it’s striking that Ioannides seems to think that a decision-making rule that doesn’t rely on a p-value is no rule at all, you’ll have people just making up claims and nobody will have any idea of what’s true or not. It just seems like a ridiculous argument.

    • The problem with this is that regulators need decision rules that they can apply uniformly. There are smart people at the FDA and other regulators, who understand science and statistics, but they have to answer to elected representatives and courts. They have to be able to explain their decisions not just in terms of the science but in terms of why they are applying the same standards to everyone. You cannot have completely different decision rules for each individual case. We are a country of laws. Right now p-values and significance testing play a role in those decisions. Any decision rule is going to provide an arbitrary cut point for what studies get rejected as evidence, but the regulators have to have a decision rule. What can replace statistical significance that will provide a better decision rule?

      • This is a strawman. First, while bad rules are probably better than no rules, maybe not. But no one is saying that everyone isn’t subject to the same criteria: (a) do good work; (b) explain it well; (c) make all your data and programs freely available to anyone who wants to dispute it, who can then do their own good work and make their own better explanation.

      • Steve said,
        “The problem with this is that regulators need decision rules that they can apply uniformly. There are smart people at the FDA and other regulators, who understand science and statistics, but they have to answer to elected representatives and courts. They have to be able to explain their decisions not just in terms of the science but in terms of why they are applying the same standards to everyone. You cannot have completely different decision rules for each individual case. We are a country of laws. Right now p-values and significance testing play a role in those decisions. Any decision rule is going to provide an arbitrary cut point for what studies get rejected as evidence, but the regulators have to have a decision rule. What can replace statistical significance that will provide a better decision rule?”

        What you are describing is often called a “bright line rule”. See https://en.wikipedia.org/wiki/Bright-line_rule for discussion of the controversy (at least in the U.S.) about requiring “bright line rules”.

      • Steve:

        As I wrote in another comment on this thread: If we need rules, let’s have rules. But why should the rules be defined based on the tail-area probability with respect to a meaningless null hypothesis? That’s just weird. They could have rules based on minimum sample size, minimum accuracy of measurement, maximum standard error of estimation, things that are more relevant to the measurement and inference process.

        See also my P.S. in the above post.

  8. “I don’t think it’s such a bad idea that people listen to what I say, in part on the strength of my carefully-written books”

    You know what would be at least 100x more effective than publishing an opinion piece in Nature? Putting all the early warnings and recommendations by, e.g., Meehl, Cohen, Ioannidis, and Turkey, and later stuff by Simonsohn, Simmons, Gelman, Greenland, etc., into all your stats TEXTBOOKS and SYLLABI! Why? Because students mostly learn from textbooks and assigned readings, and your students will hold your feet to the fire. Mine certainly do. Virtually no grad students want to go deeply in debt and waste years of their lives pursuing a fraud. Who busted Hauser? His students.

    Review your colleagues’s stat text books. If they don’t have a thorough discussion of the replication crisis and all the factors that contribute to it, give them a bad review. Request your colleagues’ syllabi. If they don’t contain a thorough intro to the replication crisis, ask them why. Make it impossible for students to not know this stuff.

    (And by “you”, I mean everyone teaching a stats course or writing a stats textbook.)

      • Andrew, you’re probably sick of being asked this, but any idea on when that will come out? I’m teaching a 2nd semester applied methods course and I would love to have a textbook that covers all the standard regression models and also includes “grown up” treatments of significance.

    • This times 1 billion. I’m sick of hearing about students ‘not understanding p values’. Who is teaching them? How do you pass a stats course and not understand p values? Why don’t these students fail? Lazy and incompetent lecturers, that’s the problem.

      I’ve just finished watching McElreath’s most recent lectures and he does this continually. I hate this term, students don’t understand that term, blah blah blah. He never actually tries to teach the damn things correctly.

      Easy solution.
      1. Teach the damn stuff correctly.
      2. Give the stats students a short, 5-10 min oral examination. You don’t know, you fail.

      Have some standards.

      • Problem is, you can teach it 100% correctly and ask them loads of questions that all fall within the framework as you’ve taught it, and good students will:

        a) Answer the questions correctly

        b) Think that the p-value is the probability of the null, that non-significant results “were likely due to chance”, etc.

        The problem is that this whole method of analysis is counter-intuitive. I’m going to take my research hypothesis, come up with a specific version of “not my hypothesis”, and then calculate the probability of getting my data (or data more “extreme”) if that version of “not my hypothesis” is true. If this probability is small enough, I reject “not my hypothesis” and conclude that I have evidence for my hypothesis.

        Are we surprised that students take this method and rework it in their heads as “I’m calculating the probability that my hypothesis is wrong”?

        • “The problem is that this whole method of analysis is counter-intuitive.”

          Yes, but since it’s so widely used, we need to teach it *and* emphasize that it is indeed counterintuitive, and that it’s often misused, etc., etc.
          (e.g., if a student asks, “But if it’s so counterintuitive, why do people use it?”, we need to reply honestly that sometimes people do things just because “That’s the way we’ve always done it,” which makes life more difficult for everyone involved.)

        • Then examine them to see if that is what they are doing, if so they fail. This isn’t hard.

          Stop allowing ‘scientists’ to treat stats 101 as a bare minimum requirement they have to fulfill before they go and ‘do science’.
          These people are ill educated and are being let loose on problems that really matter. Who is accrediting them?

        • Martha: I fully agree. Sometimes I feel conflicted about simultaneously teaching and criticizing a method. I don’t want to breed cynicism, and I also don’t want to breed credulity. It’s a tough line to walk.

          Bob: I admire your faith in the power of STAT 101 exams to ensure that future scientists thoroughly understand NHST.

    • When I was teaching stats in our masters program a few years back I definitely included a number of these arguments and stressed the dangers of relying on p<.05. When students handed in statistical work, or just reviewed other studies in their lit reviews, everything I said was lost. I'll take that up again in a later comment.

        • So we need to work on
          1) educating the advisors
          2) educating students to *critically* examine the literature in their fields (in particular, teach them to question, “That’s the way we’ve always done it”.

          (I don’t claim these tasks are easy, but they’re important.)

        • I speculate that the critical thinking needed to rectify some of these developments in statistics reflect a much larger educational shortfall that has not been articulated all that well. Some tech thought leaders seem to be aware of it. So that is something that some of us are looking at. It starts with a hunch sometimes.

        • I completely agree with your point although I can’t seem to talk about this critical thinking “shortfall” without sounding like Grandpa Simpson.

        • Cute. You may have read or heard that most transformative ideas honed through very messy thought processes. Epiphanies, hunches, etc. Fluid and crystallized intelligence. Some scientists at a conference in Boston emphasized that dirth of creativity in universities was considered a large risk

        • In the courses I taught at Duke, it was the tutors that tried to undo it insisting that the students definitely had to decide whether to reject or not. Was able to quickly change their minds maybe only because they were informed they had to (maybe they reverted when I was gone). But it underlines the evils of inertia and culture.

          I also discussed in the class the likely extreme contrast to what they saw being done in all the literature and even by the professors in their other courses. Seemed to just make them uncomfortable. And perhaps just coincidentally, the next year one of the larger disciplines started to teach their own stats course.

        • Wow Bob, I hope you aren’t a professor!

          1. My experience is with students coming to me after they’ve taken my class, and their questions entirely focused on p-values and “which hypothesis test”. How can I fail these students after they’ve finished the class?
          2. Even if I had this experience while they were a student in my class, I cannot fail them on material outside of class
          3. Even if I had this experience on assignments in the class, this isn’t cause for failure (this would be weighting this concept ginormously compared to all the other concepts, the class isn’t called “Use and misuse of p-values”)

        • So where is the mystery here? A student can get through your class while lacking a basic understanding of the most fundamental of concepts? Is it any wonder these things are misapplied?

          The subject is poorly taught, that’s why you have this problem, yet there’s a lot of chin scratching about what to do.

          No I’m not a professor, but I interview a lot of people who claim to have training in statistics, sometimes to masters level. It takes < 60 seconds to establish when they don't. I'm not mystified by it, I've sat in enough lecture halls and passed enough exams to know why. What baffles me is the reluctance of the teaching profession to examine their own role in all of this.

        • Anonymous: what about “this would be weighting this concept ginormously compared to all the other concepts, the class isn’t called “Use and misuse of p-values”” did you not understand? The class is not about hypothesis testing but statistical modeling using lm, glm, and linear mixed models. Apparently your ignorance of my class doesn’t inhibit you from anonsplaining my problems to me.

          Were we all Anonymous, we’d have no need for this discussion, or even this blog.

        • If you make a mistake on your driving test on something basic that’s considered to be fundamental to being a safe driver, you fail the test. You’re not safe to drive.

          What’s the point of teaching glm and linear mixed models, if students don’t understand something basic like p-values? How can you certify that these students understand statistics but not p-values? They’re not some niche concept for crying out loud.

          Were the academic profession to teach the damn things correctly, there’s be no need for this discussion either. But I don’t hold out much hope that they’ll take any responsibility.

        • Bob:

          You can take a look at Regression and Other Stories when it comes out. But, the short story is that I don’t think p-values are “basic” or at all important, except for what one might call sociological reasons, that there are researchers who use those methods. When I teach statistics, I spend very little time on p-values, really only enough to explain why I think the way they’re usually used is mistaken. If someone takes my class, which covers lots of things, and happens not to catch exactly what p-values are . . . ok, I’d be a little disappointed, as I want them to understand everything. But it’s not such a big deal. If they took my class and ended up analyzing data by looking at which results are statistically significant and which are not, then I’d be super-sad, because that would imply that, whether or not they understand the mathematical definition of a p-value, they’re not “safe to drive,” as you put it.

        • You can’t teach p-values correctly because they don’t make sense. Andrew has the right approach: Spend a little time on them to explain why you shouldn’t use them. Then teach things that do make sense and work.

        • + 1 to Andrew and Jeff. The trouble is to change the culture of practice in the whole scientific community, hence the Nature letter which I happily co-signed. Only so much you can do in the classroom alone when students have advisors to report to, papers to publish, etc etc. Funnily enough, this whole “they must understand p-values the way I want or fail” is itself an instance of a dubious dichotomous decision based on p-values ;) Jokes aside, I’ve seen plenty of card-carrying statiscians abuse the heck out of NHST, have no awareness of type M/S, flagrantly misrepresent Bayesian statistics, etc. Change is sometimes frustratingly slow – we shouldn’t meet out excessive punishment to our students because of it.

        • What I honestly think is that the whole concept of probability is not trivial at all, but will always be misunderstood by most and give headaches to the few who understand it a bit better. This applies to p-values, confidence intervals, Bayesian inference, likelihood, imprecise probabilities, you name it.

          Yes I agree that p-values are systematically misunderstood. I agree that they are often taught badly, but I also agree that it is very hard if not impossible to teach them in such a way that it all makes smooth sense. Except that I think that alternative approaches are no better in this respect. Our discipline has fundamental disagreements even about what the most basic term “probability” means, so how can we just say “let’s just teach the students stuff correctly, and then if they don’t get it let’s fail them.” Statistics is hard, hard even for those who teach it.

          Actually I think that the concept of a statistical test and a p-value is rather brilliant as a challenge of our thinking and dealing with uncertainty. I love to teach it. However, part of its brilliance is to appreciate what’s problematic about it and for what (good) reasons people write papers titled “Abandon Statistical Significance”.

          Major issues are that the concept of probability always refers to the question “if we hadn’t observed what we actually have observed, what else could it have been”, i.e., it refers to something fundamentally unobservable, and that statistics, when done with appropriate appreciation of uncertainty (be it Bayesian, frequentist or whatever), doesn’t give the vast majority of non-statisticians what they want, which in the real world unfortunately may put our jobs at risk.

        • Andrew said “But it’s not such a big deal. If they took my class and ended up analyzing data by looking at which results are statistically significant and which are not, then I’d be super-sad”

          This is what I meant by “1. My experience is with students coming to me after they’ve taken my class, and their questions entirely focused on p-values and “which hypothesis test”. As I said, my class is on statistical modeling not on NHST. I emphasize effect estimation and uncertainty. I emphasize the need to understand the biological consequences of the effect size and if these consequences are consistent or radically differ across the CI. Some ecological theory may predict interactions of the magnitude as a main effect, but if the CI includes small values of the interaction, then we should embrace that uncertainty.

          I tend toward’s Greenland’s take on this. I think this https://www.biorxiv.org/content/10.1101/458182v2 evolved out of my class and is a pretty good summary of some goals of my class. If you have criticism, criticize that pre-print (constructive criticism is always welcome).

        • In some discussions, I am skeptical or, should I say, wary of the emphasis on ‘uncertainty’ in the contexts that some of you raise. The underlying question? What would have been the case had researchers exercised the hefty recommendations that you all circulate? There are what? 100 million research papers out there.

          Cautioning students about ‘uncertainty’ also seems to imply that even with the best of our research efforts, we can’t really deliver very good to stellar findings. I think an outsider to your enterprises may see this. Certainly, one that has few or no conflict of interests. And Sander’s comments on Twitter and discussions also reflect this as an underlying theme. Then he lists the reasons for this.

          This is just astounding to me. Why not go into another field that may deliver some value? When I was back in Cambridge, at MIT conference, I think I made the same or similar case. And a few thought that was a very fair question. I mean to suggest that the concerns we express today were salient 25-30 years ago. I read through some of the ASA articles about Stat Significance. I believe that the overall speculation I have is that the research focus is still too narrow. I had hoped that the special edition was going to cover new ground.

  9. Regarding the concern about “popularity contests” in science, that’s what p < 0.05 is. How many people who use this actually understand it? In my experience, and according to the papers I've seen that attempt to estimate it, the answer is very few. Why does nearly everyone use it? Well, it's in all the textbooks, all the classes, everyone's colleagues use it, all the reviewers ask for it, and it's in all the articles we read. If everyone's doing it, it can't be that bad, can it?

    800 signatures is nothing compared to the implicit number of people who sign off on p < 0.05.

    • I think that the hundreds of signatories here (although a small fraction of the world’s researchers) could be helpful in persuading co-authors, reviewers and editors that concerns around current practice are shared by a broad community and not just a handful of statisticians who write on methodology (however well qualified they are to do so).

  10. A purist might argue the following: While those who use scientific results often must make dichotomous decisions by the very nature of their positions, there is nothing in the nature of science that compels scientists to draw dichotomous conclusions. This statement is 100% true but ignores the fact that the people and organizations who use scientific results are generally the ones who pay for the studies. Even if/when journals stop using .05 as a proxy for publication-worthy research, and universities stop using it as a proxy for tenure-worthy research, studies will continue to be funded only if researchers can make a case that there will be sufficient power to reject a false null, and only if the researcher promises to make a dichotomous recommendation based on statistical significance. But here’s the thing: as scientists, we can do both, we can make a recommendation based on an a priori decision rule AND we can draw a non-dichotomous conclusion that contributes to scientific knowledge by incrementally improving our estimation of an unknowable reality. Put the former in the report to your client or even in a paper for a journal aimed at industry or policymakers. Put the latter in your paper for an academic journal or scientific conference. Some may say that the two realms cannot be so cleanly divorced, which is also fine: feel free to cite and summarize each document in the other. Just make sure that you don’t conflate the two when communicating with either audience. Definitely be sure to avoid discussion of p-values and whether you’ve “proven” that an effect is “real” when you do interviews, or when your department puts out a press release, etc. That won’t fix the problem of journalists and others exaggerating or distorting the implications of research, but that’s another issue entirely. So, yeah: the choice between dichotomous or nuanced presentations of results is a false dichotomy. (I couldn’t resist!)

    • those who use scientific results often must make dichotomous decisions by the very nature of their positions

      Yes, and to do this they should perform a cost-benefit analysis that incorporates the uncertainties about the risks/costs and benefits. Statistical significance is not appropriate.

      • This is the argument that brought me around to Greenland’s position. For too long we’ve let the regulatory agencies (and the courts) shirk their duty to explicitly declare the grounds for their public policy decisions. They hate doing it, we all get that.

        Imagine if they were to declare: “We’ve decided to bet the lives of 3,000 people on the evidence before us. Saving 4,000 people would cost too much and saving 2,000 would benefit too few given our finances. Oh, and by the way, that’s 3,000 +/- 300.” It’s what we need to hear but dread to hear. Worse yet, it’s a prediction. So, once the results are in the deciders are liable to be labeled witless as well as heartless and relieved of their incomes. Better then to let such decisions be made by the Magic 8 ball “It is statistically significant!”, “Cannot predict now” and “Very doubtful.”

        Maybe the answer is that as a society we need to grow up and face harsh truths with our eyes wide open.

        P.S. The tl;dr on Ioannidis’ argument seems to be “Sure, the river of science is grossly polluted with bad science but if we give up statistical significance it’ll look like this: http://www.ohiohistorycentral.org/w/Cuyahoga_River_Fire If I’m correct, it’s a pretty damning indictment.

        • RE: ‘For too long we’ve let the regulatory agencies (and the courts) shirk their duty to explicitly declare the grounds for their public policy decisions. They hate doing it, we all get that.’
          ——–

          This is an interesting take, given Gerd Gigerenzer’s characterization of ‘Nudge’ as ‘libertarian paternalism’. Being explicit is not a strong suit of all that many people and schools of thought. Why the contestation becomes laborious.

          So to whom must we turn to protect consumers/patients from harms? I think to inform the research enterprises consumers/patients can draw on their own experiences with treatments/products. Especially important when ‘seemingly’ a fraction of doctors can claim to understand the efficacies and accuracies of technologies and test results.

      • Aren’t we going to have the same problem or even more profound problems with cost/benefit analysis? What is the value of a human life? How costly are the consequences of certain diseases. How do we value the pain that a person is relieved of by a medication that may have some other rare but terrible side effect. Regulators do deal with these questions, but they are equally subjective and they ultimately fall to our democratic institutions. I am more fearful of turning what are truly questions about our values into technocratic “solutions.” The government still needs to answer the question “should I rely on these studies as evidence?” I am not saying p-values need to be part of the decision rule. But cost/benefit isn’t an answer to that question at all.

        • Steve:

          There are intermediate points between, on one hand, a full cost-benefit analysis laying out all options and, on the other, a decision made based on a tail-area probability based on a meaningless null hypothesis. One option for example could be a Bayesian analysis with informative prior and some default assumptions about relative costs and benefits. For that matter, I don’t know that it would be so horrible for the FDA to approve a drug even if its benefit was not statistically significantly better than the alternative.

        • > approve a drug even if its benefit was not statistically significantly better than the alternative
          You are presuming that is not already been done at some agencies even if only for orphan drugs.

        • Andrew:

          Thank you. That is an actual answer. I think that the FDA has approved drugs that are not statistically significantly better than the alternatives. (In some areas, they have refused to do so, but I do not think that has been their general practice.) I am not advocating significance testing, but just making the point that the industry is capable of generating mountains of research. Significance testing is a simple rule to dispense with some of the noise. Without it, the industry will generate more noise, what alternative simple (and hopefully better) filter can the FDA or other regulators use to combat that problem. At some point the FDA has to go into Congress or a court and say this research does not meet our standards. Congress says, but there are twenty papers that all say this drug is great. They are not going to listen to nuance. The FDA needs standards upon which a consensus exists to filter out the junk.

        • Steve:

          I accept the value or rules, even arbitrary rules. To start, I’d prefer if, instead of arbitrary statistical significance rules, the FDA or similar organizations were to use arbitrary rules on minimum sample size, measurement precision, standard errors, etc. If we’re gonna have rules, let’s have them make some sense.

        • Can prior information or other data (e.g., studies of related drugs, pharmacologic effects) be considered statistically in choosing the NI margins or in deciding whether the NI study has demonstrated its objective?
          Prior information can be incorporated into a statistical model or within a Bayesian framework to take into account such factors as evidence of effects in other related indications or on other endpoints. As discussed in section IV.B.2.b, a meta-analysis is often used to estimate the average effect of the active control for purposes of setting the NI margin, and in certain circumstances, trials from related indications or for other drugs in the same class may be included in the meta-analysis conducted for this purpose. Some methods of meta-analysis allow the down-weighting of less relevant studies or studies that are not randomized or controlled (e.g., observational studies), which can be particularly important if few placebo- controlled trials are available.
          Bayesian methods that incorporate historical information from past active control studies through the use of prior distributions of model parameters provide an alternative approach to evaluating non-inferiority in the NI trial itself. Although discussed in the literature and used in other research settings, CDER and CBER have not had much experience to date in evaluating NI trials of new drugs or therapeutic biologics that make use of a Bayesian approach for design and analysis. If a sponsor is planning to conduct a Bayesian NI trial, early discussions with the Agency are advised.
          If important covariates are distributed differently in the historical studies than in the current NI study, model-based approaches may be used to adjust for these covariates in the NI analysis. Such covariates should be identified prior to the NI trial, and the methods for covariate adjustment should be specified prospectively in the NI trial protocol. Applying
          33
          post-hoc adjustments developed at the time of analyzing the NI trial would not be appropriate.

        • By the way, from that document:

          “Can prior information or other data (e.g., studies of related drugs, pharmacologic effects) be considered statistically in choosing the NI margins or in deciding whether the NI study has demonstrated its objective?

          “Prior information can be incorporated into a statistical model or within a Bayesian framework to take into account such factors as evidence of effects in other related indications or on other endpoints. As discussed in section IV.B.2.b, a meta-analysis is often used to estimate the average effect of the active control for purposes of setting the NI margin, and in certain circumstances, trials from related indications or for other drugs in the same class may be included in the meta-analysis conducted for this purpose. Some methods of meta-analysis allow the down-weighting of less relevant studies or studies that are not randomized or controlled (e.g., observational studies), which can be particularly important if few placebo- controlled trials are available.

          “Bayesian methods that incorporate historical information from past active control studies through the use of prior distributions of model parameters provide an alternative approach to evaluating non-inferiority in the NI trial itself. Although discussed in the literature and used in other research settings, CDER and CBER have not had much experience to date in evaluating NI trials of new drugs or therapeutic biologics that make use of a Bayesian approach for design and analysis. If a sponsor is planning to conduct a Bayesian NI trial, early discussions with the Agency are advised.

          “If important covariates are distributed differently in the historical studies than in the current NI study, model-based approaches may be used to adjust for these covariates in the NI analysis. Such covariates should be identified prior to the NI trial, and the methods for covariate adjustment should be specified prospectively in the NI trial protocol. Applying post-hoc adjustments developed at the time of analyzing the NI trial would not be appropriate.”

          As other people have mentioned in the comments, the mythical FDA that simply checks if p<0.05 and readily approves whatever has been brought before them is a strawman.

        • Aren’t we going to have the same problem or even more profound problems with cost/benefit analysis? What is the value of a human life? How costly are the consequences of certain diseases. How do we value the pain that a person is relieved of by a medication that may have some other rare but terrible side effect. Regulators do deal with these questions, but they are equally subjective and they ultimately fall to our democratic institutions.

          If need be, perform a cost benefit and plug in a range of numbers people throw out there as plausible. Get some upper/lower bounds. Or like I said, have them flip a coin, make it a lottery. But to me it is apparent this type of stuff shouldn’t be the business of “regulators” to begin with, since they obviously do not have the information required to “regulate” wisely.

          Anything (or nothing) is better than requiring elaborate and expensive rituals be performed that mislead because they don’t answer the relevant questions anyway.

      • I didn’t intend to express cynicism about the value or attainability of the petition’s goals, only to point out that the debate around the petition has at times treated as interchangeable arguments about what we should do and what we can or can’t do. We should conduct and report out studies without reference to statistical significance; we can’t get funded without applying the bad decision rules policymakers impose as a condition of funding, but we can report our results in different ways for different purposes while pointing out to anyone who will listen that p-values are dumb and that nuanced interpretations are what matter. These two positions do not contradict each other and as I said both are true. This is not the case for another of Ioannidis’s arguments: where the petition asserts that we should abandon statistical significance because it’s dumb, Ioannidis retorts that we should continue to report statistical significance because it has inherent scientific value.

        So I also +1 your response: Yes, policymakers *should* perform a cost-benefit analysis; they *shouldn’t* make funding contingent on studies applying inappropriate decision rules. If the objective is to change how policymakers and funding agencies evaluate proposed studies for funding, I’m 100% on board, but the audience for this petition seems to be researchers and journal editors. Appealing to researchers is a very indirect way to go about impacting policymakers, although it could be done in principle. For example, one approach would be a petition calling on researchers to ignore proposal review criteria that emphasize statistical significance testing (e.g., requiring that design decisions by built around a power analysis), and to refuse to apply those criteria when serving on a review board. If a critical mass of top researchers/research institutions followed through, agencies might be brought to heel. I would sign that petition in a show of support, though no one in my organization would allow me on their proposal writing teams if I did that in practice.

        In any event, that hypothetical petition is not *this* petition. This petition asks researchers to change their practices, even if there is no change in the incentives that extend from funding agencies down to study collaborators. The implication is that we must change within this universe while changing the universe itself through other efforts. I am 100% on board for that, too–which is the basis for my original comment that, if we are to serve two masters–science and policymakers–then the most expedient strategy is to stop trying to satisfy them both with the same product. That moves us forward to a point where we can focus on changing the way science is funded. Is there some hypocrisy here? Not if we don’t pretend that our funding-directed papers and the media for them are more than a contractual obligation. It also gives us a pretty powerful argument for our cause: “Congressman Smith, policy should be based on research results that scientists are publishing in actual science publications, which follow actual science standards, not on results that are held to agency-defined standards.”

        • My wife regularly serves on grant review boards, most of these grants are approved based on scores given by everyday researchers who rate them on various rating scales provided by the funding agency. None of these reviewers are administrators at granting agencies. Affecting the researchers could affect the granting agencies through these reviews

        • You and I are in complete agreement–affecting researchers can affect agencies–but something about your phrasing makes me infer that you believe we disagree. My point isn’t that appealing to researchers can’t change agency policies–your wife would seem to be an ideal target for what I described as “a petition calling on researchers to…refuse to apply those criteria when serving on a review board.” My point is that this petition doesn’t do that: it only asks us to do better analyses and write better papers. The petition/effort at hand fails to acknowledge that it’s currently not economically feasible (for a great many of us) to refuse to conduct and report the results of null hypothesis tests. My suggestion for reconciling this reality with the noble goals of the petition is that we do both things until efforts succeed at changing how requests for proposals are written and reviewed. We can and should do as the petition asks and write better scientific papers, but we must also write the reports our funding agencies are paying us for.

  11. I too was one of the signatories. A few thoughts:

    1. A statistical decision rule is a coordination equilibrium in a very large game with thousands of researchers, journal editors and data users. Perhaps once upon a time such a rule might have been proposed on scientific grounds alone (rightly or wrongly), but now the rule is firmly in place with each use providing an incentive for additional use. That’s why my students (see comment above) set aside what I taught in my stats class and embraced NHST. The research they rely on uses it, and the research they hope to produce will be judged by it. That matters a lot more to them than what I think.

    That’s why mass signatures make sense. It is not mob rule in the sociological sense; we signers are not swept up in a wave of transient hysterical solidarity. Rather, we are trying to dent the self-fulfilling power of expectations that locks NHST in place. 800 is too few to do this, alas, but it’s worth a try to get this going.

    2. Ionnidis is begging the question: he assumes we need formal, standardized decision rules in order to argue against their abandonment. He could be right, but it’s a case he needs to make, not just assert.

    3. There is a second-best case, sort of, for NHST that comes from Ted Porter, that this is what public agencies and other decision-making bodies that rely on research have to do to fend off attacks on their objectivity and competence. (That was the argument for cost-benefit analysis.) Hence the FDA, for instance, is under intense, constant pressure from the pharmaceutical industry. Their defense is to adopt the posture, “We did not choose to rule as we did. We were compelled by a formal decision algorithm that is set ex ante and over which we have no control.” To put it differently, the better decision processes many of us advocate are better only if they aren’t suborned. (That could be JI’s point.) But if that’s the case we should be honest about it, and recognize we are paying a serious price for our flawed political economy of regulation. Me, I’m an optimist and would prefer to fight for a less biased political economy and not stick with NHST as a shield against the one we’ve got.

    4. What I tried to teach in my classes is that the most important determinants of the dispositive power of statistical evidence *should be* its quality (research design, aptness of measurement) and diversity. “Significance” addresses neither of these. Its worst effect is that, like a magician, it distracts us from what we should be paying most attention to.

  12. RE: John’s comment

    ‘Why it is misleading: please see my e-mail about what I think regarding the inappropriateness of having “signatories” when we are discussing about scientific methods. We do need to reach conclusions dichotomously most of the time: is this genetic variant causing depression, yes or no? Should I spend 1 billion dollars to develop a treatment based on this pathway, yes or no? Is this treatment effective enough to warrant taking it, yes or no? Is this pollutant causing cancer, yes or no?

    Of course to cast some questions as dichotomies makes sense. But whether statistical significance is the means by which an empirically good or sound ‘yes and no’ can be obtained is the issue at hand.

  13. This discussion is fascinating.
    However, with three posts on NHST in the past couple of weeks, as a doctor I’m worried that anoneuoid is going to blow an aneurysm.

    Jokes aside, I’d like to point out that in medicine most of the clinically useful studies take the simple form of comparing treatment A with treatment B. We really do need an some sort of threshold we can agree on to say that (for instance) treatment A is better. Used properly NHST and P-values allow us to do this. And yes, we do take into the effect size. And yes, we do realise that a statistically significant result may be misleading for any number of reasons.

    The GRADE working group defines level 1 (high grade) evidence as being derived from at least 2 high quality randomised controlled trials. Treatments supported by such level 1 evidence will almost always be used in preference to anything else. With lower levels of evidence we use a more nuanced approach considering for each individual patient the cost/benefit/risk.

    Bottom line: an evidential threshold is necessary for the practice of evidence-based medicine. I imagine it is not necessary for the stuff Andrew and a lot of other people contributing to this blog do.

    • Jokes aside, I’d like to point out that in medicine most of the clinically useful studies take the simple form of comparing treatment A with treatment B. We really do need an some sort of threshold we can agree on to say that (for instance) treatment A is better. Used properly NHST and P-values allow us to do this.

      No, NHST does not allow you to do this, not even close. If you think it does that means you are confused.

      Also, the correct way to see which treatment/whatever to choose is well known and established: do a cost-benefit analysis which also takes into account the price/side-effects/etc. It has nothing to do with statistical significance. This seems to be a repeated “meme” amongst medical practitioners, perhaps see here:
      https://statmodeling.stat.columbia.edu/2019/03/20/retire-statistical-significance-the-discussion/#comment-998093

      • To make things clear:

        1. Usually when we compare treatment A and treatment B we do so because we have equipoise – that is we are not sure whether the treatments are equivalent or if they are different, which treatment is better. This is a kind of informal prior.

        2. If the study is a properly performed RCT then the p-value is valid and gives an indication of how likely it is that the two treatments are equivalent. I know you (and others) call the null of no difference a “straw man” but this not correct. It is entirely possible that two treatments may be equally effective especially if they have similar pharmaco/physiologic mechanisms or if neither has a true effect beyond the placebo.

        3. If we have two or more good studies that agree which treatment is better, (that is we have replication), then we have level 1 evidence and we will generally always use the better treatment. If we only have one good study then the treatment that looks better usually stays in a kind of limbo, tentatively used by some people, until its effectiveness is confirmed by another good study. So we don’t accept one p<0.05, we want two.

        As to cost/benefit, this is always considered at an individual patient level. Every patient will place different values on the benefits and risks. It's not a statistical question, it's a personal one. For instance, the evidence from several trials shows that thrombolysis in stroke increases the risk of death within 7 days but reduces the risk of long term disability. A patient may forgo the long term benefit if he/she values this less than the increased probability of avoiding an early death.

        Ben:
        1. We don't think in flowcharts. We think like clinicians – we deal with human beings and are used to uncertainty.
        2. Despite what Andrew says, the best estimate of the true effect size is always the observed effect size.
        3. A confidence interval simply consists of two inverted P-values. They may help some people visualise the uncertainty around the effect size but they are certainly not essential.

        • 1. Usually when we compare treatment A and treatment B we do so because we have equipoise – that is we are not sure whether the treatments are equivalent or if they are different, which treatment is better. This is a kind of informal prior.

          They are different, and which treatment is better is determined by the health benefits, side effect profile, financial cost, etc. Statistical significance does not give you the information you are looking for.

          2. If the study is a properly performed RCT then the p-value is valid and gives an indication of how likely it is that the two treatments are equivalent.

          No, the p-values you mention are calculated based on the assumption the two treatments are equivalent. You are committing a logical fallacy called “transposing the conditional”: https://en.wikipedia.org/wiki/Confusion_of_the_inverse

          I know you (and others) call the null of no difference a “straw man” but this not correct. It is entirely possible that two treatments may be equally effective especially if they have similar pharmaco/physiologic mechanisms or if neither has a true effect beyond the placebo.

          This is not backed up empirically. When large enough datasets are inspected everything correlates with everything else. Every treatment will be different from every other treatment, although this may be of no practical importance. Eg:

          These armchair considerations are borne out by the finding that in psychological
          and sociological investigations involving very large numbers of subjects, it is regu-
          larly found that almost all correlations or differences between means are statisti-
          cally significant. See, for example, the papers by Bakan [1] and Nunnally [8].
          Data currently being analyzed by Dr. David Lykken and myself, derived from a
          huge sample of over 55,000 Minnesota high school seniors, reveal statistically signifi-
          cant relationships in 91% of pairwise associations among a congeries of 45 miscel-
          laneous variables such as sex, birth order, religious preference, number of siblings,
          vocational choice, club membership, college choice, mother’s education, dancing,
          interest in woodworking, liking for school, and the like. The 9% of non-significant
          associations are heavily concentrated among a small minority of variables having
          dubious reliability, or involving arbitrary groupings of non-homogeneous or non-
          monotonic sub-categories. The majority of variables exhibited significant relation-
          ships with all but three of the others, often at a very high confidence level
          (p < 10^-6).

          http://www.fisme.science.uu.nl/staff/christianb/downloads/meehl1967.pdf

          But in general you can see how across research fields large sample size -> more strict “significance” threshold.

          3. If we have two or more good studies that agree which treatment is better, (that is we have replication), then we have level 1 evidence and we will generally always use the better treatment.

          Statistical significance is used to determine what studies get published though, so this is looking at a biased sample of studies. The number of unpublished (negative) studies is unknown. Also, if the studies have sufficient power (and why in the world would you run an underpowered replication study?) you will have 50% chance of two studies getting significance in the same direction if random noise is being measured, that isn’t very impressive. It is much more impressive if they get numerically similar results.

          As to cost/benefit, this is always considered at an individual patient level. Every patient will place different values on the benefits and risks. It’s not a statistical question, it’s a personal one.

          And the info they need is not whether a difference was significant or not, it is the expected benefit along with the uncertainties surrounding it.

          Essentially, the process you describe is constructed by stringing together a series of fallacies…

        • Nick,

          1. I was being a bit colorful with the “flow chart” comment, my apologies. What I mean is that many people will only consider effect sizes to be of interest IF the p-value is less than 0.05. So when you say that you take into account effect size, my question is if you take into account all effect sizes regardless of their associated p-value, or if you just take into account the significant ones.

          2. Following up on 1., if there is any filtering for significance going on, then it is not true that “the best estimate of the true effect size is always the observed effect size”. This is because the filtering process gets rid of the smaller estimates and only retains the larger ones. I think of it this way: if you take an unbiased estimator and condition it on p < 0.05, you now have a biased estimator.

          3. Using a CI is a much better idea than reporting an estimate by itself along with a p-value. Yes, the endpoints are "inverted" from the two sided p = 0.05, but they have the huge advantage of being numbers that are in the units of the variable of interest.

          As opposed to a p-value, which is a probability that most people misunderstand. You yourself just referred to it as "an indication of how likely it is that the two treatments are equal", which is false, and not trivially so. This is why I advocate that we stick to CIs – their values are much less likely to foster confusion and misinterpretations of data. And if we insist upon making a "reject" vs. "fail to reject" decision, we can still do it with the CI, which renders the p-value superfluous.

    • “Used properly NHST and P-values allow us to do this. And yes, we do take into the effect size.”

      Correct me if I’m wrong, but my impression is that the flowchart looks like this:

      1. Is p < 0.05? If not, discard. If so, move to 2.

      2. Report and interpret an effect size estimate.

      As Andrew constantly reminds us, this is a recipe for inflating effect sizes. That "evidential threshold" attached to a noisy estimator will allow only the big estimates through.

      I'm not familiar with GRADE. Does this standard insist on only looking at "registered report" style studies where it is guaranteed that the results will be published regardless of significance?

      Even in a "no publication bias" environment, I still see no case for preferring p-values to confidence intervals. CIs still let us use a threshold (check if it excludes zero), and they also give a measure of uncertainty in the estimate. The only possible scenario I can imagine in which a p-value would be used in place of a CI is when there is no interpretable statistic to put a CI around. The p-value typically just encourages confusion by shifting focus from the effect size to the "strength" of the significance ("Wow, p < 0.0001! This is great!").

  14. Re: Statement: For example, the difference between getting P = 0.03 versus P = 0.06 is the same as the difference between getting heads versus tails on a single fair coin toss (8).
    Why it is misleading: this example is factually wrong; it is true only if we are certain that the effect being addressed is truly non-null.

    This was the one sentence in the Valentin et al article, I was hoping would be nixed. I thought it was confusing and possibly inaccurate Otherwise, I endorsed the call for signatures. This was my only real disagreement with John Ioannidis’ objection.

    • I didn’t understand JI’s complaint, a presumably fair coin is tossed and 5 heads in a row occur, the probably given the presumption (Null true) is .5^5 * 2 = 0.06 and 6 heads in a row occur .5^6 * 2 = 0.03.

      I agree without reading the reference it would be confusing but to state that it was factually wrong?
      (Now I recall there being a reference but it and the sentence itself seems to have been removed in the final version).

      • Re: For example, the difference between getting P = 0.03 versus P = 0.06 is the same as the difference between getting heads versus tails on a single fair coin toss (8).
        Why it is misleading: this example is factually wrong; it is true only if we are certain that the effect being addressed is truly non-null.

        I might have characterized as a confusing comparison to draw. Then again I have rarely found the coin tossing analogies of all that much help in discerning the value of an inference.

      • Reference and sentence removed because (as John’s mistaken statement shows) it was confusing, especially given the space compression. Keith you have it exactly right, John has it 180 degrees backwards: Assume a fair coin toss (the null) and the difference between 0.06 and 0.03 is the difference in the probability of seeing 4 heads in a row vs 5 heads in a row. For a more elaborate explanation (and of course much more of other detail) see my contribution “Valid P-Values Behave Exactly as They Should:
        Some Misleading Criticisms of P-Values and Their Resolution With S-Values” to the TAS special issue at
        https://tandfonline.com/doi/pdf/10.1080/00031305.2018.1529625

  15. From my personal view point, anybody(!)teaching stochastics should stand up and talk – in his field, about the problems of statistical “cookbook science” and the ways how to solve it! A excellent example – the presentation given by Kristin Lennox at the Lawrence Livermore National Laboratory:
    >Everything wrong with statistics(and how to fix it)<
    https://www.youtube.com/watch?v=be2wuOaglFY&t=904s

  16. I feel that a big problem in this discussion is that people rejecting this or that method for decision making don’t come up with a constructive suggestion on what else to do. It’s easy to say that one should learn to live with uncertainty, or learn to accept uncertainty etc., but it’s not easy to translate it into real-life situations in the reader’s own field.

    If you propose Bayes factor as an alternative, we are back into the dichotomizing universe, except that it is trichotomizing or quadrupalizing or whatever (Jeffreys’ categories), with the added complication that one has to display a range of BFs with different priors and make one’s call as to what it means.

    In one paper published in a top journal I managed to get away with saying that we don’t really know what to conclude because even if the data are “significant” or something analogous to that (BF>10 say), it remains to be seen what will happen if we replicate this experiment. But I suspect I can’t get away with it every time.

    In a paper, if you raise a research question, there is a (natural) expectation that you provide an answer.

    • No one is suggesting Bayes Factor. In fact the Nature note explicitly mention BF as a bad alternative.

      The real alternative to p-values is hiring a statistician.

      Because analyzing data is hard, an should be done by professional humans.

      • Hiring a statistician doesn’t really work in practice unless the statistician knows the domain area *and* the hirer knows enough statistics to understand what the statistician is saying. Hiring a statistician is an idea that can work if you just hand over the entire analysis+interpretation work to a domain expert statistician. However, I know plenty of statisticians who don’t understand how to use p-values correctly.

        • I clearly remember meeting Andrew in his office in 2007 or 2008, I had questions about his Gelman and Hill book, and a friend at Columbia who knew Andrew took me to him. Andrew answered my questions but my friend explained later that we (Andrew and I) were talking past each other. I didn’t know how to ask the question clearly because I didn’t have the right vocabulary, and probably Andrew was trying to guess what I was getting at.

          Today, as a de facto consultant for students, I get the same feeling as must have gone through Andrew’s mind: what is this person asking me? We don’t have enough common ground to even have them ask me the question clearly. E.g., if I see shrunk posterior estimates of individual groups, so the posterior distribution of by-subject intercept adjustments in a hierarchical linear model, can I say that one subject shows a significant effect if their posterior distribution is far from 0? Can I say that two subjects are significantly different if their distributions don’t overlap? If you tell them that the posterior distribution of each subject’s by-intercept adjustment is telling you the uncertainty of the estimate of that adjustment, they feel hard done by because they were expecting a yes/no answer and you are trying to talk about what is something unrelated (in their mind). I see their struggle because I have been in that kind of situation. What I want to say to them is: please take my 13 week course next semester and then let’s take up this conversation again. But they wanted an answer that doesn’t require such a commitment to understand the context behind my answer.

        • +1 to all of this. Most people are willing to entertain different methods of statistics, but are confused fundamentally about what statistics is- so they want answers that end up ‘looking like’ whatever NHST mashup they were trained in. When Andrew or any of us say, ‘embrace variation and uncertainty’, eyes glaze over because you’ve just made it 10X harder to use statistics as an Oracle to build a ‘story’ out of. Plus, as you say, there were card carrying statisticians drilling the NHST significance finding mashup approach into them before- so our arguments feel like we must be mistaken or exaggerating or something.

        • I think someone who had never heard of a p-value in his or her life might still ask whether a certain drug or treatment or program has at least such-and-such effect (for instance, at least 5kg of weight loss). That is a 100% rational, obvious type of thing to ask if a study is being done.

          So someone says, “This weight-loss treatment is only worth doing if people will lose at least 5kg” and we go out and gather some data to see if that’s true. If the data purports to show a treatment effect, it’s not an “NHST mashup” for that person to ask “How certain are we that the effect is more than 5kg”?

          The hang-up this group seems to have is when, god forbid, that person has a fixed amount of certainty in mind. Am I correct that is considered OK to ask how big the effect is and how certain we are of the estimate but NOT OK say, “Well if we can’t be at least 75% (or 90% or 95% or 99.9%) certain then the study is inconclusive?

          I just can’t even…

        • No, you’re missing the point. The problem is the question you want an answer to is not answered by p values. “What’s the probability that the effect is more than 5kg” can only be answered by Bayesian probability calculations. To a Frequentist, the effect is whatever it is and there is no probability associated with it. The probability is associated with repeated experiments.

        • Also, it’s rare when a utility is zero up to 5kg and then jumps up to some value. Most people here want some kind of realistic utility used in decision making. For example if you offer a near certainty of 15kg of weight loss this might be considered conclusive evidence and then you decide the treatment “works” but obviously if there’s a 35% chance of death, it doesn’t work.

          If you want to report “given a typical utility the treatment overall has an expected value of $75 ” I’d be very happy to let you do that provided you have justified why this utility is reasonable

        • Yea, as Daniel is saying, that’s not what a p-value tells you! Interestingly, it’s a good example of a NHST mashup. Bayesian calculations give you the kind of answers you want there, of course as always conditional on the model and data…

        • Yes, there are lots of problems in working with people in other fields. It’s an ongoing problem — won’t ever stop, but we need to keep at it, because that’s better than the alternative.

          For example, I’ve worked with biologists who were open to using Bayesian methods, but “standard software” wasn’t helpful because it didn’t address the kinds of situations they were studying. I remember one case, where someone in the field had developed software to fit a particular situation, but it was poorly documented — and, in fact, the labels used were somewhat misleading. For example, a procedure labeled “hyperprior” was actually a procedure for choosing either one of a finite list of values of a parameter for a prior, or chosing to put a hyperprior on the parameter. The grad student trying to use it was (understandably) very confused, and I found it difficult to figure out what was going on in order to help her use it and explain what she was doing. There’s just lots of work to be done.

  17. We use to execute people based on trial by combat.

    But now you are suggesting it is bad idea. What alternatives are you proposing? Throwing suspects into water to see it they drown? Trial by water know to have its own drawbacks.

    And we do need to have a binary solution, we do need to know whatever suspect is guilty or not.

    • I understand your comment to mean: balance evidence and judgement to come up with a decision. I agree, but it’s definitely not clear to the novice how to do that in practice. Telling them to go read Gelman’s work would help if they knew enough to understand Andrew’s statements in his papers and books.

      And the situation is different because the jury+judge (the researcher) has to be a domain expert to make the judgement, and has to be a certain kind of expert, e.g., one who can see past p less than 0.05. I would say that there is hardly anyone who fits that description. So telling them to drop the one thing they can latch onto, without providing the proper education to see what else is possible, is asking for a lot. I guess I belong to the camp that says that education is the real problem. One thing I have noticed as a teacher is that education is difficult to impart; people get a lot of interference with p-values. If you teach them Bayes, they’ll import everything they think they know about p-values to try to apply that to posterior probs or even the posterior distribution (by looking to see if 0 is in it). I’ve done that too in my early papers using Bayes (2011 and a few years in from that time), but I was explicitly trying to avoid getting into a fight with reviewers during that period (now I feel established enough to not go there).

      BTW, I like the Freedman/Spiegelhalter/Kruschke ROPE approach a lot. I think this should be taught more standardly and it’s something people can understand and use.

      • Even though I also like the ROPE procedure outlined by Kruschke, I still think it results in a form of dichotomous thinking. You’re no longer looking to see whether a point null (0 or 1) is inside the interval but rather, any part of a predefined region of equivalence, however, again, if a tiny portion of the region is inside the interval, you are not able to reject it. That doesn’t seem very realistic or practical to me. If the vast majority of the interval contains effects to the right of the region of practical equivalence, I’d say you should be able to reject it, not suspend your judgement.

        • It depends on how you use it; I like the original Freedman/Spiegelhalter way of thinking. In practice, in my own work, what I find is that I either find some weak support for the theory under consideration, but often it’s equivocal. Really, there often isn’t much information in the data I have for theory development.

      • > So telling them to drop the one thing they can latch onto, without providing the proper education to see what else is possible, is asking for a lot.

        Yet, this is exactly what we should ask for.

        Let me catty on with my overexaggerated example.

        First, lets establish that Criminal justice is a fitting analogy. It is one case where we have to give yes-or-no answer. We are not sending people to jail for 1% of the full sentence because we have 1% certainty they they are guilty. Jury and Judge do not generally do cost-benefit analysis. They are deciding guilty or not.

        Would it be wonderful if we had a simple press-a-button method to decide if a suspect is guilty or not? Fie example, is some numerical summary of evidence is lower than 0.05, suspect is guilty. Wouldn’t it be nice? Sure, it would make criminal justice much easier and faster and more accessible for non-experts, say, mobs with pitchforks. But somehow we, as a society, understand that this is not a way to go. And that we need the army of Lawyers and Judges and Jury and criminal experts to analyses each case individually.

        Why do we have p-values in, say, drug discovery and dont have them in criminal justice? I see few possible explanations:

        1) Drug discovery much easer and more streamlined than criminal justice.

        2) Drug discovery is less important than criminal justice, so we are fine with making mistakes.

        and yeah, I dont think these explanation are correct

  18. MD clinician very appreciative of this long overdue discussion and comments. I believe there are at least a few examples of re-analysis of raw data (still using NHST) leading to different conclusions. Is this feasible prospectively?
    One thing we’re learning in medicine is that people game the system in ways that depend on various rules, requirements and oversight. We certainly see this in the finance sector as well. The current discussion is essential but even overwhelming consensus may have limited results unless the academic and publishing environment is substantially reformed.

  19. So in search of alternatives to conventional “p-value” based “NHST”, what is the specific alternative technique for answering a question of this type…

    I randomly split 100 overweight people into two groups. One gets a placebo, the other gets a supposed “fat burner” pill. I weigh them before randomization and after six months taking the treatment.

    Yes or no answer is required to the following question. Are we at least 95% confident that the “fat burner” group lost at least 5kg more than the placebo group.

    No “null hypothesis” of zero effect. A specific minimum clinically significant effect for which we want an actual estimate of probability. But if we can’t make that accessible enough that a researcher can do a simple two group, two time, randomized comparison of a interval-measured outcome without hiring a Bayesian statistician then we have NO change of ever talking the world out of using p-values and such.

    • @Brent:”Yes or no answer is required to the following question. Are we at least 95% confident that the “fat burner” group lost at least 5kg more than the placebo group.

      No “null hypothesis” of zero effect. A specific minimum clinically significant effect for which we want an actual estimate of probability. But if we can’t make that accessible enough that a researcher can do a simple two group, two time, randomized comparison of a interval-measured outcome without hiring a Bayesian statistician then we have NO change of ever talking the world out of using p-values and such.”

      Many real clinical issues are like this, aren’t they? We need to take some kind of action – treat a patient – so we try to formulate a rule to help us decide. In this particular example, formulating the rule like this isn’t very helpful. What if the actual weight loss difference was 4.8 kg, would you not do the treatment, but if it were 5.0 you would? What about 4.9? 5.2? What about 94% confidence vs 96%?

      No, a better approach for results in the gray zone is to acknowledge that “looks good, but we can’t be sure” – so we should be trying the treatment but monitoring the outcome to see if we need to make a change because the treatment isn’t getting the intended effect.

      We can all hope that clinical situations would be handled this way anyway – making sure the result comes out as intended – so what would be the difference in knowing the statistics vs not knowing? It lies in the practical areas of 1) deciding on whether to try the treatment at all, and 2) how stringently we check on the results.

      • @Tom: “What if the actual weight loss difference was 4.8 kg, would you not do the treatment, but if it were 5.0 you would? What about 4.9? 5.2? What about 94% confidence vs 96%?”

        “Actual” is the tricky concept here IMO. In a given sample, the actual difference in weight loss might be computed as 4.8kg but that’s an estimate with error. An actually measured difference of 4.8kg would imply *some* probability that the effect of treatment is greater than my 5.0kg critrion, no?

        I’d have no problem with reporting a result that says these things:

        1) In our sample the “fat burner” group lost 4.8kg more than the placebo group, with standard error of 1.6kg.
        2) We estimate a probability of 45% that the true effect of the “fat burner” treatment is greater than 5.0kg.
        3) Therefore we conclude that our study does not support a treatment effect of greater than 5.0kg with the pre-specified 95% probability.

        My problem is, I literally do not know how to calculate that “45% probability of weight less greater than 5.0kg” type of number in real-world, moderately complicated modeling situations.

        I do know how to compute a p-value for a group-by-time interaction. I also know the p-values like that which I compute suffer from a whole host of shortcoming, foremost being that they don’t actually admit an interpretation framed in the way my research question is framed. But telling me to never again compute or report those p-values without telling me what I should be computing and reporting instead is troubling. In the mean while I have no choice but to give my colleagues the (troubling) p-values they request!

        • @Brent: “But telling me to never again compute or report those p-values without telling me what I should be computing and reporting instead is troubling. In the mean while I have no choice but to give my colleagues the (troubling) p-values they request!”

          Since the p-value is a statistic, it seems to me that its uncertainty should be reported. Its standard deviation is a bit weird since the distribution of the p-value is so non-gaussian. But it’s a relevant number. If one is using a p-value threshold of say 0.05, and got 0.03 from the sample, is that number strongly supported or weakly supported? Maybe it could have come out 0.08 just as easily. In that case, your statistical significance would have gone up in smoke. Andrew has been fond of saying that the difference between statistically significant and insignificant is itself not significant. That’s my point here.

        • It would be nice if some measure of the variability in the p-value could be reported. Problem is that this depends entirely on the power of the test, which is unknown, and which cannot be estimated from p.

          I would hesitate to use the phrase “uncertainty” in the p-value. It is a statistic, but it is not an estimate of any population parameter. There is no uncertainty in the estimation sense. It is just the probability that some null model would produce a test statistic as big as the one that has been calculated.

        • > some measure of the variability in the p-value
          Just use Corey’s online R simulation code changing the true mean value each time, making a histogram with that true mean in the title – maybe 7 +/-2 stacked histograms.

        • Ben said, “There is no uncertainty in the estimation sense. It is just the probability that some null model would produce a test statistic as big as the one that has been calculated.”

          Yes, it’s not expressing uncertainty in the estimation sense — but a probability other than 0 or 1 is a measure of uncertainty of some sort.

        • The question is not whether the p-value “expresses uncertainty” somehow, the question is whether the p-value itself “is uncertain”.

          > a probability other than 0 or 1 is a measure of uncertainty of some sort.

          How is it a measure of uncertainty? Why not in the case of 0 or 1?

        • Carlos wrote:
          “> a probability other than 0 or 1 is a measure of uncertainty of some sort.

          How is it a measure of uncertainty? Why not in the case of 0 or 1?”

          Perhaps it would have been a little clearer to have said “Probability is a measure of the *degree* of uncertainty of an event, with probability 1 used for an event that is certain to apply, and probability 0 for an event that is certain not to apply, and with Prob(A) > Prob(B) meaning that A is more like to occur than B.”

        • Martha, I don’t know what I was thinking when I wrote that…

          I think my point was that it’s not a measure of uncertainty about the data or uncertainty about the model (it’s a probability for some hypothetical event conditional on the model and the observed data). But you didn’t say that it was a measure of uncertainty about the data or the model anyway.

    • I randomly split 100 overweight people into two groups. One gets a placebo, the other gets a supposed “fat burner” pill. I weigh them before randomization and after six months taking the treatment.

      The problem is you already messed up here, but want people to tell you how to deal with the mess you have created afterwards. You designed the study to test a strawman hypothesis rather than your hypothesis about how the treatment works (eg, how much weight loss do you expect?).

      Everything else is cascading errors due to this “mistake”.

    • “Yes or no answer is required to the following question. Are we at least 95% confident that the “fat burner” group lost at least 5kg more than the placebo group.”

      No real world researcher who is just looking for accessibility would ever ask to be “95% confident” in an outcome if she hadn’t been taught to ask this before the fact. In this sense the question is circular.

      The “95%” invokes a continuous scale. The demand for a yes or no answer contradicts this.

      Now, if the question is “how can we best use the data to assess the effect of the pill on weight loss?”, that’s easy to do without NHST.

  20. Fascinating discussion, and I agree with many – even conflicting – positions expressed. I did sign the statement because I thought it was a reasoned way to move forward. It did not say p values would be banned, but seemed to address the worst abuses of p-values and I think those involve how they are used in decision making. It is the decision-making aspect of the comments I want to address.

    Many people keep stressing that the world requires decisions and that they often must be made without the ability to conduct further studies (even when these are possible, decisions about what treatments to use are required while additional evidence is collected). This is obviously true, but it does not lead me to the conclusion that many people reach – that some type of hard decision rule is required. I don’t see why researchers need to make these decisions. In fact, I think it is an illusory grab for power they do not have. Other people are charged with making decisions. A clinician (along with the patient and others) must make a decision about treatment A or treatment B. That does not mean that the research into these two treatments must conclude that A or B is better (or that one is non-inferior). In fact, I would say it is not productive for the researcher to do so. I’d rather see the researcher more carefully describe what the evidence says, and what the costs of various decisions errors might be. I see no reason why this requires the researcher to declare which treatment should be used.

    The effects of having researchers reach conclusions is what I think causes the problems. It feeds the incentives to overstate what the evidence says. It invites decision-makers to hide behind their preferred “evidence” rather than taking responsibility for their own decisions. It can impede fruitful discussion about exactly what the evidence tells us and what the costs and benefits of different alternatives are. To be sure, these discussions take place today – but I think that is despite the use of p-values and decision rules rather than because of them. The way p values are used stands in the way of some of the discussion that should take place.

    For example, suppose a study (the SPRINT study is a case in point) finds that more aggressive treatment of blood pressure lowers heart attack risks but increases the risk of severe side effects. What should the research provide? I would say we want evidence about the potential sizes of the primary and adverse events, along with the uncertainty about each. I’d also like to see discussion of the relative costs and benefits of more aggressive blood pressure treatment. And, indeed, much of this discussion has taken place. But I don’t think that p values have contributed to this discussion. Effect sizes and measures of uncertainty are important. Discussion about how many people might be helped and hurt by more aggressive treatment are important. What role, exactly, does the p-value play in that? None, I would say. I would like to see confidence intervals – but, again, not in terms of whether they include zero or not. They are at least some indication of potential effects and the uncertainty that surrounds these.

    Why do we look to the researcher to recommend that we do, or do not, treat blood pressure more aggressively? Clinicians, patients, government agencies, insurers, etc. need to make these decisions. And, they should be held accountable for their decisions. I signed the statement because I don’t see how p values help bring these things about. Instead, I see them used to allow researchers to pretend they are the decision makers; they allow decision-makers to try to escape accountability; they permit evidence to become advocacy.

    If we abandon p-values, what I see is a great deal of confusion about how decisions are to be made. As many commenters point out, how will the FDA decide whether to approve drugs? How will a clinician decide which treatment to use? How will a city government decide whether to raise the minimum wage? All of these decisions will become messier, with plenty of room for hand-waving and all the other inefficiencies people have been pointing out. But I don’t see that as a bad thing. I see that as restoring some sense to the decision making process. It has become too mechanistic: do X if the p-value is less than y. Then we fight about whether the studies were done correctly or have flaws. And, publications mount on both sides. And, yes, evidence is generated that helps. But think how much cleaner this process might be if the p value were removed from the discussion. Studies would still be critiqued. Publications would still mount. Evidence would be debated. All of what needs to happen would still take place. The only thing missing would be the researchers recommendation about what to do. And, I would say, we have lost little, but perhaps gained some understanding that decisions are messy and fraught with uncertainty.

    • > I think it is an illusory grab for power they do not have.
      Very well put.

      > FDA decide whether to approve drugs?
      Quickly, the senior clinical evaluator I used to work with in drug regulation, made it clear that published papers (without access to all the data and perhaps auditing of individual patient results) was to be considered as only supportive for drag approval. To me, I took only supportive as meaning hearsay and was very glad they had that position.

  21. Of course, all of the messy debate goes on. The FDA relies on expert panels. They review the clinical, pharmacological, as well as the statistical evidence. I think that the point that some of use are making is that with significance testing, some crap is not even allowed through the door. I am asking an honest question. How do the regulators say, “these studies don’t even make it past the random noise level.” What they cannot do is tell Congress or courts that they hired a really good statistician and that is the conclusion he reached because the industry will line up several that say the opposite. (Trust me, I used to line experts up to say the opposite.) I think Ioannidis, who knows this space well, is concerned that the industry (and other groups) are going to flood the regulator with junk, and you are removing a simple decision rule with which the regulator can now fight back and say those studies don’t meet our threshold. There may be better decision rules, but we cannot just have a messy scientific process. The alternative is to allow the industry to poison us or alternatively to allow other pressure groups (insurers) to keep necessary treatments out of patients hands. And, I disagree that this is a power grab. The power already exists. You may be right that eliminating significance testing will eliminate confusion without the negative side effects that Ioannidis sees, but you have to at least understand that you are effecting a change in a regulatory system that has a lot of experience working. That disruption could be very negative.

    • I also have experience with regulatory processes, though more with utilities than drugs. From my experience, “working” is not all its stacked up to be. I’m not sure there is that much to lose by changing things.

    • The alternative is to allow the industry to poison us

      Why would you take medicine from people you think would poison you if given the chance? And People are being poisoned, right now.

      I saw what they just did to my grandmother, an endless series of pills each causing a side effect treated by the next one as quality of life drops and drops. I’ve seen what they did to my uncle, put on pain killers until he doesn’t know what is going on, needs to have limbs amputated, etc. I’ve seen what they did to my friend, put on anti-anxiety medications that turn you into the equivalent of an alcoholic after a few weeks if you don’t get your fix. I’ve seen what they did to another friend, put on anti-depressants until he starts bedwetting, causing more depression and anxiety and general emotional instability.

      The FDA is not stopping the poisonings, it is ongoing. Just watch TV, it is a bunch of carb-loaded fast-food and candy commercials followed by diabetes medication commercials.

      Even to the point of mortality… When we hear about an epidemic of drug poisonings, we would think it means stuff like heroin, methadone, and cocaine. But that isn’t what it means. The numbers actually refer to “drugs” in the general sense (ICD-10 X40-44):

      X40 Accidental poisoning by and exposure to nonopioid analgesics, antipyretics and antirheumatics
      [See pages 1013-1017 for fourth-character subdivisions]
      Includes: 4-aminophenol derivatives
      nonsteroidal anti-inflammatory drugs [NSAID]
      pyrazolone derivatives
      salicylates

      X41 Accidental poisoning by and exposure to antiepileptic, sedative-hypnotic, antiparkinsonism and psychotropic drugs, not elsewhere classified
      [See pages 1013-1017 for fourth-character subdivisions]
      Includes: antidepressants
      barbiturates
      hydantoin derivatives
      iminostilbenes
      methaqualone compounds
      neuroleptics
      psychostimulants
      succinimides and oxazolidinediones
      tranquillizers

      X42 Accidental poisoning by and exposure to narcotics and psychodysleptics [hallucinogens], not elsewhere classified
      [See pages 1013-1017 for fourth-character subdivisions]
      Includes: cannabis (derivatives)
      cocaine
      codeine
      heroin
      lysergide [LSD]
      mescaline
      methadone
      morphine
      opium (alkaloids)

      X43 Accidental poisoning by and exposure to other drugs acting on the autonomic nervous system
      [See pages 1013-1017 for fourth-character subdivisions]
      Includes: parasympatholytics [anticholinergics and antimuscarinics] and spasmolytics
      parasympathomimetics [cholinergics]
      sympatholytics [antiadrenergics]
      sympathomimetics [adrenergics]

      X44 Accidental poisoning by and exposure to other and unspecified drugs, medicaments and biological substances
      [See pages 1013-1017 for fourth-character subdivisions]
      Includes: agents primarily acting on smooth and skeletal muscles and the respiratory system
      anaesthetics (general)(local)
      drugs affecting the:
      · cardiovascular system
      · gastrointestinal system
      hormones and synthetic substitutes
      systemic and haematological agents
      systemic antibiotics and other anti-infectives
      therapeutic gases
      topical preparations
      vaccines
      water-balance agents and drugs affecting mineral and uric acid metabolism

      http://apps.who.int/classifications/apps/icd/icd10online2004/fr-icd.htm?gx40.htm+

      The “recreational” drugs are grouped in with the “medication-only” drugs so we can’t see how big a problem it is.

      • Why would you take medicine from people you think would poison you if given the chance?

        Have you ever eaten at a grocery store? We know from our own history (see e.g. the meatpacking industry, or any number of potions and tonics and cosmetics sold in turn-of-the-century America) that without adequate regulation of food safety, companies will gladly prioritize profit over the safety of their customers and workers. Or, if you prefer to look at current events, there’s the adulteration of baby formula and pet food in countries without adequate food-safety regulation. It is very clear what would happen if regulatory bodies were dissolved or hamstrung, and it would not be a return to some idyllic pre-lapsarian society, it would be a return to the 1880s.

        • The FDA’s original purpose was to make sure stuff was labeled correctly (no contaminants, etc). That is essentially what you mean by “food safety” here. It is definitely helpful for some organization to give its stamp of approval on that, afaik the FDA is doing a decent job at that.

          I think current efforts to test for acute efficacy/toxicity are also working pretty good. I am thinking of local anesthetics as an undeniable example of something that “works” acutely, would be interesting to see the role of NHST in the development of those. I suspect the process was more like “hey rub this on your lips, don’t they get numb?”

          What is not working at all is testing for and balancing chronic (more than a few hours later) efficacy/toxicity.

        • This is goalpost-moving; your original comment clearly said that we should not ingest anything made by an untrustworthy industry. My point was that history shows that the pharma industry is exactly as unscrupulous as any that produces products with a potential effect on human health, including the companies that produce the majority of our food, cosmetics, cleaning products, etc. I also completely disagree that we will somehow have safer and more effective drugs if only they were regulated like the supplement industry (a famous wasteland of therapies that are even less effective than drugs). But since you brought up contaminants, it is worth noting that a major part of the FDA’s job is to determine what actually counts as a “contaminant”, not just to check that products are labeled correctly.

        • This is goalpost-moving; your original comment clearly said that we should not ingest anything made by an untrustworthy industry.

          Where? I never said anything like that, I would say do your own cost-benefit. I have, and any interaction with the medical industry scares the crap out of me. It is people with a rudimentary understanding of the human body armed with sharp objects and concentrated chemicals, but still they are the ones with personal experience dealing with the ill…

          But my point was if you think an organization will poison you if given the chance, why would you trust them anyway? That makes no sense to me.

          I also completely disagree that we will somehow have safer and more effective drugs if only they were regulated like the supplement industry.

          I never said this either? I would guess the drugs would be about as safe and effective either way if NHST is going to be the filter.

          (a famous wasteland of therapies that are even less effective than drugs)

          “less effective than drugs” -> less dangerous than drugs

          The appeal of these supplements (etc) is that they are more likely to do “nothing” than actively harm you in some way.

      • Overtreatment and polypharmacy, side effects, etc are pervasive and serious problems. NHST used to support those is but a part of the issue.
        Just had a PhD quote a book co-authored by Sackett, to the effect that extrapolating RCT results to demographics/ populations not studied should be our default.
        Causal inference at the bedside is often necessary and often cannot be very rigorous.

  22. Really interesting discussion. I was quite surprised by Ioannidis statements “Removing statistical significance entirely will allow anyone to make any overstated claim about any result being important” and “we need to conclude something and then convey our uncertainty about the conclusion”. I think all objections to the idea of removing the use of statistical significance as a term, and its use in decision making come from these two points. If one accepts the wisdom of the second point then it leads inevitably to some norm like the current usage of P values and/or confidence intervals. That being the case, changing the norm risks weakening evidentiary standards still further, so why do it?

    We need to conclude something… I feel this statement is the root of so many problems in research. Conclusions beget binary decisions, binary decisions beget thresholds and thresholds beget statistics such as P values. Do we really need to conclude something from every analysis? We could simply realise we need more data or different data, and this can be true of every study from pilot to phase 3 mega trial.

    In the drug trial example regardless of whether we can feel we conclude the truth is that drug A is undoubtedly better than drug B, or not, we can perform a decision analysis and find which has the better expected outcome given available evidence and beliefs. We can also assess the value of gathering more or different data to then reevaluate the decision. Neither of these are conclusions in the traditional sense but they address the key decisions we must make. To put it another way, as a patient I do not care if it is God’s truth that treatment A is better than treatment B, I only care that based on what we know I will probably do better with it than the alternative. Any regulator that denies access to that treatment because we are not certain enough (P<0.05) it is truly better is not acting in my best interest.

    Alternative proposals have to start by offering an escape from the paradigm of conclude then describe uncertainty to an alternative paradigm of understanding the uncertainty and then maybe reaching conclusion(s). Then we can work through all the sociological issues around publication, regulation and gaming by interested parties. It does not take much imagination to think that we could do better than current practices.

    • I was quite surprised by Ioannidis statements “Removing statistical significance entirely will allow anyone to make any overstated claim about any result being important” and “we need to conclude something and then convey our uncertainty about the conclusion”. I think all objections to the idea of removing the use of statistical significance as a term, and its use in decision making come from these two points. If one accepts the wisdom of the second point then it leads inevitably to some norm like the current usage of P values and/or confidence intervals. That being the case, changing the norm risks weakening evidentiary standards still further, so why do it?

      —-
      I have tremendous admiration for John Ioannidis. I think his contributions have been pivotal in resuscitating the Evidence-Based Medicine movement to its original aspirations. At least that is my observation. His Youtube talk on How Evidence-Based Medicine Has Been Hijacked is brilliant and on point. It even raised goose pimples. I say that in respect.

      However, I couldn’t quite catch the logic of the claim above. The reality is that folks have taken considerable latitude to make indefensible claims by using NHST. I inferred that his objection related more directly to the propriety of collecting signatures. I don’t know whether lowering the threshold is the answer either. Except to suggest that the prospect will allow for very large studies. And we know the limitations of that as well.
      Sander Greenland didn’t wince though. I was impressed.

      I believe that current thought leaders in statistics, biomedical endeavors, epidemiology etc can substantially improve the scientific environment. I am very much heartened.

  23. Reading this discussion, it came to my mind that the p-value discussion is embedded in a bigger discourse — whatever the Institute of Science is broken altogether.

    It seems a lot of people arguing against p-value would argue against other common practices as well (such as binary decisions, sanctity of peer review, publish-or-perish attitudes)

    So often argument goes like this:

    — We should abandon statistical significance.

    — But statistical significance is nice, it allows us to do X.

    — Well, you should also stop doing X.

    — But X is nice, it allows us to do Y.

    — Well, you should stop doing Y as well.

    and so on

      • I think it’s more like moving the chains–progress toward the goalposts in increments. Although I think Mikhail is just describing the argument, not suggesting that we fix each problem in turn. As intelligent beings, we ought to be capable of following the justifications to the primary fault and fix that one. In doing so, we remove the justifications for all the other bad practices. That’s a vast oversimplification of how these things work, of course, which is why one might advocate chucking the whole institution and building a new one on a more solid foundation. That’s really difficult, if not impossible, to do, though, and the alternative is to change something big within the existing institution.

    • This is going to sound cynical, but I think the argument is more lie:

      – We should abandon statistical significance

      – But statistical significance is nice, it allows us to declare things significant

      – We, you should stop declaring things significant?

      – But then how will I declare things significant?

      etc. “Significance” gets treated as an end goal in itself, not requiring justification.

  24. It’s an interesting, well-written paper. It doesn’t seem to me that the issue (at it’s core) is really that controversial. It’s basically arguing for two things that are pretty well-agreed on:

    a) Place more emphasis on the effect size, rather than the rejection of a null hypothesis

    b) Quantify uncertainty in the estimate of the effect size, and interpret that uncertainty in paper

    And you could consider one more:

    c) Consider null hypotheses that are more realistic or practically significant given the problem, by avoiding straw man “parameter = 0” null hypotheses as the default.

    If Ioannidis wants a different quality control decision rule, why not just have it be rejecting a null hypothesis that would be “clinically significant” and also requiring a comparatively precise estimate (e.g., a sufficiently narrow confidence/credibility interval width?). My reading of the paper (despite the provocative title) is not to abandon NHST, but rather to just not stop people from ending their interpretation at rejecting weak null hypotheses with p-values alone.

  25. This has been a wonderful discussion, one of the best ever on this site. I sense a meta-argument here. There is a criticism of NHST that it’s a lousy decision rule for a number of reasons. Most of us agree up to some point. (Some agree completely, some partially.) A chief counterargument is that the worlds of science and regulation are a cesspool governed by self-interest, guile or sheer incompetence. If we don’t have a firm decision rule we will be swamped by nonsense and corruption. I’m familiar with this in economics: it’s the case for cost-benefit analysis. I’m a CBA absolutist — I think it shouldn’t exist, at least in the form of a decision rule — so I’ve had to wrestle with these arguments in that context.

    I see two difficulties with it-ain’t-perfect-but-we-need-a-rule: First, it logically leads to dissimulation about the shortcomings of the rule, since we are using it as a defense against the bad stuff. You can’t simultaneously say “your arguments don’t have standing because we’re following this rule to the letter” and “we realize this rule has big problems”. Second, it effectively assumes the inevitability of the dark forces the rule is intended to combat. We know this because, when folks of good will (like just about all of us, I think) get together to discuss evidence honestly, we can do quite a lot better than p> or < .05. So the problem is that decision contexts are corrupted by economic interests, career interests, etc. This is a lot like the argument we need CBA because the regulators are all captured. I suppose logically one could be a crusader for bringing public reason to social and scientific decision-making and also adhering to a flawed but ironclad decision rule as a defense against its lack, but in practice they conflict. I see the argument against NHST as part of a larger campaign *for* public reason, which for us means careful, comprehensive and honest evaluation of the evidence in front of us. And as part of this, I strongly agree with Dale Lehman: it is the job of the researcher to generate and evaluate evidence, not to make the final decision on the grounds that the larger system is too flawed.

    • Peter:

      I agree of much of what you say, but I’m not sure about this statement of yours: “when folks of good will (like just about all of us, I think) get together to discuss evidence honestly, we can do quite a lot better than p> or < .05." Consider many of the researchers who publish in Psychological Science, PNAS, etc. They really do seem to use p-value thresholds as a way to interpret their evidence. See, for example, the example discussed in section 2.2 of this paper. Or that whole stents thing. Lots of people of good will out there using p-value thresholds and statistical significance, not because they think it’s a lousy rule but it’s all we’ve got, but because they think it’s the statistically right thing to do.

      This comment might well be worth its own post!

        • Brent:

          Sure, but I don’t think that’s the case here. I think these researchers are doing what they were told was correct, and that various authority figures keep assuring them is fine. One positive think about the ASA statement, the paper discussed in the above post, etc., is that these are authority figures saying something different.

        • I’m with you, there’s benefit to recognized voices pointing out flaws in the conventional, received wisdom.

          And agreed that in most cases researchers are doing what they believe to be correct, assured of that correctness by various gatekeepers and authority figures. But the gatekeepers and authority figures are just as subject to Sinclair’s principle as the rest of us.

          But I’d argue that, in all likelihood, you can’t change this particular facet of the entire Research Industrial Complex without to a certain extent tearing the whole thing down. Too many interlocking sets of “salary depends on” type incentives, from top to bottom.

        • Agree, dark forces everywhere, including methodologists of all levels of prestige but they are likely a distraction or at least they are better dealt with by presuming well of them while you enlighten all the others.

          Now darkness is the eyes of the beholder, I remember commenting on JI’s Youtube talk on How Evidence-Based Medicine Has Been Hijacked somewhere on this blog, that I had worked with many of the same people but guessed that my sense of the bad actors may have had little overlap with his. Depends which day you walked into the sausage factory.

        • I think the ASA statement is great, except 1. It is not reaching most of the people who need it, and 2. When it does they don’t understand it. It is not written in a way that is accessible for those without solid statistical training.

          I know this, because when it came out I give a talk on the statement to MD researchers (I’m an epidemiologist, but have decent statistical training), and it required a lot of additional background material and practical examples and many of them still felt it was too technical. I suspect the lack of an applied perspective is a problem with statistical teaching in general. We teach students the technical parts, but fail to connect it to the current paradigm of use or explain what the consequences for misuse look like.

        • Morgana:

          I have some problems with the ASA statement; see here. But the larger issue is that no statement can do everything. Individual articles are statements can be useful steps toward larger goals of general understanding.

        • I agree completely. No field can do everything either and I think too often statistics is taught or discussed as a completely separate enterprise, but that is not how it used in practice. As an epidemiologist, I think that statistics without some discussion of threats to validity (at least bias) or causal inference is pretty limited. That’s what I see with many of my research peers: They want to assess the causality of a relationship, but the only tool they’ve been taught is the p-value. They’ve never been introduced to the other pieces or even the assumptions underlying the p-value.

      • I agree with what you’re saying here, and I didn’t say quite what I meant in my comment. I’m arguing against the view that the dark forces that corrupt honest data evaluation are inevitable, which the counterexample of what we see on this blog and elsewhere disproves. Of course, there are multiple reasons for flawed analysis, and corruption is just one of them. But corruption of various sorts (political economic, career interests) is invoked by JI and others as a reason for needing a dichotomous ex ante decision rule on the apparent grounds there is nothing else we can do about it. I think we concur that, even if we improve the decision environment so it will foster more honest treatment of evidence (and the issue of career incentives has come up repeatedly on this blog), we still have widespread misunderstanding to combat as well.

  26. In case anyone else was curious like me/didn’t know Greenland’s reference “…the travesty of recent JAMA articles like the attached Brown et al. 2017 paper…”, it’s in reference to this article (https://www.ncbi.nlm.nih.gov/pubmed/28418480) by HK Brown et al.

    Greenland writes elsewhere (https://search.proquest.com/openview/8860556f7bfd79f360145b165d673ec1):
    “As an example, consider a study by Brown et al. (2017), who reported that “in utero
    serotonergic antidepressant exposure compared with no exposure was not associated with autism
    spectrum disorder in the child,” based on an estimated hazard-rate ratio (HR) of 1.61 (a 61% rate
    increase in the exposed relative to the unexposed) and a 95% confidence interval of 0.997 to 2.59.
    As is often the case, the authors misused the confidence interval as a hypothesis test, and they
    claim to have demonstrated no association because the lower limit of the interval was slightly
    below no association (which corresponds to a hazard-rate ratio of HR = 1), ignoring that the
    upper limit exceeded 2.50.”

  27. Every field and sub field requires extensive domain-specific knowledge and statistical thresholds can’t replace the importance of having that knowledge. I hate to break it to Ioannidis, but decision rules set before a study and the thresholds chosen are just as subjective as other design choices. Perhaps for someone like Ioannidis, who from a far distance attempts to analyze all the literature without much domain-specific knowledge, this is a bad idea because it makes meta research far more difficult.

  28. John Ioannidis is a physician. He must have domain-specific medical knowledge. The bigger problem is that specialization has led to siloing of knowledge. This is at least my experience across different domains. This is also pointed out in Steven Goodman’s current ASA article.

    https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1558111

    John Ioannidis is, in fact, conducts meta-analyses across several disciplines. That is a rarity. I also don’t see where John has denied that thresholds are subjective. Quite the contrary, as indicated in his own article on P values.

    More broadly different fields are at different learning curves with respect to measurement.

    • Too much argument from authority there, Sameera. Medicine (and most areas) are too vast for anyone to have deep knowledge of more than a few domains, and the heterogeneity across those domains is daunting. Little resemblance between (say) orthopedic surgery and dermatology. At best you can only collaborate with and thus depend on specialists in a topic, and those may not be as reliable as you wished. Also, publishing across disciplines is mostly evidence of being good at the politics of publishing in the venues in which the works appeared. So no I don’t buy what you said here. Arguments need to stand alone from the presenter (and of course I don’t buy his arguments even if I understand his concerns, but others can address all that as they see fit).

      • Sander, YOU MEAN I shouldn’t consider you MY FAVORITE AUTHORITY either.? lol

        Sorry I didn’t see your post earlier. I am not sure which of my former comments in response to Zad Chow contraindicate your own response above. Specifically, I was responding to the following proposition:

        Zad Chow: ‘Every field and subfield requires extensive domain-specific knowledge and statistical thresholds can’t replace the importance of having that knowledge.’

        What I was probably suggesting that John Ioannidis has been trained in allopathic medicine. Yet John has questioned its empirical bases. Based on countless views of some of John’s lectures, I am sure that John doesn’t think ‘statistical thresholds should or can replace knowledge. That has never once been articulated. It’s a false dichotomy as framed anyway.

        Nor is it his argument now. His qualm largely was with the collection of signatories. Secondly, he has never suggested that lowering the threshold was the solution. He has consistently argued that the current threshold is too lenient for purposes of biomedical research. And that the lower threshold was a ‘temporizing measure’, and not be blindly applied either.

        Otherwise, I agree with your main thesis. I also haven’t found major disagreements, more broadly, between you and John.

        Lastly I did convey on Twitter that i disagreed with John’s argument. So there.

  29. Thank you for this interesting interchange.

    In my view, the main effect of changing views on the p-value would be fewer adverse events detected in already poorly powered clinical studies of drugs and biologics. I’ve explored this problem in an article on LinkedIn. https://www.linkedin.com/pulse/science-matters-expected-impacts-lowering-p-value-james-lyons-weiler/

    I’ve moved from NHST to prediction modeling – if we can development prediction models that actually generalize (to new data), we can understand what we are studying well enough to manipulate/control/mediate/mitigate (influence). Machine learning optimizing models (model selection, model evaluation) via internal x-validation with feature selection (perhaps by significance testing, perhaps not) followed by evaluation using predictions on new data not used in the learning phase is very powerful. We need a massive conference on robust inference of causality from observational studies anyway; epidemiologists seem to “correct for” useful co-predictors agnostic of interaction terms; There is I think an inverse relationship between the intrinsic importance of the p-value and the demonstrated utility of a prediction model derived from knowledge generated using (in part) NHST.

  30. In connection with this discussion, can Andrew or someone else comment on what constitutes statistical evidence? I’m thinking of Royall’s book: Statistical Evidence: A likelihood paradigm.

    When we say, “we have evidence that X is true”, when can we say that and what does that phrase mean?

    • For example, Gelman et al in BDA3 write:

      “The estimated posterior probabilities of a negative average change across the percentiles was 0.72 in the 750 mg/kg group, 0.99 in the 1500 mg/kg group, and 0.94 in the 3000 mg/kg group. Hence, there was *substantial evidence* of a stochastic decrease in the number of implants in the higher two dose groups relative to control.”

      source: Gelman, Andrew; Carlin, John B.; Stern, Hal S.; Dunson, David B.; Vehtari, Aki; Rubin, Donald B.. Bayesian Data Analysis, Third Edition (Chapman & Hall/CRC Texts in Statistical Science) (Page 557). CRC Press. Kindle Edition.

      Does it make any sense to say we have “substantial evidence” based on a posterior probability of a parameter being negative or positive?

  31. This debate may benefit from a distinction Richard Royall makes in his book on likelihood, about 3 possible questions:

    q1. What do the data say
    q2. What should I believe (now that I’ve seen the data)
    q3. What should I do

    The 1st is strictly evidence in data. The 2nd requires incorporating prior knowledge/beliefs (Bayes). The 3rd is decision analysis. Eg a patient tests positive for viral hepatitis, That is evidence in favor of him having the disease (q1). But I may believe he does not have the disease, depending eg on risk factor profile (q2). Yet I may decide to treat him anyway, given the risks/benefits of non/treatment (q3). It seems like we want statistical tests to do all three (when they were developed for q3 in repetitive situations).

    Royall advocates likelihood ratios, or Bayes factors, for q1. I find his reasoning compelling. At leaast the result is less prone to misinterpretation – “the data support H1 x times more strongly than H2”. Similar ideas are found in Ian Hacking, AWF Edwards, and others. (I see that others above disagree anout the value of LRs.)

    Distinguishing the 3 questions allows different ways of answering each of them. That should get rid of the problem discussed above of binary decisions vs non-binary evidence.

    • I love Royall’s writing for his clarity in laying out the problems, as in that list of 3 questions. In my view, significance defenders like Ioannidis stumble completely by conflating q1 and q3 (and often confuse all 3 questions). Royall explains the confusions well. I say this even though in the end I reject his reliance on the likelihood principle (LP) to answer q1. The LP can have us throw away important information for the analysis; so I say everyone reading Royall deeply should also examine the arguments against the exclusive use of LP-based (likelihood and Bayesian) methods, and recognize their potential to mislead when (as is the rule in social, health and medical fields) there is non-negligible uncertainty about underlying data-generating mechanism – in that situation frequentist tools become essential to deploy (along with other tools, like graphics, as Andrew often points out). See p. 635-636 of Ritov, Y., Bickel, P. J., Gamst, A. C., and Kleijn, B. J. K., “The Bayesian Analysis of Complex, High-Dimensional Models: Can It Be CODA?” (Statistical Science 2014, 29, 619–639) for a sobering discussion of some situations (which are common in my field) where LP-based methods break down.

      • Sander, Ritov et al write:
        “It is very difficult to build a prior for a very complicated model. Typically, one would assume a lot of independence. However, with many independent or nearly-independent components, the law of large numbers and central limit theorem will take effect, concentrating what was supposed to have been a vague prior in a small corner of the parameter space. The resulting estimator will be efficient for parameters in this small set, but not in general. It is safe to say that Bayes is not curse of dimensionality appropriate (or CODA, see Robins and Ritov, 1997).”

        Could you clarify this? I’m probably missing something here, but I don’t see why you want a vague prior in a large-dimensional model. For complex, real-world problems, we want prior predictive distributions that make sense, based on our external information! Let’s say we have a series of measurements of some outcome (say, bird observations) in a set of spatial locations over time. We want to model the measurement process, and then link to an underlying (latent) model of bird population growth and movement, while accounting for the noisiness of human observations, occupancy, impact of environmental covariates,etc.etc. My approach would NOT be to attempt to construct a vague joint prior in order to deliver some kind of ‘unbiasedness’. Once we push the predictive distribution out, that will inevitably put lots of mass on improbable or un-physical results. I want priors that incorporate everything else we know, and that regularize our inferences sufficiently. This is no small task! What frequentist tools help here?

        • What frequentist tools help here? The most basic example: Suppose research team goes through the task of creating a prior they think incorporates “everything they know” (although the priors I see in my field are never close to that, often being full of absurd independencies because the researchers did not think of let alone model prior dependencies). I then ask: before they (con)fused their prior with their likelihood using Bayes theorem, how did they check for possible incompatibilities between the two, which might have put a brake on their fusion? I hope examples of such checking exists in my field, but I haven’t seen Bayes applications that reported it – even though they could have at least provided a P-value comparing the two (no, posterior predictive P-values don’t count because among their problems they are not comparing the prior vs. the likelihood).

        • > how did they check for possible incompatibilities between the two, which might have put a brake on their fusion

          Can you explain in more detail what you mean? The likelihood is conditional on a given value of the parameter vector, so as long as it represents what you think is likely to happen when that parameter obtains in the world, it is doing it’s job.

          In general priors are hard to set up in high dimensions which is why prior predictive checking is great, it let’s you see what kind of data your prior expects you might get, and this is often extremely interpretable. An example here months back showed how a prior on some model predicted particulate air pollution densities on the order of magnitude of neutron star material… Very interpretably wrong

        • 1) “The likelihood is conditional on a given value of the parameter vector” – apologies, I should have written “likelihood function” (LF) not “likelihood” in all of that post (I’m jetlagged). To hopefully clarify what I meant Take the simple toy example with a N(0,1) prior and an LF proportional to N(5,1). I wouldn’t want to combine those using Bayes theorem to get inferences or predictions or whatever, I’d say something’s wrong with the prior or the data model or both for this app, better find out what. Now that’s kind of obvious from the Z-score of 5 for the difference. But in real examples with complex models and multiple parameters, even severe incompatibilities may not be so obvious. A P-value comparing the prior and likelihood function is a first line of detection relatively easy to get out of canned software. I’m all for more sophisticated checks but as I said I don’t even see a basic check in papers, and I don’t expect to see more sophisticated checks without computational ease via the software the user has (in my field that’s usually just SAS, sometimes Stata). Suggestions welcome.
          2) “In general priors are hard to set up in high dimensions” – In the apps I have in mind realistic priors are practically impossible to set up as the dimensionality is just too high to imagine; one could spend years trying to figure out from the contextual literature how to specify all the prior dependencies. As a result users resort to highly artificial prior simplifications like complete independencies that just can’t be even nearly correct and which lead to the kind of problems that Ritov Bickel etc. (RB) warn about. At best that prior becomes a regularization (smoothing) tool that is not giving a well-informed posterior distribution. As RB note in some of these situations one can still produce valid (well-calibrated) frequentist statistics for certain targets of interest given a sufficiently (even if only partially) identified selection (“sampling”) model; that may not be satisfying but sometimes it may be the best we can do given our constraints.

        • Let me try to take your example a little further, and I’ll use stan style pseudocode to be more formal, because I think we agree on some of it and not on other parts.

          You have a prior

          mu ~ normal(0,1);

          and a likelihood with one data point x=5:

          5 ~ normal(mu,1);

          After this, we have a posterior

          mu ~ normal(2.5,1/sqrt(2)); // 1/sqrt(2) ~ .707

          Now, as far as I’m concerned we could have any of these situations:

          1) everything is fine, we really do have strong evidence that mu is somewhere down near 0, and that data points are approximately normally distributed around mu with sd = 1, we got very unlucky

          2) we put more information into the prior than we really meant to, normal(0,1) should probably be something more like normal(0,10)

          3) Our prior is good, but our likelihood is probably too light tailed, something like student_t with 5 degrees of freedom would be better, because outliers tend to occur in this application, or possibly we should relax the constraint on the sd of the likelihood, instead of a delta function on sd=1 we might be more honest and make sd a parameter and put a tight prior on it, like gamma(6,5/1) which has peak at sd=1 but high probability range is something like .4 to 2.3

          I think *any* of these answers could be right. Remember though, I’m not interested in matching frequency when I choose a likelihood function. To be honest though, it seems likely that either 2 or 3 are the actual situation.

          Now, in the high dimensional case, you often can’t describe what you know very well because it requires describing a bizarre manifold in 36 dimensional space. What you *can* do is generate fake data from your prior and see if the data looks weird. Like we can generate from mu ~ normal(0,1) and then generate x from normal(mu,1) and then see if the data we get covers the range of what we expect.

          If we do this we’ll find that we are not likely to get x=5 much, and so *if 5 is something we think could happen* we should look carefully at our model and try to make it so that 5 could happen.

          on the other hand if 5 is *not* something we think should happen, then when we get 5 in our dataset…. we should rethink our model and try to figure out what assumption we’re making that prevents 5 from happening and see if it’s something we really believe or is it something we did for convenience or because we didn’t realize what we were doing…. Same thing in the high dimensional case, it’s just harder.

        • Daniel: I just saw your response. I don’t see where we disagree, we just need to be careful about translating claims between our languages because the information geometry I use to express things appears to differ from yours (which appears to focus on the model space): I see the model as defining a manifold as an implicit hypersurface in the expectation space for the data, which means the model manifold I discuss lives in N-dimensional space (N=number of observation df = N observations for independent observations). The data set is a point in this space and unconstrained fit statistics are directed distances from the manifold to the point along specified axes, while constrained fit statistics are distances from the manifold to a larger embedding manifold; the axes are scaled by the distribution under the tested (nested) model. A P-value is simply the percentile location of the observed statistic in the distribution from the tested model along that axis, and its Shannon transform log2(1/p) = -log(p) is then the bits of information against the model conveyed by the test statistic.

          Now I take your 1-3 as to be examples from 3 broad classes of what could be going on in my example:
          1) Both prior model and sampling model are correct, we just had exceptionally bad luck; the question is when are no longer content to act as if that’s the case (i.e., how to choose alpha to decide we can’t just go ahead an use Bayes’ theorem).
          2) Wrong prior (in your response the prior was overconfident but it could have been that it was biased downward instead or in addition to that)
          3) Wrong data-generating model (in your response it was underdispersed but it could have been producing or passing on upward bias instead or in addition to that)
          Reflecting your subsequent comments about these possibilities, I would add:
          4) Some combination of 1-3; in my field we know the prior and data models are both wrong to some degree and want to disperse them both to some degree to account for that (e.g., using overdispersed priors and DGMs, “robust” variances, etc.); plus by definition half the time random errors (1) will be adding to rather canceling the systematic errors (from 3).
          My point was that a preliminary P-value screen contrasting the prior and likelihood function will catch the worst of these and is better than the current standard of care I see in Bayesian apps in my field, which is check nothing. Andrew has lodged similar complaints about what he sees for decades, protesting denials (based on bad philosophy, not reality) of the need to check; we simply have diverged (at least in the past) in that I want my checks to come with a calibration distribution (a frequency validation) under the given prior and DGM so I can tell if the observed check statistic should be of concern (as in my example, which is out in the tail when drawing parameters from the assumed prior and data from the assumed DGM).
          So did I miss some disagreement here?

        • Sander: Are you aware of Mike Evans’s work on checking for prior-data conflict? (Some links here.) If so, what do you think of it? Insofar as it involves assessments reached through prior predictive distributions it seems like the most Bayesian thing out there for the purpose.

        • Corey: Not familiar with the check of Evans you mention – can you point to one article or post you think is the best intro?

        • Sander. I think we are pretty close together. I see Bayes as rarely trying to fit frequencies, except where you explicitly are, like maybe in survey research where the goal is to find out the frequency with which certain people do such and such. Because of this, I don’t think p values are the way I would generally approach determining if I have a model problem. But I ABSOLUTELY agree that we do need to check for model misfit.

          In general, I see model misfit as another instance of decision theory, and I prefer to think about it in terms of utility: what is the consequence of the kind of mis-fit I see in my model/data and how much do I care about it, and under what situations would I choose to do something about it.

          I like to go back to my “orange juice” example.

          http://models.street-artists.org/2014/03/21/the-bayesian-approach-to-frequentist-sampling-theory/

          here we’re trying to estimate the total amount of orange juice in a pallet, and we don’t really expect to find out necessarily what the “true frequency” distribution of the orange juice is (because doing so would require collecting far too much data), we only want a decent extrapolation to the average (or equivalently the total).

          So, in this case, the data doesn’t look anything like my likelihood model (the data is uniform between 1.4 and 2.0, the likelihood is exponential with a given mean). Does this mean I “shouldn’t use Bayes theorem?” because I have data where a p value using some kind of goodness of fit statistic would easily show me that my likelihood is far off the frequency distribution? No absolutely not. The information that went into the choice of likelihood is that the distribution is positive and has a mean… so the choice of a maximum entropy distribution for a given mean is justified by that information, not a desire to match frequencies.

          In any case, I still think i should check this model. Like for example if I had some data that truly was a crazy outlier: 2500 liters per jug or something, I should maybe start to consider whether the data collection process has a problem, the generative model might be: most people write down liters, but some people write down milliliters… so I should alter my model rather than let a factor of 1000 sneak in to 10% of my data as if it were a real number of liters.

          But, nuances of how to do checking are less important than the ideas you’re expressing which are similar to mine: we need to build models, we need to check models, we need to make decisions based on utilities, we need to consider ways in which we might be wrong, and try out alternatives… and we need enough background knowledge to be able to understand why we do these things and have a hope of doing them in simple cases, or working somewhat knowledgeably with modeling experts like you or me in more complicated cases.

        • FWIW, I totally agree that prior modeling is neglected! But, IMO, this is actually a side-effect of the whole vague/flat/non-informative priors vogue that I think existed in order to sell Bayes by making it superficially resemble max likelihood. It just doesn’t work very well in high-dimensional models, and often fails in relatively simple process-based/non-linear applications as well.

          I am definitely interested to learn more from your perspective about how to think about incompatibility between prior and likelihood, and the tools needed to go about checking that. Prior predictive simulation makes sense to me, but that seems orthogonal to what you have in mind. One thing I often do is overlay density plots of priors and posteriors for various parameters. Where the discrepancy is large, I take it that the current dataset I am conditioning on is telling me something substantially ‘new’ relative to my prior. I suppose in some cases, it might argue for mis-specification. Now, if I compute a test statistics of some sort representing this discrepancy, what should I do with that? Would I abandon the modeling project altogether if it passed some sort of threshold (that seems to be implied in your comment), or just tinker with likelihood? Certainly, post-hoc adjusting the prior based on something like this starts to feel a little fishy. I dunno. What do you recommend?

        • > totally agree that prior modeling is neglected!
          That’s part of the motivation for https://statmodeling.stat.columbia.edu/2019/03/20/retire-statistical-significance-the-discussion/#comment-1001538

          We have start out with where many working in science, regulation and industry are right now – not where many presume Andrew is now.

          It would be nice to know what percentage of those now doing Bayesian analysis with say reasonable competence in MCMC currently neglect the prior. Doing as one my colleagues from Duke was said, simply pulling the Bayesian crank, assuring _the_ posterior has all _the_ answers and moving on. (Actually their explanation why the president of ISBA answered my question about getting some sense of the prior by simulating from it with the response “that’s not Kosher in a Bayesian analysis”.)

          But no one does.

        • Hi Keith, interesting – that’s crazy to dismiss prior predictive simulation!
          I totally agree vis a vis the gap between communities here. Following Andrew and this blog and folks like Michael Betancourt, it sometimes feels like inhabiting a parallel universe. The kinds of things we are getting into now with prior predictive simulation, rigorous work-flow case studies where we learn how to resolve things like divergent transitions in HMC, etc. are so far past the average scientist’s training in statistics, that it may as well not have the same name…

  32. The compressing statistical data down to a binary output of yes or no for the purposes of decision making should be done at the latest possible stage in the decision making process, and it should not be done during the research phase.
    Studies should be designed to be data in data out, and they should not be data in decision out – the data coming out of the study should be as high resolution as possible.
    Deciding should be the sole responsibility of the one responsible for decision making, not delegated to the one responsible for data parsing.
    Especially when considering that part of that decision is “should I pay more to the scientist?”, and then passing that question onto that very scientist to answer – this fundamentally introduces bias into the decision making process.
    By holding the decision maker responsible for the threshold used when interpreting results, we can ensure that whatever threshold is used will be one which was chosen by the decision maker, and by those who hold the decision maker accountable – rather than being hard-coded into the research stage.

    This moment is at its core a separation of concerns.

  33. In the paper by Valentin Amrhein, Sander Greenland, and Blake McShane, I don’t understand the paragraph that starts with :

    “Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention. It is based on the false idea that there is a 95% chance that the computed interval itself contains the true value”

    Can someone here explain why is false the idea that there is a 95% chance that the computed interval itself contains the true value ?

    Thank you

    • See “The fallacy of placing confidence in confidence intervals” and “Robust misinterpretation of confidence intervals”.

      A 95% confidence interval means the following. If you repeat an experiment say 100 times and compute the 95% confidence interval each time, you will get a distribution of CIs, each of which will be varying from experiment to experiment. Let the vertical line below represent the true mu.

      1. […|..]
      2. [.|…]
      3. [….|]
      |
      … |
      |
      100. | […..]

      Of these 100 CIs, 95 will contain the true mean, i.e., the vertical line will fall inside (about) 95 of these intervals. Now, you don’t normally repeat an experiment 100 times; heck, you don’t even repeat it once. So these CIs are hypothetical.

      To say that this 95% confidence interval contains the true mean with probability 95% is wrong; what one has to say is that if one were to (counterfactually) repeatedly run the same experiment, 95% of the CIs generated would contain the true mean.

      Obviously, this implies that plotting a single CI from your one experiment leaves you neither here nor there. You don’t know which CI you are looking at in your unique experiment. It could be no. 1 above, it could be no. 100. This becomes a serious problem in low power studies; as Gelman and Carlin 2014 and many others before them have pointed out, you will get what Andrew calls Type M errors: exaggerated means, which means that whenever you get a significant result, your CI probably will not contain the true mean; in practice, when people run 10-20% power experiments, CIs from significant results are not even remotely near the true mean. Andrew has said at some point (StanCon 2017 I think) that “the MLE can be super-duper biased”. I think that this is what he meant.

      The *one* confidence interval you are looking at either contains the true value, or it doesn’t. Just look at any of the first three in the example above.

      That said, when sample sizes are large, a confidence interval and a Bayesian credible interval will have very similar bounds. An example is in our paper Parsimonious Mixed Models on arXiv. For this reason, I have seen professional statisticians (even hardcore Bayesians) treat CIs as credible intervals. Baguely’s stats textbook in psychology even writes that we can treat the CI as a credible interval, IIRC.

      • > For this reason, I have seen professional statisticians (even hardcore
        > Bayesians) treat CIs as credible intervals.

        There is a big difference between getting an exact answer to the wrong question and getting an approximate answer to the right question.

        • “when sample sizes are large, a confidence interval and a Bayesian credible interval will have very similar bounds.”
          That’s true in a lot of simple apps and useful to know as a limit theorem in finite dimensional models, but is not a good approximation when (as in many apps I encounter) the data are sparse so that even if N looks large the actual amount of information about the target is not large enough to swamp the prior. In the extreme it may be that no realistic N is large enough for their convergence to one another. Which leads into the topic of Bayesian collapse as in Ritov Bickel et al. mentioned above.

      • From my understanding bootstrapped CI is similar to Bayesian credible interval (some referred bootstrapping to as “poor man’s Bayes”) but still usually called confidence intervals.

        • Chao:

          No, bootstrapped CI is different, unless it uses a prior distribution somehow. It could be interpreted in some cases as a Bayesian posterior with flat prior—but flat priors are often what got us into this mess!

  34. It’s important to understand that while we often have to make decisions that are consistent with dichotomy and zero uncertainty, the scientific reality is often quite different. For example, suppose I have to choose a car to buy, I can conclude:

    a) There are many good cars for me (and many, many bad ones) and no one car will perfectly fit all my needs. Further, there is much uncertainty regarding reliability, longevity etc. However, weighing all the choices and cost/benefits involved, I decide the new Ford is likely the best.

    b) The new Ford is perfect, and all other cars are crap (at alpha=0.05).

    Both conclusions lead me to buy the new Ford, but one should note that while I act in a manner that is consistent with (b), the muddy reality is in fact (a), which is something to keep in mind if the new Ford performs in a manner different than anticipated.

    Ioannidis mentions the question; “Is this pollutant causing cancer, yes or no”? Indeed this may be how a policy maker will look at the question. However, scientifically, pollutants interact with each other and other environmental influences in myriad ways, and there may not be a single pollutant that causes cancer by itself. Further, different people are more adversely affected by elevated pollution exposures (e.g. trades people who work outside are more susceptible) and decisions regarding where to regulate must consider this. Also one must consider the cost of pollution regulation on the economy, which may lead to loss of jobs which could adversely affect health more than the exposure reductions provide benefit. Distilling such a pollutant/cancer/regulation question down to a yes/no proposition, is unscientific and dangerous.

    • I have no problem with a data analyst deferring or declining to assert an absolute answer to “Is this pollutant causing cancer, yes or no?”.

      But the self-styled purists in this discussion seem to also reject questions like “How likely it is that this pollutant causes cancer?”. Which I do not understand at all.

      • Brent:

        1. You write of “self-styled purists in this discussion,” but there are no self-styled purists in this discussion! I searched the thread for “purist” and found nothing but your comment right here.

        2. I’ve worked on toxicology, in particular carcinogenic pollutants, and I don’t find it particularly useful to ask “How likely it is that this pollutant causes cancer?” I don’t think it’s a very well-posed question. I’d rather ask a quantitative question such as, “How many additional cancers might this pollutant cause?” or “What is the effect of this pollutant on the probability of cancer?” I’d rather do that than frame “causing cancer” as yes or no.

  35. “Fisher and Bartlett thus agreed on the proper role of the statistician. He was not anxious to leap to conclusions nor satisfied with mere rules for consistency, Instead, he was concerned to analyze the quality and relevance of the data and the assumptions of his statistical model.”

    “Each of Jeffreys and Fisher was too wedded to his interpretation to appreciate the aims and assumptions of the other man.”

    Interpreting probability, David Howie. Location 2077 on Kindle.

    • The problem with p values and NHST isn’t with the math, it’s with the illogical usage. But no amount of teaching the true meaning of p values is likely to improve things, as discussed above, because science has become infected with it very deeply, deep enough that the quote from Upton Sinclair: “It is difficult to get a man to understand something, when his salary depends on his not understanding it” is just far too true.

      It’s not quite true that you can’t be a scientist without purposefully misunderstanding p values, but in certain fields it’s close enough. In other fields, maybe not so stridently true, but it still helps a lot career wise.

      One of the biggest problems out there is the existence of huge bodies of research that study non-existent things. How are you going to make progress in a field when many if not most of your colleagues study questions that are the equivalent of angels on a pin. This includes questions like “how does gene x regulate y to cause z” when in fact the whole idea is maybe wrong in the first place and gene x doesn’t regulate y or y doesn’t cause z but flawed logic in the literature over a couple decades, published by the top most powerful people in your field shows that it does… indeed people who are famous and powerful precisely because of this false “finding”.

      • Daniel — do you you have a specific example of what you mean by ““how does gene x regulate y to cause z” as flawed logic? There are certainly gene products (mRNA and proteins) that are components of feedback systems (regulation) that have consequences (mitochondrial biogenesis or the translocation of glucose transporters to the plasma membrane). I don’t necessarily disagree but its hard for me to evaluate what you mean (or where this might apply) without a specific example.

        • All I mean is that it may well turn out that someone who “proved” that X regulates Y and Y causes Z may have done nothing if the kind… It may turn out that X and Y don’t affect each other, or that Y isn’t what causes Z but rather Q causes both Y and Z… Etc. The lack of good mechanistic quantitative models and a reliance on NHST for support can easily lead people to prematurely conclude that they have proven some sort of fact.

          Anoneuoid’s paper is a good example of how that sort of thing comes about.

        • ” it may well turn out that someone who “proved” that X regulates Y and Y causes Z may have done nothing if the kind… It may turn out that X and Y don’t affect each other, or that Y isn’t what causes Z but rather Q causes both Y and Z… Etc. The lack of good mechanistic quantitative models and a reliance on NHST for support can easily lead people to prematurely conclude that they have proven some sort of fact.”

          +1

        • Daniel,

          Good luck in your teaching ventures. Being the daughter of a professor, I lived on campuses for much of my life. My ambivalence in joining the ranks was a consequence of witnessing the sociology of expertise. It affected my father’s health.

      • Daniel, I like your posts and this one raises serious issues, but I’ve heard the claim that “no amount of teaching the true meaning of p values is likely to improve things” for decades. What data is it based on? There are actual observations (even if not experiments) that bear on this question. The field of epidemiology and its journals have been under concerted pressure for decades from writers like Ken Rothman to improve use of existing statistics in their articles. Rothman even founded what is now a leading journal (called Epidemiology) to set an example. If you compare the statistical horrorshows that dominated that literature 40 years ago when I was a student to articles now, yes you still see some howlers, but on the whole there is obvious general improvement.

        I hold that science is as political and human as politics. We have elites who defend their interests and play on fears of mob psychology (a psychology which after all might be blamed for the testing epidemic in the first place!). The idea behind comments like the one you repeated seems to me that we must either have an instant complete solution acceptable to everyone or else give up (an attitude seen in unfounded attacks on our project, like those from Harry Crane). Such a binary extreme choice between radical revolution vs. nihilism would have ruled out the improvements I have seen over the last 40 years.

        In the real world of complex social problems (like racism, sexism, and significance testing) we must apply constant pressure to keep up improvement. This is especially so given that revolutionary change seems impossible (and perhaps undesirable, since it may replace the bad with worse). There is nothing radical about telling people to stop judging complex studies in isolation based on a magic P-threshold – even Fisher opposed that abuse. Such incremental change may be the best we can do in the face of fierce opposition from powerful figures who deploy well-crafted but fallacious rationales for the confusions and bad practices on which they built their careers and maintain their constituencies. If that sounds over the top, then explain to me using evidence how it is that scientists are morally and ethically superior to other people including politicians and voters (as opposed to having sold themselves and the public on an illusion of superiority, much as clergy have done for millenia).

        • Sander: it’s not that I think there is no way to get there from here, I just don’t think that “teaching the true meaning of p values” is an important part of the path. The path to a better future is teaching the basic concepts of mechanistic mathematical modeling, and teaching how to fit your mathematical models using Bayesian methods to determine which are the plausible values you could use in your model to make it consistent both with data and the things you knew before you got the data, also teaching model checking techniques, such as using the prior predictive distributios for choosing priors as mentioned by Chris Wilson above, and using graphical methods of posterior predictive checking to see if your model predicts things that you think are implausible after fitting, etc.

          NHST as practiced has no place in science in my opinion. P values have their use, but their correct use is almost exclusively computational (ie. checking stochastic simulations) and data-reduction (separating your data into “uninteresting vs interesting”) so you can spend the time modeling the interesting stuff.

          Also I think it’s absolutely critical for people to begin thinking of decisions in terms of utilities.

        • Agree completely about utilities. And I agree that anyone who models should understand links to mechanisms, Bayesian methods, etc. as well as frequentist methods. But I just don’t see how you can justify such sweeping assertions about what every researcher should be taught in their packed schedule or that apply to every application in every field (unless you have been pursuing hundreds of parallel full-time careers in every branch of science). Some researchers only run basic comparative randomized experiments where modeling would be pointless and a simple comparison would do; while others use only canned models with no checks so even a look at some test of fit P-values would be a practice improvement (e.g., one study I was brought in on they had been using models which had P<0.0001; a simple transform of an adjustment variable in the regression brought that up to well above 0.1 and so in my view gave more credible estimates than what they might have published).

        • Sander: “Some researchers only run basic comparative randomized experiments where modeling would be pointless and a simple comparison would do…”

          I appreciate your pointing the discussion toward that type of setting.

          A portion of my own work is not far from “simple comparison will do” which is why I’m interested in finding (hopefully) straightforward things to suggest to investigators who will otherwise do the same rote NHST/p-value stuff they learned decades ago in graduate school.

          Not every research question requires (in my opinion) an elaborate, multilevel Bayesian model with nuanced interpretation taking into account utility functions, etc. Sometimes what is required is a parameter estimate, some measure of uncertainty and an interpretation of those results relative to some pre-agreed criterion.

        • Sometimes what is required is a parameter estimate, some measure of uncertainty and an interpretation of those results relative to some pre-agreed criterion.

          From my experience people conclude far more than is warranted from such experiments. The vast majority should never be run because they do not answer the actual research question, people only think they do because “statistical significance = real”.

          In other cases it is just a descriptive study so no conclusion needs to be drawn at all, what is the threshold for?

        • Right or wrong, sometimes the starting point of a comparison is an accepted “clinically significant difference”. There is little interest in new or different treatments which are expected to produce effects that are small relative to that minimum “clinically significant difference”. Therefore, from the off, the study is designed to evaluate outcomes relative to that magnitude or larger effects.

          Nobody is staying anything remotely like, “Since a clinically significant difference is 5 units, any effect of 10 or 15 or 20 units is the same as a 5-unit effect”. Of course not. It’s just that effects of 0.1 units or 1 unit or 2 units will be considered a not-successful treatment.

          In my admittedly provincial and limited experience, in medicine and public health there always seem to be thresholds and cutpoints popping up when you start designing a study. It’s what we start from and work the design and then analysis accordingly.

        • Right or wrong, sometimes the starting point of a comparison is an accepted “clinically significant difference”. There is little interest in new or different treatments which are expected to produce effects that are small relative to that minimum “clinically significant difference”.

          Well it is wrong…

          My favorite example is chemotherapy drugs causing nausea -> caloric restriction -> slower tumor growth. Clinically significant difference, but if understood correctly no one would take a poison instead of just eating less.

        • I think it’s fine if some people aren’t doing science, like lots of doctors just treat patients… And it’s also fine if some people are pure experimentalists, they can set up experiments and physically carry them out and let others do the analysis… But if you are going to analyze data and come to conclusions about the data, and make decisions, you need to have the basic tools. No one would suggest an illiterate person should be a copy editor for a magazine… Why should someone without any mathematical background analyze quantitative data?

          Actually your joke about having hundreds of careers is not far off… I’ve done cryptography, finance, software development, civil engineering, forensic engineering, biomedical data analysis, fluid mechanics, bioinformatics, … I’ve never found a field where you could make progress on understanding mechanisms (which is my definition of doing science) without any mathematical background, and that includes simple things like calculating dilutions or just understanding logic… You can be a technician, carrying out reactions etc, and lots of people do just that, but designing experiments and figuring out whether you’ve got information that let’s you conclude something… It’s fundamentally mathematical logic based skills.

        • Also it’s obvious that the degree of sophistication required is variable, I just don’t think teaching more about the logic of p values gets us closer to something good. In fact the mathematical and logical prerequisites to ubderstand p values properly is higher than the sophistication required to understand other kinds of more useful skills I think… I mean the evidence in journal literature is that the errors are made by people who should know better and have taken multiple high level stats courses right?

        • Also, Sander, it seems like I should point out here that whatever disagreements we are hashing out here, we agree on a vast vast majority of other important bits. Your:

          If that sounds over the top, then explain to me using evidence how it is that scientists are morally and ethically superior to other people including politicians and voters (as opposed to having sold themselves and the public on an illusion of superiority, much as clergy have done for millenia).

          It didn’t sound over the top at all. It sounds just like something I would say.

          You and I agree on the disease diagnosis, we may have minor disagreements on which are the most important first steps to take to staunch the bleeding, in the end I suspect these differences may come from our different backgrounds of application. Epidemiology for example is much more difficult to build mechanistic models for than engineering failures or embryo development or biomechanics.

        • Daniel:

          I agree with your statement:

          I just don’t think that “teaching the true meaning of p values” is an important part of the path.

          Part of this is just that class time is precious so why waste it on a method that is not relevant to most applied questions (other than the question, “How can I get my noisy study accepted in Psychological Science?”).

          But part of it is deeper, it’s that p-values are presented as a measure of strength of evidence and they’re not; p-values from different studies are compared with each other which is wrong; p-values are used to pull out data findings into groups and that’s not right, etc.

          Pretty much the main point of teaching the true meaning of p-values would be to convince people not to use them. And that’s not really where i want to spend most of my time as a teacher, telling people what not to do. It’s just kind of demoralizing for all concerned. Maybe we should be teaching that way, but I think it’s a hard sell for teachers and for students alike.

        • Great, I agree with everything you said there, why teach what not to do? It’s pointless, give people a more intellectually satisfying, useful, and logically sophisticated way forward. Even just for measuring the difference between a single measure and two groups. Explain why randomization is important, and what it does and doesn’t do. Explain how to model something as simple as a randomized controlled trial done in two batches with a batch effect… Just leave out the null hypothesis testing entirely.

        • Daniel:

          People could argue with me that I don’t quite believe “why teach what not to do?”, given that I’ve published lots of articles and devoted lots of blog space on “what not to do.” Part of this can be explained by different audiences and different purposes.

          Annoyingly enough, I feel that Greenland and I and others who are making these points are getting slammed from both directions: On one hand, people are (incorrectly) saying that we’re only criticizing and not offering alternatives (a quick glance of the research and textbook output of both Greenland and me will show that we’ve offered clear alternatives, both in methods and in applied work). On the other hand, people are saying this is all hopeless: statistical significance is so popular that we have to teach it anyway. This kind of annoys me. Sander and I are busy people but we can’t do everything at once. I think that spending decades working on applied problems, writing textbooks, and offering real alternatives, and also pointing out problems in standard ways of thinking, are both important.

        • As someone “out in the field”, I can tell you that this strategy simply doesn’t work. The reason is that many of my students will go to their own PhD advisors and will have to do a frequentist analysis because that’s all that the advisor does/knows.

          I also have to work with collaborators who always ask me at the end of an analysis, “so, is it significant”? I can’t blow them off and I can’t give them a 26 week course (that’s what it takes to cover the whole story in my teaching program, it takes one year of teaching over two semesters to get to Bayes).

          So all this talk of not even teaching all this stuff is for philosophical discussions over a beer, not relevant for real life, IMO.

          Andrew wrote below that “On one hand, people are (incorrectly) saying that we’re only criticizing and not offering alternatives (a quick glance of the research and textbook output of both Greenland and me will show that we’ve offered clear alternatives, both in methods and in applied work).”

          I agree with Andrew’s statement, he *has* offered many concrete alternatives. I know because I am a student of his writings and books and have spent many years closely studying his opinions and thoughts on these issues and try to apply them in practice in my and my students’ work (not hugely successful so far—my students often resist this move away from decisive statements, and reviewers+editors make my life difficult, leading to far fewer publications relative to the number I write and submit).

          However, the problem in my view is the following:

          It is often not clear *what* exactly Andrew and others are advocating, if you read his books and articles. E.g., sometimes he will use a posterior probability of the parameter being positive to argue for “substantial evidence”. The student can’t be blamed for taking this up in their work, but then other reviewews correctly rise up in arms agains statement like that.

          Andrew can get away with saying “substantial evidence” in connection with a posterior prob. because it’s obvious he has domain knowledge and statistical credentials, but a lowly student can’t. To be fair, the reason it’s not possible to give a single solution as a replacement for NHST is that every situation is different.

          I don’t know what to do about it, I’m just identifying the problem with (correctly) raising all these issues with a single decision framework like NHST, but then effectively telling people, guys, do the right thing. What the right thing to do is depends on the situation at hand. And that requires training and education which the researcher lacks. People don’t know what to do instead of NHST *in their specific case*.

          It’s a stalemate.

          In the Nature comment, IIRC there was a suggestion to provide confidence intervals instead. People will still want to interpret them and that’s where the problems begin, esp. when power function is hovering in the nether regions near 6% and the researcher doesn’t even know that it is low. I know because I face this in my own work with my students.

          It’s hard for a newcomer to statistical analysis with a confidence interval (credible interval in my lab) and ask: what does this tell us?

          Student: Can we say we have evidence for such and such effect? Me: no.

          Student: what can we say then? Me: Something like “the posterior distribution is consistent with the theoretical prediction”.

          Student: but didn’t you just make a decision about the data? Me: No, I just said it was consistent with what was predicted, I didn’t say I found something out for sure (“reliable, “significant, etc.”).

          So I usually add in the paper: “Replications are needed to establish the robustness of our findings.” Paper goes to journal, desk reject. Reason: paper doesn’t provide closure. Punkt, end of story. Then we try again.

        • Student: If replications establish the robustness of our findings, can we say that we have evidence for such and such effect?

        • Carlos Ungil wrote:

          “If replications establish the robustness of our findings, can we say that we have evidence for such and such effect?”

          I would say: it depends on how you define the word evidence. Royall defines it as a likelihood ratio test between two models. If we accept that as a definition, I would just say that my confidence in the observed effect being consistent with theoretical predictions has increased because I can consistently replicate the effect (I’m not talking about significance, but a consistent sign of the effect with similar ranges of credible interval in repeated samples). In a footnote in Gelman and Hill, the authors call this the “secret weapon”.

          I go one step further and follow the Roberts and Pashler 2000 (How persuasive is a good fit) criteria for determining what counts as a good fit to a theory/computational model.

          We have a forthcoming paper on evaluating data against model predictions that spells out the way we express ourselves: https://psyarxiv.com/w2ckt/

          A revision of this paper will appear in a few days on psyarxiv. It’s an important paper for me, it’s sort of a culmination of 17 years of pursuing the predictions of a particular process model of a cognitive process I’m interested in (sentence comprehension).

        • Shravan. Yes, hence the frowny face. I would love to feel like being in academia was a good place to be, and that you and I could both make an honest living doing good work in academia without the kind of bullshit you mention where no matter what you do your colleagues and reviewers etc demand a sacrifice to the gods of fake science…

          I applaud your efforts, would love to help make things better. I’m considering some options to do something useful for teaching, I value your opinion, email me and we can discuss.

        • Sameera:

          Fair enough. “Slammed” is too strong. More accurate to say that our view is being disputed. I have no problem with people disagreeing with me, but I do think it’s worth pointing out incoherence in these counterarguments.

        • But Andrew, you have also said that p-values can be useful. If so, it makes sense to teach them.

          I do agree however that I *never* need p-values in my research, regardless of the situation. I just don’t think people realize that that’s true for their work too. I talk to people in my field who are way more sophisticated than me in every direction, and really, it’s all about likelihood ratio tests. That’s unshakeably evidence for them. Royall seems to agree. If I can’t even convince the elite here that it doesn’t work to use significance, I’m pretty sure there’s no chance of convincing the rest of the population.

          I’m very curious to learn what our statistical significance paper in Journal of Memory and Language will change. If people actually read it, it should shock the hell out of them and make them fall off their chairs.

        • Shravan:

          Yes, p-values can be useful. But in those examples, a straight estimation approach will also do the job, even better I think. For people who are already using p-values, fine, it makes sense to get as mch out of them as possible. But I see no good reason to learn p-values when starting from scratch—except, as discussed elsewhere on this thread, for the value of understanding what other people have been doing (for example, knowing the definition of a p-value can be helpful if your goal is to understand how it was that the entire scientific and news media was bamboozled by Brian Wansink etc.).

        • Andrew: “it’s that p-values are presented as a measure of strength of evidence and they’re not

          While p-values are not a measure of the strength of the effect, why aren’t they a (very imperfect) measure of the strength of evidence against the null? The long-run average of a p-value goes to zero as t/F/whatever gets bigger and these get bigger as the null is less likely given the data — (hmm that’s kinda circular). t/F/whatever are measures of the strength of the signal relative to noise. In this blog, you frequently use a quick and dirty t (effect divided by SE) as a measure of strength (in the signal to noise sense, which I consider evidence against the null).

        • Maybe “p-values are not a measure of strength of evidence” has to be understood a particular case of a more general “there are no (prior-free) measures of strenght of evidence”. Andrew has discussed in the past his hate of Bayes factors.

        • So when Andrew divides an effect by its SE (which he frequently does here), what is the purpose? It’s a signal to noise ratio. Andrew seems to be using it as a back-of-the-envelop (instead of computing exact p from the ratio) way to weigh the evidence that there is an effect, or that something is going on. A frequentist would interpret this signal:noise as evidence against the null (generally by computing the p-value first).

        • If you treat the drag coefficient of a golf ball as equal to the drag coefficient of a pingpong ball, you will get an answer that isn’t too far off the truth for many flight regimes, but if the regime is in the region of turbulence onset you will get different results, the dimples matter. Still if you’re trying to illustrate a basic principles, treating both of them as a constant equal to about 0.5 is a fine way to illustrate some other part of the calculation.

        • I think there is an even bigger issue with the question “Will improving statistical teaching improve the scientific literature?” Many of the researchers producing the literature have no statistical training at all, and many don’t collaborate with someone who does either. So better teaching of statistics might make some difference, but it won’t solve the problem.

          If the journals are the gatekeepers (and we can argue another time whether that is ideal or not), then what we need to solve the problem is reviewers and editors with appropriate statistical understanding. Otherwise anyone who succeeds in publishing a substandard paper gets asked to review and perpetuates the cycle. Its probably too simplistic, but if every reviewer had to pass a short basic statistics test my guess is that we could move the standard in the field rapidly.

        • I think that’s a key point. For the vast majority of research publications produced every year, the authors are not going to force the issue if it comes to flatly contradicting or defying the opinions of reviewers and editors. Relatively few researchers have the standing to publish over the objections of a “gatekeeper”.

          To the extent this article or other advocacy effort are able to reach editors and reviewers, there is enormous room for improvement. But that’s the key leverage point, not the people writing the manuscripts.

        • Re: “Will improving statistical teaching improve the scientific literature?”

          That is the underlying question I have as well. Thank you. There seems to be an assumption or is accepted that some subset has the right approach to an understanding of statistical theory and practice.

          So who is to teach what to whom? A central question as well, given the state of statistical teaching currently.

    • I agree with a lot in there, but think it misses the main problem (hypothesis to test needs to not be a strawman). But this caught my eye… I can’t imagine it could serve to do anything but cause more confusion:

      rescaling p to the Shannon information (S-value) s = − log 2(p) to provide a better scale for measuring the amount of information the test supplies against the hypothesis;

      • I have to marvel how here and elsewhere those who decry the statistical sins of the unsophisticated will nevertheless produce strong statements about what will and won’t work for teaching based on no experiments, no data, or even just anecdotal experience or considerations of what trainees will face in the world beyond these blogs. Maybe you can’t imagine the utility of s = −log2(p) because you never read about its advantages or used Shannon information in teaching. Did you read the article in the new special issue of TAS explaining its conceptual advantages over P-values? You can toss out P-values in your practice if you want, but I have seen S-values help explain (as opposed to just assert) what P-values are and aren’t saying to those who don’t have the background or time to do a stat major along with their own field’s requirements. And even if everyone stopped using P-values and everything associated with them tomorrow (don’t hold your breath given comments like those of Ioannidis), those folks would still need to understand P-values well to correctly comprehend almost all “statistical inferences” in the past 100 years of scientific literature.

        • Sander:

          Indeed, maybe students need two courses: one on how to design and analyze statistical studies, and one on how to interpret the past 100 years of scientific literature. I do a bit of the latter in my courses and in Regression and Other Stories. For better or worse, it’s my impression that students are much more interested in how to do things right, than in understanding how to read existing published work. But both skills are important, and my own views on both have changed a lot in the past ten or fifteen years. It will help to have some examples showing the old and new approaches. I’m writing a paper now with one such example and plan to do others. I learn a lot from writing each of these case studies.

        • Completely agree with Andrew. I would have been pretty clueless if I hadn’t been exposed to the controversies in statistics. I am struck that even seasoned statisticians ask very basic questions that one might attribute to a 1st-year statistics student. It’s actually encouraging to observers like myself who is not an expert, to begin with. lol.

        • Let me put something here about not blocking ways to get less wrong along with accelerating the process of getting less wrong.

          Blocking getting less wrong is far worse that not accelerating and here we are like experimenters designing experiments – we don’t have the evidence but largely just fallible guesses at what _works in practice_.

          Borrowing the phrase of science being everyday inquiry with helps, perhaps reforming statistics as every day frequentest calculations explained very differently and much more carefully with helps that will help only with enough expertise.

          So making frequentest approaches as a ladder that once climbed up could easily be kicked aside with say credible Bayesian analysis?

        • My thoughts are simply that:

          1) Many people are confused about the meaning of a p-value
          2) Adding another step of log2(p-value) will be even more confusing to these same people

        • Anon: 1: We all know that many are confused about the meaning of a P-value; I’d go much further and say that most are confused at least to some degree, including some who write authoritative articles and books about statistical testing and the meaning of P-values.
          2: The S-value (test “surprisal”) is -log2(p) (you dropped the minus sign, a sign you either haven’t read what I write or else what you write).
          And if you haven’t read my article about it neutrally, thoroughly and carefully, thought about it over some days, and taken a look at the general information-entropy idea of surprisal on which it is based*, you are not practicing scientific criticism, you are just being reactionary. So why then should I take your thoughts as anything more than negative noise to the theme of “it’s not in my playbook and I didn’t understand it instantly so I must attack it.” See also Keith O’Rourke’s more measured comment about blocking experimentation and thus impeding progress in getting less wrong.

          *due to Claude Shannon and IJ Good, not me; simple intro at Fraundorf, P. “Examples of Surprisal,” available at http://www.umsl.edu/~fraundorfp/egsurpri.html

        • I love logarithms, I used to play with slide rules as a kid, they were already antiquated, but nevertheless enjoyable to my sense of aesthetics. But I became a math major. My impression is that many of the people using p values don’t know what a logarithm is or what it does. Thinking of people like Doctors or Biologists or Psychologists or people studying Education effectiveness in elementary school or whatever…

          I get the complaint about not engaging directly with the proposal, but also I get the idea that maybe Anoneuoid has a pretty well informed idea of the level of mathematical sophistication of the people he’d like to see do better stats, and asking them to understand logarithms might be problematic.

          So, I think there’s a little of both things going on.

          But honestly, I don’t think the problem with p values and NHST is math, and I don’t think improving the mathematical understanding of p values is the best way forward, but we’ve been through that a little somewhere in the comments above already. I’d rather teach people about say building algebraic expressions that express the “word problem” of their simple experiment, and then have them think about the fact that with one equation and several unknowns… they can’t solve the problem exactly, but might get closer to solving the problem by collecting data and trying to measure different components of this “word problem”.

        • Your algebraic approach sounds intriguing, have you written it up?

          Asking people to understand probability as it is used in statistical methods is demanding an order of magnitude more sophistication than asking them to understand logarithms, especially base-2 logs which are easily illustrated with binary coding (something more familiar to younger generations than to ours) and coin tossing. The obscurity of probability is a problem that’s been documented for 50 years the experimental cognitive psych literature, a problem which Gigerenzer addressed using natural frequencies. One can teach binary surprisal using such frequencies, bypassing the log definition altogether by going straight to the coin-toss experiment that produces the P-value as the probability of N heads in N fair tosses. I believe Andrew mentioned doing that in his teaching (he doesn’t call the N of tosses the binary surprisal or S-value or Shannon information, but that’s what it is). The technical exposition I gave in TAS is for those with your kind of background, to explain the foundation for this approach and its connection to conventional theory. It’s illustrated with examples like how to show P=0.05 is weak evidence against a model without resorting to objectionable “Bayesian” machinery like unjustifiable prior spikes with 50% mass. But alas such an exposition explains nothing if it is not read.

        • Daniel said:

          “I love logarithms … My impression is that many of the people using p values don’t know what a logarithm is or what it does. Thinking of people like Doctors or Biologists or Psychologists or people studying Education effectiveness in elementary school or whatever…”

          and

          ” I’d rather teach people about say building algebraic expressions that express the “word problem” of their simple experiment, and then have them think about the fact that with one equation and several unknowns… they can’t solve the problem exactly, but might get closer to solving the problem by collecting data and trying to measure different components of this “word problem”.”

          Sadly, many math teachers don’t understand these ideas well enough to include anything in their teaching that would help students understand logarithms and how to build algebraic expressions describing a real world situation. (And, also sadly, the standard school textbooks don’t include any of the types of problems that would be helpful in practicing these skills.)

          I’ve been very fortunate to be able to teach some pre-service and current secondary math teachers some of the skills they need to do these things. If anyone is interested, see, for example, the following handouts at https://web.ma.utexas.edu/users/mks/ProbStatGradTeach/ProbStatGradTeachHome.html :

          Models and Measures
          Measures, Rates Ratios and Proportions
          What Do You Mean by Average
          Weighted Means and Means as Weighted Sums
          Logarithms and Means
          Lognormal Distributions 1
          Lognormal 2
          Basic Probability
          More Probability Problems
          Diagnosing Disorders

        • > Your algebraic approach sounds intriguing, have you written it up?

          No, but I’ve been toying with doing something like this for a while. I want to put together some example real-world problems and explain how to model them from first principles… assuming a background of nothing more than college algebra. I’d teach calculus as needed using an algebraic approach (nonstandard analysis).

          Actually if you are interested I’d be very happy to discuss the ideas with you, and maybe get ideas for real world example problems to use. I’m over near Pasadena, and could easily come meet with you at UCLA too. Email me?

          http://www.lakelandappliedsciences.com/contact-us/

        • And if you haven’t read my article about it neutrally, thoroughly and carefully, thought about it over some days, and taken a look at the general information-entropy idea of surprisal on which it is based*, you are not practicing scientific criticism, you are just being reactionary.

          You are right I didn’t look closely at it. The reason is that it comes with all the same assumptions as a p-value since that is an intermediate step. So if the p-value is calculated for a strawman, so the “S-value” will be too.

          [The S-value is a measure of the information against H encoded in the test statistic (the refutational information supplied by the test given the model A)

          It is all an exercise in futility since even if I know 100% certainty the strawman is false (“omniscient jones” told me so) I still haven’t learned anything since I knew that to begin with.

          Now if the null model is set to be something meaningful, perhaps the S-value is much better than the p-value. But getting people to check a meaningful null model is such a huge problem I do not care about the details of how exactly it gets “tested”. Anything (just eyeballing it) is so far superior to testing a strawman that there is no comparison. The gulf is as vast as the difference between science and religion.

        • Anon, this is my take so far: As with many I encounter (e.g., Trafimow) I think the strawman argument has blinded you to the purpose and function of statistical models as explained (for example) by George Box. As before I’d ask you to not respond by reacting to that remark, but instead respond after reading the details in the expository TAS article in light of the following summary narrative explanation for that remark:

          Box was a very practical and experienced modeler and a proponent of P-values for model checking even for nominal Bayesians like himself. “All models are wrong but some are useful” was his most famed quote so he knew the strawman argument well, and how it misleads. If we know all models must be wrong in some possibly important way, why are we using them? Because we have no choice! Nature does not deliver us the truth in dreams or voices in our head. A model is a manifold that we can shrink our jumble of observations toward in the hopes that the stabilization (variance reduction) so gained compensates for any bias introduced by the inevitable model mis-specification.

          That hope won’t be met by most models or just any model. Ideally we start with models that are as simple to write out and use as they can be without contradicting our external information (“prior knowledge”); it should as simple as can be while staying compatible with the context. For example, a good modeler knows the app context well enough to recognize when a linear term alone is hopeless as an approximation because of effect reversals (e.g., health responses to vitamin intakes inevitably have reversals in an an “optimal” intake region). That’s the model as a prior, as Box emphasized.

          Upon applying that starting model we should check the fit, because we know it’s wrong in some way and we want to not use it if the data are making clear how it is wrong. Otherwise we proceed with it as a tentative working model (working models rather than laws of nature is what stat modeling is about – note however how the high-caste math-stat literature nonetheless calls distributions “laws” as if otherwise, and Box was an early critic of that literature).

          Now I rarely see checking done in practice in my field even though it is easy to see or request basic checks from popular packages, such as “goodness-of-fit” statistics (quantities misnamed like most everything in stat, since they actually measure badness of fit). Naive eyeballing falls apart rapidly once you needed more than a few variables in the model; sure there are amazing visualization methods out there to get around that problem but in my field only a tiny percent of researchers doing regressions are aware of them let alone do them or have software that does. A fit statistic (and those include coefficient tests, which tests fit of a reduced model relative to an embedding model) is an immediately available warning (idiot) light against the possibility that the shrinkage tradeoff may be unfavorable for the model they are using. More specifically, a large value for it relative to its reference distribution says the data fall in the fringes of the noise cloud which (as Lakeland aptly describes it) we would see surrounding the model manifold upon simulating from the model.

          The badness of fit (warning-light) intensity could be measured by any strictly increasing function of the test statistic, so an important question is: Are there any functions that stand out for some desirable properties? A rationale for the P transform is that it automatically incorporates the reference distribution into the measure. The Shannon transform does that too and has several advantages over the traditional P-transform both abstractly and ergonomically (in terms of heading off naive misinterpretations such as inversion fallacies, and in providing an equal-interval intensity scale). See further details at the TAS exposition,
          https://www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1529625?needAccess=true

  36. Somewhat important to this discussion: I think (based on a small sample) that many students in what is probably the most well funded area of science: cell & molecular biology (including neuroscience and microbiology) do not take a statistics course as a grad student and may not have had stats 101 as an undergrad. For many of these researchers, their entire statistical training was in the lab by a postdoc, who was trained by a postdoc, etc. The dominant software used is GraphPad Prism, which is entirely about analyzing experiments with NHST. It is the anti-modeling software. I have limited experience (only played with a the 30-day expire copy) but I’m pretty sure one cannot say use a glm (negative binomial or whatever) for count data. Or a linear mixed model for blocked or repeated measures data. The software was developed by Harvey Motulskey whose books are very readable and try to guide/nudge researchers away from the most egregious errors of interpretation. There is effectively no anti-NHST literature in this field. I suspect many researchers in this field would be baffled by this conversation.

    A perspective on this, from a star in the field is here: http://ewanbirney.com/2011/06/five-statistical-things-i-wished-i-had.html

    • Thanks for your comments and the link. I think your comments, Ewan’s blog post to which you link, and the comments to his blog post are important “data” on how statistics education is inadequate in one subarea of science. There is no reason to suppose that there are not similar inadequacies in other subareas. It’s a big task, so as I wrote in an earlier comment, there’s a lot of work still to be done in providing adequate education in statistics, so we all need to keep on doing what we can, and encouraging others to do so — to resist the inertia of “that’s the way we’ve always done it.”

    • The relationship between Pvalue, Effect size, and Sample size – this needs to be drilled into everyone – we’re far too trigger happy quoting Pvalues, when we should often be quoting Pvalues and Effect size. Once a Pvalue is significant, it’s higher significance is sort of meaningless (or rather it compounds Effect size things with Sample size things, the latter often being about relative frequency). So – if something is significantly correlated/different, then you want to know about how much of an effect this observation has.

      http://ewanbirney.com/2011/06/five-statistical-things-i-wished-i-had.html

      People come up with some amazing gibberish to justify their use of NHST. This is too muddled for me to figure out, but he definitely thinks “significance = real”.

      On top of that, somehow the influence of sample size on p-values is a problem for him, but only for significant p-values? And the reason he wants to know the effect size has nothing to do with any biological meaning, but because it (somehow) helps him interpret his p-values?

      And he includes “statistical methods” as a main area of his research on the about page…

      • Here’s the relevant “significance = real” quote:

        “So – if something is significantly correlated/different, then you want to know about how much of an effect this observation has.”

        This is what I think of when I hear it asserted that “oh, we don’t just look at p-values, we also look at effect sizes” (see Nick’s posts above). Sure, they look at effect sizes – after they look at the p-values. Because if p > 0.05, then the effect “isn’t real” or “was due to chance”, and should be ignored. Whereas if p < 0.05, the effect "is real" or "wasn't due to chance", and then it is appropriate to interpret effect size.

        Judging by the variety of disciplines in which I've seen this line of thinking, I suspect that it is rampant.

        • There exisrs a subset of doctors with an interest in statistics/epidemiology/math. At the risk of over-generalising, I would say that most of them would have enjoyed maths prior to university, still have an interest, and have not forgotten what a logarithm is. Several commenters in this thread have painted a grim picture of the statistical literacy of the medical profession which I do not think is entirely justified.
          With regard to assessing evidence we really/honestly/truly do not just look at the p-value. Instead we do this:
          1. Only look seriously at randomised controlled trials – observational studies are only hypothesis generating.
          2. Scrupulously check the method and conduct of the trial to ensure there has been allocation concealment, blinding etc.
          3. Only then will we look at the results taking into account effect size ( which needs to be clinically significant), confidence intervals, and yes, p-values. We really do understand the shortcomings of NHST and we deal with them because we have to because that’s what we are given. The p-value conflates effect size and uncertainty, we get that.

          Although reading the medical literature ( especially abstracts) you might get the idea that it is all about step 3 above, we are actually more concerned about step 1 and 2.
          And perhaps to the disgust of some, and perhaps unreasonably, we have a deep mistrust of most sophisticated statistical techniques, including Bayesianism in its entirety (Stephen Senn has written about this).
          NHST May be an ugly baby, but it is our baby and it will not be given up without a fight. Petitions. Ha!

        • Nick, I think your 1, 2, and 3 are great, and I don’t object to looking at p-values. If the p-value is just one summary statistic reported among many, I don’t think there’s great harm in this (though I don’t think there’s much benefit either). My objection is to using the p-value as a filtering device. And one of the big problems is that you and everyone else reading published results may not have a choice in this matter, if the journals are selecting what to publish using statistical significance as one of the criteria. I don’t know how common this is in your field; in some fields it is damn near ubiquitous. As I said above, a lot of people who use and consume statistics genuinely believe that not significant means “ignore this; it isn’t real”.

          On the other hand, if I know that a study would have been published regardless of whether the result was statistically significant, then I am much more inclined to trust the reported effect sizes (this is the great advantage of registered reports, where studies are guaranteed to be published regardless of how the data turn out). But if I believe a study came to the point of getting published via a system that filters out non-significant results, then I do not trust the reported effect size. This is because it was arrived at using an estimator that, from the point of view of the person reading the paper, is biased. This isn’t mean throwing shade at NHST; it is a mathematical certainty.

          This aspect of NHST is the one that I think most people here are objecting to. It isn’t the p-value itself. It’s how the focus on p-values biases research and publishing practices. When you read these anti-NHST pieces, it’s the effect of the methods on the practice of statistics that are the focus.

          I’ll also admit that I know very little about medical literature or statistical literacy of medical profession. I work mostly with academics outside of strictly medical fields. From the people I talk to and the papers I read, I have developed a pessimistic view of popular statistical practice, because what I see is “statistical significance” acting as a gatekeeper. This gatekeeping has serious negative consequences; e.g. all of these replication crises we’re witnessing.

        • Ben:

          Yes. As we wrote in our Abandon Statistical Significance paper, the use of a lexicographic decision rule based on the p-value is problematic.

  37. Suppose we have a scalar real parameter theta, and we perform a test of the point null hypothesis that theta = 0. Assume the test provides for a two-sided significance level of alpha. Suppose we obtain statistical significance (i.e., we reject the null hypothesis), and the associated point estimate theta.hat is positive. What is the posterior probability that theta is in fact nonpositive?

    If we employ any proper prior on theta which is symmetric about zero, then we can obtain the following upper bound on this probability using Bayes’ theorem:

    P(theta 0) 0 | theta > 0), the power of the corresponding upper-tailed test averaged over the prior.

    What’s notable about this bounding formula is that whenever 1 – beta >> alpha, the bound is small. E.g., when alpha = 0.05, and 1 – beta = 0.50, then the risk that theta is nonpositive is no more than 4.8%. Even when alpha = 0.10, and 1 – beta = 0.20, the risk is no more than 20%. While multiple comparisons problems inflate alpha, they also inflate 1 – beta, so it isn’t clear whether specification search, garden of forking paths, p-hacking, etc. drive up this risk In general

    You can tell a parallel story about the risk that theta is nonnegative, given significance and theta.hat being negative. Indeed, you can define a “false sign rate”, or FSR (by analogy with the false discovery rate), as the maximum of these two error probabilities. The FSR is then a bound that applies whether the estimate is positive or negative, as long as we’ve obtained (two-sided) significance.

    Seeking consensus on 1 – beta and alpha is challenging, given that different individuals may adopt different priors, and may have different ideas about how to account for multiple comparisons. As long as the complete analysis plan is reported, however, it should be easy enough to forge consensus on whether 1 – beta >> alpha. And if we have agreement on that, then we have agreement that there is a low risk that we have the sign wrong. In that case, two-sided significance of the usual sort can be quite informative, even if it’s not the only thing we should care about.

    • Sorry, equations didn’t render right. The bound is:

      P(theta nonpositive | significant, theta.hat positive) < 0.5 / (0.5 + ((1 – beta) / alpha)).

      And 1 – beta = P(significant, theta.hat positive | theta positive), the power of the corresponding upper-tailed test averaged over the prior.

    • I explained previously that the p-value refers to the entire model, not just the value of a single parameter (“theta”):
      https://statmodeling.stat.columbia.edu/2019/03/16/statistical-significance-thinking-is-not-just-a-bad-way-to-publish-its-also-a-bad-way-to-think/#comment-996157

      Your “false sign rate” is a misnomer, it doesn’t tell you the rate at which you will draw incorrect conclusions about the sign of the parameter. So, it would be inappropriate to use it to “have agreement that there is a low risk that we have the sign wrong”.

      • There’s no p-value here. Just a decision rule that yields “significant” or “not significant” with some two-sided type I error rate, and some power averaged over the prior. The rest is Bayes theorem.

        • There’s no p-value here. J

          Then where is this coming from:

          Suppose we obtain statistical significance

          And this:

          the corresponding upper-tailed test

          You aren’t doing the same thing as in that Greeneland and Poole 2013 paper?

        • There’s no flat prior on theta here, and you can define tests and their frequentist properties without p-values entering the story in any way, including one-sided and two-sided tests. What I’m “doing” is just pointing out that Bayes theorem plus a prior symmetric about 0 means that a much higher probability of significance + positive estimate when theta is actually positive than the two-sided significance level implies a low probability that theta is nonpositive. Unfortunately I have to run now, but the details of the argument are all there in the original post.

        • The problem isn’t with the p-value per se, that is just what was used in the Greenland paper. The problem is with whatever assumptions are being made in addition to theta = 0 to get these numbers.

          If you work out an entire example I am sure you will see that the “false sign rate” depends on stuff other than the value of theta.

  38. When the number of comments on this post reaches the number of signers of the petition, I suggest we declare defeat. Or declare victory. Type I or Type II error – your pick.

    • No rather a kind of victory – we have discovered one of the real enemies – us!

      (Strictly speaking for “us” read – failing to communicate in a scientifically profitable manner that enables all of us to get less wrong to the maximal degree.)

  39. “This uncertainty is a key reason why this discussion is worth having, I think.”
    Of course, but surely you know what “p” in p-value stands for – it stands for probability (of data given the null hypothesis is true, informally) and probability is a measure of uncertainty. It seems p-value serves what the purpose advocated by you perfectly unless you want to abandon the probability theory entirely (a quick Google suggest there are these so called “possibility theory”, Dempster–Shafer theory to name a few).

    • Chao:

      My full quote there is:

      I agree that the effects of any interventions are unknown. We’re offering, or trying to offer, suggestions for good statistical practice in the hope that this will lead to better outcome. This uncertainty is a key reason why this discussion is worth having, I think.

      Here I’m talking about the potential effects on real-world statistical practice of an intervention such as writing a paper or signing a form recommending that statistical practice be changed in some way (in this case, by “retiring statistical significance”). We can’t know the effects of such an intervention without trying it—or even then, really. This is just a basic statement about causal inference and the difficulty of prediction; it has nothing to do with p-values or anything else.

      • Not sure why my post failed but here try again:

        My bad. One of the key messages of the Nature article is ‘uncertainty’ (for data analysis and interpretation in a regular scientific paper) so such confusion. What I tried to say this, to “embrace uncertainty”, one can embrace p-value (and its alike) as p-value is a probability – a measure of uncertainty. This is contrast to the tone of the article although they did say they did not advocate to ban p-value (but why not embrace it? p-value does exactly what it says it would do on the tin).

        With regards to threshold, at the end of the day one has to make a decision. For example the large Hadron Collider shows the existence of a Higgs boson-like particle with a false positive rate of one in 3.5 million (search “rss time line of statistics”). Do we decide that Higgs particle exists or not? Is it right to award Peter Higgs a Nobel Prize? We have to make a decision based on something, be it 0.05 or 1 in 3.5 million (or 5*10^-8 in GWAS research in above comments).

        • >the large Hadron Collider shows the existence of a Higgs boson-like particle with a false positive rate of one in 3.5 million

          rewrite this as: the LHC shows that some chosen random number generator wouldn’t produce data like the LHC Higgs detector except one in 3.5 million times you ran the experiment.

          What “shows that it’s Higgs like” is the *theory* which suggests how a Higgs boson should act. The Stats just let you rule out some specific random process as the alternative thing that might have generated this data. But what makes us care about this specific random process? Why is it unique? Shouldn’t we also rule out many other random number generators? Such as ones that sometimes just happen to generate Higgs like data for reasons other than the existence of a Higgs particle, perhaps because God hates physicists and just likes to screw with their head? ;-)

          The stats are doing very little heavy lifting, it’s Peter Higgs giving a description of how the Higgs boson should behave that did all the work.

        • For example the large Hadron Collider shows the existence of a Higgs boson-like particle with a false positive rate of one in 3.5 million (search “rss time line of statistics”).

          That false positive rate (p-value) refers to how much the data deviated from that predicted by their model of the background for some reason. It does not “show the existence of a Higgs boson-like particle”.

          Do we decide that Higgs particle exists or not?

          Certainly not from that, you need to rule out all other reasons for such a deviation from the background model (other particles, modeled sources of detector noise, etc).

          When read carefully you see the authors of these LHC/LIGO type papers are headed down a dark path where they give less import to ruling out alternatives and testing theoretical predictions (eg, by putting these in an appendix) and more on looking for deviations from these background models. If this continues I have zero doubt those areas of research will be “in crises” soon.

          I’ve read some physicists say there has been no progress since the 1970s so perhaps it has already begun:

          “All of the theoretical work that’s been done since the 1970s has not produced a single successful prediction,” says Neil Turok, director of the Perimeter Institute for Theoretical Physics in Waterloo, Canada. “That’s a very shocking state of affairs.”

          https://www.nbcnews.com/mach/science/why-some-scientists-say-physics-has-gone-rails-ncna879346

        • Daniel, Anoneuoid:
          Thanks for your clarification. Well, I am not a physicist and just came across this a couple of days ago on RSS so thought may be relevant to mention. I suppose if this false positive rate is high then Higgs theory may be wrong (now it turns out he was correct). Having read another book from a physicist (“My Life as a Quant: Reflections on Physics and Finance”), I got an impression that physicists always make theories, some proved, some not.

          “Certainly not from that, you need to rule out all other reasons for such a deviation”
          This reminds me the debate of whether smoking can cause lung cancer. I suppose similarly you have to rule out other reasons, for example some gene causing both (smoking and lung cancer). Ronald A. Fisher apparently believed just this (search “Why the Father of Modern Statistics Didn’t Believe Smoking Caused Cancer”).

        • I suppose if this false positive rate is high then Higgs theory may be wrong

          No, it doesn’t mean that either. For example, when they first started collecting data the “false positive rate was high”, but no one took that to mean the Higg’s Boson didn’t exist.

        • Sure. I meant to say if data shows high false positive rate (or “insignificant”) then there was no/weak evidence to confirm something (Higg’s Boson, an effect) exist but we could not rule out it does exist before we saw further evidence (due to small sample size used, other factors etc). It is more difficult to prove something does not exist than something exist (for example does God exist? It would be much easier if he could just show up!).

        • Sure. I meant to say if data shows high false positive rate (or “insignificant”) then there was no/weak evidence to confirm something (Higg’s Boson, an effect) exist but we could not rule out it does exist before we saw further evidence (due to small sample size used, other factors etc). It is more difficult to prove something does not exist than something exist (for example does God exist? It would be much easier if he could just show up!).

          I know it seems like nitpicking but that is because there are so many “nits” (ie, misunderstandings) surrounding p-values and statistical significance. Once you pick away all the “nits” there is literally nothing left. You discover NHST has been constructed entirely from “nits”.

  40. Why not a compromise? Upon reflection, I rarely write something like effect of x is statistically significant. I usually say e.g.: x1 is statistically significant at 95% confidence level, while x2 is significant 99% confidence level. I explicitly stated the uncertainty and readers can make their own mind whether they accept the effect exists or not.

    • > The following 1,646 scientists, statisticians, doctors, librarians, plumbers, and YouTube stars from 117 countries and 2 planets are signatories for tar and feathering frequentists

      Agree, we almost never want to take ourselves too seriously.

      • Keith:

        I don’t think that criticizing statistical significance represents tarring and feathering of “frequentists.” I’m a frequentist! Our work on type M and type S errors is, in part, a frequentist demonstration of problems with statistical significance. One of the most important criticisms of statistical significance is that it adds noise to estimates, and we’ve done frequency analyses (that is, simulation studies) showing this. Studies don’t replicate: that’s a frequentist concern.

        • Andrew – it is a joke site by someone unknown.

          The tarring and feathering of “frequentists” seems mostly like a misdirection but is hard to tell.

          I like its funny (or I did after clicking around the site).

          p.s. maybe they should have waiting until next Monday.

        • Keith:

          Sure, I can tell it’s a joke. My impression is that it’s supposed to be mocking the earnest statistics reformers, who in their misdirected enthusiasm, would like nothing more than to round up “frequentists” and send them to re-education camps. And my point in the above comment is that I think this misses the point, even as a joke. When people such as Amrhein, Greenland, McShane, and I criticize reasoning that is based on statistical significance, we’re not criticizing “frequentists.” Rather, we are frequentists in that our criticisms typically come from a frequentist direction: we’re saying that if you categorize your data based on statistical significance, you’ll on average make systematic errors in your inferences, you’ll make predictable mistakes. This sort of reasoning is frequentist analysis.

          Meanwhile there’s a lot of naivety by people who might call themselves “frequentists” but who might be better labeled as “conventionalists”: supporters of existing methods who naively think that a published regression coefficient is an unbiased estimate (and thus for example thinking that the estimated 42% effect on earnings of early childhood intervention, and the estimated 20% shift in women’s vote choice based on time of the month, are unbiased estimates of underlying or population average effects), or who naively think that p-values in general have a uniform distribution under the null hypothesis (not recognizing the complexity that arises with real-world null hypotheses that are typically full of nuisance parameters). That last bit about the non-uniformity of the p-value distribution doesn’t concern me so much, as I typically don’t care at all about the null hypothesis, but it’s relevant to the larger discussion because once we realize we can’t in general know the exact distribution of the p-value, it moves us to a more sophisticated frequentism, in the same what that our sophistication increases when we understand that published regression coefficients are nothing like unbiased estimates.

        • Thought I would re-visit your comment in light of Daniel’s later comment that seems to largely dismiss frequency considerations https://statmodeling.stat.columbia.edu/2019/03/20/retire-statistical-significance-the-discussion/#comment-1003958

          Your “we are frequentists in that our criticisms typically come from a frequentist direction … sort of reasoning is frequentist analysis.” is something others may disagree with, but has it been made clearly enough to them.

          Yes, I know you have indicated it here zillions of times – maybe it isn’t relating just like the dangers of smoking by empty gas cans (as empty suggest harmless).

          Does the Bayesian Frequentist contrast, compellingly suggest Bayesians should eschew any consideration of frequencies?
          (I was spared that misdirection by Don Rubin telling me really smart people don’t like being repeatedly wrong.)

          As a joke an alternative to “conventionalists” could be “frequentest” (what’s most frequently done in statistics) but seriously currently terminology in statistics seems almost designed to cause confusion.

        • Keith, I wouldn’t ever dismiss frequency considerations when the stable frequency with which something happens is directly the subject of study.

          Like “how many white women over 50 commit suicide at each age each year” or “what is the histogram of number of frogs per acre across swampland in Florida” or whatnot.

          But very often we study things where establishing that there is a stable frequency distribution and what its shape is like isn’t even remotely possible. If you give a drug to 100 people and you measure a variety of severe side effects on some numerical 0-10 scale, asking what the frequency distribution of severe side effects greater than severity 7 is isn’t productive. You may get 1, 2, 3 events… to establish the shape of a distribution you will need 50 or 100 or 1000 if its a long tailed distribution. Getting that many adverse events will take prescribing the drug to millions…

          In the absence of the millions of prescriptions, *pretending that a chosen model distribution is a frequency distribution* is harmful, it confuses things. You can generate p values saying that “severe side effects greater than 9 will only occur in at most 1 in 250000 patients” based on the frequency properties of some model, but this is a *recipe* for being *repeatedly wrong* rather than for avoiding being repeatedly wrong.

          I see frequency based calculations as appropriate when you have all of the following:

          1) A large data set: usually at least thousands of points
          2) Collected over multiple time points and multiple locations
          3) Every subset through time displays a similar histogram, every subset through space displays a similar histogram (can be checked with various p value based tests)
          4) Events of interest occur on the order of 50-100 times at least
          5) Any parametric model you use passes multiple types of goodness of fit tests in the core and tails as needed.

          When you have all of that, then you have the bare requirements for treating your process as a random number generation process with a certain distribution, and at that point you can ask questions like “what would this random number generator do?”

          But this is *not* the situation for a vast majority of cases where p values are used in psychology, economics, medicine, biology, etc.

          It *is* the case where p values are used in say manufacturing process control, designing telephone switching networks, or detecting fraudulent credit card transactions, and those are productive areas for frequentist statistics.

          My biggest complaint is that people have been trained to say “everything is automatically a stable random number generator” and “usually gaussian” and to step from that to doing frequency calculations, when in fact anyone who has looked into what it takes to test computational pseudo-random-number-generators can tell you that plenty of smart people have designed PRNGs that fail. In other words, in the real world, good randomness is hard to get.

        • Note that there’s one situation where you can dispense with some of that: when the data is generated by a computer PRNG. Why? Because in this case *the PRNG itself* has been tested heavily to ensure that its properties meet the requirements for being considered random under extreme circumstances (including billions of samples and often hundreds or thousands or unimaginable numbers of independent dimensions).

          So if you’re using a good PRNG you can generate 12 samples and talk about what distribution they came from, because they’re constructed to come from a particular distribution!

        • Daniel: A frequentist test doesn’t address the question whether something is a stable random generator, nor does it assume that anything is. It rather asks whether what was observed can be distinguished from what a stable random number generator would give.

        • > nor does it assume that anything is.

          Frequentist tests *always* assume that *something* is a quality random number generator.

          In the typical NHST test, it assumes the data could be from a strawman perfect PRNG with a given distribution, and asks whether that RNG could have generated the data. That’s why non-significant results are more interesting than significant. When the result is non-significant it suggests (but doesn’t prove) that a stupid simple random noise generator could have made this data…

          A bootstrap confidence interval for the mean treatment effect of a drug directly assumes the drug treatment data is output of a random number generator whose distribution is well characterized by repeat sampling with replacement from the data set.

          A rank test assumes the data is the output of a stable random number generator with unknown distribution, transforms that distribution to near uniform assuming the ECDF is a good approximation, and then performs tests on the resulting uniform.

          If my data collection process is just observing the clock, all of them fail the very first assumption: that there’s anything like a random sequence involved. The bootstrap confidence interval for the output of my clock observing process will show that in the future 20 samples from my clock are “guaranteed” to show an average value between start,end with essentially probability 1… And yet the very first sample of 20 I take in the second experiment will be with probability 1 outside the interval.

          If you set up a NULL hypothesis just to shoot it down, it tells you virtually nothing, except: a stupid RNG wouldn’t have made this data.

          If you set up a frequentist “real” hypothesis to attempt to get confidence bounds on means or standard deviations or ranks or things, it tells you “IF the world behaves as a RNG of the type I assumed, THEN the frequency with which future experiments will show mean values of such and such is F”. Unfortunately it is rarely even conceivable to test the Conditional part after the IF for many many problems. The default should be to *reject* that hypothesis until its demonstrated to be a good approximation to reality, but in fact in practice the default is to *accept it unconditionally*

        • For doing the computations we need these assumptions, true. This doesn’t imply by any meaqns that we have to believe that this is true in reality.

        • Of course the implication of not rejecting the H0 doesn’t mean the H0 is true. (That’s what people are banging on about all the time, don’t they?) It only means that the data won’t serve you for arguing anything else.

        • I’m not by the way defending the misuses of test/p-values that you or the paper under discussion criticise. I’m just saying it’s possible to use them in a manner that makes sense, particularly if we say goodbye to the myth that the business of any model could be to be “true”.

        • > This doesn’t imply by any meaqns that we have to believe that this is true in reality.

          I agree with you, these tests tell us *true facts* about a *purely fictional* RNG simulator. But I think this point is subtle enough that even PhD statisticians often don’t remember it. People routinely talk as if their “coverage guarantees” for their favorite frequentist methods are true *in the world* instead of *in the theoretical computational model*. It’s a psychological problem that is closely related to the problem of misinterpreting what p values mean.

          an example discussion: https://radfordneal.wordpress.com/2009/03/07/does-coverage-matter/

          In other words, people routinely act and speak as if they believe that their typical small N psych or medical study *really is a high quality random number generator and they really do know the distribution exactly*. Statements we often quote about “misinterpretation of p values” are actually statements that would be correct if the world were a certain random number generator.

        • Daniel:

          Our disagreements I believe are mostly meta-statistical.

          To me “random number generation process” is just one possible re-representation of a probability model and hence a Bayesian analysis is implemented (if not best thought of) jointly as random number generation process of unknowns (parameters) and random number generation process of knowns (data) given the parameters which is then conditioned on actual in hand data (e.g. two stage or ABC).

          You may think differently, but I agree that “these tests tell us *true facts* about a *purely fictional* RNG simulator” but also Bayesian outputs tell us *true facts* about a *purely fictional* _joint_ RNG simulator. That people confuse the representation with what it tries to represent is a problem everywhere in science and maybe more so statistics.

          But overall, I am concerned when people use the _joint_ RNG simulator (Bayes) how often they will be lead away from what they tried to represent – become more rather than less wrong about the reality beyond direct access.

        • Keith, I think I mostly disagree with that characterization of Bayesian models as joint random number generator processes.

          Using a pseudo-random-number generator to explore the posterior is a totally legitimate computational method to describe the calculation. Pushing that further to say that you believe that certain frequencies of occurrence will actually occur in the world is a step you can *optionally* take, but doing so should be backed up by enough data to confirm the existence of a stable distribution of outcomes in the world.

          Most of the time, it would be a mistake to interpret something like “generate a set of parameters and then using those parameters generate a large data set using the likelihood” as saying that you think future data will have a histogram very similar to the histogram of your fake data.

          My take on this is that the point of Bayes is to tell you what is compatible with your idealized model, and what isn’t, but there is no requirement that histograms match, this is an optional feature of a Bayesian calculation.

          Basically if you have something like data ~ Normal(m,1) and m has a prior to be in some range like -.5 to 1, you should be unhappy with your model when your data includes a data point 33 or 100. Your model is strongly ruling out those kinds of values, and so your model is wrong.

          But if you have 200 data points and their histogram looks more like a mixture .8 * Normal(0,.8) + .2 * Normal(.25,.5) this is a very easily detectable bad fit to frequency, but all the values you observe are individually values that would not be surprising to observe from some kind of Normal(m,1) model for some values of m around say .1 or something. Therefore, your model passes the basic requirements that it doesn’t rule out actually observed data.

          Another option is to decide your utility function that describes what makes your model “good” for your purposes is telling you “I need to do accurate inference on the frequency histogram”. In that case, something like a chi-squared goodness of fit test on the histogram of data vs the histogram of fake data would tell you “hey, i’m not fitting the histogram”.

          The point that I’m trying to make here is *whether to fit the histogram or not* is an optional decision that you should make based on the purposes you will use your model for.

          If, like in my orange juice example, you are going to use your model to predict the total quantity of orange juice on a pallet, all you need to do is get the average quantity per bottle close enough, and then multiply that number by N. Fitting the histogram of individual bottle volume is completely irrelevant to the task your model will be used for and therefore is not in any way required.

          Now, if you do a frequentist analysis to try to determine how often you are shipping bottles that are more than half empty… you will absolutely need to match the histograms, and you can do that by doing a Bayesian analysis on a model space where you intentionally have a flexible likelihood model that can represent all the various types of histogram shapes you expect to see… Your end result will be a distribution over the parameters describing a frequency histogram. In this case, checking your model with p values is absolutely justified, precisely because the *utility* you used to select your model space *cares about* the p value.

        • Keith, Andrew, now that we’re really deep in the comments section, things get interesting right? Hope someone is still reading.

          Thinking about what I just wrote, I think it brings back some ideas Keith and I discussed here, might be 10 years ago maybe… anyway the point is that for many Bayesian analyses of the type I’m discussing the relevant question for “goodness of fit” is something like p(Data|model)/pmax(Data|model), a likelihood ratio. If this ratio is extremely small for some data values regardless of which parameter vector sampled from the posterior you use, then you have a serious conflict between what your model thinks is reasonable and what actually occurs. This is an important Bayesian indicator that your model has problems.

          For example:

          > sapply(rcauchy(100),function(x) { return((dnorm(x)/dnorm(0)))})
          [1] 7.891992e-01 9.512069e-01 9.333288e-01 2.065716e-01 1.725975e-01
          [6] 9.611109e-01 5.994569e-01 9.993181e-01 8.209975e-01 8.757838e-01
          [11] 9.585353e-01 4.221307e-02 7.793029e-01 2.088760e-03 9.994152e-01
          [16] 2.910476e-12 1.616417e-09 5.855129e-71 8.779560e-01 6.747276e-02
          [21] 1.528774e-06 9.063797e-01 1.623805e-09 6.533939e-01 1.750493e-04
          [26] 4.269431e-01 4.204763e-61 4.128359e-02 8.005868e-01 8.590735e-01
          [31] 9.450481e-01 9.961029e-01 9.237048e-01 9.997488e-01 5.928207e-07
          [36] 3.250178e-06 4.363710e-01 2.424559e-03 8.800808e-01 8.414203e-01
          [41] 9.809371e-01 9.899293e-01 6.393970e-01 5.981492e-01 9.886327e-01
          [46] 6.716481e-01 9.907263e-01 2.499136e-01 9.938484e-01 6.005361e-01
          [51] 9.999992e-01 3.389235e-01 7.304050e-01 6.588093e-01 2.443490e-23
          [56] 3.626158e-01 2.188212e-01 7.216229e-06 9.926447e-01 9.862284e-07
          [61] 3.723818e-01 6.528378e-02 8.573225e-01 2.648962e-01 7.208475e-01
          [66] 3.940839e-03 6.500636e-02 9.963418e-01 9.855869e-01 8.444145e-05
          [71] 3.587700e-01 3.791043e-01 2.205225e-04 9.826397e-01 3.822761e-02
          [76] 7.936852e-05 9.882986e-01 1.667596e-01 9.027819e-01 9.759895e-01
          [81] 9.906979e-01 3.114632e-01 5.291793e-10 7.930160e-01 5.560888e-01
          [86] 8.715981e-01 9.080356e-01 7.417094e-01 9.906985e-01 1.355451e-06
          [91] 4.635497e-03 1.384732e-16 2.957543e-82 2.325108e-10 4.049838e-69
          [96] 9.997441e-01 9.751823e-01 1.696421e-03 1.661349e-01 4.913744e-01

          shows that this data generated by a cauchy random number generator winds up containing points that are 10^82 times less likely than the most likely value for the standard normal distribution. So a normal model of this data is terrible as we have actual data points that are completely ruled out by the normal model.

          Incorporating this kind of check into Bayesian workflow is probably a really important idea, and it’s compatible with models designed to fit frequencies, as well as models where frequencies are not necessarily the goal.

        • In summary, the fact that it is unlikely that a random number generator of the type specified in the model would have generated the data is an important indicator that your model has problems.

          However, the fact that it is unlikely that a random number generator of the type specified in the null hypothesis test would have generated the data is not very interesting (and non-significant results are more interesting than significant).

          If the null hypothesis is “the distribution is normal (all the higher moments are zero)”, isn’t the p-value the important indicator you mentioned above?

        • Carlos: Evidently I’m not communicating the idea well…

          It’s important to distinguish between how likely your model thinks something is vs how often that thing will happen in repetitions.

          Suppose the world unbeknownst to you generates numbers that are uniformly distributed between -1 and 1. You have 8 of these numbers to work with, and you have a model which specifies a likelihood as normal(0,1). The frequency is completely wrong. It would be entirely easy to find a frequentist test that would completely reject this normal model, like some mann-whitney test or whatever. But the values that are actually generated between -1 and 1 are all values that have reasonably high probability in the normal model. None of them contradict the normal model’s assumption that “I give higher weight to stuff in the vicinity of -1 to 1 or -2 to 2 than I do to stuff in places like 38 or -100”

          Bayes measures a kind of plausibility or credence or willingness to “give weight” to something, it does not measure frequency unless you make the choice to do so.

          So, if your model is not surprised by uniform(-1,1) values, if you want, this can be the end of story. If you want, you can take it further and elaborate a more specific prediction, maybe like a uniform(-a,a) or whatever based on some additional information… but *the shape of the distribution expresses information* and often the actual real world frequency distribution *is not information that you have, nor is it information you are likely to actually acquire*

          fortunately, the likelihood ratio I mention DOES NOT rely on frequency in repeated sampling to flag a problem. It relies on whether or not your actual data conflicts with the weight you were willing to give to outcomes in those locations.

          For example, if you test the normal model against the uniform data it’s easy to say “the frequency with which data should exceed 2 in absolute value is far lower than the frequency with which you actually observed values greater than 2…

          so what? all the values you did observe were in the high probability region of your likelihood, therefore they all were compatible with what you expected.

        • In summary, the fact that it is unlikely that a random number generator of the type specified in the model would have generated the data is an important indicator that your model has problems.

          Rewrite this as “the fact that the model gives near zero credence to the idea that the world will produce numbers in the range that the world actually does produce is an important indicator that your model has problems.”

          It just so happens that you can easily construct a random number generator such that the credence you give to certain outcomes is *numerically exactly equal* to the frequency of the RNG’s outcomes in that region, but that is irrelevant to the credence question because your credence measurement is not a measure of *how often you think the world will produce those numbers* its just a measure of credence you give to that prediction.

          Think of it this way p(Data | Model) is a measure of “how well my model thinks its doing at predicting Data”. As that “goodness” goes to zero the model hates its predictions more. If you find this number very close to zero, your model is unhappy with its performance. Due to the lack of absolute scale, the only way to measure consistently is via a dimensionless ratio p(Data|Model)/p(Data_max|Model) and if that is very small, your model is complaining.

          a p value using tail area is far less interesting. For example, suppose you have a model that data values should be near integers plus or minus small errors…

          1/3 * (Normal(0,.1) + Normal(1,.1) + Normal(2,.1)) is your likelihood…. you see a data value at 1.5. The tail area is huge, maybe .33 on one side and .66 on the other… but the model rules out this value, it has near zero probability to be between 1.3 and 1.7

          Now suppose instead you see values 0,1,2,0,1,2,0,1,2,0,1,2

          it would be easy to construct a test statistic that tells you “there is no way you’d get data always in ascending order and exactly equal to integer values” in other words you could reject the RNG 1/3*(Normal(0,.1) + Normal(1,.1) + Normal(2,.1)) with extremely small p value p = 0.00000002 or something like that.

          Fortunately the Bayesian model isn’t saying “I think the world is like an RNG with this frequency distribution” instead it’s saying “I think the values are highly likely to be close to integers, and very unlikely to be close to halfway points between integers” so the fact that the data always arrives in ascending numerical order and reads out as exactly integer values is not cause to reject the Bayesian model, whereas it’s extremely high quantity of evidence to to reject the RNG model.

        • So if I understand correctly you’re talking about some kind of model checking that will tell you that your Normal(0,1) model has problems if your data is one single 42 measurement, but will be cool with a dataset of a thousand measurements which are all equal to 1. Ok, it may have it’s uses.

          Minor point: the “likelihood ratio” that you propose seems to involves densities at different places. Doesn’t it bother you that it changes with reparametrizations?

        • YES that’s exactly it.

          RE reparameterizations. No I don’t think it bothers me because it’s observed data that we’re interested in here, there is no reparameterization of data, data comes from measurement machines with a given fixed dimension and units. If you alter the machine, you’ll need to alter the likelihood, but you’re not free to alter the likelihood without altering the machine so to speak. That’s not true for the parameters the model uses.

        • I image that you normalize the probability (density) of getting a particular measurement using the maximum from the distribution to handle very wide distributions where the density may be low everywhere. But this solution has its own issues: if the model has a spike, say 50% probability of being (in an epsilon environment of) zero and 50% probability of being uniformly distributed between -1 and 1, your ratio will be very small everywhere except at zero despite half of the mass being elsewhere.

          It seems you would like to check if the measurement is in the HDR of the model distribution and the more strightforward way to do so seems to calculate a p-value (using the probability density as statistic). I can’t see a satisfactory solution not involving tail areas.

        • Daniel: “Basically if you have something like data ~ Normal(m,1) and m has a prior to be in some range like -.5 to 1, you should be unhappy with your model when your data includes a data point 33 or 100. Your model is strongly ruling out those kinds of values, and so your model is wrong.

          But if you have 200 data points and their histogram looks more like a mixture .8 * Normal(0,.8) + .2 * Normal(.25,.5) this is a very easily detectable bad fit to frequency, but all the values you observe are individually values that would not be surprising to observe from some kind of Normal(m,1) model for some values of m around say .1 or something.”

          I don’t get this. The model includes probabilities not only for individual values but also for bigger sets of them, so if your models predicts that there is no dip of a certain depth in the histogram, say, but in fact there is, why is this a problem for the frequentist but not for the Bayesian? You may say the Bayesian may not be interested in this, but neither may the frequentist – when just estimating the location of the overall thing, the histogram doesn’t need to be fitted perfectly, asymptotic normality kicks in pretty early in most cases so it doesn’t really matter much if the underlying distribution is one normal, a mixture of two of them or a uniform… a Cauchy would be bad, but that’s the same for the Bayesian.

          You seem to apply quite proudly and openly higher standards to frequentist than to Bayesian modelling. Actually the Bayesian model has more “elements” so one could argue the other way round, too.

        • Christian: it comes down to whether the Bayesian model is intended to represent frequencies or not.

          If you in fact are specifying a likelihood based on frequency properties of the data, then it’s appropriate to hold the Bayesian and Frequentist models to the same standard, since they’re modeling the same thing: the behavior under repetition.

          If on the other hand, you are modeling purely probability (as a measure of plausibility or credence) then the failure to fit the histogram in a frequency repetition sense is not something to be concerned about. However this flexibility to be concerned about two different things, is not present in Frequentist models because they don’t have the concept of plausibility to work with.

          @Carlos: if you have a spike normal(0,.0001) mixed 50% with a slabish thing normal(0,100) and you get a data point at say 4, it really is the case that your model found this value to be .001993 / 1994.713 = 1e-6 times less likely than 0 was.

          the fact that there are many other data points that didn’t happen that make up the bulk of the normal(0,100) isn’t relevant, since those values didn’t happen.

        • >if your models predicts that there is no dip of a certain depth in the histogram, say, but in fact there is, why is this a problem for the frequentist but not for the Bayesian?

          Well, it could be a problem for the Bayesian depending on their utility that defines what “problem” means… Model choice is another instance of Bayesian Decision Theory. A Bayesian might choose a model that has such a “problem” because the consequences of the problem have no real utility downside. Or may reject the model entirely even though it fits well according to many Frequency based p values because this little wiggle is critical in terms of utility (perhaps you want to detect when something weird happens, and the world never lets you have x=1/2 or 1/3 or 2/3 precisely but your model says it’s fine, it has no dropout there).

          But Frequentism *defines* its models by frequency, and it relies on frequency to calculate with. If I can show that a dataset is inconsistent with a model using my test DansSpecialTest, at the p = 0.00001 what should a Frequentist think of the model? It can be rejected on the basis that the data is nearly impossible from this model right?

          Often frequency based models aren’t working directly with the frequency of occurrence of individual measurements, but say the sampling distribution of some statistic of large batches, the mean or median or standard deviation … which converges to something like normal regardless… almost the only way frequentist statistics work in many cases is due to the fact that they’re really working with just 1 sample from the sampling distribution of the average, and we can calculate a shape of the sampling distribution of the average even if we barely know the shape of the data distribution at all…

          The assumptions required when you’re dealing with a strong asymptotic result are a lot less than when working with something like screening genes based on 3 measurements of the expression intensity of each gene (like an Affymetrix chip or something).

        • Daniel, do we agree that in that case, upon seeing a non-zero value (a likely occurrence if the model is correct), you will conclude that you have a serious conflict between what your model thinks is reasonable and what actually occurs?

        • You will conclude you have a conflict, how serious it is you will define separately. But if you put this spike here, you must think you know something very specific about the precision of your prediction. So importantly, if you can adjust your parameters to make ALL of your data fall in the spike you will STRONGLY prefer those values of the parameters.

        • I’m lost, I thought that this probability distribution was in the space of measurements and not over parameters that I could change. Or are you suggesting that I modify the machine to always get similar outputs? In a particle physics experiment, for example, a detector producing two kind of outcomes (spike when there is an interesting event, background noise otherwise) is a feature and not a bug to be fixed.

        • Sorry, I’m saying that when you fit the model, Stan will sample heavily in any region where the parameters make the data fall in the spike. You can’t move the data, but you can make the model flexible using parameters and these parameters may adjust the model until you place the spike right on the data…

        • Daniel, to keep it simple: let’s forget about Stan, sampling, fitting, and parameters. Let’s say that we have measurements of angles of dispersion.

          The model predicts the angle (in milliradians) to be the mixture that you wrote above: the beam is narrow when there is no interaction (50% probability) but wide when there is interaction (50% probability).

          Let’s say that the model does indeed correspond well to the real data generating process and we get the following data after four measurement: -0.00008, 4, 42, 0.00017.

          Your ratio will give very low results for the second and third data points. Does this indicate a serious conflict between what your model thinks is reasonable and what actually occurs?

        • Carlos, I think Andrew is going to post a whole post on this topic of finding model conflict, so can we continue the discussion there? I like your question, because I think it highlights the difference between usage when we really are modeling frequency distributions, vs when we’re modeling pure probability. The answers may be different depending on the model purpose and I find that intriguing but I don’t exactly have an answer. When we continue I’d like to also discuss the following issue:

          Y = f(x,a,b,c) + Noise

          Noise ~ our spike and slab.

          there exists a range of a,b,c values that make f(x,a,b,c) virtually equal to Y data so Noise is all in the spike… does this indicate a “severe conflict” with our model of Noise because it’s easy to show that the noise would never happen if sampled from the spike and slab?

          Please let’s come back to this when it’s not buried deep in the comments on a different topic!

          Or if you like, I’ll post it to the Stan discourse, and then we can discuss there, and maybe Andrew can link to that from his blog post?

        • Daniel: “But Frequentism *defines* its models by frequency, and it relies on frequency to calculate with. If I can show that a dataset is inconsistent with a model using my test DansSpecialTest, at the p = 0.00001 what should a Frequentist think of the model? It can be rejected on the basis that the data is nearly impossible from this model right?”

          I don’t agree, but in this respect I’m probably different from many frequentists. I think we have to acknowledge that all models are wrong and that this can be shown. Any continuous model could be rejected by a test that rejects if only rational numbers were observed; this doesn’t mean we shouldn’t ever use continuous models. Violations of model assumptions are inevitable and only those are a real problem that affect the later inference that we want to make based on the model, e.g. if we want to say things about mean and variance, neither continuity nor most density gaps somewhere are a problem (but there are issues that are problematic such as heavy tails – as for the Bayesians). So my philosophy of frequentist modelling has something in common with your Bayesian one. We need to know what we care for and on what this depends; other issues with assumptions are mostly harmless.

          And I agree, the frequentist cannot do much with one or three observations. I’m actually fine with Bayes if the prior is convincingly argued, particularly in situations in which it is clear how much it helps.

          OK, I should probably also stop discussing here… too hard to find stuff in this thread.

        • I have a colleague who calls himself a non-denominational statistician. That sounds like a good solution to the “labeling” problem to me.

  41. Anon, this is my take so far: As with many I encounter (e.g., Trafimow) I think the strawman argument has blinded you to the purpose and function of statistical models as explained (for example) by George Box.

    […]

    https://www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1529625?needAccess=true

    After reading that paper, I see you do not seem to understand the “strawman argument”. You have a short section about “P-Values Do Not Force Users to Focus on Null Hypotheses—But Nullistic Jargon Does”, which is true. The problem is not with the p-value, it is with the null hypothesis (I prefer to refer to the “model” rather than hypothesis).

    But in this very paper you use an example of testing an irrelevant (strawman) null model about adverse event rates after “ibuprofin” vs “ibuprofen + acetaminophen”: “testing the null hypothesis of no association (rate ratio of 1, a 0% difference)”.[1] This is a prototypical example of testing a strawman model…

    Such null models are always wrong, and this may be for many, many reasons, most of which are not of any theoretical or practical interest. Eg, in the Walsh 2018 paper you get your ibuprofen results from,[2] the caregivers were not blinded. So perhaps something about the warning label made them more or less likely to report an adverse event. They even suggest this in the paper, but do not do anything with it:

    It is also possible that infants receiving a prescription for ibuprofen are judged sicker than those who do not, thereby increasing their risk of subsequent medical attendance. It is unclear that this would increase the incidence rate of the outcomes we specifically sought.

    In the end, whether or not the null hypothesis is rejected (via any means you want to use) we still cannot draw a valid conclusion about whether or not ibuprofen should be given. I simply do not care if the p-value, s-value, likelihood ratio, confidence interval, posterior distribution, or whatever is consistent with “no effect”.

    A non-strawman, relevant null model worth testing would be derived from a theory of interest (eg, what effects on the rates of certain adverse events would we expect based on the mechanism of action of ibuprofen). So I think it is you who is blinded to the bigger problem and is focusing on the red herrings in the procedure. It isn’t that the rest of the stuff you say about misinterpreted p-values is wrong, I just don’t care when there is a much bigger problem staring us in the face.

    [1] You have also forgotten that you are testing the entire null model, not just “no association”, which is only one parameter of the model. See my discussions with Ram here.

    [2] Walsh, P., Rothenberg, S. J., & Bang, H. (2018). Safety of ibuprofen in infants younger than six months: A retrospective cohort study. PLOS ONE, 13(6), e0199493. doi:10.1371/journal.pone.0199493

    • Anoneuoid: Thanks for more elaborate comments. But I still think you have not read my paper with due care, as you are missing my major points entirely and misrepresenting my example discussion and context. And you seem to have a hostile agenda, downplaying extensive agreement and exaggerating conflict, veering towards trolling (amplified by writing anonymously, which frees you from taking full responsibility for your errors). Consider that you say
      “In the end, whether or not the null hypothesis is rejected (via any means you want to use) we still cannot draw a valid conclusion about whether or not ibuprofen should be given. I simply do not care if the p-value, s-value, likelihood ratio, confidence interval, posterior distribution, or whatever is consistent with “no effect””
      -Your first sentence is just restating what I actually said and your second sentence is a statement about your cares unrelated to what I actually said. The Walsh example was requested by the editor to illustrate concepts numerically for the typical user who will be forced to supply a P for the null and interpret it conditionally as if the rest of the model is correct – at least if they expect to get their paper most any medical journal today. Note that I did not draw any conclusion about ibuprofen but instead said no conclusion could be drawn, and made it an ongoing point in the paper that the P-value tests the whole model from which it is computed. At the same time I wanted to show how even with the conventional conditional interpretation (i.e., assuming the embedding model is correct), the authors could not have possibly justified their conclusions or indeed made any qualitative inference at all. Here’s what I said on p. 111:
      “an accurate report would have said ‘Our study lacked sufficient information to reach any useful inference about adverse renal events comparing ibuprofen to acetaminophen alone; much more data would be needed to address ibuprofen safety concerns’—albeit under current journal publication criteria such an honest conclusion would make publication difficult.”
      Do you really think that’s no great improvement over Walsh et al.’s claim that they observed no difference??

      Like all the critics of our Nature commentary you simply overlook the fact that we are fighting the practice of making claims about the truth or falsity of hypotheses based only on a single study. To make such qualitative claims based on one study is a form of incompetence that nonetheless is encouraged and enshrined in the promotions of those critics. All we should demand from a study report beyond extensive details of motivation, design, conduct, and resulting data (preferably with the database and analysis code available online) are some evidence summaries that can be interpreted and combined with summaries from other sources, in the example not only other studies like this one but also lab studies of the renal-cell toxicity of ibuprofen. Instead the authors are forced into making unsupportable inferential statements based on whether the null P crosses a threshold or the CI contains the null.

      You seem indicate don’t disagree with any of these specific points. So my impression is that you are simply among those who want some sort of radical change but haven’t offered any politically viable alternative. Again, you are starting to look to me like a classic anonymous troll: You seem to reflexively attack anything not radical enough, anything pragmatically geared toward accommodating the reality that P-value misuse is not about to stop. That sad reality exists because misuse is still ardently defended and promoted by the most powerful figures in medicine today like Ioannidis and all those whose entire research output can be called into question if “statistical significance” is discredited. Their sociopolitical blockade to reform is the core problem, far more important than the abstract notions you seem to be on about.

      As an endnote: We don’t yet know all the harms that might be caused by reform proposals, especially radical proposals for change (by which I mean those like forcing replacement of all frequentist methods by Bayesian methods). Fisher and Neyman utterly failed to envision the harms their innovations would bring, and it seems foolish to repeat their error of not anticipating , monitoring for, and protesting misuse of their methods.

      • Sander: Anoneuoid isn’t trolling nor verging on it; they have strongly held views, just like you. (And they’re pseudonymous, not anonymous — they don’t go by “Anonymous” specifically to distinguish themselves from those who do and they have established a distinct persona as a commenter here under that name.)

        • I disagree. “Strongly held views” does not excuse the behavior I’m criticizing. See my replies to their latest reply for why I think that. One can act like a troll even if one imagines and presents themselves otherwise – I’ve found that’s not unusual to see on these blogs when the commentator isn’t using their own name.

          Also, I don’t know who they are so they are anonymous to me even if they are not Anonymous. Use of another persona seems to disinhibit lazy remarks and destructive behavior (as I said before, I think because it disengages the person from taking full responsibility for what they write). If they have multiple persona out here that they can’t integrate, maybe they need professional help.

          Not that I’m always against destructive criticism, but I see that as equivalent to engaging in warfare and so not to be taken lightly. And even then it ought to be intellectually honest. Now I know those are my own values speaking and can see they are not universally shared in practice despite lip-service – there are some who make an art of dishonest representations of themselves and opponents (as in total war with no regard for even the Geneva conventions, and in ordinary everyday politics). Again I see Anoneuoid as veering close in their misrepresentation of my writing alongside NHST. Either that or they are being incredibly lazy intellectually in failing to sort out the distinction carefully before typing back. Either way its their choice to do that, my choice to call them on it.

      • Consider that you say
        “In the end, whether or not the null hypothesis is rejected (via any means you want to use) we still cannot draw a valid conclusion about whether or not ibuprofen should be given. I simply do not care if the p-value, s-value, likelihood ratio, confidence interval, posterior distribution, or whatever is consistent with “no effect””
        -Your first sentence is just restating what I actually said

        We came to the same conclusion but (it seems to me) for totally different reasons.

        The Walsh example was requested by the editor to illustrate concepts numerically for the typical user who will be forced to supply a P for the null and interpret it conditionally as if the rest of the model is correct – at least if they expect to get their paper most any medical journal today.

        The editor requested a prototypical example of using your method to test a strawman null hypothesis. I am well aware (and honestly terrified in the case of medicine) that this is what they want to put in their journals, but that does not make it any less wrong. I totally understand your position, they want you to give them new ways to do the same wrong thing.

        Here’s what I said on p. 111:
        “an accurate report would have said ‘Our study lacked sufficient information to reach any useful inference about adverse renal events comparing ibuprofen to acetaminophen alone; much more data would be needed to address ibuprofen safety concerns’—albeit under current journal publication criteria such an honest conclusion would make publication difficult.”
        Do you really think that’s no great improvement over Walsh et al.’s claim that they observed no difference??

        It depends. If when you say “much more data would be needed” you mean the same type of data, then no. The “much more data” required is data estimating how much various things account for the observed difference (set limits on the influence of caregiver bias, etc). Even better (and much more practical to collect) is a completely different type of data corresponding to whatever predictions could be derived from current models of how ibuprofen works.

        You seem indicate don’t disagree with any of these specific points. So my impression is that you are simply among those who want some sort of radical change but haven’t offered any politically viable alternative.

        I just think these are like minor blemishes on a massive tumor, so focusing on them serves as a distraction (red herring). As far as I am concerned, fixing all the problems you mention will not help as long as people are testing strawman models. It just gets them bogged down in technicalities.

        Testing your own model while using no statistics at all is far superior (in terms of usable info generated) to the most perfect analysis of how well a strawman fits the data. And in my experience, once you test your own model you naturally gravitate towards legitimate use/interpretation of statistics. One reason is the user is naturally biased to not find significance.

        The “radical change” I want is just to go back to doing things how they were done pre-NHST (depends on the field) where they compare their own model of what generated the data to the observations rather than some default strawman model.

        I guess my main point is that: NHST is bizarro science, there is no “right” way to do the opposite of science and have it turn out ok. Inspecting the bestiary of fallacies people use to justify their actions is intellectually interesting but not helpful from what I have seen. One fallacy spawns another and the primal fallacy of NHST is testing a strawman model.

        • I now see another way you have misread or else misrepresent what I have been writing: Nowhere do I support NHST, in fact I am another critic of it. I keep repeating that better use of P-values requires getting them for more than one hypothesis, not just the null (my attacks on nullism) and that they should not be interpreted in terms of cutoffs or significance of anything (other than significant conflict between model and data when very small, in the ordinary English sense of “significant”). Yet you make a distorted representation in which my paper is treated as if it is some fix-up for NHST. And you ignore completely everything I wrote here about wrong models (Box).

          Corey says you are not trolling but it’s a good step towards that when you present criticism of NHST conflated with criticism of what I’ve written (which is another angle of attack on NHST), ignore most of what I actually said, and take the worst interpretation of sentences that seem ambiguous (especially when taken out of context).

          Anyway, P-values long predate anything resembling the bizarro science of NHST so if you want to go back to how things were done before NHST you have to tackle P-values separately – working backwards, before Fisher’s nullism and Neyman’s dichotomania there are the P-values (“significance levels”) from Pearson’s tests of fit, which was a culmination of developments stat historians trace back to the early 18th century. That tradition of use pretty much continues into the LHC Higgs experiments. What evidence in this history leads you to claim (as you seem to) that testing hypotheses with noisy data and no statistics superior to that usage. Where is your alternative laid out in a way that typical researchers like Walsh (who were quite innocent in their mistake – it was their statistician who sent me their paper) can follow and use and expect to get published, at least in some journal more open than JAMA or NEJM?

        • Anyway, P-values long predate anything resembling the bizarro science of NHST so if you want to go back to how things were done before NHST you have to tackle P-values separately

          Yes, exactly! The problem is not p-values. In the past p-values or similar were used to determine how well the observations fit with the predictions of the scientist’s model. Perhaps they are not ideal but they are ok enough. P-values have also been misused for BS for a long time to do NHST, like the guy who disproved that male and female birthrates were equal and concluded god existed and hated the same thing as him.

          P-values are a tool that can be used for good or bad. Using them to test a strawman is always bad. In other cases, I reserve judgement but it seems ok.

          Where is your alternative laid out in a way that typical researchers like Walsh (who were quite innocent in their mistake – it was their statistician who sent me their paper)

          There is none, afaict. It will take a larger cultural change like how the reformation eventually lead to people ignoring all the crap that took “the bible is correct” as a base premise. They will simply be replaced by an entirely different way of doing things that renders them and the journals largely irrelevant (but we see how the catholic church still lives on with many followers).

        • “The problem is not p-values. In the past p-values or similar were used to determine how well the observations fit with the predictions of the scientist’s model. Perhaps they are not ideal but they are ok enough.”
          WTF. That’s a practically summary of several themes in my current TAS paper. I wonder: if you had been using your real name would you would have attacked my paper spuriously as you did, only to feed its themes back as if your own.

      • I thought of a more succinct way of putting my position:

        All the misinterpretations and fallacies of p-values and statistical significance exist to somehow justify to the researcher why they are testing a strawman model (as they were taught to do). Testing a strawman model is the disease, your paper attempts to treat the symptoms.

        • I explained days back why your strawman=disease analogy is just wrong: All models are wrong (strawmen), we still need to use some and want to weed out the more wrong ones wherever we have information to so.
          Even if you want to hold onto your false analogy, consider that if you ever have a disease wracking you with pain you will welcome symptomatic relief, so that has its place too (in fact it’s the bulk of medical practice in some areas, e.g., neurology). I see much pain generated as researchers try to force fallacious conclusions out of statistics, so I seek relatively safe sources of pain reduction: S-values are like aspirin, of minor potency with some hazards but safe and effective enough for OTC use; Bayes as touted by some fanatics is like oxycontin, potent with profound abuse potential and too often deadlier to uncertainty assessment than the frequentist disease it was supposed to alleviate – yet they want to make it the new tyranny.

        • All models are wrong, but some models are not even related to the problem at hand: understanding how things work… This is the disease, thinking of things as “noise” vs “my favorite signal”. many people have stopped doing science and instead are simply comparing measurements between “my default noise generating process that my stats software uses” and “something else” and not looking deeply into the idea of “something else”, worse yet, even concluding that their favorite “something else” must be true because there is only “my favorite” and “noise” and they’ve “proven” it’s not “noise”.

          Anoneuoid has repeatedly stated here (over several years) that a p value used to test an actual scientific research hypothesis is far less damaging, something like “drug X phosphorylates thing Y which reduces its activity in process Z which reduces edema in organ Q according to a dose-response curve C, therefore we expect measurements of edema in Q to look like q_i(C(dose))+noise and measurements of phosphorylated Y to look like y_i(C(dose)) + noise and we can’t reject this model using the available data (p=0.22) therefore the model seems adequate at the moment”

          It’s unfortunate to see the two of you battling over this, because I honestly think you’re both thinking similar things. The problem seems to be entirely one of communication / cross talk rather than disagreement on fundamental ideas.

          For example, if up near the top of your Nature article you’d said something like “Testing a null model provides only information about whether a default noisemaking process might have caused the data at hand. We must move back to the concept of testing specific research hypotheses about how and why scientific phenomena come about” I suspect you’d both be patting each other on the back instead.

          But your paper starts off with one important symptom of the bigger NHST disease: concluding that there is no difference because a comparison wasn’t significant.

          However, reading your Nature commentary more fully, I honestly think that something like my comment above couldn’t have been far from your lips, but that you had to focus on just a few of the problems to make your point understood by people who are likely to battle you because they’ve built their careers on NHST. Saying merely “the whole things is a disaster” would be unhelpful compared to saying “look here’s a clearly specified problem that is obviously wrong, and there are others…”

          One bit of background you may find useful is that it’s come out through many discussions here that Anoneuoid was a biomedical researcher and left the field *precisely because of their dissatisfaction with the nonscientific nature of so much research they were involved in*. When you see your favorite chosen subject dying of cancer it is maybe understandable that you get upset when someone suggests treating a concomitant ear ache. Imagine if in order to get a PhD in Epidemiology you had to swear an oath to uphold certain mathematical “facts” which had long ago been proven by mathematicians to be not just false, but actually imply the opposite. I suspect you’d have moved on to some other field and been pretty unhappy about this fake guild-like oath process.

        • Daniel:

          Thanks for clarifying, to help mitigate failing to communicate in a scientifically profitable manner that enables all of us to get less wrong.

          However might be worth remembering – authors write (and re-write over and over again), editors edit.

          And communicating while upset, though understandable, seldom communicates profitably.

        • All models are wrong (strawmen), we still need to use some and want to weed out the more wrong ones wherever we have information to so.

          A model being wrong does not make it a strawman. Perhaps an example of good practice will help.

          One I like is: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2007940/

          Here they have an explanation for how cancer develops (accumulation of some type of “error”), make some simplifying assumptions (eg, the rate of “errors” is very small), derive how the age-specific cancer incidence should look if their model was correct, and compare the prediction to the data available (age-specific cancer mortality was used as a proxy).

          No one expects this model to perfectly fit the data, but it does capture the overall pattern. So it seems like they are on to something, but further tweaking of the model and auxiliary assumptions is clearly necessary. Eg, we can allow the rate of “error” to be arbitrarily high but assume most cancer cells are cleared by the immune system. Then if we rederive the predicted curves, the result can fit the data even better since then cancer rates can peak. Still it will not be perfect, but the data available isn’t perfectly capturing the rate of carcinogenesis either…

          Then the next step is to look in other data for what types of “errors” are accumulating in a given type of tissue at around the rates required to make the model fit. The error rate has been assumed to correspond to the somatic mutation rate but other data shows that is much too low. So we can look around for other “errors” that may accumulate. Eg, the error rate could correspond to chromosomal missegregation (aneuploidy), which happens with much higher frequency. This would lead us to collect data on how often chromosomal missegregation occurs in various tissues to see if it is consistent with the error rates predicted by our model.

          I see this as a legitimate scientific research program (even a good one since it makes the surprising prediction that much high “error” rates are involved than can be explained by somatic mutation rate). I don’t see anywhere that I am testing a strawman.

          Testing a strawman would be something like assuming “chromosomal missegration does not increase with age”. When I (inevitably) reject it, I take the result as support for my theory that chromosomal missegregation accumulation leads to cancer. Then I would come up with the next strawman to test like “chromosomal missegretation rate has zero correlation with cancer rate”. Next I can move on to working out details by testing hypotheses like “chromosomal missegregation rate is exactly the same in males and females”.

        • > Testing a strawman would be something like assuming “chromosomal missegration does not increase with age”
          OK, how about an HIV vaccine, developed based on the way other vaccines were developed, “does not increase protection against currently circulating HIV?”

          General point, are we not agreed that we need to avoid claiming to be certain (about anything than math) especially about what others do or do not understand? Especially when we may not have not read their paper that is at the center of the discussion?

        • OK, how about an HIV vaccine, developed based on the way other vaccines were developed, “does not increase protection against currently circulating HIV?”

          No, I don’t believe anyone cares about the answer to that question. What is relevant is how much protection for what cost. But HIV seems to be really hard to transmit anyway (something like 1 in a thousand to ten thousand odds per “encounter”)[1]. Lots of strange stuff about HIV actually, that is another thing where I think NHST has lead people down a very wrong path.

          1) It is really hard to transmit between humans[1], but supposedly has jumped to humans at least a dozen times[2]
          2) Infections seem to stem from a single (or at most a couple) “founder” variants of the virus[3],but HIV is infamous for mutating like crazy within an infected person[4]. So how is it that so few variants get transmitted?

          Due to these strange properties (for a virus), I suspect it is transferred primarily cell-to-cell[5] so a vaccine will not be very useful.

          [1] https://www.ncbi.nlm.nih.gov/pubmed/11323041
          [2] https://evolution.berkeley.edu/evolibrary/news/081101_hivorigins
          [3] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2387184/
          [4] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2614444/
          [5] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5825902/

        • Keith, I’ve thought a lot recently about the difference between something that is more or less “pure measurement” vs doing science.

          If I measure something precisely, and nothing else, i’m not doing science. Sure, measurement is very important to science, it’s critical that we do it well, and designing measurement instruments may take a knowledge of science (ie. an explanation for how the instrument works) but the data coming out of an instrument/experiment is not in and of itself science.

          Why? Because science is about distilling the essence of why things happen in the world. If you aren’t asking why and how, you aren’t yet doing science.

          So, when we have a question which I will reword as “What is the relative risk of contracting HIV with vs without having taken this vaccine?” The question is one of measurement alone. The science came along when you decided how you’d design the vaccine and observed its effects in cell culture or animal models or whatever and measured things to determine if they were affecting the chemicals that make up organisms in the way you thought they should.

          It’s ENTIRELY appropriate to do different things in a measurement or engineering task than you would in a scientific/explanatory task.

          But once we come down to “what is the relative risk of With/Without in the real world population of people randomized to get the vaccine or placebo?” we are asking a pure measurement question. Here are some thoughts I’ve had on that:

          1) Ask the most informative kinds of questions: “what is the relative risk” (continuous, offers us several bits worth of information), vs “does the vaccine decrease the risk” (binary yes/no)

          2) Design the study to answer as many measurement questions as possible simultaneously. “What is the dose response curve of antibody H vs vaccine dose” and “Which viruses subtypes are most frequently encountered” and “How does risk depend on behavior” and soforth can all be assessed along with the primary outcome, and should be published as well, 100% not filtered by p values.

          3) Discuss the relevance of different sizes of outcomes/measurements both before and after the study. Make plain a utility that will be used and the thinking that went into its choice. All utilities will be oversimplified, but even just giving a basic argument for why plugging in some simplified utility into a calculation would be Ok is good information. Even just comparing the relative monetary cost of prescribing antiretroviral therapy under the widespread vaccine vs non-vaccine conditions would be far far better than “is p less than 0.05 in binary yes/no does the vaccine reduce risk question”

          All of those kinds of considerations would be much much more welcome than testing the straw man of zero effect on risk and rejecting it.

        • Daniel:

          Regarding the idea of pure measurement not being science, recall my definition of statistics as being the intersection of three things:

          – measurement
          – variation
          – comparison.

          Also this (“Statistics does not require randomness. The three essential elements of statistics are measurement, comparison, and variation. Randomness is one way to supply variation, and it’s one way to model variation, but it’s not necessary. . . .”).

          P.S. I hope people are continuing to read all this, so deep in the comment thread.

        • > are we not agreed that we need to avoid claiming to be certain (about anything than math) especially about what others do or do not understand?
          OK, I agree this is asking too much, hopefully readers can still benefit from hanging in.

        • > are we not agreed that we need to avoid claiming to be certain (about anything than math) especially about what others do or do not understand?
          OK, I agree this is asking too much, hopefully readers can still benefit from hanging in.

          Sorry, I don’t know what you are getting at. I can tell you this though, literally the only thing I am certain about is that this practice of testing strawman null hypothesis is at best worthless. I’ve searched far and wide for any example to the contrary and have never found one. Anything else is up in the air.

        • Anoneuoid:

          Stated as “the only thing I am certain about is that this practice of testing strawman null hypothesis is at best worthless” I would not disagree nor would the authors of the commentary (short excerpt below to confirm). Especially since as strawman null hypothesis seems to be defined as something worthless. (I tried to raise some uncertainty in the worthless claim in Virology. My concern was the apparent projection of positions to people who had written otherwise and how this hampers profitable discussion).

          Excerpt: “One practical way to do so is to rename confidence intervals as ‘compatibility intervals’ and interpret them in a way that avoids overconfidence. Specifically, we recommend that authors describe the practical implications of all values inside the interval, … Therefore, singling out one particular value (such as the null value) in the interval as ‘shown’ makes no sense.”

          Now, if zero should occur in an interval, I see no problem in assessing it’s compatibility with data and all assumptions (which could include a prior).

  42. RE:

    Sander Greenland: ‘My own view is that this significance issue has been a massive problem in the sociology of science, hidden and often hijacked by those pundits under the guise of methodology or “statistical science” (a nearly oxymoronic term).’

    ———–
    As I have expressed before, as a parallel, in the foreign policy/international relations arena, sociology of expertise is critical toward understanding how and why ideas and assumptions prevail on the bases of small sample perspectives and the behind the scenes interpersonal dynamics at play. This is the case in scientific circles too, based on my understanding from very prominent scientists at MIT and members of the National Academy of Sciences [NAS}, some of whom took on the candidacy of Samuel Huntington to the NAS. A remarkable process which has many lessons for decision-making and uses of probability assignment in international relations relations.

    • I really enjoyed that article, too, specially the overall note of caution in conflatuating statistical and scientific inference, so common in soft sciences. But they insist too much in comparing statistical inference to sampling theory (well, most books of applied statistics do it, too), and their notion of scientific inference lacks philosophical depth – they argue in favor of abduction, but their concept of significant sameness reeks of induction – and it would be OK if they at least mention the century old debate of this problem.

      • Erikson,
        Your observations are well taken. I had similar takeaways. I gather Hubbard has considerable regard for Meehl. I do welcome the emphasis on sampling theory. Most researchers are not applied statisticians, as Rex Kline, Concordia Univ. has highlighted in his Hello Statistics video. I think the inferences they drew made sense to me.

        • Re: ‘their notion of scientific inference lacks philosophical depth’

          For purposes of the debate, any deeper elaboration would have distracted audiences even further from their main thesis.

          Could be that one or more is a proponent of Charles S. Peirce. Some scientists have suggested that ‘abduction’ doesn’t come easily to many in the research communities. I have no way of evaluating that claim.

          I thought that the authors’ point was that ‘significant sameness’ was largely a deductive process.

      • I mostly agree with Erikson – liked their caution in conflatuating statistical and scientific inference and the point about statistical sameness underlying the history and theory of statistics but the paper seems to lack philosophical depth (especially on adbduction) and addresses a rather ill informed view of statistical theory and practice.

        But authors need to start out from where they find themselves and apparently that is in a community that has mostly encountered less than ideal theory and practice of statistics. It good to hear about that.

        As for their sense of history this maybe a good contrast How Large Are Your G-Values? Try Gosset’s Guinnessometrics When a Little “p” Is Not Enough https://www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1514325

        • Abduction is extremely hard to explain. Several academics in Boston suggested that scientific fields need more ‘abduction.

          I thought that Hubbard’s article pointed to a broader lens through which to view the measurement challenge and contextualizes the role of statistics in that challenge.

        • The discussion on Twitter is devolving into a circus for a variety of reasons. I don’t know why it turns into such a cantankerous interchange. Let’s have some fun with it.

          I think some just get too sensitive or angry with others. It distracts from a wonderful opportunity.

        • To expect unconditional loyalty for ONE or another position should not be the badge we should honor. That would defeat the purpose of the petition too. Yet there is an undercurrent of it and distorted by the responses to it. What the heck?

      • Here is what Raymond Hubbard notes on ‘significant sameness’.

        ‘Of course, the significant sameness model is not without its own limitations. For example, it might be objected that this model could be potentially expensive to apply in practice. There is merit to this argument as things stand today. If, however, the editorial-reviewer biases favoring the publication of “novel,”p<0.05, results, and against the publication of replication research—which biases institutionalize bad science—could be successfully confronted, then substantial resources would be freed to promote significant sameness.'

    • As usual I agree with the vast majority about the series of fallacies people have been stringing together to interpret their data the way they want. However, I think there are distractions here.

      I don’t think “formal” is the right term. They define it as:

      formal methods of statistical inference—that is, frequentist methods for generalizing from sample to population in enumerative studies

      First, I don’t see wat the problems they go on to mention have to do with being frequentist. For example, Fisher was not a frequentist.[1]

      Second, by “formal” they seem to mean “default” statistical model. All the complaints are about reality deviating from the assumptions of the models statisticians have found it convenient to derive. Exactly! There is one root problem that leads to all those mentioned there: people are testing models of something else (strawman) than what they believe may have generated the data.

      It really is that simple. All these other problems flow from testing a strawman null hypothesis.

      On the other hand, RCTs with p > 0.05 findings, while failing to pass muster with the FDA, can nevertheless be of great help to many patients (Kent and Hayward 2007, p. 62). For instance, the FDA initially denied approval for the drug temozolomide as a treatment for glioblastoma brain tumors because of overall p > 0.05 results. This meant delayed access to a new medicine which subsequently became the treatment of choice for patients with brain cancer (Williams 2010, p. 113).

      This is missing some info. How has it been determined that temozolomide is a net help for glioblastoma patients? From the Williams 2010 source[2] it looks like because there were “significant” results in the phase II trial:

      Given the disparity between the Medical Research Council trial and previous anaplastic astrocytoma results, as well as various other discordant features of the results (such as the failure to find any effects of well established prognostic variables, such as age and Karnofsky score), the clinical issue was whether the null result in the randomized phase III trial should override the seemingly much more positive results in the previous phase II trials.This particular large phase III trial had little influence because of various shortcomings in the trial execution (Chamberlain and Jaeckle 2001), but while possible reasons for the null effect reported in this trial could be identified, it is important to recognize that all clinical trials with null effects face a similar problem of interpretation.

      So I don’t get what we are supposed to learn from this example. Are the “formal statistical methods” working then? We just need to wisely choose which to listen to?

      [1] “The Nature of Probability”. Centennial Review. 2: 261–274. 1958. http://www.york.ac.uk/depts/maths/histstat/fisher272.pdf
      [2] https://www.ncbi.nlm.nih.gov/pubmed/20173299

  43. I want to emphasize the role of the Nature Comment, and particularly of all the signatories, as a resource for teaching students beyond the actual content of the commentary. It provides a new sort of evidence about the number of concerned individuals in the scientific community that can help current students who are often stuck being taught conflicting ideas. We have been teaching a course on statistical thinking (also described in one of the TAS papers) in hopes of contributing to a shift in thinking. Our students are faced with the challenge of digesting our ideas against a backdrop of what appears to be strong consensus and belief in a fairly rigid use of p-values. Many students today, for example, are taught to believe in p<0.05 in their first and formative exposure to statistics and often from trusted teachers in secondary school. Why should they then believe us? The ASA statement on p-values, this often-assigned blog post, the new special issue – they all help; however, a paper with 800+ signatories is a new form of evidence to convince students that it's OK to question what you have been taught and, further, that you will not be alone when you do so.

    • Ashley,

      All viewpoints should be appraised. As I mentioned to John Ioannidis, last year at a Yale symposium, I doubt that we can make much headway unless we tackle some broader dimensions of learning. You are calling attention to Pedagogy, rightfully.

      Aside from pedagogy, there is the reality that many biomedical fields are siloed as pointed out in Steven Goodman’s article

      https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1558111

      Thus making it difficult to capture the information/knowledge required to think contextually to begin with.

      For example, I’ve noticed in my non-statisticians circles offline, a good portion of them may have read or will read some fraction of the 43 TAS articles and call it a day. Typically, they will read Ron Wasserstein, John Ioannidis, Steven Goodman, and Valentin & al. I am not sure whether they can distill important aspects of each of what they have read. That would take time to ascertain.

      Reading comprehension is a broader issue, it seems to me. Two readers can interpret & report the same passage differently.

      Reading comprehension is aligned with another surprising problem. See

      Massive citations to Misleading Methods and Research Tools: Matthew Effect, Quotation Error, Citation Copying.
      John Ioannidis

      https://link.springer.com/article/10.1007%2Fs10654-018-0449-x

      I’m kinda brainstorming here; and, in the interim suggest that there is the consideration of the bias, among other biases, of ‘anchoring’. There are what 245 biases implicated in thinking through any theory and practice.

      https://en.wikipedia.org/wiki/Anchoring

      The extent of exposure to ‘measurement’issues daunting. Even now much high school and undergraduate curricula do not prepare graduates in critically evaluating the integrity of the claims made. I have to speculate that perhaps that even now the teaching is far too didactic to evaluate the merit/demerit of the views posited.

      I think it is worth everyone to read through ALL the articles as well as this Comment and others that may surface.

    • Of course it is OK to question (and OK to question the signature gathering approach too), but like your example illustrates, the problem lies with misuse and incorrect teaching, not a problem with p-values or alphas or significance testing itself.

      I wonder how it came to be that the students were being taught

      -to believe 1 study means anything conclusive
      -to always use .05 for alpha
      -to only use a p-value when making a decision
      -to not question things
      -to, I’m guessing, divorce p-values from well-designed experiments

      Those do not reflect my personal experience with high school, undergrad, grad school, and professional work experience with significance testing concepts. In fact, it is almost the complete opposite. I was taught to

      -to design good experiments/surveys
      -set alpha based on cost of making a Type I error, and ideally sample size
      -to use p-value/alpha as just one of many tools for making a decision

      I’d worry that people are (to quote Mayo, from memory so I might get it wrong) throwing the error control baby out with the misuse bathwater if they outright reject significance testing. Also, if the teachers get significance testing logic so wrong, they will have no hope with proposed alternatives such as the intricacies of priors, MCMC settings, and math/programming in Bayesian approaches.

      I would also doubt the students, or even the teachers, are reading TAS (or Nature?). How do we reach populations that do not read the journals where these things are discussed? Are they reading these blogs?

      Justin
      http://www.statisticool.com

      • Re: ‘I would also doubt the students, or even the teachers, are reading TAS (or Nature?). How do we reach populations that do not read the journals where these things are discussed? Are they reading these blogs?’

        In the response of mine that wouldn’t post, I raised the prospect that most, in my own expert circles offline, have read a few of the 43. They have used statistics in their own work. But if you ask them what their takeaways are, they are not able to articulate them so well. This same problem manifests in informal conversations with statisticians as pointed out by Rex Kline in one of books.

        I was cautioning readers to the bias of ‘anchoring’ that is visible in various stages of a query. Getting stitched to one is quite common and hard to prod thereafter.

      • RE: Of course it is OK to question (and OK to question the signature gathering approach too), but like your example illustrates, the problem lies with misuse and incorrect teaching, not a problem with p-values or alphas or significance testing itself.’
        ——
        The way you have framed it strikes me as a false dichotomy. Misuse/incorrect teaching and correct p-values or alpha or significance. Especially as some suggest that the very definitions are open to question. Begs the question too.

        It’s a multicausal terrain that evokes the ‘garden of forking paths’ analogy. Standardization of definitions could have been addressed in a timely way.

        Statisticians alone can’t address this. It is going to take far better collaboration across and within fields, given the extent of siloes as Steven Goodman points out.

      • Fair enough to wonder how the ongoing statistical atrocities came to be, but I was taken aback by this:
        “Those do not reflect my personal experience with high school, undergrad, grad school, and professional work experience with significance testing concepts. In fact, it is almost the complete opposite. I was taught to -design good experiments/surveys
        -set alpha based on cost of making a Type I error and ideally sample size -use p-value/alpha as just one of many tools for making a decision”
        I had to wonder, were you educated on planet Earth? If so I’d like to know where so I can send my grandchildren to those educational institutions. In the medical literature, the rule not the exception is to misinterpret tests exactly as we describe. That is in line with various tutorials and editorials in those journals (in the 20th century at least) which routinely botched key verbal descriptions of statistics. No surprise then that the surveys we cited found such mistakes in over half the articles. More disheartening, surveys of professionals have seen over 90% misunderstanding at least one important aspect of significance/hypothesis testing.

        Now despite all that I am not for banning anything except (1) use of potent words like “significance” and “confidence” to refer to mere numerical outputs of the procedures so named, and (2) degradation of reported quantitative statistics into dichotomies. This does NOT preclude their comparison to cutoffs for decision rules; it simply helps deconfound reporting of statistical results from the much more difficult step of decision analysis.

        It seems few appreciate and most are in denial about the tyranny of language, as well as the editorial tyrannies that still force authors to describe everything in utterly misleading terms which confuse the inherently statistical aspects of information extraction and communication (which has hopes of achieving a modicum of consensus) with the deeper intricacies of decision formation (which I think will remain highly controversial because of conflicting values). That brings me to one major element missing from your list and every other I see: Before you set alpha, you must decide which hypothesis (in a very artificially circumscribed domain of test and alternative hypotheses) is the more costly to erroneously reject. As Neyman (Synthese 1977, p. 106) explained, that is as subjective (based on individual costs/utilities) as the choice of alpha, and is fundamental to his entire hypothesis-testing framework.

        You should thus wonder how it is almost the entire scientific community came to act as if and even defend the pernicious idea that the null hypothesis (of no effect) is the only legitimate choice for the test hypothesis, creating a the plague of nullism in the form of mindless NHST. I see it arising from confusing Fisher and Neyman’s writing into a rigid doctrine neither of them endorsed, aggravated by adoption of Fisher’s use of “null hypothesis” to mean the test hypothesis (a mistake Neyman knew better than to perpetuate). Again, terminology affects practice than most authorities ever realized let alone now seem prepared to admit and reform.

        • Re: You should thus wonder how it is almost the entire scientific community came to act as if and even defend the pernicious idea that the null hypothesis (of no effect) is the only legitimate choice for the test hypothesis, creating the plague of nullism in the form of mindless NHST. I see it arising from confusing Fisher and Neyman’s writing into a rigid doctrine neither of them endorsed, aggravated by the adoption of Fisher’s use of “null hypothesis” to mean the test hypothesis (a mistake Neyman knew better than to perpetuate). Again, terminology affects practice than most authorities ever realized let alone now seem prepared to admit and reform.
          ——-
          I go back to the point I raise often, some small subset has better quality fluid and crystallized intelligence, regardless of credentialing. Maybe better contextual intelligence more appropriate characterization. One was a medic who was a better diagnostician than several doctors that knew the medic also. They consulted him informally. There was a teacher that use to be consulted. She had been a nurse in Africa with wide experience treating sick. Another was a dilettante and had many hobbies, also sought out by scientists.

          I understand all three were not binary thinkers. No great explanation. Such consultations are not unusual.

          I doubt that those three would have blindly followed that choice for the test hypothesis.

      • Justin,

        My reaction regarding education is the same as Sander’s: I see p-values and hypothesis testing as mostly poorly taught, and even if well taught, poorly learned.

        If we’re talking “intro to stats” classes, students are usually taught a step-by-step, plug-the-numbers-into-the-formulas procedure which they can employ with little-to-no understanding and still get decent grades. Many students do not like being taught in a non step-by-step, plug-the-numbers-in-the-formulas way. I teach these classes and over time I’ve made them less formulaic and more conceptual, which has lead to an increase in student complaints. I have a high tolerance for these kinds of complaints and I’m not going to go back to teaching these courses as they were taught when I inherited them. But plenty of teachers prefer the path of least resistance, which is teaching it the rote way. The rote way also makes writing and grading homework and exam problems much easier. I can grade “calculate the bounds of a 95% confidence interval” practically in my sleep. Grading “why do we make 95% confidence intervals in the first place?” takes more time and effort, and also lays bare the fact that students can learn how to implement procedures while still knowing almost nothing of value. This makes some teachers uncomfortable.

        Now, the sorts of students who are going to end up working on publishable research are usually more inquisitive and careful than most. They may learn the conceptual framework and the subtleties well of p-values and significance testing well. They may appreciate being taught to think more critically, to not treat “p < 0.05" as some arbiter of truth. And then they'll start doing research and be told that, in order for a paper to be publishable, p < 0.05 is simply required. And the corollary, that results not attaining p < 0.05 have to be discarded. When the pressures of reality face up against the principles they might have learned in a well taught statistics class, reality wins.

        Also consider that many students learn these methods from people whose own understandings are very shallow. There are only so many people who have real statistical expertise teaching this stuff. One game I like to play is to go to the university bookstore and find textbooks from various sciences (biology, economics, psychology, chemistry, sociology, whatever you like…) and look in the index for "p-value" or "hypothesis testing", and read through the relevant sections. The textbooks usually give a naive treatment.

        As far as "if they can't figure out NHST, what hope do they have of figuring out more advanced methods?", I don't think they have to figure out more advanced methods. Trading in the p-values for confidence intervals or standard errors and dropping "significant" from our vocabulary would be easy to do and it would be an improvement. Reporting plots of the data rather than only statistical summaries would be great. If using regression type models, reporting predictions for various relevant values of the predictors is easy enough, and provides a kind of clarity that a table full of slopes often lacks.

        • I agree completely, and this matches my experience as well (except that my understanding is far less deep than yours – still I refuse to teach the subject with a formulaic approach, despite the pressure from students and colleagues to do it that way). But I’m afraid these problems are not unique to p-values or even statistics. They permeate much of higher education at the undergraduate level. Despite lip service paid to the importance of critical thinking and teaching, the reality is that it is not really the priority, and where it is (small liberal arts colleges, for the most part – at least some of them), there is still too much formulaic teaching, testing, and textbooks. While it is possible to design a “good” multiple choice question, I still believe they have no place in education. If they are to be used, then simply adding a paragraph that says “explain your choice” would make a far better question. Why is that done so rarely? Simply because it then takes far more time to grade the answers. And, as teaching loads increase (and many teachers are underpaid and overworked adjunct instructors), that time becomes precious. Students, too, are overworked with more activities than they have time for. So, careful and thoughtful reflection takes a back seat to the daily needs to get things done. I don’t see this as specific to p-values, nor do I believe the consequences are more severe than they are with myriad other subjects with the same pressures and problems.

        • Dale, thanks for that. You are so right. For at least 15 years, I perused critical thinking curricula. Largely out of curiosity. These problems pervade other fields. I see this in foreign policy fields in particular, which, as in the biomedical fields, is backed by special interests. Recently a research scientist at the National Science Foundation stated that ‘conflicts free’ research is necessary. This is also an educational necessity. There has been considerable research in this area. See David Perkins, Robert Ennis, Raymond Nickerson, Howard Gardner, Robert Sternberg.

          I haven’t yet come across studies that show whether critical thinking curricula has had or not had held out improvement. Philip Tetlock’s Good Judgment project seems to suggest that improvement is feasible. I participated in IARPA’s CREATE program last summer. It was tough. But it was a great learning experience.

        • I have to think who I would regard as the best critical thinkers today. We need big thinkers too. We have many many fine analytical talents. But as we can see from the TAS19 43 articles, or at least of the 13 I read, that most of the recommendations were posted for decades. Rather my increasingly stronger hunch is that this is about creativity and collaboration. And this is also what many scientists have expressed. Lack of creativity is a concern in the sciences. Specialization has bred siloes of knowledge another constraint.

        • Just to add one last thing. My friends who follow my Twitter Feed had been lampooning the Twitter threads on the Statistical Wars. I was a little embarrassed to hear that. Mostly they thought that the interchanges were not only super grumpy but childish, with tempers flaring. Not sure who they meant specifically. But I can guess.

          This turns off younger Twitterati. Those not in statistics circles. Discussions are too technical for them. It might provide some entertainment. But whether they are learning is a question.

          My view is that people shouldn’t get so sensitive when their perspectives are queried. I think it is great when someone can convince me that her position has merit. Who wants to hold on to inaccuracies anyway.

        • Sameera:

          I hate twitter. I wish the people who are tweeting would post comments on this blog (or write their own blogs) where they could explore their ideas more carefully and move away from the sniping.

        • Re: Twitter

          I am though an advocate of discussions on Twitter. It is a platform, IMO, that approximates or reflects the epistemic environment in a condensed form. It affords access to the purveyors/thought leaders and of course the wider public. A democratic ideal whereby people can evaluate claims made. I think it is concerning for any expert circle to regard itself as the defining authority. There should be checks. thus the roles of the citizen scientists and non-scientists are where it is at. particularly as a function of producing information entropy change.

          Sorry to take so much time.

        • My impression has been that some venues include a few who have fewer conflicts of interests. I know that Elliot Richardson and Kingman Brewster had emphasized its importance referring to it as the ‘independent observer’ status. Then again they had libertarian leanings or something like that.

          In short, I think some have to have the courage to distinguish evidence from evidence from proof when a situation commands it.

          I’m just saying the NSF has had “national defense/security” as one of its priorities from the beginning, with all that entails. Seems like a pretty huge conflict to me.

          One of the goals was to “develop and promote a national policy for scientific research and scientific education” Since about that time people have been taught a totally wrong way to do research (NHST), so it looks like a failure to me.

          But another stated goal is “control of patents in accordance with the dictates of national security.”

          It would be in the interest of national security to do stuff that, eg, slows down nuclear proliferation. Classifying patents is one way. Another way to do this would be fund improperly run projects that get misinfo inserted into the public literature. So perhaps it was a success.

          Regardless of whether something like that ever happened, I don’t think you can get more conflicted than that. So it is just pretty hilarious to me to hear about someone working at the NSF saying research must be “conflict free”.

        • What is so amazing for me is that I never heard the term CIA while I was growing up, even though I was around many heads of these agencies at the academic conferences. It was only when I read Andrew Bacevich’s history of the relations between CIA and Defense Dept, I recognized that I had seen & talked with Allen Dulles many times at conferences. There was Sherman Kent too. So I had to speculate that I had or have little idea sometimes who is affiliated with whom. I go with the flow. LOl

          My impression has been that some venues include a few who have fewer conflicts of interests. I know that Elliot Richardson and Kingman Brewster had emphasized its importance referring to it as the ‘independent observer’ status. Then again they had libertarian leanings or something like that.

          In short, I think some have to have the courage to distinguish evidence from evidence from proof when a situation commands it.

        • Anoneuoid,

          I fully comprehend your stated concerns. It is a very complicated discussion to pursue. The quantitative/qualitative debates have evolved from the 80s, when I accessed Tversky and Kahneman’s work. I thought then it would permeate decision making and perhaps lead to an epistemological and epistemic revolution in the decades ahead, a view I shared with some associates. The current measurement controversies and the replication crises have led to some greater caution than expressed publicly, notwithstanding some exceptions. The IARPA tournaments conducted by Philip Tetlock, his wife/colleague Barbara Mellors have provided lessons for the national security/intelligence communities. What they have demonstrated is that some decisions may hasten entropy or, in the alternative, poor decisions. Thus impacting on any given society or many societies.

          The US is not about to abandon its military goals and objectives; in short to contest the emergence of any rival military power. It is implied and expressed in every national security strategy and report.

          But in that process, in an era of technology, automation, Machine learning, etc there are unanticipated risks potential as well. So the question does remain who are the individuals that can predict and prevent it. What education would facilitate better thinking and produce better quality of knowledge.

        • The Challenger Accident was a huge deal in re-assessment of the criteria for recruitment to the national security community. My understandings that some pulled into it were deemed very high IQs and very eclectic thinking styles. With IQs of 160-180 range, a measure that I don’t particularly countenance myself for a variety of reasons. Primarily I don’t think it necessarily reflects wisdom. I don’t even know my IQ.

          Learning & emotional disposition I have favored which is a function of fluid and crystallized intelligence. These are just hypotheses based on my readings and interacting with many psychologists and philosophers since the age of 14.

        • Sameera:

          Do you have a link to this thing about conflicts-free research?

          “Conflicts free” can be good or bad. Conflict can make it harder to move forward, and it can be a distraction from science. I was mocked or criticized for saying that I don’t like interpersonal conflict, but I really don’t like interpersonal conflict. Some people seem to thrive on it, but I don’t. I don’t mind intellectual conflict, but that’s another story.

          Anyway, to continue with the main thread here . . . Yes, “conflicts free research” can be good, but it can also be bad, if “conflicts free” implies sticking with the status quo. If someone publishes a paper on embodied cognition and it gets thousands of citations, but then it doesn’t replicate, should we not publish the replication, or soft-pedal its implications, as a way to avoid conflict? Etc. I’m concerned if “conflicts free” is used as a way to afflict the afflicted and comfort the comfortable. I’m particularly concerned if Robert Sternberg is involved in this movement, as he has a track record of supporting the status quo and attacking critics.

        • Andrew,

          My apologies, I should have been clearer. In the domain of ethics. I meant ‘conflict-free’ as not having a conflict between private interest and the official responsibilities in a position of trust.

          Believe me, as my former colleagues can tell you, I voiced perspectives that so astounded the foreign policy establishment that few would talk to me in a friendly way. However, I had some very good backing. After all Samuel Huntington told me that I had every right to speak my mind. I grew up in these circles to begin with.

          I challenged some for some of the very reasons elaborated in Expert Political Judgment by Philip Tetlock and Irving Janis’ Crucial Decisions. I

        • Robert Sternberg was not as you describe when he was at Yale and Tufts. I know many who had worked with him as I was in education circles in Cambridge/Medford/Somerville area. I don’t his career trajectory since I moved to DC, 17 years ago. I read several of his books. I am a proponent of David Perkins’ and Howard Gardner’s work mostly.
          Sternberg’s work is very interesting. His recent book on Creativity is well regarded.

        • RE: TWITTER

          Andrew,

          My Twitter connections are mostly young. I was inducted into 120 Twitter lists which is a bit unusual. It leverages therefore over 120000 pus Tweeters or more who can see my tweets; above and beyond the 11,500 or so Twitter following. Plus I have to qualify that my friends are comedians, musicians, and actors. 40 plus or so. Hey, they use to tease me too. Good natured mostly. But they were kinda surprised at how academics behave on Twitter.

          Yes, I urge many to post on your blog.

        • Andrew,

          I see some similarities of your temperament with mine, especially when I came to DC. I was amenable, frank, and refrained from vicious behavior; much seemed to be the result of too much alcohol or ego. Not sure, but I think I’m right. I also thought naively everyone loved me. Smile. [some did]/ But many more get into a competitive mode. I held off getting into it. Fortunately, Zbigniew Brzezinski and others at CSIs came to my defense, even though we had some disagreements.

          Later on Linkedin fora, I just kicked the pants off the gangland that descended to test my cool and views. I hated doing it. But I had soooooo much fun. I chuckled all through it. I didn’t get vicious either or engage in name calling. I stuck to the merits and demerits of the argument. Many in DC saw me in a new light. They were nicer. So sometimes there is a value in a sustained kick your ass attitude. I had to adopt it with the six or seven dudes that were on me.

      • I actually figured out what the problem has been when I recently had to set up a website. The data for the site is being saved to different servers around the world, they don’t update all at once:

        Server cache is often the culprit of confusion and frustration when it comes to performing website updates. Clients often request a change, we make them, tell them it’s fixed, only for them to tell us it’s not fixed. Why’s this happen? (Hint:It’s not because we didn’t fix it!) The answer is server cache.

        In many cases, if you wait patiently (in some cases, it could be a few days) the changes will eventually show up as the server will clear out the cached information during regular intervals. We understand that not everyone is that patient, so to review the change immediately, within the backend of most CMS’s (content management systems), there is the ability to flush, or clear the cache.

        https://www.commonplaces.com/blog/show-me-the-cache-server-cache-vs-browser-cache

        Sometimes clearing the browser cache (like the server cache but on your device) and resetting the page helps, but not always (when the issue is the server cache). For me that is ctrl-F5.

        But yea, andrew needs to (or get someone to) set the cache duration lower, eg: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control

        It is much better since he moved to the new domain, but still has more issues than any other site that I’ve noticed.

        • Anon:

          Indeed, this is the problem we were having on the blog for awhile, until we switched to a new host. People were posting comments that were not appearing until a day or two later, and that was making it difficult to have coherent and timely discussion threads.

        • Yea, it is far better now but the side bar is still not always synchronized with the rest of the page. I would bet they are saved as separate files with different cache settings. You could try turning off caching altogether since your blog gets updated (via comments) pretty frequently.

        • Most likely the cache should be set something like 2x the peak of the average inter-comment arrival time as measured over say 10 minutes. Then when maybe 20 or 40 people load the page to read the comments in between comments, it doesn’t fall over, but when a new comment arrives, it doesn’t take long to get noticed.

          If I were going to put a prior on this peak inter-comment arrival time it’d be something like say 30 seconds to 1 minute (remember we’re averaging inter-comment arrival times for 10 minutes, and taking the peak of this average each day). So cache timeout should be something like 1 or 2 minutes.

        • Lower is better for interactivity but requires more CPU, higher is bad for interactivity but leaves the computer less loaded. I’d say given the interactivity here, higher than 1 or 2 minutes is a bad idea, so in some ways you might want to set it to say 30s cache timeout, and then scale up the CPU plan until it can handle it.

        • Then when maybe 20 or 40 people load the page to read the comments in between comments, it doesn’t fall over, but when a new comment arrives, it doesn’t take long to get noticed.
          […]
          So cache timeout should be something like 1 or 2 minutes.

          […]

          Lower is better for interactivity but requires more CPU, higher is bad for interactivity but leaves the computer less loaded. I’d say given the interactivity here, higher than 1 or 2 minutes is a bad idea, so in some ways you might want to set it to say 30s cache timeout, and then scale up the CPU plan until it can handle it.

          This is fine for optimizing the reader experience, but what about the poster? People prefer (even expect) to see their comments appear more or less immediately. I get three different experiences though:

          1) Most often it shows up immediately (~75%).
          2) Sometimes a preview is shown and it says “waiting for moderation” (~25%). This can take a few seconds to minutes or even hours. This seems more likely if I include some links, so I assume it is just a spam filter but maybe Andrew has some others as well.
          3) If I enter no email (or the wrong one) I get no preview and the post just shows up later at some point (I assume this is the same “waiting for moderation” as #2).

          Seeing the post show up immediately is much preferred by me. Like Sameera said below, the delay causes some stress because it makes you wonder if the comment will ever appear, or if something might be offensive about it.

          Also take into account the costs of switching to no cache:

          1) Page load time will get a bit slower (it currently takes ~100 ms for me to load an entire page), but is probably negligible for a page like this. Most visitors will stay on a page “consuming the content” for quite awhile after each page is loaded, so as long as this is less than say 2 seconds it shouldn’t matter much to the experience.

          2) It will increase the number of server hits. This may cost Andrew money, but based on the domain it looks like the university is hosting it now so perhaps it is “free” to him.

          So if there is negligible cost, I’d say he should just turn off caching.

        • Let’s say that 10,000 readers read this blog, and they reload it throughout an 8 hour day on average 10 or so times. There are certain times of the day that are 10x as heavy as avg… so I do:

          10000 * 10 / 3600/8 = 3.47 hits/second is the avg rate, and 34.7 hits/second is the peak rate.

          If every one of those hits has to build the page from scratch I could easily see a cheap hosting plan (ie less than $1000/mo) falling over without proper caching. On the other hand if only once every 30 seconds the cache has to be rebuilt… this drops the cost from say $1000/mo to $10/mo ;-)

          But I think the biggest issue is we used to “stay logged in” using some cookies, and then it would know who you are, and you’d see your comment which would say “awaiting moderation” and you’d know that it’d be auto-flagged and Andrew or someone had to approve it.

          But at some point, this changed, I no longer stay logged in, and if I reload the page I don’t see my held comments. So if we can figure out how to stay logged in… that’d alleviate some of the stress I think.

        • Thank you So much! You are so sweet to take the trouble to explain. I thought at 1st I had said something unacceptable.

  44. In short, my response that won’t post suggested that the bias ‘anchoring’ has been a persistent problem in the debate over P values. And that it behooves us to read through all the 43 TAS articles and Comment to assess the epistemological integrity of the positions.

    I hope that Andrew will reconsider posting my more elaborate response.

    • I wanted to add that while basic logic is helpful, so too exposed to the 245 cognitive biases which constitute a daunting task. However, a good percentage need to be reevaluated & redefined in the contexts in which they are raised. This is not simply my view who are exceptionally versed in these biases.

  45. Responding to Keith’s post here:

    I would not disagree nor would the authors of the commentary (short excerpt below to confirm). Especially since as strawman null hypothesis seems to be defined as something worthless
    […]
    ‘Excerpt: “One practical way to do so is to rename confidence intervals as ‘compatibility intervals’ and interpret them in a way that avoids overconfidence. Specifically, we recommend that authors describe the practical implications of all values inside the interval, … Therefore, singling out one particular value (such as the null value) in the interval as ‘shown’ makes no sense.”

    Now, if zero should occur in an interval, I see no problem in assessing it’s compatibility with data and all assumptions (which could include a prior).

    I really don’t think the true problem (the one that I harp on) with testing a strawman model was grasped. The paper included an example of testing a strawman (teaches the audience to test a strawman) and from comments here Sander was equating strawman with “wrong/imperfect” model. It isn’t a strawman because it is wrong, it is a strawman because it is not predicted by the research hypothesis.

    I tried to raise some uncertainty in the worthless claim in Virology.

    I do not hold virology or vaccine development to be paragons of good science, on the contrary I believe NHST has allowed much mischief there.

    [In an attempt to avoid being dismissed as an “anti-vaxxer” I tried to be very thorough below, probably way more effort than a blog comment deserves and I bet something goes wrong with the formatting anyway…]

    For example regarding “previous vaccines”, in the case of measles 4 things happened at around the same time:

    1) Public health campaigns told people to stop having “measles parties”.[1]
    2) The definition of “confirmed case” became more strict and eventually required some form of blood test.[2, 3, 4, 5]
    3) Doctors are less likely to diagnose measles if they are told the person was vaccinated.[6]
    4) The vaccine was given.

    All the analysis I’ve seen amounts to “measles cases went down therefore the vaccine is very effective”, totally ignoring factors 1-3 which by my estimates could be very large. Eg, there were still ~20k cases of “measles-like illness” diagnosed every year in 2004 (but only 100 cases of “confirmed” measles)[7]. Presumably all of those cases would have been called “measles” before the blood tests. Another study reported only 7% of “clinically diagnosed” cases were “confirmed”.[8]

    Also, I never see anyone taking into account the risks and costs when discussing this vaccine.

    1) The original plan from the CDC was to “end measles”, ie eradicate it, by 1967.[9, 10] This didn’t work out because measles is much more contagious than they thought (at the time it was not believed it could survive long in the air).[11] So the plan morphed into the current one of vaccinating a high percentage of people for perpetuity.[12, 13]

    The population is now basically addicted to their vaccine, and if the supply is ever lost for an extended period of time (war, disaster) measles will be a worse problem then ever before. There are a huge number of adults that have accumulated who are not immune because the antibodies waned, or the vaccine didn’t take to begin with.[14, 15]

    It may not even take a disruption in supply lines either, since vaccination rates just below the eradication threshold are expected to cause a “honeymoon period” followed by a large rebound epidemic.[16] Basically the current strategy is the worst possible one in theory.

    2) When I compare the complication rates of measles just before vaccinations were introduced[17,18] to those of MMR[19], they are very similar.

    Rates of complications:
    Ear Infections
    Measles_1967 = 0.025
    MMR_2018 = 0.015

    Pneumonia
    Measles_1967 = 0.038
    MMR_2018 = 0.1

    Encephalitis
    Measles_1967 = 0.001
    MMR_2018 < 0.0007? (febrile convulsions = .002)

    Mortality
    Measles_1967 = 0.0002
    MMR_2018 < 0.0007

    Fever
    Measles_1967 ~ 1.0 (assumed)
    MMR_2018 = 0.3

    Rash
    Measles_1967 ~ 1.0 (assumed)
    MMR_2018 = 0.25

    Regarding fever and rash for measles, I assumed 100% since these are classic symptoms. However, I don't think anyone knows the rate since an "easy" case of measles would never have been reported.

    [1] https://www.ncbi.nlm.nih.gov/pubmed/15106083
    [2] http://www.cdc.gov/mmwr/preview/mmwrhtml/00001225.htm
    [3] http://www.cdc.gov/mmwr/preview/mmwrhtml/00025629.htm
    [4] http://www.cdc.gov/mmwr/preview/mmwrhtml/00047449.htm
    [5] http://www.ncbi.nlm.nih.gov/pubmed/6751071
    [6] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2134550/
    [7] https://www.ncbi.nlm.nih.gov/pubmed/15106109
    [8] http://www.ncbi.nlm.nih.gov/pubmed/17609829
    [9] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1919891/
    [10] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1522578/
    [11] https://www.ncbi.nlm.nih.gov/pubmed/6939399
    [12] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1228954/
    [13] https://www.ncbi.nlm.nih.gov/pubmed/15106085
    [14] https://www.ncbi.nlm.nih.gov/pubmed/15106101
    [15] https://www.ncbi.nlm.nih.gov/pubmed/22966129
    [16] http://www.ncbi.nlm.nih.gov/pubmed/12176860
    [17] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1815949/
    [18] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1815980/
    [19] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6343620/

    • Assuming your figures are correct:
      the mortality difference is 5/10000 in favour of MMR
      US population 300 million odd
      equals 150 000 excess deaths without MMR

      The other outcomes are less important.

      (PS: The pneumonia rate for MMR is implausible)

      • Assuming your figures are correct:
        the mortality difference is 5/10000 in favour of MMR
        US population 300 million odd
        equals 150 000 excess deaths without MMR

        The MMR value was an upper bound. The study only included ~1500 children and none died. Also, there should be uncertainty surrounding all those numbers in a final analysis. It is possible that with more data we would find MMR mortality rates to be far less.

        It was based on this:
        Measles:

        The most numerous were severe affections of the respiratory tract (38 per 1,000)

        https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1815949/

        MMR:

        A total of 51.4% (95% CI: 48.5%, 54.3%) and 48.4% (44.3%, 52.6%) of children in the MMR-RIT and MMR II groups, respectively, reported unsolicited adverse events (AEs) during the 43-day post-vaccination period; upper respiratory tract infection (9.5% and 12.8%) and diarrhea (8.2% and 8.0%) occurred most frequently.

        https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6343620/

        So you are correct, pneumonia would be lower respiratory tract. Also, I assumed “severe respiratory tract” meant pneumonia.

        I corrected the table. There was also a typo, 1967 should be 1963.

        Rates of complications:
        Ear Infections
        Measles_1963 = 0.025
        MMR_2018 = 0.015

        Respiratory tract afflictions
        Measles_1963 = 0.038 (“severe affections of the respiratory tract”)
        MMR_2018 = 0.1 (“upper respiratory tract infection”)

        Encephalitis
        Measles_1963 = 0.001
        MMR_2018 < 0.0007? (febrile convulsions = .002)

        Mortality
        Measles_1963 = 0.0002
        MMR_2018 < 0.0007

        Fever
        Measles_1963 ~ 1.0 (assumed)
        MMR_2018 = 0.3

        Rash
        Measles_1963 ~ 1.0 (assumed)
        MMR_2018 = 0.25

        Another thing to keep in mind is that for the MMR study the values reflect rate of occurrence within 6 weeks of receiving the vaccine, there is no attempt to attribute it to MMR or not (they attempt to do this elsewhere in that paper but it is highly subjective).

        For the measles data, doctors were sent a letter a month after the measles notifications and given two weeks to respond. So again it was (conveniently) 6 weeks. In this case the doctor's were expected to subjectively make an attribution of caused by measles or not, so it is possible that those values underestimate the true occurrence.

        But my problem is more that I am making a table like this to begin with from such messy info… This should have been data collected to begin with to determine if the vaccine was desirable.

      • the mortality difference is 5/10000 in favour of MMR

        I have another pending comment but forgot to mention you have this reversed. The upper limit on MMR mortality (from that study) is higher than measles mortality rate.

  46. Of possible interest:

    I learned today that some sections of the US Federal Drug Administration do allow Bayesian techniques for analyzing clinical trials.

    However, even though Bayesian methods are used, the researchers are required to provide Type I error rates and power. (They do this by simulations, though.)

    • Yes, as from memory Don Berry argued that in a regulatory agency frequency concerns (how often you pass what you shouldn’t or don’t pass what you should) take precedence over any theory.

      The guidance has been publicly available since about 2010 and when it first came out I was very concerned the simulations would not be feasible. At the time, just reasonably so with assuming conjugate priors.

      • I was told (no personal experience) the primary consideration is how many (or the percent of applications for) drugs were approved the previous year and the prevailing politics. Do you want to be “fostering innovation”, or “ensuring patient safety”? There are limits to how much innovation can be fostered and patient safety ensure though, too large a deviation in either direction raises eyebrows in congress. It looks like the acceptable range is 15-60 new drugs/year:

        http://fingfx.thomsonreuters.com/gfx/rngs/USA-DRUGS/01M0110L20G/USA-DRUGS-01.jpg

  47. I would say that it is important to have a threshold, such as an acceptable p-value, we all know it is not 100% correct, the same as when we give a ‘normal’ blood pressure as SBP/DBP <120/80 mmHg (using 2017 US updated criteria), we know someone may have a higher BP than this as her/his normal BP. "Alpha and beta errors" already suggest us give a conclusion carefully.

Leave a Reply to Thanatos Savehn Cancel reply

Your email address will not be published. Required fields are marked *