Valentin Amrhein, Sander Greenland, and Blake McShane write:

We have a forthcoming comment in Nature arguing that it is time to abandon statistical significance. The comment serves to introduce a new special issue of The American Statistician on “Statistical inference in the 21st century: A world beyond P < 0.05”. It is titled "Retire Statistical Significance"---a theme of many of the papers in the special issue including the editorial introduction---and it focuses on the absurdities generated by so-called “proofs of the null”. Nature has asked us to recruit "co-signatories” for the comment (for an example, see here) and we think readers of your blog would be interested. If so, we would be delighted to send a draft to interested parties for signature . Please request a copy at retire.significance2019@gmail.com and we will send it (Nature has a very strict embargo policy so please explicitly indicate you will keep it to yourself) or, if you already agree with the message, please just sign here. The timeline is tight so we need endorsements by Mar 8 but the comment is short at ~1500 words.

I signed the form myself! I like their paper and agree with all of it, with just a few minor issues:

– They write, “For example, the difference between getting P = 0.03 versus P = 0.06 is the same as the difference between getting heads versus tails on a single fair coin toss.” I’d remove this sentence, first because the connection to the coin toss does not seem clear—it’s a cute mathematical argument but I think just confusing in this context—second because I feel that the whole p=0.03 vs. p=0.06 thing (or, worse, p=0.49 vs. p=0.51) is misleading. The fundamental problem with “statistical significance” is not the arbitrariness of the bright-line rule, but rather the fact that even apparently large differences in p-values (for example, p=0.01 and p=0.30 mentioned later in that paragraph) can be easily explained by noise.

– Also in that paragraph they refer to two studies with 80% power. This too is a bit misleading, I think: People always think they have 80% power when they don’t (see here and here).

– I like that they say we must learn to embrace uncertainty!

– I’m somewhat bothered about this recommendation from their paper: “We recommend that authors describe the practical implications of all values inside the interval, especially the observed effect or point estimate (that is, the value most compatible with the data) and the limits. All the values between the interval’s limits are reasonably compatible with the data.” My problem is that in many cases of forking paths and selection, we have no good reason to think of *any* of the values within the confidence interval as reasonable. For example that study of beauty and sex ratio which purportedly found an 8 percentage point difference with a 95% confidence interval of something like [2%, 14%]. Even 2%–even 1%–would be highly implausible here. In this example, I don’t think it’s accurate in that case to even say that values the range [2%, 14%] are “reasonably compatible with the data.”

I understand the point they’re trying to make, and I like the term “compatability intervals,” but I think you have to be careful not to put too much of a burden on these intervals. There are lots of people out there who say, Let’s dump p-values and instead use confidence intervals. But confidence intervals have these selection problems too. I agree with the things they say in the paragraphs following the above quote.

– They write that in the future, “P-values will be reported precisely (e.g., P = 0.021 or P = 0.13) rather than as binary inequalities.” I don’t like this! I mean, sure, binary is terrible. But “P = 0.021” is, to my mind, ridiculous over-precision. I’d rather see the estimate and the standard error.

Anyway, I think their article is great; the above comments are minor.

Key point from Amrhein, Greenland, and McShane:

We don’t mean to drop P-values, but rather to stop using them dichotomously to decide whether a result refutes or supports a hypothesis.

Also this:

The objection we hear most against retiring statistical significance is that it is needed to make yes-or-no decisions. But for the choices often required in regulatory, policy, and business environments, decisions based on the costs, benefits, and likelihoods of all potential consequences always beat those made based solely on statistical significance. Moreover, for decisions about whether to further pursue a research idea, there is no simple connection between a P-value and the probable results of subsequent studies.

Yes yes yes yes yes. See this other paper of ours for further elaboration of these points.

**P.S.** As noted above, I signed the petition and I recommend you, the readers, consider doing so as well. That said, I fully respect people who don’t like to sign petitions. Feel free to use the comment thread both to discuss the general idea of retiring statistical significance, as well as questions of whether petitions are a good idea . . .

I’m curious about this:

“there is no simple connection between a P-value and the probable results of subsequent studies”???

If we were to look back at Phase 2 and Phase 3 clinical trials, would we really see no “connection” whatsoever (i.e., no significant association) between the p-value obtained in the phase 2 (primary endpoint) and the eventual outcome of the phase 3?

Harlan

Harlan:

No

simpleor direct connection. The probable results of subsequent studies depend on lots of things, not just the p-value. But many users seem to believe that the p-value alone can tell you something about how likely a study is to replicate.I refer to it fondly as the P-etered Out Value. hee hee

Got a good chuckle out of me.

Thanks. I’ve got a few other choice labels. lol

I agree that there are many other factors (“on lots of things”), but I still disagree with the wording “simple connection.” If a Phase 2 study has a p-value of 0.87 (and a large enough sample size), I’m not so sure I want to invest in the Phase 3. Sure I’ll also look carefully at “lots of things” before making my investment decision, but I think I’d be much more likely to invest following a Phase 2 with a p-value of 0.01 (all else being equal). Does this make me a bad investor… statistically speaking?

Harlan

Applying a superior understanding to investing in pharmaceutical stocks is more complicated. Since the FDA and other investors make decisions based on statistical significance, you are really betting on the company’s ability to “get significance” rather than be correct or develop an actually useful drug. So the better thing to bet on is stuff like “discovering bad side effects”.

As a case study, look at the CRISPR hype that peaked middle of last year. Anyone who actually understood the method knew they must be selecting for mutant cells. It was only a matter of time before a paper was published on that and crashed all those stocks. But you also then missed the prior pump… Eg, https://www.tradingview.com/chart/?symbol=NASDAQ%3ACRSP

Do you have a link to a paper describing the issue you’re talking about. I mostly have ignored the CRISPR hype but would be interested in reading up on whatever it was that you think was published around that time.

The main one was here:

https://www.ncbi.nlm.nih.gov/pubmed/29892062

The pop-sci media covered it:

https://www.scientificamerican.com/article/crispr-edited-cells-linked-to-cancer-risk-in-2-studies/

Harlan:

“All else being equal,” sure. But all else is not equal. P-values from different studies are compared all the time. And, even within a study, there’s often nothing like “all else being equal” when comparing different results. If you want to play the “all else being equal” game, then a study with a p-value of 0.35 is better than a p-value of 0.38. The p-value is just a summary of data, and if all else is equal you can use a summary to rank. But this has nothing to do with statistical significance, which is all about taking p-values and using them to treat patterns as real or not.

I wonder if p-values of past research, especially seminal work, influence the prior belief of subsequent researchers. In experimental setups, this implies that it is possible that subsequent researchers design experiments that “look” to confirm previous findings. Now imagine a situation when some of this seminal work emanating from a particular researcher gets discredited (for example, a researcher from Cornell). I wonder how we should view subsequent work that confirm and build on the findings/effects of the earlier discredited work?

In particular, if one is conducting a meta-analysis of such a finding, I wonder how we can capture and control for this strong prior belief (that is proven to be false now).

I’m all for it if only because it will at last cause courts to do what they used to do but have avoided doing for 40+ years. As part of their public policy / common law generative process they’ll have to explicitly do cost/benefit analyses. For decades now they’ve shirked their duty by ruling “if p is less than .05 then it’s up to the jury to decide”. If a jury decides that a 1 in 10,000 risk is outrageous or enough to send a man to death row then so be it, so long as p-of-causation (the inverse probability trap which is what courts tend to believe or at least enforce) is greater than or equal to 1-p. When finally forced to decide in the face of uncertainty (rather than the faux certainty which p has provided) they’ll hopefully return to the wisdom and humility found in opinions written 100 years ago.

Good points.

Here’s the question: what proportion of the population of statisticians and researchers must sign the petition for us to conclude definitively that a majority of our community agrees with it? Without knowing the answer to that, we can’t possibly make the yes/no decision of whether to ban statistical significance!

Just get a non-random sample of statisticians and researchers, estimate the proportion, and calculate a Bayes Factor. Easy-peasy.

What – no pretest to ensure they actually read and understood the paper?

I completely agree that we should take a more thoughtful approach that and this is why: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0212302. This paper was published last week and explains how we can replace P-values with frequentist posterior probabilities of replication—when possible parameter values must have uniform marginal prior probabilities. We can then take into account other factors such as methodological issues and Bayesian distributions and use the probabilities in decision analysis if we wish.

The p-value should simply be used as a descriptive parameter that combines power and effect size. It should be used as one argument argument among others in the scientific debate. The claim that it serves as the silver bullet for infering theoretical truth (or the “realness” of effects) directly from the data is obsolete.

This is great! I really like your comment on reporting “precise” p-values, though. It always drive me nuts that in most online homework systems, the authors require students to enter their calculated p-value estimates down to like 3 or 4 sig. figures. I understand that they do it to make sure the students follow the exact algorithm for finding p-value estimate that is given in the book, including some arbitrary rules on using continuity correction and so on. But it sends a terrible message! Even most WeBWorK exercises are doing that.

I agree! We can always ask students for estimate & standard error instead. Or for a confidence interval. Or for a test statistic. Arbitrary precision is always silly, but as you say it sends a particularly bad message when applied to p-values. And even for p-values, I don’t think we really need to demand 3 decimal precision to make sure students did the work correctly. Small mistakes will result in huge difference in p-values. Probably this level of precision is used so that auto-graded questions using random numbers can be given and students can’t copy their classmates’ numbers.

I think we have to teach p-values in these classes, as they are ubiquitous. But we don’t have to teach students to engage in reading of p-value tea leaves.

It would be helpful to have a one-sentence example of how results would be reported under these guidelines.

Take the following example: “Our study finds a statistically significant negative relationship between carrot consumption and ankle cancer (p = .00214).” How would this result be summarized in one sentence under the proposed guidelines?

“The odds of ankle cancer in carrot eaters is 2.1 times that in non carrot eaters (95% CI: 1.2 – 3.1)”

Adding comments on the practical implications of values within the interval is much much harder, at least in many areas of science. I did this on a paper of performance trade-offs (negative correlations) between endurance and sprint events among olympic decathletes and used the values in the CI to estimate the difference in place one would achieve in the olympic 100 m final given a certain level of performance in the 1500 m.

Thanks.

I would word is as

“The odds of ankle cancer in carrot eaters is 1.2 – 3.1 times (95% CI) that in non carrot eaters”

There is [usually] nothing special about the middle of the interval and there is no need to highlight it.

I like your re-wording because it nudges the author (and the reader) to think about the range. That said, *conditional on the data*, isn’t the point estimate “special” in the sense that it maximizes the likelihood? Or as Andrew quoted above “the value most compatible with the data”.

> isn’t the point estimate “special”

The special depends highly on how its (point estimate and interval) constructed but once that is worked through

1. Its like reporting too many decimal places in that its higher compatibility might only be slightly more than those in a large interval around it.

2. It might be very incompatible with background knowledge.

The downside being people ignoring other worrisome or encouraging parameters that are reasonably compatible or important background knowledge.

Andrew, I puzzle a bit about this statement:

I don’t think it’s accurate in that case to even say that values the range [2%, 14%] are “reasonably compatible with the data.”

How would you phrase it then? Cannot the CI be compatible with the data, yet provide an utterly unlikely range for the parameter in light of prior knowledge? I can see why one would object to “reasonable range for the parameter”

In longer drafts without such word limitations this “reasonably compatible with the data” would likely be expanded to reasonably compatible with the data and all the assumptions brought to bear or implied in the analysis.

Even data entry errors can make silly parameter values seem compatible. In Andrew’s example, the implied prior that would result in the [2%, 14%] compatible interval is not compatible with background knowledge.

It is a problem that many reader will likely interpret the compatible interval as being far more certain than they should, but the authors got start somewhere and in this case with limited words.

What has puzzled me for 3 years are the subtle contradictions I have found in appraisals of statistical significance, reproducibility, and P-values. The mix of standard and non-standard definitions can confuse a reader. I was relieved to find this article, for example, for definitional clarity.

https://www.semanticscholar.org/paper/What-does-research-reproducibility-mean-Goodman-Fanelli/3728c289e3582f6329a735b3e9882b2a0cabad83

I am particularly sympathetic to the criticism that most journal articles are so filled with jargon and uninteresting or dubious claims.

In part, this is due to the fact that much research is conducted in narrow environments: labs, internet, offices, etc. Narrows the researchers’ lens. Research efforts appear to be siloed. Researchers are also constrained by the dictates of their institutions.

In my experience, physicians, due to their frequent contacts with patients, are able to communicate their viewpoints with greater clarity, which is why, for example, Steven Goodman and John Ioannidis can convey patients’ treatment outcomes so well, notwithstanding the fact that they are excellent writers.

We rarely hear patients’ perspectives who can offer some very useful insights, for obvious reasons. They are reaping the consequences of trials and treatments. They are not encumbered by the institutional constraints imposed on researcher communities.

Finally, to Keith’s point, I think refining some of the terms will lend greater clarity. But that is not an option we have probably.

Re: We rarely hear patients’ perspectives THAT [not who] may offer some very useful insights, for obvious reasons. Apologies!

Thomas, Keith:

That interval of [2%, 12%] represents values compatible with

someof the data, but not all of the data or even most of the data.Consider this thought experiment. You roll a fair die 100 times:

You look at all your data:

You now decide to test whether the die is more likely to come up 1 than 2:

And this is what comes up:

Hey! Your 95% confidence interval for the proportion of 1's, among 1's and 2's, is [0.57, 0.88]. Are these the numbers "reasonably compatible with the data"? I don't think so. I think 0.5 is also reasonably compatible with the data---all the data, that is.

Let me put it another way. A 95% confidence interval is a statistical procedure which, if replied repeatedly, will on average include the true parameter value 95% of the time. Fine (conditional on the model). But . . .

1. That probability statement is about the entire stream of intervals. It does not in general hold under selection.

2. The definition of confidence interval says nothing about "compatibility." That's fine too, but it suggests that if we really want a theory of compatibility intervals, we should move away from thinking about coverage and instead think more about what's in the interval. There's nothing at all in the theory of confidence intervals that requires compatibility.

So, if we want to define and work with compatibility intervals, I don't think we should take confidence intervals as a starting point. Instead we should start with estimates and standard errors. Or something like that.

This is all closely related to why it doesn’t make sense in general to form confidence intervals by inverting hypothesis tests.

Thanks! of course if you cherry-pick an outlier result and then do the estimation/test it’s absurd.

(I’m guessing that is the case of the sex ratio study).

Re CIs, I agree that “compatibility” is not implied by a given coverage – e.g. a one sided 95% CI that goes to infinity will include some very incompatible values. But for “well-behaved” situations, estimators that obey the CLT, symmetrical CIs, how do they differ from a point estimate and standard error, the solution you like? Not trying to argue, just interested in the distinction.

Like Thomas, I am unsure what to make of Andrew’s comment on my comment – starting with a know psuedo-random number generation, bringing in selection and pointing out that confidence intervals in some situations obtained by some constructions can be silly.

My comment was about limited space making it challenging for the authors to be fully clear about all the nuances and providing enough qualifications to mostly rule them out.

But again, the authors got to start somewhere and in this case with limited words.

And starting with the “Sometimes you can get a reasonable confidence interval by inverting a hypothesis test. For example, the z or t test or, more generally, inference for a location parameter. But if your hypothesis test can ever reject the model entirely, then you’re in the situation shown above.” [from Andrew’s link].

Now I am anticipating Bayesian credible intervals will be re-cast into [more informed] compatibility intervals and so I am taking the confidence intervals are a start not a finish.

Being less lazy this morning found a concise summary from a longer paper – Amrhein V, Trafimow D, Greenland S. Inferential statistics as descriptive statistics https://peerj.com/preprints/26857/

Consider that a 95% confidence interval encompasses a range of hypotheses(effect sizes)that have a P-value exceeding 0.05. Instead of talking about hypothetical coverage of the true value by such intervals, which will fail under various assumption violations, we can think of the confidence interval as a “compatibility interval”(Greenland 2018a,b), showing effect sizes most compatible with the data according to their P-values,under the model used to compute the interval. Likewise, we can think of a posterior probability interval, or Bayesian “credible interval,” as a compatibility interval showing effect sizes most compatible with the data, under the model and prior distribution used to compute the interval (Greenland 2018a).

If we restrict ourselves to coverage-based interpretations, then, as far as I can tell, there’s no reason to report CI limits at all. They’re basically meaningless, and guarantees about the

procedureare all that matter. Coverage ensures very general properties and no practical use at all.Under the inverted-test interpretation, all values within a CI are compatible with the data (assuming no selection, and given design, measurement, and modeling choices) in the sense that they would not be rejected at the given alpha level under the assumption that the point estimate of the test statistic is the true population value. As argued in that linked post, this approach has more limited generality. But when it

isvalid, it enables sample-specific probability statements, which seems pretty useful to me.> Coverage ensures very general properties and no practical use at all.

Agree and formal definitions of what constitutes a CI do make that clear.

In some of the authors longer papers, it is explained how compatibility intervals can sometimes have nominal confidence coverage given a long list of caveats but certainly not something to rely on.

(Also see quote I added above)

Discussing the practical implication of a point estimate and all values within the CI is maybe the biggest reason to abandon reporting results as “statistically significant” (or not) in biology. In general, we don’t have a clue on what the biological consequences of different effect sizes are and I think a big reason why is because we are trained to do simple experiments with inferential tests instead of thinking about mechanistic models of the system that make quantitative predictions that could then be evaluated by, among other things, experiments that estimate effects and uncertainty. Or quantitative models in how variation in the outcome effects different aspects of the biology.

A dichotomous “significance/non-significance” may also constrain the way we think about the design of experiments. By this I mean that we are trained (implicitly by reading the literature that dichotomizes into “effect” or “no effect”) to ask “does X have an effect on Y” instead of “how does the effect vary?” For example, many experiments in biology dichotomize a continuous factor such as Temperature or pH (low T vs. high T treatments) instead of using a regression design with multiple T levels that allows the estimation of how the response changes over a range of T. We know T changes everything in biology — I don’t need an experiment to tell me that. What we don’t know is how the magnitude of the effect varies as a function of T.

Yes — dichotomies really constrain thinking (as well as seeing the real world). Yet so many people seem to think in terms of them. I think education in general needs to focus more on continuous rather than dichotomous thinking.

Thank you, Martha,

I know you do not mean that we should, therefore, think exclusively in a ‘continuous’ state.

I do not think ‘continuous’ sufficiently captures the thinking frame that might need to be cultivated, in most queries I’ve come across.

Admittedly, we are far from comprehending how the mind works, more generally. And here, I speculate that one’s fluid and crystallized intelligence matter, regardless of training. Some people may have better problem-solving abilities, therefore. And who’s to say that it was because of a focus on ‘continuous’ modeling.

‘Continuous’in the ordinary English meaning of the term: forming an unbroken whole, without interruption. I don’t think that can apply to every context.

‘di·chot·o·my

Dictionary result for dichotomy

/dīˈkädəmē/

noun

noun: dichotomy; plural noun: dichotomies

a division or contrast between two things that are or are represented as being opposed or entirely different.

“a rigid dichotomy between science and mysticism”

synonyms: division, separation, divorce, split, gulf, chasm; More

difference, contrast, disjunction, polarity, lack of consistency, contradiction, antagonism, conflict;

rarecontrariety

“there is a great dichotomy between social theory and practice”

Botany

repeated branching into two equal parts.

Maybe we need to cultivate new descriptives.

Just saw this after writing a Quora comment on it. Was happy to see that Andrew highlighted the same points. This binarization into “true” and (worse) “false” has got to start with journal editors proscribing language along those lines and insisting on a more nuanced discussion of the evidence, along with the model that generated the inferences. Confidence intervals are a small step in the right direction, but I’m concerned people might think they are *more precise*, when they are anything but.

https://www.quora.com/A-forthcoming-comment-in-Nature-argues-that-it-is-time-to-abandon-retire-statistical-significance-see-link-Do-you-agree-with-this-argument/answer/Fred-Feinberg

Andrew,

I was also going to suggest the removal of the comparison to a ‘coin toss’. There are several others that may capture the insight intended, which I can propose after the article is published so as to not give away any substantial portion of the paper beforehand.

My only qualm is in your appeal to ’embrace uncertainty’. Of course, that is a given. But my view is that uncertainty could have been reduced had there been a robust appreciation of the conflicts of interests undergirding some efforts. And that was a core premise of the Evidence-Based medicine movement. And here we are, in 2019, with evidence-based medicine serving as a very effective marketing tool for discoveries. I don’t see how anyone can miss this. And why we have to rethink several current approaches, diligently and untold creativity.

Fred,

I appreciate your caution about the use of ‘confidence intervals’.

My introduction to p-values included the observation that the value 0.05 originated from R. A. Fisher’s instinct. People have gone to jail because of Ronald Fisher’s gut. Sign!

The legal implications are one thing that convinced me that you can’t ignore Bayes. See http://www.dcscience.net/2016/03/22/statistics-and-the-law-the-prosecutors-fallacy/

Is there any statistic which:

(1) enjoys nominal frequentist properties that survive multiple (potential as well as actual) comparisons?

(2) cannot be arbitrarily dichotomized?

(3) by itself tells us everything we want to know about the data?

(4) is relevant to every research question?

(5) is not commonly misinterpreted by non-statisticians?

(6) cannot be portrayed as evidence against straw man hypotheses?

(7) comprehensively measures reproducibility?

(8) appropriately incorporates all prior information?

If not, then perhaps the problem isn’t a statistic (e.g., the p-value), or even the way that statistic is commonly used (e.g., evaluating “statistical significance”), but the incentives researchers face in many disciplines, enforced by editorial boards and academic committees, originating in the need to distinguish productive researchers from unproductive researchers, and promising research programs from dead-end research programs.

You can’t improve on an equilibrium by directly changing some aspect of it. You instead have to change the incentives that induce the equilibrium. If this effort succeeds, I will expect researchers to simply switch from reporting p-values and evaluating statistical significance to reporting r-values and evaluating statistical schmignificance.

Would the following statistic satisfy your 8 requirements? It would be the frequentist probability of replication for a range of parameters bounded by null hypotheses and conditional on the study data alone initially. This simple initial analysis would correspond approximately to P, 1-P and confidence intervals initially. It would be a non-dichotomised posterior probability that reflected uncertainty. In order to complete the statistical analysis, the conditional evidence for the posterior probability can be expanded (e.g. to include Bayesian distributions and methodological issues) and the range of parameters changed (e.g. by narrowing or widening them). The details are given in an open-access paper published last week [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0212302#references].

The generalized likelihood ratio.

The statement clearly does not go far enough. But it is a serious step in the right direction, and probably the largest leap that the research community as a whole would be receptive to at this point. We should not let the perfect be the enemy of the good.

I signed.

I’m afraid that this petition is meaningless. It just constitutes a summary of the perspective of a biased subpopulation of the statistics/research community. It does not constitute consensus, nor does it constitute a representative vote.

Publishing the survey would be equivalent to (for instance) having Fox News publish an online petition (announced during an episode of Tucker Carlson Tonight) supporting teaching religion in public schools. It would only represent the viewers of that show, and only those viewers motivated to participate in the survey. It would say nothing about the attitudes of the broader population.

If we did want to present a representative survey, what community should be surveyed? Statisticians? Biostatisticians? Psychologists? Physicians? Whose opinion counts?

I would certainly agree that publications should have a more nuanced discussion of the uncertainty of the conclusions. This is not the way to make it happen.

Clark:

I don’t think anyone is presenting the letter of support, or petition, or whatever we call it, as a representative sample or consensus of any community. I fully respect people who don’t believe in signing petitions, but I think it’s silly to say a petition is “meaningless” because it is not a representative sample, which is something it never claimed to be.

I wish I could sign on to the opposite of this petition. All this will accomplish is to sow confusion and would likely make science worse due to the lack of guidelines for researchers to follow. Significance is not a problem, it is a feature. Some domains use 2 sigma (e.g. medicine), others use 6 (e.g. physics), but in almost all cases there needs to be a binary decision made. How much expensive Large Hadron Collider data do you need to collect before you can declare that you have found a particle?

Even the authors don’t understand how to rectify the need for binary decisions, after all, if you are giving up significance, you should get rid of intervals and show the entire confidence (or posterior) distribution.

Ian Fellows, where have you seen that “in almost all cases there needs to be a binary decision made”? Sorry but this is nonsense to me. Have you read Amrhein et al’s comment fully? They precisely make the case that when publishing a paper, very rarely a binary decision has to be made.

Off the top of my head:

1. At a minimum someone has to make the binary decision of whether the strength of the evidence is sufficient to publish the paper in the journal under consideration.

2. Authors typically need to make the binary decision of whether to claim “there exists strong evidence of an effect” or “there exists an effect of at least size ___”.

3. Funders need to make the binary decision of whether it is scientifically necessary to run another study to determine the veracity of a reported effect.

4. As a reader I need to make the determination of whether my sick daughter will respond better to a drug treatment v.s. doing nothing.

Should raw p-values be reported? Of course, always. Should we have community standards for degree of evidence? I’d also argue: Yes absolutely.

And no I have not read the comment fully. I can’t find it anywhere. Do you have a link?

1) The journal should balance the benefit they get from publishing a possibly interesting correct paper vs the cost they have of publishing a potentially incorrect paper. p values don’t help with this.

2) p values do not provide “there exists strong evidence of an effect” there are huge huge numbers of examples this blog has been over where small p values are shown for effects that don’t exist (power posing, ovulation and voting etc etc etc)

3) Funders should balance the cost of doing more research vs the benefits of improving knowledge of this topic expected to be acquired from that research. p values don’t do this

4) You need someone to publish a predictive model of treatment outcomes and to show that it is effective out-of-sample and provides useful posterior distributions over the predictive quantities of interest, including primary outcomes and risks of side effects etc. a p value of “is this treatment the same as placebo” will not do that.

I disagree on all points. p-values help evaluate evidence in all of these cases.

Regarding (2) the problem with all of those studies had nothing to do with hypothesis testing or p-values. The garden of forking paths != NHST, and I think that the focus on NHST has blinded many to the real (and more difficult to address) problems at the heart of the replication crisis.

I would think you could “fork” your way to a better-looking posterior distribution but perhaps not quite so easily and (in many cases) unintentionally as with harvesting p-values. But good point.

Exactly. In fact, unless you specifically try to use methods that will guard your inferences against forking paths, it is pretty much just as easy to get bad bayesian inferences as it is to get bad frequentist inferences.

The Bayesian CLT tells us that asymptotically Bayesian inference yields a posterior equal to the frequentist (ML) confidence distribution, with posterior intervals ~= confidence intervals. So if you make bad frequentist inference, you’ll probably make bad Bayesian inference. The way around this from a Bayesian perspective is to take into account all of the potential inferences you’d do during the forking and put strongly informative priors on them. This is especially convenient to do if your inferences have hierarchical structure (e.g. estimating state effects) where you can infer the appropriate amount of shrinkage from the data. From a frequentist perspective, you can similarly enforce shrinkage on your estimates. Multiple comparison corrections or FDR corrections are another approach. Both of these solutions require careful accounting of all comparisons attempted, which is not how many researchers in some fields of study operate.

Ian:

Not trying to be a a hobgoblin here, but have you obtained and read the commentary?

Also, a longer paper might address many of your criticisms https://statmodeling.stat.columbia.edu/2019/03/05/abandon-retire-statistical-significance-your-chance-to-sign-a-petition/#comment-986215

Thanks very much for the link! Unfortunately I don’t feel that it addresses much of my issues.

(1) It spends a lot of time linking the replication crisis to a significance cut-off, with much ink spilled on the file drawer effect. I think that this is a fundamental miss-diagnosis of the drivers of the crisis (at least in the social sciences). They do mention the garden of forking paths, but more in passing.

(2) That said their actual recommendations on page 12 are pretty solid and do address the forking path ((a) and(b)). And in fact, with the exception of (e) and perhaps (f), they are just the standard run-of-the-mill rules for doing research.

(3) p. 12 They report a classic error of a researcher accepting the null rather than failing to reject it, and somehow this is a problem with significance testing? This is exactly what you are supposed to NOT do when doing hypothesis testing, and most professional researchers do not fall for this stats 101 trap.

(4) They seem to be stuck on confidence intervals of 95%. Why not 99% or 80%? 95% confidence intervals only really make sense with respect to alpha <= .05 significance testing.

(5) I agree with them that more caveats should be present in papers regarding external validity and potential biases.

(6) I support their recommendation to describe both the existence AND the plausible magnitude of effects reported.

(7) They hate the word "significance," thinking that it connotes too much certainty when reporting effects. Well whatever, I'm not going to argue the semantics. What is really the issue that faces authors is when they can report that they have found an effect (e.g. a relationship, or treatment effect) that can't be explained my mere noise. When can a researcher write

"Compared with no exposure, the treatment improved on the outcome (p=XXX), with … description of effect sizes and such…"

and when should they say

"Compared with no exposure, we did not detect treatment improvement on the outcome (p=XXX), with … description of effect sizes and such…"

At least from my reading, they suggest using alpha <= .05 to determine which to use, just replacing the word "significant" with "possible effect sizes compatible with our data" (aka significance). A rose by any other name is still significantly similar to significance testing (p=.033).

Ian:

Fair enough, but I think you are expecting too much in a single paper. For instance, they did not mention or reference extending confidence intervals to p value functions.

But, for your point 4 this was in the appendix “Without further elements (such as an α-level cutoff) this observed P-value implies absolutely no decision, inference, bet, or behavior. What people make of p=0.04 in these or other practical terms requires additional contextual detail such as a loss function, acceptable error rates, or whatever else the analysis team can bring to bear (although usually it is just social conventions like α=0.05 and program defaults that determine what gets claimed).”

Keith:

I don’t know if I’m expecting too much. Maybe I am. I’d love for something smaller tighter and focused, but for me this just ended up all muddled. They come out against dichotomizing and alpha=0.05, but then recommend dichotomizing the effect space into compatible and non-compatible. They come out against “nullism” (i.e. privileging the testing of a single H0 over all the other other possible H0s), but then the recommend strongly reporting p-values since they are continuous and non-dichotomous.

I’m not sure what my takeaway is as to how I should change my analysis/reporting (according to them) other than a substitution of nomenclature.

re: “(3) p. 12 They report a classic error of a researcher accepting the null rather than failing to reject it, and somehow this is a problem with significance testing? This is exactly what you are supposed to NOT do when doing hypothesis testing, and most professional researchers do not fall for this stats 101 trap.”

Loads of professional researchers do this. Professional statisticians don’t do it. But researchers with limited statistical training but who nonetheless perform and report analyses do it all the time.

This is the main complaint of the paper: dichotimization is very dangerous in the hands of people who have limited understanding of how the methods they’re using work. We could say “that’s their problem”, but it’s our problem because it affects so much research that is of practical consequence.

It’s often said that other statistics can be used mindlessly too, and so “p < 0.05" is a scapegoat. I don't entirely agree. I think "p < 0.05" is particularly dangerous because so few people who use it as a standard of evidence understand the logic that underlies it. P-values are confusing; it's no surprise that they're so commonly misinterpreted.

It's also often said that if we get rid of significance we should get rid of confidence intervals too; after all they are simply inverted hypothesis tests. But again I disagree – at least with a CI the values are in terms of a variable or statistic of interest. Even if researchers go straight to "does this interval exclude zero?", they're at the same time forced to look at the endpoints of the interval. "p < 0.05", on the other hand, is mostly treated like a form of magic.

(To qualify my above statement, I work with researchers and grad students at a public university, who often talk to me precisely because they lack expertise. So my subjective experience is biased by that. At the same time I can’t count the number of papers I’ve read or presentations I’ve viewed that treat statistical significance as magic, and that interpret p > 0.05 as “no effect”).

Ben:

I think you’ve got some good perspectives there. I agree that many might fall for the trap even though you’d get marked wrong in a Stats 101 class for it. However, I do think it generally gets corrected along the way. It is pretty rare to see an error like this slip all the way to a published article.

Ben:

Good responses to Ian.

Replying to Ian Fellows comment “I agree that many might fall for the trap even though you’d get marked wrong in a Stats 101 class for it. However, I do think it generally gets corrected along the way. It is pretty rare to see an error like this slip all the way to a published article.”:

Ian, have you done a survey of the literature to check your claim? And if so, in what fields? Please give a cite.

Do you think the answer is uniform across journals and fields?

Our commentary cites such surveys in soft sciences and guess what? This error has been found in OVER HALF of the articles in some fields and journals.

Statistics is supposed to be about what can be said about realities based on real, collected data. Interesting then how many statisticians make claims of how wonderful this or that method is or how rare abuse is based on zero data about actual usage – in fact in matters like these it seems the absence of any cited data leads to more adamantly certain claims.

Sander:

Thank you for the response. I know it can be tough to have criticism of your work thrown at you. I did just get a copy of the comment. Thank you for sharing it. My previous comments were with regard to the paper that Keith linked to. I hope they are helpful. I did find the comment much tighter and more compelling.

That is shocking to me that >50% of papers report a non-significant test as the non-existence of an effect. I mean I’ve taught low level stats, and it is literally the first or second thing that is taught about testing, and it is on the exam. It certainly doesn’t comport with my experience reading medical/epi literature. That said, I don’t see the cites in your commentary. The version that was sent to me today has:

“These and similar errors are widespread. Surveys of XXX articles across five journals in psychology, neuropsychology and conservation biology found that statistically non-significant results were interpreted as indicating ‘no difference’ in XXX% of articles.”

Ian: It can be tough to get criticisms of work where it is clear the critic did not read the criticized article, or made criticisms based on assertions that themselves lack supporting evidence.

In contrast, I solicit and appreciate genuine corrections. I even welcome opposing views when they are based on careful scholarship and logical argument from evidence (especially of the empirical sort, with properly interpreted statistics).

For survey citations see this open-access article, which we cite:

Amrhein et al., “The earth is flat (p > 0 . 05): significance thresholds and the crisis of unreplicable research”

https://peerj.com/articles/3544.pdf

especially

https://peerj.com/articles/3544/#p-47

It’s a long article, I think exceptionally researched and well worth reading in its entirety (much like the Hurlbert & Lombardy 2009 article they cite). If you haven’t the time for that, do a search on “survey” and you will find several examples. There is also an unpublished survey by them which I hope to see available as supplementary material.

Thanks for the references.

(1) I did read your article and your comment was not public. I requested it, read it and the supporting information was not there. How nice of you to contrast my criticism with those stemming from logical argument! Not the greatest of looks IMO.

(2) I will definitely have to dig into those references. It appears that I was mistaken on point 4 of the 7 points in my previous comment. If is not possible to teach the first lesson of hypothesis testing to professional researchers, I do wonder if we are in a hopeless situation. Is there anything so foolproof that ignoring (or failing to learn at a basic level) how to use it will yield good results?

Ian: I was chiefly responding to your “However, I do think it generally gets corrected along the way. It is pretty rare to see an error like this slip all the way to a published article.” These kind of assertions (as if factual with no factual support) plague all fields and endeavors, but especially plague debates in statistics education and misuse. That seems rather ironic for a field that is supposed to be about what one can infer from data. Then too there were errors of fact (as you admitted for 4) and the problem (not your fault) that the version you saw did not include in text the information about surveys of the problem (Nature decided to summarize that info in a figure, which unfortunately has not been available for distribution – they have a huge say in exactly how the presentation ends up).

The problems we discuss have been lamented for generations – we’ve found complaints about statistical significance as far back as 1919, before NHST dichotomania made it even worse. It’s interesting to speculate on why those problems grew worse and now continue a century later. Well one reason may be the excuses I’ve heard forever, like: ‘we teach it correctly, it’s the user’s fault’; and ‘abuse is an exception by a few bad players’. Both are false excuses – many books and instructors teach many of the misconceptions by (bad) example, and abuse is widespread. This is why we’ve opted to take a more aggressive approach, since thus far statistics reform has been proceeding at a geologic pace – showing that polite academic pleas by honored authorities have proven ineffectual (e.g., both Cox and Lehmann in their books advised reporting precise P-values, not inequalities or asterisks).

Resistance to our call is already forming among those who have built their papers, books, and entire careers on declarations of significance and nonsignificance and promotion of the the same using the same fallacies we decry (“we must make decisions!” – sure, but on the basis of a p-value alone? please!). So expect to be entertained thoroughly as the battle is joined.

P.S. More surveys showing high prevalences of elementary misconceptions of tests and P-values are in the literature than I can cite – I’m sure Andrew and colleagues have done or at least can cite several. Gerd Gigerenzer has also done and cites several surveys, most recently (that I know of) “Statistical Rituals: The Replication Delusion and How We Got There”, Advances in Methods and Practices in Psychological Science 2018 vol. 1(2), 198–218, DOI: 10.1177/2515245918771329

A key point here is that the problem is not with P-values but with statisticians and statistics users, and cognitive biases that lead to the distorted interpretations, as many have discussed in print (e.g., Greenland “The need for cognitive science in methodology,” American Journal of Epidemiology vol. 186, 639-645, open access at https://doi.org/10.1093/aje/kwx259).

And now, unsurprisingly, we face intense resistance to reform by those who have built their careers around these distortions, as some of the early responses to our essay reveal.

Great comments. I wonder, have you come across this 2008 paper? https://pdfs.semanticscholar.org/dab3/f6246beb6e42e29dab81a0428e2b058d905d.pdf

Had not seen! Thanks much, the title looks dead on, will have to examine closely. Too bad it was in such an out-of-the-way location.

1) is a recipe for publication bias. Besides, the relevant criterion would more often be the width of the CI

Helene:

> relevant criterion would more often be the width of the CI

I ended up being an author on a paper that (unfortunately) made that argument. I tried to get it removed but was unable to and should have withdrawn as an author but was too worried about supporting my young family…

Any selection other than on research quality (which is extremely difficult to assess) will cause problems. I do understand journals and academia cannot work without selection but the further forward we can push selection and decision making in the process the better off science will be. But not journals and many senior members of academia.

As Sander points out, we should expect push back on pushing selection and decision making to later on in the process.

Even if it were true that you had a to make a decision after nearly every study, why would statistical significance (with all of it’s clearly demonstrated drawbacks) be the way you’d choose to make it? It doesn’t help you make good decisions.

“When making a decision, I think it is necessary to consider effect sizes (not merely the possible existence of a nonzero effect) as well as costs.” Gelman (2013) Interrogating p-values. http://stat.columbia.edu/~gelman/research/published/gregfrancis2.pdf

Yes!

My god, how can anyone be THAT confused about decisions and p-values. Yet you are the one criticizing others for their lack of understanding?

We are doomed.

Any scientist who is not able to make informed decisions without p-values or arbitrary cut-off points holding their hand needs to quit their job right away. Any scientist who thinks those two things are needed for decision making should not only quit their job but burn everything they’ve published.

Oh noes!! A hobgoblin is coming for my papers!

Ian said,

“Even the authors don’t understand how to rectify the need for binary decisions, after all, if you are giving up significance, you should get rid of intervals and show the entire confidence (or posterior) distribution.”

Showing the entire posterior distribution would be a good idea!

I agree with that. However, at some point you have to collapse this into a conclusion. e.g. Is this drug sufficiently effective compared to placebo. If so, then approve it for use. If not, then don’t.

Unpopular opinion (i guess): Having some community standards for the amount of evidence that must be present to make an assertive claim is a good thing.

Ian,

That’s what it comes to for me. I can not disagree with the logic of asking if one trial or the combined evidence from multiple trials meets some (binary) pre-defined threshold of effectiveness.

I totally agree that “p less than point-oh-five” is a poor choice of pre-defined binary thresholds, in almost every conceivable situation.

But if someone wants to set the clinically significant threshold as “10% greater” or “20 pounds less” or “Relative Risk>2.0” that’s reasonable. The only question is what (arbitrary) amount of evidence for an effect size beyond the threshold we choose.

The thing I’m still not decided on is whether to endorse a double-binary decision rule. It’s one thing to say. “Tell me how certain we are that the the effect is greater that 10% improvement?”. Not quite the same as saying, “Yes or no, are we 95% certain that the effect is greater than 10% improvement?”.

>I can not disagree with the logic of asking if one trial or the combined evidence from multiple trials meets some (binary) pre-defined threshold of effectiveness.

That’s OK, I can do it for you…

Decisions should be made by considering the costs and benefits of all the different possible states of the world that are compatible with what we know (ie. the posterior distribution of some reasonable Bayesian statistical model) and choosing the action that maximizes the total benefit-cost.

A “predefined threshold of effectiveness” is not useful. Imagine for example that I show that doing X will with 95% certainty cure cancer of type Y… You might think “heck 95% certainty is past any predefined threshold… we should do it”

Now I tell you that “by the way doing X requires dismembering every baby born in the next month in your county and grinding up their bones into a powder…” There is no predefined threshold of effectiveness that would ever lead us to decide to do this treatment. Sorry.

Making treatment decisions on thresholds of effectiveness is *at its core the wrong way to think about things*

Rather gruesome, but makes the point.

There’s nothing like taking an error to its extreme form to highlight the problem. It’s no use complaining about a subtle form, people won’t get it. When you’re doing research on some drugs that are supposed to help people, there’s already the idea that they should have exactly zero probability of melting your face off or exploding in a 35 meter wide fireball or whatever. Basically underlying assumptions about costs and benefits are implicit in people’s whole world-view and so they don’t even realize they are there, like I don’t think about the air I breathe (until someone pollutes it).

There is a “old-school” method scientists figured out for doing this: test predictions derived from a theory developed to explain one dataset against

newdata collected later. You can approximate this by using training/test datasets like machine learning users do, but there is a lot more room for (purposeful or accidental) bias to be introduced that way.Who is it? Solutions along these lines have been something I’ve been thinking about (https://statmodeling.stat.columbia.edu/2019/01/27/jpsp-done-bem-paper-back-2010-click-find-surprisingly-simple-answer/#comment-956058), so I’d love to see what folks are up to.

Who is who? People using training and validation/test datasets? AFAIK you can’t get away with not doing that. It is just standard practice:

https://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set

Oh, sorry. I miss-read your comment. I thought you meant that there was an old school scientist advocating for test/train splits in as a solution to the replication crisis.

My impression is that type of thing was considered so obvious to the “old school” there was no reason to ever mention it.

Ian,

Now you’ve gone and done it!

Is it helpful to think about when we have to evaluate multiple studies on the same subject? Is it easier or harder to summarize multiple results when each result is stated in binary form rather than continuous form?

Honestly asking. I don’t know the answer.

Continuous form gives more information than binary form for a study, so seems better for evaluating multiple studies on the same subject.

Do you mean “continuous form” of p-value versus an asterisk?

Most papers I’ve seen give (continuous) point estimates and standard errors, even if they summarize the p-values by asterisks or the like.

I was thinking of “continuous form” as giving relevant statistics such as point estimates and standard errors (or confidence intervals) rather than p-values or asterisks or statements about “statistical significance”.

Until seeing a couple examples discussed on this blog, it never even occurred to me someone would publish a paper whose results were presented as list of significant/not-significant p-values rather than actual (to my thinking) results. But apparently there are a few of those out there.

I have occasionally seen drafts of (poorly) student-written papers leaning somewhat in that direction so I understand it is for some reason a temptation. Just amazing something like that would make it into print. I guess the authors figure the p-values are all anyone is going to read anyway so just cut to the chase…or something…

There is no link to the comment; you have to request it. This is Nature’s policy. They have an email set up that you can send a message to and you’ll get a copy back within a short amount of time: retire.significance2019@gmail.com

Who are they petitioning? The king of science? What is the purpose of this?

Just because a bunch of people agree something is bad doesn’t mean it should be banned. Just stop trusting conclusions arrived at via NHST and stop publishing in journals that require it.

“Who are they petitioning? The king of science?”

Gotta admit it. This is funny.

Behold! The splendid garments of my rival, Emperor NHST, are revealed as nothing more than illusions spun from misunderstandings.

PAGE: Sire! The anti-dichotomous peasants are revolting!

THE KING OF SCIENCE: You can say that again!

Sure if you want to petition King of Science you should not write to Nature. You should write to Science.

No, I am not going to sign this. Of course I agree that the over-reliance on p-values in social sciences and medicine is a problem, but there are a number of serious flaws in this argument:

*A sensational headline, prone to mis/over-interpretation* Of course we shouldn’t retire statistical significance. Such a headline is just clickbait. I am already seeing the hypercorrection coming: people use this “p-value controversy” as an excuse for drawing conclusions from point estimates without use of any statistical uncertainty. As such, this “controversy” argument just makes the situation worse.

*What’s the alternative, anyway?* It’s easy to ridicule status quo when you don’t have a workable alternative to compare it to. By arguing that the status quo is not perfect, you are just making an armchair argument, implicitly comparing it to some Utopian situation. As it is, there are a lot of uses of significance test (non-parametric statistics, sample size calculations, model validation etc) where the p-value is clearly not the optimal way to look at the data, but where no good alternative is available to ordinary users. Before we retire significance, we would need to have a better alternative in place, implemented in widely-used software and supported by textbooks aimed at social science and med students.

*Med/Soc is not everything* There are plenty of scientific fields in which hypothesis testing is sound. Particle physics and software validation, for example.

*What’s the root cause of the problem?* As a statistical reviewer, I often run into manuscripts that try to address this “p-value controversy” by reporting confidence intervals rather than p-values in the abstract. The problem is that it is often difficult to explain in a few words how the effect size scale is to be interpreted, and difficult to compare results from studies that report the effect size in different ways. “We estimated the effect size to beta1= 0.76, 95% CI = [0.22; 1.30]” is no more helpful than a p-value, just more difficult to read.

Helene Thygesen, PhD Biostatistics (University of Amsterdam), Principal Science Adviser at the Department of Conservation, New Zealand

Progress in particle physics seems to have pretty much shut down. AFAIK, all they have done for 50 years is verify what they already thought was true. I am not too familiar with it but I would assume (based on experience elsewhere) this happened rather soon after adopting NHST:

https://www.nbcnews.com/mach/science/why-some-scientists-say-physics-has-gone-rails-ncna879346

In support of that, here is a paper that seems to indicate “statistical significance” became a thing in particle physics in the 1970-1980s: https://arxiv.org/abs/hep-ex/0208005

As to software validation, I am not sure exactly what you mean. Is it AB testing? Almost all the software/websites I use have become becoming more and more annoying to use, and I blame AB testing for that.

Like here is the site for google earth: https://www.google.com/earth/

How many clicks does it take you to download the software you want? (it took me like four, when it should take one)

Sorry, I should have been more specific about software testing.

When output of an algorithm can be tested against a fact sheet I suppose one shouldn’t use a statistical approach at all since you simply require 100% concordance.

I was thinking of situations in which you have unpaired data and/or non-deterministic output. For example when validating stochastic simulation software.

Validating stochastic simulation software is literally one of about 2 or 3 possible legitimate uses of P values. Here the *definition* of working correctly is that various stochastic outputs agree with mathematical predictions to within a tolerance *defined logically* by a p value.

But there are very very few other uses. All the other uses are essentially the same thing only using a “big dataset” instead of a p value. Like, we have a dataset of 10 million non-fraudulent credit card transactions and we want to determine whether some test statistic applied to the next transaction produces a t value that would be extremely rare if we chose it uniformly at random from our dataset of non-fraudulent transactions… after testing stochastic algorithms, some version of this “filtering data” is basically the only legit usage of a p value.

sorry I should have said: ‘using a “big dataset” instead of a random number generator’

Sure, reminds me of Daniel Lakeland’s thing about NHST is good for testing a RNG.

The difference there is that the “null hypothesis” is what you actually think could be happening.

The entire problem with NHST is that it is used for testing something other than what someone expects to happen. I wouldn’t limit the usefulness to RNGs personally (at least not at this point when there is such a huge problem with the current paradigm). I would say the exact same procedure is fine anytime a prediction is tested that is more precise than the measurement uncertainty. The key is to test a prediction derived from a model you believe in though.

Helene said,

“Before we retire significance, we would need to have a better alternative in place, implemented in widely-used software and supported by textbooks aimed at social science and med students.”

This misses the important point that “a better alternative” may be context dependent.

Re: This misses the important point that “a better alternative” may be context dependent.

Very good point.

Helene Hoegsbro Thygesen

I signed it solely because the magic threshold of 0.05 is clearly absurd.

Furthermore, p values close to 0.05 provide very weak evidence -they correspond to a false positive risk of at least 20-30% under reasonable assumptions (and a lot higher for hypotheses that are implausible).

I agree that it’s a huge problem that statisticians have failed to agree on alternatives to p values. Perhaps that’s why their advice has been almost universally ignored by journal editors. I have made concrete suggestions for alternatives (though needless to say, not everyone agrees with them).

See https://arxiv.org/ftp/arxiv/papers/1802/1802.04888.pdf

and

https://www.youtube.com/watch?v=jZWgijUnIxI

I signed as a supporter, despite the fact that it doesn’t really explain that deficiencies of p values as evidence. The main thrust is to stop dichotomisation, and that is probably the only thing that all 43 authors agree on.

I’ve come late to this party. Reading through all these comments, I am left with a few thoughts. First, we all (at least most) can agree that dichotomous thinking is a great part of the problem. So, is the garden of forked paths. Second, there is real dispute as to whether p values and/or confidence intervals are worth reporting at all – but it looks to me like most of the opposition to these regards their misuse rather than the potential information. Perhaps the misuse is so great that we have no choice but to get rid of them altogether. But I’m not convinced.

Third, it strikes me that some of the problem (perhaps much of it) stems from researchers believing that they should make binary recommendations. Decision makers often must make binary choices – but researchers have the luxury (and the obligation!) to not do this. I’d like to see more papers actually seriously discuss the costs and benefits of different courses of action. Let the decision makers do what they are being paid to do. Researchers are supposed to collect and evaluate evidence – since when is their job to make decisions? Perhaps this is their reaction to feeling that the decision makers are incompetent and their desire to have power they don’t have.

If my child/friend/spouse/etc has a diagnosis and a decision must be made regarding treatment A or treatment B, I don’t think a research study should be telling me which choice to make. I don’t think a clinician should tell me either. Yes, decision must be made, but it will be a decision fraught with uncertainty. I’d rather see the scientific literature provide extensive discussion of the uncertainties and the clinician explain these uncertainties. Why does a binary choice imply that either of these people/entities must make that binary choice?

It isn’t the p-values or CIs, it is using them to check if there is a difference between groups or not (test a strawman null hypothesis). That is the “misuse”, and it is what everyone is being trained to do in their stats classes.

This would do nothing since people will just misuse likelihood ratios or Bayes’ factors to do the same thing.

Haven’t statisticians been beating this drum for a long, long time?

Applied scientists aren’t stupid people, but they apparently don’t see any valuable alternative.

Some have, but other statisticians have been teaching them to do it. Apparently because that is what the researchers want to hear:

https://www.sciencedirect.com/science/article/abs/pii/S1053535704000927

I’ve only glanced over that paper, but this caught my attention:

“Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (say 20 subjects in each sample). Further, suppose you use a simple independent means t-test and your result is significant (t = 2.7, d.f. = 18, p = 0.01). […]

“6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.

“Which statements are in fact true? Recall that a p-value is the probability of the observed data (or of more extreme data points), given that the null hypothesis H0 is true, defined in symbols as p(D|H0). […]

“Statement 6 amounts to the replication fallacy (Gigerenzer, 1993, 2000). Here, p = 1% is taken to imply that such significant data would reappear in 99% of the repetitions. Statement 6 could be made only if one knew that the null hypothesis was true. In formal terms, p(D|H0) is confused with 1 − p(D).”

If the null hypothesis is true, one definitely cannot say that a significant result will be obtained on 99% of occasions! (It would be obtained in 5% of the replications.)

Garnett:

People do see valuable alternatives! Lots of researchers use alternative statistical approaches to learn from and make conclusions from data: methods including Bayesian inference, machine learning, and statistical graphics. These are valuable alternatives.

I agree that there are lots of researchers who think statistical significance is just fine, but they’re mistaken! Yes we’ve been beating this drum for a long long time, but maybe there’s been some progress. At least, there’s a lot more noise about the replication crisis coming from a lot more people than before.

Andrew and Anoneuoid

Thanks for your comments. I want to ask a somewhat reversed question (apologies for the ridiculously large brush strokes):

Bayesian methods are a hot topic in many scientific disciplines. With the availability of software such as WindBUGS and Stan, and the clarity of Bayesian inference, many applied researchers have adopted, or at least appreciated the value, of Bayesian approaches.

Yet, as Anoneuoid points out, statistical significance, whether through p-values, Bayes factors, or posterior probabilities, is *the* dominant paradigm of statistical practice in the applied sciences.

The latter has been recognized by statisticians as a major problem for a long time, yet it is the former that is proliferating through the applied sciences.

Why?

Because applied people largely find value in claiming that they can take a Bayesian approach. Abandoning statistical significance is seen as nothing but a barrier to their view of scientific practice. Statistical significance has enormous value to them.

Not to answer for Andrew or Anoneuoid,

but I would have like to have seen something like this added to the comment – “We anticipate, that common approaches based on likelihood or Bayesian methods, could also be similarly revised to help halt overconfident and unwarranted claims”

A few years back, I got to review work by research reviewers when Bayesian methods had been used (with access to their notes and in person discussions) and all I can say is it was scary (and the reviews all had to be redone using better critical use of Bayesian methods).

Perhaps more worryingly I also actually had to deal with statisticians who think that doing Bayes with default priors automatically vanquishes all the problems that arise from using frequentist statistics, again using default priors.

So I believe a similar revision to what applied scientists may have learned about Bayes from various sources of various quality is needed first.

One start to that is here https://statmodeling.stat.columbia.edu/2019/03/05/abandon-retire-statistical-significance-your-chance-to-sign-a-petition/#comment-986215

And there is lot of material from this blog and authors to drawn on.

It is far easier to disprove a strawman and then conclude your favorite theory is true vs. actually thinking hard about what your theory entails and then trying to disprove it. Yet both of these types of study are given the same nominal value in terms of publication.

In a type of Gresham’s law, the “debased” publications are generated instead of the good publications and eventually the good publications are driven out of circulation. It just takes so much more effort to produce the good publications (for the same nominal value career-wise) that people stop doing it.

Thanks.

Adherence to NHST is not one that statisticians can do anything about.

They’ve already made their case, repeatedly, with apparently little traction.

As you say, abandoning statistical significance will have to come from scientists who place greater premium on “actually thinking hard about what your theory entails and then trying to disprove it.”

Sorry, but this is also wrong. I am not too far out from taking stats classes where they taught me to do NHST. Obviously, what statisticians can do is stop teaching this stuff, stop writing textbooks that teach it, and even further write negative reviews of the ones that do.

Here is the thing, the researchers have been trained to think that statisticians can tell them whether a result is “real” or not. If you don’t give it to them they will go to someone else who will sell them that illusion.

+1

+1

I don’t disagree with your view at all, except if applied scientists find value in NHST and statisticians refuse to talk about it, then the applied scientists will just get their own people to teach their view of statistics. I see that all the time.

Garnett:

I don’t know that applied scientists always find value in NHST. I think often they’re doing it because they feel they have to, because they can’t get their papers published if they don’t have statistical significance. I’d rather let them publish their results without statistical significance. For example, if JPSP wants to publish an ESP paper, let them do it. Daryl Bem’s paper would’ve been better, not worse, had he just presented his data, if formal statistics had never been invented. See discussion here.

Thanks for your comment.

“I think often they’re doing it because they feel they have to, because they can’t get their papers published if they don’t have statistical significance.”

That’s what I meant by finding value in NHST.

My whole point, such as it is, is that educating scientists about NHST seems insufficient to change things at anything close to the scale that we’d all like.

Garnett:

Sure. But I think a lot of scientists and statisticians think that NHST is fundamental to inference. Even now, I think it’s a standard take in statistical theory that confidence intervals are inverted hypothesis tests (see here for the problem with this idea).

I do think it’s possible to change practice to some extent by publishing articles and books.

For example, back in 1991 I learned that many Bayesians did not think it appropriate to check model fit. For many years I wrote about this—in theoretical articles, applied articles, and books—and after awhile practice changed. I really think that my efforts made a difference. My theoretical work gave people permission to check their models, my applied work gave people tools to do it. The combination of permission + tools can be effective, I think, and that’s what I’m trying to do in this area of replacing hypothesis tests with Bayesian inference.

Andrew said:

“I do think it’s possible to change practice to some extent by publishing articles and books.

For example, back in 1991 I learned that many Bayesians did not think it appropriate to check model fit. For many years I wrote about this—in theoretical articles, applied articles, and books—and after awhile practice changed. I really think that my efforts made a difference. My theoretical work gave people permission to check their models, my applied work gave people tools to do it. The combination of permission + tools can be effective, I think, and that’s what I’m trying to do in this area of replacing hypothesis tests with Bayesian inference.”

+1. It ain’t easy, and it takes persistence. So it’s worth keeping on making the effort, rather than giving up easily.

Somebody needs to write up a style manual that succinctly summarizes all these points that come up on this blog. For instance there was a previous discussion about how to say “controls for”.

Not everyone reads this blog and commits its recomendations to memory. If there were a style sheet with a catchy acronym of a name, there is a chance usage might achieve critical mass. Keep it really concise, and maybe provide links to more in depth discussions.

Good point. If you, or someone else, or some group of people, made up a draft of such a document (although I’m not sure “style sheet” would be the best thing to call it), and posted a link to it on this blog, I’ll bet a lot of people would be willing to contribute to revising, honing, and publicizing it.

Good suggestion

This! At my school, three other departments besides us teach an intro statistics course. I have no idea what they are teaching there. Psychology is one of the three. The rest of the departments send their students to our intro stats classes. I am pretty sure that if we suddenly refused to teach NHST, soon all of the departments would have their own statistics courses, and we will have no students left. But it is even worse: our department has two statisticians. The rest of us are all math. Intro statistics is our largest class, even bigger than remedial math, lots of sections, so most of these sections are taught by mathematicians, not statisticians. All my formal training in statistics is one semester in college, that I almost failed, and few mentions of random walks and Monte Carlo methods in several graduate courses. When I got my first teaching job, one semester I was handed a copy of Triola, a copy of previous semester syllabus, and they told me “here, teach this”. That was some 18 years ago, and I have taught many semesters of intro statistics at two different schools since then. I know of many math faculty any many schools all over the place with the same experience. If the schools I have experience with are any indication, there are many more mathematicians teaching introductory statistics than statisticians, and as far as I can tell, they love their Triola. I understand why, I have been there. It is easy to teach from, it makes statistics look superficially like math, and we mathematicians have this fixed idea that once you understand whatever topic it is that you did your dissertation on, you can understand and teach anything. What I am trying to say, in addition to applied scientists, here is another group that may need to be “enlightened”. And it is not going to be easy.

Thanks. One thing that recently startled me a bit was my comment to new employees who had recently taken multiple stats courses that is was a common miss-conception to think the p value was the probability of the Null being true. All 4 who studied at differing universities replied – that is exactly what we were taught it was.

Now, there was an informal survey in 2017 of intro stats course taught by those who do research in teaching intro stats – all were teaching one or more of the ASA statement’s don’ts as what should be done.

A more formal survey of what actually is being taught in intro stats courses would be very helpful – I would suggest surveying faculty and their students.

What Jan and Keith say are sad to hear. I am a mathematician, but fortunately, my turn to statistics was not in the circumstances that Jan describes. I was a full professor, and had taken on as one of my tasks the oversight of the courses required by future math teachers. As AP statistics became popular, I thought that future math teachers needed a college statistics course that was at a more advanced level (having probability as a prerequisite) than our intro stat, but that the math stat was not applied enough. Since statistics was then underrepresented at my university, I realized that I would need to develop the course myself. Fortunately, the statisticians were very helpful, always willing to answer my questions and point to possible references. I also sat in on some graduate statistics courses. Also, there were NSF summer workshops aimed at mathematicians teaching statistics.

But as more people started teaching the course I developed, I did experience the type of thing Jan describes – they liked the more “cookbook” texts like Triola, that I consider as neither good mathematics nor good statistics.

Keith, this is a well known problem. I remember about 5 or 6 years ago there was a formal survey that concluded that large percentage of researchers, students, but most disturbingly statistics instructors were unable to recognize a correct definition of p-value among several clearly incorrect alternatives. I cannot find the study now, but when googling for it, whole bunch other studies in different countries comes up saying basically the same thing.

One thing I am completely confident, though, is that none of our instructors teach that p value is the probability of Null being true. I am pretty sure that we do not teach anything actually incorrect. The problem is that we put way to much stress on cookbook style applications of NHST. Although some of the “word problem” style examples and exercises I have seen in some of the textbooks we use are beyond wrong.

Re possible ways to help alleviate misunderstandings of p-values:

For several years, I taught a continuing education course titled “Common Mistakes in Using Statistics: Spotting Them and Avoiding Them”. (The course assumed that the students had at least an introductory course in statistics.)

The Day 2 notes (on hypothesis tests and confidence intervals) can be downloaded (in 3 parts) from https://web.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html

The presentation is designed to attempt to forestall many of the common misunderstandings of p-values and confidence intervals. In particular, the “quiz’s” on pp. 43 and 61 of Part 3 confront the students with several misinterpretations and misunderstandings. I typically have students read each statement and discuss it with their neighbors; then I ask for a show of hands for each of the three possible answers (Doesn’t get it, Gets it partly, but misses some details, Gets it!), then ask for volunteers to defend their choices.

I typically also used such “quizzes” both when teaching introductory courses and as part of an introduction/review at the beginning of courses in regression and analysis of variance. They don’t provide perfect understanding for all students, but I think they do help many students avoid misunderstandings — or at the very least, convince the students that the concepts are non-trivial.

I want to add another counter-point that I’ve already made above. The use and admiration for Bayesian approaches is becoming more and more popular in the applied sciences even though it is not taught in most standard textbooks.

How can that be if textbooks and coursework dictate the practice of statistics in applied sciences?

That is b/c scientists see value, for one reason or another, in Bayesian approaches. There is no value in abandoning NHST.

Well, I started using JAGS because I was trying to fit a curve and R’s nls was spitting out errors and had confusing docs. JAGS at least returned something usable relatively quickly so I started using that.

I think as soon as you go outside “check if there is a difference” and start coming up with actual models of what may be happening, you start running monte carlo simulations with a few parameters. MCMC is then an obvious next step for tuning the parameters and all of a sudden you are doing bayesian stats.

Dale said,

“If my child/friend/spouse/etc has a diagnosis and a decision must be made regarding treatment A or treatment B, I don’t think a research study should be telling me which choice to make. I don’t think a clinician should tell me either. Yes, decision must be made, but it will be a decision fraught with uncertainty. I’d rather see the scientific literature provide extensive discussion of the uncertainties and the clinician explain these uncertainties. Why does a binary choice imply that either of these people/entities must make that binary choice?”

Indeed, “shared decision making” is becoming more common in medical practice. It is described as

“a process in which clinicians and patients work together to make decisions and select tests, treatments and care plans based on clinical evidence that balances risks and expected outcomes with patient preferences and values.” (https://www.healthit.gov/sites/default/files/nlc_shared_decision_making_fact_sheet.pdf)

Letting researchers and clinicians make binary decisions goes against the idea of shared decision making.

Perhaps it’s just my innate cynicism about human nature (esp. responses to perceived incentives) but I truly don’t believe we’ll see the end in my lifetime of the p-value idea, of NHST in general and more importantly of filtering out so-called null findings. My idea is everything that exists and endures must be serving some purpose. Maybe not the purpose everyone thinks but the existence of a ubiquitous practice means it is filling a niche.

In this case the niche is created by the desire to compute lots and lots and lots of numbers, any of which might be hailed as a “finding”. If that’s what is going to be done (and it is) then some method or methods will arise that allow expedient winnowing down of a dozen or a hundred proto-findings to identify the ones that will be treated as significant (in the colloquial sense, not the p<0.05 sense).

Perhaps the replication crises and this heightened pressure to eliminate the p-value will eventually cause "significance testing" to be socially unacceptable in the wider research community. But I think we're kidding ourselves if we think it won't be replaced by some other easily implemented filter for sifting through the results of all those forking paths. Or in many cases what amounts to outright fishing expeditions.

Which isn't to say we shouldn't shout down the whole NHST framework. But success there will, in my humble opinion, be a Pyrrhic victory at best if we're concerned about replication and reproducibilty.

This is my view as well. I repeatedly explain the problems of NHST to applied scientists (largely borrowed from this blog, its commenters, and the usual books), they nod in understanding and appreciate the insights, and continue doing things exactly as before.

They just don’t see any alternative, and these are some very smart people.

I think your comment just before this one was spot on. An alternative statistical procedure in service of the existing way of framing research questions is typically welcomed, if it offers real advantages.

But just watch those smiles turn to frowns when you suggest an alternative paradigm for thinking about applied research. A person who has been professionally successful under the existing shared set of assumptions and practices has less than zero incentive to reject the fundamental underpinnings that have led to that success.

Give them a better way to do what they’re always done? You’re a genius! Tell them what they’ve always done is invalid? You’re a threat.

Garnett said

“I repeatedly explain the problems of NHST to applied scientists (largely borrowed from this blog, its commenters, and the usual books)”

Sorry, but I can’t resist: You really borrow the scientists from this blog, its commenters, and the usual books? ;~)

Yes, but I’m really bad about returning them.

Luckily for you, it’s not an official library, so you don’t have to pay overdue fines.

A lot of the criticisms shown in these discussions apply to Bayesian and other approaches as well (misunderstanding, misuse, hacking of various forms, arbitrary journal standards, god loves a value of X surely as much as one of Y, paradoxes, etc.). We’d see a lot more of those examples, and articles discussing them, if Bayesian or other approaches were as widely used and successful as frequentist approaches.

For a non-Bayesian example, I just read an article from Briggs on a “Replacement for Hypothesis Testing” (http://wmbriggs.com/public/Briggs.ReplacementForHypothesisTesting.pdf). In it he eventually refers to a delta. The size of the delta indicates if a result is…well…something like significant or important or worth looking at, etc. Is this supposed to replace p-values?

Justin

http://www.statisticool.com

Justin:

Yes. As I wrote a few years ago, the problems with p-values are not just with p-values. This was my discussion/criticism of that famous ASA statement on p-values, which I thought missed some important points.

Andrew

I apologise for making a similar point again. The main concern seems to be imposing a threshold on the P value, added to which the P-value is the likelihood of an observation or hypothetical observations conditional on a hypothesis. It is thus not a proper conditional probability statement that inevitably induces thoughtlessness especially in non-statisticians. If a statistician asked a physician the probability that his test results would be replicated if repeated but he was told that if he had some single diagnosis, then the likelihood that he would have developed his symptoms of something more extreme was 0.04, he think the answer to be equally meaningless.

Physicians should be intuitively familiar with (a) the probability of being able to repeat some diagnostic test properly, (b) the probability of the diagnostic test giving a repeat result within some specified range (based on uniform priors) and (c) the probability of a diagnosis conditional on the test result (based on non-uniform priors). Why does the statistical community not do something similar? Instead of a P-value, we would have the frequentist posterior probability of the ‘true’ result being above and below the null hypothesis, which would be approximately equal to P and 1-P [ https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0212302#references ].

This would only be a provisional starting point that would set an approximate upper bound for a more detailed probability statement of the true result being in a narrower range than that bounded by a null hypothesis and based on other evidence (e.g. Bayesian distribution and methodological soundness (e.g. absence of forking paths and based on pre=registration). This would represent non-dichotomous uncertainty which doctors and biologists would have a better chance of understanding.

Huw said,

“Physicians should be intuitively familiar with (a) the probability of being able to repeat some diagnostic test properly, (b) the probability of the diagnostic test giving a repeat result within some specified range (based on uniform priors) and (c) the probability of a diagnosis conditional on the test result (based on non-uniform priors). ”

This would be true in an ideal world. However, https://www.stat.berkeley.edu/~aldous/157/Papers/health_stats.pdf documents that physicians (as well as patients, journalists and politicians) have a high degree of statistical illiteracy. (And I sincerely doubt that the situation has changed much, if at all, since the paper appeared in 2006.)

Ben P says:

“It’s also often said that if we get rid of significance we should get rid of confidence intervals too; after all they are simply inverted hypothesis tests. But again I disagree – at least with a CI the values are in terms of a variable or statistic of interest…”

I have long preferred confidence intervals to p-values for much the same reason as yourself. But recently I’ve come to worry over the fact that the usual construction of a Frequentist confidence interval does not *quite* admit the interpretation suggested by this formulation.

Yes, I agree there is still a larger problem of interpretability within the framework of constructing 95% CIs. It is nice that the endpoints are in terms of interpretable values, which is why I like them so much more than p-values. But then we ask “what does this range refer to?”, and the answer is “it is a range that was constructed using a method that, given the assumptions of the model, will successfully capture the unknown parameter 95% of the time under repeating sampling.” And that doesn’t exactly drip off the page.

“Indeed, under the Law of Diffusion of Idiocy, every foolish application of significance testing will beget a corresponding foolish practice for confidence limits.” (Abelson, 1997)

The problems that need to be fixed are poor understanding of statistics and lacking honesty to not oversell findings. Neither can be cured by signing a petition.

Soren:

I don’t think anyone’s claiming that the problems of science can be cured by signing a petition. I’ve written several books about statistics. I don’t think that’s going to cure the problems of science either. I’ve even written blog posts pointing out the statistical errors of New York Times columns of the Brooks brothers. I don’t think that’s going to cure the problems of journalism. We do what we can, trying to make people aware of problems in the current system, and exploring possibly better alternatives.

People seem to be missing the point here. The value of this isn’t that the king of science will listen and ban the practice of NHST, the value of this is that when you argue with people that they should stop doing NHST and they act like you are a raving lunatic who is easily ignored like the time cube guy… You can point to this and say, no really there are many professional statisticians who agree with me you can’t just blow this issue off…

Stopping NHST to many researchers is as insane as saying you should just burn some incense and meditate over your data and see what comes of it…

+1

Funny, because that is very similar to what I consider

usingNHST to be like. Just did it on here 2 days ago actually:https://statmodeling.stat.columbia.edu/2019/03/05/back-to-basics-how-is-statistics-relevant-to-scientific-discovery/#comment-983992

I prefer the “monk/monastery” one since it implies institutionalization and collective/prevailing opinion though. The “incense/meditate” suggests a more decentralized/independent practice.

But more on topic. I think enough has been published on the topic (and can even be found on the wikipedia NHST page) for anyone who is the least bit interested and open to new ideas on the topic to see something must be going on. Pages like this would be more helpful than a petition imo:

http://www.indiana.edu/~stigtsts/quotsagn.html

Like this quote, perfectly true from 1963:

Maybe people should get some sort of “gold star” for reading a page of negative quotes about NHST.

Sure, I agree using NHST is very much like asking the monks to pray for your data… but here’s the thing: telling people that using NHST is like getting monks to pray for your data, when they’ve been *actually taught to do it by actual statisticians at actual universities* is going to get you looked at like a raving schizophrenic talking about “evil magnetism coming from the internet” or whatever.

You need to be able to point to *actually published opinions from clearly not-raving lunatics* or they WILL ignore you without a further thought… In the end they might ignore you, but they’ll have to engage at least *a little* with this Nature petition.

Yea, I realize that. I meant both “sides” will think the other side is crazy because there are fundamental differences in assumptions about the world at play. I have identified at two:

1) It assumes it is somehow special to discover the existence of a difference between two groups or the presence of a correlation.

– In reality everything is correlated with everything else to some extent (most relationships are negligible or unimportant. but non-zero)

2) It assumes what has been published in the “peer reviewed” literature is at least roughly correct.

– In reality most of the observations don’t even (or even no one knows how to) replicate, and I have low expectations for how many reliable observations have been interpreted correctly.

This is just like the monks assuming what is written in the religious text (and highly respected commentaries) is pretty much true. Their job is just to identify any apparent contradictions (from their wrong “null model”), figure out the right away of thinking about it to resolve the contradictions, and perhaps work out minor details (scholasticism).

That is how you get elaborate “theories” developed over decades or centuries involving hundreds or thousands of people to explain stuff that didn’t need explaining to begin with.

Anon,

Nice list of quotes. Since one important driver of the popularity of significance tests is the desire for certainty, I suggest as a companion to your list the following page of quotes about uncertainty: https://web.ma.utexas.edu/users/mks/statmistakes/uncertaintyquotes.html

Agree, I fact I have already used the ASA statement to do exactly that.

Most of the criticisms of NHST relate to keeping the significance level fixed. Allowing the significance level to decrease with the sample size at an appropriate rate results in an almost sure hypothesis test. In any sufficiently large sample, an almost sure hypothesis test rejects the null when it is false with probability one and it fails to reject the null when it is true with probability one. In fact, I used this to resolve the Jeffreys-Lindley paradox by showing there is no paradox if the significance level is set to alpha= n^(-p) for p>1/2, ejs 2016 “almost sure hypothesis testing and a resolution of the Jeffreys-Lindley paradox”. Almost sure hypothesis tests are robust to multiple comparisons, publication bias, and optional stopping. It can also be used in model selection to find the correct model with probability one. In my most recent paper, researchgate “An application of almost sure hypothesis testing to experiment replication and the inconsistency of consistent priors”, I prove A.S. confidence intervals can be used to test experiment replication and the posterior does not generally converge in probability to the cdf of the true parameter even with normal data! I also provide a simple algorithm for the dishonest researcher to beat the Bayes Factor with large probability. AS hypothesis testing can be used to confirm a null hypothesis and it treats the null and alternative in a symmetric fashion. One may estimate both the type I and type II error in almost sure hypothesis testing. This follows from the Edgeworth expansions derived in the ejs paper. I can also provide a referee report from the Annals of Statistics that AS hypothesis testing resolves the Berger-Selke paradox.

What are you basing this on? That is a relatively minor red herring (amongst many larger red herrings) when it comes to problems with NHST.

Jeffrey’s-Lindley paradox and Berger-selke paradox are two examples of fixed significance levels causing the problem that Bayesians were criticizing. Lindley criticized frequentists for a lack of robustness to optional stopping. Multiple comparisons was mentioned in the asa statement on pvalues. If you give me a specific criticism of NHST, I’ll try to explain how as hypothesis testing solves the problem, if I can, at least approximately.

Sure. If omniscient jones told me the null model is false (I’ll even allow all auxiliary assumptions like iid samples to be true, only mu1 != mu2), it doesn’t offer much support for my research hypothesis/theory anyway. The reason is there are many explanations for why there could be a difference between groups, and no reason to favor the one that motivated the study.

Further, I actually do know the null hypothesis is false before I run the study, because everything literally correlates with everything else to some extent for reasons that could be interesting or not.

So I never care if the null hypothesis is false, because I don’t learn anything from that. I already knew it before collecting data. For “directional” use of NHST, this goes from 100% to 50%, only a slight improvement.

The “alternative” is to instead set the “hypothesis to be nullified” as a prediction derived from your research hypothesis, which must be precise enough to surprise people (all the terms referring to other models in the denominator of Bayes’ rule are small). That indicates there are not many explanations for why such a value would be observed, thus the observations support the research hypothesis.

It was all explained clearly here:

Paul E. Meehl, “Theory-Testing in Psychology and Physics: A Methodological Paradox,” Philosophy of Science 34, no. 2 (Jun., 1967): 103-115. https://doi.org/10.1086/288135

Fair enough, I take the assumption to be the data is i.i.d within each group. It follows by assumption that there are at least two uncorrelated random variables in a group.The proposition “everything literally correlates with everything else to some extent” is now false, which follows from the i.i.d. Assumption. Next.

Hmm, I don’t know but I can kind of see how those two things contradict.

But fine, I take away that assumption (which is false in reality) then I guess. I was just trying to simplify the situation for you.

I’ll meet you halfway and take it to be true that all random variables are correlated as long as I can use the standard statistical stuff for dependent sequences. The zero-correlation proposition is but one proposition. One could still test the proposition that the data has a standard normal distribution or test for unequal variances. I even have a specification test for mle estimators, which is another kind of proposition. So I can test a variety of propositions in the as hupothesis franework while still maintaining the assumption all random variables correlated. Next.

How about this: the data do not come from an algorithmically random sequence at all, they come from actual natural processes such as the growth of trees or the weather or the choices of humans or the action of predators and prey…

If there is some natural process that cannot be adequately described by a probability measure, then statistics as a whole is not very useful. Both the Bayesian and the Frequentist typically require the existence of probability measures. It seems like an odd argument that statistical significance should be abandoned because probability measures are ineadequate for describing the real world. Next.

The Bayesian probability measure is not a measure of the frequency properties of repeated experiments, as such it does not have any difficulties with the fact that real world data sequences do not actually have any of the properties of high complexity algorithmic randomness. Frequency measures on the other hand are the defining characteristic of algorithmic randomness. So, one of these things is not like the other. Sorry you can’t just ignore this issue.

While the Bayesian may have a different interpretation of probability, thus doesn’t change the fact that both approaches typically assume some probability measures accurately model the underlying natural process. See the first page of the Bayesian section in van der vaart ,asymptotic statistics. If process x cannot be described by a probability measure, then the Bayesian probability measure also fails to describe it. This is also a general criticism of the calculus of probabilities not a specific criticism about NHST. Next.

The Bayesian probability measure needs to satisfactorily describe a “state of information” about the relative plausibility of what might happen given our model, the Frequentist probability measure needs to accurately describes how often the world will arrange for a certain outcome to occur, they are totally different requirements. The Bayesian one doesn’t require anything of the physical process the world carries out, the Frequentist one makes strong assumptions about the world that are demonstrably false in most cases. But since you are basically just dismissing me, have a nice day.

Daniel:

I believe Michael Naaman does have a point – if satisfactory description a “state of information” about the relative plausibility has no connection to reality – you are just doing math. (There for instance would no reason to do prior predictive simulations and compare them to actual observations).

Of course a Bayesian model needs a connection to reality, it’s the KIND of connection that is in question. Frequentist statistics inherently requires the world to satisfy tests of randomness in order for the conclusions to hold. Anyone who has worked with random number generator algorithms knows that these tests are very stringent and plenty of smart people who tried hard to produce good RNGs failed. Why is it that we should think that just any old psychology survey (or tree ring data or economic development experiment etc) will automatically act like a random sequence?

On the other hand, the kind of match that Bayesian models need to have is that actual observed data should be found in places with nontrivial density for p(data|model) that’s all, if significant quantities of data are found in regions of low relative density (D(x)/Dmax) then the model claims that what should happen is something else than what does happen.

There is zero frequency requirement there, and the result is that the world failing to match the Bayesian probability in terms of frequency under repeated sampling as it must inevitably since the world isn’t a RNG, has no bearing on the inference from a Bayesian model… The questions are different and the adequacy is measured differently.

Daniel: > adequacy is measured differently.

Some people may wish to measure adequacy differently but everyone will experience the same consequences of inadequacy.

Perhaps a discussion for another time.

The consequences of a Frequentist model being inadequate is that it will give you frequency “guarantees” that are nothing of the kind… They describe an idealized random sequence rather than say physics or economics. However the requirement to check this is that you collect dramatically more data than anyone actually collects in real applied settings, like instead of 35 patients you should have 100,000 …

On the other hand the consequences of a Bayesian model being inadequate depend strongly on your description of adequate, which is generally problem specific and involves a utility function. In some cases even large errors in predictive ability have perfectly fine utility so long as say your average behavior is correct or you don’t make certain kinds of errors etc.

Sander:

I don’t think Michael has convinced many people either, but I do appreciate that he’s engaging in discussion here. We wouldn’t want the comment section to be an echo chamber.

I wouldn’t want it to be an echo chamber either, but a dismissive “next” at the end of each post is basically a troll move, so I don’t think Michael Naaman wants to be seen as a troll, and he should give that up.

Once again, I have learned to be sorry for trying to discuss this outside a real life example. You keep avoiding the point by nitpicking tangential stuff meant to simplify the discussion.

Why would you want to do this?

“Once again, I have learned to be sorry for trying to discuss this outside a real life example. You keep avoiding the point by nitpicking tangential stuff meant to simplify the discussion.”

+1

Reminds me of: Theoretical statistics is the theory of *applied* statistics :)

In a regression model, the standard errors will be wrong in the presence of heteroskedasicity (unequal variances), which is why one might be interested in testing the assumption of homoskedasticity can also test propositions about the mean of the distribution, which has nothing to do with correlation. As for normality, if one is able to demonstrate that a vector of random variables is normally distributed, then the conditional expectation would be a linear function which in turn provides a solid foundation for the linear regression model. Next.

Michael, I’m sure you know this but the relevant question is not whether the data are normally distributed but whether the regression residuals are. You mistakenly said the former though which is what Anoneuid was responding to I believe. Honestly, even there, as Andrew likes to say, that’s one of the most overrated assumptions of linear regression. It’s mainly important if you are constructing test statistics that presuppose specific distributional forms…

One can have a nonlinear conditional expectation with normal residuals

Y=A(x-b)^2+ z

If the random vector (y,x) is normal with non singular covariance matrix, then the conditional expectation of y given x is linear. Next.

And you care about this why? What would be the consequences either way?

Michael: If one enters into a data analysis uncertain about a normality assumption, how does one “demonstrate normality”? What exactly do you mean by “demonstrating normality” given that uncertain data set?

In my reasearch gate paper, I provide an explicit formula for an as hypothesis test of normality using estimated parameters. I also discuss the extension to the mulrivariate case. This includes alpha mixing dependent sequences. On a side note if one is very certain that a parameter is non-zero with probability one, then just do a one-sided almost sure hypothesis test and this will tell you the sign of the parameter, which is often sufficient for the social sciences. Next.

Sander,

Can you at least post a picture of you skiing? lol

Michael said:

“In my reasearch gate paper, I provide an explicit formula for an as hypothesis test of normality using estimated parameters.”

But another hypothesis test just increases the problem of multiple testing.

Michael: A frequentist test statistic (or a rescaling of it such as a P-value or S-value) does not “demonstrate a hypothesis”. It only measures evidence against the model used for its construction along the axis of variation of the test statistic.

Your “Next” comments make no sense except to show you are demonstrating things to your satisfaction, that’s all. Why that should matter to the rest of us remains unclear (at least to me, but I don’t see where you’ve convinced anyone else here either).

Perhaps relevant to this discussion: “In the era before “big data” became a household name, the small sizes of the datasets most researchers worked with necessitated great care in their analysis and made it possible to manually verify the results received.” https://www.forbes.com/sites/kalevleetaru/2019/03/07/how-data-scientists-turned-against-statistics/#4ca53dc7257c

See? The low powered, noise-mining, pencil-wielding verificationists were winning until modern stats software came along.

If you give me your word you won’t sign the pledge or in the event that you already signed the pledge then you will make an honest effort to have your name removed, then within 7 days I will change my Facebook profile picture to a picture of me skiing for a period of no less than 7 days.

Dear Martha,

Can you please provide me a little more information about your criticism?

You seem to just love doing these tests for their own sake or something. I attempted to bring you back to the real world but you ignored my post. But here you peek your head out of the weeds so let’s see:

In what sense is this sufficient? You realize everyone is saying it is not “sufficient” right? The questions being answered by all your “tests” are not of interest to anyone. When “social scientists” (et al) misinterpret them to mean something else so they

thinkthey are of interest, then you will blame them for “misusing” your useless tests.Like this:

In a regression model, the (coefficients and) standard errors will be wrong unless you include all and only the relevant predictors (the model is correctly specified).[1] Since we know this isn’t the case to begin with, adding more tests for the various (also known to be literally incorrect) assumptions serves no purpose.

[1] Some discussion here: https://statmodeling.stat.columbia.edu/2017/01/04/30805/

Michael said, “Can you please provide me a little more information about your criticism?”

Am I correct in assuming that the criticism you are referring to is, “But another hypothesis test just increases the problem of multiple testing.”?

I’m not sure just what your background is. I was assuming that you were well aware of the problem of multiple testing. If you are aware of this problem, I’m not sure what more information you need. In case you are not familiar with this problem, here are some references:

https://en.wikipedia.org/wiki/Multiple_comparisons_problem

http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf

https://statmodeling.stat.columbia.edu/2019/03/05/abandon-retire-statistical-significance-your-chance-to-sign-a-petition/#comment-990037, pp. 3 – 16

Oops — my last link was in error. I intended:

https://web.ma.utexas.edu/users/mks/CommonMistakes2016/SSISlidesDay4_2016.pdf , pp. 3-16

Apologies for not double checking (still getting used to new computer).

Almost sure hypothesis tests are robust to finitely many multiple comparisons, but one can still use the bonferroni correction to Control for multiple comparisons I discuss this in the ejs paper. Next.

Michael, Anon:

Just to try to short-circuit this discussion: Just about any statistical problem can be solved in any statistical framework. The advantage of one framework over another is ultimately in flexibility and ease of use. Which in turn depends on context and resources. Which is one reason why different people find different frameworks to be useful.

If that is true, then why do we need to abandon statistical significance?

Consider a double blind study of a new drug where a positive coefficient indicates the drug is more effective than a Placebo.

In this case, it seems to me that a one-sided test would be sufficient. I have a feeling a fair number of people at the FDA would agree with me on that.

Regardless, I am a person and I disagree with you, so all the “everyone” claims are false. I need a little more on point 1 please.

Sufficient for what?

Yes, people who don’t know what a p-value and/or statistical significance means agree with you… Not good company to share.

You are getting close to a good criticism, don’t get off track. Please provide more explanation on point 1.

I have no idea what “point 1” refers to.

What is it you think the one-sided test would be “sufficient” to accomplish though? You have said something like that twice now, and I suspect you have given very little thought to what comes after the “significance” is determined…

Perhaps point 1 here https://statmodeling.stat.columbia.edu/2017/08/31/make-reported-statistical-analysis-summaries-hear-no-distinction-see-no-ensembles-speak-no-non-random-error/ might help clarify ANoneuoid’s reply here.

p.s. I am not trying to be your private consultant, tutor or debating partner.

There is a issue I’m having with this site that is preventing it from saving the name field, which caused that typoed “N”. It has no significance.

But anyway, can you quote the “point 1” you are referring to? That is different from the “point 1” that Michael Naaman is referring to, right?

1. failing to distinguish what something is versus what to make of it…

What is meant by “what to make of” a reported statistical analysis summary, its upshot or how it should affect our future actions and thinking as opposed to simply what it is? CS Peirce called this the pragmatic grade of clarity of a concept. To him it was the third grade that needed to be proceeded by two other grades, the ability to recognise instances of a concept and the ability to define it. For instance with regard to p_values, the ability to recognise what is or is not a p_value, the ability to define a p_value and the ability to know what to make of a p_value in a given study. Its the third that is primary and paramount to “enabling researchers to be less misled by the observations” and thereby discern what to make for instance of a p_value. Importantly it also always remains open ended.

Yes, I believe this is approximately what I was getting it but I do not think it is stated as clearly as it could be.

The 1 refers to the link that you gave to the other discussion board. Are you arguing that the form of the conditional expectation can change depending on the set of variables that are being conditioned upon?

Ok…

1) Why are your comments all starting new threads? Are you hitting “reply to comment”?

2) How was I supposed to know what you meant by “point 1”?

3) Please respond to my questions regarding what you mean for the one-sided test to be “sufficient” before focusing on another tangent.

In the drug example, the one sided test is sufficient to show that the drug has some positive effect that cannot be explained by the placebo under appropriate regularity conditions. You can also do AS tests for intervals or more general sets as long as they aren’t too crazy. I basically took this one sided example from an ra Johnson (Bayesian also wrote redefine statistical significance) paper in the annals of statistics on Uniformly most powerful Bayesian tests. I believe Dr. German criticized Johnson’s approach. Next.

Thanks, this concludes more than allowed even under the most ideal of circumstances (beyond what would be the undoubtedly optimistic “appropriate regulatory conditions” you had in mind).

All you know is the outcome was “better” (higher/lower) in the treatment group for some reason, something that had 50/50 chance of happening a priori.

Perhaps the drug makes the person twitch a bit when injected but the “placebo” doesn’t, so the doctors/nurses unconsciously tend to treats/score the patients slightly different. Such a drug would not be having a “positive effect”, and should not be approved as a treatment based on that.

Ra Johnson advocated for a one sided Bayesian test in a hypothetical drug testing scenario in that annals of statistics paper. Is it acceptable to do a one sided Uniformly most powerful Bayesian test?

Or can we add UNiformly most powerful Bayesian tests to the petition?

No, Bayesian vs Frequentist vs Fisher vs “Eyeball” is totally orthogonal to this issue. You can use any method to test a strawman “hypothesis”, and all are a waste of time. The Frequentist vs Bayesian etc debate is only a distraction from the BIG problem with the procedure.

Please check the Meehl paper I referenced earlier:

Paul E. Meehl, “Theory-Testing in Psychology and Physics: A Methodological Paradox,” Philosophy of Science 34, no. 2 (Jun., 1967): 103-115. https://doi.org/10.1086/288135

+1

Ah, I see now. I wrote it as a reference [1], which you took as “point 1”. Sorry about that.

Are the physicists wrong too?

If they are testing a model that may actually be correct, then I don’t think so. There is nothing wrong* I see with testing a model you think may actually be correct using the exact same methods.

I am no expert in physics, but see here upthread:

https://statmodeling.stat.columbia.edu/2019/03/05/abandon-retire-statistical-significance-your-chance-to-sign-a-petition/#comment-984498

Basically, my impression is physics is “on the path to hell” with increasing reliance on statistical significance. They do still test their own models as well, but the next step we have seen in other fields (medicine, psychology, etc) is to eventually drop that all together.

* However, perhaps the exactness of the test is causing them to reject too many nascent theories that produce approximately correct predictions that could be fixed with a bit more attention… I leave this as a future problem.

This should be “increasing reliance on testing for statistically significant differences from a model of the ‘background'”. That is as opposed to checking the predictions of whatever theory directly, ie they are allowing ever more vague “predictions” count as support for a theory.

But the physicists have done quite well when it comes to experiment replication while pychology and some areas of medicine have struggled mightily with experiment replication.

Reproducibility is another important topic but not what I am talking about here.

You seem to keep trying to drag distractions into the discussion. First nitpicking what was obviously a simplifying assumption, then Bayesian vs Frequentist, now reproducibility…

It really isn’t a complicated point: there are always many possible explanations (some interesting, some not) for a statistically significant difference so it is not of interest to “discover” it.

A fair number of Nobel prizes have been awarded based on experiments that use statistical significance( ligo Higgs boson come to mind).

Do you think those prizes should be rescinded? Should the winners pay the money back?

What about climate change deniers or anti-Vaxers? They will be very happy.

They get to point at your petition and say, “Fake statistics”. This petition is reckless with little theoretical foundation and I encourage everyone to take their name off the petition. I don’t like the Bayesian approach, but I can’t deny that it is perfectly acceptable in some situations. I just think my way is better.

Once again you change the subject. At least you stopped saying “Next” after doing it though, so I think I got through little bit.

But yea, this is a BIG problem. Most people cannot handle it, their mind recoils at the implications and they turn to blind hope that somehow it must have worked out anyway because so many highly educated people can’t be so wrong…

But only a few hundred years ago all the most highly educated people spent their time worrying about theology, something most researchers consider largely a waste of their lives today. I think academia is in a very similar state to the Catholic Church ~1500 AD:

https://statmodeling.stat.columbia.edu/2019/03/05/abandon-retire-statistical-significance-your-chance-to-sign-a-petition/#comment-987745

Anyway, please try to stay on topic and justify NHST in the face of the problem raised by Meehl I pointed out in this thread. I do not believe this is possible, I am confident NHST is just pseudoscience based on a common logical fallacy (strawman) that (at best) has lead to a nearly unbelievable squandering of time and resources. But if you still have hope please do try.

Recall the topic: “Most of the criticisms of NHST relate to keeping the significance level fixed.” I think I have proved my point as you have come up with one or two coherent criticisms while I have shown that keeping the significance level leads problems in optional stopping, multiple comparisons, model selection, Berger-Selke Paradox, and Jeffreys-Lindley Paradox. Next.

Michael: > keeping the significance level leads problems

OK, but that was pointed out in Lehman’s Testing Statistical Hypothesis many years ago.

(I recall thinking because it was in a later chapter that many statisticians just never got to it).

As Andrew point out in one the many comments here, “Just about any statistical problem can be solved in any statistical framework. The advantage of one framework over another is ultimately in flexibility and ease of use. Which in turn depends on context and resources.”

So you may have a way of rectifying most of the problems associated with NHST and that would be a start not a finish. Likely you will have to convince enough in the statistical community that you have rectified them and then – this may be the most important – they can be successfully taught to users so that they use them sensibly. I don’t believe further comments here are likely to do much for either.

Also, here: https://statmodeling.stat.columbia.edu/2016/02/15/the-recent-black-hole-ligo-experiment-used-pystan/#comment-263404

Dear Keith O’Rourke,

Yes, Lehman pointed it out, but I did the work and constructed a theory to solve the problem. Lindley specifically said the there is no justification for keeping the significance level fixed. I have incorporated that into NHST. Isn’t this how academic process is supposed to work? Changing a theory in response to criticisms.

I agree that I have not addressed all of the criticisms of NHST, but I need time to prove my point to a larger audience, but you guys are just shouting down dissent with petitions. Nobody is writing any academic articles about why AS hypothesis testing is insufficient, but rather there is a rash of papers complaining about NHST without even discussing the literature that solves the problems that are being complained about.

Michael: the problem with NHST is that the logic of it is flawed from the beginning, no amount of adjusting the p threshold helps.

The logic falls down because an NHST type test answers the following type of question:

If the world were a random number generator and my treatment and control groups were the same random number generator, then would I see test statistics more extreme than my actual data results more than p fraction of the time?

let’s suppose that the answer is no (ie. p is smaller than a threshold you think is reasonable, like p = 0.002)

Now what can we logically conclude? The answer is:

“Either the world is not a random number generator, or the world is a random number generator but not of the type I was testing, or the world is a random number generator of the type I was testing, but the treatment and control random number generators are different in some way for some unknown reason.”

Since the world *isn’t* a random number generator is the correct answer almost all of the time… And we knew this before we started… the NHST did nothing for us. We learned *literally nothing*.

But even if we accept “the world is as-if a random number generator in this case” then we’re left with

1) It’s not the type I chose

2) The treatment and control are different for some unknown reason

Testing the goodness of fit of a random number model to data requires *massive* amounts of data. For example the die-harder tests use literally *billions* of data points to determine with precision if a pseudo-random-number-generator is “as if” a uniform [0,1] random sequence. You can just about count on your hands the number of cases where we have billions of data points (particle colliders, credit card transactions, and a few other things, almost zero science). Even if you’re ok with testing the goodness of fit using just say 100,000 data points, it’s still massively more data than the vast majority of science has, and furthermore it usually must be collected over long times to test stationarity assumptions.

Neither (1) or (2) get us closer to scientific understanding such as “when you give drug A it interacts with certain cytokine pathways which causes the rate of reaction X to decrease, which reduces PDQ production in cells M which decreases swelling and pain for H hours… but only in patients who have problems associated with this pathway”

Because (2) just means “there was something different” which we already knew, there’s *always* something different. Extraordinary care needs to be used to confirm that the “something different” isn’t “something meaningless we don’t care about” like the difference in lighting between one side of your biological bench and the other, or the difference between having men change the mouse cage vs women (and the resulting changes in pheromone or perfume or strength and size of the technician’s hand or whatever)

Anoneuoid’s point is: until you develop a scientific model that you believe might actually be real with quantitative predictions from that model which can be corroborated in data, you don’t have science, you have rolling dice and picking the outcomes you like.

Making the tests “almost sure” under the mathematical assumptions about the random properties of the data just solves mathematical problems perhaps of use to the people who design pseudo-random number generators. None of it helps actual scientists get less wrong about how the world works.

Is the 737 MAX crashing too often? It has had two fatal crashes in less than two years, after around half-a-million flights. The previous models (737 600/700/800/900) have accumulated 60 million flights, with only 9 fatal events.

The observed fatal crash rate of the old models is 0.15 per million flights. The probability of getting two or more fatal crashes after 500k flights at that rate is less than 0.3%. It’s true that two crashes is only one crash away from one single crash (which wouldn’t be so rare, would happen with probability 7%). It’s true that there is selection bias because the only reason I’m doing this calculation is that there were two major crashes in the first place. It’s true that this calculation doesn’t tell me what is the cause of the accidents.

Nevertheless, I wouldn’t say that we learned *literally nothing* from this calculation and I don’t think that billions of flights are required to be able to learn something.

Carlos:

This is a good example where the p-value, if carefully interpreted, can be a good data summary. I’d prefer a Bayesian analysis as it plugs more directly into decision calculus, but I agree that the p-values tells us something. One thing I

wouldn’twant to do in this case is to make a decision based on “statistical significance,” that is, whether the p-value reaches some threshold. For one thing, I’m not interested in the null hypothesis that the crash rate of the new plane is exactly equal to the crash rate of the old plane (a hypothesis I know is false: not only are the planes different, also the conditions under which they’re being flown have changed in some ways). What’s relevant here is the probability of crashes in future flights.I fully agree.

Here all you care about is literally whether a frequency is close to equal in two groups where millions of observations are available. It rather makes my point than contradicts it. But the truth is most of the reason we are concerned about these crashes is we have a mechanistic model of the failure, in which a single non redundant sensor may be failing and causing the computer to point the nose down to avoid a nonexistent stall. The Bayesian analysis of this mechanism suggests that the observed behavior is exactly what would be expected if the sensor or computer software misbehaved, making this explanation very likely and predictions are that in the future it should increase in frequency as these sensors continue to age.

Also, rather than being some “random” type of event over which we have little control, given the assumed mechanism it may be fairly easy to reduce the chance of it happening in the future long before we have sufficient, very expensive (lives lost), evidence of a “statistically significant difference in rates”

1) training pilots how to disable the system, apparently there’s a switch right in the center of the cockpit you can use to turn off this system if it’s misbehaving

2) Rewriting the software (Boeing is already doing this) to use additional types of inputs to make a better decision about whether a stall is really occurring rather than relying solely on a single stall sensor. Bayesian inspired decision making would help here.

3) inspecting or replacing stall sensors on a better schedule to reduce the chance of failure of the sensor itself

etc

Basically by using a mechanistic Bayesian style predictive decision making scheme we can react to a situation *long before* we have frequentist “statistically significant” evidence that the pure observed rates are different.

Science is about understanding mechanisms in the world. Frequencies are about ignoring mechanism, replacing mechanism with a Random Number Generator, and purely measuring how often one vs another thing occurs in repeated measuring.

Measurement is useful, and flagging atypical events is useful, but usually not *for understanding mechanism* which is what I think science is about. If your goal is just “Detect when something happens that’s abnormal from a mass of usual type of data” then this is actually pretty much the actual legit usage of Frequentist statistics.

I use frequentist analysis myself in various places. For example I used a database of seismograph readings to find a threshold where most “non earthquake” events failed to reach this threshold, and then filtered out all events that exceeded this threshold so that those filtered events were much more likely to be small earthquakes…. and then analyzed this smaller intentionally biased dataset using bayesian mechanistic models.

I’ve used frequentist models to determine whether filtering genes out of a genome based on some certain biological criteria would produce similar or different distributions of certain gene characteristics compared to random selection, as a way to decide whether that criterion was “informative” in the sense of causing a change in the distribution of certain characteristics of the selected population, etc. In each case, the clear question *really is* whether a computational process of selection is practically equivalent to a pseudo-random number generation process. It is essentially testing two kinds of mathematical constructs.

Unfortunately, when the process is not a computational/mathematical construct, such as when measuring outcomes of a small number of surgeries with many variables which can change systematically from surgeon to surgeon or place to place or population to population or the like, or checking something about an assumed relationship between GDP vs longevity over the non-repeatable historic period 1960 to 2000… it’s not appropriate.

“Basically by using a mechanistic Bayesian style predictive decision making scheme we can react to a situation *long before* we have frequentist “statistically significant” evidence that the pure observed rates are different.

Science is about understanding mechanisms in the world. Frequencies are about ignoring mechanism, replacing mechanism with a Random Number Generator, and purely measuring how often one vs another thing occurs in repeated measuring.”

Nicely put (and with a good example!)

Martha:

Here the Frequencies are just focused on testing model fit.

They also could be focused on how often analysis methods (including Bayes) would take you more towards or away from reality (assessed of course by assuming different realities are true.) One way to frame this in Bayesian analyses is as quality assessment of the study design and analysis protocol (use this prior and data model). Bayesian justified only before the study but still relevant after the study.

The pain of getting reality wrong when it has consequences is not affected by how you abstractly justified your analysis methods.

This requires some sort of cost-benefit analysis on the part of the airlines, passengers, etc. We can get an idea by looking at what rates have been acceptable in the past.

The concord had a crash rate of ~11 per million flights: http://www.airsafe.com/events/models/rate_mod.htm

Assuming the odds of a crash on each flight is iid, you would see two crashes after 500k flights about ~6% of the time (

at leasttwo crashes 97% of the time):> 100*dbinom(0:4, 500e3, prob = 11/1e6)

[1] 0.4086648 2.2476810 6.1811784 11.3322398 15.5819076

The average crash rate across airliners is an order of magnitude lower, closer to once per million flights though (I divided Events/Flights, no idea what is going on with their rates). In which case we would see 2 crashes after 500k flights ~7% of the time (

at leasttwo crashes 9% of the time):> 100*dbinom(0:4, 500e3, prob = 1/1e6)

[1] 60.6530508 30.3265557 7.5816314 1.2636014 0.1579494

Overall I get 528 events from 908 million flights (crash rate of ~.6 per million flights), which according to the iid model would yield

at leasttwo crashes ~3.5% of the time:> 100*dbinom(0:4, 500e3, prob = 528/908e6)

[1] 74.77033406 21.73940492 3.16034954 0.30628834 0.02226313

But, of course we can do much better, the odds of a crash on each flight is not going to be iid… the rate probably drops with time as everyone gets used to the idiosyncrasies of the new model, depends on where they tend to fly, etc.

Really, I am not sure what purpose the iid assumption has here. Why not just compare crash rates directly?

The 737 MAX’s 4 crashes per million is higher than the overall .6 crashes per million, but the crash rate is approximately inversely proportional to the number of flights per model: https://i.ibb.co/7rVmGmN/planecrash.png

So to be more comparable we should do crashes per

first500k flights. I don’t know where to find that data but would now be interested in seeing it.> The concord had a crash rate of ~11 per million flights: http://www.airsafe.com/events/models/rate_mod.htm

Note that the rate for the Concorde is calculated from one single crash in only 90’000 flights.

Yes, that is our only other data point for less than a million flights. I fixed the curves and added the 737 MAX:

https://i.ibb.co/bBVhF0Z/planecrash.png

It looks like it has had about the expected number of crashes to me.

There are a couple of notable absences from the chart: the A380 (over half-a-million flights) and the 787 (over one million flights).

What would be the expected number of crashes to you?

It sounds like your model is that airplanes have 1-10 crashes during their early years as the bugs get worked out, and then never crash again. That’s the only thing that would give you k/N as the rate where N is number of flights.

I’d suggest more like k/N + r where r is a long term rate, and is probably something like 0.2 to 0.5 per million

But the mechanism of this is that we learn from the first k. If we just say “hey, no biggie these crashes are not statistically significantly different from the typical rate” then eventually after doing nothing, we’ll detect that “you know what, there was some crappy design all along, sorry we killed all your friends waiting for a sufficient sample size”

Carlos Ungil wrote:

Where are you getting this info, also that the 737 Max had ~500k flights? Anyway, like Daniel said it looks like we should expect 0-10 crashes for the first million or so flights, the uncertainty around those curves should get thinner as more flights accumulate. I’m sure the airlines have better data on how many “flights-in” each crash occurred though, would be interesting to see. Maybe it is public?

Daniel Lakeland wrote:

Sure that is definitely the case, but I think playing with another parameter is just complicating things at this point so leave k = 0 for now. Perhaps the planes all get replaced by new models before reaching very near to 0 anyway.

Daniel Lakeland wrote:

But how many people got to their destination in superior comfort/speed/whatever? Is it worth a 4 in a million chance of crashing over 1 in a million to get there twice as fast?

Either way, every crash should be investigated to attempt figuring out what went wrong rather than letting it be summed up as “random chance” like this model treats it.

Typo: “so leave r = 0 for now”

Compare the logical results of doing an NHST with the logical results of fitting a mechanistic model of a process using a Bayesian model. The Bayesian model answers the following question:

Assuming my mechanistic model of the world is true, which values of the unknown quantities in my model are both consistent with my expectations (prior) and produce predictions which are consistent with my expectations for the predictive accuracy of this model (actual data is in the high probability region of the likelihood)?

At the end of a Bayesian calculation you’re left with a sample of vectors of unknown parameters which balance your prior expectations of their plausible values with your prior expectations for predictive accuracy of a working model. This then allows you to decide whether the model “works” which is a judgement made based on whether the sample of parameter values is in the high probability region of your prior and produces predictions for which the data is in the high probability region of the likelihood. Of course it’s possible that the high probability region of the posterior doesn’t match your prior and also doesn’t produce good predictions. This is an indication that your model is wrong.

But there are further model checks you can do to confirm the scientific logic of your model, like having the model predict additional quantities and then observing whether the predictions match new data, or deciding that it’s fine for some of the predictions to fail to match as long as certain properties of the predictions match (such as, the median or mean or some function F(q) of the predictions matches in some useful way according to a utility function you decide on).

None of this relies on any *frequency properties* as the Bayesian calculation is entirely about *plausibility under certain assumptions*.

So the reason NHST needs to go isn’t really about dichotomization, and isn’t really about p thresholds, and isn’t really about any of the math of probability on infinite sequences of numbers…. it’s about the fact that applying NHST to *scientific inquiry* is for almost every real-world case based on a logical fallacy of what it will allow us to conclude.

If it’s not too off-topic, I’m curious to hear what the community here thinks of the use of significance in this paper:

Fox CW, Paine CET. Gender differences in peer review outcomes and manuscript impact at six journals of ecology and evolution. Ecol Evol. 2019;00:1–21. https://doi.org/10.1002/ece3.4993

They investigate the question with a relatively large dataset, at least in terms of number of manuscripts (> 23k), so I’m guessing it shouldn’t run into issues of detecting small differences in small samples that have plagued some of the examples raised in recent posts on this blog. The authors make claims along the same lines in their discussion: “The large sample size of the current study, >23,000 papers submitted to six journals, provides the statistical power necessary to detect gender differences in the range of 5%–10%.” Although, for some of their tests which was done on single-author papers, the sample size was (only?) 1121 manuscripts.

Yr:

Rather than pulling out various statistically significant comparisons (as you note, when N is large you’ll find a lot of statistical significance), I think it would make sense for them to display a grid of all their findings.

Reminds me of this:

https://statmodeling.stat.columbia.edu/2019/02/11/global-warming-blame-the-democrats/

Also:

Reminds me of this: https://en.wikipedia.org/wiki/Simpson%27s_paradox#UC_Berkeley_gender_bias

The averaged yearly “success ratio” is going to be different from the total “success ratio”. Instead of calculating a ratio for each year and taking the average of them, why not calculate a single total ratio?

A small R simulation showing what I mean: https://pastebin.com/bgVY9VfN

Mean of yearly ratio: 1.293038

Total ratio: 1.153061

Which is correct?

Thanks for your thoughts.

Andrew: I believe they showed non-significant comparisons too — their single author findings were not statistically significant — but yes, a table of all their findings would have been more useful.

Anoneuoid: I’m not sure what they’re trying to achieve with this. My guess is that they were attempting to control for potential shifts in attitudes across the years, but I would think that such trends might be better uncovered by comparing ratios between years than using a mean.

Petition: You should stop living you life according to dogmas

Audience: But if we drop our current dogmas, what dogmas we should use? They look as ugly as our current one.

“You are all individuals!”

“WE ARE ALL INDIVIDUALS.”

“I’m not!”

“SHHHHH!”

Michael,

I’ll have a go. But before I do let me say that I like the Almost Sure Hypothesis Testing idea. I can see advantages over the standard approach. But.

The example I want to use is the one you mentioned above:

“In a regression model, the standard errors will be wrong in the presence of heteroskedasicity (unequal variances), which is why one might be interested in testing the assumption of homoskedasticity….”

Say you do this, and the ASHT approach says you should reject homoskedasticity. OK … but does it tell you how big the size distortion is likely to be? As I understand it, the answer is no. It looks like you have evidence your SEs are “wrong”, but you don’t know how wrong. If nominal size is 5% and actual is 20%, you should probably be worried. If nominal size is 5% and actual is 5.1% … no biggie.

Of course, the standard textbook approach to testing for heteroskedasticity has the same problem. I’m saying only that the ASHT approach (as I understand it, at any rate) doesn’t help. If I’m wrong, I’d like to hear about it, because this weakness of some specification tests bothers me a lot.

I’m talking only about the test for heteroskedasticity, because that was your example. Not about the “statistical significance” of the coefficient of interest etc.

Mark:

When you are ready and if you are willing it would be nice to get your take on Michael’s ASHT for the readers here (including me).

We can’t all run off and read his paper but that does not me some of us are not interested.

Sorry, saw your comment but I’ve been swamped. Will try to return to this at some point. But if you want to check it out yourself, the paper isn’t very long and has some nice examples, and there’s even a Wikipedia entry for Almost Sure Hypothesis Testing that has the gist of it plus some examples.

This is a great discussion & I think it’s useful, however it may be missing the point.

At the end of the day, the problems with NHST & the over-dependence on statistical significance all stem from the more original problem of publish-or-perish. This is what justifies the sense that “all’s fair in love & academic publishing”, so long as you get enough papers [ie get a good p-value]. Whatever we replace NHST with, ie whatever the criteria are for getting a publication, the existing incentive structure will continue to motivate a glut of papers with little-to-no replication potential.

In this context, advocating that we judge research on its merits or do away with statistical significance smacks of naïveté .. I mean, I get this question a lot, but I haven’t heard a good response: If we do away with NHST, what criteria should we use instead? What this is really asking is, “what are the rules of the game if not NHST (p<0.05)?".

I get that we've (the research community has) become so reliant on this tool and so driven to publish that we have largely forgotten how to judge a body of research by any other means. But, I think we have to assume that folks will continue to try to game the system if the incentives haven't changed.

It really is ridiculous to keep hearing this argument. Just go look at what people were doing before NHST was a thing in your field.[1, 2] Things scientists do:

1) Measure things carefully and figure out what needs to be controlled to get reproducible results.[3]

2) Come up with possible explanations for reliable phenomena found in #1, turn them into a specific, testable model. Then derive a prediction that distinguishes your explanation from the others (usually this means more precise than “A will be higher than B”).

3) Compare the prediction of #2 to

newobservations.People can be judged on:

1) How easily can their results be reproduced by others?

2) How good are they at coming up with explanations and/or deriving useful predictions (in the sense of they differ from the predictions of other proposed explanations) from these explanations?

3) How accurate do their predictions tend to be?

Outside of something like testing a model of light cones, none of this has anything to do with checking whether two variables have exactly zero correlation.

[1] Here is a great example (discovery of insulin) that someone tried to pass off as successful NHST: https://statmodeling.stat.columbia.edu/2019/03/05/back-to-basics-how-is-statistics-relevant-to-scientific-discovery/#comment-984175

[2] Here is a small list of examples of people in the “soft sciences” coming up with quantitative laws/models I gathered awhile back: https://statmodeling.stat.columbia.edu/2017/07/20/nobel-prize-winning-economist-become-victim-bog-standard-selection-bias/#comment-530272

[3] There is the issue of choosing what to measure/observe to begin with though, I have no doubt that people would game this one by measuring only the easiest things no matter if they are of any theoretical importance.

@Anoneuoid: This is great, thank you so much. I will reference your post when I get this question – because I get it a lot! The community particularly needs good examples of published research to emulate.

Just to clarify, I wasn’t making this argument, merely repeating it. I do not dispute the need to move away from significance testing. Rather, I want to emphasize that this tendency towards binary decision making on the basis of statistical thresholds or criteria is a consequence of the publish-or-perish paradigm, in which statistics is seen as the gatekeeper or referee limiting access to publication. Hence why we keep hearing the same question over & over again (re: the rules of the game).

I would love to see more focus on the criteria for judging a research career as you describe here, since I believe this is the crux of the issue.

Sorry, I have just heard it so many times. It goes along with “biology/etc is more complicated than physics that is why we need to do it this way”.

To me it is clear the obstacle sits with the research community, at least in addition to (if not rather than) the complexity of the phenomena under study. And really, the examples abound if you just look at how research was done pre-1940 or so.

there are many examples but classic examples are Huxley’s model of the action potential and subsequent experiments, Huxley’s 1966 measurements of muscle tension at different sarcomere lengths to test predictions of the sliding filament model (13-14 years after the sliding filament model was proposed) and Luria and Delbrück’s 1943 “fluctuation test” to test quantitative predictions that differed between two models of mutation.

Jacki said,

“Rather, I want to emphasize that this tendency towards binary decision making on the basis of statistical thresholds or criteria is a consequence of the publish-or-perish paradigm, in which statistics is seen as the gatekeeper or referee limiting access to publication. ”

To add to Anoneuoid’s response:

I don’t agree that the tendency towards binary discussion making “is a consequence of the publish-or-perish paradigm”. The publish-or-perish paradigm is itself an instance of binary decision making — so I would say that the tendency toward binary decision making is a driver of *both* the publish-or-perish paradigm and the dependence on p-values or other. So this is where change needs to start: putting less emphasis on binary thinking, and focusing educational efforts on the more complex thinking that is needed in science.

Jacki also said,

“If we do away with NHST, what criteria should we use instead? What this is really asking is, “what are the rules of the game if not NHST (p<0.05)?""

As I see it, thinking in terms of "the rules of the game" is another oversimplification that often leads to poor science. To the extent that I could list any "rule of the game" it might be, "Keep your mind open to reality, including uncertainty and other types of complexity". Anoneuoid's suggestions are good examples of paying attention to reality.

Anoneuoid said: “that someone tried to pass off as successful NHST:”

just a correction because no one likes to have their views mistakenly passed onto others. I specifically used the insulin paper as an example of the way experimental biology was done before NHST. Somewhere around 2000, experimental biologists simply did the same thing and added p-values (generally lots and lots of them).

reflecting on that conversation, I think we largely agree on everything (your 1-2-3 points above for both how to do science and how to judge science) except that I have more faith that modern experimental biology (not clinical medicine, not nutrition, not much of epidemiology) is discovering stuff despite how they do science (quantitative phenomenological models are uncommon, the researchers think that NHST is justifying their conclusions, and the researchers never consider at all the biological consequences of variation in effect size).

It is not the same thing. Now what they care about is “was the difference significant”, which in their confused minds means “was there a

realdifference”.Where do the authors show concern about this in the insulin paper? I see people reporting quantitative observations, not looking for differences.

Stop placing any bar on publication at all. Create an official government sponsored archive of public research. Make it free to submit, and automatic to be published. It’s just an enormous geographically distributed redundant write-only hard drive.

Now, publication itself is automatic, it signals nothing, there’s no point in publishing bullshit just to say “Hey I have a Nature Paper” and we can go back to actually reading stuff we care about and thinking about it and deciding if it’s right or wrong.

As for promotion and tenure and etc. let people do their usual monkey-brained social climbing bullshit, but decouple it from polluting the truth with crap.

Further, if it receives federal funding, or if it’s conducted at a university that uses federal overhead from federal grants, *require by law* that it be submitted to this archive.

> Stop placing any bar on publication at all. Create an official government sponsored archive of public research. Make it free to submit, and automatic to be published. It’s just an enormous geographically distributed redundant write-only hard drive.

Hard drive is already implemented, it is called “the Internet”. The actualproblems is how to actually fund the correct people, ie how to organize a class of professional scientists.

No, the Internet Archive might be more like it, but *lots of stuff drops off the internet on a regular basis*. The difference between what I’m suggesting and “the internet” is all about permanence, discoverability, citeability, and the *require by law* that you publish there in order to get public funding

I see you point, but I think technology and practical implementation of “the thing” are less important, while culture and institution surrounding “the thing”. Credibility is in the eye of the funding comity.

Look, people are managing to discuss science on twitter, which is like purposefully designed against these kind of discussions!

PS: “permanence, discoverability, citeability” look very much like Project Xanadu.

I’m visualizing the result as ending up somewhat like Stack Exchange. On Stack Exchange the vast majority of queries never attract any meaningful reply. What Daniel is proposing (a concept which I applaud in principle) could easily end up with millions of items “published”, few of which are ever read.

The crucial issue would be what sort of community (or “institution” in your formulation) would arise to curate the useful bits and incorporate them into a web of genuinely peer reviewed, qualty-assured research results that are permanent, discoverable and citeable. I very much doubt that such a review and curation function would spontaneously arise on its own.

>could easily end up with millions of items “published”, few of which are ever read.

How would that be any different than what we have today? :-)

> genuinely peer reviewed, qualty-assured research results that are permanent, discoverable and citeable. I very much doubt that such a review and curation function would spontaneously arise on its own.

The only thing I think that would be necessary is some kind of avoidance of people abusing the system to “publish” illegal content or similar. I’m skeptical that peer review is important, but I KNOW *anonymous* peer review as a hurdle to publication is actually a bad idea. Post publication peer review is fine, it’s also known as “discussing a paper”. Lots of stuff won’t be worth discussing, but then, that’s already *definitely* the case.

According to my wife the NSF stopped having deadlines for submission for grants. Now you submit whenever you want, and they just batch stuff up and evaluate it when they have enough submissions to warrant their time.

The result was submission rate for grants *dramatically plummeted*. Why would you fill up the literature with crap if you weren’t getting anything much for it? Make publication in and of itself a meaningless metric, and people will start to think about whether they should waste their time writing crap just to get “one more paper before tenure” or whatever. Pretty much overnight I think people would be much more careful about what kind of thing they chose to publish.

Daniel:

Interesting – “According to my wife the NSF stopped having deadlines for submission for grants. Now you submit whenever you want, and they just batch stuff up and evaluate it when they have enough submissions to warrant their time.”

The world is changing. My works works as a research facilitator and if Canadian funders follow suit she might buy a bottle of very expensive single malt to celebrate – or maybe 3!

My concern here is that resources will actually flow somehow to a community that will curate the useful bits, purposefully review them and ensure they are permanent, discoverable and citeable.

>My concern here is that resources will actually flow somehow to a community

I too agree with this. It’s the reason I donate to Wikipedia and The Internet Archive and the OpenWrt project and the Julia project and the Stan project … and maybe some others. I don’t have a ton of funds to do this kind of donation, but I recognize that without those donations these worthy projects will suffer.

I also donate my time to some of those too…

A part of me wants to suggest a distributed “permanent archive” so that universities and other organizations can each host a portion of it, thereby easily donating tangible resources to the permanence at least.

As for curation, it’s a difficult problem. Perhaps making academic promotion in part dependent on the quality of a curated list of high quality papers within their field produced by the faculty member. Make some of the time be part of what you do as a faculty member. Forget this anonymous peer review stuff, we want our reputations to be based on how well we advance scientific discourse by discovering and promoting good ideas.

Jacki:

I have two answers for you:

First, yes there are careerists etc. but there are also lots of scientists doing their best who have the impression that null hypothesis significance testing is the right thing to do. Even practitioners who don’t do much NHST, who spend their time running regressions and bootstrapping etc., often seem to think that NHST is fundamental, and this can cause all sorts of problems in interpretation of results. Think of all the people who do good solid work that would be publishable under any statistical paradigm—but they use NHST to categorize results as significant or not, leading to a jumble of reporting and a noisification of all results. Abandoning statistical significance could help these people a lot, in part by giving them permission to look at the data more directly.

Second, yes, careerists gonna careerist. But I’d rather see them do their careerist thing using paradigms that involve more data exploration and display. I’d rather see careerists display all their results. If, for example, it became the standard “careerist” thing to display all data and relevant comparisons using ggplot, I think that would be a step forward: I think would make papers easier to read and more useful for subsequent researchers. And, even careerists would rather get closer to the truth, all else equal.

Agree here.

There are real issues of academic culture ( this morning I was reminded by a query on of one of my takes on that here and noted it seemed old news and hopeless https://statmodeling.stat.columbia.edu/2017/08/31/make-reported-statistical-analysis-summaries-hear-no-distinction-see-no-ensembles-speak-no-non-random-error/ )

But we should try to make it easier for non-careerists to do better research and harder for careerists to be careerists regardless or perhaps even more so given the resilience of academic culture to change.

Also found helpful material from the link here https://statmodeling.stat.columbia.edu/2019/03/05/abandon-retire-statistical-significance-your-chance-to-sign-a-petition/#comment-991517

“One aspect is how people acquire and use statistical knowledge and how they think about statistical concepts—this is the descriptive facet of statistical cognition. The study of how people should think about statistical concepts—the normative—is also an important aspect of statistical cognition as this is often what we are exposed to (e.g., in school) and it is also the standard to which our performance is usually compared. Finally, the question of closing the gap between the descriptive (the “is”) and the normative (the “should”)—the prescriptive—is a critical issue in statistical cognition.”

Most of the conversation here seems to have been on normative (the “should”) and less so the descriptive (the “is”) and not enough on the prescriptive (the “how to enable others”).

“The prescriptive facet seeks to improve statistical practice, and statistics learning…How to help people to make good decisions … How to train people to make better decisions”

Also new link here https://thenewstatistics.com/itns/2019/02/19/statistical-cognition-an-invitation/

I would love to see more discussions here between Deborah Mayo and Sander Greenland. I just posted Sander’s invitation on Twitter. As a complete novice, [only one course in statistics], I was able to comprehend several of Sander Greenland’s papers whereas I was befuddled by the mix of non-standard and standard definitions of p-values & other terms in the same article.

I would also like to see John Ioannidis join in b/c I follow his logic just as easily. Then again I was also influenced by David Sackett’s work.

These blogs offer a valuable opportunity to have evaluated by the lay public as well.

Apologies, I meant that I was befuddled…in other articles I came across. Sander’s definitions were clear.