Robert Matthews writes:

Your post on the design and analysis of trials really highlights how now more than ever it’s vital the research community takes seriously all that “nit-picking stuff” from statisticians about the dangers of faulty inferences based on null hypothesis significance testing.

These dangers aren’t restricted to the search for new therapies. I’m currently conducting a literature review of existing prophylactics for upper respiratory tract infections which may reduce the risk of SARS-CoV-2 infection. I’ve found a number of studies with point estimates indicating substantial risk reduction that have nevertheless been dismissed as failures because they did not achieve statistical significance.

Maybe the time has finally come to make the big move into that “post p < 0.05 world” we’ve all been talking about for years to no obvious effect. Right now, to paraphrase that old World War Two saying, “Careless inference costs lives”.

My reply:

I agree that the idea of statistical significance continues to create all sorts of problems, both theoretical and practical, and we (the scientific establishment) should move past the practice of using statistical significance to summarize experiments. At best, statistical significance provides some rough guidance into the question, “Are more data needed to make any sort of useful conclusion here?”—but even for that specialized question, there are better tools.

That said, I doubt we’ll see any revolution right now. I expect we’ll muddle through using existing practices, partly because people are in too much of a hurry to change, and partly because of the dominance of classical statistical training. I was just talking with a medical researcher the other day who wanted to do a classical power analysis. The result of the power analysis wasn’t useless; it just had to be interpreted carefully. Interpreted in the conventional way, the power analysis could be worse than useless. I do think we should move beyond statistical significance and that lives could be saved by doing things right; unfortunately I don’t see this happening in general practice in the short term.

I think the current situation cuts both ways. While I am in agreement about getting rid of NHST (and I did sign the petition), I think we can also see the dangers of having little or no guidelines as to whether findings can be distinguished from random noise. In the present environment, I can see all sorts of treatments/diagnostic tools being touted based on small and poorly conducted studies that would not have passed the NHST filter. So, while I can agree that the NHST filter is counterproductive, I can also see the dangers of having no filter at all.

It is fine to say that all findings and data are worth being made available, I am already feeling overwhelmed by the quantity of what is already out there and I can’t keep up with trying to figure out which threads are worth pursuing and which are not. I think it points to the insufficiency of our current institutions to deal with a real time crisis that requires good analysis.

We will muddle through and NHST does more damage at present (in my opinion) than it helps. But the alternative is far from ideal as well.

How would these tools get by an interval estimation filter (the most commonly proposed alternative) when it doesn’t get by a significance filter? Very noisy estimates and failure for effects to removed, or far removed, from 0, would result either way.

Also, if you believe that NHST does more damage then help then you don’t need an alternative that’s ideal. You just need one that actually helps.

Alternatives that actually help and can be widely adopted are more difficult to find than many assume. For example ‘does the confidence interval contain 0’ is doing essentially the same thing as a significance test, but in a *more* confusing and obfuscated way.

There’s the occasional talk in the stats conferences I go to about creating an ‘automated statistician’. The very concept terrifies me.

This is an odd time to Bayesian/Frequentist point score IMO. In this time we should strive to have better experiments.

The critique of “costing lives” can apply as well to Bayesian priors that were not the best choice, or decisions based solely on Bayes factors, stopping too soon, Bayesian QRPs, etc. Of course, no mention of “saving lives” or “doing good science” is ever granted to NHST or frequentism.

It sounds like the original poster has an issue with arbitrary journal standards (which we’d have too with other approaches if they were as popular or as useful as hypothesis testing), not with p-values or the modus tollens logic.

Justin

Justin:

See my linked post. It’s not about point scoring nor is it about journals. It’s about learning the most from our data and making the best decisions. Researchers can and have saved lives and done good science using inefficient statistical methods; the point is we can do better, and that the use of bad statistical methods can lead to avoidable mistakes in data interpretation and decision making, even when well-intentioned researchers are involved.

+1

…ah yes, the many uses of NHST like telling us probabilities of hypotheses null or otherwise (oh wait no), whether the estimated sign of effect is even directionally correct (oh wait that’s Type S error), OK then the magnitude of estimate is reliable (oh wait that’s type M error), well maybe it’s useful for comparing interventions where we might favor the significant ones over not significant ones (oh wait the difference btw significant and not significant is not itself significant), any others? Oh let’s not forget the greatest use of them all- to have a binary lexicographic real/not-real filter to help us weave our tales of science. What could go wrong?

Justin, where in the post did Andrew mention Bayes or Frequentism, let alone attempt to make a “Bayesian/Frequentist point score” as you so blithely accuse?

I am just not seeing it, and as has been noted by many others on this blog, it appears that you have yet again failed to read the post or are simply trolling. For example, see here:

https://statmodeling.stat.columbia.edu/2020/02/21/whats-the-american-statistical-association-gonna-say-in-their-task-force-on-statistical-significance-and-replicability/#comment-1246846

https://statmodeling.stat.columbia.edu/2020/02/21/whats-the-american-statistical-association-gonna-say-in-their-task-force-on-statistical-significance-and-replicability/#comment-1246902

https://statmodeling.stat.columbia.edu/2020/02/21/whats-the-american-statistical-association-gonna-say-in-their-task-force-on-statistical-significance-and-replicability/#comment-1247707

https://statmodeling.stat.columbia.edu/2020/02/21/whats-the-american-statistical-association-gonna-say-in-their-task-force-on-statistical-significance-and-replicability/#comment-1246933

https://statmodeling.stat.columbia.edu/2020/02/21/whats-the-american-statistical-association-gonna-say-in-their-task-force-on-statistical-significance-and-replicability/#comment-1247728

It would nice if sharing data would happen.

Though, even there some retooling will be necessary – explain what you did and how the data was coded and entered.

“explain what you did and how the data was coded and entered.”

Yes!

Stuart Ritchie on twitter has drawn attention to this preprint

“Vitamin D supplementation could prevent and treat influenza, coronavirus, and pneumonia infections”

https://www.preprints.org/manuscript/202003.0235/v1

It contains the following actual sentences:

The trial also reported significantly reduced overall cancer mortality rates for all participants if the first 1 or 2 years of deaths were omitted. We consider p = 0.06 to indicate statistical significance on the basis of an opinion piece in Nature, titled

“Scientists rise up against statistical significance”.

https://twitter.com/StuartJRitchie/status/1242806561183064064

Bob:

They clearly didn’t get the memo. The new standard isn’t p=0.06, it’s p=0.07!

I hope that people recognize that taking K2 with Vitamin D is highly recommended. The safest option is to get Vitamin D from the sun.

Re: ‘ At best, statistical significance provides some rough guidance into the question, “Are more data needed to make any sort of useful conclusion here?”—but even for that specialized question, there are better tools.’

—-

I have not understood how stat sig can provide any guidance into the question posed above.

Assuming say, a normal distribution, the p value is a function of the effect size, estimated variability, and the sample size. On the assumption that the former two are sufficiently correct, any p-value significance threshold equates to a sample size threshold.

Dang! If only I had signed that petition!!!

“if the first 1 or 2 years of deaths were omitted.” Aargh!

I have not kept up with standards for clinical trials. I worry now that trials of vaccines and anti-viral drugs are going to be hampered by requirements for significance testing and p<.05. My understanding is that the use of Bayesian methods will generally make trials take less time. I would think that this is crucial.

But, folks don’t have much time to spare to learn new technologies such as dynamic trials that often use Bayesian methods.

Better if they share their data so those who already have the technology can make better use of.

Yes, exactly. Right now we need division of labor. People who can run small trials should do so and publish relatively raw data. People who know how to analyze data and aggregate data into multilevel analyses etc should grab that data and figure out what we can learn from them. We really shouldn’t be “running trials” of the pre-designed power-analyzed etc variety at all, do basic randomization into a few groups and try things in small batches, once a bigger picture emerges we can do something more intense on the most promising ideas.

Sharing their data isn’t enough — they also need to share their data-collection protocol — and may need guidance in choosing a good/useable protocol to begin with.

Jon:

I think the real benefit will come not so much from improved analysis of individual experiments, but rather from incorporating all these experiments in the big picture in which many drugs are being tried, and the populations of patients vary.

All of these comments make sense. But I’m worrying about the process of formal approval. That isn’t a problem for drugs that are already available and can be used “off label”, but it may be a problem for new ones, slowing down their use if they should work. I really don’t know how this works, but at one time it was a very serious problem.

With adaptive trials and new types of analysis, clinical statistics in pharma is definitely moving forward.

Was the power analysis requested to check off a box in a grant submission, or was it a truly good-faith effort to decide if a particular trial with a particular design is worth doing?

In my experience, the power analysis is one of the last tasks done in the grant submission process, usually with the trial design as a forgone conclusion.

The stage at which I am usually asked to help with power analysis is after someone knows the design they intend to use but before they decide whether to go forward with the study. They assume that they will have to “check off a box” saying they’ve done the power analysis in order to get funded.

So they want me to tell them whether checking that box will be possible using a manageable sample size. Sometimes it’s pretty much a pro forma part of the process but it’s not unusual for results of the power analysis to force rethinking the approach, the aims or the nature of the study.

I’m not saying power analysis (with <0.05 as a target) is the PROPER way to assess feasibility of a given study. But that box must be checked so power analysis is the form that decision takes. It leaves plenty of room for studies which turn out not to be all that valuable to pass the "80% power for alpha=0.05" requirement anyway or for potentially valuable studies to fall at that hurdle. But that is part of the standard proposal review formula in most fields I'm familiar with, whether we agree or not.

So many people want to keep a strict guideline that is easy to cheat, and is well known to be poorly understood, instead of demanding consensus and reasonable estimates of effects from reasonable experts.

I was thinking a bit about the results of interval and probability estimation in two cases made dichotomous by the typical significance recommendation.

p.05

The interval could reveal a small tight range of effects close to 0 suggesting that there really is absolutely nothing there in the population that one wants to make inferences about. Or, there could be an interval that just tips it’s toe over the 0 point. It is little different from one of the p<.05 intervals that is just barely not touching 0. Seeing such intervals side by side allows one to easily avoid the logically ridiculous conclusion that such things are different. Or, perhaps you get a very wide interval well centred on 0. In that case the study doesn't really tell you much at all. Only in that latter case is the final conclusion similar to that from "not significant" except that it's based on much better information. So in this case, again, the interval estimate is better.

One might want to argue some points above such as there is no rule about what is a wide interval, large effect, or negligible effect. But focusing on significance alone and ignoring them doesn't make them go away. It's a proverbial ostrich move. I would much prefer to be reading a blog like this one that was in my subject area talking about just such decisions. I imagine it would be around with intense discussion for about a month and then maybe 2-3 times a year have a post that suggests some adjustment of relatively agreed upon norms of behaviour or thinking about amounts (not agreed upon cutoffs for qualitative statements about sizes of effects!).

I think Robert Matthews’ question underscores the need to be clear about the intent of a statistical analysis. Although decision problems and scientific research often share the same apparatus, the p-value in a decision problem should be set having considered the constraints imposed by scarce resources and the consequences of the decision. The need to act immediately benefits from the bright line. But, in a scenario like this, there is not even the pretense of doing science.

Hello, it is clear that statisticians don’t talk to many computer scientists. What you do is publish an article with this exact title: “Statistical Significance Considered Harmful”.

Your community’s use of p values will clear up within a couple months. (It’s not so important who actually publishes the article, nor what the article itself actually says.)

Computer science, huh? How bout this: “Null hypothesis significance testing is the ‘vi’ of statistical methods.”

It’s not really a problem with statisticians. Is a problem with a vast network of fields that don’t actually listen to what statisticians have to say on the matter.

+1