It’s rare to see researchers say flat-out that an experimental result leaves them uncertain. There seems to be such a temptation to either declare victory with statistical significance (setting the significance level to 0.1 if necessary to clear the bar) or to claim that weak and noisy results are “suggestive” or, conversely, to declare non-significance as evidence of no effect.

But . . . hey! . . . check this out:

Under the heading, “The one med paper in existence that was somewhat ok with uncertainty?,” Zad points to this article, “Randomized Trial of Nocturnal Oxygen in Chronic Obstructive Pulmonary Disease,” by Yves Lacasse in the New England Journal of Medicine:

Abstract Conclusions: “Our underpowered trial provides no indication that nocturnal oxygen has a positive or negative effect on survival or progression to long-term”

Full-text Discussion: “Our trial did not show evidence of an effect of nocturnal oxygen therapy on survival or progression to long-term oxygen therapy in patients with COPD with isolated nocturnal oxygen desaturation. Because enrollment in the trial was stopped before we had reached our proposed sample size, the trial was underpowered, with the consequence of a wide confidence interval around the point estimate of the absolute difference in risk between the trial groups at 3 years of follow-up. The data that were accrued could not rule out benefit or harm from nocturnal oxygen and included the minimal clinically important difference determined before the trial. However, nocturnal oxygen had no observed effect on secondary oucomes, including exacerbation and hospitalization rates and quality of life. Furthermore, the duration of exposure to nocturnal oxygen did not modify the overall effect of therapy. Because our trial did not reach full enrollment, it makes sense to view our results in the context of other results in the field. A systematic review of the effect of home oxygen therapy in patients with COPD with isolated nocturnal desaturation identified two published trials that examined the effect of nocturnal oxygen on survival and progression to long-term oxygen therapy.”

Progress!

Progress: No. Imagine the word “Covid” would not be found in this paper, it would have been rejected by the editor without detailed review. Currently, all Covid-related papers are accepted. Once (when???) this is over, we will be back to the old no-significance-no-publication attitude. Hoping for the nect Corona.

No need to tax your imagination, the word COVID is not found in this paper.

Sorry, mea culpa. Too much Covid papers around.

Given how many times I’ve questioned whether scientific communication will ever be able to really embrace uncertainty, I love this data point. If science is supposed to give us knowledge and saying ‘we don’t know’ is perceived as non-information then it takes some courage to be this blunt. Makes me wonder how the authors perceive what they’ve added to the scientific record with this paper. In that paragraph from discussion, there’s this part which reads more like a typical claim: “However, nocturnal oxygen had no observed effect on secondary oucomes, including exacerbation and hospitalization rates and quality of life. Furthermore, the duration of exposure to nocturnal oxygen did not modify the overall effect of therapy.” Or maybe in medicine its more standard to publish on these trials even if they didn’t work out than it would be in, say, psychology?

Nah. This study demonstrated that nocturnal oxygen therapy had no benefit, suggesting clear lack of efficacy. The study yielded null findings, providing further reassurance that the early trial termination did not conceal detection of a true benefit.

https://www.cidrap.umn.edu/news-perspective/2021/02/zinc-vitamin-c-show-no-effect-covid-19-small-study

This seems to suggest that an adequately powered study with a non-significant result can be interpreted as evidence of absence.

With a sufficiently powered study, one can conclude the mean effect is probably below some clinically useful value. The matter of heterogeneity in the population remains.

No study adequately powered to rule out the null hypothesis would have non-significant results though.

That’s not how that works… If the power is x the probability of having a non-significant result is 1-x. If by “adequately powered” you mean x=1 that’s impossible.

Everything is correlated to everything else, even if this may be by a negligible amount.

That means the null hypothesis of no effect/correlation is always false. There is no such thing as a type I error when it comes to NHST.

But you can still get non-significant results when the null hypothesis is false. Unless you there is really no uncertainty in your experiment you cannot attain 100% power. In that case, we can agree that there is no need for statistical analysis.

That would be due to inadequate power, ie sample size is too small for the precision of the measurements to allow detection of the difference/correlation.

I guess your definition of “adequately powered” is “power = 100%” then.

That is what adequately powered to rule out the null hypothesis means. This is the statement I responded to:

A study can be powered to rule out some magnitude of deviation from the hypothesis.

This is what we settled on in 2002 after much badgering by journal editors and reviewers to find something positive (not sure which journals).

The CI for the primary outcome measure (–5.6% to 17.0%) was too wide to allow for a definitive statement of treatment equivalence and allows for the possibility of a clinically important effect that we did not have sufficient power to detect.

Interdisciplinary inpatient care for elderly people with hip fracture: https://www.cmaj.ca/content/cmaj/167/1/25.full.pdf

As a flip-side to this, when the confidence interval is relatively small, I’ve been toying with the idea of ex-post power calculations to rule out parameter values. Like – if there had been a 0.1sd treatment effect, I would have gotten a significant coefficient with a probability P.

But in general I really wish we had a better developed and understood set of tools for turning “my confidence interval includes 0” into “we provide evidence that any effect is smaller than Delta.”

You could use the lower bound of a one-sided confidence interval.

Yeah, someone could do that. But they are asking different questions, and neither of them is exactly the one any of us want to answer.

Confidence Interval Approach: Given my estimate and this outcome, what is the probability that this interval covers a real effect delta?

Power: Given my experimental design and this outcome, what is the probability that I would get a statistically significant effect size estimate for a real effect delta?

The first is asking about the properties of an estimator – the point estimate and confidence interval. The second is asking about properties of an experimental design – if the real thing in the world is this big, what is the probability my resulting confidence interval would not cover zero.

Using power throws out information – it ignores the point estimate. But in this case the estimate is already believed to be very close to zero relative to noise. Maybe here I worry more about “believing the point estimate” given how little information is actually in it. Maybe that’s not smart of me.

I also like the fact that with power you define and fix the alternative effect size, and let the probability vary; with a confidence interval, it’s the opposite (even if there are transformations to make more like for like comparisons with power).

I guess that in these cases I think we want to know about particular potential effects in the world and whether, if they were a certain size, a particular statistical setup would detect them. At its core, that’s a power question right, not a confidence interval question?

tl;dr – this reply reaffirms my original point that I think we would benefit from a better developed and understood set of tools here.

Given your question about turning “my confidence interval includes 0” into something about delta the obvious answer is to check whether “my confidence interval includes delta”. But I agree that this may be the right answer to the wrong question.

I’m not sure I agree with your characterisation of the “confidence level approach”. It’s not about a probability regarding *this* interval. We cannot use confidence intervals to answer that kind of question.

I’m also confused by your comments on “power”.

> Given my experimental design and this outcome, what is the probability that I would get a statistically significant effect size estimate for a real effect delta?

How is it related to the outcome? Maybe you’re doing some kind of post-hoc power calculation where you set the “real effect delta” assumed in the power calculation to the observed outcome. If you’re really conditioning on the outcome there is no probability to calculate: given the experimental design and the outcome either it corresponds to a statistically significative result or it doesn’t.

> [Power] is asking about properties of an experimental design – if the real thing in the world is this big, what is the probability my resulting confidence interval would not cover zero.

I could agree with that.

> Using power throws out information – it ignores the point estimate.

If you understand power as “given this outcome”, how would it be ignoring the point estimate? The point estimate and the outcome contain the same information. Or maybe “this outcome” was “an hypothetical outcome” and not “the actual outcome”.

I can see the confusion, but I think it all makes sense if by “outcome” you read “outcome variable” or “Y” (and specifically the SD of Y). And by “experimental design” you read “sample size and randomization/sampling strategy”.

Basically I mean that with power we completely ignore the point estimate and ask “given the standard errors I actually realize here, what is the power for a study like this if I pretend I hadn’t done it yet”. So that’s what like like 2.8se (2 +.8) from 0 for power at 80% right?, because at 2se we have power at 50%). With that kind of setup we could also ask “if real effect in the world was Delta, what would have been our power with standard errors this size?” That’s what I had in mind with power.

So one difference is the power question ignores the point-estimate whereas the CI is centered on the point estimate. That’s what I was trying to point out, and to discuss the philosophical implications of the different framings.

But maybe I failed again. Probably because much of this was just baiting Daniel to point out the Bayesian solution, and then I could ignore that because the “conditional on the model” part doesn’t make any sense to my guts in the way that “it covers the real parameter alpha percent of the time” is at least empirically intuitive to me, even if not what I really want.

The proper statement of the Frequentist result is “It covers the real parameter alpha percent of the time conditional on the model”

where “conditional on the model” means “the world really is a high quality random number generator”

The behavior of high complexity sequences is “special” in the sense that even many computer programs which people intended to be RNGs turn out not to be good ones and theorems about what we’ll get when we repeatedly sample from them fail to be true because those RNGs don’t meet the requirements for the theorem. Why we should just automatically accept that scientists running experiments in laboratories are able to make better high complexity sequences of numbers than computer scientists who try to design RNGs do, I don’t know. But everyone just completely accepts the premise.

In reality, most people are probably just completely ignorant of the premise they’re accepting.

This all seems like torturing yourself to not have to realize that frequentist statistics don’t answer the questions you’re interested in and that Bayesian stats do…

“Given what I know, how much credence should I give to the idea that a true effect size is in the range [delta – epsilon , delta + epsilon]” is exactly answered by the Bayesian posterior probability.

I am a particle physicist.

In my field not only it is standard to publish null results, but we “quantify” them.

In this case we would conclude the paper with a statement like “the absolute difference

in risk at 3 years of followup is < XX at the 95% CL". (or something to that effect).

In fact, we would never be able to publish without this last statement!

It seems to me that if studying "something" was deemed important, then a result that rules out

that "something" at a given level is also important. Moreover, it sets a target for any further

study, ie, to be useful a followup study should aim for a sensitivity well below XX.

Am I making sense?

Although the experiment itself is not highly powered, couldn’t it be, or shouldn’t it be, combined with more highly powered studies that have the same materials and methods? Shouldn’t all such similar experiments be lumped together in that way? it seems in this way it might lend a marginal increase or decrease to the significance of the larger results pooled results.