Although the experiment itself is not highly powered, couldn’t it be, or shouldn’t it be, combined with more highly powered studies that have the same materials and methods? Shouldn’t all such similar experiments be lumped together in that way? it seems in this way it might lend a marginal increase or decrease to the significance of the larger results pooled results.

]]>The proper statement of the Frequentist result is “It covers the real parameter alpha percent of the time conditional on the model”

where “conditional on the model” means “the world really is a high quality random number generator”

The behavior of high complexity sequences is “special” in the sense that even many computer programs which people intended to be RNGs turn out not to be good ones and theorems about what we’ll get when we repeatedly sample from them fail to be true because those RNGs don’t meet the requirements for the theorem. Why we should just automatically accept that scientists running experiments in laboratories are able to make better high complexity sequences of numbers than computer scientists who try to design RNGs do, I don’t know. But everyone just completely accepts the premise.

In reality, most people are probably just completely ignorant of the premise they’re accepting.

]]>I can see the confusion, but I think it all makes sense if by “outcome” you read “outcome variable” or “Y” (and specifically the SD of Y). And by “experimental design” you read “sample size and randomization/sampling strategy”.

Basically I mean that with power we completely ignore the point estimate and ask “given the standard errors I actually realize here, what is the power for a study like this if I pretend I hadn’t done it yet”. So that’s what like like 2.8se (2 +.8) from 0 for power at 80% right?, because at 2se we have power at 50%). With that kind of setup we could also ask “if real effect in the world was Delta, what would have been our power with standard errors this size?” That’s what I had in mind with power.

So one difference is the power question ignores the point-estimate whereas the CI is centered on the point estimate. That’s what I was trying to point out, and to discuss the philosophical implications of the different framings.

But maybe I failed again. Probably because much of this was just baiting Daniel to point out the Bayesian solution, and then I could ignore that because the “conditional on the model” part doesn’t make any sense to my guts in the way that “it covers the real parameter alpha percent of the time” is at least empirically intuitive to me, even if not what I really want.

]]>This all seems like torturing yourself to not have to realize that frequentist statistics don’t answer the questions you’re interested in and that Bayesian stats do…

“Given what I know, how much credence should I give to the idea that a true effect size is in the range [delta – epsilon , delta + epsilon]” is exactly answered by the Bayesian posterior probability.

]]>Given your question about turning “my confidence interval includes 0” into something about delta the obvious answer is to check whether “my confidence interval includes delta”. But I agree that this may be the right answer to the wrong question.

I’m not sure I agree with your characterisation of the “confidence level approach”. It’s not about a probability regarding *this* interval. We cannot use confidence intervals to answer that kind of question.

I’m also confused by your comments on “power”.

> Given my experimental design and this outcome, what is the probability that I would get a statistically significant effect size estimate for a real effect delta?

How is it related to the outcome? Maybe you’re doing some kind of post-hoc power calculation where you set the “real effect delta” assumed in the power calculation to the observed outcome. If you’re really conditioning on the outcome there is no probability to calculate: given the experimental design and the outcome either it corresponds to a statistically significative result or it doesn’t.

> [Power] is asking about properties of an experimental design – if the real thing in the world is this big, what is the probability my resulting confidence interval would not cover zero.

I could agree with that.

> Using power throws out information – it ignores the point estimate.

If you understand power as “given this outcome”, how would it be ignoring the point estimate? The point estimate and the outcome contain the same information. Or maybe “this outcome” was “an hypothetical outcome” and not “the actual outcome”.

]]>Yeah, someone could do that. But they are asking different questions, and neither of them is exactly the one any of us want to answer.

Confidence Interval Approach: Given my estimate and this outcome, what is the probability that this interval covers a real effect delta?

Power: Given my experimental design and this outcome, what is the probability that I would get a statistically significant effect size estimate for a real effect delta?

The first is asking about the properties of an estimator – the point estimate and confidence interval. The second is asking about properties of an experimental design – if the real thing in the world is this big, what is the probability my resulting confidence interval would not cover zero.

Using power throws out information – it ignores the point estimate. But in this case the estimate is already believed to be very close to zero relative to noise. Maybe here I worry more about “believing the point estimate” given how little information is actually in it. Maybe that’s not smart of me.

I also like the fact that with power you define and fix the alternative effect size, and let the probability vary; with a confidence interval, it’s the opposite (even if there are transformations to make more like for like comparisons with power).

I guess that in these cases I think we want to know about particular potential effects in the world and whether, if they were a certain size, a particular statistical setup would detect them. At its core, that’s a power question right, not a confidence interval question?

tl;dr – this reply reaffirms my original point that I think we would benefit from a better developed and understood set of tools here.

]]>In my field not only it is standard to publish null results, but we “quantify” them.

In this case we would conclude the paper with a statement like “the absolute difference

in risk at 3 years of followup is < XX at the 95% CL". (or something to that effect).

In fact, we would never be able to publish without this last statement!

It seems to me that if studying "something" was deemed important, then a result that rules out

that "something" at a given level is also important. Moreover, it sets a target for any further

study, ie, to be useful a followup study should aim for a sensitivity well below XX.

Am I making sense? ]]>

You could use the lower bound of a one-sided confidence interval.

]]>As a flip-side to this, when the confidence interval is relatively small, I’ve been toying with the idea of ex-post power calculations to rule out parameter values. Like – if there had been a 0.1sd treatment effect, I would have gotten a significant coefficient with a probability P.

But in general I really wish we had a better developed and understood set of tools for turning “my confidence interval includes 0” into “we provide evidence that any effect is smaller than Delta.”

]]>I guess your definition of “adequately powered” is “power = 100%” then.

That is what adequately powered to rule out the null hypothesis means. This is the statement I responded to:

an adequately powered study with a non-significant result can be interpreted as evidence of absence.

A study can be powered to rule out some magnitude of deviation from the hypothesis.

]]>I guess your definition of “adequately powered” is “power = 100%” then.

]]>Sorry, mea culpa. Too much Covid papers around.

]]>But you can still get non-significant results when the null hypothesis is false.

That would be due to inadequate power, ie sample size is too small for the precision of the measurements to allow detection of the difference/correlation.

]]>But you can still get non-significant results when the null hypothesis is false. Unless you there is really no uncertainty in your experiment you cannot attain 100% power. In that case, we can agree that there is no need for statistical analysis.

]]>Everything is correlated to everything else, even if this may be by a negligible amount.

That means the null hypothesis of no effect/correlation is always false. There is no such thing as a type I error when it comes to NHST.

]]>The CI for the primary outcome measure (–5.6% to 17.0%) was too wide to allow for a definitive statement of treatment equivalence and allows for the possibility of a clinically important effect that we did not have sufficient power to detect.

Interdisciplinary inpatient care for elderly people with hip fracture: https://www.cmaj.ca/content/cmaj/167/1/25.full.pdf

]]>That’s not how that works… If the power is x the probability of having a non-significant result is 1-x. If by “adequately powered” you mean x=1 that’s impossible.

]]>No study adequately powered to rule out the null hypothesis would have non-significant results though.

]]>With a sufficiently powered study, one can conclude the mean effect is probably below some clinically useful value. The matter of heterogeneity in the population remains.

]]>No need to tax your imagination, the word COVID is not found in this paper.

]]>