The 80% power lie

This came up in class today and I wanted to repost it:

To get NIH funding, you need to demonstrate (that is, convincingly claim) that your study has 80% power.

I hate the term “power” as it’s all tied into the idea of the goal of a study that results in statistical significance. But let’s set that aside for now, and just do the math, which is that with a normal distribution, if you want an 80% probability of your 95% interval excluding zero, then the true effect size has to be at least 2.8 standard errors from zero.

All right, then. Suppose we really were running studies with 80% power (by which I mean 80% power conditional on the true effect; that is, suppose that the true signal-to-noise ratio is 2.8, so that there is an 80% probability that any given study reaches statistical significance). In that case, the expected z-score is 2.8, and 95% of the time we’d see z-scores between 0.8 and 4.8.

Let’s open up the R:

> 2*pnorm(-0.8)
[1] 0.42
> 2*pnorm(-4.8)
[1] 1.6e-06

So we should expect to routinely see p-values ranging from 0.42 to . . . ummmm, 0.0000016. And those would be clean, pre-registered p-values, no funny business, no researcher degrees of freedom, no forking paths.

Let’s explore further . . . the 75th percentile of the normal distribution is 0.67, so if we’re really running studies with 80% power, then one-quarter of the time we’d see z-scores above 2.8 + 0.67 = 3.47.

> 2*pnorm(-3.47)
[1] 0.00052

Dayum. We’d expect to see clean, un-hacked p-values less than 0.0005, at least a quarter of the time, if we were running studies with minimum 80% power, as we routinely claim we’re doing, if we ever want any of that sweet, sweet NIH funding.

And, yes, that’s 0.0005, not 0.005. There’s a bunch of zeroes there.

And, no, this ain’t happening. We don’t have 80% power. Heck, we’re lucky if we have 6% power.

Remember that wonderful passage from the Nosek, Spies, and Motyl “50 shades of gray” paper:

We conducted a direct replication while we prepared the manuscript. We ran 1,300 participants, giving us .995 power to detect an effect of the original effect size at alpha = .05.

Followed by:

The effect vanished (p = .59).

None of this should be a surprise

When I say, “None of this should be a surprise,” I don’t just mean that, in response to the replication crisis and the work of Ioannidis, Button et al., etc., we should realize that statistically-based science is not what it’s claimed to be. And I don’t just mean that, given the real world of type M errors and the statistical significance filter, that we should expect claims of statistical power (which are based on optimistic interpretations of a biased literatures) will be wildly inflated. Issues of failed replications, type M errors, etc., are huge, and they contribute to the conditions that allow the erroneous power estimates.

But what I’m saying right here is that, even knowing nothing about any replication crisis, without any real-world experience or cynicism or sociology or documentation or whatever you want to call it . . . it just comes down to the math. With 80% power, we’d expect to see tons and tons of p-values like 0.0005, 0.0001, 0.00005, etc. This would just be happening all the time. But it doesn’t.

I should’ve realized this the first time I was asked to demonstrate 80% power for a grant proposal. And certainly I should’ve realized this when writing the section on sample size and power analysis in my book with Jennifer, over ten years ago, well before I’d thought about all the problems in statistical practice of which we are now so painfully aware. All the math in that section is correct—but the implications of the math reveal the absurdity of the assumptions.

P.S. In response to a couple of comments on the earlier post: Yes, the power is conditioned on the assumed effect size. But the point of the power calculation for the NIH grant is to say that the power really is at least 80%, at least much of the time. The assumptions are supposed to be reasonable. Given that we’re not routinely seeing tons and tons of p-values like 0.0005, 0.0001, 0.00005, etc., this suggests that the assumptions are not so reasonable. See here for further discussion of this point.

P.P.S. The funny thing is, when you design a study it seems like it should be so damn easy to get 80% power. It goes something like this: Assume a small effect size, say 0.1 standard deviations; then to get 2.8 standard errors from zero you just need 0.1 = 2.8/sqrt(N), thus N = (2.8/0.1)^2 = 784. Voila! OK, 784 seems like a lot of people, so let’s assume a effect size of 0.2 standard deviations, then we just need N = 196, that’s not so bad. NIH, here we come!

What went wrong? Here’s what’s happening: (a) effects are typically much smaller than people want to believe, (b) effect size estimates from the literature are massively biased, (c) systematic error is a thing, (d) so is variation across experimental conditions. Put it all together, and even that N = 784 study is not going to do the job—and even if you do turn up a statistically significant difference in your particular experimental conditions, there’s no particular reason to expect it will generalize. So, designing a study with 80% power is not so easy after all.

P.P.P.S. To clarify one more thing: I do not think the goal of an experiment should be to get “statistical significance” or any other sense of certainty. The paradigm of routine discovery is over, and it can’t be recovered with N = 784 or even N = 7840.

42 thoughts on “The 80% power lie

    • Student:

      I think that it is appropriate for the design of a study to require the researcher to give a serious hypothesized effect size and to then come up with some estimate of the standard error that would be obtained from the proposed study. I don’t think it should be required that this ratio exceed 2.8, also I think it’s important that the effect size be justified in a reasonable way, not just using the estimate from some earlier noisy study.

      So, my two recommendations are:
      1. Be more serious about your hypothesized effect size and variation.
      2. Recognize that most purportedly 80% power studies do not have anything like 80% power relative to the true effect size.

  1. I always find it funny that textbooks will mention Fisher as a kind of “father of statistics”, but fail to mention he considered the majority of what follows to be the result of “mental confusion”:

    The phrase “Errors of the second kind”, although apparently only a harmless piece of technical jargon, is useful as indicating the type of mental confusion in which it was coined.

    https://academic.oup.com/jrsssb/article/17/1/69/7026709

    Its worth reading the full paper.

  2. > Suppose we really were running studies with 80% power. In that case, the expected z-score is 2.8, and 95% of the time we’d see z-scores between 0.8 and 4.8.

    When we are “really” running studies with 80% power (which is property of the study and depends on the reference effect size used to calculate the power, not on the actual effect size) how often we’d see z-scores between 0.8 and 4.8 depends on the distribution of actual effect sizes.

    For what it’s worth, the wikipedia says that “many statisticians have argued that post-hoc power calculations are misleading and essentially meaningless.” (I’ve noticed now the PS that seems to acknowledge that going from what NHS means by power to something else and call that “real” power is misleading.)

  3. Andrew,

    You wrote:

    > So we should expect to routinely see p-values ranging from 0.42 to . . . ummmm, 0.0000016.

    I would object to the word routinely above.

    The distribution of the p-values is such that the probability is 20% that you will get a p-value above 0.05. So 0.42 is not a routine outcome, nor is any value above 0.05. Unless I don’t understand the meaning of routine.

    power.t.test(d=30,
    sd=150,
    n=200,
    type=”one.sample”,
    alternative=”two.sided”,
    strict=TRUE)

    pvals<-rep(NA,100000)
    for(i in 1:100000){
    y<-rnorm(200,30,150)
    pvals[i]<-t.test(y)$p.value
    }

    hist(pvals)

    mean(pvals<0.05)

    • I have another question about the claim: “With 80% power, we’d expect to see tons and tons of p-values like 0.0005, 0.0001, 0.00005, etc. This would just be happening all the time. But it doesn’t.”

      Can you point to evidence to support that? I looked at the last month of issues of the NEJM and saw quite a few p values like <.001, <.0001. I didn't try to convert the many hazard ratios into p values, but many of these showed 95% confidence intervals far away from 1 (while some overlapped 1). I'm sure quite a few of these would convert to p values for the difference between treated and control groups on the order of <.001. So, I'm just wondering where I can see evidence relating to what you say – I'm not necessarily expressing doubt, I just don't know what we are routinely seeing.

    • Maybe you would agree that we “routinely” see values in the range from 0.05 to 0.0003 (it happens 60% of the time). Then they are also in “routinely” in the 0.42 to 0.0000016 range! (And in the 1 to 0 range or in the real line, for that matter.)

      There was a similar situation in a recent statement about how the “z-curve” method can give misleading results when the number of significant z-values is, say, less than 100 or 200. A specific example was given where the coverage was 90% instead of 95% when the number of significant value was 75 – and 75 is clearly less than 100 or 200 so it did indeed support the claim.

      • My point was that it’s pragmatically odd to say that we will routinely see values between 0 and 1 (or in the 0.42 to 0.0000016 range or whatever huge range you see). The important point for frequentists is that you will see values above 0.05 only 20% of the time. It’s pragmatically odd because it violates Grice’s maxim of relevance, and also the maxim of quantity (what Andrew wrote is underinformative in my view).

        You will see values below 0.05 80% of the time. I am more or less OK calling that routinely.

        • You’ll see values less than 0.05 80% or more of the time if the real effect size is greater than or equal to the one assumed in the power calculation.

          So you could have tons of 80% powered studies it’s just that the real effect sizes are substantially smaller than the one assumed in the 80% power calculation… and then you’d see very few p values below 0.05

          I think that’s the case of what’s really going on. As Anoneuoid often says, there’s virtually no such thing as effect size 0 exactly. So with large enough sample size you’ll get a p value below 0.05 with probability 1

        • I agree with your point! Saying that we would routinely see values which are somewhere in the range that contains 95% of the values is not saying much. (And that’s assuming that the effect that we would like to find in an experiment is always there – but if that was the case we wouldn’t be doing any experiments to start with.)

    • Daniel Lakeland wrote:

      > So you could have tons of 80% powered studies it’s just that the real effect sizes are substantially smaller than the one assumed in the 80% power calculation… and then you’d see very few p values below 0.05

      Agreed. But that’s a different point than the one I was discussing. We are working with the fiction that the true effect size is what we think it is.

      Incidentally, I don’t understand why we don’t run adaptive trials a la Freedman and Spiegelhalter (popularized in recent years by Kruschke)? Just run till you reach a (clinically) meaningful 95% credible interval. Or if you insist on frequentist analyses you have to keep down-adjusting the alpha value (obviously a poor man’s solution). I guess the problem with that is that one can’t get funding in advance for running a study.

      What I do in practice when applying for funding is that I use a meta-analytic estimate to figure out what the *range* of power would be given a posterior distribution of the presumed effect size (knowing it’s an overestimate because of heavy publication bias). Then I get the money from the funding agency and use it up. If things still look too uncertain, I use my own annual budget to run more and more subjects. This means one has to do fewer studies that take longer (in one case, 4 years; but I gave up early there because data collection was too slow; this was eyetracking while reading, a very noisy method).

  4. Andrew: You wrote: “In response to a couple of comments on the earlier post: Yes, the p-value is conditioned on the assumed effect size.” I think you mean that the power is conditioned on the assumed effect size.

    The term “true power” seems to be confusing, so I’ve been using PoS (the probability of statistical significance) instead. A study can be designed to have 80% power (against the effect the researchers would not want to miss) while the PoS is only 20%.

  5. > We’d expect to see clean, un-hacked p-values less than 0.0005, at least a quarter of the time, if we were running studies with minimum 80% power, as we routinely claim we’re doing, if we ever want any of that sweet, sweet NIH funding.

    Do you routinely claim that you have 80% probability of getting statistically significant results to get NIH funding? I don’t think anyone does.

    Let’s say for simplicity that either there is the precise effect to be found with 80% power or there isn’t any effect at all. If the prior probability of having an effect is 1/5 we expect to see clean, un-hacked p-values less than 0.0005 only one fifth of one a quarter of the time. (And we expect a statistically significant effect one fifth of the time – and we expect one fifth of those significant results to be false positives.)

    • When I was doing this, the process was run a pilot study using extra money from previous grants. Then don’t bother to submit unless that shows you can get 80% power with the sample size you applying to get funded.

      • If I understand you correctly you are saying that people routinely claim that there is, with 100% certainty, an effect of a certain size. I don’t think that’s what they do but I could be wrong.

        Anyway, If they do claim that, the “lie” would be there and not really in the 80% power calculation which is conditional on the a hypothetical effect but doesn’t assume that the effect is certain.

        • Carlos:

          No, I’m not saying that people are routinely claiming anything with 100% certainty. I’m saying that people routinely overestimate effect sizes, by a lot, thus thinking they have 80% power (in Erik’s phrase, 80% probability of significance (PoS)) when their PoS is really much less.

        • Thanks for the clarifications.

          You said first that “people routinely claim that they have 80% probability of getting statistically significant results to get NIH funding. Indeed, sometimes it’s practically a requirement.”

          That was wrong unless it meant “people routinely claim that they [would] have 80% probability of getting statistically significant results [if there actually was an effect of a certain magnitude] to get NIH funding. Indeed, sometimes it’s practically a requirement.”

          And in that case it would’t be true that “we’d expect to see clean, un-hacked p-values less than 0.0005, at least a quarter of the time”, because the conditional 80% power is not an unconditional probability of significance.

          You wrote “we’d expect to see clean, un-hacked p-values less than 0.0005, at least a quarter of the time if we were running studies with minimum 80% power, as we routinely claim we’re doing” but if “power” was properly used (for some hypothesized effect size, as you wrote) the “quarter of the time” was false and if you (mis)used power as probability of significance the “we routinely claim we’re doing” part was false. I really don’t see how that statement can be saved.

          Now you say that “people routinely think they have 80% probability of significance” and I don’t know what people think but that’s definitely not a requirement to get NIH funding. Even if people routinely thought that, it would still be wrong to say that “we’d expect to see clean, un-hacked p-values less than 0.0005, at least a quarter of the time”. We would only expect that if we also thought ourselves that they have 80% probability of significance.

          In any case, given a study with 80% power (probability of getting a statistically significant result if the effect size were equal to some value, say 42) thinking that we have 80% probability of significance could mean that we think that the effect size is certainly 42 or that we think that it’s either zero (21% probability) or infinite (79% probability) or something in between: it necessarily implies that we assign to the uncertain effect size a probability distribution which includes large enough sizes with high enough probability.

        • Carlos:

          I don’t think my statement is wrong at all. I think it’s accurate. When I wrote that people need to “demonstrate (that is, convincingly claim),” I was not talking about 100% certainty. I’m saying that in their NIH proposal they have to make a convincing argument. Or maybe I should say a potentially convincing argument. Unfortunately, the way the world works is that an earlier study achieving statistical significance is often taken as convincing evidence that the effect of interest is equal to or larger than the point estimate.

        • > When I wrote that people need to “demonstrate (that is, convincingly claim),” I was not talking about 100% certainty.

          You didn’t write about having 80% probability of getting statistically significant result either, at least not explicitly.

          You wrote “To get NIH funding, you need to demonstrate (that is, convincingly claim) that your study has 80% power.”

          The obvious reading “To get NIH funding, you need to demonstrate (that is, convincingly claim) that your study has 80% [what the NIH calls] power.” is correct.

          However, the rest of the post and the comments suggest that you meant “To get NIH funding, you need to demonstrate (that is, convincingly claim) that your study has 80% [what Andrew Gelman for some reason insists in also calling] power [in Erik’s phrase, 80% probability of significance (PoS)].” which doesn’t seem correct or accurate.

        • Carlos:

          I think that sometimes when people present a claim of statistical power it is based on an effect size that could be important, but many times it is based on an effect size that is claimed to be plausible or reasonable or even approximately true.

          What I call “statistical power” is the same as what everybody else calls “statistical power.” It’s the probability of attaining statistical significance in a future study, as computed based on some assumptions about that study.

          The usual thing in an NIH proposal is to claim 80% power conditional on some assumptions, and then to provide evidence that these assumptions are plausible, that they could be true. I agree that it’s ok in a proposal to say that you don’t know, that the true effect size could be zero, but you’re also supposed to provide a good argument that the effect size you’re using for the power analysis is scientifically plausible. Unfortunately, when people do this, I think they typically give overestimates of the effect size.

          If NSF funds a study that is powered for an effect size of X, and the proposal justifies X based on some previous studies, it’s my impression that the true effect size is probably quite a bit lower, that the preregistered analysis has a much less than 80% chance of attaining statistical significance, and that this result was not the intention of the researcher and the NIH. From everything I’ve seen, researchers really do think of X as a good prior estimate of the effect size, and funders really do think that, if the experiment goes as planned, there is a very high probability of statistical significance.

          But, sure, I agree that Erik’s term, “Probability of significance,” is useful, because “power” or “true power” is ambiguous.

        • > researchers really do think of X as a good prior estimate of the effect size, and funders really do think that, if the experiment goes as planned, there is a very high probability of statistical significance.

          As high as 80%? Then _they_ think that the effect is large enough with high enough probability. They could be wrong about that and still be running experiments with what “everybody else calls” 80% power (computed based on some assumptions the effect size that may or may not be true).

          We (everybody else) don’t have to think that the effect is large enough with high enough probability to understand whether a study has 80% power! (i.e. 80% probability of attaining statistical significance conditional on some hypothetical effect size that doesn’t have to be accepted as true to accept that the study has 80% power.)

          > Suppose we really were running studies with 80% power […] In that case, the expected z-score is 2.8, and 95% of the time we’d see z-scores between 0.8 and 4.8.

          I don’t expect that. You don’t expect that either. It’s not a logical consequence of running studies with 80% power. The reason why nobody should expect that is simply that it’s not true! Suggesting that it should be expected won’t help people understand what power is and is not.

  6. > To clarify one more thing: I do not think the goal of an experiment should be to get “statistical significance” or any other sense of certainty.

    Indeed.

    It seems like you buried the lede.

    The first time I was exposed to ‘power analysis’ was in my coursework as an undergraduate statistics major and at that time we were taught that the effect size to use in the calculation was the smallest effect size that would still be practically meaningful. This made sense then, and still seems more defensible than trying to plug in an estimate of the true effect size without any reflection on whether ‘detecting’ a practically insignificant effect provides any scientific value.

    So if it happens that there are a lot of experiments that fail to reject H0 because the practically meaningful effect size is much larger than the true effect size, so be it. I wouldn’t view this as there being a crisis of ‘underpowered experiments’. Or that experimenters were lying to themselves about having 80% power. They may very well have had 80% power for detecting something that would be scientifically meaningful.

    If there is scientific value in just estimating an effect size with high precision without any reflection on practical significance, a power analysis is obviously the wrong tool for the job of determining sample size.

    If the NIH is using 80% power as a criterion for funding an experiment, hopefully they are also scrutinizing the effect size used in the calculation to ensure that it is large enough to be practically relevant.

  7. But the point of the power calculation for the NIH grant is to say that the power really is at least 80%, at least much of the time.

    Like a lot of meta-science I think there’s an implicit economic tradeoff here.

    How much low-hanging fruit (substantial effects) is there? What “much of the time” is ideal?

    It’s not obvious to me that 10% of studies with true power >80-90% is far-from-optimal. Would we be better having more studies with higher sensitivity (power) for the the effects under study, or to investigate more questions with weak designs knowing that we’ll still likely catch the truly huge effects?

    This post and the various Van Zwet et al papers (which are great!) reason based on the SNR, which combines truth with study design/sample size/budget.

    Another way to put my question is: what would the ideal joint distribution of standardized effect size (dividing by SD, not SE) and N look like? what is the ideal proportion of effects under investigation that are greater than the MCID used for power calculations?

    • Instead NIH should require methods that can be replicated and predictions that allow distinguishing the research hypothesis from other explanations (e.g., you messed up the experiment somehow).

      How does statistical significance (and the related power analysis) contribute to these goals? Afaict, they do not at all.

  8. > But the point of the power calculation for the NIH grant is to say that the power really is at least 80%, at least much of the time.

    Is it, though? I’ll look again at their guidance on proposals, but I don’t recall ever seeing anything like this, nor does the statement make sense to me, your overloading of the term ‘power’ taken into account.

    > The assumptions are supposed to be reasonable.

    Sure. But the assumption about the effect size that is being used in the power calculation is ‘this effect size is one that would be scientifically interesting if we detect it’ so the claim that it is scientifically interesting is what should be a reasonable one.

    Might a referee also scrutinize this proposed minimally scientifically meaningful effect size if they are not convinced the true effect size is anywhere close? Sure. But isn’t that a different concern with the experiment proposal? That they already think they know the effect size is small and uninteresting?

    • Jyd:

      The proposals I’ve seen, when they hypothesize an effect size, they make the argument that this hypothesized effect size is plausible. There’s no point in a power calculation based on an effect that would be scientifically interesting, if it’s implausible. Just for example, suppose I were to propose a study on sex ratios with a sufficient power to detect a 5 percentage point difference in the proportion of girl births. Such an effect would indeed be scientifically interesting–as the expression goes, it would be “big if true”–but it’s entirely implausible, and it would be inappropriate for the NIH to throw money at it.

      • > The proposals I’ve seen, when they hypothesize an effect size, they make the argument that this hypothesized effect size is plausible.

        When you say that “people routinely claim that they have 80% probability of getting statistically significant results” do you mean that there are also many proposals that you have not seen where they claim that the hypothesized effect size is certain?

      • > There’s no point in a power calculation based on an effect that would be scientifically interesting, if it’s implausible. […] it would be inappropriate for the NIH to throw money at it.

        Conversely, wouldn’t it be pointless to study to determine if a treatment is essentially useless or nearly essentially useless? Why would the NIH throw money, if the best effect you’d get is entirely uninteresting.

        If the proposal is written from this perspective, perhaps it is assumed that the power calculation is done on a minimally interesting effect size, while plausibility is explicitly justified.

        I guess I’m not too interested in whether this charitable reading applies. Is the “minimally scientifically meaningful effect size” mentioned by JYD a valid approach to power analysis? That is certainly how I think about it.

        • @Anonanon +1. Discussions about power very often conflate “true” power, power based on a “hypothesized” effect size, and power based on a “minimally interesting” effect size. The latter is the only version of power that makes sense to me. It seems like all of the debate should be about what constitutes “minimally interesting” and how likely we’ll be to detect such an effect. Better yet, we could reason about an experiment design with respect to the full power curve. (I really like Richard Morey’s post on this: https://richarddmorey.medium.com/power-and-precision-47f644ddea5e.)

          All that said, I’d much rather ditch NHST and use model-based calibration. Michael Betancourt’s approach is as good a resource as I can think of for this: https://arxiv.org/abs/1803.08393. I think science would be much better off if we aimed for this kind of transparency and domain-expertise consistency, all of which can be reasoned about and validated prior to data collection.

  9. “(a) effects are typically much smaller than people want to believe”

    I think a contribution to this in regression and correlation analysis is the belief that the explained variance gives a meaningful quantification of the strength of the relationship: “With our sample size we can detect an explained variance as low as 1%, surely anything smaller is unimportant!” But then it is missed that this means there is about 9 times as much unexplained as explained variation on the scale of the variable (not 99 times), amongst other things.

    I think when seeing a 1% explained variance, people think of this in terms of regular proportions: If we were able to detect a difference between two proportions of 50% and 51% with 80% power, then we have surely done all we can to detect any meaningful effects. But as explained variance is about squared variation, not the variation on the original scale of the variable, a 1% explained variance is (roughly) more like having two proportions of 50% and 60% – an optimistically large difference in many (social science) studies. And, of course, many studies expect larger explained variances than 1%.

  10. I think that paper you refer to as “50 shades of gray”,
    may have contributed to leading lots of people astray
    When it is proposed that opening data, materials, and workflow are “the ultimate solution”
    I think it’s fair to wonder whether that has been an optimal, or even appropriate, contribution

    And, in light of all the problems and explanations, can someone please explain to me again,
    about “the incentives” and why tenured professors would even want to publish as much as they can
    How exactly should I understand the possible role of “publish or perish” for those tenured?
    I mean, doesn’t being tenured imply they can decide, and adhere to, their own standard?

    And, if it would be “trivial” to “publish” using things like a repository and a pre-print
    How and why exactly would that impact things like an evaluation by a committee, or a CV’s imprint?
    Is posting a manuscript really “publishing” in a way that is relevant concerning the discussions had?
    Is not talking about how and why proposed “solutions” might actually be “rewarded” pretty sad?

    And now we’re here, and the same issues are still there
    They simply got snowed under by the distraction, and by the glare
    They just waited for the right time, and for the noise to stop
    And just like a Snowdrop in Winter, they pierce through, and once again pop up

  11. The right approach would be to say “The power would be 80% if the effect size is X and we don’t care finding differences smaller than X since they are practically insignificant”. This would put an upper bound on the sample size (instead of a lower bound) and that is a relevant information for the funding agency. Of course, one can argue that there are other issues (some of which you have eluded to) that would bring down the power; so, perhaps other things should be taken into account too. But it is still better than just trusting the researcher on the effect size they expect, a process which you rightly pointed out to have a lot of issues. I have always found the researchers’ prior belief about effect sizes sketchy. There should a rigorous justification of why the researcher believes that their effect size is > X, and this should be independent of the power analysis.

Leave a Reply

Your email address will not be published. Required fields are marked *