No, average statistical power is not as high as you think: Tracing a statistical error as it spreads through the literature

I was reading this recently published article by Sakaluk et al. and came across a striking claim:

Despite recommendations that studies be conducted with 80% power for the expected effect size, recent reviews have found that the average social science study possesses only a 44% chance of detecting an existing medium-sized true effect (Szucs & Ioannidis, 2017).

I noticed this not because the claimed 44% was so low but because it was so high! I strongly doubt that the average social science study possesses a power of anything close to 44%. Why? Because 44% is close to 50%, and a study will have power of 50% if the true effect is 2 standard errors away from zero. I doubt that typical studies have such large effects.

I was curious where the 44% came from, so I looked up the Szucs and Ioannidis article, “Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature.” Here’s the relevant passage from the abstract:

We have empirically assessed the distribution of published effect sizes and estimated power by analyzing 26,841 statistical records from 3,801 cognitive neuroscience and psychology papers published recently. The reported median effect size was D = 0.93 (interquartile range: 0.64–1.46) for nominally statistically significant results and D = 0.24 (0.11–0.42) for nonsignificant results. Median power to detect small, medium, and large effects was 0.12, 0.44, and 0.73, reflecting no improvement through the past half-century.

This is fine—but I don’t think most effect sizes are so large. To put it another way, what they call a “medium effect,” I would call a huge effect. So, realistically, power will be much much less than 44%.

This is important, because if researchers come into a study with the seemingly humble expectation of 44% power, then they’ll expect that they’ll get “p less than 0.05” about half the time, and if they don’t they’ll think that something went wrong. Actually, though, the only way that researchers have been having such a high apparent success rate in the past is from forking paths. The expectation of 44% power has bad consequences.

50 thoughts on “No, average statistical power is not as high as you think: Tracing a statistical error as it spreads through the literature

    • I suspect suspiciously high power in studies relates more to experiments designed to investigate the bleeding obvious without considering all those inconvenient confounders.

    • “Highest power field, Intelligence – Group Differences, in line with Sesardić’s conjecture.”

      This post by Andrew smells fine to me, not sure why it would attract flies. But now I need a shower.

      For the usual blog denizens, I will save you a trip to a dark corner of the web. “Intelligence – Group Differences” is a euphemism for racism masked as science.

      • Indeed, the linked site by Emil Kirkegaard has a number of dubious claims about IQ that all seem to add up to a lame defense of some kind of Alt Rightism.

      • Well, Kirkegaard’s links make that inference, but if you go to the source paper, “group differences” actually refers to methodology – specifically:

        “group differences compare compare existing (non-manipulated) groups and typically use Cohen’s d or raw mean IQ differences as the key effect size.”

        In other words, any study based on a t-test or ANOVA would typically give a high estimated effect size and hence high power. Examples of such studies include estimating IQ in the cases of people with schizophrenia, and gender differences amongst university students (without adjustment for covariates).

        This is in stark contrast to “behaviour genetics” studies that attempt to link intelligence to genetic variations or estimate heritability, where Nuiljten estimates they have an average power of 9%.

        One may speculate why this result was omitted from discussion.

        • FWIW, the high effect size in such group difference studies might also be interpreted as “these studies are the easiest to do and kinda sketchy, so they’ll have to find a really big effect to be publishable”. Anything involving an experiment or an intervention has a much smaller power (26.5%).

        • I did attempt to get a clarification on this some time ago (from Michèle). I speculate it is because they included the candidate gene studies in this, and these studies are super underpowered and generally completely useless. I think the regular behavioral genetics twin studies will have fairly recent power. E.g., tons of studies done using the TEDS sample, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3817931/, which has something like 10k twin pairs.

        • The more important point is that this rather establishes the original analysis has very little to do with your inference, and in fact on the face of it contradicts it, unless you “speculate” it away…

        • Here’s her reply (posting with permission). So I was right.

          Hi Emil,

          We included 5 meta-analyses that we labelled as behavior genetics.

          Three of these are candidate gene studies:

          Barnett, J. H., Scoriels, L., & Munafo, M. R. (2008). Meta-analysis of the cognitive effects of the catechol-O-methyltransferase gene val158/108Met polymorphism. Biological Psychiatry, 64(2), 137-144. doi:10.1016/j.biopsych.2008.01.005

          Yang, L., Zhan, G.-d., Ding, J.-j., Wang, H.-j., Ma, D., Huang, G.-y., & Zhou, W.-h. (2013). Psychiatric Illness and Intellectual Disability in the Prader-Willi Syndrome with Different Molecular Defects – A Meta Analysis. Plos One, 8(8). doi:10.1371/journal.pone.0072640

          Zhang, J.-P., Burdick, K. E., Lencz, T., & Malhotra, A. K. (2010). Meta-analysis of genetic variation in DTNBP1 and general cognitive ability. Biological Psychiatry, 68(12), 1126-1133. doi:10.1016/j.biopsych.2010.09.016

          One is a candidate gene study involving twins:

          Luciano, M., Lind, P. A., Deary, I. J., Payton, A., Posthuma, D., Butcher, L. M., . . . Plomin, R. (2008). Testing replication of a 5-SNP set for general cognitive ability in six population samples. European Journal of Human Genetics, 16(11), 1388-1395. doi:10.1038/ejhg.2008.100

          The fifth one studies heritability with twins:

          Beaujean, A. A. (2005). Heritability of cognitive abilities as measured by mental chronometric tasks: A meta-analysis. Intelligence, 33(2), 187-201. doi:10.1016/j.intell.2004.08.001

          Hope this helps!

          Best,

          Michèle

      • The blog post “debunking” Taleb’s claims also commits some major statistical sins like plotting a fit without datapoints to prove the validity of a fit, misunderstanding “monotonicity” for “linearity”, and mistaking what is essentially a measurement-theoretic argument about construct validity for one about predictive power.

        Emile, if you’re listening, if you really are interested in psychometrics rather than what you can claim using bad psychometric arguments, you should read Meehl, Luce, and Falmagne. Something being predictive is not the same as it being “real”, or else you’ll end up arguing that convolutional layers are capturing something fundamental about perception.

        • > Finally, let’s consider Taleb’s standard of montonicity. This is getting back to the idea that IQ’s relationship with an outcome, say job performance, needs to be the same at all levels of job performance.

          Lol, what is it with people and preferring to guess definitions instead of looking them up?

        • You mean this Meehl?
          “Verbal definitions of the intelligence concept have never been adequate or commanded consensus. Carroll’s (1993) Human Cognitive Abilities and Jensen’s (1998) The g Factor (books which will be the definitive treatises on the subject for many years to come) essentially solve the problem. Development of more sophisticated factor analytic methods than Spearman or Thurstone had makes it clear that there is a g factor, that it is manifested in either omnibus IQ tests or elementary cognitive tasks, that it is strongly hereditary, and that its influence permeates all areas of competence in human life. What remains is to find out what microanatomic or biochemical features of the brain are involved in the hereditable component of g. A century of research—more than that if we start with Galton—has resulted in a triumph of scientific psychology, the footdraggers being either uninformed, deficient in quantitative reasoning, or impaired by political correctness.”
          A Paul Meehl Reader: Essays on the Practice of Scientific Psychology p. 435

          Or maybe Meehl th

    • Emil:

      1. How is the criterion problem handled in these studies? Is it roundly ignored or addressed?

      2. If bias in measurement in the predictor is similar to the bias in measurement in the outcome — does this produce an overestimate or an underestimate of the correlation?

      3. If a correlation of 0.5 (R^2 = 0.25) is produced for an outcome that is theoretically grounded in expected variation due to intelligence — should explained variance of 25% be considered evidence of validity or evidence of a lack of validity?

      While I do not always agree with Nassim Taleb’s arguments — though is math is usually right on, in this case he’s absolutely correct in his analysis.

    • Sean Last is a liar. For every field he used the mean/median for a small effect size except intelligence where he used the value for a medium effect size. “Furthermore, across primary studies, we found a median power of 11.9% to detect a small effect, 54.5% to detect a medium effect, and 93.9% to detect a large effect.” is what the paper actually says. He reported 57% so he even made a (probably intended) mistake reporting the medium effect size.

  1. > if researchers come into a study with the seemingly humble expectation of 44% power, then they’ll expect that they’ll get “p less than 0.05” about half the time

    Only if they are pretty sure that the medium-sized true effect is indeed there to be found.

  2. “recent reviews have found that the average social science study possesses only a 44% chance of detecting an existing medium-sized true effect (Szucs & Ioannidis, 2017).” I don’t know what Szucs and Ioannidis would find if they looked at political science, sociology, demography, economics, etc., but I found the claim about “the average social science study” an untrue description of the cited paper.

  3. I know you know this, but the size of an effect is only one factor in determining power. On its face, the statement “a study will have power of 50% if the true effect is 2 standard errors away from zero” is as much of a “striking claim” as the one that led to this post. Someone could argue that tools like blocking, covariates, repeated measures, multiple measures with minimally-correlated errors, matching, etc., when used appropriately, achieve a “practical” effect size much larger than the “true effect.” That Someone would have to assume that people actually use those methods, and use them correctly, which we know is often untrue–maybe that’s your point?

    Like I said, I know you know this, but the precision of your statement implies a more precise claim, either that effect size dictates power or that the tools we use to boost power are exactly cancelled out by measurement error, small sample sizes and general incompetence.

    • Michael:

      To put it another way, when I say, “a study will have power of 50% if the true effect is 2 standard errors away from zero,” there’s a lot packed into that “standard error”: good design and measurement can give you a low standard error. Also, I’m assuming the estimate is unbiased, which again comes from good design and measurement (including concern with validity of measurement, which goes beyond the usual statistics-textbook discussions of randomization, blindness, and sample size).

      • Well, I think it’s fair to say, Andrew, that your original statement is a bit confusing, in that the writeup emphasizes the effect size part, making readers think that the main issue is that effects aren’t big. But really we are talking about the ratio of effect to SE.

        Frankly I find it very hard to interpret all this power discussion. More power is good, but if we seek power too hard we stop investigating more subtle effects that are hard to detect, or end up doing much fewer studies. And we will still end up making those mistakes that have little to do with statistical power (like inability to generalise, bad treatment of confounding, forking paths, bad sampling, bad methods….)

        • In (partial) defense of prioritizing power, the purpose of some studies isn’t to establish whether an effect exists, but to determine the effect size itself. Most education interventions, for example, are known a priori to increase learning, but if the effect size compared with business as usual isn’t big enough, it’s not worth changing education practices. This is especially true in terms of effects on subgroups, since n < N by definition. I just wish we could stop using the standard of sufficient power = CI does not contain zero 80% of the time, to something like sufficient power = CI will be narrower than x 80% of the time.

  4. Do statistics ever tell us anything? :) Statistics is the modern alchemy. Some people do seem to understand how to use statistics to analyze and explore natural phenomena, but it seems to be a small fraction of the people who are claiming that they understand. It’s not hard to see why students wind up going down the wrong road and perpetuate the errors and fallacies.

    Recently I bought a book with a cool sounding title. I was on about page three when the author referenced Wansink’s bottomless soup bowls as among the most compelling evidence of the basic thesis of the book. I was shocked that I had been taken in so easily. I hadn’t recognized the relationship between the thesis of the book and Wansink’s bottomless bowls of bullshit. I had thought of Wansink’s bullshit as quack science about nutrition, but now I see that it “supports” a broad range of quack psychology. Nothing is safe. Now I can’t read the book. Who do I sue to get my $25 back?

      • “Subliminal – how your unconscious mind rules your behavior” – Leonard Mlodinow

        I’m embarrassed because now the connection seems obvious but it just sounded interesting when I bought it!

        I misrecollected the details above. The offending paragraph is paragraph 1, page 20; and it wasn’t Wansink’s bottomless soup bowls that were referenced, it was the stale popcorn experiment. I remember reading it and having that sense that it was vaguely familiar, then thinking: what moron theatre owner or chain would allow a social scientist to give his patrons stale popcorn? Then I checked the reference.

        I don’t recall if the popcorn experiment specifically has been debunked, but it sounds like bullshit to me.

        • jim, don’t feel bad, I’m sure they used some sort of subliminal advertising on the cover that left you helpless to resist.

      • My point with my original comment is that so much BS is being peddled that nothing can be taken at face value. Even researchers who are trying to do things right can be so badly misinformed, poorly educated and/or misled that their work has no merit. And even highly skeptical people with moderate statistical and/or scientific literacy like myself will get duped sooner or later just because of the sheer volume of trash.

        • Yeah, but we are now *realizing* this. That is the thing that fascinates me. We are at a point in history where we just noticed that a lot of standard statistical practices, if applied uncritically, can lead us seriously wrong.

          I don’t know if anybody else feels this, but the psychologist in me just keeps repeating: “errors are an important opportunity to learn”…

        • A problem for a layperson is that if you understand this and try to tell your friends, you sound like a real crank.

        • I won’t dispute that, but the replication projects are new, aren’t they? The technology and code that we apply this problem are different now.

          I believe the game has changed.

        • Not really, there is it more awareness widely these days.

          For instance, anyone who wandered into meta-analysis of clinical trails in the 1980s noticed the large variance among results from very similar trials, as did astronomers in the 1800s.

  5. Makes sense, SE’s, not SD’s. In fact, with everything packed in, SE is likely an underestimate–there’s no reason to think that noise from sources like measurement error decreases with sample size by a factor of sqrt(n). Larger samples can actually mean increasing heterogeneity that may not be accounted for fully. (e.g., DIF)

  6. Sorry if I have missed a discussion buried in the comments

    Andrew Gelman’s original post seems to equate an effect size of two standard errors with one of one standard deviation. Is there really an invariant relation between the two?

    And what is a “standard error”?

    Yours Sincerely

  7. Hi,

    if only we had a method that could estimate the average power of studies. Oh wait, we do.

    First, a representative sample of honest replication studies provides an estimate of power because power (inc. 5% power for true null-hypothesis) is an estimate of power (i.e., the long-run probability of obtaining a significant results). And what did the much celebrated and cited OSC replication project for psychology find, 37% significant results. This is a power estimate of roughly 40%. And if we allow for some real null-effects, the power to detect true effects is greater than that. So, power is at least 37%.

    We also have to allow for failed replications that messed up experimental procedures. Fortunately, we can even estimate power based on published test-statistics using a method called z-curve. This gives power estimates around 50% for social psychology. Maybe time to update the biasian prior about power based on the data that we have. :)

    You can even compute your own estimates of power using the r-package for z-curve, now on CRAN.

    Cheers, Uli

    https://replicationindex.com/2020/01/10/z-curve-2-0/

      • No. Ideally the replication studies would have used the same sample size as the original studies. This was not the case because the replication team made an unfortunate attempt to power replication studies at 90%, but used the published effect sizes to do so. Ouch.
        As a result, some studies had higher sample sizes, which makes it possible to power down the replication result. This doesn’t really change the replication outcomes because the increase in power is often minimal. Other studies were powered down (double ouch). Here failures to replicate (in one case with p = .06) were replication failures because the replication study had lower power. However, overall the rate is what we would expect for a set of studies with the same sample sizes as in the original studies.

        One argument one could make is that the set of studies also included cognitive psychology which is not social science. If we focus on social psychology alone, the success rate is lower and when we focus on experimental social psychology it is much lower.

        Mainly I am saying, we have better evidence than the evidence you cite and question.

        Best, Uli

  8. Who cares what *average* power is anyway. Once you take all sources of variance into account the uncertainty of the power estimate can stretch from 0.06 all the way to .90, leaving you scratching your head and asking, why did i even bother computing this?

Leave a Reply to Ben Hanowell Cancel reply

Your email address will not be published. Required fields are marked *