Getting all negative about so-called average power

Blake McShane writes:

The idea of retrospectively estimating the average power of a set of studies via meta-analysis has recently been gaining a ton of traction in psychology and medicine. This seems really bad for two reasons:

1. Proponents claim average power is a “replicability estimate” and that it estimates the rate of replicability “if the same studies were run again”. Estimation issues aside, average power obviously says nothing about replicability in any real sense that is meaningful for actual prospective replication studies. It perhaps only says something about replicability if we were able to replicate in the purely hypothetical repeated sampling sense and if we defined success in terms of statistical significance.

2. For the reason you point out in your Bababekov et al. Annals of Surgery letter, the power of a single study is not estimated well:
taking a noisy estimate and plugging it into a formula does not give us “the power”; it gives us a very noisy estimate of the power
Having more than one study in the average power case helps, but not much. For example, in the ideal case of k studies all with the same power, z.bar.hat ~ N(z.true, 1/k) and mapping this estimate to power results in a very noisy distribution except for k large (roughly 60 in this ideal case). If you also try to adjust for publication bias as average power proponents do, the resulting distribution is much noisier and requires hundreds of studies for a precise estimate.

In sum, people are left with a noisy estimate that doesn’t mean what they think it means and that they do not realize is noisy!

With all this talk of negativity and bearing in mind the bullshit asymmetry principle, I wonder whether you would consider posting something on this or having a blog discussion or guest post or something along those lines. As Sander and Zad have discussed, it would be good to stop this one in its tracks fairly early on before it becomes more entrenched so as to avoid the dreaded bullshit asymmetry.

He also links to an article, “Average Power: A Cautionary Note,” with Ulf Böckenholt and Karsten Hansen, where they find that “point estimates of average power are too variable and inaccurate for use in application” and that “the width of interval estimates of average power depends on the corresponding point estimates; consequently, the width of an interval estimate of average power cannot serve as an independent measure of the precision of the point estimate.”

39 thoughts on “Getting all negative about so-called average power

  1. Mmm, I must be wrong about this. But in principle, it seems to me like the cumulative power would be more important than the average power. After all, a test with high power is really a sequence of repeated tests with low power.

    • The first point the paper makes–that estimation issues aside, average power just isn’t an interesting or useful quantity–seems pretty obvious for the reasons Andrew, Anoneuoid, Fisher, and others point out below. What may be interesting is that the authors seem to have had to belabor this point to their audience, presumably because it is not so obvious to them.

      What is remarkable though is the second point. Sure the post-hoc power estimate of a single study is going to be noisy, but that it takes 60 studies to get a vaguely precise estimate of post-hoc retrospective meta-analytic average power surprised me even though it falls out so simply from the math. And that’s 60 in the most ideal conditions possible (all studies have the same power, no heterogeneity, no publication bias, no moderators, etc.). The authors show it gets much much worse in the sense that you need many many more studies when you move away from this ideal.

      The upshot is that even if we were for some strange reason interested in estimating average power, we could never do so in practice because (a) we almost never have 60 studies and (b) in the rare cases we do, there will be heterogeneity, publication bias, moderators, etc. And all of this applies to median power and all other measures of central tendency too (i.e., because all studies have the same power in the ideal condition, the average, median, etc. are all the same).

      Andrew really was right: post-hoc power calculations, for a single study or a meta-analytic average of many studies, really are like a shit sandwich!

    • There is never any response to this. It is just Bayes rule:

      p(H[0])*p(H[0]|D) = p(H[0])*p(H[0]|D) / [ p(H[0])*p(H[0]|D) + p(H[1:n])*p(H[1:n]|D) ]

      Where H[0:n] are the plausible hypothesis for the data (D). You cannot meaningfully calculate power of your test of H[0] without considering the alternative explanations H[1:n]. It is that simple.

        • Lets try once more, even though everyone knows what I meant:

          p(H[0]|D) = p(H[0])*p(D|H[0]) / [ p(H[0])*p(D|H[0]) + p(H[1:n])*p(D|H[1:n]) ]

      • > You cannot meaningfully calculate power of your test of H[0] without considering the alternative explanations H[1:n]. It is that simple.

        That’s why the power of the test is calculated against a specific alternative hypothesis.

        • Usually it is tested against the alternative hypothesis that assuming the model is correct the value of the parameter is not exactly x (whatever value they set as the “null hypothesis”). This ignores all other explanations (ie, models/likelihoods). Read the Fisher paper.

        • If you don’t agree that power calculations are done considering a specific alternative hypothesis, maybe you give an example of a power calculation which is based only on the test and null hypothesis (and leaves the alternative hypothesis unspecified).

        • Not sure what you have me agreeing to or not, but it seems irrelevant. I’ll just quote Fisher from the paper above:

          The phrase “Errors of the second kind”, although apparently only a harmless piece of tech-
          nical jargon, is useful as indicating the type of mental confusion in which it was coined.
          In an acceptance procedure lots will sometimes be accepted which would have been rejected
          had they been examined fully, and other lots will have been rejected when, in this sense, they ought
          to have been accepted. A well-designed acceptance procedure is one which attempts to minimize
          the losses entailed by such events. To do this one must take account of the costliness of each
          type of error, if errors they should be called, and in similar terms of the costliness of the testing
          process; it must take account also of the frequencies of each type of event. For this reason
          probability a priori, or rather knowledge based on past experience, of the frequencies with which
          lots of different quality are offered, is of great importance; whereas, in scientific research, or in
          the process of “learning by experience”, such knowledge a priori is almost always absent or
          negligible.

          Simply from the point of view of an acceptance procedure, though we may by analogy think
          of these two kinds of events as “errors” and recognize that they are errors in opposite directions,
          I doubt if anyone would have thought of distinguishing them as of two kinds, for in this milieu
          they are essentially of one kind only and of equal theoretical importance. It was only when the
          relation between a test of significance and its corresponding null hypothesis was confused with
          an acceptance procedure that it seemed suitable to distinguish errors in which the hypothesis is
          rejected wrongly, from errors in which it is “accepted wrongly” as the phrase does. The frequency
          of the first class, relative to the frequency with which the hypothesis is true, is calculable, and
          therefore controllable simply from the specification of the null hypothesis. The frequency of
          the second kind must depend not only on the frequency with which rival hypotheses are in fact
          true, but also greatly on how closely they resemble the null hypothesis. Such errors are there-
          fore incalculable both in frequency and in magnitude merely from the specification of the null
          hypothesis, and would never have come into consideration in the theory only of tests of significance,
          had the logic of such tests not been confused with that of acceptance procedures.

          It may be added that in the theory of estimation we consider a continuum of hypotheses each
          eligible as null hypothesis, and it is the aggregate of frequencies calculated from each possibility
          in turn as true-including frequencies of error, therefore only of the “first kind”, without any
          assumptions of knowledge a priori-which supply the likelihood function, fiducial limits, and
          other indications of the amount of information available. The introduction of allusions to errors
          of the second kind in such arguments is entirely formal and ineffectual.

          The fashion of speaking of a null hypothesis as “accepted when false”, whenever a test of signifi-
          cance gives us no strong reason for rejecting it, and when in fact it is in some way imperfect, shows
          real ignorance of the research workers’ attitude, by suggesting that in such a case he has come
          to an irreversible decision.

          The worker’s real attitude in such a case might be, according to the circumstances:
          (a) “The possible deviation from truth of my working hypothesis, to examine which the test
          is appropriate, seems not to be of sufficient magnitude to warrant any immediate modification.”

          Or it might be:
          (b) “The deviation is in the direction expected for certain influences which seemed to me
          not improbable, and to this extent my suspicion has been confirmed; but the body of data available
          so far is not by itself sufficient to demonstrate their reality.”
          These examples show how badly the word “error” is used in describing such a situation. More-
          over, it is a fallacy, so well known as to be a standard example, to conclude from a test of signifi-
          cance that the null hypothesis is thereby established; at most it may be said to be confirmed or
          strengthened.

          In an acceptance procedure, on the other hand, acceptance is irreversible, whether the evidence
          for it was strong or weak. It is the result of applying mechanically rules laid down in advance;
          no thought is given to the particular case, and the tester’s state of mind, or his capacity for learning,
          is inoperative.

          By contrast, the conclusions drawn by a scientific worker from a test of significance are pro-
          visional, and involve an intelligent attempt to understand the experimental situation.

        • I thought that your reply manifested some sort of disagreement with my remark “That’s why the power of the test is calculated against a specific alternative hypothesis.”

          I of course agree that “[Errors of the second kind are] incalculable both in frequency and in magnitude merely from the specification of the null hypothesis”. Which is why, as I said, the power of the test is calculated against a specific alternative hypothesis.

        • I think the problem is statisticians often assume a model is true then think all the different combinations of parameter values for that model are the different possible hypotheses.

          The real universe of hypotheses also includes the other possible models. You can get decent a power calculation if you specify the top 10 or so plausible models, but not from specifying one model then saying “anything except that one model”.

        • > You can get decent a power calculation if you specify the top 10 or so plausible models, but not from specifying one model then saying “anything except that one model”.

          If you specify one model (the null hypothesis, if I understand correcly) and then say “anything except that one model” you cannot get a power calculation at all. Hopefully we agree on that.

        • Please disregard my previous comment, I may have misread what you wrote and I’m not sure I understand what you’re talking about now.

        • Hypothesis is not being used in the sense of eg “assume this sample came from a normal distribution, now what is the mean.” There, all the “hypotheses” are just different values for the mean.

          But, the sample could also come from some other distribution. You cannot calculate type II error rate without specifying these other distributions.

          Think more of competing scientific theories derived from assumptions as your model.

        • Which is why, as I said, the power of the test is calculated against a specific alternative hypothesis.

          Yes, and I already wrote:

          Usually it is tested against the alternative hypothesis that assuming the model is correct the value of the parameter is not exactly x (whatever value they set as the “null hypothesis”). This ignores all other explanations (ie, models/likelihoods). Read the Fisher paper.

          You cannot calculate power based on specifying the “alternative hypothesis” that the model is correct but the parameter value is not what we set as the null hypothesis. That is the same thing as specifying the null hypothesis, there is no additional info.

          You need to specify the other plausible hypotheses/models you are comparing to. If the data does not fit them well, but fits your null model well, then you have high power. If the data fits the other models equally well as the null model, then you have low power.

        • Just to be clear. If you have a model and the null hypothesis is “mu is 0” you can calculate the power of a test against a specific alternative hypothesis as “mu is 1”. You cannot calculate the power of a test against the non-specific alternative hypothesis “mu is not 0”. We strongly agree.

        • Just to be clear. If you have a model and the null hypothesis is “mu is 0” you can calculate the power of a test against a specific alternative hypothesis as “mu is 1”. You cannot calculate the power of a test against the non-specific alternative hypothesis “mu is not 0”. We strongly agree.

          type II error is the non-rejection of a false null hypothesis (also known as a “false negative” finding or conclusion[1]).

          https://en.wikipedia.org/wiki/Type_I_and_type_II_errors

          You can calculate something but you can’t determine this by only specifying those two “hypotheses”.

        • Not to put words in your mouth, but I read that as you don’t think you need to consider alternative explanations when judging whether a given explanation is correct?

          Anything else seems to be sidestepping or disregarding Bayes rule. Perhaps you mean in the sense of only checking the probability of a parameter assuming a given likelihood [explanation] is correct?

          I’m assuming you refer to this paper: http://www.stat.columbia.edu/~gelman/research/published/philosophy.pdf

        • Beyond the philosophical difficulties, there are technical
          problems with methods that purport to determine the posterior probability of models,
          most notably that in models with continuous parameters, aspects of the model that have
          essentially no effect on posterior inferences within a model can have huge effects on
          the comparison of posterior probability among models.19 Bayesian inference is good for
          deductive inference within a model we prefer to evaluate a model by comparing it to
          data.

          I mean if when using Bayes rule you assume the likelihood is correct to begin and only some parameters of it are misspecified then its game over and you are doing religion. But the whole point of science is to find the right likelihood function (natural law).

  2. We’ve still got a long way to go in educating users of statistics. The human tendency to believe what you want to believe is very strong. “Keep it simple, Stupid” might be good advice in some specific instances, but “It ain’t simple, Stupid” fits a lot of situations.

  3. I stumbled late on this post and read the linked warning piece. I found myself wondering: are there theorems concerning the reduction or factoring of uncertainty? It looks like a bunch of studies that are then grouped. But shouldnt there be some factoring available by which you identify a noise pattern and apply that? There should then be some general form or rule or modular definition by which some ‘remainder’ is determined (complete with sign potential, I would think). I’m just wondering aloud.

  4. Blake McShane is an asshole. If you want to criticize something, you should at least cite the work you are criticizing.
    And in 2002 I made Ulf Bockenholt a co-author on a paper he commented on for five minutes. Now he doesn’t even cite my work.
    Academia is full of assholes.

    If you want to see what average power can do for you look at Roy F.(I don’t know what the F stands for) Baumeister’s z-curve, which can be used to estimate the average power of his famous ego-depletion studies.
    You see that he has a 92% success rate (not counting marginal significance) in his reported studies, but he only has 20% power to do so.
    If this doesn’t tell you something of interest, you are not very interested in replicability.

    https://replicationindex.com/2018/11/16/replicability-audit-of-roy-f-baumeister/

    If you want to learn more about the replication crisis in social psycholoyg, you can check out this preprint/blog of an in press article.

    https://replicationindex.com/2020/01/05/replication-crisis-review/

    • The McShane paper cites you 20 times beginning in the second paragraph, but never mind that. I have a question about your z-curve method that you link to.

      In the ideal case the McShane paper first considers, the authors show the MLE based on k studies has sampling distribution N(truth, 1/k). They then do a simple change of variables to show the sampling distribution of average power (Equation 4). The distribution is as claimed wide unless k is very large. That’s the math and it is pretty trivial.

      Are you saying your z-curve method can beat the MLE in this one-dimensional problem? If so, that would be of interest not only to the blog but the statistics community at large. It would be helpful if you could clarify.

    • I am struggling to come up with a charitable explanation for this comment. After all, McShane, Böckenholt, & Hansen’s *4th sentence* cites Brunner & Schimmack (2016) as well as Schimmack & Brunner (2017)!

      Were you not able to make it that deep into the paper?

        • Ulrich:

          Yeah, you should definitely regret posting that comment. Beyond the unnecessary rudeness, you wrote that McShane and Bockenholt don’t cite your work, but they directly cite your work in the linked paper. That’s just ridiculous, the kind of think we might see on twitter, maybe, but we expect better in the blog comments here.

  5. McShane is an asshole because he criticizes average power without even mentioning my work on it and why it is useful to estimate it.
    Bockeholt is an asshole because I helped him with data when I was post-doc in Illinois and even made him a co-author on a paper and now he doesn’t even acknowledge my work in this area. Not citing work that doesn’t fit your purposes is just so low and prevents our science from making progress.

    Regarding the uselessness of average power, I present to you the z-curve of Roy F. (don’t know what the F stands for) Baumeister. He has a 92% success rate (not counting p = .06 successes) with just 20% power. If you think this is not useful to know to evaluate his work, please let me know why not.

    https://replicationindex.com/…/replicability-audit-of…/

    • I am struggling to come up with a charitable explanation for this comment. After all, McShane, Böckenholt, & Hansen’s *4th sentence* cites Brunner & Schimmack (2016) as well as Schimmack & Brunner (2017)!

      Were you not able to make it that deep into the paper?

      • I misremembered They cite us but only to trash our approach. The main criticism is apparently that power is useless because significance testing is useless. Oh my god. Do we need a full paper that criticizes average power if you want to say significance testing is wrong?
        The reality is that 99% of psychology uses this statistical approach. We all know that rejecting the nil-hypothesis is not the end all of science, but why criticize a method that shows even this small goal is often achieved only by fudging the data in low power studies. Anyhow, the whole article is just junk and the precision of our power estimates depends of course on sample size.

        • Ulrich:

          I think an apology is in order. You called them “assholes” for doing something they didn’t even do. If you have specific criticisms of their paper, fine, that’s another story.

          Your actual criticisms here are pretty empty. You write, “the main criticism is apparently that power is useless because significance testing is useless.” Actually, this is what they say:

          In this article, we have two aims. First, we clarify the nature of average power and its implications for replicability. We explain that average power is not relevant to the replicability of actual prospective replication studies. Instead, it relates to efforts in the history of science to catalogue the power of prior studies. Second, we evaluate the statistical properties of point estimates and interval estimates of average power obtained via the meta-analytic approach. We find that point estimates of average power are too variable and inaccurate for use in application. We also find that the width of interval estimates of average power depends on the corresponding point estimates; consequently, the width of an interval estimate of average power cannot serve as an independent measure of the precision of the point estimate.

          This is right in the abstract.

          It’s fine to express disagreement about methodology. Not so fine to misrepresent what’s in a paper and attack the authors for something they didn’t do.

        • First you make the incorrect citation claim, and now you make this incorrect claim about significance testing. Criticism of significance testing represents two paragraphs of the paper (page 2), and these two paragraphs deal not so much with criticism of significance testing but consequences of it.

          The bulk of the paper is an exercise in mathematical statistics. It shows that, among other things, the MLE of average power in a one-dimensional, ideal setting is very noisy unless the number of studies is very large.

          I repeat what I asked you above and which has gone unanswered: are you claiming, as you seem to imply, that your z-curve method dominates the MLE in the one-dimensional, ideal setting of their Scenario 1? It would be very helpful to clarify this, or what precisely you are claiming.

Leave a Reply to Sameera Daniels Cancel reply

Your email address will not be published. Required fields are marked *