Blake McShane writes:

The idea of retrospectively estimating the average power of a set of studies via meta-analysis has recently been gaining a ton of traction in psychology and medicine. This seems really bad for two reasons:

1. Proponents claim average power is a “replicability estimate” and that it estimates the rate of replicability “if the same studies were run again”. Estimation issues aside, average power obviously says nothing about replicability in any real sense that is meaningful for actual prospective replication studies. It perhaps only says something about replicability if we were able to replicate in the purely hypothetical repeated sampling sense and if we defined success in terms of statistical significance.

2. For the reason you point out in your Bababekov et al. Annals of Surgery letter, the power of a single study is not estimated well:

taking a noisy estimate and plugging it into a formula does not give us “the power”; it gives us a very noisy estimate of the power

Having more than one study in the average power case helps, but not much. For example, in the ideal case of k studies all with the same power, z.bar.hat ~ N(z.true, 1/k) and mapping this estimate to power results in a very noisy distribution except for k large (roughly 60 in this ideal case). If you also try to adjust for publication bias as average power proponents do, the resulting distribution is much noisier and requires hundreds of studies for a precise estimate.In sum, people are left with a noisy estimate that doesn’t mean what they think it means and that they do not realize is noisy!

With all this talk of negativity and bearing in mind the bullshit asymmetry principle, I wonder whether you would consider posting something on this or having a blog discussion or guest post or something along those lines. As Sander and Zad have discussed, it would be good to stop this one in its tracks fairly early on before it becomes more entrenched so as to avoid the dreaded bullshit asymmetry.

He also links to an article, “Average Power: A Cautionary Note,” with Ulf Böckenholt and Karsten Hansen, where they find that “point estimates of average power are too variable and inaccurate for use in application” and that “the width of interval estimates of average power depends on the corresponding point estimates; consequently, the width of an interval estimate of average power cannot serve as an independent measure of the precision of the point estimate.”

Mmm, I must be wrong about this. But in principle, it seems to me like the cumulative power would be more important than the average power. After all, a test with high power is really a sequence of repeated tests with low power.

As the saying goes, post-hoc power calculations are like a shit sandwich!

https://statmodeling.stat.columbia.edu/2019/01/13/post-hoc-power-calculation-like-shit-sandwich/

The first point the paper makes–that estimation issues aside, average power just isn’t an interesting or useful quantity–seems pretty obvious for the reasons Andrew, Anoneuoid, Fisher, and others point out below. What may be interesting is that the authors seem to have had to belabor this point to their audience, presumably because it is not so obvious to them.

What is remarkable though is the second point. Sure the post-hoc power estimate of a single study is going to be noisy, but that it takes 60 studies to get a vaguely precise estimate of post-hoc retrospective meta-analytic average power surprised me even though it falls out so simply from the math. And that’s 60 in the most ideal conditions possible (all studies have the same power, no heterogeneity, no publication bias, no moderators, etc.). The authors show it gets much much worse in the sense that you need many many more studies when you move away from this ideal.

The upshot is that even if we were for some strange reason interested in estimating average power, we could never do so in practice because (a) we almost never have 60 studies and (b) in the rare cases we do, there will be heterogeneity, publication bias, moderators, etc. And all of this applies to median power and all other measures of central tendency too (i.e., because all studies have the same power in the ideal condition, the average, median, etc. are all the same).

Andrew really was right: post-hoc power calculations, for a single study or a meta-analytic average of many studies, really are like a shit sandwich!

Good points.

As Fisher said, type II error and thus statistical power is the product of “mental confusion”.

https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1955.tb00180.x

There is never any response to this. It is just Bayes rule:

p(H[0])*p(H[0]|D) = p(H[0])*p(H[0]|D) / [ p(H[0])*p(H[0]|D) + p(H[1:n])*p(H[1:n]|D) ]

Where H[0:n] are the plausible hypothesis for the data (D). You cannot meaningfully calculate power of your test of H[0] without considering the alternative explanations H[1:n]. It is that simple.

Lets try once more, even though everyone knows what I meant:

> You cannot meaningfully calculate power of your test of H[0] without considering the alternative explanations H[1:n]. It is that simple.

That’s why the power of the test is calculated against a specific alternative hypothesis.

Usually it is tested against the alternative hypothesis that assuming the model is correct the value of the parameter is not exactly x (whatever value they set as the “null hypothesis”). This ignores all other explanations (ie, models/likelihoods). Read the Fisher paper.

If you don’t agree that power calculations are done considering a specific alternative hypothesis, maybe you give an example of a power calculation which is based only on the test and null hypothesis (and leaves the alternative hypothesis unspecified).

Not sure what you have me agreeing to or not, but it seems irrelevant. I’ll just quote Fisher from the paper above:

I thought that your reply manifested some sort of disagreement with my remark “That’s why the power of the test is calculated against a specific alternative hypothesis.”

I of course agree that “[Errors of the second kind are] incalculable both in frequency and in magnitude merely from the specification of the null hypothesis”. Which is why, as I said, the power of the test is calculated against a specific alternative hypothesis.

I think the problem is statisticians often assume a model is true then think all the different combinations of parameter values for that model are the different possible hypotheses.

The real universe of hypotheses also includes the other possible models. You can get decent a power calculation if you specify the top 10 or so plausible models, but not from specifying one model then saying “anything except that one model”.

> You can get decent a power calculation if you specify the top 10 or so plausible models, but not from specifying one model then saying “anything except that one model”.

If you specify one model (the null hypothesis, if I understand correcly) and then say “anything except that one model” you cannot get a power calculation at all. Hopefully we agree on that.

Please disregard my previous comment, I may have misread what you wrote and I’m not sure I understand what you’re talking about now.

Hypothesis is not being used in the sense of eg “assume this sample came from a normal distribution, now what is the mean.” There, all the “hypotheses” are just different values for the mean.

But, the sample could also come from some other distribution. You cannot calculate type II error rate without specifying these other distributions.

Think more of competing scientific theories derived from assumptions as your model.

Yes, and I already wrote:

You cannot calculate power based on specifying the “alternative hypothesis” that the model is correct but the parameter value is not what we set as the null hypothesis. That is the same thing as specifying the null hypothesis, there is no additional info.

You need to specify the other plausible hypotheses/models you are comparing to. If the data does not fit them well, but fits your null model well, then you have high power. If the data fits the other models equally well as the null model, then you have low power.

Just to be clear. If you have a model and the null hypothesis is “mu is 0” you can calculate the power of a test against a specific alternative hypothesis as “mu is 1”. You cannot calculate the power of a test against the non-specific alternative hypothesis “mu is not 0”. We strongly agree.

https://en.wikipedia.org/wiki/Type_I_and_type_II_errors

You can calculate something but you can’t determine this by only specifying those two “hypotheses”.

Anon:

I have very little interest in the probability of a hypothesis. See my paper with Shalizi for more on this point.

Hallelujah

Not to put words in your mouth, but I read that as you don’t think you need to consider alternative explanations when judging whether a given explanation is correct?

Anything else seems to be sidestepping or disregarding Bayes rule. Perhaps you mean in the sense of only checking the probability of a parameter assuming a given likelihood [explanation] is correct?

I’m assuming you refer to this paper: http://www.stat.columbia.edu/~gelman/research/published/philosophy.pdf

I mean if when using Bayes rule you assume the likelihood is correct to begin and only some parameters of it are misspecified then its game over and you are doing religion. But the whole point of science is to find the right likelihood function (natural law).

We’ve still got a long way to go in educating users of statistics. The human tendency to believe what you want to believe is very strong. “Keep it simple, Stupid” might be good advice in some specific instances, but “It ain’t simple, Stupid” fits a lot of situations.

I stumbled late on this post and read the linked warning piece. I found myself wondering: are there theorems concerning the reduction or factoring of uncertainty? It looks like a bunch of studies that are then grouped. But shouldnt there be some factoring available by which you identify a noise pattern and apply that? There should then be some general form or rule or modular definition by which some ‘remainder’ is determined (complete with sign potential, I would think). I’m just wondering aloud.

Blake McShane is an asshole. If you want to criticize something, you should at least cite the work you are criticizing.

And in 2002 I made Ulf Bockenholt a co-author on a paper he commented on for five minutes. Now he doesn’t even cite my work.

Academia is full of assholes.

If you want to see what average power can do for you look at Roy F.(I don’t know what the F stands for) Baumeister’s z-curve, which can be used to estimate the average power of his famous ego-depletion studies.

You see that he has a 92% success rate (not counting marginal significance) in his reported studies, but he only has 20% power to do so.

If this doesn’t tell you something of interest, you are not very interested in replicability.

https://replicationindex.com/2018/11/16/replicability-audit-of-roy-f-baumeister/

If you want to learn more about the replication crisis in social psycholoyg, you can check out this preprint/blog of an in press article.

https://replicationindex.com/2020/01/05/replication-crisis-review/

The McShane paper cites you 20 times beginning in the second paragraph, but never mind that. I have a question about your z-curve method that you link to.

In the ideal case the McShane paper first considers, the authors show the MLE based on k studies has sampling distribution N(truth, 1/k). They then do a simple change of variables to show the sampling distribution of average power (Equation 4). The distribution is as claimed wide unless k is very large. That’s the math and it is pretty trivial.

Are you saying your z-curve method can beat the MLE in this one-dimensional problem? If so, that would be of interest not only to the blog but the statistics community at large. It would be helpful if you could clarify.

I am struggling to come up with a charitable explanation for this comment. After all, McShane, Böckenholt, & Hansen’s *4th sentence* cites Brunner & Schimmack (2016) as well as Schimmack & Brunner (2017)!

Were you not able to make it that deep into the paper?

Did somebody delete my post?

Ulrich:

We have an active spam filter and sometimes it takes couple days before I clean it out. Other times (as now) I happen to be checking it and cleaning it out right away.

I guess impolite posts are treated as spam. Probably regret posting it tomorrow, but what is done is done.

Cheers, Uli

Ulrich:

I have no idea what algorithm the spam filter uses. My guess is that your post got flagged because it had links.

Ulrich:

Yeah, you should definitely regret posting that comment. Beyond the unnecessary rudeness, you wrote that McShane and Bockenholt don’t cite your work, but they directly cite your work in the linked paper. That’s just ridiculous, the kind of think we might see on twitter, maybe, but we expect better in the blog comments here.

McShane is an asshole because he criticizes average power without even mentioning my work on it and why it is useful to estimate it.

Bockeholt is an asshole because I helped him with data when I was post-doc in Illinois and even made him a co-author on a paper and now he doesn’t even acknowledge my work in this area. Not citing work that doesn’t fit your purposes is just so low and prevents our science from making progress.

Regarding the uselessness of average power, I present to you the z-curve of Roy F. (don’t know what the F stands for) Baumeister. He has a 92% success rate (not counting p = .06 successes) with just 20% power. If you think this is not useful to know to evaluate his work, please let me know why not.

https://replicationindex.com/…/replicability-audit-of…/

I am struggling to come up with a charitable explanation for this comment. After all, McShane, Böckenholt, & Hansen’s *4th sentence* cites Brunner & Schimmack (2016) as well as Schimmack & Brunner (2017)!

Were you not able to make it that deep into the paper?

I misremembered They cite us but only to trash our approach. The main criticism is apparently that power is useless because significance testing is useless. Oh my god. Do we need a full paper that criticizes average power if you want to say significance testing is wrong?

The reality is that 99% of psychology uses this statistical approach. We all know that rejecting the nil-hypothesis is not the end all of science, but why criticize a method that shows even this small goal is often achieved only by fudging the data in low power studies. Anyhow, the whole article is just junk and the precision of our power estimates depends of course on sample size.

Ulrich:

I think an apology is in order. You called them “assholes” for doing something they didn’t even do. If you have specific criticisms of their paper, fine, that’s another story.

Your actual criticisms here are pretty empty. You write, “the main criticism is apparently that power is useless because significance testing is useless.” Actually, this is what they say:

This is right in the abstract.

It’s fine to express disagreement about methodology. Not so fine to misrepresent what’s in a paper and attack the authors for something they didn’t do.

First you make the incorrect citation claim, and now you make this incorrect claim about significance testing. Criticism of significance testing represents two paragraphs of the paper (page 2), and these two paragraphs deal not so much with criticism of significance testing but consequences of it.

The bulk of the paper is an exercise in mathematical statistics. It shows that, among other things, the MLE of average power in a one-dimensional, ideal setting is very noisy unless the number of studies is very large.

I repeat what I asked you above and which has gone unanswered: are you claiming, as you seem to imply, that your z-curve method dominates the MLE in the one-dimensional, ideal setting of their Scenario 1? It would be very helpful to clarify this, or what precisely you are claiming.