How post-hoc power calculation is like a shit sandwich

Posted on January 13, 2019 9:30 AM by Andrew

Damn. This story makes me so frustrated I can’t even laugh. I can only cry.

Here’s the background. A few months ago, Aleksi Reito (who sent me the adorable picture above) pointed me to a short article by Yanik Bababekov, Sahael Stapleton, Jessica Mueller, Zhi Fong, and David Chang in Annals of Surgery, “A Proposal to Mitigate the Consequences of Type 2 Error in Surgical Science,” which contained some reasonable ideas but also made a common and important statistical mistake.

I was bothered to see this mistake in an influential publication. Instead of blogging it, this time I decided to write a letter to the journal, which they pretty much published as is.

My letter went like this:

An article recently published in the Annals of Surgery states: “as 80% power is difficult to achieve in surgical studies, we argue that the CONSORT and STROBE guidelines should be modified to include the disclosure of power—even if <80%---with the given sample size and effect size observed in that study”. This would be a bad idea. The problem is that the (estimated) effect size observed in a study is noisy, especially so in the sorts of studies discussed by the authors. Using estimated effect size can give a terrible estimate of power, and in many cases can lead to drastic overestimates of power . . . The problem is well known in the statistical and medical literatures . . . That said, I agree with much of the content of [Bababekov et al.] . . . I appreciate the concerns of [Bababekov et al.] and I agree with their goals and general recommendations, including their conclusion that “we need to begin to convey the uncertainty associated with our studies so that patients and providers can be empowered to make appropriate decisions.” There is just a problem with their recommendation to calculate power using observed effect sizes.

I was surgically precise, focusing on the specific technical error in their paper and separating this from their other recommendations.

And the letter was published, with no hassle! Not at all like my frustrating experience with the American Sociological Review.

So I thought the story was over.

But then my blissful slumber was interrupted when I received another email from Reito, pointing to a response in that same journal by Bababekov and Chang to my letter and others. Bababekov and Chang write:

We are greatly appreciative of the commentaries regarding our recent editorial . . .

So far, so good! But then:

We respectfully disagree that it is wrong to report post hoc power in the surgical literature. We fully understand that P value and post hoc power based on observed effect size are mathematically redundant; however, we would point out that being redundant is not the same as being incorrect. . . . We also respectfully disagree that knowing the power after the fact is not useful in surgical science.

No! My problem is not that their recommended post-hoc power calculations are “mathematically redundant”; my problem is that their recommended calculations will give wrong answers because they are based on extremely noisy estimates of effect size. To put it in statistical terms, their recommended method has bad frequency properties.

I completely agree with the authors that “knowing the power after the fact” can be useful, both in designing future studies and in interpreting existing results. John Carlin and I discuss this in our paper. But the authors’ recommended procedure of taking a noisy estimate and plugging it into a formula does not give us “the power”; it gives us a very noisy estimate of the power. Not the same thing at all.

Here’s an example. Suppose you have 200 patients: 100 treated and 100 control, and post-operative survival is 94 for the treated group and 90 for the controls. Then the raw estimated treatment effect is 0.04 with standard error sqrt(0.94*0.06/100 + 0.90*0.10/100) = 0.04. The estimate is just one s.e. away from zero, hence not statistically significant. And the crudely estimated post-hoc power, using the normal distribution, is approximately 16% (the probability of observing an estimate at least 2 standard errors away from zero, conditional on the true parameter value being 1 standard error away from zero). But that’s a noisy, noisy estimate! Consider that effect sizes consistent with these data could be anywhere from -0.04 to +0.12 (roughly), hence absolute effect sizes could be roughly between 0 and 3 standard errors away fro zero, corresponding to power being somewhere between 5% (if the true population effect size happened to be zero) and 97.5% (if the true effect size were three standard errors from zero). That’s what I call noisy.

Here’s an analogy that might help. Suppose someone offers me a shit sandwich. I’m not gonna want to eat it. My problem is not that it’s a sandwich, it’s that it’s filled with shit. Give me a sandwich with something edible inside; then we can talk.

I’m not saying that the approach that Carlin and I recommend—performing design analysis using substantively-based effect size estimates—is trivial to implement. As Bababekov and Chang write in their letter, “it would be difficult to adapt previously reported effect sizes to comparative research involving a surgical innovation that has never been tested.”

Fair enough. It’s not easy, and it requires assumptions. But that’s the way it works: if you want to make a statement about power of a study, you need to make some assumption about effect size. Make your assumption clearly, and go from there. Bababekov and Chang write: “As such, if we want to encourage the reporting of power, then we are obliged to use observed effect size in a post hoc fashion.” No, no, and no. You are not obliged to use a super-noisy estimate. You were allowed to use scientific judgment when performing that power analysis you wrote for your grant proposal, before doing the study, and you’re allowed to use scientific judgment when doing your design analysis, after doing the study.

The whole thing is so frustrating.

Look. I can’t get mad at the authors of this article. They’re doing their best, and they have some good points to make. They’re completely right that authors and researchers should not “misinterpret P > 0.05 to mean comparison groups are equivalent or ‘not different.'” This is an important point that’s not well understood; indeed my colleagues and I recently wrote a whole paper on the topic, actually in the context of a surgical example. Statistics is hard. The authors of this paper are surgeons and health policy researchers, not statisticians. I’m a statistician and I don’t know anything about surgery; no reason to expect these two surgeons to know anything about statistics. But, it’s still frustrating.

P.S. After writing the above post a few months ago, I submitted it (without some features such as the “shit sandwich” line) as a letter to the editor of the journal. To its credit, the journal is publishing the letter. So that’s good.

54 thoughts on “How post-hoc power calculation is like a shit sandwich”

Kyle C on January 13, 2019 10:25 AM at 10:25 am said:

Andrew, the internet has developed a clean version of the shit sandwich metaphor, which is, “dinner of tire rims and anthrax.” Origin discussed here. https://washingtonmonthly.com/2011/07/23/tire-rims-and-anthrax/

Reply ↓
David Bailey on January 13, 2019 10:54 AM at 10:54 am said:

If the surgeons really want to report the power estimate, would it be okay as long as they explicitly report the uncertainty, e.g. 0.16 (0.05, 0.975) in your toy example? Being explicit should force both authors and readers to recognize when the power estimate is of negligible utility. One could argue that this might actually forestall readers calculating their own power estimate without realizing what a lousy estimate it is.

Reply ↓
- Andrew on January 13, 2019 11:04 AM at 11:04 am said:
  
  David:
  
  They could do that. I don’t in general think the 95% confidence interval is such a good expression of uncertainty, especially with a noisy experiment, but, sure, if this would be a way for people to realize how ridiculous such calculations are, then sure.
  
  Reply ↓
- Degenerate Prior on January 13, 2019 12:12 PM at 12:12 pm said:
  
  If the null value wasn’t rejected at the two-sided alpha level, then the corresponding two-sided 1 – alpha CI must include the null value. The probability of rejecting the null hypothesis assuming the null value is alpha by definition. Necessarily, then, the lower bound of the two-sided CI for the rejection probability is alpha. More interesting, perhaps, would be to determine the upper-tailed CI for the rejection probability, and consider its lower bound to be a non-trivial, conservative estimate of the power, as well as reflecting uncertainty in relation to the point estimate.
  
  Reply ↓
Carlos Ungil on January 13, 2019 12:23 PM at 12:23 pm said:

> my problem is that their recommended calculations will give wrong answers

Wrong answers to what question?

In principle, a similar thing could be said about p-values or confidence intervals:

“Here’s an example. … The estimate is just one s.e. away from zero, hence not statistically significant: the crudely estimated (one-tailed) p-value, using the normal distribution, is approximately 0.16. But that’s a noisy, noisy estimate! Consider that effect sizes consistent with these data could be anywhere from -0.04 to +0.12 (roughly), hence absolute effect sizes could be roughly between 0 and 3 standard errors away from zero, corresponding to p-value being somewhere between 0.5 (if the true population effect size happened to be zero) and 0.001 (if the true effect size were three standard errors from zero). That’s what I call noisy.”

“Here’s an example. … The estimate is just one s.e. away from zero, hence not statistically significant: the crudely estimated (two-tailed) confindence interval using the normal distribution, is approximately [-0.04 0.12]. But that’s a noisy, noisy estimate! Consider that effect sizes consistent with these data could be anywhere from -0.04 to +0.12 (roughly), hence absolute effect sizes could be roughly between 0 and 3 standard errors away from zero, corresponding to the CI being somewhere between [-0.08 0.08] (if the true population effect size happened to be zero) and [0.04 0.20] (if the true effect size were three standard errors from zero). That’s what I call noisy.”

P-values and confidence intervals give the “wrong answer” to the question many people have in mind, but they can still be useful and worth-reporting (as long as it’s clear what is the “question” being answered with those statistics). I don’t say that post-hoc power calculation based on the estimated effect size are useful, I have no idea. Maybe this “redundant” statistic can give additional insight to some people, like the use of both p-values and confidence intervals can make the picture clearer.

Reply ↓
Dave on January 13, 2019 2:14 PM at 2:14 pm said:

Seems like they should plot the power curve over a reasonable range of estimates so the readers could make their own judgements. If the reader is worried about noisy estimates, he or she can get a feel for how quickly the power diminishes away from the estimate.

Reply ↓
Phil on January 13, 2019 2:14 PM at 2:14 pm said:

Andrew: my problem is that their recommended calculations will give wrong answers

Carlos: Wrong answers to what question?

Their recommended calculation will give wrong answers to the questions “what is the power of the study?”, and “approximately what is the power of the study?”

Reply ↓
- Carlos Ungil on January 13, 2019 2:46 PM at 2:46 pm said:
  
  “what is the power of the study?”
  
  Reply ↓
- Carlos Ungil on January 13, 2019 2:50 PM at 2:50 pm said:
  
  There is no correct answer to the question “what is the power of the study?”
  
  The question doesn’t make sense if no alternative hypothesis is specified. In the same way that the p-value calculations depend on a model and a null hypothesis, power calculations depend on a model, a null hypothesis and an alternative hypothesis.
  
  Reply ↓
  - Carlos Ungil on January 13, 2019 2:57 PM at 2:57 pm said:
    
    To be clear, p-value calculations depend also on the data. Power calculations depend only on the model and the null and alternative hypothesis and not on the data… except that for “post-hoc” we use the data to define the alternative hypothesis. (I’m not defending that this kind of calculation makes much sense, but I don’t think that saying it’s a noisy estimate of the “power of the study” makes much sense either…)
    
    Reply ↓
    - Phil on January 15, 2019 12:02 PM at 12:02 pm said:
      
      Carlos,
      Yes, ‘statistical power’ depends on the model and the hypotheses…including the hypothesized effect size. The proposal Andrew is criticizing would take the hypothesized effect size to be the estimate that comes out of the data analysis. Since that estimate is very noisy, the estimated effects size will, generally, differ greatly from the true effect size. Thus the statistical power of the study will, in general, be badly mis-estimated.
      
      I think you understand this, so I can’t figure out what your complaint is. I know you are not a pedant, quibbling about word choice. What part of the paragraph above do you disagree with? Or maybe none of it, but you want something explicit about the fact that the ‘statistical power’ depends on the significance criterion, and other such caveats?
    - Carlos Ungil on January 15, 2019 12:39 PM at 12:39 pm said:
      
      > the statistical power of the study will, in general, be badly mis-estimated.
      
      What would be a better “estimate” of “the statistical power of the study”? Do you think that the “true” “statistical power of the study” is a well-defined and potentially useful concept?
      
      The statistical power is a function of the hypothesized effect size. It’s not an statistic (a function of the data). I agree that using the estimated effect size instead of an actual hypothesized effect size is misguided. But I don’t think that arguing that there is a “true” statistical power (corresponding to the unknonwn “true” effect instead of an actual hypothesized effect) that is being mis-estimated by the procedure does contribute much to clear the confusion.
      
      For what it’s worth, I like Andrew Althouse’s explanation much better.
    - Daniel Lakeland on January 15, 2019 2:44 PM at 2:44 pm said:
      
      I agree with Carlos that I don’t think it’s a good idea to propagate the idea that there exists a number x such that x is the true power of a given study.
      
      This is like saying there exists a temperature T such that T is the temperature of the air in Los Angeles in June. You can maybe argue correctly that there is a T such that T is the temperature of the atmosphere at your backyard thermometer at 9:33:00 AM on June 13th 2018 but “the temperature in June” is a function of both time and place, even if you just say “at the thermometer in my backyard” it varies between say 60F and 115F throughout the day and it’s a mistake to claim that there is a correct number “for all of June”
      
      The “power of a study” is a function of another number you might call the “detection threshold”. If you would like to detect if there’s an effect of x=3.3 then you can put that effect in and find out what the power of your study is to detect it. If you would like to detect an effect of 18.6 then you put that number in and figure out the power to detect *it*. x=0.22 yet another power. It’s a category error to think that “the power” is a number, “the power… to detect an effect of size x” is a number, and x is a free variable.
    - Phil on January 15, 2019 4:30 PM at 4:30 pm said:
      
      Maybe it would help to have a specific example in mind. A thought experiment.
      
      I have a drug that may or may not cure a specific disease some fraction of the time. Let’s suppose that if we gave this drug to everyone who suffers from the disease, it would cure c% of them. Let’s suppose the spontaneous cure rate is 5% in a month. Let’s also suppose I have the ability to take a simple random sample of all of the people who suffer from the disease. Finally, suppose people are independent: if person A is cured of the disease, this does not affect person B. I take a sample of size N = 20 and give the drug to all of them. I take another sample of size M = 40 and give them a placebo.
      
      The probability that I will conclude that p > 0 under these circumstances depends on c, and also on the p-value I set in order to ‘conclude’ that c > 0. This probability is called the ‘statistical power’ of the study.
      
      Daniel, I think your objection to saying there is a ‘true’ answer to this question is based on practical issues: you can’t really take a simple random sample of people, the probability of curing people isn’t really independent, etc., etc. To me, that’s like saying that there’s no ‘true’ answer to a Newtonian physics question because we can’t really make a frictionless plane, can’t really put two objects in orbit around each other without other gravitational fields being present, and so on. Abstractions are necessary to solve any problem — the real world is just wavefunctions all the way down.
      
      I, too, object to the ‘statistical power’ paradigm, but not because there’s no ‘correct answer’; I think that in a mathematical sense there is a correct answer. My big problem is that I think the focus should be ‘what is the magnitude of c’, not ‘is c > 0, and how sure am I?’.
      
      So I don’t want to get cornered into defending the concept of ‘statistical power’ for this kind of study. What I will say is that, EVEN IF one accepts that statistical power is a useful concept, and agrees that there is a right answer to the statistical power for a given study, one should still recognize that you will not get the right answer for ‘what is the statistical power of this study’ if one assumes the true value of c is the raw estimated value of c that comes out of the study, if the study is small and therefore noisy.
      
      It’s hard for me to believe that either Carlos or Daniel disagrees with this, really. So I’m not sure what we are arguing about.
    - Carlos Ungil on January 15, 2019 4:55 PM at 4:55 pm said:
      
      > The probability that I will conclude that p > 0 [N.B. I think there is a typo and you mean c>0] under these circumstances depends on c, and also on the p-value I set in order to ‘conclude’ that c > 0. This probability is called the ‘statistical power’ of the study.
      
      By whom?
      
      The standard definition of the power of a test is that it is the probability that the test rejects the null hypothesis (H0) when a specific alternative hypothesis (H1) is true
      
      In your example, the null hypothesis is H0:”c=0″ and there is an infinity of “statistical powers” of the sudy.
      
      The power of the test for the alternative hypothesis H1:”c=1″ is Pr(reject H0 | c=1)
      
      The power of the test for the alternative hypothesis H1:”c=0.666″ is Pr(reject H0 | c=0.666)
      
      Etc. None of them depends on the true value of c and none of them is “true”.
    - pnprice on January 15, 2019 5:12 PM at 5:12 pm said:
      
      Carlos,
      You disagree that the statistical power of a test depends on the magnitude of the effect you are trying to quantify? Well, I take it back, we do have a fundamental disagreement that is the source of our larger disagreement. (And yes, I did mean c > 0, thank you).
      
      If you think a study of a given size and given signficance criterion has the same power whether the effect is large or small, then I think you’re wrong. If you’re right and the statistical power is independent of the effect size, then of course it doesn’t matter what estimate of effect size you use when calculating the statistical power of the study, and if that’s the case then your position makes sense. but I don’t think that is the case.
      
      Phil
    - Daniel Lakeland on January 15, 2019 7:20 PM at 7:20 pm said:
      
      No Phil, he thinks that the power of the study is a different number for every different possible *size of the effect*.
      
      Rather than saying “the power of the study” we should say “the power of the study to discern an effect of size c=c*” and there are any number of c*s we could plug in.
      
      You might argue “well, all we care about is the power of the study to discern and effect of size c equal to the actual size of c” but since the actual size of c is unknown and is *certainly not* the sample average or whatever… we’re left with either a Bayesian statement about the “probability of the power of the study given our prior over c” or we’re left with a function of the hypothesized c value. we aren’t left with a calculatable number.
    - Phil on January 16, 2019 1:41 AM at 1:41 am said:
      
      Daniel,
      For crying out loud, you and Carlos seem to be in extremely strong _agreement_ with the point Andrew is making and that I am trying to make. The statistical power of the study is only defined if you specify an effect size. The effect size is unknown. If you assume it is equal to the estimated effect size that you get from a noisy study then you are almost certain to get it wrong. Thus your post hoc estimate of the statistical power is almost certain to be wrong. This is all I am saying! If neither you nor Carlos disagree with it then wtf are we arguing about?
      
      But Carlos says that the probability that the test rejects the null hypothesis does not depend on the true value of c, so I think he does disagree.
      
      But I dunno and I no longer care. I _think_ we all understand the situation and are simply caught up in an argument about language. I’m dropping it.
    - Carlos Ungil on January 16, 2019 3:35 AM at 3:35 am said:
      
      > The statistical power of the study is only defined if you specify an effect size.
      
      Agreed (but noting that you can specify any effect size to do the calculation for that alternative; it’s not about specifying the “true” effect size).
      
      > The effect size is unknown. If you assume it is equal to the estimated effect size that you get from a noisy study then you are almost certain to get it wrong.
      
      Agreed (but irrelevant).
      
      > Thus your post hoc estimate of the statistical power is almost certain to be wrong.
      
      No agreement here: I don’t know what the “post hoc estimate of the statistical power” means. The statistical power is not an unknown quantity that has to be estimated.
      
      The statistical power of a test (as originally defined, at least) is a mathematical property of the mathematical model of the test, which can be calculated for any pair of parameter values mu_null and mu_alternative. It’s unrelated to the “true” value of the parameter. We can calculate the statistical power of a test for the difference in mean height between martians and venusians.
    - Daniel Lakeland on January 16, 2019 9:33 AM at 9:33 am said:
      
      The problem Phil is that there is no such thing as “the power of the study” and so claiming you will incorrectly calculate the power of the study is already propagating a misunderstanding.
      
      When you plug in the point estimate of the effect size and calculate the power, you get a correct answer to an irrelevant question (what is the power of the study to detect the estimated effect size), not the wrong answer to a relevant question (a relevant question would be something like what is the power of the study to detect an effect of size equal to the size most patients think would be worthwhile to make it worth enduring the downsides of the procedure)
      
      that the researchers are Asking the wrong question is the point we need to get across, not that they are somehow doing the calculation wrong
    - Phil on January 16, 2019 12:19 PM at 12:19 pm said:
      
      OK, as I suspected this is all about language, not substance.
      
      If we agree that there is a concept of the ‘true’ effect size — if everybody got the drug, what fraction of them would be cured, or similar — then there is a correct answer to the question “What is the statistical power of the study to detect the effect of the drug, given the true effect size and a statistical significance criterion.” When I say “what is the right answer for the statistical power of the study” that is what I am talking about.
      
      You guys (Carlos and Ungil) point out that you can do a statistical power calculation for _any_ effect size, including effect sizes that are wildly wrong — meaning, they differ greatly from the true effect size. I certainly agree with that.
      
      I think our entire disagreement comes down to terminology. I would say that if you use the wrong value for the effect size — by which I mean one that differs greatly from the true value — then you get the wrong answer. You guys say no, you are getting the right answer, it just happens to be the right answer to the wrong question. To me, that’s like saying “I’m trying to calculate how long it will take this object to fall to the ground, and I’m assuming g = 1 m/s^2. Don’t try to tell me I’m getting the wrong answer, since I can get a right answer for any value of g that I choose.” Yes, you are doing the right calculation but you have an incorrect input value so you are getting the wrong answer. Similarly, you can do a statistical power calculation for any effect size that you want, but if you use the wrong effect size you will get the wrong answer.
      
      Since you are both smart people there must be merit to your view on this. Perhaps more merit than there is to my view! I am just relieved to understand the source of our disagreement. At least it doesn’t come down to what your definition of ‘is’ is.
    - Carlos Ungil on January 16, 2019 1:12 PM at 1:12 pm said:
      
      Phil, does the following dialogue seem reasonable to you?
      
      – Doctor, how powerful is that blood test? What is the probability that the test will identify the tumor as cancerous?
      
      – Well, if the tumor is indeed malignant and in late stage the test will detect the cancer with 95% probability. If it’s early stage, its 50%/50%. And there is also a 5% probability that it comes out as malignant when it’s actually benign.
      
      – Come on, doctor, if you use the wrong hypothesis you get the wrong answer. The question here is what’s the probability that the tumor is malignant in my case. That’s what I call the power of the test!
    - Carlos Ungil on January 16, 2019 1:18 PM at 1:18 pm said:
      
      Minor correction:
      
      “… The question here is what’s the probability that the tumor is __ identified as __ malignant in my case. That’s what I call the power of the test!”
    - Phil on January 16, 2019 2:19 PM at 2:19 pm said:
      
      Carlos,
      I do see your point, and it is similar to a point I have often made myself in regard to the concept of ‘unbiased’ estimates: what is the point of defining ‘bias’ relative to a quantity that I do not know? Surely it only makes sense to condition on the information that is actually available?
      
      And for an, um, pre-hoc power calculation, I completely agree with you. You propose to do a study, someone asks what statistical power you will have, I don’t think you can do anything better than say “If the true effect is c, we’ll have such-and-such power, and if it’s 2*c then we’ll have so-and-so.” I have no problem with that at all.
      
      But the whole point of this post (meaning Andrew’s post) is that these researchers don’t leave it there. They now go off, collect a bunch of really noisy data, estimate an effect size, and say (or imply) “now that we know approximately the size of the true effect, we can do a post-hoc calculation of the power of the study, and here’s the answer.” The only point of doing this is to make the claim that the post-hoc estimate of the power of the study is relevant to the real world. They’re not saying, as you are, “there’s a whole range of possible answers to ‘what is the statistical power of the study, depending on the effect size”, they’re saying (or at least implying) “now that we know the effect size, we know the statistical power of the study and here it is.”
    - Carlos Ungil on January 16, 2019 3:19 PM at 3:19 pm said:
      
      > The only point of doing this is to make the claim that the post-hoc estimate of the power of the study is relevant to the real world.
      
      Saying “post-hoc estimate of the power of the study” is conceptually wrong (at least in the framework of N-P hypothesis testing).
      
      > they’re saying (or at least implying) “now that we know the effect size, we know the statistical power of the study and here it is.”
      
      You’re saying, or at least implying, that there is a thing called “the statistical power of the study” that can be estimated (precisely if the true effect is estimated precisely) and that is somehow relevant or useful.
      
      I don’t know why you are so attached to that idea or where did you get it from. I don’t thing it’s just a question of terminology, but I guess it depends on how we undertand the term “terminology” :-)
    - Daniel Lakeland on January 16, 2019 5:23 PM at 5:23 pm said:
      
      The only logical resolution to this question “what is the power of the study to detect an effect whose size is the true size” is to provide a posterior distribution over the true size, then provide a probability distribution over the power based on that posterior over the size.
      
      Then you can say “there’s a 50% probability that the power of the test is greater than 50%” and everyone’s eyes will roll back into their heads just before their brain explodes.
      
      Seriously though Phil, the reason to care about this is that if you tell people “plugging in the observed effect size is the wrong way to calculate the power of the test” they’ll go out and try to “find the right way” to calculate it and come up with some other equally wrong headed answer.
      
      whereas if you say “the power of the test is a function of an assumed effect size and since even after the study there are a range of plausible effect sizes there is equally a range of plausible powers and so there’s no one answer to the question what’s the power since it’s really not a single number” stops that wrong headed follow-up .
    - Phil on January 16, 2019 6:21 PM at 6:21 pm said:
      
      Carlos, you say:
      ====
      You [Phil] are saying, or at least implying, that there is a thing called “the statistical power of the study” that can be estimated (precisely if the true effect is estimated precisely) and that is somehow relevant or useful.
      
      I don’t know why you are so attached to that idea or where did you get it from. I don’t thing it’s just a question of terminology, but I guess it depends on how we undertand the term “terminology” :-)”
      ====
      
      I feel like I am in a house of mirrors or something. WTF. Look, here’s a quote from Andrew’s post that this whole comment string is about: “An article recently published in the Annals of Surgery states: “as 80% power is difficult to achieve in surgical studies, we argue that the CONSORT and STROBE guidelines should be modified to include the disclosure of power—even if <80%—with the given sample size and effect size observed in that study”.
      
      So there you have it. The authors of that piece in Annals of Surgery believe there is such a thing as 'disclosing' the power of the study with the sample size and effect size observed in that study. What's more, I agree with them! (I thought you did too, indeed you said it at 4:55 pm yesterday): Once you choose a sample size and an effect size, you can quantify the probability that a 'statistically significant' result will be obtained, i.e. you can determine 'the statistical power of the study.' As you know, probably at least as well as I, the recipe for doing the calculation is pretty simple. Although simple, it is possible to do the calculation incorrectly. I hope we all agree that if you mess up the calculation you will get the wrong answer for the statistical power of the study. The question is, what happens if you do the calculation _correctly_, but you use an effect size that is wrong in the sense that it is far from the true value of the effect size.
      
      I claim that if you use the wrong value of the effect size, you get the wrong answer for the statistical power of the study even if you do the calculation correctly. In contrast, you claim….I am not sure. You seem to strongly object to my claim that if you use the wrong effect size you get the wrong answer. Does this mean you think that if you use the wrong effect size you get the right answer? I feel certain that is not the case, I'm sure you agree that if you use the wrong effect size you get the wrong answer for the statistial power of the study. But, in an attempt to clarify things, you now say there is no such thing as 'the statistical power of the study.' I find this confusing because just yesterday I thought we were in agreement that if you give an effect size and a sample size and a signficance criterion it is possible to determine the 'statistical power of the study.' It is certainly common practice to do so: both the concept and the term 'power' have been used this way for a long time.
      
      I believe there is a calculation, commonly called 'the power of the study' or 'the statistical power of the study', that can be determined through a standard calculation. I believe it has a 'right answer' for any given study, to wit, the answer you get if you plug in the correct values of the input parameters (which include effect size, sample size, and significance criterion). I believe if you plug in incorrect values of one or more of those input parameters — notably including effect size — you will get the wrong answer for statistical power.
      
      You disagree with something in the paragraph immediately above. I don't know what that could possibly be, but I have to say I am so weary of this discussion that I don't much care. Believe what you want to believe, including that I'm an idiot; I won't try to convince you otherwise.
      
      Daniel, I agree that you cannot know the true statistical power of the study unless you know the true effect size. (I wonder if Carlos would agree with that statement? Never mind, don't tell me.) This is, indeed, one of the reasons the post hoc calculation of statistical power based on a noisy estimate of effect size is a stupid thing to do. If you have a really large study so you get a really good estimate of the effect size then you can get a good estimate of the statistical power, enabling you to say things like "hey, it's a good thing we didn't fund that study with only 100 cases because we now know that such a study would have been seriously under-powered". Don't forget, I am not arguing in favor of post hoc power calculations, I am arguing _against_ them!
    - Daniel Lakeland on January 16, 2019 6:39 PM at 6:39 pm said:
      
      Phil, and Carlos, I’m pretty sure you will agree that if we describe some parameters of the study we can answer the question “what is the statistical power of this study to detect an effect size of x=3.14”, and we could derive the correct formula. Let’s call it Pow(study,x=3.14)
      
      Everyone agrees about that. Now, suppose that in truth if you did a lot of experiments you’d find that your intervention had an effect of a 3.14 unit improvement in outcome measure. I think Phil would say that “the power you calculated at x=3.14 is the power of the study”, in other words he defines Pow(study) as Pow(study,x close to 3.14 the real value) and Carlos would say “the power you calculated at x=3.14 is the power of the study when the effect really is 3.14” in other words “there is no such thing as Pow(study) only Pow(study,x=3.14).
      
      I don’t think there’s a lot of disagreement here except that Phil says that x=3.14 is special because it “really is” the effect size so he *defines* a new term “the power of the study” and Carlos doesn’t find that definition helpful.
      
      What I think is the issue is that if you calculate the power for x=4.25 carlos would say “This really is the power of the study when x=4.25” and Phil would say “That’s the wrong power because x isn’t 4.25”, in other words Phil agrees Pow(study,x=4.25) is a correctly calculated number, but it’s not equal to Pow(study) as he defined it, and Carlos says “hey there is no such thing as Pow(study)”
      
      I tend to side with Carlos here, because I think it’s a useful thing to do calculations like “the minimum x anyone would pay to achieve is x=41 and so the power of the study to detect an x=41, Pow(study,x=41) is 99.9994% and the fact that we didn’t detect it is strong evidence that this intervention isn’t worth pursuing” but if you preference “the true power” somehow a counter argument would be “but that’s not the actual power of the study the actual power of the study is what you get when you plug in something like 3.14 and so the power to detect that was only 50% and so you need to get us $100M more to continue our study” and I would say somehow preferencing “the true power” rather than “the power to detect something specific of interest to us” is a mistake that will, with misinterpretation both intentional and unintentional, lead to scientific resource waste and arguments about whether people should do interventions, just like calling p less than 0.05 “significant” will lead to Dentists telling you “but there’s a significant decrease in cavities if you do our expensive procedure” meaning after testing 10000 people they found that the risk of cavities went from 2.2 percent to 2.1 percent over 5 years with p = 0.02.
      
      Poor patient pays money, dentists get rich, side effects are painful, consumer winds up worse off than if they really understood what kind of flimflam they were being sold… etc
      
      Same thing with power.
    - Carlos Ungil on January 16, 2019 7:36 PM at 7:36 pm said:
      
      I don’t know how can I make my position more clear, but clearly I’m not succeeding (at least Daniel understands what I’m trying to say).
      
      As I said yesterday at 4:55pm: “The standard definition of the power of a test is that it is the probability that the test rejects the null hypothesis (H0) when a specific alternative hypothesis (H1) is true”
      
      To calculate the power you have to give an alternative hypothesis, in addition to specifying the rest of the model and the test. The power is calculated for *this* test, using *this* model, *this* sample size, *this* null hypothesis, *this* value of alpha, and, crucially, *this* alternative hypothesis. Having an alternative hypothesis, which is a mathematical construct unrelated to the real world, you can get the sampling distribution of the data conditional on the alternative hypothesis and therefore for the rejection of the test.
      
      The power is a calculation that depends on an arbitrarily defined alternative, there is not such a thing as “the” power of the study or the “true” power of the study where the calculation is done with the “true” alternative. Yes, people report “post-hoc” or “observed” power setting the alternative to the observed value of the parameter. I don’t think it makes much sense, at least they put some adjective so maybe they understand it’s not really the same thing. Claiming that they are estimating (badly or not) the “true” power of the study doesn’t make sense either.
      
      > I claim that if you use the wrong value of the effect size, you get the wrong answer for the statistical power of the study even if you do the calculation correctly.
      
      There is no way to use a “wrong value of the effect size” in the power calculation, in the same way that there is no “wrong value of the null hypothesis”. The power is calculated for, say mu_null = 0 and mu_alternative = 1 (i.e. what you call effect size equal to 1) but neither of this values is true or false. They are just the inputs to the calculation. The interest of the calculation is what it says about the sampling distribution of the test decisions, conditional on those hypothetical values.
      
      > In contrast, you claim….I am not sure. You seem to strongly object to my claim that if you use the wrong effect size you get the wrong answer. Does this mean you think that if you use the wrong effect size you get the right answer?
      
      There is no wrong effect size. There is no wrong answer. Every effect size answers the right answer to the question that power is designed to answer, which is about the sampling properties of the test under some hypothetical consditions. The power construct doesn’t try to say anything at all about the real world. It doesn’t need data or “true” values of the parameters for the result to be “right”.
      
      > I feel certain that is not the case, I’m sure you agree that if you use the wrong effect size you get the wrong answer for the statistial power of the study.
      
      Well, I don’t.
      
      > But, in an attempt to clarify things, you now say there is no such thing as ‘the statistical power of the study.’
      
      As I said yesterday at 4:55: “In your example, the null hypothesis is H0:”c=0″ and there is an infinity of “statistical powers” of the study. […] None of them depends on the true value of c and none of them is “true”.”
      
      > I find this confusing because just yesterday I thought we were in agreement that if you give an effect size and a sample size and a signficance criterion it is possible to determine the ‘statistical power of the study.’
      
      Yes, if you give an effect size S1, etc. you determine the statistical power of the study to reject the null hypothesis of zero effect when the actual effect size S is larger that the hypothesized effect size S1.
      
      And if you give another effect size S2, etc. you determine the statistical power of the study to reject the null hypothesis of zero effect when the actual effect size S is larger that the hypothesized effect size S2.
      
      Both of the previous calculations are correct. The value of the actual effect size S is irrelevant for the power calculations.
      
      If you knew the true actual effect S you could give another effect size S3 = S and determine the statistical power of the study to reject the null hypothesis of zero effect when the actual effect S is larger than the hypothesized effect size S3 = S.
      
      But I don’t see what’s the interest in doing so (of course if you knew the true actual effect S you wouldn’t be doing any tests anyway) and I strongly object to saying that this is THE power of the study and somehow those calculated above for S1 and S2 are wrong.
    - Jeff Walker on January 17, 2019 6:07 PM at 6:07 pm said:
      
      Carlos — your comment starting “I don’t know how can I make my position more clear, …” is extremely clear and brilliant. Thanks.
    - Phil on January 16, 2019 7:43 PM at 7:43 pm said:
      
      Daniel,
      If you (or Carlos) disagree that the answer you get when you plug in the correct values of the input parameters is the “right answer” then there’s not much to talk about, we have identified the source of our dispute!
      Whether or not it’s useful, I don’t know and make no claim. What I do claim is that, whatever the merit of preferencing the correct parameter value as some sort of hoity-toity ‘right’ answer just because it happens to be correct, it is even MORE meritless to preference a parameter value that is almost sure to be _incorrect_. Doing post hoc power analysis under the assumption that the estimated effect size is equal to the observed effect size makes no sense when the observed effect size is subject to a lot of error. I know you know that. I know Carlos knows that. I don’t know why we have just wasted three pages talking about it.
    - Daniel Lakeland on January 16, 2019 9:47 PM at 9:47 pm said:
      
      Actually Phil, I disagree with this:
      “whatever the merit of preferencing the correct parameter value as some sort of hoity-toity ‘right’ answer just because it happens to be correct, it is even MORE meritless to preference a parameter value that is almost sure to be _incorrect_”
      
      Let’s all agree that we personally are more Bayesian and wouldn’t use power at all but rather a more bayesian thing… But *if we assume we’re in a situation where power is being discussed* then the *main reason* to use Power is to make decisions about whether a study should go forward, or be continued, or further funded, or followed up on, etc. And the question is usually *whether there’s some effect size of practical interest that we could detect*. So plugging in the *practically important* effect size is usually the right thing to do, and it really doesn’t matter what our best guess at the actual effect size is. If we have very high power to reject the practical effect size so far, and we did reject it, then we shouldn’t continue the study, regardless of what the *real* effect size is.
      
      So I think plugging the estimate of the actual effect size is almost always the wrong thing to do, hence we just spent 3 pages talking about it.
    - Phil on January 17, 2019 12:55 PM at 12:55 pm said:
      
      As I said yesterday, “If you disagree that the answer you get when you plug in the correct values of the input parameters is the “right answer” then there’s not much to talk about, we have identified the source of our dispute!” And there it is.
    - Daniel Lakeland on January 17, 2019 4:15 PM at 4:15 pm said:
      
      Yes, you say Pow(Study, x = wrong number) is a “wrong estimate of the power P(Study, x= right number)” whereas I say Pow(Study, x = how big would a useful treatment be) is the only thing we should really be calculating… and how big a useful treatment is is a completely independent idea from how big your actual treatment effect really is.
      
      Either way we both agree that putting in the sample average doesn’t give much helpfulness.
      
      I think it’s wrong headed to try to “get the right number to put in” so you could “get the true power” because what matters isn’t whether you detect whatever the actual size of the effect is (remember effect could be 0.002 on a scale where everyone only cares about say effects of size 1). What matters is whether we can detect effects that we care about (like size 1 in my hypothetical).
Deborah G. Mayo on January 13, 2019 3:20 PM at 3:20 pm said:

There’s a confusion between ordinary power analysis post data, and performing that same power analysis only substituting the observed effect size. Because power-analytic reasoning no longer serves its purpose in the latter case, I call it “shpower”. Shpower computes the probability of rejecting the test hypothesis (at the specified significance level) under the assumption that the population discrepancy = the observed effect size. I discuss “howlers of shpower analysis” in Statistical Inference as Severe Testing: how to Get Beyond the Statistics Wars (pp. 354-5), with an example of a 1-sided test of the mean of a Normal distribution. In the case of a non-significant or negative result–the very case for which ordinary power analysis generally arises– the shpower ≤ .5. Thus it doesn’t do the work of power analysis. So I agree with Gelman in criticizing the substitution of the observed effect size. However, even where one doesn’t know from background the pop effect size, one can do an ordinary power analysis for several alternative discrepancies. The non-significant result is evidence against the discrepancies that the test had high power to detect. However, it’s more informative to use the actual non-significant result, d(x), rather than the largest difference that just misses the cut-off for rejection: Pr(d(X) ≥ d(x), mu’). To the extent that this is high, d(x) indicates mu ≤ mu’ (supposing that model assumptions hold).

Reply ↓
Thanatos Savehn on January 13, 2019 3:25 PM at 3:25 pm said:

When our hypotheses once tested prove disappointing we can blame inscrutable Nature. When our assumptions once stated prove disappointing we blame ourselves.

Reply ↓
Andrew Althouse on January 14, 2019 7:51 AM at 7:51 am said:

I’m glad to hear that you replied again, Andrew. I was unsure that you would see their re-response, so I also went to the trouble of responding to their response because I, too, was saddened / enraged by the absurdity of: “We fully understand that P value and post hoc power based on observed effect size are mathematically redundant; however, we would point out that being redundant is not the same as being incorrect…” among other things. Since you have been kind enough to post the full text of your response, I will also post what I sent to the journal – as yet, I have not heard whether they intend to publish this, but would generally feel that your response “supersedes” this since you were part of the original exchange:

———————————————

This letter is meant to continue the discussion started by Bababekov et al (1), responded to by Gelman (2), Plate et al (3), with a subsequent response from Bababekov and Chang (4) regarding the performance of post-hoc power calculations and their utility in study interpretation.

Bababekov and Chang are to be commended for their concern about Type 2 error and are onto something in that regard, but in their most recent response to prior letters, remain steadfastly incorrect about the way in which this should be carried out. In several places the authors seem to confuse the utility of power analyses in general with the specific arguments against their proposal of performing post-hoc power calculations using the observed effect size. For example:

“We fully understand that P value and post hoc power based on observed effect size are mathematically redundant; however, we would point out that being redundant is not the same as being incorrect. As such, we believe that post hoc power is not wrong, but instead a necessary first assistant in interpreting results.”

The authors still appear to misunderstand a stone-cold reality: if the study result was not statistically significant at the observed sample size and effect size, a post hoc power calculation based on the observed effect size will always appear to be underpowered. This will be true 100 percent of the time.

Perhaps the prior letters were too theoretical or abstract, and an extreme example is required to illustrate this point more clearly. Consider a parallel-group randomized trial with a very large sample size of 100,000 participants (exactly 50,000 patients per treatment arm). Suppose that 25,000 participants (exactly 50%) experience the primary outcome in the arm receiving standard of care. Suppose that 24,995 participants (exactly 49.99%) experience the primary outcome in the arm receiving experimental therapy. This primary comparison will return a non-significant result (p=0.98) suggesting that there is no difference between the treatment groups. Before looking ahead, we can all agree that 100,000 participants with an event rate of 50% seems like it would be very well-powered, yes?

We perform a post-hoc power calculation as prescribed, using the study sample size and observed effect size and see that the study has only 5% statistical power. Ah-ha! Maybe the null result isn’t evidence of no effect – after all, the study was simply “underpowered” to detect an effect…or was it?

Was this study actually “underpowered” to detect a clinically relevant effect? Of course not! A sample size of 100,000 participants would have 88.5% power if the true effect reduced the event rate from 50% with standard of care to 49% with experimental therapy (a mere 1% absolute reduction).

The problem is the authors’ mistaken (and now doubled-down-upon) belief that using the observed effect from the study gives the information the reader needs to determine whether the study was underpowered. By using the observed effect size, the authors’ proposal would result in literally every non-significant result being accompanied with a post-hoc power calculation showing that the study was “underpowered” to detect an effect. In such a world, there are no negative results – only underpowered studies!

There is a more sensible alternative, of course. Instead of conducting a post-hoc power calculation based on the observed effect size, the authors could show the power calculation would have looked like with their chosen sample size and some small-to-moderate effect size that would be notable enough to adopt the therapy in clinical practice (essentially, the smallest effect that we would not want to miss).

How would that look in practice? Let’s now consider a trial with 1,000 patients (exactly 500 per treatment arm). In the standard of care, suppose that 250 participants (exactly 50%) experience the primary outcome. In the experimental arm, suppose that 249 participants (49.8%) experience the primary outcome. Again, this comparison will return a non-significant result (p=0.95) suggesting that there is no difference between the treatment groups. Again, a post-hoc power calculation filling in the observed effect size will show that the study had extremely low statistical power (about 5%). As noted above, this information is virtually useless; a more useful alternative would be computing statistical power that the sample size would have for some clinically relevant effect size. Suppose that we would want to adopt this therapy if it reduced the true event rate from 50% to 40%. A hypothetical trial with 1,000 participants (500 per arm) would have 89% power to detect such a difference, and therefore we may conclude that this initial trial was not “underpowered” to detect a clinically relevant effect – the trial was in fact reasonably well-powered to detect such an effect.

There is one other comment from the prior letter which merits a response:

“Lastly, a priori power calculations are not routinely feasible in surgical science primarily because there is no accepted standard for effect size for most outcomes in surgery.”

Frankly, this is a ridiculous statement. There is no “accepted standard for effect size” that is used for all studies in any discipline. The effect size for any power calculation should be tailored to the specific study, with careful attention to the smallest effect size of clinical interest for that question (5). Many other medical disciplines (cardiology, critical care, etc) carry out scores of clinical trials using varying detectable effect sizes for their power calculations. The effect size used for a primary prevention trial with statins (6) may be very different from the effect size for transcatheter valve replacement (7), yet both studies managed to carry out an power calculation ahead of the trial. The belief that there must be some “accepted standard” effect size to perform a priori power calculations holds no water, and allows this backwards understanding of the utility of post-hoc power to perpetuate.

This problem is not new (8). In fact, one of Bababekov’s very own citations (9) explains this in detail. Paying attention to possibility of Type 2 error is good; if you must, however, please try to understand and implement the underlying principles correctly.

References

1. Bababekov YJ, Stapleton S, Mueller JL, Fong Z, Chang DC. A Proposal to Mitigate the Consequences of Type 2 Error in Surgical Science. Ann Surg 2018; 267(4): 621-622.

2. Gelman A. Don’t Calculate Post-hoc Power Using Observed Estimate of Effect Size. Ann Surg 2018 (epub ahead of print)

3. Plate JDJ, Borggreve AS, van Hillegersberg R, Peelen LM. Post Hoc Power Calculation: Observing the Expected. Ann Surg 2018 (epub ahead of print)

4. Bababekov YJ, Chang DC. Post Hoc Power: A Surgeon’s First Assistant in Interpreting “Negative” Studies. Ann Surg 2018 (epub ahead of print)

5. Albers C, Lakens D. When power analyses based on pilot data are biased: Inaccurate effect size estimators and follow-up bias. J Exp Soc Psychol. 2018; 74: 187-195.

6. Ridker PM, Danielson E, Fonseca FA, Genest J, Gotto AM, Jr, Kastelein JJ, Koenig W, Libby P, Lorenzatti AJ, MacFadyen JG, Nordestgaard BG, Shepherd J, Willerson JT, Glynn RJ. Rosuvastatin to prevent vascular events in men and women with elevated C-reactive protein. N Engl J Med. 2008; 359: 2195–2207.

7. Leon MB, Smith CR, Mack M, Miller DC, Moses JW, Svensson LG, Tuzcu EM, Webb JG, Fontana GP, Makkar RR, Brown DL, Block PC, Guyton RA, Pichard AD, Bavaria JE, Herrmann HC, Douglas PS, Petersen JL, Akin JJ, Anderson WN, Wang D, Pocock S; PARTNER Trial Investigators. Transcatheter aortic-valve implantation for aortic stenosis in patients who cannot undergo surgery. N Engl J Med 2010; 363(17): 1597-1607.

8. Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. The American Statistician 2001; 55(1): 19-24.

9. O’Keefe DJ. Brief Report: Post Hoc Power, Observed Power, A Priori Power, Retrospective Power, Prospective Power, Achieved Power: Sorting Out Appropriate Uses of Statistical Power Analyses. Commun Meth Measures 2007; 1:291-299.

Reply ↓
- Andrew on January 14, 2019 9:14 AM at 9:14 am said:
  
  Andrew A.:
  
  Yes. What also interests/bothers me is the social problem here. We have researchers who are, I assume, well intentioned and have no benefit to be gained by giving people bad advice. They make a mistake that is published and are lucky to see their mistake corrected—but they don’t recognize their luck. Instead they double down on their bad advice. (And, sure, calling something a “shit sandwich” is not a way to get someone on your side—but recall that I only did this after the published repetition of the published mistake, so their stubbornness can’t be attributed to my impoliteness.)
  
  In situations such as Wansink etc. we can say that there’s a problem with the incentives, in that researchers can take a big hit to their career by recognizing that they’ve been working in a failed paradigm: it’s not just the cost of admitting a mistake, it’s also the need to retool one’s entire research agenda. But in this example with people writing an expository note, they could admit their mistake, learn something from the episode, and move on. It’s my feeling that the reason these people haven’t taken this step is not about incentives, or even stubbornness, but rather the expectation that scientists never admit error. It’s a problem.
  
  Reply ↓
  - Andrew Althouse on January 14, 2019 12:15 PM at 12:15 pm said:
    
    Very much agreed, Andrew G.
    
    I dare not comment too far outside my own experience or expertise, but in general, statistics and probability are challenging for many folks to grasp (as physics, chemistry, and biology all were/are challenging for me). I think many folks that work in specialties where they have passing contact with statistics, such as medicine, are very self-impressed when they learn a new statistics thingamajig and decide it’s the greatest thing since sliced bread whether they really understand it or not (I worked with a surgeon on a project where we used a Cox model with a time-varying covariate; he described this as our team using “novel statistical methods” and proudly told all of his colleagues about this great new thing he’d come up with). I surmise that these folks, having sufficiently impressed themselves by learning about “post hoc power” and how they can use it to explain so-called negative (p>0.05) studies, are simply reluctant to admit that they don’t really understand the pet stats thingy that they just learned all that well, so rather than thank you and the other fine stats folks who are trying to help them understand, it’s better to just deflect and talk about how they are still right anyway.
    
    Anyways, thank you for taking the time to try and set them straight on this issue. Hopefully even if they don’t get it, some folks will read and understand.
    
    Reply ↓
    - Andrew Althouse on February 15, 2019 8:37 AM at 8:37 am said:
      
      Semi-humorous update: the letter I submitted (text above) sat in the editorial queue for over four months (submitted October 4; listed as “Decision In Process” effective October 14; no updates since then). I Tweeeted at the Annals of Surgery earlier this week to ask about the status, and *today* received notice that they will publish the letter, although they will first request a reply from the authors of the work (NOW? Four months later? You couldn’t have asked them for a reply before? although it is possible/likely that this is just form letter text that they include with all acceptance notices of Letter to the Editor, and this has already been sent to the original authors for evaluation).
    - Andrew on February 15, 2019 10:10 AM at 10:10 am said:
      
      Andrew A.:
      
      That’s still much much better than journals such as Psychological Science or Perspectives on Psychological Science that go to extra effort to avoid correcting the errors they publish.
Sean Mackinnon on January 14, 2019 12:21 PM at 12:21 pm said:

Putting aside the obvious truth you’ve pointed out that post-hoc power analysis conducted using the effect size generated from the experiment is useless (I agree 100% there), it’s worth engaging with alternate solutions to the problem at hand. I have done stats consulting with medical doctors for a few years, and almost universally the issues are:

a) They have no idea about what the population effect size is / should be
b) The sample sizes are often small, due to difficulty of data collection and ethics
c) They’re using a NHST framework, and want to have some sense about their ability to detect effects with the data they have

When working with this situation, I usually recommend a sensitivity power analysis. That is, just re-arrange the algebra to solve for population effect size given N, alpha, power, and the analysis being conducted. That way, it skirts around the issue of having no good educated guess for the population effect size. Then you can say something like, “If the effect size in the population is bigger than X, my study probably would have detected it. If it’s smaller than that, my study is uninformative.”

To keep with the analogy … the population effect size is the shit in the sandwich. So let’s take that out. I guess now we just have a bread sandwich, but hey, at least that’s edible.

Obviously, a power analysis using pre-existing data and a good estimate for the population effect size is preferable. But honestly, that is quite frequently out of reach in a lot of studies. If they had a good estimate of the population effect size, they wouldn’t need to do the study!

Reply ↓
- Andrew Althouse on January 14, 2019 12:27 PM at 12:27 pm said:
  
  Agreed, Sean. That’s what I proposed in the letter I wrote (unaware that Andrew G. would be responding to the response), and that’s the most actionable way to try to approach this. If they “must” use some sort of post-hoc power analysis, there is a way to do it that (sort of) makes sense. The original authors’ mistake is that they (still!) do not understand why using the observed effect size to gin up the post-hoc power number is a problem.
  
  Reply ↓
Adan Becerra on January 14, 2019 1:01 PM at 1:01 pm said:

Andrew, just FYI, David Chang is not a surgeon. He has a PhD in health policy from Hopkins.

https://advances.massgeneral.org/contributors/contributor.aspx?id=1165

Reply ↓
- Andrew on January 14, 2019 1:06 PM at 1:06 pm said:
  
  Fixed; thanks.
  
  Reply ↓
  - Adan Becerra on January 14, 2019 1:28 PM at 1:28 pm said:
    
    I guess I would expect a PhD in health policy from a atop program to know more about statistics than what is implied by their response. I agree with you that they make some good points, but was disappointed to see that Chang, who I believe is a solid researcher and data analyst in general, missed the point about getting wrong answers when using noisy point estimates.
    
    Reply ↓
    - Andrew Althouse on January 14, 2019 1:48 PM at 1:48 pm said:
      
      It’s also possible (?) that he counseled against some of this language but the surgeons plowed ahead anyway. I say this as a veteran of watching surgeons disappear using Tables, Figures, or analyses that I produced only for a paper to appear several months later in the literature (sometimes with my name on it, sometimes not). Being an editorial, that scenario is unlikely, but they may well have used some language that he did not agree with, either waiting for him to relent or making the change after showing him but before the submission.
    - Andrew on January 14, 2019 2:01 PM at 2:01 pm said:
      
      Andrew A.:
      
      Sure, but then this raises the question of why would the surgeons care so much about the wording of a letter on a topic that they clearly don’t understand. I can’t imagine that this letter, or the original article it supports, is important to their careers, as it’s tutorial material, not research. (It doesn’t matter to my career either; I’m just doing this as a public service. And I can see that the authors of this article think they’re doing a public service—not only do they not understand the statistical issues, they don’t even understand their own level of ignorance—but in that case you’d think they’d respect pushback from a coauthor even if they can’t handle disagreement from the outside.
    - Daniel Lakeland on January 15, 2019 3:23 PM at 3:23 pm said:
      
      Here’s my take Andrew, based on conversations I’ve had with academic doctors. In medicine there is pressure on doctors to do research and publish in addition to long hours being a doctor doing surgeries or clinical rounds or office visits or emergency on-calls etc. There is however zero time allotted for this research, however it affects your salary and promotions to be able to list published articles, and there are largish pools of money set aside *only for people with MDs* and it really doesn’t matter what those articles say. So, there’s a whole industry of “turn the crank” medical research designed to bring this “just for MDs” pool of money into University coffers. NHST is a gold-mine for this industry since you can generate an endless stream of “amazing new discoveries” by essentially noisy random number generation. Furthermore, post-hoc power calculations of this sort are a further gold-mine because it proves that “if you didn’t find something, it’s because your power wasn’t big enough” and so you need to be given more money to increase your power.
      
      Now, some physicians do *real* research, but many others do this fake “turn the crank” kind. It’s an industry of government sponsored rent seeking encouraged by universities whose enormous budgets are padded by literally hundreds of billions per year country-wide from this kind of thing. Many physicians burn out on it and wind up leaving academic positions for Kaiser or whatever because it’s identifiably BS. They’ve told me as much. So, as an outsider stumbling into this field you’re thinking innocently “look, we can make this research better if we understand statistics better” and the response you get is essentially “stay the f*** out, and stop pissing in our funding pool”
    - Keith O'Rourke on January 15, 2019 3:40 PM at 3:40 pm said:
      
      Daniel:
      
      I have worked along side (rather than with) people doing this “however zero time allotted for this research, however it affects your salary and promotions to be able to list published articles, and there are largish pools of money set aside *only for people with MDs* and it really doesn’t matter what those articles say. So, there’s a whole industry of “turn the crank” medical research designed to bring this “just for MDs” pool of money into University coffers”.
      
      One called it providing a research boutique for clinicians – “You have a seat and your study is design to your liking, sample size calculations done, grant application drafted for you to review and send off, etc.” They were serious and very soon afterwards the editor of a major clinical journal.
      
      But there is, I believe a large enough percent of clinical researchers that do care more about patient welfare than their careers and we need to enable those to criticize and dis-fund such careerists groups. They, I believe, can make this research better if we help them understand statistics better.
    - Andrew on January 15, 2019 3:42 PM at 3:42 pm said:
      
      Daniel:
      
      Sure, but (a) the journal’s not telling me to go away, they’re publishing my letter; and (b) I assume the authors of the original paper are sincerely trying to help out. If they were pure rent-seekers, why bother writing that sort of tutorial argument in the first place.
      
      I agree with you on the second-order position that the rent-seeking environment creates niches for bad science, and niches for people to promote and encourage bad science—but, within all that, there’s still a lot of misunderstanding. I feel the same way about the psychologists who publish junk science in PNAS etc.: Yes, I blame them for not putting in the effort to understand what they are doing wrong; but, in addition, they seem to be true believers in their method.
    - Daniel Lakeland on January 15, 2019 8:33 PM at 8:33 pm said:
      
      Like most things, it’s complicated. and as I said “some physicians do *real* research” but there’s all these other guys too, and the statistical paradigm of “turn the crank and discover the truth!” is a huge boon to that industry of fake studies. The worst part is that some people in that field *don’t know* they’re rent seekers doing BS because for all they know *this is how science is done*.
Adan Becerra on January 14, 2019 2:01 PM at 2:01 pm said:

Andrew A,

Yes I do think it is a possibility that Chang didn’t agree with some of the language. While I am not a veteran, I did work with surgeons for 3 years while in grad school and I agree sometimes I did have some challenges in writing papers with technical (but appropriate) sounding language. I do often see some language from my first papers that I wish I could have advocated more to change because it is not the way I would write them these days. As the only non surgeon on the papers, it was often difficult to get certain points across since the surgeons didn;t care about the technicalities and instead just on the clinical implications and conclusions etc. But I did get better over the years and they began to respect me as time went by.

In this case however, I would expect Chang to not fall into this trap anymore. Unlike me, he is not a young and up and coming researcher. On the contrary, Chang is considered a respected and veteran researcher in the surgical outcomes research world. He is also the Director of Healthcare Research and Policy Development, Codman Center, so perhaps this is why I expected higher standards.

Reply ↓
systematic reviewer on November 17, 2025 1:05 PM at 1:05 pm said:

“Post hoc power” is not wrong because the estimate from the study is noisy. It is wrong because it is just another way of expressing the p-value associated with the estimated effect size. If p=0.05 then the study had 50% power to detect the difference it did. If p50%. If p>0.05 then power <50%.

If a retrospective power calculation is thought useful, you'd have to mentally go back to the design stages, before any data had been collected or effect size observed, and do a power calculation based on the smallest effect that would be sufficient to change practice (or whatever other gauge you use to decide what effect size to power your study for).

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

How post-hoc power calculation is like a shit sandwich

54 thoughts on “How post-hoc power calculation is like a shit sandwich”

Leave a Reply Cancel reply