Don’t calculate post-hoc power using observed estimate of effect size

Posted on September 24, 2018 9:16 AM by Andrew

Aleksi Reito writes:

The statement below was included in a recent issue of Annals of Surgery:

But, as 80% power is difficult to achieve in surgical studies, we argue that the CONSORT and STROBE guidelines should be modified to include the disclosure of power—even if less than 80%—with the given sample size and effect size observed in that study.

It is the highest ranking journal in the field of surgery. I find it worrying that they suggest calculating post-hoc power.

I agree. This is a well known error; see references here, where we write:

The idea that published effect-size estimates tend to be too large, essentially because of publication bias, is not new (Hedges, 1984; Lane & Dunlap, 1978; for a more recent example, also see Button et al., 2013). . . .

After data have been collected, and a result is in hand, statistical authorities commonly recommend against performing power calculations (see, e.g., Goodman & Berlin, 1994; Lenth, 2007; Senn, 2002).

It’s fine to estimate power (or, more generally, statistical properties of estimates) after the data have come in—but only only only only only if you do this based on a scientifically grounded assumed effect size. One should not not not not not estimate the power (or other statistical properties) of a study based on the “effect size observed in that study.” That’s just terrible, and it’s too bad that the Annals of Surgery is ignoring a literature that goes back at least to 1994 (and I’m sure earlier) that warns against this.

Reito continues:

I still can´t understand how it is possible that authors suggest a revision to CONSORT and STROBE guidelines by including an assessment of post-hoc power and this gets published in the highest ranking surgical journal. They try to tackle the issues with reproducibility but show a complete lack of understanding in the basic statistical concepts. I look forward the discussion on this matter.

I too look forward to this discussion. Hey, Annals of Surgery, whassup?

P.S. I guess I could write a letter to the editor of the journal but I doubt they’d publish it, as I don’t speak the language of medical journals.

But, hey, let’s give it a try! I’ll go over to the webpage of Annals of Surgery, set up an account, write a letter . . .

Here it is:

Don’t calculate post-hoc power using observed estimate of effect size

Andrew Gelman

28 Mar 2018

In an article recently published in the Annals of Surgery, Bababekov et al. (2018) write: “as 80% power is difficult to achieve in surgical studies, we argue that the CONSORT and STROBE guidelines should be modified to include the disclosure of power—even if <80%—with the given sample size and effect size observed in that study.” This would be a bad idea. The problem is that the (estimated) effect size observed in a study is noisy, especially so in the sorts of studies discussed by the authors. Using estimated effect size can give a terrible estimate of power, and in many cases can lead to drastic overestimates of power (thus, extreme overconfidence of the sort that is rightly deplored by Bababekov et al. in their article), with the problem becoming even worse for studies that happen to achieve statistical significance. The problem is well known in the statistical and medical literatures; see, e.g., Lane and Dunlap (1978), Hedges (1984), Goodman and Berlin (1994), Senn (2002), and Lenth (2007). For some discussion of the systemic consequences of biased power calculations based on noisy estimates of effect size, see Button et al. (2013), and for an alternative approach to design and power analysis, see Gelman and Carlin (2014). That said, I agree with much of what Bababekov et al. (2018) say. I agree that the routine assumption of 80% power is a mistake, and that requirements of 80% power encourage researchers to exaggerate effect sizes in their experimental designs, to cheat in their analyses in order to attain the statistical significance that they was supposedly so nearly being assured (Gelman, 2017b). More generally, demands for near-certainty, along with the availability of statistical analysis tools that can yield statistical significance even in the absence of real effects (Simmons et al., 2011), have led to replication crisis and general corruption in many areas of science (Ioannidis, 2016), a problem which I believe is structural and persists even in the presence of honest intentions of many or most participants in the process (Gelman, 2017a). I appreciate the concerns of Bababekov et al. (2018) and I agree with their goals and general recommendations, including their conclusion that “we need to begin to convey the uncertainty associated with our studies so that patients and providers can be empowered to make appropriate decisions.” There is a just a problem with their recommendation to calculate power using observed effect sizes. References Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B., Flint, J., Robinson, E. S. J., and Munafo, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience 14, 1-12. Gelman, A. (2017a). Honesty and transparency are not enough. Chance 30 (1), 37-39. Gelman, A. (2017b). The “80% power” lie. Statistical Modeling, Causal Inference, and Social Science blog, 4 Dec. https://statmodeling.stat.columbia.edu/2017/12/04/80-power-lie/ Gelman, A., and Carlin, J. B. (2014). Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors. Perspectives on Psychological Science 9, 641-651. Goodman, S. N., and Berlin, J. A. (1994). The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Annals of Internal Medicine 121, 200-206. Hedges, L. V. (1984). Estimation of effect size under non- random sampling: The effects of censoring studies yielding statistically insignificant mean differences. Journal of Educational Statistics 9, 61-85. Ioannidis, J. (2016). Evidence-based medicine has been hijacked: a report to David Sackett. Journal of Clinical Epidemiology 73, 82-86. Lane, D. M., and Dunlap, W. P. (1978). Estimating effect size: Bias resulting from the significance criterion in editorial decisions. British Journal of Mathematical and Statistical Psychology 31, 107-112. Lenth, R. V. (2007). Statistical power calculations. Journal of Animal Science 85, E24-E29. Senn, S. J. (2002). Power is indeed irrelevant in interpreting completed studies. British Medical Journal 325, Article 1304. Simmons, J., Nelson, L., and Simonsohn, U. (2011). False- positive psychology: Undisclosed flexibility in data collection and analysis allow presenting anything as significant. Psychological Science 22, 1359–-366.

. . . upload it to the journal’s submission website. Done!

That took an hour. An hour worth spending? Who knows. I doubt the journal will accept the letter, but we’ll see. I assume their editorial review system is faster than this blog. Submission is on 28 Mar 2018, blog is scheduled for posting 24 Sept 2018.

65 thoughts on “Don’t calculate post-hoc power using observed estimate of effect size”

Shravan on September 24, 2018 9:31 AM at 9:31 am said:

The world’s first pregistration of a letter submission.

It is common in psych to argue that after you get a p larger than 0.05, observed power was large enough. Apprently in psych programs they don’t teach that observed power is function of p-value (hoenig and heisey, 2001). I guess that psych is generally a sophisticated early adopter of stats, so they stopped going to primary sources and developed a garbled version of stats theory.

Reply ↓
- psyched out on September 24, 2018 5:44 PM at 5:44 pm said:
  
  Beyond the stab at psychology, most of my training involved understanding the relationship between ES, sample size, and our chosen alpha level. I read the H&H paper you referenced. I was never encouraged to look at observed power, but I was taught to explore detectable effect sizes in the event of a non-significant effect. I disagree with H&H on the use of detectable effect sizes;
  
  “The closer the detectable effect size is to the null hypothesis of 0, the stronger the evidence is taken to be for the null.”
  
  I was never taught that it was evidence for the null. Rather, we were taught that it provides a sense of how underpowered we are to detect the effect given our observed ES and sample size.
  
  Reply ↓
  - Shravan on September 25, 2018 2:34 AM at 2:34 am said:
    
    I am referring to things like this:
    
    “To show that our finding of no interaction between
    the RC type and the definite/indefinite conditions was
    not due to a lack of statistical power, we conducted a
    power analysis using the results of the first critical word
    reading time. Using the error terms of the main effects
    and interaction of the ANOVA on this word, we found
    that we had power above .8 to detect an interaction of
    the size found in experiment 3 of Gordon et al. (2001)
    (this interaction was found when names replaced the
    embedded NP of the RC). Thus, we concluded that our
    lack of detection of an interaction between the two
    factors of our ANOVA was not due to a lack of statistical
    power.”
    
    p 103 of https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.452.2923&rep=rep1&type=pdf
    
    Reply ↓
Joe L on September 24, 2018 9:38 AM at 9:38 am said:

Good on them!

https://www.ncbi.nlm.nih.gov/pubmed/29994928

Reply ↓
Zad Chow on September 24, 2018 10:31 AM at 10:31 am said:

I submitted something similar to JAMA last month regarding a trial that investigated the effects of an antidepressant for cardiovascular disease outcomes. The authors made power calculations via the effect sizes observed in the study and concluded that they had achieved more power than they had planned for and therefore put more confidence in their results. Unfortunately, it got rejected, as do most letter to the editors in JAMA about methodology and statistics

Reply ↓
Daniel Lakeland on September 24, 2018 11:00 AM at 11:00 am said:

Power is a broken idea, the Bayesian version makes more sense: how much data do you need to make the high probability posterior interval of your parameter estimate smaller than some desirable precision / width? This can be estimated by first drawing a parameter value from your prior, then generating data of a certain size N, then running the data through your Stan/inference engine to get a posterior sample. repeat with different values of the parameters and different N. It only makes sense if you’re using moderately informed priors. It is somewhat computing-time consuming, particularly if a Stan run takes more than a few seconds or minutes.

Post hoc power analysis in this context makes sense if you ask the question “how many more data points do I need to get my posterior interval down to an even smaller interval”? You can use the posterior as the prior in re-doing the same analysis. If you want to decide how small your interval should be: Bayesian decision theory. How much value (measured in dollars or something like it) does a given dataset size gain you on average (you now have to quantify how valuable information is, in a real-world sense).

Reply ↓
- Shravan on September 24, 2018 12:15 PM at 12:15 pm said:
  
  Am about to submit second paper doing just this, will report if rejected.
  
  Reply ↓
  - Daniel Lakeland on September 24, 2018 2:39 PM at 2:39 pm said:
    
    Please, also if accepted!
    
    Reply ↓
    - Shravan on September 27, 2018 2:51 AM at 2:51 am said:
      
      First paper was accepted w/o complaints: https://www.sciencedirect.com/science/article/pii/S0749596X18300640?via%3Dihub
      
      PS I made this paper open access for 1700 euros. Elsevier’s sole contribution was to add at least one typo that wasn’t there in the original submitted latex source. Someone on Elsevier’s production staff looked at the paperand said, no, we cannot allow a paper to be published without mistakes in it, let me edit the source .tex file and add a mistake.
- Richard D. Morey on October 4, 2019 5:14 AM at 5:14 am said:
  
  “Power is a broken idea, the Bayesian version makes more sense: how much data do you need to make the high probability posterior interval of your parameter estimate smaller than some desirable precision / width?”
  
  It seems more that the folk statistical idea of what power is is broken; that’s what causes people to do silly things with it. Suppose you were interested in precision. What is precision for? It isn’t a fundamental statistical idea; it is a heuristic (which credible interval is the precision? central 95%? standard deviation? particular HPD? or is it the curvature of the posterior at the mode?). It is a fine heuristic, but we have to understand that that’s all it is and we need more basic formal principles to define it. We can’t *replace* those ideas with precision.
  
  But let’s examine the heuristic from a classical perspective to see if power is “broken” and whether we should consider precision instead (as, say, CI advocates suggest). The basic idea behind precision, stripped down, is that we want to be able to differentiate true values in one range from true values in another, and the bounds of these ranges are separated by some amount X. When X is smaller, we have more precision.
  
  Suppose a classical test has a max Type I error rate of alpha and this is acceptable low to us. If theta>0, we would not often claim theta>0. Now we construct a test with high power (1-alpha): if theta>=theta_1, we would very often claim that theta>0 (and hence, theta!<0). The test will work in the other way, too; a failure to reject indicates that thetainfty, we can make theta_1 closer and closer to 0 and and get the same high power (1-alpha). In what sense is this not precision? Talking about precision as the steepness of the power curve seems totally reasonable.
  
  Granted, precision-as-power is precision *targeted at a particular region of the parameter space*, but that’s important, because not all models will give you the same precision at all points in the parameter space. And it is pretty typical for a particular region of the parameter space to be of interest (help/harm, loss/gain, etc) so this makes sense. We care more about regions of the parameter space where the qualitative interpetation of the parameter changes (and if you wanted to examine a different part of the parameter space, you could anyway).
  
  So I don’t get why power is “broken”, unless one thinks you “estimate” it from data or some weirdness like that. Of course strange folk-statistical ideas about power are going to seem broken.
  
  Reply ↓
  - Richard D. Morey on October 4, 2019 5:20 AM at 5:20 am said:
    
    Oops, “If theta>0, we would not often claim theta>0” should be “If theta0,” of course. and the filter seemed to have eaten part of my sentence: “a failure to reject indicates that theta<theta_1. As N→infty”. That will teach me to use less than and greater than signs in posts…
    
    Reply ↓
Tim Hofer on September 24, 2018 12:31 PM at 12:31 pm said:

Well the good news is that they published it July 10 (ePub)
Don’t Calculate Post-hoc Power Using Observed Estimate of Effect Size
Gelman, Andrew, PhD
Annals of Surgery: July 9, 2018 – Volume Publish Ahead of Print – Issue – p
doi: 10.1097/SLA.0000000000002908

The bad news is that the authors doubled down
https://journals.lww.com/annalsofsurgery/Citation/publishahead/Post_Hoc_Power__A_Surgeon_s_First_Assistant_in.95535.aspx

Reply ↓
- george on September 24, 2018 4:32 PM at 4:32 pm said:
  
  Really bad: citation 9 in Bababekov and Chang’s rebuttal (available here) actually goes along with Andrew’s point, it doesn’t support their argument at all. The Annals shouldn’t have let that through.
  
  Reply ↓
Anonymous on September 24, 2018 1:07 PM at 1:07 pm said:

Accepted and published, DOI: 10.1097/SLA.0000000000002908

Reply ↓
Keith O'Rourke on September 24, 2018 2:27 PM at 2:27 pm said:

> complete lack of understanding in the basic statistical concepts
I would of thought that would not be surprising in a non-statistical journal if not even some statistical journals (remember the bible code thing).

Daniel:

I think the prior bias analysis idea is more sensible – essentially do the simulations you suggest but with a set parameter value and calculate the percentage of times the posterior down-weights the prior probability of that set parameter value. (Hey posteriors based on small noisy data and weak priors can be far from the truth).

Statistical Reasoning: Choosing and Checking the Ingredients Entropy 2018 https://www.mdpi.com/1099-4300/20/4/289

Reply ↓
- Daniel Lakeland on September 24, 2018 2:49 PM at 2:49 pm said:
  
  That’s an interesting analysis as well, I haven’t thought about which one is “better” but it seems that they do answer somewhat different questions. Yours is about whether or not noisy data with N data points is going to cause you to concentrate around a wrong estimate, mine is about whether a given dataset size is going to get you enough concentration for some practical purpose. Ideally you’d maybe answer both: can I get a small enough posterior interval and is it going to frequently contain the correct value.
  
  your point about “posteriors based on small noisy data and weak priors can be far from the truth” is also very important, especially the part about “weak priors”. We had a discussion recently about picking priors where Dan Simpson showed some prior they were using caused the typical prior estimate of air pollution density to be far denser than neutron stars. It goes a long way to simply work on specifying priors that are actually sensible. If your prior excludes the nonsense regions of space you have less chance of concentrating your posterior around noisy nonsense answers.
  
  All techniques available to you, working together, are the ideal of course. Thanks for the reference.
  
  Reply ↓
  - Keith O'Rourke on September 25, 2018 8:02 AM at 8:02 am said:
    
    Definitely both.
    
    The reference argues for both but assesses the “strength” of the posterior e.g. using the posterior probability of parameters with probability increases larger than a favored parameter value. (Meeting with the author later this week if anyone has questions).
    
    Reply ↓
Nick Adams on September 24, 2018 5:47 PM at 5:47 pm said:

Actually, observed power is a useful concept, just not in the way the Annals of Surgery use it.
The true power of a study depends upon the true (i.e. actual) effect size, the sample size and alpha (the nominated level of significance). Clearly the first of these is unknown but the best estimate of the true effect size is the observed effect size and hence the best estimate of the true power is the observed power. I take here a pure likelihood approach, ignoring any prior information about the true effect size and using only the data at hand.
The observed power is known to be a monotonic function of the P-value alone (e.g. Hoenig and Heisey 2001). As the P-value gets smaller the observed power gets bigger. To get an idea of the calibration, the observed power always equals 50% when the P-value equals alpha (which a moments thought will show makes perfect sense). The upshot of this is that a large P-value always implies low power, whether due to a small effect size or a small sample size or both. This low power means that it is not reasonable to accept that the null-hypothesis is true based on a large P-value. The converse also holds: a small P-value implies high power and thus confident rejection of the null.
This of course is exactly what R. A. Fisher originally wrote – one can reject the null or fail to reject the null but not accept the null.

The interesting thing about all this is that the strength of evidence against the null is simply a function of the P-value and the sample size is irrelevant. So comparing two studies, say p=0.01, n=50 and p=0.01, n=1000, provide the same strength of evidence against the null (notwithstanding that the observed effect size/power will be a better estimate of the true effect size/power for the larger study).
And yes I know that the null itself is seldom of interest but this does help sort out a lot of misconceptions about power (e.g. on Deborah Mayo’s blog and Daniels comment above).

Reply ↓
- Andrew on September 24, 2018 6:10 PM at 6:10 pm said:
  
  Nick,
  
  You write, “the best estimate of the true effect size is the observed effect size.” First, strictly speaking, there is no “observed effect size”; all that you can get is an estimate. Second, the usual point estimate may be “best” to you, but it’s not “best” to me! Using this estimate results in systematic overestimates of effect size, replicability rates, etc., leading to the famous replication crisis in which people have been stunned and surprised by failed replications—they maybe wouldn’t have been so stunned and surprised had they been aware of type M error. And, no, a small p-value does not imply high power: this is a key lesson that we’ve learned over the past 10 years or so. Some discussion is here, here, and here.
  
  Reply ↓
- Jackson Monroe on September 24, 2018 7:53 PM at 7:53 pm said:
  
  Revering p-values to the extent that two studies are considered equivalent when they have the same p-value but one has 20 times the observations of the other nicely encapsulates the problems this blog looks to remedy, in my opinion.
  
  Reply ↓
  - Nick Adams on September 24, 2018 9:58 PM at 9:58 pm said:
    
    Jackson,
    
    I don’t revere p-values, I’m just pointing out that logically from a likelihood perspective they do a superb job doing exactly what Fisher designed them to do – represent strength of evidence against the null. If the small study and the large study have the same p-value then the large study must have a much smaller observed effect size that is closer to the null. This smaller effect size is exactly offset by the lower level of uncertainty associated with the larger sample size. So yes, the 2 studies represent equivalent evidence against the null.
    
    Andrew,
    
    A low p-value implies a high observed power. It doesn’t necessarily imply ‘high power’ in the sense that the effect size estimate has low uncertainty attached to it, which is the sense I think you are using.
    
    On a slightly tangential point, I wonder why we don’t use confidence intervals for P-values. Results are often given as a point estimate, its p-value and then the 95% confidence interval of the point estimate. The two end points of the confidence interval could be treated as point estimates and have their own p-values calculated. This might bring home the point that the p-value is a random variable and that an exact replication study is unlikely to result in the same p-value as the original study.
    
    Reply ↓
    - Michael Nelson on September 25, 2018 11:46 AM at 11:46 am said:
      
      Nick, if I understand you, I think it’s not right to say that p-values quantify the degree of “evidence against the null” as opposed to just whether we reject the null. If I decide a priori that I’m willing to reject the null while accepting that I’ll be wrong 5% of the time, then I am implicitly declaring that a p-value of .051 and a p-value of .51 have the same interpretation with respect to the null. This is widely misunderstood, I think, largely due to the never-ending stream of articles that say things like “this result was significant at p = .001” when there’s no way they would have rejected at p = .01, or “this result was nearly significant (p = .051)” as though their *choice* of significance level were a random variable. Because reject/fail-to-reject is a binary choice, it’s not particularly meaningful to interpret the distance from the critical value as a magnitude of “evidence against the null.” Of course, the significance level used is arbitrary and the researcher might have chosen .001 or .051 in an alternate universe, but that’s one of the reasons NHST’s are criticized. Besides, we have other, better tools for assessing the quality of evidence, one of them being sample size. I believe part of the point of above comments was that a small sample size is much more likely to grossly underestimate standard error than a larger study, so the quality of the p-value itself is more questionable in a small study.
    - Corey on September 25, 2018 12:02 PM at 12:02 pm said:
      
      “The two end points of the confidence interval could be treated as point estimates and have their own p-values calculated.”
      
      I think I know why this isn’t common practice. Here are the one-sided p-values for the endpoints of the usual 95% central confidence interval:
      lower limit: 0.025
      upper limit: 0.975
      These values are independent of the data by the definition of the confidence interval.
    - Daniel Lakeland on September 25, 2018 12:46 PM at 12:46 pm said:
      
      I think he meant to calculate the power to reject given your N and observed variation if the real value of the parameter was the lower and the upper limit of the current CI. It’s not clear to me that this is so trivial. The frequency in question is not the frequency with which you can reject each confidence interval you’ll get from repeated runs, but rather the frequency with which you’ll reject the constant values Lower_of_my_current_CI and Upper_of_my_current_CI in repeated trials.
      
      It still misses the point, which is that we really should care about what the parameter values are, not whether we can accept or reject null hypotheses…
    - Corey on September 26, 2018 10:50 AM at 10:50 am said:
      
      I don’t think so — the paragraph opens by describing its point as “slightly tangential” to the previous discussion of power and doesn’t mention power at all.
    - Nick Adams on September 25, 2018 6:41 PM at 6:41 pm said:
      
      No. A p-value of 0.01 for instance cannot have these limits.
    - Daniel Lakeland on September 26, 2018 10:50 AM at 10:50 am said:
      
      I think you’re confused. p values do not have confidence limits, a parameter has confidence limits.
      
      And the way a confidence limit is defined usually is that it’s the parameter value at which the p value for that parameter is 0.025 and the parameter value for which the p value is 0.975
    - Daniel Lakeland on September 26, 2018 11:08 AM at 11:08 am said:
      
      I had to pause here to think a little harder. When we say “the p value” we need to also state “to reject the hypothesis H” and which H do we mean? Specifically the usual way a confidence interval is calculated is for rejecting the hypothesis that the true parameter value is different from the estimated value (often the sample mean). That’s a different hypothesis than the “null” where usually the null is “the parameter value is zero”
      
      so when you say p = 0.01 you are probably referring to the p value to reject “a null hypothesis” and when Corey is saying the limits are 0.025 and 0.975 he’s referring to a hypothesis “the true value is equal to the observed mean” in the usual case.
    - Corey on September 26, 2018 7:44 PM at 7:44 pm said:
      
      ‘Corey is saying the limits are 0.025 and 0.975 he’s referring to a hypothesis “the true value is equal to the observed mean” in the usual case.’
      
      No I’m not — in the usual case the confidence limits are based on pivotal quantities which have distributions invariant to the true value of the mean.
    - Daniel Lakeland on September 27, 2018 9:59 PM at 9:59 pm said:
      
      The hypothesis “is the pivotal quantity equal to zero” is mathematically equivalent to the hypothesis “is the quantity we are estimating equal to the estimate value” the first one arises as an invertible mathematical transforming of the second (usually shifting and rescaling)
      
      The most important point though is that the p values you mention are not for “is the *effect* equal to zero” which I think is the source of some confusion about the p=0.01 stuff
    - Patrick on September 26, 2018 2:54 PM at 2:54 pm said:
      
      Hey Nick, obviously the fact that Fisher believed the so-called alpha-postulate (or p postulate; the idea that p-values present equivalent amounts of evidence against the null regardless of sample size) should not persuade us to do so as well. However, I’m surprised you believe it based on a likelihood justification. Royall (1986) himself demonstrates that precise p-values in fact provide greater evidence against a point null when yielded by *smaller* sample sizes. The same is demonstrated by Wagenmakers (2007) and generalized to evidential p-value bounds by Held & Ott (2016).
      
      See:
      
      Royall, R. M. (1986). The effect of sample size on the meaning of significance tests. Am Stat, 40(4), 313-315.
      Held, L., Ott, M. (2016). How the maximal evidence of p-values against point null hypotheses depends on sample size. Am Stat, 70(4), 335-341.
      Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychon Bull Rev, 14(5), 779-804.
    - Nick Adams on September 26, 2018 6:06 PM at 6:06 pm said:
      
      Patrick,
      Crickey, thanks for the references. I am on holiday so don’t have access to the full texts. I have read Royall’s monograph (Statistical Evidence: A Likelihood Paradigm) several times but I don’t remember that bit. I know Held and Ott define the minimum Bayes factor several different ways – I think only Goodman’s definition is the same as the likelihood ratio. The Wagenmakers approach requires a number of assumptions including a uniform prior and is a rehash of the “p-value overstates the evidence against the null” argument. No room to argue all this here – i’ll put something up on arXiv soon.
      
      Finally, never underestimate Fisher.
Martha (Smith) on September 24, 2018 8:15 PM at 8:15 pm said:

The purported 8 comments are not showing up.

Reply ↓
A. Tasso on September 24, 2018 10:19 PM at 10:19 pm said:

Not bad. They beat you by 2 months.
https://journals.lww.com/annalsofsurgery/Citation/publishahead/Don_t_Calculate_Post_hoc_Power_Using_Observed.95527.aspx

Reply ↓
Robert Grant on September 25, 2018 4:36 AM at 4:36 am said:

I agree with all of this. Hopefully the post-hoc power message is strong enough in medics’ awareness to win through.

But researchers often want to know whether their study was a real no-hoper in terms of type 2 errors and that seems a reasonable question. Whaddya think of this approach: https://www.robertgrantstats.co.uk/papers/false_nonsig_rate.pdf

Reply ↓
Simon Gates on September 25, 2018 6:06 AM at 6:06 am said:

Did the letter get published?

This episode doesn’t surprise me much – there is still lots of statistical illiteracy in medical journals, even after decades of efforts by people like Doug Altman. As an example that I’ve been concerned with lately, the New England Journal of Medicine, no less, likes to insist on having significance tests for baseline characteristics in randomised clinical trials – where you know that any differences are just chance.

Reply ↓
- Simon Gates on September 25, 2018 6:09 AM at 6:09 am said:
  
  Can see it did get published – for some reason couldn’t see the comments earlier.
  
  Reply ↓
Sean Mackinnon on September 25, 2018 7:08 AM at 7:08 am said:

I’ve done statistical consulting for medical folks for about 4 years, and they’d sometimes get questions about whether or not they had enough power from reviewers, if they didnt actually do a power anaysis. They also usually had no idea what a good estimate of the population effect size was, and neither did I since it was out of field for me. In cases like that, I find sensitivity power analyses more useful. That is, solve for effect size given N, power, and alpha. Then you can say, if the effect size is X or larger in the population, you probably had enough power. If its smaller, your study probably can’t detect it.

In truth, the population effect size is mostly unknowable for many studies, and massively inflated when based on prior work, so this makes more sense to me.

Reply ↓
- Daniel Lakeland on September 25, 2018 9:51 AM at 9:51 am said:
  
  The biggest issue is that power is all about rejecting the null, which is a questionable thing in the first place. Power or something like it should be used to help you pick an N. You don’t need to know what the real effect size is at all, just what effect size would be considered practically useful or of interest. If your surgery technique reducing incidence of *bad stuff* by 10% would be enough to recommend it, use the 10% reduction. Later if it turns out that the estimate is 30% reduction then great.
  
  The better way to choose N is Bayesian decision theory anyway. Choose N that minimizes expected societal cost considering cost of the research and reduction in societal cost from whatever benefit the research gives, averaged over the real informative prior for effect size…
  
  We can drop power in the wastebasket if we use decision theory.
  
  Reply ↓
  - Daniel Lakeland on September 25, 2018 12:40 PM at 12:40 pm said:
    
    Put another way, the goal of research is to provide useful true information about the world at reasonable cost, not to frequently be able to reject null hypotheses.
    
    Reply ↓
    - Sean Mackinnon on September 26, 2018 10:56 AM at 10:56 am said:
      
      Point well taken! That said, in that particular context (i.e., consulting for medical doctors who have done a null hypothesis test in a frequentist framework), telling them that the entire enterprise of NHST is wrong would be an impossible sell. Broader systemic things need to change before that particular audience would be at all sympathetic to that line of thinking.
      
      Like, if Andrew instead just wrote a letter that said “The whole NHST enterprise is wrong-headed, and since the idea of statistical significance is wrong-headed, power analyses are not useful.” it would probably fall on deaf ears, even though it’s probably closer to the mark on his actual thoughts about it, based on my reading of the blog over a few years. It’s important to meet some people where they’re at, and work through changes gradually, IMO.
    - Daniel Lakeland on September 28, 2018 9:58 AM at 9:58 am said:
      
      If you come in after the fact it’s one thing, but when doing power analysis to choose design parameters prior to the study I think it’s fine to try to convince people about the use of Bayesian decision theory instead of power analysis, doctors don’t want to die on the hill of p value defending, they want to convince people their study is a good one. A bunch of estimates showing a cost benefit type analysis in favor of the chosen experimental design choices isn’t going to hurt if well explained.
      
      Even post hoc, if you’re trying to argue for follow-up… Again it’s a good idea to compare the preliminary study to the proposed follow up using Bayesian decision theory.
Nicholas Erskine on September 25, 2018 7:13 AM at 7:13 am said:

Hey look, they accepted it:
https://journals.lww.com/annalsofsurgery/Citation/publishahead/Don_t_Calculate_Post_hoc_Power_Using_Observed.95527.aspx

Reply ↓
Steven Johnson on September 25, 2018 8:09 AM at 8:09 am said:

Time for a post-hoc update to priors? ;-)

Don’t Calculate Post-hoc Power Using Observed Estimate of Effect Size
Gelman, Andrew, PhD
Annals of Surgery: July 9, 2018 – Volume Publish Ahead of Print – Issue – p
doi: 10.1097/SLA.0000000000002908
Letter to the Editor: PDF Only
https://journals.lww.com/annalsofsurgery/Citation/publishahead/Don_t_Calculate_Post_hoc_Power_Using_Observed.95527.aspx

Reply ↓
J. Norway on September 25, 2018 7:06 PM at 7:06 pm said:

Maybe a stupid question… I know Cohen always decried the idea of post-hoc power analysis, and the “illogic” of that is what I was taught in grad school.

That being said, D. Mayo’s “severity” concept has shown up on this site a few times, and I’ve never understood how that (severity) is not just a form of post-hoc power analysis. Am I right, or am I missing something deeper there? I’ve read her blog and her first book several times, and frankly, I still don’t grok any distinction. Can anyone elucidate? Thanks.

Reply ↓
- Andrew on September 25, 2018 7:45 PM at 7:45 pm said:
  
  J
  
  I don’t think “power analysis” is so useful because “power” is all about statistical significance, which I think is a generally useless idea (see for example here: https://www.stat.columbia.edu/~gelman/research/published/abandon_final.pdf). I do, however, think that post-hoc design analysis can be very useful, as long as they are based on reasonable assumptions of effect sizes rather than being computed by plugging in noisy estimates from the data. I write about post-hoc design analysis in this paper: https://www.stat.columbia.edu/~gelman/research/published/retropower_final.pdf and this one: https://www.stat.columbia.edu/~gelman/research/published/incrementalism_3.pdf
  
  Reply ↓
  - Anonymous on November 8, 2018 11:33 AM at 11:33 am said:
    
    Clear enough that we should not use the observed effect for post-hoc power analysis, but is it any more justified to use the observed variance to perform post-hoc design analysis, since good a priori information about variability seems to be even more challenging to come by?
    
    Reply ↓
    - Andrew on November 8, 2018 12:03 PM at 12:03 pm said:
      
      Anon:
      
      It depends. Sometimes you can get a stable estimate of variance using formulas such as 0.5/sqrt(n); other times, sure, it can make sense to perform a range of design analyses using different values for the variance.
- Keith O'Rourke on September 26, 2018 9:07 AM at 9:07 am said:
  
  J. Norway.
  People often take ‘illogic” labels too literally/at face value rather than trying to clearly discern what the logic should be (which I think Cohen didn’t).
  
  For instance, this from my past https://statmodeling.stat.columbia.edu/2017/03/03/yes-makes-sense-design-analysis-power-calculations-data-collected/#comment-434462
  
  Reply ↓
- Kyle C on September 26, 2018 11:28 AM at 11:28 am said:
  
  So I am not alone in my reaction to Mayo’s work! I know our host considers it “rude” to keep saying this here, and I am genuinely sorry about that. But Mayo is a public intellectual whose quirky, allusive, academically philosophical prose makes it impossible for many educated people to grasp her points.
  
  Reply ↓
  - Andrew on September 26, 2018 12:08 PM at 12:08 pm said:
    
    Kyle:
    
    I don’t think it’s rude to say that you find Mayo’s writing to be confusing or that you find her ideas to be useless. I disagree—I find her ideas useful (not directly useful in my data analysis, but indirectly in helping me think about my own philosophy of statistics)—but I don’t think it’s rude to express your views. The thing I thought was rude was when people flinging giving personal insults. If you just want to say that someone’s work is confusing, wrong, useless, whatever, that’s fine: go for it, and explain your reasons.
    
    Reply ↓
- Kyle C on November 8, 2018 12:27 PM at 12:27 pm said:
  
  “I’ve read her blog and her first book several times, and frankly, I still don’t grok any distinction.” +++1
  
  Reply ↓
Shravan on September 27, 2018 3:33 AM at 3:33 am said:

Seems I can no longer post comments on this blog except from my phone, or using Tor. Others also facing this issue? What has happened?

Reply ↓
- Eric Vlach on September 27, 2018 9:29 AM at 9:29 am said:
  
  Hi Shravan,
  
  Can you help me understand your issues on this blog? What browsers are you using when you can’t post? What error message do you get?
  
  Reply ↓
  - Dale Lehman on September 27, 2018 10:12 AM at 10:12 am said:
    
    For around 2 weeks I have had recurring issues both posting and reading this blog. Sometimes my posted comments don’t show up for a long time. Sometimes, when I click to read comments on a post, there are no comments there (but there really are comments, I just can’t see them). Sometimes I see the comments only to return a few minutes later and they are gone – but they then reappear. Sometimes the new posts don’t appear for a long time, sometimes they appear earlier in the day (when I am used to seeing them). In short, the blog posting/reading has become quite erratic in the past 2 weeks or so. Something clearly seems to work differently, but I have no clue what it is.
    
    Reply ↓
    - Kyle C on September 27, 2018 11:18 AM at 11:18 am said:
      
      Same here. Using Chrome.
    - Eric Vlach on September 27, 2018 1:05 PM at 1:05 pm said:
      
      It appears our web host has implemented some aggressive caching. When one visitor looks at a page, the host caches it for ~10-20 minutes. For example the homepage comment count is rarely accurate since within those 20 minutes, new comments can roll in on posts. But then if you click on a post, you may be the first to view it within that 20 minute span, so it rebuilds the page in the cache. This leads to discrepancies in comments on the post, comment counts, sidebar comments, and comments not appearing immediately.
      
      I’ve attempted to turn off caching on both the host and on the CDN, to see if it improves. I will continue monitoring this thread in case there is additional information you or other commenters can provide. Hopefully the situation improves.
    - Thanatos Savehn on September 27, 2018 1:52 PM at 1:52 pm said:
      
      It worked for me. After Andrew’s suggestion that it was a caching problem on our end I scrubbed browsers on desktop and phone yet was still getting time shifted into the past; and only on this blog. I just refreshed the homepage and for the first time in a good while it appears to be current (if the “Postdoc position …” post is indeed current).
    - Thanatos Savehn on September 27, 2018 8:18 PM at 8:18 pm said:
      
      FWIW I take it back. currently time warped back to yesterday on one device and two days ago on another.
    - Greg Francis on September 27, 2018 1:37 PM at 1:37 pm said:
      
      I am having similar problems. It may be several issues, but one is that this site uses javascript to handle various aspects of comments. Some of the scripts seem to hosted on a different site and are downloaded as needed. A conflict occurs because this site is run through secure http (https:) while the call to the off site scripts is made by non-secure http (http:) call. Some web browsers do not allow such calls (because it breaks security).
      
      This may be only part of the problem because I am not sure why I only sometimes have trouble viewing comments.
    - Eric Vlach on September 27, 2018 3:33 PM at 3:33 pm said:
      
      Thanks for the heads up about the mixed content! Will look into this.
Shravan on September 27, 2018 3:35 AM at 3:35 am said:

Test.

Reply ↓
- Martha (Smith) on September 28, 2018 8:12 PM at 8:12 pm said:
  
  Test
  
  Reply ↓
Tanu Kumar on September 28, 2018 8:47 AM at 8:47 am said:

How about calculation a minimum detectable effect?

Reply ↓
Peter on September 28, 2018 11:09 AM at 11:09 am said:

Why not ignore the observed effect size and used the observed standard errors to construct the ex post MDE for conventional parameters? Then the reader can decide if it’s well powered given the theory/existing evidence.

Reply ↓

65 thoughts on “Don’t calculate post-hoc power using observed estimate of effect size”

Leave a Reply Cancel reply