OK, so this is nothing new. Greg Francis said it, and Uri Simonsohn said it, Ulrich Schimmack said it, lots of people have said it. But it’s worth saying again.

To get NIH funding, you need to demonstrate (that is, convincingly claim) that your study has 80% power.

I hate the term “power” as it’s all tied into the idea of the goal of a study that results in statistical significance. But let’s set that aside for now, and just do the math, which is that with a normal distribution, if you want an 80% probability of your 95% interval excluding zero, then the true effect size has to be at least 2.8 standard errors from zero.

All right, then. Suppose we really were running studies with 80% power. In that case, the expected z-score is 2.8, and 95% of the time we’d see z-scores between 0.8 and 4.8.

Let’s open up the R:

> 2*pnorm(-0.8) [1] 0.42 > 2*pnorm(-4.8) [1] 1.6e-06

So we should expect to routinely see p-values ranging from 0.42 to . . . ummmm, 0.0000016. And those would be clean, pre-registered p-values, no funny business, no researcher degrees of freedom, no forking paths.

Let’s explore further . . . the 75th percentile of the normal distribution is 0.67, so if we’re really running studies with 80% power, then one-quarter of the time we’d see z-scores above 2.8 + 0.67 = 3.47.

> 2*pnorm(-3.47) [1] 0.00052

Dayum. We’d expect to see clean, un-hacked p-values less than 0.0005, at least a quarter of the time, if we were running studies with minimum 80% power, as we routinely claim we’re doing, if we ever want any of that sweet, sweet NIH funding.

And, yes, that’s 0.0005, not 0.005. There’s a bunch of zeroes there.

And, no, this ain’t happening. We don’t have 80% power. Heck, we’re lucky if we have 6% power.

Remember that wonderful passage from the Nosek, Spies, and Motyl “50 shades of gray” paper:

We conducted a direct replication while we prepared the manuscript. We ran 1,300 participants, giving us .995 power to detect an effect of the original effect size at alpha = .05.

Followed by:

The effect vanished (p = .59).

**None of this should be a surprise**

When I say, “None of this should be a surprise,” I don’t just mean that, in response to the replication crisis and the work of Ioannidis, Button et al., etc., we should realize that statistically-based science is not what it’s claimed to be. And I don’t just mean that, given the real world of type M errors and the statistical significance filter, that we should expect claims of statistical power (which are based on optimistic interpretations of a biased literatures) will be wildly inflated. Issues of failed replications, type M errors, etc., are huge, and they contribute to the conditions that allow the erroneous power estimates.

But what I’m saying right here is that, even knowing nothing about any replication crisis, without any real-world experience or cynicism or sociology or documentation or whatever you want to call it . . . it just comes down to the math. With 80% power, we’d expect to see tons and tons of p-values like 0.0005, 0.0001, 0.00005, etc. This would just be happening all the time. But it doesn’t.

I should’ve realized this the first time I was asked to demonstrate 80% power for a grant proposal. And certainly I should’ve realized this when writing the section on sample size and power analysis in my book with Jennifer, over ten years ago, well before I’d thought about all the problems in statistical practice of which we are now so painfully aware. All the math in that section is correct—but the implications of the math reveal the absurdity of the assumptions.

**P.S.** In response to a couple of comments below: Yes, the p-value is conditioned on the assumed effect size. But the point of the power calculation for the NIH grant is to say that the power really is at least 80%, at least much of the time. The assumptions are supposed to be reasonable. Given that we’re not routinely seeing tons and tons of p-values like 0.0005, 0.0001, 0.00005, etc., this suggests that the assumptions are *not* so reasonable.

**P.P.S.** The funny thing is, when you design a study it seems like it should be so damn *easy* to get 80% power. It goes something like this: Assume a small effect size, say 0.1 standard deviations; then to get 2.8 standard errors from zero you just need 0.1 = 2.8/sqrt(N), thus N = (2.8/0.1)^2 = 784. Voila! OK, 784 seems like a lot of people, so let’s assume a effect size of 0.2 standard deviations, then we just need N = 196, that’s not so bad. NIH, here we come!

What went wrong? Here’s what’s happening: (a) effects are typically much smaller than people want to believe, (b) effect size estimates from the literature are massively biased, (c) systematic error is a thing, (d) so is variation across experimental conditions. Put it all together, and even that N = 784 study is not going to do the job—and even if you do turn up a statistically significant difference in your particular experimental conditions, there’s no particular reason to expect it will generalize. So, designing a study with 80% power is not so easy after all.

**P.P.P.S.** To clarify one more thing: I do *not* think the goal of an experiment should be to get “statistical significance” or any other sense of certainty. The paradigm of routine discovery is over, and it can’t be recovered with N = 784 or even N = 7840.

How do people normally justify their power-calculations? Is there a bit of “we had 30 subjects per condition ergo we have power = .8”?

In my statistics course we hardly covered power because in the opinion of the lecturer you basically had to know the truth already to do a proper power calculation anyway. Shouldn’t power calculations be given as an interval?

1.Pick a previous paper that showed an effect

2. Ignore all others who did not show an effect

3. Claim that you are operating under regime 1

but because you will “innovate” you can achieve 10x effect size in first study

Done

Power calculations are subjunctive—you don’t need to know the truth. For a given p-value (e.g., p < 0.05), a given effect size (e.g., 2.8 standard deviations from zero), a given probability of achieving the p-value (e.g., 80%), we can deduce the minimum sample size to achieve the probabilty of getting the desired p-value. Any one of these items can be determined from the others. For example, if you know the number of subjects, p-value and probability, you can determine what the effect size has to be.

The typical solutions are minimum subjects to guarantee percentage chance of bounding p-value given effect size—those are points, not intervals. But if you have uncertainty in effect size, it propagates to uncertainty in necessary sample size.

I’ve always had a gripe with power calculations based on “effect size” considered as “so many standard deviations from zero”. My gripe is that measuring effect size in terms of “so many standard deviations” is an artifice. Any meaningful power calculation would need to measure “practical significance” in the units (centimeters, seconds, or whatever) of the actual measure, then consider standard deviation as another parameter to include in the “power” formula — and needs to take into account that you have to estimate standard deviationto plug it into the formula, which gives uncertainty in the output of the “power” formula.

I’ve heard it said that the request for power calculations isn’t really to ensure that the study has the necessary power, but rather to know what the effect required to have power = 0.8 to be.

Of course, if you’re also completely making up the size of the error terms, then it’s nearly impossible to get anything meaningful out of the power calculation. But for some studies, it may be more reasonable to think carefully about what the typical error for this type of data is, rather than to think carefully about some new treatment effect no one has properly observed before.

Martha:

In educational testing settings it can make sense to present effects on the scale of standard deviations, where 1 standard deviation represents variation in some specified population such as all fourth-graders in the United States.

OK, I can see it in the context of standardized educational testing. But that is a pretty specialized situation.

Keep in mind that the effect size is just the difference divided by the standard deviation. So you can think of the power analysis in terms of raw units of measure if you just include the effect size calculation as the first step in calculating the power. If you’re looking for, say, a change in duration of 0.5 seconds or more as your measure of “practical significance”, and the standard deviation of the events is 1.0 seconds, then you can use those values to calculate an effect size of 0.5 and go from there. But if you prefer, there are also power calculators that use the raw values. For example: http://powerandsamplesize.com/Calculators/

Did he cover null hypothesis testing? You also need unrealistic assumptions to make it work…

The idea of power is the following:

I measure something with error sigma, and look for differences from 0 using null hypothesis testing (if the true value is 0, I will be getting a signifcant result with probability 0.05).

I am intested in particular in the alternative where the true value is 1 (maybe it is the theoretical prediction or the minimal effect that would be of practical importance).

I want to design the experiment so that if in that case (true effect size = 1) the probability of getting a significant result is 0.8.

This allows me to calculate what sigma do I need in my measurement (i.e. how many subjects in the sample) to achive that kind of reliability.

we’d see p-values above 2.8 + 0.67 = 3.47.

I think it should be

we’d see z-values above 2.8 + 0.67 = 3.47.

Darrel:

Fixed; thanks.

Also, shouldn’t this

“(a) effects are typically much larger than people want to believe”

rather be

“(a) effects are typically much smaller than people want to believe”

People claim 80% power to detect pre-specified effect sizes of interest, not 80% power to detect whatever the actual effect size might happen to be.

> We’d expect to see clean, un-hacked p-values less than 0.0005, at least a quarter of the time, if we were running studies with minimum 80% power, as we routinely claim we’re doing, if we ever want any of that sweet, sweet NIH funding.

Only if the true effect size happens to be equal to the precise alternative hypothesis used to calculate the power when the experiment was designed.

Are you assuming that the alternative hypothesis is always true?

Z, Carlos:

See P.S. above. The point is that in an NIH proposal you have to make the case that the proposed effect sizes are realistic. They won’t be right every time, but they’re supposed to be close enough that one would routinely see results that are 4 or 5 standard errors from zero. Indeed, the effect sizes assumed for the 80% power are typically stated as conservative assumptions, as the idea is that power should be

at least80%.When calculating power, you don’t choose the effect size by guessing what the real effect size is likely to be. You choose the smallest effect size that would be (clinically) significant if true.

I’m sure sample sizes are generally too small since all the incentives for researchers point in that direction. But I don’t get much from your calculations because:

1. For a significant number of studies that the NIH funds, there is a sizable prior probability of a near zero or very small effect.

2. Powers are all calculated for effects large enough to be clinically meaningful.

“When calculating power, you don’t choose the effect size by guessing what the real effect size is likely to be. You choose the smallest effect size that would be (clinically) significant if true.”

Came in to say this. Power is not an actual quantity. It is not something to be estimated. It is a hypothetical that guides experimental design. The “true” effect size is irrelevant. We should not “[s]uppose we really were running studies with 80% power” because studies don’t have power; tests+designs have power *curves*. When you look at it that way (the intended way, per Neyman) the issue disappears.

It’s not a problem with the concept of power; the problem is the thing people *think* power is, and often critiques reinforce the same wrong idea of power that leads to bad designs.

(on preview, what Mayo said below)

Some good points here, but I think some clarification is still warranted. In some sense, it is fair to say that “power is not an actual quantity”. However, the phrase “The power of this hypothesis test to detect a difference of at least this much against this particular alternate hypothesis under the conditions of this particular study” is well-defined, and can be estimated if one has enough information.

I agree that it is better to think in terms of a power curve than just “power of this study”, since the power curve does show some (although not all) of the dependence of “power” on the conditions under which it is being considered.

The concept of power can be useful (if used carefully) in getting an estimate of what size sample is needed to detect a desired difference.

For a little more detail, see pp. 14ff of the link “Slides Day 3” and references to the link “Appendix Day 3” at https://web.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html

Maybe this is obvious, but doesn’t this just provide more evidence that effect sizes are routinely exaggerated in grant proposals and thus presumably the literature they are based on? Would fit nicely with the whole type M error analysis….

I see that you wrote “And I don’t just mean that, given the real world of type M errors and the statistical significance filter, that we should expect claims of statistical power (which are based on optimistic interpretations of a biased literatures) will be wildly inflated.” But I think this point is really important and shouldn’t be buried. It is still sometimes difficult to convince people that type M is real, pervasive, and practically quite significant – I think your analysis of the distribution of p-values you would expect with power routinely around 0.8 would be pretty convincing for researchers brought up with NHST.

Chris:

I’ve rewritten to clarify.

I don’t know what does it mean “to say that the power really is at least 80%, at least much of the time.”

I have no experience with NHS grants. Is there an expectation that the true effect will be most of the time larger than assumed in the proposal?

Carlos:

With NIH proposals, the assumptions underlying the power analysis are supposed to be reasonable, and typically they’ll be supported based on some reference to the literature. It’s not enough just to say that this study has 80% power to estimate effect X, and we’d really like X to be true. You’re supposed to supply some evidence that X is plausible, and when I’ve seen these things, these are typically presented as conservative assumptions, implying that the true power is higher than 80%.

Ok, thanks. Then it’s not so “damn easy” to get 80% power… your P.P.S. make it look as if the assumed effect size didn’t have to be justified at all.

Carlos:

I’ve rewritten to clarify.

Your point seem to be that people uses unrealistic assumptions for the effect size and disappointment ensues.

Requiring power calculations in grant proposals seems useful. If investigators have to specify what effect size would be required for the study to be worthwile, it makes it easier for reviewers as they have just to judge if that effect size is reasonable or not.

On the other hand, it seems reviewers don’t really care about your hypothesis, if you can just use whatever effect size is needed to make the power of your study 80%…

Carlos:

The trouble is that the funding agency is asking people to demonstrate something that generally isn’t close to the truth, and then is rewarding people for delivering the lies that are demanded. NIH should either stop funding these studies, or remove the goal of statistical significance and the expectation that statistically significant results represent scientific truth. Remember the Lance Armstrong principle: If you push people to promise more than they can deliver, they’re motivated to cheat.

> If you push people to promise more than they can deliver, they’re motivated to cheat.

Agree, and a lot of this comes from an implicit assumption that there needs to be essentially one definitive trial which is usually very bad economy of research. For instance, the cost can become so high that no matter how good the design and sound the reasoning it would not get funded for fiscal reasons and researchers learn this.

Instead one could fund well designed trails that test the design with a reasonable number of patients where these well tested designs are then franchised to other researchers in other teaching hospitals and the data shared. If desirable the designs could sequentially improved.

“After all, the ultimate objective is to be able to answer the question or in some cases go on to another question we may be better able to answer. To do otherwise, is to ensure inefficient use of the scarce resources available for research.” https://www.ncbi.nlm.nih.gov/pubmed/2809651

Unfortunately the business model of most funding agencies seems to more about seeding and building prestige of academics and academic groups.

What Keith describes sounds very much like a quality approach that is used in some industries. As his last sentence suggests, introducing this into “the business model of most funding agencies” would likely be an improvement in the ultimate results of (i.e., information gained from) clinical trials.

> Remember the Lance Armstrong principle

And the Wells Fargo principle, and the AIG principle, and… the examples keep piling up.

And even more perniciously, you select for cheaters and cast out the honest.

Andrew:

While I didn’t get to read all 55 comments yet, I want to make two points. First, I agree with Z, that what is wanted is to show the test has a good probability of revealing (by didn’t of a stat sig result) a discrepancy of a magnitude of interest–not a magnitude expected or known true. For example, if a risk increase is such and such, you want to have a high probability of detecting this. That’s what power lets you do. (Post data, it lets you set an upper bound to the discrepancies ruled out by negative results.) Power is too coarse post data (I prefer severity) but is useful for planning.

My second, and main point is this: it’s precisely because having .8 power to detect population effects that “we wouldn’t want to miss” (Senn) does NOT mean we consider those effect sizes actual or even approximately true, that some of us question the supposition that alternatives corresponding to .8 power are implicitly being assigned .5 prior probability in statistical testing. You will see this supposition, for example, in the paper on lowering p to .005. (p. 7). A 0 null is given .5 prior. file:///Users/deborahmayo/Downloads/BenjaminEtAlRedefineStatisticalSignificance(1)%20(5).pdf

This was already visible in Geoff Cumming’s dance of p-values, back in 2009: https://www.youtube.com/watch?v=ez4DgdurRPg. At 7:30 in the video he sets the power to 0.8, and you can see on the histogram that 30% of the p-values are less than 0.001.

Sadly, Cumming’s video is not widely enough used — ought to be shown in every intro stats course.

Hm. I feel like I do see that kind of P-values routinely, with e.g. clinical trials. But I guess those are an outlier in that effect sizes etc are much better known in advance than for most observational/epidemiological/psychological studies.

I am seeing this as a (global rather than specific) check of the data generating model and point effect priors that needs to be done over many repeated studies. Being a global check, many things could be going wrong, e.g. point effect prior being wrong, effect not being constant but varying, randomization was not done correctly, the variance not being constant, constraints on the individual improvements possible (ceiling effects), non-compliance (some in control group getting the active), distribution not being Normal, etc.

Some of these would be context specific, such as exaggerating the point effect prior to get to do a small manageable trial to demanding it to be the smallest possible effect worth learning (MCID). It was the latter that concerned me most in clinical research in that promising effects (i.e. where credible background reasoning suggested it was more than small) were not getting funded as even small effects would be worth learning about but funding required for a trail large enough to have 80% power for that would be larger than any one funding agency would agree to.

Yes, my critiques of articles that have lots of p-values just below .05, are essentially model checks. I don’t see it as being a global check, but as being a check of the multiple studies that are related to the authors’ theoretical claims. A global check may be a good way of thinking about general problems in the field, but it does not necessarily challenge any particular set of findings; and I think being more specific is usually better in cases like these.

Greg:

> I don’t see it as being a global check

By global I just meant not specific to a particular part of the model being too wrong – more specific are almost always better.

Additionally these checks can only be done over multiple studies when as it was put in Mosteller & Tukey, Data Analysis and Regression: A Second Course – there is access to the real uncertainty.

It is APA style to round all p-values lower than .001 to p 8 sigmas. (Yes, we did power analyses before the studies.)

I was surprised—the impression of “moderately low p-values only” is partly based on censored reporting.

Comment weirdly truncated. Missing “I looked at my most recent APA article, 7 tests of main hypothesis, p8 sigmas.”

Did you use a <? That opens an HTML tag…

You need to write < if you want a < to appear (and you need to write &lt; if you want < to appear, and so on…)

1. Jacob Cohen (1988), who is famous in psychology for his contribution to power analysis, developed rough guidelines for power calculations. You just need to make a decision whether you think the effect will be large (e.g., a study of the startle response in response to an unexpected gun shot), medium, or small (e.g., the effect of a subtle priming manipulation on behavior or the effect of income on a noisy single-item happiness item). Once you made this rough judgment, Cohen provides effect sizes, that can be used for an a priori power calcualation (e.g., fee software like GPower).

2. Perverting the idea of effect sizes, psychologists ignored them for a priori power analysis (unless they had to provide one for a grant proposal), but they used Cohen to interpret the OBSERVED effect sizes with confidence intervals so wide you could drive a truck through them, which is why psychologists do not report them (Cohen, 1994). With the help of low power and publication bias, these observed effect sizes are inflated and papers used Cohen to claim medium to large effects (which do not replicate in replication studies).

3. One problem was that true power is unknown a priori and a posteriori for a single study and observed power in a study with a significant result has to be greater than 50%. So, researchers never realized how low the true power of their studies was. Jerry Brunner and I developed a method that can estimate the true power of a set of studies (e.g., in a journal, on a topic, by a specific researcher) based on the observed test statistics (converted into z-scores). The results often show very low power.

Link to the z-curve paper.

https://replicationindex.wordpress.com/2017/11/16/preprint-z-curve-a-method-for-the-estimating-replicability-based-on-test-statistics-in-original-studies-schimmack-brunner-2017/

Link to a z-curve analysis of John. A. Bargh’s book on unconscious process.

https://replicationindex.wordpress.com/2017/11/28/before-you-know-it-by-john-a-bargh-a-quantitative-book-review/

The investigator degrees of freedom in developing a power analysis rival the degrees of freedom in the analytic garden of forking paths.

In addition to the parameters mentioned, if the study involves a multilevel model (as if increasingly often the case) you can also play with the estimates of the numerous variance components. Those are actually the best ones to manipulate since we usually know next to nothing about them at the time we plan the study and almost any claim is at least defensible against the charge of intentional manipulation.

There is also, in human subjects research, the question of attrition from the study. One typically projects low attrition rates that are seldom achieved in reality.

Another issue in time-to-event studies is the expected incidence rate in the control group. The claim in the power analysis is almost always much higher than seen in the final study. This is typically not the result of deceit so much as the fact that for many conditions, the prognosis is indeed getting better over time, so an incidence rate for a bad outcome grabbed from a study that is even just 10 years old is likely to be a substantial overestimate.

Given also that the power function has some steep sections, several tiny pushes in an optimistic direction can add up to a huge overestimate of power.

If you do sensitivity analyses on your power calculations, of course you will see all of this. I have on more than one occasion done that and found that with my proposed design the power is plausibly somewhere between 5 and 99.5%! Given the limitations on the length of the proposal, and all the other things that need to be in there, you can’t put all of that in there. So you pick a set of assumptions that you think are a reasonable likely-case analysis. But humans being humans, that set of assumptions is likely to be biased in the optimistic direction.

When I review grants for NIH, I don’t even bother to read the power analysis sections. That is, unless I’ve had a down day and I need a good laugh.

Lots of good points here.

I don’t see the requirements for power calculations going away — when resources (lives, materials, money) are at stake, it is reassuring to have some sense of what the Ns should look like to do a good study without wasting those resources. I’m disturbed when I see investigators low-balling the N — just trying to get the funding to do a study where they’re unlikely to see a real effect. There is room for improvement.

I’m bothered by power analyses failing to take the uncertainty of the preliminary estimates of effect size and variation into account. One thing I’ve begun doing recently is to use the lower bound of the 95% confidence interval of the estimate of the mean effect size, and use this for the effect size in the power calculation — this should inflate the N appropriately considering the uncertainty of the mean, though it ignores the uncertainty in the estimate of the standard deviation. I’ve given some thought to bootstrapping the available data to obtain a confidence interval for the power analysis’ N, and reporting the upper bound of this. I’d like to have a better handle on the theory behind these sorts of adjustments.

Yes, a point typically ignored is that power based on a point estimate of effect size, even an unbiased one, will *always* be biased if the standard error of the estimate is ignored. For a 2-group mean comparison with all the gods on your side (assumptions of t-test met), an effect size of .5 sd requires 64 per group for 80% power. Say the estimate has a standard error of .1. An effect size of .6 would require only 45 per group, but an effect of .4 requires 200 per group. The error of the effect size estimate may be symmetrical about the mean, but the power function is not.

I do remember the OH S**t! reaction of some other statisticians in Toronto when I pointed out I was using the upper confidence limit for the sd in power calculations.

The bigger problems here are not clearly understanding its just economics (economy of research) and implicitly assuming it all has to be done in one or at the very most two studies. More often that not it should a series of studies and the initial ones targeted at learning how many studies modified how to get an economic answer (i.e. cost benefit of adopting the treatment versus continued or discontinued research on this research question).

“More often that not it should a series of studies and the initial ones targeted at learning how many studies modified how to get an economic answer (i.e. cost benefit of adopting the treatment versus continued or discontinued research on this research question).”

This sounds like the view of experimentation often used in industrial quality assurance: Design experiments so each one leads to a better design of the next one.

However, the problems pointed out by Clyde are important.

> Clyde are important.

Agree and I have been there but I believe the only feasible way forward is to get beyond funding single trails for single academic groups – also see my recent comment to Andrew.

The problem is that the low-hanging fruit in health research are long gone. There is nothing around like penicillin reducing the mortality of pneumococcal pneumonia from close to 100% to close to 0%. The incremental effects that new interventions can bring about are typically quite modest, and detecting them requires studies that are very large or very long in duration. If we had realistic power analyses, grant applications would need budgets that are beyond the limits of the funding mechanisms NIH uses. If we want that kind of research done, then we have to pay for it, and that almost certainly entails doing fewer research projects in order to better resource the ones we do. The incentives in the system, however, all work in the opposite direction. Moreover, it isn’t clear that the scientific and administrative review systems used by NIH are up to the task of properly identifying this narrower set of more important studies to fund.

There are still new therapies that can be shown to work with small trials: https://goo.gl/EUAVAQ

(I’m sorry for the shortened link, but for some reason the system doesn’t like the full URL)

Or the NHST paradigm took over medical research because it made it so easy to “discover” something, leading to a literature being filled with misinfo. This wastes 90% of the priume of life time/energy of people during the prime of their life who might otherwise get something done. Besides just figuring that out and what to do about it personally, then there is the obstacle of dealing with all the people who haven’t yet.

It seems more plausible to me that our understanding of health and the human body is extremely rudimentary… I mean you can read Herodotus and find arguments that eating a primarily grain diet is like “eating dirt” (ie leads to early mortality vs other diets).

Do we really know whether or not (or more likely, under what conditions) that is the case yet? Not afaict, yet the opposite claim was embedded in the US food pyramid until not too long ago, but now has gone out of fashion. People need to stop with this overconfidence on health matters.

As both Z and Andrew point out, this concerns an assumed effect size.

So what does this tell us, aside from the fact that people can’t estimate a prior alternate effect size, or that they engage in wish-casting? If there is no penalty for being wrong, there is no incentive to be accurate. If there is a trade-off between getting funded and engaging in unpenalized wishful thinking what would you expect to happen?

Attach a real penalty to not delivering the promised effect size and see what happens. This occurs in some commercial environments (product development and QA). Shipping a defective product can bring a visit from, say, the FDA or measureable market volume loss.

+1

I concur with Z and Andrew too

If you attach a “real” penalty (as though not getting funded is not a real penalty for a junior academic in particular) you will get the same response, just like you do when you say you will close schools and fire teachers if they don’t achieve “real” goals on test scores. What you get is the normal range of human responses from cheating on the tests to suicide of the fired teachers to people leaving the job.

I think the impetus needs to lie with the funding source. The beauty of the power calculation is that it gives transparency to the assumptions used my researchers who would like to be funded. The NIH can, at any time, not fund a study because the power calculation is based on an unrealistic or impractical effect sizes and variation estimates. There can and should be policing on this by a knowledgeable funding committee. So that the real penalty is simply not having your study be funded.

Also, some scales have validation studies that show how the scale corresponds to other measures, i.e. quality of life vs. disability. Researchers should be citing these studies as well to help facilitate the decision by the funding committee.

You said, “(a) effects are typically much larger than people want to believe”. Don’t you mean, “(a) effects are typically much *smaller* than people want to believe”?

Fixed; thanks.

Regarding needing to the know the truth, my opinion is that most of the time you need to know at least some part of the truth. You can say you just need an effect size, which doesn’t need to be the truth. I understand this in the case of specifying a meaningful effect to detect, but real measures have distributions. If you want to detect an increase in blood pressure or whatever of at least X units, you still have to specify the variance of the underlying distribution, and if you get that wrong, the power calculation for the meaningful effect isn’t of much use. For things like blood pressure, which have been studied for a very long time, this might not be a problem, but most of the stuff I work on, and in most of the papers I read, the uncertainty in these estimates is so large that power calculations seem like more a formality to get grants than something that is actually useful. Please correct me if I’m wrong.

The low power problem bugged me so much in the semiconductor industry that I wrote 2 papers about around 1995. Variability estimates come naturally from routine manufacturing statistics, which in semicon were tracked carefully because they are economically important. The sample size is determined by how many production lots (e.g. 24 wafers each) you are willing to run in the experiment – each lot adds to the cost.

What I found was that small process improvements were almost impossible to detect, using the then-standard experimental methods. For example, if an experiment has a genuine yield impact of 0.2 percent, that can be worth a few million dollars. (A semiconductor fabrication facility produced at that time roughly $1 to $5 billion of output per year.) But a change of that size was lost in the noise. Only when the true effect rose into the 1% or higher range was there much hope of detecting it. (And a 1% yield change, from a single experiment, would be spectacular.)

Yet semicon engineers were running these experiments all the time, and often acting on the results. What was going on? One conclusion was that most good experiments were “short loop” trials, meaning that the wafers did not go all the way through the process. For example, you could run an experiment on a single mask layer, and then measure the effect on manufacturing tolerances. (Not the right terminology in semicon, but that is what they are called elsewhere.) In this way, the only noise was from the single mask layer. Such an experiment would not tell you the impact on yields, but an engineering model could estimate the relationship between tolerances ===> yields. Now, small changes were detectable with reasonable sample sizes.

– “Noise and Learning in Semiconductor Manufacturing”, Bohn 1995: https://www.dropbox.com/s/ggzjilo437pfhku/1995-bohn.pdf?dl=0

– “The impact of process noise on VLSI process improvement”, Bohn 1995: https://www.dropbox.com/s/prdbzkdq8l1yj01/1995-bohn-2.pdf?dl=0

– “Learning and Process Improvement during Production Ramp-Up”, Terwiesch & Bohn 1998: https://cloudfront.escholarship.org/dist/prd/content/qt5zf5q453/qt5zf5q453.pdf

For grant purposes in clinical trials, it would probably be better to focus on the width of the confidence interval (or credible interval) for the difference in 2 treatments as opposed to statistical significance. What you want is for the study to be large enough so that the difference between treatment groups is reasonably estimated – whether that difference contains zero shouldn’t be relevant for deciding whether or not to fund.

I’ve wanted to ask this for a while:

Are there any examples out there of a good “sample size” section in an NIH grant that proposes a Bayesian analysis (or at least an analysis that’s not all based on p-values)? I’ve been in some situations where I want to write up something like this, but I don’t know what grant reviewers respond to.