Some good points here, but I think some clarification is still warranted. In some sense, it is fair to say that “power is not an actual quantity”. However, the phrase “The power of this hypothesis test to detect a difference of at least this much against this particular alternate hypothesis under the conditions of this particular study” is well-defined, and can be estimated if one has enough information.

I agree that it is better to think in terms of a power curve than just “power of this study”, since the power curve does show some (although not all) of the dependence of “power” on the conditions under which it is being considered.

The concept of power can be useful (if used carefully) in getting an estimate of what size sample is needed to detect a desired difference.

For a little more detail, see pp. 14ff of the link “Slides Day 3” and references to the link “Appendix Day 3” at https://web.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html

]]>“When calculating power, you don’t choose the effect size by guessing what the real effect size is likely to be. You choose the smallest effect size that would be (clinically) significant if true.”

Came in to say this. Power is not an actual quantity. It is not something to be estimated. It is a hypothetical that guides experimental design. The “true” effect size is irrelevant. We should not “[s]uppose we really were running studies with 80% power” because studies don’t have power; tests+designs have power *curves*. When you look at it that way (the intended way, per Neyman) the issue disappears.

It’s not a problem with the concept of power; the problem is the thing people *think* power is, and often critiques reinforce the same wrong idea of power that leads to bad designs.

(on preview, what Mayo said below)

]]>I think the impetus needs to lie with the funding source. The beauty of the power calculation is that it gives transparency to the assumptions used my researchers who would like to be funded. The NIH can, at any time, not fund a study because the power calculation is based on an unrealistic or impractical effect sizes and variation estimates. There can and should be policing on this by a knowledgeable funding committee. So that the real penalty is simply not having your study be funded.

Also, some scales have validation studies that show how the scale corresponds to other measures, i.e. quality of life vs. disability. Researchers should be citing these studies as well to help facilitate the decision by the funding committee.

]]>– “Noise and Learning in Semiconductor Manufacturing”, Bohn 1995: https://www.dropbox.com/s/ggzjilo437pfhku/1995-bohn.pdf?dl=0

– “The impact of process noise on VLSI process improvement”, Bohn 1995: https://www.dropbox.com/s/prdbzkdq8l1yj01/1995-bohn-2.pdf?dl=0

– “Learning and Process Improvement during Production Ramp-Up”, Terwiesch & Bohn 1998: https://cloudfront.escholarship.org/dist/prd/content/qt5zf5q453/qt5zf5q453.pdf

If you attach a “real” penalty (as though not getting funded is not a real penalty for a junior academic in particular) you will get the same response, just like you do when you say you will close schools and fire teachers if they don’t achieve “real” goals on test scores. What you get is the normal range of human responses from cheating on the tests to suicide of the fired teachers to people leaving the job.

]]>What Keith describes sounds very much like a quality approach that is used in some industries. As his last sentence suggests, introducing this into “the business model of most funding agencies” would likely be an improvement in the ultimate results of (i.e., information gained from) clinical trials.

]]>I concur with Z and Andrew too

]]>And even more perniciously, you select for cheaters and cast out the honest.

]]>> Remember the Lance Armstrong principle

And the Wells Fargo principle, and the AIG principle, and… the examples keep piling up.

]]>Andrew:

While I didn’t get to read all 55 comments yet, I want to make two points. First, I agree with Z, that what is wanted is to show the test has a good probability of revealing (by didn’t of a stat sig result) a discrepancy of a magnitude of interest–not a magnitude expected or known true. For example, if a risk increase is such and such, you want to have a high probability of detecting this. That’s what power lets you do. (Post data, it lets you set an upper bound to the discrepancies ruled out by negative results.) Power is too coarse post data (I prefer severity) but is useful for planning.

My second, and main point is this: it’s precisely because having .8 power to detect population effects that “we wouldn’t want to miss” (Senn) does NOT mean we consider those effect sizes actual or even approximately true, that some of us question the supposition that alternatives corresponding to .8 power are implicitly being assigned .5 prior probability in statistical testing. You will see this supposition, for example, in the paper on lowering p to .005. (p. 7). A 0 null is given .5 prior. file:///Users/deborahmayo/Downloads/BenjaminEtAlRedefineStatisticalSignificance(1)%20(5).pdf

]]>Are there any examples out there of a good “sample size” section in an NIH grant that proposes a Bayesian analysis (or at least an analysis that’s not all based on p-values)? I’ve been in some situations where I want to write up something like this, but I don’t know what grant reviewers respond to. ]]>

Keep in mind that the effect size is just the difference divided by the standard deviation. So you can think of the power analysis in terms of raw units of measure if you just include the effect size calculation as the first step in calculating the power. If you’re looking for, say, a change in duration of 0.5 seconds or more as your measure of “practical significance”, and the standard deviation of the events is 1.0 seconds, then you can use those values to calculate an effect size of 0.5 and go from there. But if you prefer, there are also power calculators that use the raw values. For example: http://powerandsamplesize.com/Calculators/

]]>> Clyde are important.

Agree and I have been there but I believe the only feasible way forward is to get beyond funding single trails for single academic groups – also see my recent comment to Andrew.

> If you push people to promise more than they can deliver, they’re motivated to cheat.

Agree, and a lot of this comes from an implicit assumption that there needs to be essentially one definitive trial which is usually very bad economy of research. For instance, the cost can become so high that no matter how good the design and sound the reasoning it would not get funded for fiscal reasons and researchers learn this.

Instead one could fund well designed trails that test the design with a reasonable number of patients where these well tested designs are then franchised to other researchers in other teaching hospitals and the data shared. If desirable the designs could sequentially improved.

“After all, the ultimate objective is to be able to answer the question or in some cases go on to another question we may be better able to answer. To do otherwise, is to ensure inefficient use of the scarce resources available for research.” https://www.ncbi.nlm.nih.gov/pubmed/2809651

Unfortunately the business model of most funding agencies seems to more about seeding and building prestige of academics and academic groups.

]]>Greg:

> I don’t see it as being a global check

By global I just meant not specific to a particular part of the model being too wrong – more specific are almost always better.

Additionally these checks can only be done over multiple studies when as it was put in Mosteller & Tukey, Data Analysis and Regression: A Second Course – there is access to the real uncertainty.

]]>The problem is that the low-hanging fruit in health research are long gone. There is nothing around like penicillin reducing the mortality of pneumococcal pneumonia from close to 100% to close to 0%. The incremental effects that new interventions can bring about are typically quite modest, and detecting them requires studies that are very large or very long in duration.

Or the NHST paradigm took over medical research because it made it so easy to “discover” something, leading to a literature being filled with misinfo. This wastes 90% of the priume of life time/energy of people during the prime of their life who might otherwise get something done. Besides just figuring that out and what to do about it personally, then there is the obstacle of dealing with all the people who haven’t yet.

It seems more plausible to me that our understanding of health and the human body is extremely rudimentary… I mean you can read Herodotus and find arguments that eating a primarily grain diet is like “eating dirt” (ie leads to early mortality vs other diets).

Do we really know whether or not (or more likely, under what conditions) that is the case yet? Not afaict, yet the opposite claim was embedded in the US food pyramid until not too long ago, but now has gone out of fashion. People need to stop with this overconfidence on health matters.

]]>What I found was that small process improvements were almost impossible to detect, using the then-standard experimental methods. For example, if an experiment has a genuine yield impact of 0.2 percent, that can be worth a few million dollars. (A semiconductor fabrication facility produced at that time roughly $1 to $5 billion of output per year.) But a change of that size was lost in the noise. Only when the true effect rose into the 1% or higher range was there much hope of detecting it. (And a 1% yield change, from a single experiment, would be spectacular.)

Yet semicon engineers were running these experiments all the time, and often acting on the results. What was going on? One conclusion was that most good experiments were “short loop” trials, meaning that the wafers did not go all the way through the process. For example, you could run an experiment on a single mask layer, and then measure the effect on manufacturing tolerances. (Not the right terminology in semicon, but that is what they are called elsewhere.) In this way, the only noise was from the single mask layer. Such an experiment would not tell you the impact on yields, but an engineering model could estimate the relationship between tolerances ===> yields. Now, small changes were detectable with reasonable sample sizes.

]]>“More often that not it should a series of studies and the initial ones targeted at learning how many studies modified how to get an economic answer (i.e. cost benefit of adopting the treatment versus continued or discontinued research on this research question).”

This sounds like the view of experimentation often used in industrial quality assurance: Design experiments so each one leads to a better design of the next one.

However, the problems pointed out by Clyde are important.

]]>(I’m sorry for the shortened link, but for some reason the system doesn’t like the full URL)

]]>There are still new therapies that can be shown to work with small trials: https://goo.gl/EUAVAQ

]]>Lots of good points here.

]]>Sadly, Cumming’s video is not widely enough used — ought to be shown in every intro stats course.

]]>OK, I can see it in the context of standardized educational testing. But that is a pretty specialized situation.

]]>The problem is that the low-hanging fruit in health research are long gone. There is nothing around like penicillin reducing the mortality of pneumococcal pneumonia from close to 100% to close to 0%. The incremental effects that new interventions can bring about are typically quite modest, and detecting them requires studies that are very large or very long in duration. If we had realistic power analyses, grant applications would need budgets that are beyond the limits of the funding mechanisms NIH uses. If we want that kind of research done, then we have to pay for it, and that almost certainly entails doing fewer research projects in order to better resource the ones we do. The incentives in the system, however, all work in the opposite direction. Moreover, it isn’t clear that the scientific and administrative review systems used by NIH are up to the task of properly identifying this narrower set of more important studies to fund.

]]>Yes, my critiques of articles that have lots of p-values just below .05, are essentially model checks. I don’t see it as being a global check, but as being a check of the multiple studies that are related to the authors’ theoretical claims. A global check may be a good way of thinking about general problems in the field, but it does not necessarily challenge any particular set of findings; and I think being more specific is usually better in cases like these.

]]>Carlos:

I’ve rewritten to clarify.

]]>Chris:

I’ve rewritten to clarify.

]]>I see that you wrote “And I don’t just mean that, given the real world of type M errors and the statistical significance filter, that we should expect claims of statistical power (which are based on optimistic interpretations of a biased literatures) will be wildly inflated.” But I think this point is really important and shouldn’t be buried. It is still sometimes difficult to convince people that type M is real, pervasive, and practically quite significant – I think your analysis of the distribution of p-values you would expect with power routinely around 0.8 would be pretty convincing for researchers brought up with NHST.

]]>Ok, thanks. Then it’s not so “damn easy” to get 80% power… your P.P.S. make it look as if the assumed effect size didn’t have to be justified at all.

]]>Carlos:

The trouble is that the funding agency is asking people to demonstrate something that generally isn’t close to the truth, and then is rewarding people for delivering the lies that are demanded. NIH should either stop funding these studies, or remove the goal of statistical significance and the expectation that statistically significant results represent scientific truth. Remember the Lance Armstrong principle: If you push people to promise more than they can deliver, they’re motivated to cheat.

]]>Your point seem to be that people uses unrealistic assumptions for the effect size and disappointment ensues.

Requiring power calculations in grant proposals seems useful. If investigators have to specify what effect size would be required for the study to be worthwile, it makes it easier for reviewers as they have just to judge if that effect size is reasonable or not.

On the other hand, it seems reviewers don’t really care about your hypothesis, if you can just use whatever effect size is needed to make the power of your study 80%…

]]>Carlos:

With NIH proposals, the assumptions underlying the power analysis are supposed to be reasonable, and typically they’ll be supported based on some reference to the literature. It’s not enough just to say that this study has 80% power to estimate effect X, and we’d really like X to be true. You’re supposed to supply some evidence that X is plausible, and when I’ve seen these things, these are typically presented as conservative assumptions, implying that the true power is higher than 80%.

]]>Fixed; thanks.

]]>Also, shouldn’t this

“(a) effects are typically much larger than people want to believe”

rather be

“(a) effects are typically much smaller than people want to believe”

]]>I don’t know what does it mean “to say that the power really is at least 80%, at least much of the time.”

I have no experience with NHS grants. Is there an expectation that the true effect will be most of the time larger than assumed in the proposal?

]]>I do remember the OH S**t! reaction of some other statisticians in Toronto when I pointed out I was using the upper confidence limit for the sd in power calculations.

The bigger problems here are not clearly understanding its just economics (economy of research) and implicitly assuming it all has to be done in one or at the very most two studies. More often that not it should a series of studies and the initial ones targeted at learning how many studies modified how to get an economic answer (i.e. cost benefit of adopting the treatment versus continued or discontinued research on this research question).

]]>Did you use a <? That opens an HTML tag…

You need to write < if you want a < to appear (and you need to write &lt; if you want < to appear, and so on…)

]]>Yes, a point typically ignored is that power based on a point estimate of effect size, even an unbiased one, will *always* be biased if the standard error of the estimate is ignored. For a 2-group mean comparison with all the gods on your side (assumptions of t-test met), an effect size of .5 sd requires 64 per group for 80% power. Say the estimate has a standard error of .1. An effect size of .6 would require only 45 per group, but an effect of .4 requires 200 per group. The error of the effect size estimate may be symmetrical about the mean, but the power function is not.

]]>+1

]]>So what does this tell us, aside from the fact that people can’t estimate a prior alternate effect size, or that they engage in wish-casting? If there is no penalty for being wrong, there is no incentive to be accurate. If there is a trade-off between getting funded and engaging in unpenalized wishful thinking what would you expect to happen?

Attach a real penalty to not delivering the promised effect size and see what happens. This occurs in some commercial environments (product development and QA). Shipping a defective product can bring a visit from, say, the FDA or measureable market volume loss.

]]>Maybe this is obvious, but doesn’t this just provide more evidence that effect sizes are routinely exaggerated in grant proposals and thus presumably the literature they are based on? Would fit nicely with the whole type M error analysis….

]]>I’m bothered by power analyses failing to take the uncertainty of the preliminary estimates of effect size and variation into account. One thing I’ve begun doing recently is to use the lower bound of the 95% confidence interval of the estimate of the mean effect size, and use this for the effect size in the power calculation — this should inflate the N appropriately considering the uncertainty of the mean, though it ignores the uncertainty in the estimate of the standard deviation. I’ve given some thought to bootstrapping the available data to obtain a confidence interval for the power analysis’ N, and reporting the upper bound of this. I’d like to have a better handle on the theory behind these sorts of adjustments.

]]>In addition to the parameters mentioned, if the study involves a multilevel model (as if increasingly often the case) you can also play with the estimates of the numerous variance components. Those are actually the best ones to manipulate since we usually know next to nothing about them at the time we plan the study and almost any claim is at least defensible against the charge of intentional manipulation.

There is also, in human subjects research, the question of attrition from the study. One typically projects low attrition rates that are seldom achieved in reality.

Another issue in time-to-event studies is the expected incidence rate in the control group. The claim in the power analysis is almost always much higher than seen in the final study. This is typically not the result of deceit so much as the fact that for many conditions, the prognosis is indeed getting better over time, so an incidence rate for a bad outcome grabbed from a study that is even just 10 years old is likely to be a substantial overestimate.

Given also that the power function has some steep sections, several tiny pushes in an optimistic direction can add up to a huge overestimate of power.

If you do sensitivity analyses on your power calculations, of course you will see all of this. I have on more than one occasion done that and found that with my proposed design the power is plausibly somewhere between 5 and 99.5%! Given the limitations on the length of the proposal, and all the other things that need to be in there, you can’t put all of that in there. So you pick a set of assumptions that you think are a reasonable likely-case analysis. But humans being humans, that set of assumptions is likely to be biased in the optimistic direction.

When I review grants for NIH, I don’t even bother to read the power analysis sections. That is, unless I’ve had a down day and I need a good laugh.

]]>When calculating power, you don’t choose the effect size by guessing what the real effect size is likely to be. You choose the smallest effect size that would be (clinically) significant if true.

I’m sure sample sizes are generally too small since all the incentives for researchers point in that direction. But I don’t get much from your calculations because:

1. For a significant number of studies that the NIH funds, there is a sizable prior probability of a near zero or very small effect.

2. Powers are all calculated for effects large enough to be clinically meaningful.

Comment weirdly truncated. Missing “I looked at my most recent APA article, 7 tests of main hypothesis, p8 sigmas.”

]]>2. Perverting the idea of effect sizes, psychologists ignored them for a priori power analysis (unless they had to provide one for a grant proposal), but they used Cohen to interpret the OBSERVED effect sizes with confidence intervals so wide you could drive a truck through them, which is why psychologists do not report them (Cohen, 1994). With the help of low power and publication bias, these observed effect sizes are inflated and papers used Cohen to claim medium to large effects (which do not replicate in replication studies).

3. One problem was that true power is unknown a priori and a posteriori for a single study and observed power in a study with a significant result has to be greater than 50%. So, researchers never realized how low the true power of their studies was. Jerry Brunner and I developed a method that can estimate the true power of a set of studies (e.g., in a journal, on a topic, by a specific researcher) based on the observed test statistics (converted into z-scores). The results often show very low power.

Link to the z-curve paper.

https://replicationindex.wordpress.com/2017/11/16/preprint-z-curve-a-method-for-the-estimating-replicability-based-on-test-statistics-in-original-studies-schimmack-brunner-2017/

Link to a z-curve analysis of John. A. Bargh’s book on unconscious process.

https://replicationindex.wordpress.com/2017/11/28/before-you-know-it-by-john-a-bargh-a-quantitative-book-review/

I was surprised—the impression of “moderately low p-values only” is partly based on censored reporting.

]]>