When anyone claims 80% power, I’m skeptical.

A policy analyst writes:

I saw you speak at ** on Bayesian methods. . . . I had been asked to consult on a large national evaluation of . . . [details removed to preserve anonymity] . . . and had suggested treading carefully around the use of Bayesian statistics in this study (basing it on all my expertise taken from a single graduate course taught ten years ago by a professor who didn’t even like Bayesian statistics). I left that meeting determined to learn more myself as well as to encourage Bayesian experts to help us bring this approach to my field of research. . . .

I’ve been asked to review an evaluation proposal and provide feedback on what could help improve the evaluation. The study team is evaluating the effectiveness of a training . . . The study team suggests up front that a great deal of variance exists among these case workers (who come from a number of different agencies, so are clustered under [two factors]). The problem with all of this is that the number is still small [less than 100 people] . . . To recap, we have a believed heterogenous population . . . whose variation is expected to cluster under a number of factors . . . and all will get this training at the same time . . . The study team has proposed a GEE model and their power analysis was based on multiple regression. Their claim on power is that [less than 100] individuals is sufficient to achieve 80% power using multiple regression with no covariates. That is as much information as they give me, so their assumptions about the sample are unclear to me.

I want to suggest they go back an actually run the power analysis on the study they are doing and based on the analysis they intend to run. I’m also suggesting they consider GLMM instead of GEE since they seem to feel that the variance that can be explained by these clusters is meaningful and could inform future trainings. I recall learning that when you have this scenario where there is this clustered variance, actual power is less than if you make an assumption of homogeneity across your population. In other words, that estimating power based on a multiple regression and a standard proxy for the expected variance of your sample, you would far over-estimate how much power you have. The problem is I can’t remember if this is true, why this is true, what words to google to remind myself. If I totally made this up and pulled it out of my butt. What I recall was that in my MLM courses they had suggested that MLM would actually achieve 80% power with a smaller sample when the cluster variable explained significant variance.

I modeled all of this for another study where the baseline variance is already know and where a number of related past studies exist. What I found there was that if you just plug your N into G*Power and use its preset values for variance, the sample it suggests is needed to achieve 80% power is far lower than what you get if you put in accurate values for variance. That if you move from G*Power to a power software script for R where you can explicitly model power for GLMM and put in good estimates for all this, you get a number somewhat in-between the two. This is what I think they should do, but I don’t want to send them on a wild goose chase.

1. When anyone claims 80% power, I’m skeptical. (See also here.) Power estimates are just about always set up in an optimistic way with the primary goal to get the study approved. I’d throw out any claim of 80% power. Instead start with the inputs: the assumptions of effect size and variation that were used to get that 80% power claim. I recommend you interrogate those assumptions carefully. In particular, guessed effect sizes are typically drawn from existing literature which overestimates effect sizes (Type M errors arising from selection on statistical significance), and these overestimates can be huge, perhaps no surprise given that, in designing their studies, researchers have every incentive to use high estimates for effectiveness of their interventions.

2. More generally, I’m not so interested in “power” as it is associated with the view that the goal of a study is to get statistical significance and then make claims with certainty. I think it’s better to accept ahead of time that, even after the study is done, residual uncertainty will be present. Don’t make statistical significance the goal, then there’s less pressure to cheat and get statistical significance, less overconfidence the study happens to result in statistical significance, and less disappointment if the results end up not being statistically significant.

3. To continue with the theme: I don’t like talking about “power,” but I agree with the general point that it’s a bad idea to do “low-power studies” (or, as I’d say, studies where the standard error of the parameter estimate is as high, or higher, than the true underlying effect size). The trouble is that a low-power study gives a noisy estimate, which if it does happen to be statistically significant, will cause the treatment effect to be overestimated.

Sometimes people think a low-power study isn’t so bad, in a no-harm, no-foul sort of way: if a study has low power, it’ll probably fail anyway. But in some sense the big risk with a low-power study is that it apparently succeeds; thus leading people into a false sense of confidence about effect size. That’s why we sometimes talk about statistical significance as a “winner’s curse” or a “deal with the devil.”

3. To return to your technical question: Yes, in general a multilevel analysis (or, for that matter, a correctly done GEE) will give a lower power estimate than a corresponding observation-level analysis that does not account for clustering. There are some examples of power analysis for multilevel data structures in chapter 20 of my book with Jennifer Hill.

4. When in doubt, I recommend fake-data simulation. This can take some work—you need to simulate predictors as well as outcomes, you need to make assumps about variance components, correlations, etc.—but in my experience the effort to make the assumps is well worth it in clarifying one’s thinking.

Which elicited the following response:

You raise concerns for the broader funding system itself. The grants require that awardees demonstrate in their proposal through a power analysis that their analytic method and sample size is sufficient to detect the expected effects of their intervention. Unfortunately it is very common for researchers to present something along these lines:

– We are looking to improve parenting practices through intervention X.
– Past interventions have seen effect sizes with estimates that range from 0.1 to 0.6 (and in fact, the literature would show effect sizes ranging from between -0.6 to 0.6, but those negative effect sizes are left out of the proposal).
– The researcher will conclude that their sample is sufficient to detect an effect size of 0.6 with 80% power and that this is reasonable.
– The study is good to go.

I would look at that and generally conclude the study is underpowered and whatever effects they do or don’t find are going to be questionable. The uncertainty would be too great.

I like your idea of doing simulations. While I can’t recommend that awardees do that (It would be considered too onerous), I can suggest evaluators consider the option. I’ve done these for education studies. My experience is that if it is a well-studied field and the assumptions can be more readily inferred, they are not that time consuming to do.

I assume a problem with making these assumptions of effect size and variance would be the file drawer effect, the lack of published null or negative studies and the inadvertent or intentional ignoring of those that are published. [Actually, I think that forking paths—the ability of researchers to extract statistical significance from data that came from any individual study—is a much bigger deal than the file drawer. — ed.] I do think that some researchers take an erroneous view of these statistics that if no effect or a negative effect is found, its not meaningful. If a positive effect is found, it is meaningful and interpretable.

P.S. I have been highly privileged, first in receiving a world-class general education in college, then in receiving a world-class education in statistics in graduate school, then in having jobs for the past three decades in which I’ve been paid to explore the truth however I’ve seen fit, and during this period to have found hundreds of wonderful collaborators in both theory and application. As a result, I take very seriously the “service” aspect of my job, and I’m happy to give back to the community by sharing tips in this way.

1. Vikram says:

As a person interested in policy and the application of quantitative methods to social science, I want to thank you for all the work you do in your blog, and particularly the attitude expressed in your postscript. I learn almost as much from your discussions with people who write in with their particular problems as I do from the theoretical discussions.
I’ve not yet had the opportunity to apply what I have learnt from this blog in a professional settings, but I have certainly learnt enough to be intelligently skeptical of common flaws in the literature, and more so to convince other friends and colleagues about the value of not adopting a “throw in all the data into Stata and let the stars sort it out” attitude towards analysis of complex questions.

2. Thanatos Savehn says:

GLMM, GLE, GEE-WHIZ! I read today that a paper pertinent perhaps to the discussion has now been published: https://psyarxiv.com/qkwst/ and found this bit especially revealing:

“And yet the process of translating this question from natural language to
statistical models gave rise to many different assumptions and choices that influenced the
conclusions. This raises the possibility of hidden uncertainty due to the wide range of analytic
choices available to the researchers across a wide variety of research applications.”

Is there some emerging consensus about which models to use and when or is it all just going to collapse again into some software that allows you to find the model that will give you the answer you want? Or will DAGs save the day?

• In my opinion, there shouldn’t be any “consensus about which models to use” because applications are heterogeneous and sensible models depend very critically on *subject matter specifics*

At least, this is true when your goal is understanding the world, when your goal is pure engineering (ie. how can I make my self driving car safe?) then there’s plenty of room for “we used generic method X and it seems to work in extensive lab and real-world testing”

Little of what is published in academics is about pure engineering, that’s done at commercial R&D labs, so we should expect that scientific questions fail to have a consensus model, and model choice based on subject matter considerations, and model comparison between plausible subject matter based models should be a very important part of science.

• Note, sometimes there is a kind of “pure engineering” to certain aspects of your scientific question. Like for example if you want to understand how FOO affects BAR and there’s a big dataset on how BAZ affects FOO you might well go ahead and use the model of FOO = F(BAZ) + error as a kind of “engineered way” to get a noisy version of FOO for your more important model BAR = B(FOO)+error, this comes down to basically *measurement instrument construction* and not enough emphasis is placed in engineering on understand what a measurement really is. Rarely is a measurement anything like authoritative, measurement error abounds in everything, and that includes *model error* in your measurement instrument. Even electronic Multimeters are built underneath their nice digital output on not quite correct models of how electrons behave in certain circuits. In many ways things like “income” or “race” or “party identification” or even “flood stage of the Mekong river” are the same

• Thanatos Savehn says:

But if 29 expert teams can look at the same question and decide it’s best modeled 29 different ways, thereafter producing a rainbow of significant/non-significant results, effects that range from negative to positive, and effect sizes that run from -x to + 2x then why not just give up and cast lots? Surely we could save a lot of money if we just said instead “Come! Let us cast lots that we may decide who/what is to blame for this calamity”.

• A nice aspect of Bayes is that it gives a principled way to compare these models. If in the end none dominates the other, then we can rightly say that we can’t determine which model is best.

Then we can try to see in what regimes the models predict differences and find out what kind of experiments might distinguish them.

Significant and non significant effects just don’t work like that…

• Keith O'Rourke says:

Apparently the lots that were cast were largely cast in the bar or at grad school ;-)

It an area that needs more attention – http://www.stat.columbia.edu/~gelman/research/published/authorship2.pdf

3. A test having 80% power is at best a highly incomplete claim. Tests only have power in relation to a discrepancy from a null or an alternative value. Some people erroneously think a test’s power is its probability of correctly rejecting a null–first mistake. Second mistake is to infer that, if they do reject, there’s an 80% chance its a correct rejection. All of these are wrong. If you reject a null with a test with high power against m1, and a low power against m2 (m1, and m2 being parametric means), the result is good evidence m> m2, and lousy evidence m >m1.