“Giving less power to statistical power”

Megan Higgs and Valentin Amrhein write:

Researchers often need to justify their choice of sample size, particularly in fields such as animal and clinical research, where there are obvious ethical concerns about relying on too many or too few study subjects. The common approach is still to depend on statistical power calculations, typically carried out using simple formulas and default values. Over-reliance on power, however, not only carries the baggage of statistical hypothesis tests that have been criticized for decades, but also blocks an opportunity to strengthen the research in the design phase by learning about challenges in interpretation before the study is carried out. We recommend constructing a ‘quantitative backdrop’ in the planning stage of a study, which means explicitly connecting ranges of possible research outcomes to their expected real-life implications. Such a
backdrop can facilitate a priori considerations of how potential results, for example represented by intervals, will ultimately be interpreted. It can also serve, in principle, to help select single values of interest for use in traditional power analyses, or, better, inform sample size investigations based on the goal of achieving an interval width narrow enough to distinguish values deemed practically or clinically important from those not representing practically meaningful effects. The latter bases calculations on a desired precision, rather than desired power.

I agree completely, and I’ve been trying to use the term “design analysis” instead of the usual “power calculation” or “sample size calculation” for three reasons:

1. It’s about design, and the sample size is only one aspect of the design of a study. Design also encompasses measurement.

2. Power is the wrong thing to be targeting, once you leave the horrible world of null hypothesis significance testing. I’ll still talk about power, because statistical significance affects how results are reported.

3. I prefer to call it an analysis rather than a calculation, because “analysis” implies some thought, whereas “calculation” sounds like you’re just plugging numbers into a formula (which indeed is what people do).

When we discuss design analysis in Chapter 16 of Regression and Other Stories, we recommend simulating a study with n=100 and then adjusting the sample size or other aspects of the design up or down to get the desired accuracy of estimation. I very much prefer this to a formula for n.

P.S. The Higgs and Amrhein paper, which came out of a conference that Megan and I organized with Pamela Reinagel a few years ago, appeared in the journal Laboratory Animals, so I’m guessing you wouldn’t have heard about it if we hadn’t blogged it here! Another thing that came up in that conference was this discussion of the different roles of replication in different scientific fields.

10 thoughts on ““Giving less power to statistical power”

  1. Ah, I just saved this paper for possible future use concerning a manuscript I am currently working on. I think I came across it when looking up some more paper by Higgs after coming across Higgs & Gelman (2021) “Research on registered report research” which I also saved for possible future use concerning the manuscript.

    • I ended up using Higgs and Gelman (2021) for my latest manuscript. It was useful to repeat the point that the “quality” of some research might be something people view differently and have different criteria for. I could further use some sentences from Higgs and Gelman as a jumping-off point to provide some more detailed examples from the discussed paper as well.

      It also made me think about a recent blog post here where, if I remember correctly, Mr. Gelman mentioned some of his research that may not have been that impactful or something like that. I think I commented there that it may be hard to determine what is impactful or successful research or something like that.

      The Higgs and Gelman (2021) paper has been cited 10 times according to google scholar. I don’t know whether that is impactful or successful from a just-looking-at-the-citation-numbers-aspect. I do know that the paper has been useful concerning my manuscript, which might be something worthwhile to mention (possibly again). I think it might be the case that you never really know what the impact of a paper truly is. Even just reading a short section or a sentence in a paper somewhere might result in thinking about something which migh result in a paper which might be read by someone else who might then think about something and write about something, etc. etc. etc.

      • Here’s an excerpt of my manuscript (which has now been posted, approved, and distributed on SSRN) to provide an example of what I wrote above. Higgs and Gelman (2021) has been very useful concerning my manuscript. I hope I have interpreted and understood things correctly regarding the two papers involved here. It might provide a nice example of how a paper might influence something or someone, perhaps in a way that might not necessarily be “captured” looking at the number of citations or where the paper has been published or such things. Here’s the excerpt:

        “Higgs and Gelman (2021) note that “It’s interesting, and probably important, to consider prior beliefs about the benefits and detriments of a registered reports model.” (p. 978). The authors wonder what the takeaway would be if Soderberg et al. (2021) found that mean scores for research quality were lower for RRs compared to the standard model. They further note that an unexpected outcome might raise certain questions, and result in spending time addressing these questions, which might not happen with expected outcomes. It might be useful to wonder how some might interpret Soderberg et al. (2021) reporting that RRs were statistically indistinguishable from control articles in the importance of the research regardless of what results will be observed (p. 991) when RRs are often described as a format where “Decisions to publish are therefore based on the perceived importance of the research question and the quality of the methodology, not whether the findings are positive, novel and clean.” (Soderberg et al., 2021, p. 990). This all might tie in nicely with what has been mentioned earlier concerning the possibility of a sub-optimal monitoring and evaluation process.”

      • Another excerpt to provide an example how previously published work might provide a useful jumping-off point to provide some more detailed examples that were not necessarily present in this previous work. This previous work however might be inspiring, and fitting, concerning later work by someone else in a way that might not necessarily be captured by numbers or such things. The previous work might provide an ingredient, might provide information, might play a specific role, or might have a certain influence concerning another person’s work that may be impossible to even describe, let alone, capture or measure. It’s one of the things I like most about writing and reading scientific manuscripts, lyrics, or poetry.

        “It might be noteworthy, if I am understanding things correctly, that a substantial part of the RRs used in the study by Soderberg et al. (2021) do not include a link to the pre-registration (see Table 1, p. 993). If this is correct, one could wonder whether the reviewers looked for such a link when reading words like “preregistered” in the excerpts. Or one could wonder whether the reviewers assumed such a link is present somewhere else in the actual paper, but just not in the specific excerpt they were reading. Or one could wonder whether the
        reviewers read a word like “preregistered” and did not even think about, or look for, a link to pre-registration information. The availability of a link to pre-registration information might be a factor that some take into account in judging whether a study is “rigorous” or “high quality” or “preregistered”. And some might wonder whether reported “preregistered analyses” in such studies without a link to publicly accessible pre-registrations even deserve to be called “preregistered”. For some, this might all be crucial in judging the rigour and quality of a study, which might tie in nicely with Higgs and Gelman (2021) who write: “Individual researchers have different standards and tend to focus on different aspects of quality, not to mention the differences that exist among disciplines. This is just something we should have in mind.” (p. 979). This all might provide a useful example of how attention might be directed to one thing, and not to another thing.”

      • Another excerpt:

        “Soderberg et al. (2021) further note that “However, it is not a definitive comparative investigation of RRs versus the standard publishing model.” (p. 994), and go on and mention some limitations of their study, such as the naturalistic design, and that they used peer review evaluations of quality rather than other assessments of quality like replicability (see p. 994). But, what about the replicability of their own findings? And in general, what about the replicability of this kind of monitoring and evaluation research? A focus on direct replication around the time of “a crisis of confidence” regularly made clear that a single study is not enough to warrant strong conclusions, and that performing and publishing only conceptual replications is problematic (e.g. see Pashler & Harris, 2012, p. 533). Perhaps such a view should also be applied to evaluation or meta-scientific research, where it seems just as important to be aware of the possible problems with single studies and concluding things
        from them.
        From that perspective, and to again use Soderberg et al. (2021), perhaps one could wonder whether it is appropriate or optimal to write: “Despite these limitations, this study provides a basis of evidence to expand the use of RRs in research, and should spur follow-up research to examine the generality of these findings.” (p. 994). Perhaps there is a real risk that sub-optimal, severely flawed, or bad research is nonetheless used to introduce, underline, or expand certain projects or initiatives. Perhaps this ties in with what Higgs and Gelman (2021) mention: “Soderberg et al. will likely be cited often, as support for various arguments regarding the effectiveness of registered reports in research, and most notably for the increase they found in research quality.” (p. 978). Even just a single, perhaps flawed, study can heavily influence things, and can facilitate taking a certain step. That step might be a step in the wrong direction, and might not easily be corrected.”

        (More can be read in “Pre-registration, grocery lists, and particular pre-registration issues” which has recently been posted on SSRN. I have been inspired by several authors and papers and blog posts I came across on this blog in the last few years when writing the manuscript, and this writing and information has been very useful concerning my manuscript as possibly shown via the sections I quoted here. I hope I view things correctly, and understand things well enough, and have phrased things appropriately in this all.)

  2. On this topic, I also recommend Chapter 13 in Kruschke’s “Doing Bayesian Data Analysis” as well as his 2018 paper with Torrin Liddell, “The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective”. These sources also address the broader issue you point out, namely, that “power” is only one of many possible inferential goals. Once you break out of the NHST mold, but particularly in a Bayesian context, it becomes possible to conceive and plan studies to address more meaningful concerns.

  3. Everybody is entitled to opinions and freedom of speech but to (a) be against power calculations and (b) criticize psychology for the use of small samples is a bit of a contraction (which is fine for opinions, but not for serious scientific commentary).

    As many serious researchers like Kahneman and Tversky have pointed out, it is insanity to conduct studies with less than 50% power, but that is what many psychologists have done for decades. Would proper power calculations avoided the replication crisis? Yes.

    So, what is the real criticism of power calculations. They make hypothesis testing useful. This may not be desirable for people who never conduct studies with resource constraints (e.g. animal studies) and can just dream about perfect studies with large samples. But for real researchers, small samples and large sampling error are a reality, and having at least 50% power to make directional claims with low rates of sign errors is all they can do.

    Ok, sorry for the scientific interruption. Back to opinionated entertainment.

    • Ulrich:

      I can’t speak for Higgs and Amrhein, but, as far as I’m concerned, I think design analysis is very important.

      I’m not “against power calculations.” I do them all the time because people are often characterizing results based on statistical significance, so it’s good to have a sense of what could happen under that measure. But, moving forward, I’m less interested in statistical significance and more interested in effect sizes, which is why I prefer design analysis on future estimates and uncertainties rather than on p-values.

      I agree with you that low power studies are a problem. Indeed, I wrote a post, This is what “power = .06” looks like. Get used to it. I really like that post!

      Perhaps the above post, entitled “Giving less power to statistical power,” should instead be titled, “Power analysis is important; also, there are other aspects of design analysis that have been neglected.”

      P.S. Regarding your final paragraph: There’s no need for you to apologize! We appreciate scientific comments here. The title of our blog is Statistical Modeling, Causal Inference, and Social Science, which covers all sorts of things, including the material in your comment.

      • Dear Andrew,

        I believe there is in fact much for Dr. Schimmack to apologize for, beginning with the presumption of being able to judge so readily who is a “serious” researcher and who is not, as well as labeling as “opinionated entertainment” anything that does not align with his own opinions (which he presents as the only ones that are “scientific” here).

        My advice to Dr. Schimmack would be to make a greater effort to ensure that his comments are not dominated by personal biases, which includes undertaking a broader analysis of the factors underlying the so-called replication crisis (e.g., https://doi.org/10.1080/00031305.2018.1543137, https://doi.org/10.1177/10755470241239947, https://doi.org/10.1177/0146167217729162). Of course, the presence of differing views is fundamental to research practice, provided that they are supported by technical arguments rather than driven by emotion or framed in an offensive manner.

      • Andrew:

        I’m also not against power calculations. I think an alternative title for a slightly different paper could have been “Power analysis is important if you are doing traditional hypothesis testing; if you think you can test hypotheses relatively convincingly with your p-values, but don’t do pre-study and pre-registered power calculations, you’re wrong.” But this paper has been written many times by many people since about 1933!

        Sadly, from what I observe in my field, evolutionary biology, people using p-values don’t do power calculations, or at least they don’t publish them (as we described here: https://doi.org/10.1111/jeb.14009). I think there were some real researchers in this field, such as Charles Darwin or Ronald Fisher, but neither of them was known as a serious advocate of statistical power.

        To get serious again: As was often explained, for example by Gerd Gigerenzer and Steve Goodman, one cannot be both a Fisherian and a Neyman-Pearsonian within the same analysis. It seems that researchers doing NHST pick the most convenient components from both worlds and claim to be testing their hypotheses but skip the tedious part that would be necessary to do so (power calculations, or in fact any design analysis on future estimates and uncertainties).

Leave a Reply

Your email address will not be published. Required fields are marked *