Deborah Mayo asked me some questions about that paper (“Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors”), and here’s how I responded:
I am not happy with the concepts of “power,” “type 1 error,” and “type 2 error,” because all these are defined in terms of statistical significance, which I am more and more convinced is a bad way of trying to summarize data. The concepts of Type S and Type M errors are not perfect but I think they are a step forward.
Now, one odd thing about my paper with Carlin is that it gives some tools that I recommend others use when designing and evaluating their research, but I would not typically use these tools directly myself! Because I am not wanting to summarize inference by statistical significance.
But let’s set that aside, recognizing that my paper with Carlin is intended to improve current practice which remains focused on statistical significance.
One key point of our paper is that “design analysis” considered broadly (that is, calculations or estimations of the frequency properties of statistical methods) can be useful and relevant, even _after_ the data have been gathered. This runs against usual expert advice from top statisticians. The problem is that there’s a long ugly history of researchers doing crappy “post hoc power analysis” where they perform a power calculation, using the point estimate from the data as their assumed true parameter value. This procedure can be very misleading, either by getting researchers off the hook (“sure, I didn’t get statistical significance, but that’s because I had low power”) or by encouraging overconfidence. So there’s lots of good advice in the stat literature, telling people not to do those post-hoc power analyses. What Carlin and I recommend is different in that we recommend using real prior information to posit the true parameter values.
The other key point of our paper is the statistical significance filter, which we rename as the exaggeration factor. The exaggeration factor is always greater than 1, but it can be huge if the signal is much smaller than noise.
Finally, this all fits in with the garden of forking paths. If researchers were doing preregistered experiments, then in crappy “power = .06” studies, they’d only get statistical significance 6% of the time. And, sure, those 6% of cases would be disasters, but at least in the other 94% of cases, researchers would give up. But with the garden of forking paths, researchers can just about always get statistical significance, hence they come face to face with the problems that Carlin and I discuss in our paper.
I hope this background is helpful. Published papers get revised so many times that their original motivation can become obscured.
P.S. Here’s the first version of that paper. It’s from May, 2011. I didn’t realize I’d been thinking about this for such a long time!
P.P.S. In comments, Art Owen points to a recent paper of his, “Confidence intervals with control of the sign error in low power settings,” following up on some of the above ideas.