Skip to content

No, I don’t think it’s the file drawer effect

Someone named Andrew Certain writes:

I’ve been reading your blog since your appearance on Econtalk . . . explaining the ways in which statistics are misused/misinterpreted in low-sample/high-noise studies. . . .

I recently came across a meta-analysis on stereotype threat [a reanalysis by Emil Kirkegaard] by that identified a clear relationship between smaller sample sizes and higher effect estimates. However, their conclusion [that is, the conclusion of the original paper, The influence of stereotype threat on immigrants: review and meta-analysis, by Markus Appel, Silvana Weber, and Nicole Kronberger] seems to be the following: In order for the effects estimated by the small samples to be spurious, there would have to be a large number of small-sample studies that showed no effect. Since that number is so large, even accounting for the file-drawer effect, and because they can’t find those null-effect studies, the effect size must be large.

Am I misinterpreting their argument? Is it as crazy at it sounds to me?

My reply:

I’m not sure. I didn’t see where in the paper they said that the effect size must be large. But I do agree that something seemed odd in their discussion, in that first they said that there were suspiciously few small-n studies showing small effect size estimates, but then they don’t really do much with that conclusion.

Here’s the relevant bit from the Appel et al. paper:

Taken together, our sampling analysis pointed out a remarkable lack of null effects in small sample studies. If such studies were conducted, they were unavailable to us. A file-drawer analysis showed that the number of studies in support of the null hypothesis that were needed to change the average effect size to small or even to insubstantial is rather large. Thus, we conclude that the average effect size in support of a stereotype threat effect among people with an immigrant background is not severely challenged by potentially existing but unaccounted for studies.

I’m not quite ready to call this “crazy” because maybe I’m just missing something here.

I will say, though, that I expect that “file drawer” is much less of an issue here than “forking paths.” That is, I don’t think there are zillions of completed studies that yielded non-statistically-significant results and then were put away in a file. Rather, I think that researchers manage to find statistically significant comparisons each time, or most of the time, using the data they have. And in that case the whole “count how many hidden studies would have to be in the file drawer” thing is irrelevant.

P.S. “Andrew Certain” . . . what a great name! I’d like to be called Andrew Uncertain; that would be a good name for a Bayesian.


  1. Keith O'Rourke says:

    Yup, study quality* is almost always conflated with possible file drawer effects (as well as treatment and population variation).

    For instance, those who know how to do and manage to carry out high quality studies also tend to know they should be published (regardless of the results) and are best equipped to get them by obstructive reviewers and editors.

    * by quality is simply as whatever leads to more valid results and not necessarily how well the authors did say an inherently biased study.

  2. Michael Johnson says:

    This makes me think of a question that I have been wondering about for some time. When doing a meta-analysis, is it better to have lots of small N studies or just a few large N studies? Obviously large N studies are better on their own than small N studies, but is that also true for a meta-analysis? Let’s assume that we can have the same overall sample size, but one meta-analysis has 100 small studies and another has 10 large studies. Which result would you trust more?

  3. Martha (Smith) says:

    Andrew Certain:

    For a discussion of the many ways in which research on stereotype threat in particular can go wrong, see and the five blog entries following it.

  4. Guive says:

    Statistical superhero Andrew Uncertain fights data analysis crime while maintaining a secret identity as mild-mannered Columbia professor.

  5. Ignazio Ziano says:

    Hi, the approach described in the paper seems to be “fail-safe N” originally proposed by Rosenthal (1979). It’s useless though. See this criticism by Joe Hilgard:

    • Keith O'Rourke says:

      But all the _better_ alternatives given by Hilgard (almost always) fail for reasons I gave above.

      Just a matter of too many unknown unknowns when all one has access to are the published papers (i.e. career promotion materials).

  6. Jeff Valentine says:

    If I had to bet I’d say that it’s both publication bias on the one hand and some combination of QRPs and forking paths on the other. My intuition is based in part on the belief that most population effect sizes are not large. Coupled with study sample sizes that are not large, this implies a distribution of observed effects that includes zero and even negative effects a non-trivial portion of the time. I suspect that on average researchers are more likely to “give up” on these, rather than try to QRP their way to a result, and I think this is classic publication bias. That said, publication bias tests are agnostic as to the cause, and could be renamed “tests for small study effects” with no changes to the underlying procedures (and as far as I can tell, there are literally no experts who have faith in fail-safe N estimates — authors love that test though because it almost always yields a number that is unbelievably large).

Leave a Reply