Skip to content

Dark Angel

Chris Kavanagh writes:

I know you are all too frequently coming across defensive, special pleading-laced responses to failed replications so I thought I would just point out a recent a very admirable response from Will Gervais posted on his blog.

He not only commends the replicators but acknowledges that the original finding was likely a false positive.

Cool! He’s following in the footsteps of 50 shades of gray. Which in turn reminds me of the monochrome Stroop chart:


  1. Marcus says:

    It reads a little like the Cuddy defense when he writes “The sample size is (by 2017 standards) embarrassingly small. 31 participants in one condition, 26 in the other. Between subjects design.” The study was conducted around 2009-2010 but I guess I missed how we only became aware of statistical power issues after 2009. My vague recollection is that concern about power dates at least back to Cohen’s power review published in 1962. This stuff was drilled into me in graduate school on a regular basis.

    • Andrew says:


      Cohen and Meehl have been aware of these things for a long time, but many of the rest of us were pretty clueless as recently as 10 or 15 years ago, as I discussed in this historical overview.

      I think one problem is that, yes, everyone knows about statistical power, but it’s natural to think that if your result is statistically significant, that retrospectively your sample size must’ve been large enough. The (mistaken, but superficially persuasive) reasoning goes like this:
      (1) We need large sample size, accurate measurement, and high power so that our estimates will be precise enough to learn something reliable from data.
      (2) If the estimate is statistically significant, then the experiment was precise enough: Design and data collection did their job.
      (3) Thus, no questions about sample size and power, if the experiment appears to be a success.
      The (mistaken, but superficially persuasive) idea is that power is a part of experimental design, not of analysis. I can see how, until recently, good researchers could make this mistake.

      • John Blankenbaker says:

        I don’t disagree with your first “(mistaken, but superficially persuasive)” — that statistically significant effects means no one worries about power, but I do disagree with your second “(mistaken, but superficially persuasive)”. I’d argue that there are other, more important reasons, to consider power as part of experimental design and not analysis. And that’s what happens in the case of non-significant results.

        See for instance Hoenig, John M. and Heisey, Dennis M. (2001), “The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis,” The American Statistician, 55, 19-24 who argue against “observed power” and point out the fallacy of the “power approach paradox” — more observed power does not imply stronger evidence for a non-rejected null hypothesis.

        The message that was drilled into us was “If you didn’t do a power calculation up front, shame on you, but don’t do one retrospectively, because your pre-experiment estimate of the effect size you want to detect is gone; any estimate you come up with now will be tainted by the fact that you saw the results. Use the results from this time to inform the power calculation for the next experiment.” (I’ll be the first to acknowledge that much of what I was taught had a kind of purity unsullied by practical concerns)

        I’ll give Hoenig and Heisey the last word: “The reader with Bayesian inclinations would probably think ‘what foolishness—the whole issue would be moot if people just focused on the sensible task of obtaining posterior distributions.'”

    • Gervais goes on to say, “So, all in all, this looks like a cute effect emerging from a small-sample experiment using novel manipulations on a presumably noisy single-item DV. This is a recipe for trouble.” This sounds more Carneyesque than Cuddyesque.

      He also refers to the great “False Positive Psychology” (2011), which proposes six requirements for authors of studies, including: “Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data-collection justification.” Although Gervais’s study preceded the paper by about two years, it’s possible he and others believed that a sample size of 25-30 per condition was just fine. (Clearly that wasn’t what Simmons et al. were saying; that’s why they listed six requirements.)

      It’s possible that people have tended to treat the six requirements (and other criteria) as a kind of checklist, instead of considering the underlying principles. A commenter on Gervais’s piece wrote: “Nitpicking Reviewer #3 stuff: the False Positive Psychology authors recommended an absolutely bare minimum of 20, not 25, per cell. They seem to be regretting this, though, because as is the inflationary way of the world, it seems that people have started citing this as justification for using only 20 per cell. See

      In any case, it seems that Gervais has not only acknowledged the weakness of the original study but actually supported the failed replication. That is commendable.

    • zbicyclist says:

      In grad school (1970s) we read Meehl and Cohen and parroted back their arguments on the exams.

      But it didn’t sink in.

      It was like that advice that you shouldn’t assume 5 pt Likert scales (definitely, probably, might, probably won’t, definitely won’t) are interval, but then we went and did t-tests.

      Or maybe seeing that “road closed 5 miles ahead” sign but then continuing that way, anyway, thinking that there’s bound to be some way through.

      • Martha (Smith) says:

        At least you’re honest about it in retrospect. I think what you did is, regrettably, human nature — which points out that, to make things better, we have to figure out how to get past human nature.

  2. Paul Alper says:


    “a very admiral response from Will Gervais”

    Never mind what the issue is, note that Gervais ended his “admiral response” with the following:

    “PS…I usually include an f-bomb or two in my blogs. I saved them for the postscript this time.


    Blog standards are caricatures of themselves.

  3. psyoskeptic says:

    I don’t think he should be harsh at all on the initial study and I find it a bit problematic in the response. There’s no need for condemning it because it’s not an unreasonable study at all. He has accepted the replication as pretty good evidence that the lone initial finding was spurious and that’s all that really needs to be said.

    A perfectly designed experiment with a reasonable amount of power could also find an effect that’s spurious. There doesn’t have to be anything wrong with the initial study.

    That said, very glad that he at least handled it the way that he did. I recently congratulated Eli Finkel for an excellent acceptance of a failed RRR of one of his studies and now I see he has posted that Cuddy should be congratulated for standing up for herself… ‘sigh’.

    • Greg Francis says:

      I tend to agree that a single study gone wrong might be excusable (but Will is in the best position to judge whether his study was very good). On the other hand, the study was part of a set of 5 studies, and they all produced results that were just barely significant (p=0.04, 0.03, 0.04, 0.03 and 0.04). If the effects were real and the studies were run and reported properly, this should be a very unusual set of outcomes. Using a method that is favorable to the original studies, we estimated the probability to be 0.051. Details are in

      Figure 1 in that paper shows a graphical representation of the 95% confidence intervals for those studies. The left side of the CI is always very close (but does not quite include) zero.

      So, yes, one study could be a fluke, but 5 such studies suggests a pattern; and I think scientists should be skeptical about the results in the original paper. Maybe there really is an effect of religion on rational thinking, but this set of studies does not make a good case.

      • Marcus says:

        It seems like a bit of pattern. I just peeked at two other papers (Gervais & Norenzayan, 2012 Psych Science; Gervais & Norenzayan, 2012, JESP) and there are a lot of p-values just below (or at) .05 and some desperately tiny sample sizes. My favorite is an N=38 for a between-subjects priming study in which an interaction is hypothesized.

        • Greg Francis says:

          To be fair, in their discussion they write about Experiment 2 (in comparison to other experiments, which did find the interaction), “…we are hesitant to offer too much speculation regarding this single result. Instead, we highlight that the inconsistent moderation observed in this paper reflects the current state of the religious priming literature.” Still I agree with your main point, that when a set of 277 subjects in Experiment 1 produced p=.02 for an interaction, it was unlikely that a set of 38 subjects (with a weaker prime) would produce a significant interaction. (Of course, maybe the actual order of the experiments differed from their order in the paper.)

    • Dzhaughn says:

      Well, Cuddy’s work is totally based on posturing, after all.

  4. Markus says:

    ” My vague recollection is that concern about power dates at least back to Cohen’s power review published in 1962. This stuff was drilled into me in graduate school on a regular basis.”
    Me too. But back then we reasoned, that (a) the experiments being done in a lab context are artificial anyway and so there was at the time no real world effect size to judge them against, i.e. effect size might truly be huge for this particular setup even if that doesn’t relate to anything much outside the lab. (b) When trying to derive sample sizes analytically we had to make lots of assumptions we weren’t confident in and so it seemed reasonable to use sample sizes everyone else was using and getting results with. The faulty logic was that apparently these sample sizes were sufficient for getting ‘effects of interesting size’. Weaker effects for which we’d need larger samples wouldn’t be as interesting. We assumed the effect size and power considerations were baked into the standard setup of task + sample size based on the expert judgement of those in the field.

    In our defense, we (c) usually ran experiments with undergraduate experimenters first until we had a good handle on the effect and then did the real studies for the papers running the experiments ourselves, so the effects were replicable locally, and (d) we compared the experimental results to point predictions from process models to chose among models.

    Where it went wrong IMHO is when that approach was applied to the new implicit social measures with corresponding weaker effect sizes, theory was a lot weaker (i.e. more researcher dfs) and the check against process models was replaced with sweeping claims about the real world. Plus, of course, it was far sexier and easier to publish.

Leave a Reply