Skip to content

When a study fails to replicate: let’s be fair and open-minded

In a recent discussion of replication in science (particularly psychology experiments), the question came up of how to interpret things when a preregistered replication reaches a conclusion different from the original study. Typically the original, published result is large and statistically significant, and the estimate from the replication is small and not statistically significant.

One person in the discussion wrote, “As Simone Schnall suggests, this may not call into question the existence of the phenomenon; but it does raise concerns about boundary conditions, robustness, etc. It also opens up doors for examining exceptions, new factors (e.g., cultural factors outside US / North America), etc.” All this indeed is possible, but let’s also keep in mind the very real possibility that what we are seeing is simple sampling variation.

That is, suppose study 1 is performed under conditions A and is published with p less than .05, and then replication study 2 is performed under conditions B (which are intended to reproduce conditions A but in practice no replication is perfect), and replication study 2 is not statistically significant.

(i) One story (perhaps the preferred story of the researcher who published study 1) is that study 1 discovered a real effect and that study 2 is flawed, either because of poor data collection or analysis or because the replication wasn’t done right.

(ii) Another story (perhaps the back-up) is that study 2 did not reach statistical significance because it was a poorly done study with low power.

(iii) Yet another story (the back-up back-up) is that study 2 differed from study 1 because the effect is variable and occurs in setting A but not in setting B.

(iiii) But I’d like to advance another story (not mentioned at all as a possibility by Schnall in her post that got this recent discussion started) which is that any real effect is so small as to be essentially undetectable (as in the power=.06 example here, and, yes, power=.06 is no joke, it’s a real possibility), and so the statistically significant pattern in study 1 is actually just happening within that particular sample and doesn’t reflect any general story even under setting A.

Again, let me emphasize that I’m not speaking of Schnall’s research in particular, which I’ve barely looked at; rather, I’m speaking more generally about how to think about the results of replications.

I think we should be fair and open-minded—and part of being fair and open-minded is to consider option (iiii) above as a real possibility.


  1. Martha says:

    Yes, it does not make sense to me to leave out option (iiii). I think of including this option not as “being fair and open-minded” but as intellectual honesty.

  2. Keith O'Rourke says:

    Thinking a bit about how one should measure replication …

    It should measure how similar the evidence was in say two studies. If two single group studies were conducted and we observed 4/20 and 2/10 most would think the evidence was similar.

    Defining evidence as how prior probabilities are changed by the data would suggest it be with respect to the same prior (that isn’t current evidence) and thinking about what is meant by similar suggests the ranks of how the probabilities were changed rather than the absolute amount say ratio of posterior to prior.

    And a fancy plot to show this –

  3. george says:

    Do (i) and (ii) overlap? Could “data collection” being flawed in (i) be interpreted as “not enough data hence poor power” in (ii)?

    Perhaps more generally, if the aim is being fair and open-minded – and appearing to be so – then it’ll help to not call study conduct or analysis “poor”. Sure, studies have weaknesses, but these can be due fundamental difficulties, i.e. a question that is hard to answer, rather than poor quality work by investigators and/or analysts. Doing the best-possible job with finite resources limits what your study can say, but need not make it “poor”.

  4. Chance says:

    There’s some overlap and misunderstandings here.

    #1 should be stated as: the second trial is simply too small and by chance alone did not replicate the first
    #2 should be stated as: the second trial was poorly conducted or simply could not match some of the original conditions

    For #3, as stated reflects a misunderstand of sampling. In statistics the underlying effect is always assumed to be fixed and unknown. But the observed effect in a given sample always varies. So this explanation is better explained as “bad luck,” that the particular sample in the second study didn’t show the effect. However, the likelihood of this is characterized as power, the Type 2 error.

    For #4 it’s not clear if you’re referring to the power of study 1 or 2, but either way power had nothing to do with study failure if there is no true effect and you think study 1 was significant due to chance alone. The latter is Type 1 error and should have been characterized and controlled in study 1 to be very unlikely. That said, it’s true that one could posit that study 1 did suffer a false positive outcome (though it’s far more likely that one of the other explanations is true if study 1 was designed and executed well)

    • Chance says:

      Forgive typos and grammar errors (iPhone).

      And of course power = 1 – (Type 2 error), I’m sure that was obvious.

    • Andrew says:


      Let me clarify two points:

      Regarding item 1, it’s not just an issue of a study being too small. Sample size is fine but we’ve been seeing a lot of studies with poor measurement. If someone is doing a study with biased and highly variable measurements (that is, low reliability and validity), then increasing the sample size is not necessarily going to help much.

      Regarding items 3 and 4: No, I think the framework of type 1 and type 2 errors and false positives and false negatives is unhelpful here. I’m not talking about things being “due to chance alone” or there being “no true effect”; I’m talking about effects that vary and that cannot be estimated well using biased and noisy experiments. See this paper for further discussion of the point, also this paper with Carlin.

  5. Radford Neal says:

    Chance says: “…power had nothing to do with study failure if there is no true effect and you think study 1 was significant due to chance alone. The latter is Type 1 error and should have been characterized and controlled in study 1 to be very unlikely. That said, it’s true that one could posit that study 1 did suffer a false positive outcome (though it’s far more likely that one of the other explanations is true if study 1 was designed and executed well)”

    This is a fallacy. In a well-designed and executed study, if the null hypothesis is true, Type 1 error is unlikely *unconditionally*. Conditional on having obtained p<0.05, however, the probability of Type 1 error depends on both the prior probability of the null hypothesis being false, and the power of the test. Of course, frequentists don't use probabilities in this context. But even from a frequentiest perspective, the logic is flawed. The basic frequentist argument is that if p<0.05, then either something rather unlikely happened, or the null hypothesis is false, and you're expected to usually opt for the second possibility. But why go for the null hypothesis being false rather than thinking you got p<0.05 by chance when p has uniform(0,1) distribution? It makes sense to abandon the null hypothesis only if the null hypothesis being false is associated with a distribution for p that is far from uniform(0,1), and in particular gives a probability of p<0.05 that's a lot bigger than 0.05. But if the power of the test is low, this isn't so! (That it's not so is the very definition of "low power".) So the standard frequentist argument for why one should reject the null hypothesis when p<0.05 has no force when the test has low power.

  6. Mayo says:

    My view is that much more attention should be given to the glaring methodological flaws that prevent many of these investigations from counting as even remotely probing the alleged phenomenon of interest. If one is studying whether “situated cognition” of cleanliness may influence moral judgments, one should demonstrate a rigorous and stringent critique of each aspect of the artificial experiment and proxy variables before giving us things like: unscrambling words having to do with soap causes less harsh judgments of questionable acts such as eating your dog after he’s been run over, as measured by so and so’s psych scale. Serious sciences or even part way serious inquiries demand some kind of independent checks of measurements, not just reification based on another study within the same paradigm (assuming the reality of the phenomenon). When I see absolutely no such self scrutiny in these studies, but rather the opposite: the constant presumption that observations are caused by the “treatment”, accompanied by story-telling, for which there is huge latitude, I’m inclined to place the study under the umbrella of non-science/chump effects before even looking at the statistics. The replicators, in failing to address these methodological concerns, and making it appear that it’s a matter of pure statistics, getting the power of the replication right, etc. are in danger of merely replicating the deepest fallacies of the original research. Perhaps a certain suspension of serious criticism is prerequisite for some fields, but fortunately, not for all readers.

    • Andrew says:


      Yes. My take on this is that measurement is an underrated area of statistics, even within psychology research where the terms “validity” and “reliability” are well known. It’s all too common to take measurements as given, without seriously considering the connection between the measurement and what is purportedly measured.

    • Fernando says:


      You might find the curious case of Claude Barnard interesting. See annex to

      • Keith O'Rourke says:

        Fernando: From the paper “The field of replication studies remains notoriously ill-defined, poorly understood, rarely
        taught, seldom done, spurned by most journal editors and reviewers, and rarely published.”

        Fully agree (and think one could replace “replication studies” with meta-analysis for a more general statement).

        Perhaps something about it essentially being a critical audit of study design, conduct reporting, analysis and interpretation makes it an often skipped over or poorly discussed/published topic.

      • Mayo says:

        Thanks for the link to your paper. From a quick scan, I think you might find my Error and the Growth of Experimental Knowledge (Chicago, 1996) interesting. I have said essentially, and nearly in those same words, what you write in your recent paper:

        “The objective of a procedural replication is not to come up with new estimates, or to recommend
        an ideal scientific standard, but to highlight problems with the actual standard used,
        and to compile checklists of possible sources of errors to consider in other similar studies
        (either prospectively, or retrospectively). Put differently, the goal is to know whether and
        why the original study may be flawed, and to learn from these mistakes, not to make a
        substantive contribution with regards to the original research hypothesis.”

        However, I wouldn’t force it into a Bayesian mold with beliefs of operating properties instead of ordinary information about operating properties. The rest is deductive logic and error statistical inference.

        • Keith O'Rourke says:


          This “wouldn’t force it into a Bayesian mold” does seem a bit _undemocratic_ whereas being democratic better enables inquiry aimed at getting things less wrong.

          I would agree many Bayesian approaches and priors block noticing and learning from mistakes (with this unquestionable prior, conditioning on what you omnipotently take as the true data, the posterior is self-evidently the full answer) but there are other approaches where the prior is more purposefully used to better discern how one is wrong and enable getting less wrong faster. These may be far less popular and not well promoted – but Box did start the ball rolling.

  7. Fernando says:


    Ha! A few months back I tried to borrow your book from my local university library but it appears someone has stolen it. Congratulations. I think that is a form of compliment! I’ve been meaning to read it for some time.

    As for the Bayesian wrapper, I actually think it is useful. For one, you still can have base rate fallacies even if control is perfect. For another it seem to describe how some scientists actually think. My sense is Bernanrd was doing this without being explicit. Also game theorists treat Bayesian thinking as the baseline. To some of use it seems natural. But I think Bayesians could do a lot better in easing the math, the computations, and elicitation.

Leave a Reply