A completely reasonable-sounding statement with which I strongly disagree

From a couple years ago:

In the context of a listserv discussion about replication in psychology experiments, someone wrote:

The current best estimate of the effect size is somewhere in between the original study and the replication’s reported value.

This conciliatory, split-the-difference statement sounds reasonable, and it might well represent good politics in the context of a war over replications—but from a statistical perspective I strongly disagree with it, for the following reason.

The original study’s estimate typically has a huge bias (due to the statistical significance filter). The estimate from the replicated study, assuming it’s a preregistered replication, is unbiased. I think in such a setting the safest course is to use the replication’s reported value as our current best estimate. That doesn’t mean that the original study is “wrong,” but it is wrong to report a biased estimate as if it’s unbiased.

And this doesn’t even bring in the possibility of an informative prior distribution, which in these sorts of examples could bring the estimate even closer to zero.

14 thoughts on “A completely reasonable-sounding statement with which I strongly disagree

  1. I agree – as long as the data analytic specification (not just the research methodology) is pre-registered as well. I wonder if some authors involved in replication attempts have a conscious or unconscious interest in either replicating or not replicating the original effect so that they choose a data analytic specification that supports any bias that they may hold. The garden-of-forking paths can be abused in both ways.

  2. I wonder if publication bias still operates with replication studies? Perhaps a result that contradicts a recently published (and perhaps highly publicised) study is more likely to be published than one that simply confirms it?

    • Regarding Marcus’s concern above, direct replications in psychology are very often pre-registered (the Venn diagram of people doing replicability research overlaps a lot with pre-registration fans). They are also closely scrutinized for adherence to the original protocol and analysis plan, which is an additional constraint on researcher degrees of freedom.

      As far as the additional concern about publication bias, I’ve heard suspicions to that effect but I haven’t seen evidence either way.

      To address both kinds of concerns, psychology journals are beginning to adopt registered report submission tracks, including for replications. That means that a pre-registration is peer reviewed and conditionally accepted before the data are collected (the “conditional” part is mostly based on whether the researchers end up following the approved protocol). Chris Chambers has implemented such a policy for replications at Royal Society Open Science. And Steve Lindsay, the editor in chief of Psychological Science, has said that they will be adopting a similar submission track soon.

      http://neurochambers.blogspot.co.uk/2016/11/an-accountable-replication-policy-at.html
      https://twitter.com/dstephenlindsay/status/860264875188080640

      • But let’s say that you want to do a replication of the power pose effect. Should you be constrained to the same analytic specification (and design) as the original authors who may have done strange things in both the design (e.g., no control group) and analysis (e.g., outlier identification strategy, use of covariates) that you don’t agree with?

    • Based on my own N=1 experience, I feel that this is possible. I submitted a replication attempt (registered report, meaning no data collection at time of submission), but the editor desk-rejected it. One of the reasons was that the editor felt that the probability of successful replication was too high. I got the impression that the editor would have been more inclined to entertain the submission if it was likely that the replication failed. In this particular example, prior evidence consisted of a single Psych Science paper with 4 conceptual replications in the same paper, p-values of .04,.04,.05,.005, and observed power at around 50% – which I personally considered weak evidence.

  3. Hey, that was me who said that. For context, it was back in November 2014 (before the Reproducibility Project results were in), and the discussion arose from Simone Schnall’s now infamous blog post criticizing the replication movement:

    http://www.psychol.cam.ac.uk/cece/blog

    So, at the time I was musing about ways we might discuss failed replications in venues like social media that are less sensationalistic than: “the original scientist is a fraud, new replication falsifies previous research” because such statements seemed to create strident opposition to replication in psychology from the replication targets (i.e., Schnall’s blog post).

    I’ve read lots since then and feel like I’ve learned a lot. I probably align more closely to Andrew’s position now than I did then. Still, there’s a part of this logic I’m skeptical about. In this post, Andrew’s argument goes like this:

    1. Published effects are positively biased due to the significance filter and garden of forking paths
    2. Pre-registered replications have less positive bias because they minimize the influence of the significance filter / forking paths
    3. Therefore, the safest course is to use the replication’s reported value as our current best estimate.

    Point 1 & 2 are ironclad, as far as I’m concerned. However, I think that the conclusion does not necessarily follow from the first two points because it relies on one more unstated assumption:

    2b. Effects from pre-registered replications are not negatively biased (i.e., not smaller than their true value)

    Even as a strong proponent of replication studies, I worry a bit that this auxiliary assumption doesn’t hold because of a potential for pre-existing bias among replicators (e.g., Marcus’ post above). For replication studies, the reward system has become inverted — falsifying a “big name” study is the way to getting published and accolades on social media. Replicators often choose a very small subset of an experiment to replicate, I think that cherry-picking might arise in the design stage wherein replicators target the weakest components of an experiment or choose analyses that are overly stringent. Moreover, if (as critics claim) replication studies are more frequently done by non-experts or are otherwise more “sloppy”, this would imply more measurement error in replications — which in turn would tend to deflate the effect sizes for true effects. I don’t know that any of this is true because I haven’t seen evidence one way or another. But because I can imagine the logical mechanism by which replication studies could be negatively biased, I’m left with a small seed of skepticism.

    Right now, I think it’s safest to be skeptical of all results until a relatively large number of replications are performed. I’d contend that replicators could positively or negatively bias their results based on their pre-conceived notions, and that those biases would only wash out after you get numerous different research groups to replicate an effect.

    • Hi Sean,
      You write: “falsifying a “big name” study is the way to getting published and accolades on social media.” I’ve heard this before but I don’t know what the evidence for it is. I still don’t see many replications published in top journals, and I haven’t seen evidence that those publications were easier to get published than an original paper, nor that they get cited more. I do believe that they get some attention on Twitter, but I would guess it’s till a fraction of the attention that “big name” people (or even perhaps ordinary people) get for their original research.
      More generally, the people I know who have published replications in top journals did not have an easy time and we’re held to quite high standards. I wrote a blog post about my perception of how closely we scrutinize replications versus original studies here: http://sometimesimwrong.typepad.com/wrong/2016/08/solution-is-us.html
      Of course this is just my impression, and I’d be curious to hear more from people whose impression is different (e.g., examples of replications that were published in top journals despite signs of lack of competence in the part of the replicators, etc.).

      • Simine and Sean:

        I do share Sean’s concern – the _replicators_ are working in (and many cut their teeth in) the same eco system that made careerists the way to go rather that trying to get at the truth no matter what (true science).

        My previous experience with some now presenting themselves as part of the vanguard of replication would suggest, unless they have changed, they are putting their careers first.

        Some evidence might be discerned by how they respond to criticism of their work, but they are playing on a higher field and are at higher risk of being corrected. Lets hope so.

      • Good points. It’s an impression I have (and a statement often made by people), but I don’t have evidence to support it. It may or may not be true. I would imagine that getting evidence to prove that point would be difficult without some pretty intensive database trawling. The only reason I make these points at all is because I’m pressing on the conjecture that replication studies are unbiased by trying to falsify it as a means of strengthening the position. The weakness in the argument as I see it, is that there isn’t evidence one way or another that replication studies have no negative bias.

        In terms of your blog post, my impression is certainly that they replication studies are published in less prestigious outlets on average, because top journals prioritize novelty. Also, when replications are published, they tend to have more stringent requirements before you can submit at all (e.g., requiring pre-registration). I don’t have a sense on whether the actual review process is more or less stringent though — I haven’t published enough replications to tell.

        I do suspect that it is a quite different set of factors that determines success on social media vs. journals, but that’s just conjecture. Social media traction is hard to gauge from anecdotal evidence, since it is tailored to individuals by design (so I see lots more positive material on replication studies on social media as a result).

  4. My argument against “2b. Effects from pre-registered replications are not negatively biased” is that so far I fail to see pre-registered replications by the original authors which support their initial findings. I’m, willing to believe in some ‘special sauce’ or tacit experimenter knowledge, but for most replication failures so far it wouldn’t be overly onerous for the original authors to replicate themselves, maybe with some additional blinding to avoid experimenter effects, maybe with some added manipulation checks. I’m not seeing that. I see a lot of HARKing, cherry-picking and obfuscation. To my knowledge, nobody has yet said ‘I’m surprised this doesn’t replicate, please give me a moment to have another go myself after which we’ll have a better idea whether it was an experimenter effect or some subtle stuff we overlooked or a false positive.’
    The argument for 2b needs to be based on facts, that is an ability to replicate one’s own experiments, not insinuations about the motives or quality of the replicators. After someone can reliably replicate his or her experiments, the quality of the replications (and experimenter effects in the original lab) are fair game for discussion, but not before.

    • I don’t disagree with these points. Original authors don’t seem willing to put their ideas to the test with their own pre-registered replication. If they were, the contention I’m making with 2b above would be partially testable.

      I think that the evidence for 2b is non-existent right now. No good evidence for the presence OR absence of negative bias, which can be a stuck point when dealing with critics. I do find the “special sauce” argument distasteful; but it is a falsifiable statement, if you could get people to do pre-registered replications of their own studies and compare them to “less-experienced” researchers.

    • Markus: “insinuations about the motives or quality of the replicators”

      It not insinuation but rather just in science one should avoid trusting anyone when things can be assessed.

      Furthermore there are reasons to worry here even with uniformly high motives and quality of the replicators – which Felix provides some suggestive evidence for [poor editorial decisions].

      http://statmodeling.stat.columbia.edu/2017/05/05/completely-reasonable-sounding-statement-strongly-disagree-2/#comment-483303

Leave a Reply to Marcus Cancel reply

Your email address will not be published. Required fields are marked *