Pointing to this recent article, “Nonreplicable publications are cited more than replicable ones” and associated press release, Carol Ting writes:
There seems to be something slightly ironic here. The study would be very useful for educational purposes if the analysis was on solid grounds, but the classification seems to imply that studies with p>0.05 in the OSC study are all false positives, which the OSC warned people against in the press release. The finding is cute, but does it make sense to base the analysis on this dichotomous variable? It also raises the bigger issue of communicating findings of replication projects with the public. Even authors have good intentions, the message often gets distorted and all the nuances lost after going through the media pipelines. I guess this one might not be really damaging, but as I read through the press coverage I’ve definitely seen articles taking a very cynical view about scientists and jumping to conclusions based on the 36% replication rate. I wonder what you think about this communication problem.
My reply: What caught my eye, and not in a good way, was the very first sentence in the press release:
Papers that cannot be replicated are cited 153 times more because their findings are interesting . . .
I went into the paper and it turns out that the claim is an additive 153, not a multiplicative 153. That is, the total number of citations of the so-called non-replicated papers was 153 more, on average, than that of the so-called replicated papers. Or something like that. They were fitting some regressions too.
I’d rather report this by saying that some types of papers are cited 1.5 times more than others, or 2 times more than others, or whatever. But I guess that “153” (a number which is actually buried pretty deep in the paper) looks better in the press release. Can’t blame the authors for that!
Here’s the relevant graph from the published article, which was reproduced in the press release:

Later in the article, it says,
On average, papers that failed to replicate are cited almost 16 times more per year. . . . This difference of 16 citations more per year can be benchmarked against the 5-year impact factor of the journal in which the original studies were published, which measures the citations of papers published in the previous 5 years. In 2016, the 5-year impact factor of Nature and Science was 44 and 38, respectively, meaning the papers they published in the same time period as the original studies were cited, on average, 38 to 44 times per year. . . .
OK, 153/16 = 9.6, so maybe the papers they’re looking at are, on average, 9.6 years old? I’m not sure, but I get the general point.
Getting to the explanations, the paper offers a plausible story:
When the paper is more interesting, the review team may apply lower standards regarding its reproducibility.
I agree with Ting, though, that it is a mistake to characterize a paper as “replicable” or “nonreplicable” based on whether a replication study exceeded some p-value threshold.
Seems like they should correct it to read ‘153 more times’ – and ‘16 more times’, which are very different claims than 153 times more!
There must be a typo in what is attributed to Carol Ting:
“Even authors have good intentions, the message often gets distorted and all the nuances lost after going through the media pipelines.”
I suppose Ting wrote/meant
“Even if authors have good intentions, the message often gets distorted and all the nuances lost after going through the media pipelines.”
Another way of correcting the alleged typo:
“Even authors have good intentions; the message, nevertheless, often gets distorted and all the nuances lost after going through the media pipelines.”
Yes, the error is mine. Thank you for pointing it out, and I appreciate the suggested corrections.
Also, apologies for the confusion. I meant “even if”.
The authors of the article note that “A possible answer is that the review team faces a trade-off. When the results are more “interesting,” they apply lower standards regarding their reproducibility.?” That squares with my experience with a cover article in Science about regulating flows to increase fishery yield. A friend who reviewed a Nature paper claiming that river deltas were growing rather than shrinking had a similar experience; the editor did not want to accept a negative review.
A decade ago, I was involved in one of the replications for the big psychology paper that replicated 100 studies in Science. We sort of bucked the trend, insomuch as (a) a successful replication required 3 main effects and absence of interactions with gender and (b) multilevel models that made standardized effect sizes difficult. The former was a problem for their design, because you could only pick one effect to preregister.
Our study was an unqualified successful replication, and the original authors were very helpful.
However, in almost every secondary analysis of that dataset (including this paper and ones they cote in the method), they use a subsample of 96 papers (excluding ours and a few others) mostly because the original p-value in the target study wasn’t less than .05.
It’s been a bit vexing over the years; I feel like we’re selectively omitted from reanalysis because our story and replication doesn’t fit the general narrative of “science is broken”.
Sorry if that’s a bit tangential to your post, but this keeps happening, and it’s weird to be retroactively removed from the historical record of data on this topic in all these later studies like this one.
This is interesting. It highlights the contrasting biases in evaluating research: while the publication bias favors studies with p-values below .05 in judging original results, judgement on replication results tends to focus on positive results turning negative and neglect originally negative findings. Both tendencies are intensified by overconfidence in single studies. This is no surprise to the readers of this blog, but it is rarely addressed in discussions aimed at the general public (as evidenced by this paper’s analysis of the media coverage surrounding the “Replication Project: Psychology” https://doi.org/10.1177/10755470241239947, full text available here https://t.co/Iehe7jCpQ1).
It seems a shame that, instead of raising awareness about the more fundamental problem of overconfidence in single studies, replication efforts often get reduced to narrower questions, such as in this case, what can we say about “non-replicable” studies based on one replication attempt? Such a narrow approach has led to media headlines like “Research findings that are probably wrong cited far more than robust ones, study finds” and “Studies that are exciting but less likely to be true are cited more often in academia“ (for details see the paper mentioned above). But what if the next replication is positive? Does the study suddenly become replicable? It seems that if we want the public to take scientific evidence seriously, focusing on single replications probably isn’t the best strategy. Please don’t misunderstand me —replication projects are valuable, and I understand the inherent challenge of balancing impact and accuracy. However, I also believe that there is a communication issue here, and the problem cannot be resolved without first convincing the audience to accept a greater degree of uncertainty in scientific findings.
Is the conclusion of the article valid? It depends on how studies were accrued into the three replication projects that form their data set. I am not familiar with those studies, nor do I work in replication myself. So I don’t know what the general practices in this field are.
But I think that if I were going to undertake a replication project, I would be likely to focus on studies that are highly cited. After all, what is the point of replicating a study that the world is ignoring anyway? And, if I were interested in making my replication findings interesting, perhaps to attract favorable action from editors, it would be in my interest to try to select a corpus of studies that are especially likely to fail to replicate. Since it is understood that studies with extreme or unexpected findings are likely to fail to replicate, it is feasible to do this.
If both conditions of the preceding paragraph are true, then using the replication project data corpus constitutes selection on a collider of the high-citation:replication failure relationship (aka Berkson error).
Does anybody reading this know how studies were selected for inclusion in the replication projects?
Shouldn’t these patterns be expected without any biases or trade-offs from the reviewers?
It seems likely that the more extreme or weird or counterintuitive the result, the more likely it is to be due to an error or a random fluke. Therefore less reproducible. At the same time, papers with such results gain more interest, coverage, hype, and citations.
To counteract this mechanism, wouldn’t the reviewers need to have a bias against such papers, a.k.a. “extraordinary claims require extraordinary evidence”? Maybe we’d like to think that we routinely work along this maxim, but the reviewers may just follow a “fair” standard practice.