Skip to content

I agree it’s a problem but it doesn’t surprise me. It’s pretty random what these tabloids publish, as they get so many submissions.

Jeff Lax writes:

I’m probably not the only one telling you about this Science story, but just in case.

The link points to a new research article reporting a failed replication of a study from 2008. The journal that published that now-questionable result refuses to consider publishing the replication attempt.

My reply:

I agree it’s a problem but it doesn’t surprise me. It’s pretty random what these tabloids publish, as they get so many submissions. Sure, they couldn’t publish this particular paper, but maybe there was something more exciting submitted to Science that week, maybe a new manuscript by Michael Lacour?


  1. jim says:

    Seeing the abstract on PubMed, it’s interesting that the only number in the abstract is the sample size. The degree or intensity of the effect isn’t there, nor are any other data relating to the outcome of the experiment. That’s a bad sign.

    When I see something like this, my first question is: if this were recognizable in a group of only 46 individuals, wouldn’t it have been widely observed already?

    The abstract doesn’t identify any subpopulation. This is a sample from the entire world population, right? So if that’s the case, if we were just modelling the population and grabbed a random sample of 46, seems likely most would be in the center of the spectrum (25-75 percentile) of responsiveness. That would leave, in the modern world, billions people with greater extremes of reaction than the median of sample, which should make the behavior readily observable to any normal human.

    Humans are superb at recognizing these kinds of patterns. IMO it would have to be a *very* small effect to not be already visible to people.

  2. Michael Nelson says:

    Multiple principles of good science practice and communication come into play here, which is interesting and important because these are the kinds of things we’ll need to start reconciling and codifying as we proceed through the ongoing paradigm shift in what “good science” means:

    Principle 1: A journal is obligated to publish high-quality replications of high-impact studies it previously published.

    Principle 2: Authors should not use arbitrary cut points (p-values) to draw meaningful conclusions from statistical analyses.

    Science obviously should publish the replication, and not doing so is harming both science and Science. Separately, and without implying a false equivalence, the authors should not characterize their study as a replication “failure” on the basis of p-values. Most of us, I think, associate the term “failed replication” with an original study’s outcomes being contradicted by those of a second study. Given that nearly every measure in the various replications (see the graph in their twitter thread) showed a positive impact, albeit very noisy (p > .1) ones, their work is better characterized as a “weak” replication or a “trivial” replication or maybe even an “inconclusive” replication. The authors could accurately say that their results failed to achieve the same arbitrary standards used to assess the results of the original study, and therefore failed to replicate according to the original authors’ own rules; yet, if we all agree with Principle 2 above, then that’s not saying much. It’s like saying, “Even if you believe that throwing a coin into a well will lead to your wish coming true, we show that the coin in this case landed nowhere near the well…”

    Of course, the authors can and should (and maybe do) make the points that a “real” effect should be less noisy with larger samples, and that no effect is truly zero and therefore what really matters is whether the effect is large enough and generalizable enough to be meaningful. And maybe on the basis of those points alone, setting aside the question of p-values, there’s an argument to be made of replication failure–but I still think that the failure term is awfully strong for this case.

    Am I being pedantic? Nitpicking? Of course I am! Precision and care in terminology are also part of science:

    Principle 3: Intelligent and respectful communication of results, and the debate of those results, requires language that is precise and consistent, particularly in social science fields where technical words often have very different meanings or connotations in common speech.

    I’m surely not being more precious here than when Andrew (agreeing with Martha) recommended that we stop using the term “control for” when interpreting regression models and instead use “adjust for” because people get the wrong idea about what it means to “control” a variable. In general, the fact that the term “failure” is so loaded with negative connotations means we should use it only with great care and with a strict, explicit definition. Although I know of no “official” technical definition of a replication failure (one may well exist), I feel confident that the use of “failure” in the context of this particular replication is technically wrong in that this label ignores the actual pattern in the data, in favor of an arbitrary cut point.

    And just to be clear: What Science is doing is foolish, unjustifiable, and very plainly wrong, and in no way analogous to the authors’ imprecise use of terminology and implicit endorsement of binary interpretation of results. I just think this movement has progressed to the point where we actually can recognize multiple errors at once and ensure that we are not sending mixed messages. (Also, I found it funny that the very next post after this one criticizes a paper for both using arbitrary p-values and for being unlikely to replicate!)

    • Martha (Smith) says:

      “Andrew (agreeing with Martha) recommended that we stop using the term “control for” when interpreting regression models and instead use “adjust for” because people get the wrong idea about what it means to “control” a variable.”

      Actually, my position is a bit more extreme than that: I advocate saying “attempt to adjust for”, to take into account that procedures for “adjusting for” still leave some uncertainty in the results.

  3. Jordan Anaya says:

    This story continues:

    Despite the editor not being mentioned by name, the editor saw the thread and blocked the author.

    His stated reason for this is that his personal communication was shared without his permission:

    As far as I can tell it was just a couple quotes, and again, his name wasn’t even attached to them. The premise that everything would have been hunky dory had the author just paraphrased the correspondence doesn’t hold water. If that happened the editor could have just said the correspondence was misconstrued and would have used that as a reason for blocking them. You really can’t win with these people–they will say whatever they have to say to paint themselves as the victim and you as the bad guy.

    Anyways, the editor has been on the defensive all day and has constantly been deleting his tweets and I wasn’t following along so I’m not sure what the different excuses he gave were. He seems to have eventually settled on he can do whatever he pleases, which I’m pretty sure is the official policy of the tabloids:

    So in short the tabloid didn’t want to clean up their own mess, which is fine, but when the authors dared complain about it on social media got blocked and accused of making an ethics violation. If I were these authors I would take this as a sign that my future papers are not welcome in the tabloid, and it seems to signal to other people that if you publicly complain about the tabloid that your papers also won’t be welcome.

Leave a Reply