Some researchers retrospect on their mistakes

Roy Mendelssohn points to this article by Julia Rohrer, Warren Tierney, Erik Uhlmann, et al., who write:

Science is often perceived to be a self-correcting enterprise. In principle, the assessment of scientific claims is supposed to proceed in a cumulative fashion, with the reigning theories of the day progressively approximating truth more accurately over time. In practice, however, cumulative self-correction tends to proceed less efficiently than one might naively suppose. Far from evaluating new evidence dispassionately and infallibly, individual scientists often cling stubbornly to prior findings. Here we explore the dynamics of scientific self-correction at an individual rather than collective level. In 13 written statements, researchers from diverse branches of psychology share why and how they have lost confidence in one of their own published findings. We qualitatively characterize these disclosures and explore their implications. A cross-disciplinary survey suggests that such loss-of-confidence sentiments are surprisingly common among members of the broader scientific population yet rarely become part of the public record. We argue that removing barriers to self-correction at the individual level is imperative if the scientific community as a whole is to achieve the ideal of efficient self-correction.

They have an interesting set of stories. I wonder what the people would say who made the ridiculous claim that “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.” I’m still wondering why they didn’t claim 200%, just to be super-sure about it.

This also reminds me of the “I Can’t Believe It’s Not Better” session.

10 thoughts on “Some researchers retrospect on their mistakes

      • Jonathan:

        I’m reminded of this story from the collection, “A Random Walk in Science”:

        Once in Russia, in a physics exam, the professor wrote the equation

        E = hv

        and asked a student:

        “What is v?”
        “Planck’s constant.”
        “And h?”
        “The length of the plank.”

        [Astonishingly, this is translated directly from the Russian.]

        “Astonishingly” is maybe a bit strong, but it’s a good story in any case.

  1. Very interesting article. It’s kind of disappointing that even these top psychologists don’t seem to really get it. E.g., the reason for retracting one of the papers mentioned in this article is that a result flipped from significant to non-significant when random slopes were added to the linear mixed model (Fisher et al 2015). That is not a very convincing reason for retracting a paper. The authors say that the scientific record was self-corrected because of the retraction. I would say that I didn’t really learn a lot from the flip from sig to non-sig.

    The presupposition here is that we can simply bin findings into “supports belief X” vs “does not support belief X”. But one can maintain some degree of belief in X without necessarily committing 100% to it. In both linguistics and psychology, there seems to be this implicit pressure, which we learn in grad school, to have a position. E.g., in sentence processing (psycholinguistics) if you are a connectionist modeler, then symbolicists are evil and wrong, and vice versa; the idea that both kinds of representations of linguistic knowledge in the mind/brain have upsides and downsides and that we don’t really know what “the” right representational assumptions should be, and that we don’t need to commit to either, is never taken seriously. Why do I have to believe in anything with 100% certainty?

    If the Fisher et al 2015 paper is retracted because a result flipped from significant to non-significant, then practically every second paper in psycholinguistics will have to be retracted. In many of these papers, which are published in journals seen as “top journals” (Journal of Memory and Language and Cognition, for example), even the published p-values are not significant, but clever wording and obfuscation (such as coyly avoiding discussing minF’ values that are clearly not significant even in the published tables of F- and p-values), coupled with hurried reviewing and lack of editorial oversight, let’s these papers enter the canon of truth.

    Those that do report actually significant p-values often involve things like using repeated measures anova (to knock out some of the variance components from the game—basically the same thing that led to the Fisher et al retraction). Another common approach (which I will blog on soon with concrete examples) is to do, say, a 3×2 experiment and then slice up the data into two subsets of a 2×2 design and then analyze those separately; this can reduce the residual variance, leading to significant results. If the analysis had been done with the full data set (say, using nested contrast coding), there would be nothing to report.

    I myself published a paper in 2018 with Andrew Gelman, in which we report 7 failed attempts to replicate a published claim in a Journal of Memory and Language article. The author of that original article asked me if he should retract the paper. My answer was no: the claim could be true, we just don’t have any convincing evidence for it, and it’s fine for it to be out there in print. You just don’t need to necessarily commit 100% to that claim, or indeed any other claim.

    Basically, what I wanted to say is that in this paper, I would have questioned even the need to have confidence in an idea in order to have it out there in the published literature.

    PS I have to add that often when we submit papers to top journals with a limitations section, in which we openly talk about all the ways in which the claims we made might be wrong, the paper gets rejected. This is because of the stated policy of said journal to only publish important findings. Entertaining doubt is actively discouraged.

    • I am biased but I do think if meta-analysis was a topic in introductory statistics many more would avoid these misconceptions.

      I did that in my intro course at Duke in 2007/8 using Fisher’s simple method of combining p-values. One example where the “popular” study was significant but when the the other studies were considered, the combination was not significant. And one where the first study was non-significant but when the subsequent studies were combined (all non-significant) the combination was significant. So at the least the value of significance in a single study was clearly deflated.

      The students had less difficulty with this material than much of the other material, though they did realize the larger lesson was disconcerting – beware the results of a single study.

      I don’t know if anyone else has done this in an intro course. Not sure why but maybe pressure to stick with the usual curriculum and avoid risking poor student evaluations.

    • I had similar thoughts when reading of the Fisher et al 2015 paper retraction. Their retraction puts too strong an emphasis on p-values rather than actually learning something from the random-slopes-disappearing-significance effect. ALso, it serves to highlight how good their more recent work is (where they have emphasised the maximal random effects) so am not sure it is really in keeping with the spirit of this paper overall (which, has confessional tone)

Leave a Reply

Your email address will not be published. Required fields are marked *