Unreplicable

Leonid Schneider writes:

I am cell biologist turned science journalist after 13 years in academia. Despite my many years experience as scientist, I shamefully admit to be largely incompetent in statistics.

My request to you is as follows:

A soon to be published psychology study set on to reproduce 100 randomly picked earlier studies and succeeded only with 39. This suggests that the psychology literature is only 39% reliable.

I tried to comment on the website Retraction Watch that one needs to take into account that the replication studies may not be reliable themselves, given the incidence of poor reproducibility in psychology. Ivan Oransky of Retraction Watch disagreed with me, citing Zeno’s paradox.

Basically, this was my comment:

39% of studies are claimed to have been reproduced by other psychologists. But: if this is indeed the ratio of reproducibility among psychological research, then only 39% of the reproduction studies are potentially reproducible themselves. My statistics skills are embarrassingly poor, but wouldn’t this mean only 15% of the originally tested psychology studies can indeed be considered as successfully reproduced?

These 15% are certainly completely wrong and obtained by my incompetence, but are the 39% indeed solid, unless we can fully trust the results of the reproducibility project? I wrote to Ivan next:

Only if a third study would confirm same 39 studies reproduced can we trust the previous result. Otherwise, we know that at least 61% of what psychologists publish is wrong or fake, so if we ask these or other psychologists to perform any study, we should be aware of their reliability.

We never agreed. I still have the feeling the proper number must be lower than 39%, unless the psychologists who obtained it are 100% honest and reliable.

Thus, can I ask for your professional view about the true reliability of psychology studies and how to approximate it in this context?

Hey, I know about that replication project! Indeed, I was part of it and was originally going to be one of the many many authors of the paper. But I ended up not really doing anything but commenting in some of the email exchanges among the psychologists on this project, so it didn’t make sense for me to be included as an author.

Anyway, I think there are a few things going on. The first is that the probability that a study can be replicated will vary: Stroop will get replicated over and over, whereas ESP and power pose don’t have such a good shot. So you can’t simply multiply the probabilities. The second issue is that replicability depends on measurement error and sample size, and these will not necessarily be the same in a replication as in the original study.

But, really, the big thing is to move beyond the idea of a study as being true or false, or being replicated or non-replicated. I think the authors of this replication paper have been pretty careful to avoid any claims of the form, “the psychology literature is only 39% reliable.” Rather, it’s my impression that the purpose of this exercise is more to demonstrate that attempted replications can be done. The idea is to normalize the practice of replication, to try to move various subfields of psychology onto firmer foundations by providing guidelines for establishing findings with more confidence, following the debacles of embodied cognition and all the rest.

18 thoughts on “Unreplicable

  1. > move beyond the idea of a study as being true or false, or being replicated or non-replicated.
    Agree, I would look at the rank correlations of how the studies up/down weight prior probabilities.

    If that’s good, then the studies are essentially saying the same thing before any (Frequentist) dichotomisation or (Bayesian) weighted averaging is applied.

    • Gwern:

      I don’t like Bayes factors here because I really really don’t like the model in which claims are “true” or “false.” I’ve written about this occasionally but it’s probably worth a fuller discussion.

      • Andrew:

        Maybe not the rigid true / false dichotomy, but I hope there’s some way to quantify this?

        Are you saying there’s no good way to distill the “badness” of Psych studies versus other fields to any metric of comparison at all?

        • Studies can be “bad” in so many ways that it doesn’t make sense to try to distill their “badness” to a “metric of comparison.”

        • @Martha:

          That is true about so many things in life but it hardly stops statisticians from rating those things on a unidimensional scale.

          e.g. Colleges, departments, cities, etc.

          Why resist a quantitative measure of research study quality? Even if not a single metric, a few metrics?

        • Rahul:

          Many or most studies we’ve been discussing can be summarized as estimating some parameter, maybe a causal effect of some treatment or maybe the difference between two parameters. In this case, the study being “true” is often framed as the parameter being nonzero, or the parameter being of a specified sign (positive or negative). Instead I think it is better to think of theta as being variable (varying across people, scenarios, and person*scenario interaction) and uncertain. Type S and Type M errors are two (crude) ways to get at this.

        • Why not use the closeness of the estimated parameter between the original & the replicate study as a measure of the “goodness” of the set of studies?

  2. >”The second issue is that replicability depends on measurement error and sample size, and these will not necessarily be the same in a replication as in the original study.”

    Is it really a replication if you can’t even control conditions enough to get equal sample size? I noticed that about this psych project. Some of these “replications” didn’t really seem to be replications, very strange when the entire point is assessing how reliable the findings are.

  3. I think that this recent paper by McElreath and Smaldino is useful for understanding this question as they model how replication aids, or does not aid, the scientific discovery process.

    McElreath R, Smaldino PE (2015) Replication, communication, and the population dynamics of scientific discovery. PLOS ONE, 10(8), e0136088.

    http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0136088

    Also, Smaldino blogs about their model in relation to the reproducibility project here.: http://smaldino.com/wp/?p=434

  4. Suppose in the original 100 experiments there were real differences of d*sigma (real, constant but unknown) between two treatments and that the power for each experiment was 0.6. Then in the original study 60 out of 100 experiments would give a significant result even though the difference was real in all cases. In the replication study of 60 experiments 36 would give a significant result. The problem here is lack of power in the original studies. The reality of those 100 studies reported in the paper is much more complex of course, with the real differences varying from small to big, and power varying from poor to good, but I guess the fundamental problem is still low power in a lot of the studies.

    Another issue is probably unconscious bias introduced by those conducting studies. For 40 years I worked alongside biologists, chemists, ecologists, agricultural scientists, environmental scientists and so on, and I would never underestimate the ability of an experimenter to foul up an experiment – because of a poor grasp of sources of variation and the fundamentals underlying experimental design.

Leave a Reply

Your email address will not be published. Required fields are marked *