Skip to content

“How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions” . . . and still stays around even after it’s been retracted

Chuck Jackson points to two items of possible interest:

Rigor Mortis: How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions, by Richard Harris. Review here by Leonard Freedman.

Retractions do not work very well, by Ken Cor and Gaurav Sood. This post by Tyler Cowen brought this paper to my attention.

Here’s a quote from Harris’s review:

Harris shows both sides of the reproducibility debate, noting that many eminent members of the research establishment would like to see this new practice of airing the scientific community’s dirty laundry quietly disappear. He describes how, for example, in the aftermath of their 2012 paper demonstrating that only 6 of 53 landmark studies in cancer biology could be reproduced, Glenn Begley and Lee Ellis were immediately attacked by some in the biomedical research aristocracy for their “naïveté,” their “lack of competence” and their “disservice” to the scientific community.

“The biomedical research aristocracy” . . . I like that.

From Cor and Sood’s abstract:

Using data from over 3,000 retracted articles and over 74,000 citations to these articles, we find that at least 31.2% of the citations to retracted articles happen a year after they have been retracted. And that 91.4% of the post-retraction citations are approving—note no concern with the cited article.

I’m reminded of this story: “A study fails to replicate, but it continues to get referenced as if it had no problems. Communication channels are blocked.”

This is believable—and disturbing. But . . . do you really have to say “31.2%” and “91.4%”? Meaningless precision alert! Even if you could estimate those percentages to this sort of precision, you can’t take these numbers seriously, as the percentages are varying over time etc. Saying 30% and 90% would be just fine, indeed more appropriate and scientific, for the same reason that we don’t say that Steph Curry is 6’2.84378″ tall.


  1. zbicyclist says:

    I’m guessing that most researchers don’t recheck references.

    It’s the Woody Hayes principle: Woody, explaining why his teams seldom used the forward pass, explained that there were three things that could happen, and two of them were bad. (interception, incompletion, and completion).

    Similarly, you could find a retraction (better science), not find anything (which means you wasted time), or find a cloud of controversy, requiring thinking about an issue that may be on the fringes of your research question.

    How might this be made better, by making it easier? Here’s a naive suggestion: with student term papers, you can use TurnItIn, which checks for plagarism. What about a similar service where you could send in an automated reference list (from something like Zotero) and it would tell you about retractions, nonreplications, and similar issues. Basically, this could be a filter on Google Scholar, which will tell you who’s referenced an article.

  2. Ted Dibble says:

    I recall that when the report came out that “6 of 53 landmark studies in cancer biology could be reproduced” some biomedical scientists noted that there are real challenges in getting these experiments to work. But in reading the article by Begley and Ellis, I see the following lines that will not surprise readers of this blog:
    “In studies for which findings could be reproduced, authors had paid close attention to controls, reagents, investigator bias and describing the complete data set. For results that could not be reproduced, however, data were not routinely analysed by investigators blinded to the experimental versus control groups. Investigators frequently presented the results of one experiment, such as a single Western-blot analysis. They sometimes said they presented specific experiments that supported their underlying hypothesis, but that were not reflective of the entire data set.”

  3. Brent Hutto says:

    In the first months in my career I was given a dataset that had obvious measurement issues with one of the scales. A literature rewiew turned up numerous (maybe a dozen or so? this was decades ago) articles citing the same validation study for the scale in question. The validation study, to its credit, quite clearly concluded that the scale WAS NOT VALID.

    Yet every one of those articles stated that they were using the scale and that it had been validated, accompanied by a reference to the validation study. I’ve always wondered if any of those citing and then using that scale had actually read the (in)validation study of if they were just happy to cite some reference titled “Validity of the Such-and-Such Instrument”.

    On the one hand, it’s cool that the study showing non-validity actually made it past the File Drawer Effect and into print. On the other hand, it appears to have made little difference that it was published. On the gripping hand, at least it wasn’t part of a drug trial or other life or death line of research.

  4. Mark Samuel Tuttle says:

    I think this has something to do with the power of stories. Evolution has programmed us to yearn for and remember stories. Sometimes good stories are “wishful thinking” (a moniker recently applied to AI by a computer science cynic).

    So, as usual, I think there needs to be some deeper analysis here to ask why some proven-wrong “stories” hold sway; it must reflect yearning, somehow.

    So, somehow, we need to make replication fulfill yearning.

    I reminded of earlier efforts in cognitive science – which had this same problem, only worse. Eventually, cognitive science presentations at meetings would be meet with searing criticism. The ones that survived proved to be important. Needless to say, the cost of the effort was high, e.g., everyone’s feelings got hurt.

    A classics professor once explained to me that one reason the Iliad and the Odyssey survived when most everything else – stories and plays – did not was that they were compelling stories. Of course, Homer may have been performing in front of fat merchants yearning for the old days …

    Anyway, I think the question implied by your post – why is refuted work so hard to extinguish is worthy of study, in and of itself.

  5. Al says:

    “This is believable—and disturbing. But . . . do you really have to say “31.2%” and “91.4%”? Meaningless precision alert! Even if you could estimate those percentages to this sort of precision, you can’t take these numbers seriously, as the percentages are varying over time etc. Saying 30% and 90% would be just fine…”

    This seems wrong to me. I fully agree with the general point, so I’d round to 31% and 91%. Or say 3/10 and 9/10. But if you provide a particular precision, the numbers should be correct, no? Unless you prefix it with “roughly” or whatever. I now see that in fact the 31% is “at least” but the 91% doesn’t seem to be.

  6. Jordan Anaya says:

    I think some criticism of Glenn Begley and Lee Ellis is fair. They announced their company couldn’t reproduce all of these papers, but didn’t tell us what papers or what didn’t reproduce.

    • Anoneuoid says:

      Begley was part of the cancer reproducibility project so it isn’t like he did nothing about it. However, looks like he thought that turned out to be waste of time:

      “Early on, Begley, who had raised some of the initial objections about irreproducible papers, became disenchanted. He says some of the papers chosen have such serious flaws, such as a lack of appropriate controls, that attempting to replicate them is “a complete waste of time.” He stepped down from the project’s advisory board last year.

  7. Sean Mackinnon says:

    I mean, if we’re being pedantic about rounding, nobody rounds to the nearest foot either. Maybe 6’2 is needlessly precise. There’s just 5, 6, and 7 foot tall people. I don’t see how rounding to 1/12 of a foot is scientifically correct, but rounding to 1/10 of a percent is needlessly precise.

    If you don’t extend out to 1 decimal place with percentages, there’s a high chance your percentages won’t add up to 100% due to rounding, which will anger a separate set of people who will claim you are innumerate for failing to have them add up to 100%. So maybe you cheat on the rounding a little bit so that it adds up to 100% to appease these folks, then someone else will claim you’re falsifying data because your reported stat doesn’t match the output 100%. So you put it back to 1 decimal place, and Andrew will say you’re unnecessarily precise.

    In summary, no matter how you choose to round, somebody on the Internet will think you’re innumerate.

    • Thanks for raising this subject. Nothing more annoying than a condescending attitude. More annoying is the ‘my way or the highway’crew.

    • Andrew says:


      I don’t care about “somebody on the Internet.” What I care about is expressing data informatively. The reason why it makes sense to report someone as 6’2″ rather than 6′ is that heights are stable to the nearest inch. Someone will be 6’2″ today, 6’2″ tomorrow, and 6’2″ next year. Depending on the purpose and how carefully the measurement is done, it could make sense to report someone as 6’2.3″. They might be 6’2.3″ today, 6’2.4″ tomorrow, 6’2.2″ in a year, and this is all different from 6’1.8″. So, again, there might be purposes for which we’d want that level of precision. But to measure someone as 6’2.34″ . . . it’s hard for me to imagine when we’d want that.

      Getting to the example at hand, there are three reasons why I think that “31.2%” should be reported as “31%” or “30%.” In no particular order:

      1. Sampling variation. We’re generally interested in the population, not the sample, and the sample proportion is a noisy estimate of the population proportion.

      2. Variation. We have data from one particular setting in one particular time period. Things change, and we can expect the variation across scenario and over time to be more than 1 percentage point. There’d just be no reason to suspect stability at the level of the fractional percentage point; too many things are changing.

      3. Relevance. There’s no reason to care about that fractional percentage point.

      If I were writing up the above example, I would present the number at 30%. However, I can see the rationale for writing 31%: perhaps for some purposes the distinction between 26% and 34% is a big deal; also I can see that some readers might be put off by a lot of rounding, as they’d wonder what happened to that rounded digit. I can’t see any good reason for reporting 31.2%. The bit about “cheating on the rounding” doesn’t come up here because it was just a number reported in the abstract.

      But, in any case, no matter how many digits are presented, the rounding problem can occur when you’re presenting a set of numbers that add up to 100%. There is no general solution to this other than an asterisk pointing to a note that numbers may not add up to 100% because of rounding; this has nothing to do with the number of decimal places. When the numbers add up to 108%, though, then there’s a problem!

      I did not go through all these points in the above post because they’ve come up many many times on this blog. Anyway, I think it’s important. It’s not the most important thing in statistics, but I think that unnecessary decimal points—or, more generally, the presentation of noise—can get in the way of statistical communication. I think the noise can be a distraction from the signal. Not the biggest deal in the world but it’s one little bit that can make a difference.

      • jd says:

        Well, my weather app on my phone is stating there’s a 0% chance of 0.07″ of rain on Friday and a 100% chance of 0.57″. So i’m going to lose faith in this thing if we get 0.07″ on Friday and 0.58″ on Saturday.

        • Andrew says:


          I think there are three things going on in your example.

          – I’m not sure what the app is trying to say, if it reports that there’s a 0% chance of 0.07″ of rain. If there’s a 0% chance of rain, what’s the 0.07″ thing for?

          – I’m not quite sure what to do with a prediction of 0.58″ of rain, but I’m not really a weather expert. For my own weather info, I go to the National Weather Service page, which gives me information such as “This Afternoon: Showers likely and possibly a thunderstorm after 4pm. Mostly cloudy, with a high near 61. Southeast wind 7 to 11 mph. Chance of precipitation is 60%. New rainfall amounts between a tenth and quarter of an inch, except higher amounts possible in thunderstorms.” I’m not really sure who would care about 0.58″ (rather than, simply, “a half inch,” which is how the NWS might report it to me)—but the app isn’t just for reading; it can also save the results. If I’m going to save the results for further analysis, then I’d prefer to have the extra digits. Similarly, in the example of the above-cited research paper, I want the authors to keep all their digits internally; it’s just in the writeup that I’d want rounding.

          – The weather varies a lot from day to day and people are interested in that. In contrast, we’re not typically interested in day-to-day variation in people’s height, and day-to-day variation in reported height can just be uninteresting measurement error. In the example of the above-cited paper, we may very well want to know about variation from year to year, but I don’t see how variation at the level of the fractional percentage point would be interesting.

          • jd says:

            I agree. I was being a bit facetious. It was just an amusing example of your point about meaningless precision. This app is actually great, but every time I see the rain total prediction, it makes me laugh and think of this topic of discussion on your blog.

            However, I have a question about your second point –
            “If I’m going to save the results for further analysis, then I’d prefer to have the extra digits.”
            what if your instrument can output a measure of the data point to high precision, but you know that it actually can’t measure it that precisely – then do you still want those decimal places for analysis? Say for example, you had a digital rain gauge that measured rain to 0.001 inch and log this at a specific time, but you knew that if the wind was blowing, then the reading could vary by 0.05 inches at least at any given time. Then would you want to store that data point as 0.584 for analysis?

          • I think it’s reporting a CDF… 0% of 0.07 inches or less, 100% chance of 0.58 inches or less

      • markus says:

        “The bit about “cheating on the rounding” doesn’t come up here because it was just a number reported in the abstract.”

        The numbers in the abstract should match the numbers in the body of the paper. They need not, but it’s a legitimate choice in how to communicate results in that it enables readers to more easily connect the dots. Or more easily read the paper, if e.g. they end up searching for the number reported in the abstract to see where it comes from.
        Your points about noise and variation can partly be solved by confidence intervals around the percentage, partly they’re issues of design (i.e. how strong is our belief the number will stay around the same) which can hardly be adressed at the level of the numbers. In this case, with N=74000 I get a 95% CI from around 30.9 to 31.5. So at the level of this study it seems fine to say the uncertainty is in the decimal point. (I mean, what do you expect them do if asked for a CI and the reported percentage is ‘30%’ as you suggest? 30 [30; 30]? That won’t cause an unnecessary double take by readers?)
        That said, I agree with the criticism that decimal places shouldn’t be added mindlessly, I just think your ‘I can’t see any good reason for reporting 31.2%’ is limited by the effort you put into trying to find reasons, and I think that effort is usually not sufficient.

        • Andrew says:


          1. I agree that if the number is reported at 31.2% in the paper it should be the same 31.2% in the abstract. My comment about “just a number reported in the abstract” was in response to the claim that authors should show more digits to avoid the problem of percentages not adding up to exactly 100%. In any case, as I noted in my comment above, the issue of percentages not adding up to exactly 100% can arise no matter how many digits are included, so if that’s a concern there’s no general alternative beyond noting the issue to the reader.

          2. The confidence interval you create is based on the assumption of independent data. But the data are clustered so I expect the correct confidence interval will be wider.

          3. In any case, even if N = 10 billion and the confidence interval is [31.324%, 31.328%] or whatever, I still would report as 31%, not 32.326%, for the other reasons discussed above. Sampling variability is only one of the many reasons not to report fractional percentages.

          4. Regarding “limited by the effort you put into trying to find reasons”: I did not put in any effort to find reasons. These reasons were already there. I put in some effort to express and type up these reasons, just as part of the public service of this blog. When there’s a point that’s familiar to me but unfamiliar to some commenters, I’ll often put in the effort to carefully explain my motivation. When I’m just doing data analysis or reading a paper, these issues take no time at all for me, as I’ve seen them so often.

      • Sean Mackinnon says:

        Thanks for taking the time to respond. Sorry for being a bit snarky, I guess it just rubbed me the wrong way, because I often report things to 1 or 2 decimal places for (what I believe to be) rational reasons.

        One issue is that there is a subset of folks who want more precision in the numbers reported in the paper for two different reasons. There’s the error-detection folks, who often use algorithms on the presented data to determine if there are typos (e.g., using the point estimate and the SE, is the test statistic and/or p-value correct?). Rounding to the nearest integer can interfere with this.

        There’s another set of folks who might want to use data from published paper in a meta-analysis. Here too, the extra precision is liable to useful in the calculation of the effect sizes that are aggregated together. This would also be true if I say, wanted to compare the means in my sample to a prior sample. Basically, I would argue that more precision in general does allow for greater secondary usage of the data. I know you’re not into calculating p-values or calculating confidence intervals, but rounding to the nearest integers for estimates is liable to interfere with this process in some circumstances for the same reason rounding too early in any calculation can introduce bias. I try to think about reporting things in a way that maximizes secondary use by third parties — this is also why I prefer tables to figures a lot, even though I know you also prefer figures to tables (and despite knowing that figures communicate the info better at a glance!) Having had to make secondary use of published results (e.g., to inform effect sizes for power analyses), tables help a lot.

        I suppose if you have open data, these two points don’t matter so much! Then you can be freed up to report clearly, without bogging down in precise numbers. But in my consulting and collaboration, it’s often not the case, so reporting to greater precision with descriptive statistics enables more secondary use of data.

        You’ve raised this before, so I actually have changed my view on this a bit over the past few years of reading the blog in the sense that I am cued to avoid reporting things with needless precision now, and agree that it’s often a bad thing. That said, I don’t think 1/10 of a percent is too precise, but 1/100 of a percent sure is. So I guess I also just disagree about the threshold of irrelevance.

    • DavidP says:

      I can’t help remarking – this line in Rod Stewart’s “Mandolin Wind” has always caught my attention:

      “It was the coldest winter in almost 14 years.”

      What does that mean? If it’s 13, why not say 13? If it’s 13.5, did he move to England from Australia? Apart from hemispherical moves, don’t winters come in integer spacings?

  8. Rigor Mortis was very well written. Read it when it 1st came out. Not given sufficient play in these statistical controversies.

  9. Adede says:

    Apparently science is fundamentally broken….let’s argue about arbitrary rounding thresholds.

    • we need to at least double the size of the bike she’d on this blog

    • Andrew says:


      I agree that rounding is not the most pressing problem in statistics, but it’s not nothing. It’s about communication, which I think is central to statistics. I would not want to write about rounding and graphical display all the time, but when it comes up, I don’t mind discussing it. On this blog we also talk sometimes about art, literature, and sports, and we’re not fixing science there either. I think intellectual exploration and discussion has value for its own sake.

      • Adede says:

        I’m not saying never discuss it, but by ending your post with it you’ve distracted from the main point and given a minor detail disproportionate weight. I agree too many decimals is silly, but it’s not the worst thing (better than too few). Maybe give it a rest on the days when widespread epistemological dysfunction is the topic of discussion.

    • Anoneuoid says:

      Apparently science is fundamentally broken….let’s argue about arbitrary rounding thresholds.

      Science is just fine, there are just a lot of people calling what they do science when it is something else. I prefer the term “research”.

      It is easy to tell. In science you see people:

      1) Figuring out how to repeatedly get the same results from their experiments (not assuming “published in peer reviewed journal -> fact” and it is a waste of money for anyone else to do it)
      2) Testing quantitative predictions derived from their theories (not a strawman null hypothesis)

      Vast swaths of the research community are not doing either of those things.

  10. Andrew,

    You and my Dad would have gotten along famously. Renaissance minds. A rich intellectual childhood makes a difference.

  11. Dave says:

    Published research should come with REST API calls for references, which have possible error codes for retractions that could alert downstream research dynamically. In other words, whenever someone goes to read a published piece, it dynamically updates the references via REST APIs, traversing upstream references and alerting the reader to any retracted research in the ancestry.

  12. Scott Grindy says:

    re: citations of retracted articles – shouldn’t the journals/publishing entities have some responsibility here? Many publishers (most that I’m familiar with anyway) do the work to link a formatted citation to the DOI to its source already, and I can imagine some software tools (a la Dave’s suggestion of REST APIs) could quickly identify retracted articles. Perhaps the benefit to journals isn’t worth it; I would guess that the average paper cites << 1 retracted article, so the actual benefit from the tool would be small.

Leave a Reply