Something we’ve seen with depressing regularity is that researchers do something sloppy—perhaps even deceitful or fraudulent, but oftentimes just sloppy—and then when the error is pointed out, they reply that the main conclusions of the study have not changed.
It often looks ridiculous, and when we post on these things we put them in the Zombies category—but, sometimes, sure it must be the case. Someone claims some big result, but it still seems to make sense that they got the direction right, so maybe the magnitude doesn’t matter?
How to think about this?
My suggestion: try the Error-Reversal Heuristic. Imagine how the promoter of the idea would’ve reacted had the mistaken gone in the opposite direction.
Here are some examples.
1. Published paper from an organization called Toxic-Free Future claims that a toxin is at 80% of the legal limit. They screwed up their calculation—it’s actually only 8%—and here’s their response: “it is important to note that this does not impact our results . . . and our recommendations remain the same.”
The Error-Reversal Heuristic: Suppose someone else had done a study and found that the level of exposure was “8% of the reference dose, thus, a potential concern,” but they’d done the calculation wrong, and the level was really 80% of the reference dose. Then I assume that the folks at Toxic-Free Future would’t say that the recommendations remain the same, right? They’d say the exposure had been underestimated by a factor of 10 and that’s a big deal!
2. Published paper in Lancet (uh oh) published a paper that hydroxychloroquine/chloroquine was killing people. It turns out the work was fraudulent, which perhaps should not surprise us, given the strong criticism by James “not the racist dude” Watson, who wrote at the time, “The big finding is that when controlling for age, sex, race, co-morbidities and disease severity, the mortality is double in the HCQ/CQ groups (16-24% versus 9% in controls). This is a huge effect size! Not many drugs are that good at killing people. . . . The most obvious confounder is disease severity . . . The authors say that they adjust for disease severity but actually they use just two binary variables: oxygen saturation and qSOFA score. The second one has actually been reported to be quite bad for stratifying disease severity in COVID. The biggest problem is that they include patients who received HCQ/CQ treatment up to 48 hours post admission. . . . This temporal aspect cannot be picked up a single severity measurement. In short, seeing such huge effects really suggests that some very big confounders have not been properly adjusted for. . . .”
Five days after the problems with this paper came out, a press officer for Lancet wrote that “The results and conclusions reported in the study remain unchanged.”
Ummm . . . time for the Error-Reversal Heuristic: Suppose the results had originally been reported as kinda small, then it turned out a mistake had been made, and the actual effect of the drug was to double the mortality rate. How would the promoters have reacted? I’m pretty sure they’d say that such an effect is a big deal!
3. A published paper, “Attractive Names Sustain Increased Vegetable Intake in Schools” (guess who’s the author? Hint: “Pizzagate”) made big claims. It turned out that the data in the paper were incoherent, and a correction was written that was longer than the original paper. According to Retraction Watch: “Some of the changes include explaining the children studied were preschoolers (3-5 years old), not preteens (8-11), as originally claimed.” The author’s response to all of this? You got it: “These mistakes and omissions do not change the general conclusion of the paper.”
Time for the Error-Reversal Heuristic! What if things had gone the other way? Someone published a null result on the effects of attractive names on vegetable intake in schools, but it turned out that the data had been entirely garbled, and in fact the study was on preschoolers, not preteens. Would Mister Cornell Food Researcher then reply that the general conclusions did not change? Hell no! He would’ve said this any claims of a null finding were invalidated by the sloppiness of the study.
4. A notorious member of the National Academy of Sciences published a paper with t-statistics reported as 5.03 and 11.14. But those were in error! The actual t-statistics were 1.8 and 3.3. How did the author reply? You’ll never guess: this “does not change the conclusion of the paper.” As I wrote at the time:
This is both ridiculous and all too true. It’s ridiculous because one of the key claims is entirely based on a statistically significant p-value that is no longer there. But the claim is true because the real “conclusion of the paper” doesn’t depend on any of its details—all that matters is that there’s something, somewhere, that has p less than .05, because that’s enough to make publishable, promotable claims about “the pervasiveness and persistence of the elderly stereotype” or whatever else they want to publish that day.
When the authors protest that none of the errors really matter, it makes you realize that, in these projects, the data hardly matter at all.
But . . . let’s try the Error-Reversal Heuristic. Suppose the published t statistics had been 1.8 and 3.3, but that had been an error, and they really were 5.03 and 11.14. How would the author have responded then? Probably something about how strong the evidence is, right?
5. We’ve also seen papers where the result goes in the opposite direction of the pre-registration. Had it gone in the same direction as the pre-registration, it would be hailed as a success, so when it goes in the opposite direction . . . maybe not so much of a success? There was the notorious case of the paper about ovulation and clothing with a finding that failed to replicate in a new study by the same authors. They refused to let go of the original, fatally-flawed claim and instead argued that they’d discovered an interaction. And then there was the “gremlins” article that approached the Platonic ideal of having more errors than data points. The only thing that remained constant amid all the wreckage was . . . the conclusion.
6. And, most consequentially, there was the notorious “Excel error” paper, where fatal flaws were discovered and the authors dismissed this as an “academic kerfuffle,” which isn’t quite “the conclusions are unchanged,” but close enough. Again, imagine if someone had published a null result and then, once the data had been fixed, a big estimate in their preferred direction had shown up. I think they would’ve said this was a big deal.
I’m happy to retell these above stories as often as might be needed–recall Paul Alper’s horse—; my point in this post is to give examples of the error-reversal heuristic.
P.S. Sometimes people do it right. Here’s an example where fatal flaws were found in a published paper, and the authors concluded, “A reanalysis of the data leads to revised findings that do not replicate the results in the original paper.” So it is possible.
I think you could also apply this heuristic to ongoing policies of firing government workers or retaining/deporting potential illegal immigrants or cutting off lifesaving aid. We see mistakes being made, but the Administration says they will correct these but never admits there is a problem with the process. Suppose the process was shown to be missing illegal immigrants or retaining government workers who are lazy/incompetent or preserving aid that is not lifesaving: would the Administration stand by their processes?
Another example. A widely-cited key paper on the Covid origins issue, “The molecular epidemiology of multiple zoonotic origins of SARS-CoV-2” (https://www.science.org/doi/10.1126/science.abp8337) had multiple coding errors contributing to its Bayes factor of 60 favoring multiple spillovers. Under pressure from an intrepid pubpeer contributor, the main coding errors were corrected, reducing BF to ~4. Since the original paper had defined a significance threshold as BF=10, that sounds like a problem. The solution was simple. The threshold was lowered to BF=3.2 in the corrected version. So no changes in the conclusion were needed.
Although BF=4 doesn’t sound impressive for a subtle post-hoc selected feature chosen to support a narrative, the situation is actually worse. The Bayes calculation used a broader definition of the observed outcome for the favored hypothesis than for the unfavored. Correction of that basic error appears to reduce BF to less than 1. (https://michaelweissman.substack.com/p/explanation-of-and-comments-on-mccowans) Presumably if that correction is ever acknowledged it will still not change the conclusions.
Very insightful, as always!
NB: There’s a minor error/typo: In the sentence “Suppose the published p-values had been 1.8 and 3.3, but that had been an error, and they really were 5.03 and 11.14.”, it should say “t-statistics” instead of “p-values” (or, if you actually wanted to talk about the corresponding p-values, a bit more would need to be added/changed).
Fixed; thank you.
@Andrew:
That Reinhart&Rogoff paper still makes me blow out an artery, for three reasons:
a) not even Margaret Thatcher was this stupid
b) it didn’t tally with any gatherable evidence since the collapse of the Bretton Woods system,
and c) how readily its conclusions were adobted, causing misery left, right, and center – with the exception of one country: Belgium! Why? Because at the time Belgium (for two years, BTW) didn’t have a federal government (you should spend some time there – their fries are unbeatable, and they gave us the only surrealist who’d made you giggle), so it was impossible to impose austerity.
Olaf:
I did spend some time in Belgium–I took a sabbatical semester there in 1997.
Regarding Reinhart and Rogoff: yeah, I’m annoyed at them for calling it a “kerfuffle”–if the whole thing is such a joke, they could’ve informed the world back when they were being celebrated for their apparently important findings–; still, let’s not forget that other researchers have behaved much worse in similar settings.
R&R’s results weren’t “null” when the Excel error was fixed, and as you note that actual magnitude matters. The big differences between them and other researchers resulted from deliberate choices they made (as I have to keep commenting here), and they presumably think their choices were correct (even though they couldn’t possible establish causality with their approach). You’ve said it’s still worth beating a dead horse because the horse isn’t really dead, and that’s the case here.
The glibness with which they rebuked criticism was mind-boggling to me at the time. In my simple world, intellectual integrity is a necessary, though not sufficient, pre-requisite for moral integrity. They failed at the first hurdle.
PS Why don’t you write a book about “How [not!] to do research – from A[riely] to Z[imbardo]”.
> Watson, who wrote at the time, “The big finding is that when controlling for age, sex, race, co-morbidities and disease severity, the mortality is double in the HCQ/CQ groups (16-24% versus 9% in controls). This is a huge effect size! Not many drugs are that good at killing people
That was wrong. The claimed effect size was much lower – but in agreement with the spirit of your post I guess that’s just a minor detail and it doesn’t matter.
Carlos:
From the abstract of the (now-retracted) paper:
The mortality result reported for the control group is 9.3% and for the treatment groups is 18.0%, 23.8%, 16.4%, and 22.2%. So in what way was Watson’s statement wrong? It looks to me like he accurately characterized the claim made in the paper. Why do you say that the claimed effect size was much lower? I feel like I’m missing something here.
https://statmodeling.stat.columbia.edu/2020/05/24/doubts-about-that-article-claiming-that-hydroxychloroquine-chloroquine-is-killing-people/#comment-1344045
Oh, that’s confusing! It seems that the abstract was mixing raw rates (9.3%, 18.0%, etc.) and adjusted ratios (1.335, etc.) in the same sentence! I got faked out because the sentence began, “After controlling for multiple confounding factors,” so I’d assumed in my quick reading that all the numbers were adjusted.
Great post. I use this heuristic when I plan statistical tests with marketers, who have the prior that whatever they spend money on is sure to generate positive results. When I discuss the design, we agree on what actions to take contingent on what results come out of the test. There’s always the person who knows enough statistics to bring up the state of not statistically significant but “directionally positive” (think significant at 80% confidence in the desired direction). They will argue that in such a case, they should be allowed to alter the treatment a little to improve the results in the next test, meanwhile continuing to spend the money. I’d then set up the reversal heuristic: I’d ask them what they would like to do if the result is not significant in the negative direction? For sure, they say that because it is not significant, it is not evidence of harm, and they should keep spending the money.
Of course, if “directional” results can be accepted in one direction, they should also be accepted in the other direction, thus, using their own standard, they should stop spending the money!
Hmm, something can still qualify as very bad even across an order of magnitude difference. Consider this somewhat parody illustration:
X: “Let’s play Russian Roulette to show our bravery.”
Y: “That’s a really dumb idea. The statistical analysis of a revolver with one bullet and six chambers shows a 1/6 = 16.67% chance of death.”
X: “You’ve made an error. The gun to be used isn’t a six-chamber revolver. It’s got a high-capacity extension with 30 chambers. A re-analysis shows the chance of death with 1 bullet is actually 1/30 = 3.33%, five times less that you conclude.”
Y: “Even with a correction of a factor of five, my recommendation that it is a really dumb idea remains unchanged.”
X: “Let’s apply the Error-Reversal Heuristic. If you originally thought the chance of death was 3.33%, what would have have said if you then found it was 16.67%?”
Y: “That it was even worse than I thought, which doesn’t change my view that it’s a really dumb idea in both situations.”
I think this is particularly relevant to the toxin case. While 8% of the legal limit might not cause as much harm as 80%, it seems to me that amount might still be high enough to make mitigation a good idea.