So. I opened the newspaper today and saw this article by Roni Caryn Rabin, “Two Retractions Hurt Credibility of Peer Review.” It was about the Surgisphere scandal, which we’ve discussed a few times in this space, going from Doubts about that article claiming that hydroxychloroquine/chloroquine is killing people to How should those Lancet/Surgisphere/Harvard data have been analyzed?
The news article had this revealing bit:
In interviews with The New York Times, Dr. Richard Horton, the editor in chief of The Lancet, and Dr. Eric Rubin, editor in chief of the N.E.J.M., said that the studies should never have appeared in their journals but insisted that the review process was still working. . . .
Dr. Horton called the paper retracted by his journal a “fabrication” and “a monumental fraud.” But peer review was never intended to detect outright deceit, he said, and anyone who thinks otherwise has “a fundamental misunderstanding of what peer review is.”
“If you have an author who deliberately tries to mislead, it’s surprisingly easy for them to do so,” he said.
I hate hate hate hate hate this attitude, trying to use fraud to get himself off the hook.
It’s not about the fraud
The fraud is the least of it.
As regular readers of this blog will recall, the original criticism of the recent Lancet/Surgisphere/Harvard paper on hydro-oxy-whatever was not that the data came from a Theranos-like company that employs more adult-content models than statisticians, but rather that the data, being observational, required some adjustment to yield strong causal conclusions—and the causal adjustment reported in that article did not seem to be enough.
As James “not the racist dude who assured us that cancer would be cured by 2000” Watson wrote:
This is a retrospective study using data from 600+ hospitals in the US and elsewhere with over 96,000 patients, of whom about 15,000 received hydroxychloroquine/chloroquine (HCQ/CQ) with or without an antibiotic. The big finding is that when controlling for age, sex, race, co-morbidities and disease severity, the mortality is double in the HCQ/CQ groups (16-24% versus 9% in controls). This is a huge effect size! Not many drugs are that good at killing people. . . .
The most obvious confounder is disease severity . . . The authors say that they adjust for disease severity but actually they use just two binary variables: oxygen saturation and qSOFA score. The second one has actually been reported to be quite bad for stratifying disease severity in COVID. The biggest problem is that they include patients who received HCQ/CQ treatment up to 48 hours post admission. . . . This temporal aspect cannot be picked up a single severity measurement.
In short, seeing such huge effects really suggests that some very big confounders have not been properly adjusted for. . . .
I’m not saying that the editor of Lancet should’ve caught this. After all, he’s not a statistician, and indeed his journal has a track record of falling for statistically innumerate but politically convenient policy claims (see here for a discussion of one example).
Before giving up, Lancet doubled down
Here’s a quote from Team Lancet, five days after the problems about that hydroxychloroquine paper came out:
On May 29, Jessica Kleyn, a press officer at The Lancet journals, informed The Scientist in an emailed statement that the authors had corrected the Australian data in their paper and redone one of the tables in the supplementary information with raw data rather than the adjusted data Desai said had been shown before.
“The results and conclusions reported in the study remain unchanged,” Kleyn adds in the email. “The original full-text article will be updated on our website. The Lancet encourages scientific debate and will publish responses to the study, along with a response from the authors, in the journal in due course.”
As I wrote at the time, the real scandal is that the respected medical journal Lancet aids and abets in poor research practices by serving as a kind of shield for the authors of a questionable paper, by acting as if secret pre-publication review has more validity than open post-publication review.
Give credit where due
Unfortunately, the NYT article did not quote or even mention James Watson, Peter Ellis, or other researchers who exposed the problems with the Surgisphere study.
Watson pointed out the statistical problems and the data irregularities; Ellis did some investigation and found out that Surgisphere had no there there.
Why give credit where due? Not just out of fairness to Watson and Ellis. Also because post-publication review is what did the job. Unlike alcohol (in Homer Simpson’s famous phrase), post-publication review is the solution, but not the cause, of these particular problems.
The peer review system did not catch the clear statistical problems with the paper. Also, the peer review system did not catch the fraud.
Post-publication review caught the statistical problems and the fraud.
I’m frustrated with the NYT article because it had about a hundred paragraphs on peer review and just about nothing on post-publication review.
There was this quote from a former editor of the New England Journal of Medicine:
If outside scientists detected problems that weren’t identified by the peer reviewers, then the journals failed.
But that’s not quite right. There’s no way that peer review can even come close to post-publication review. Peer review is done by 3 or 4 insiders; post-publication review is done by thousands of outsiders. There’s no contest.
Here’s a good line
I was amused by this bit from the Times article:
Shot: Dr. Mehra is well respected in scientific circles.
Chaser: Both editors pointed out that Dr. Mehra had signed statements indicating he had access to all of the data and took responsibility for the work, as did other co-authors.
Ouch! It looks like he traded in all that respect for the chance to be first author on a Lancet paper. Maybe not the best way to spend your scientific reputation.
P.S. The key point here is not that it’s some horrible failure of peer review that the Lancet editors published this paper, not noticing some obvious red flags. Mistakes happen. The key point is that open post-publication review is better than secret pre-publication peer review, and I think that this news article missed that big story. Also the Lancet editors and the author of that paper behaved badly by doubling down rather than seriously engaging with the open review.
The New York Times has a long history of not giving credit to people who first spot anything. It’s not news to them until they print it.
I still don’t get it either.
>> In short, seeing such huge effects really suggests that some very big confounders have not been properly adjusted for.
> I’m not saying that the editor of Lancet should’ve caught this.
Do you think the effect they report (95% CI: [1.2 1.5]) deserves to be “caught”? It’s not”incompatible” with previously reported effects like [0.6 1.9], [0.8 2.4], [0.80 2.2] and [1.2 2.9]. Maybe all those studies should have been caught? They all have issues, without doubt. Should peer-review aim to reject every paper where the analysis could have been done better?
This came up in one of the earlier posts. This sort of statistical adjustment is hard even when done by experts, and there was no evidence that any of the authors had any expertise in statistical analysis of observational studies.
The point is not that the analysis “could have been done better,” it’s that they were claiming to have effortlessly done something that’s really hard to do. And they provided no data and code. It’s kind of like if I published a paper in the Physical Review on experimental particle physics, based on data from the homemade cyclotron that I set up in my living room.
Yeah… those papers have to go to arXiv.
I don’t know, one of the authors discusses here https://www.onlinejacc.org/content/45/3/388.abstract how observational studies can suffer from the limitation of “confounding by indication” where the propensity score removes overt biases but do not account for hidden biases because of unrecorded differences between treated and control patients.
If your point is that most observational studies are badly done I may agree. But I’m not sure this paper was special in that sense, let alone in the not-providing-data-and-code issue, the fraud bit is what makes it outstanding.
It’s not even so simple to correct for measured biases. In this case James Watson pointed out problems in the purported adjustments for pre-treatment health status. In any case, my point is that this sort of statistical adjustment is hard even when done by experts, and there was no evidence that any of the authors had any expertise in statistical analysis of observational studies. Writing an editorial comment in a medical journal does not count as expertise.
And, yes, I agree with you that the fraud bit is what makes the story special. Without the fraud, this is just one more crappy paper that got published, perhaps helped out by its political implications and the author’s Harvard affiliation. That’s why I don’t want Lancet to use fraud as an excuse. Even if the paper had no fraud, it would’ve been a bad paper (like a lot of other bad papers that get published, make it through peer review, then get treated as if they’re truth).
> In any case, my point is that this sort of statistical adjustment is hard even when done by experts, and there was no evidence that any of the authors had any expertise in statistical analysis of observational studies. Writing an editorial comment in a medical journal does not count as expertise.
I agree that having published over the years peer-reviewed papers with statisical analysis of observation studies doesn’t count either (it may be a co-author who had the expertise and even in the single-authorship case medical journals often won’t require that paper are written by the listed authors). But what could have counted as evidence of expertise then?
> That’s why I don’t want Lancet to use fraud as an excuse. Even if the paper had no fraud, it would’ve been a bad paper
An excuse for having published one paper as bad as many others? It seems to me that you’re beating a retracted horse here when could be targetting other crappy (and crappier) papers.
1. This is not a tough call. The authors are 4 M.D.s. There’s no requirement that a statistician be a coauthor of a paper, but there’s also no reason to trust a complicated statistical analysis when there’s no code, no data, and no statistician involved.
2. I’m writing about this case because I think it’s a mistake for it to be presented as a problem of peer review. The real message I think people should be hearing is not, “Peer review suddenly has problems. It needs to be fixed!” but, rather, “Peer review doesn’t catch bad science. Post-publication review does. Here’s an example.”
It was a lost opportunity for that NYT article to focus on journal editors rather than on the post-publication reviews that were so effective.
You are right on this Andrew. Generally by independent citizen-scientists who are often retired and so can do full-time citizen advocacy.
> This is not a tough call. The authors are 4 M.D.s. There’s no requirement that a statistician be a coauthor of a paper, but there’s also no reason to trust a complicated statistical analysis when there’s no code, no data, and no statistician involved.
It’s a tangential point, but one shouldn’t assume that no statistician is involved in an analysis published in a medical journal just because the authors are all MDs. At least when the paper has been written to sell something: https://journals.plos.org/plosmedicine/article%3Fid%3D10.1371/journal.pmed.0040019
“We identified 44 industry-initiated trials. We did not find any trial protocol or publication that stated explicitly that the clinical study report or the manuscript was to be written or was written by the clinical investigators, and none of the protocols stated that clinical investigators were to be involved with data analysis. We found evidence of ghost authorship for 33 trials (75%; 95% confidence interval 60%–87%). The prevalence of ghost authorship was increased to 91% (40 of 44 articles; 95% confidence interval 78%–98%) when we included cases where a person qualifying for authorship was acknowledged rather than appearing as an author. In 31 trials, the ghost authors we identified were statisticians.”
(In this case someone from the company wanting to sell something was acknowledged for their statistical review; it could have been a real company marketing a real product with a paper ghostwritten by an actual statistician… but it wasn’t because of the fraudulent aspect of it all.)
When a person who was instrumental in the work is not listed as an author I think that should constitute fraud. If the person doesn’t want to be author then it should be required to somewhere in the paper say that “xyz work was done by an unnamed individual who declined to be listed as an author of this study”
Otherwise it “misrepresents the research”
The Lancet and it’s associated journals do not ask for code or data and do not seem interested in checking any numbers at all.
When I was reviewing a manuscript that used a large public database that applied propensity score analysis I asked that the authors supply a figure of distribution of propensity scores for the two arms. In their resubmission the authors abandoned the propensity score analysis altogether.
My recounting this anecdote is to say that the Hcq-The Lancet/NEJM saga is a failure of the peer review system.
If MDs are publishing trash statistical work I point the finger at rampant use of push-button statistical packages. I suspect statistics educators are also promoting this culture.
@MD and an engineer but not a Statistician
Has a very good point and it is probably even worse that MD’s conclusion.
There are also good computer scientists out there using non-click and point software that have no interest in understanding the data but learn from things like ‘Kaggle competitions’.
The original or processed data might often be normalized or standardized (often not necessary) and lose much physical meaning – if they had any interest or understanding of it in the first place.
James Watson and Peter Ellis were not the first to debunk the Lancet paper, other people did it earlier.
To give credit where due, I prepared a timeline of the NEJM/Lancet Hydroxychloroquine scandal:
Thanks. I edited the above post. The NYT should publish your timeline!
We should organize a prize for Twitter whistleblowers, at least something symbolic to acknowledge their contributions
Get some MSFTies, AMZNians, APPLetions & GOOGoyles to pull together a $20M endowment for annual $100K prizes in ten disciplines. That could generate a lot of post-publication review.
To me, they can do and say what they want. I’ll wait for them to share the (raw) data and the code they used. Until then, I won’t give a … nothing to what they say.
Here’s what I don’t understand, why doesn’t the Lancet and NEJM employ statisticians to review articles. These journals make a lot of money. They could continue to do peer review, but also hire statisticians to go over articles. Medical experts think they know enough stats to review science articles, but they don’t. They’ve been fooled enough to know that they can be fooled. The FDA has statisticians review all submissions. Why do these journals keep pretending that review by subject matter experts is sufficient?
Because the paper didn’t include the raw data. There’s precious little for a statistician to work with.
Actually, the NEJM sometimes do have statisticians that go over articles. In particular, this one I was involved in https://pubmed.ncbi.nlm.nih.gov/9420342/. After we submitted it we realized we mucked up the sections on statistical methods and what we wrote was obviously invalid. The statistical reviewer missed this and reviewed our paper favorably. Fortunately, we were able to correct things before publication.
But also, someone I knew reviewed for one of the top clinical journals and discussed some of the reviews after they did them. They usually made errors and I would point them out and they would agree. Then they stopped talking to me about them ;-)
It’s simply a myth that most statisticians can do good work and get hired by non-statisticians.
I don’t think that any old statistician is enough.
I have a history in physical sciences and had to read many medical journal articles on the safety of nano zinc oxide for example; they usually confuse the argument by either using acronyms that aren’t usual in physical science and further confuse things by pointing out that ‘nano’ ZnO is dangerous to human cells in isolation rather than give a baseline of ‘non-nano’ ZnO.
Same with potential treatments for Covid. Zn2+ does kill Covid, other viruses and even human cells in isolation but in in a real application of nano ZnO on the skin from a sunscreen has ever been proven to be a danger.
Also too much or too little or too much water, salt etc will probably kill an isolated Covid AND a human cell sample also.
Any old statistician can make what they want out of data! That is the big problem here. Either the dodgy Surgisphere mob (and related authors) or the ones trying to defend it.
It sounds like you disregard the Surgisphere/Lancet/Harvard study’s sensitivity analysis for hidden confounders (their “tipping point” analysis). Why?
I’m referring to the sensitivity analysis where they posit an unobserved binary confounder with 50 percent prevalence in the treatment group and examine how large the HR for this counfounder would have to be to remove the observed association between the treatment and the outcome. I mean, their sensitivity analysis is p-value based, and it would be much better if it weren’t. But the strength of your disagreement with their methods makes me think that you disagree with the sensitivity analysis for other, more fundamental reasons than its unnecessary use of p-values.
Sorry if you’ve already addressed this somewhere.
Post-review worked here, but rarely has a chance. NEJM gives 3 weeks (used to be 2 weeks) to submit a letter to the editor regarding a manuscript. How are methods to be routinely proofed if code, data are not provided?
The most basic form of external post-publication review is people screaming, “Hey—the data and code aren’t available!”
I mentioned in twitter that peer reviews should be considered a process. This means looking at what is done and how. The checklists suggested in https://statmodeling.stat.columbia.edu/2020/05/02/statistics-controversies-from-the-perspective-of-industrial-statistics/ are all about this. Obviously, you cannot “industrialise” peer reviews but some sort of framing would help ensure consistency and thoroughness.
These days, the main challenges is to find good reviewers to look at specific submissions. In this context where demand exceeds supply, the peer review process is not discussed. Editors, concerned by their journal’s reputation should join in to try the improve it.
The checklists in https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3591808 are a first step.
The paper under this link is another https://content.iospress.com/articles/statistical-journal-of-the-iaos/sji967
NYT & journalism in general might not be too excited about attacking the peer review system. It works in their interests.
First, it’s a rubber stamp that gives them the green light to turn brains off and headline and press machines on. “Peer Review” in journalism school is the golden stamp of approval. What’s peer reviewed is Official Science, and what isn’t peer reviewed Certified Quackery. It’s a nice bright line that any journalist, even one from NYT, can understand.
Second, because it lets a lot of garbage slip through, it gives them lots of great click-bait and headlines.
What I don’t see addressed anywhere is what was the motivation to publish fraudulent studies? Was Desai directly or indirectly compensated by the drug companies whose drugs benefited from the fraud?
I think that focusing on “motivation to publish fraudulent studies” may be missing the real problem — namely, that a lot of people have a very shallow idea of what constitutes a good study.
Martha – while what you say is certainly true, I am not sure that is the real problem in this case. I am very concerned that the Lancet and NEJM may just think this is over – and may even declare it as an example of how publication works well. We don’t know the real story here – was the data made up, who really did that and/or knew about it, and perhaps most importantly, why? Your explanation potentially leaves them off the hook by just pleading incompetence. I fear it may be more sinister than that – and that worries me more than just poorly done studies.
You may be right. I may be naive.
Good science treats fraud the same as incompetence. It was impossible for anyone to replicate this study based on the methods provided.