“The good news about this episode is that it’s kinda shut up those people who were criticizing that Stanford antibody study because it was an un-peer-reviewed preprint. . . .” and a P.P.P.S. with Paul Alper’s line about the dead horse

People keep emailing me about this recently published paper, but I already said I’m not going to write about it. So I’ll mask the details.

Philippe Lemoine writes:

So far it seems you haven’t taken a close look at the paper yourself and I’m hoping that you will, because I’m curious to know what you think and I know I’m not alone.

I really think there are serious problems with this paper and that it shouldn’t have been published without doing something about them. In my opinion, the most obvious issue is that, just looking at table **, one can see that people in the treatment groups were almost ** times as likely to be placed on mechanical ventilation as people in the control group, even though the covariates they used seem balanced across groups. As tables ** in the supplementary materials show, even after matching on propensity score, there are more than twice as many people who ended up on mechanical ventilation in the treatment groups as in the control groups.

If the control and treatment groups were really comparable at the beginning, it seems very unlikely there would be such a difference between them in the proportion of people who ended up being placed on mechanical ventilation, so I think it was a huge red flag that the covariates they used weren’t sufficient to adequately control for disease severity at baseline. (Another study with a similar design published recently in the NEJM used more covariates to control for baseline disease severity and didn’t find any effect.) They should at least have tried other specifications to see if that affected the results. But they didn’t and only used propensity score matching with exactly the same covariates in a secondary analysis.

In the discussion section, when they talk about the limitations of the study, they write this extraordinary sentence: “Due to the observational study design, we cannot exclude the possibility of unmeasured confounding factors, although we have reassuringly [emphasis mine] noted consistency between the primary analysis and the propensity score matched analyses.” But propensity score matching is just a non-parametric alternative to regression, it still assumes that treatment assignment is strongly ignorable, so how could it be reassuring that unmeasured confounding factors didn’t bias the results?

I actually have an answer to that one! In causal inference from observational data, you start with the raw-data comparison, then you adjust for basic demographics, then you adjust for other available pre-treatment predictors such as pre-existing medical conditions, smoking history, etc. And then you have to worry about adjustments for the relevant pre-treatment predictors you haven’t measured. At each step you should show what your adjustment did. If adjusting for demographics doesn’t change your answer much, and adjusting for available pre-treatment predictors doesn’t change your answer much, then it’s not so unreasonable to suppose that adjustment for other, harder-to-measure, variables won’t do much either. This is standard reasoning in observational studies (see our 1990 paper, for example). I think Paul Rosenbaum has written some more formal arguments along those lines.

Lemoine continues:

Of course, those are hardly the only issues with this paper, as many commenters have noted on your blog. In particular, I think **’s analysis of table ** is pretty convincing, but as I noted in response to his comment, if he is right that it’s what the authors of the study did, what they say in the paper is extremely misleading. They should clarify what they did and, if the data in that table were indeed processed, which seems very likely, a correction should be made to explain what they did.

Frankly, I don’t expect ** to have anything other than a small effect, whether positive or negative, so I don’t really care about that issue. But I fear that, since the issue has become politicized (which is kind of crazy when you think about it), many people are unwilling to criticize this study because the conclusions are politically convenient and they don’t want to appear to side with **. I think this is very bad for science and that it’s important that post-publication peer review proceeds as it normally would.

I just wanted to encourage you to dig into the study yourself because I’m curious to know what you think. Moreover, if you agree there are serious issues with it and say that on your blog, the authors will be more likely to respond instead of ignoring those criticisms, as they have been doing so far.

My reply:

My now, enough has been said about this study that I don’t need to look into it in detail! At this point, it seems that nobody believes the published analysis or conclusion, and the main questions revolve around what the data actually are and where they came from. It’s become a pizzagate kind of thing. It’s possible that the authors will be able to pull a rabbit out of the hat and explain everything, but given their responses so far, I’m doubtful. As we’ve discussed, ** (and journals in general) have a poor record about responding to criticisms of their paper: at best, the most you’ll usually get is a letter published months after the original article, along with a bag of words by the original authors explaining how, surprise! none of their conclusions have changed in any way.

The good news about this episode is that it’s kinda shut up those people who were criticizing that Stanford antibody study because it was an un-peer-reviewed preprint. The problem with the Stanford antibody study is not that it was an un-peer-reviewed preprint; it’s that it had bad statistical analyses and the authors supplied no data or code. It easily could’ve been published in JAMA or NEJM or Lancet or whatever and had the same problems. Indeed, “Stanford” played a similar role as “Lancet” in giving the paper instant credibility. As did “Cornell” with the pizzagate papers.

As Kelsey Piper puts it, “the new, fast scientific process (and even the old, slow scientific process) can produce errors — sometimes significant ones — to make it through peer review.”

P.S. Keep sending me cat pictures, people! They make these posts soooo much more appealing.

P.P.S. As usual, I’m open to the possibility that the conclusions in the disputed paper are correct. Just because they haven’t made a convincing case and they haven’t shared their data and code and people have already found problems with their data, that doesn’t mean that their substantive conclusions are wrong. It just means they haven’t supplied strong evidence for their claims. Remember evidence and truth.

P.P.P.S. I better explain something that comes up sometimes with these Zombies posts. Why beat a dead horse? Remember Paul Alper’s dictum, “One should always beat a dead horse because the horse is never really dead.” Is it obsessive to post multiple takes on the same topic? Remember the Javert paradox. It’s still not too late for the authors to release their code and some version of their data and to respond in good faith on the pubpeer thread, also not too late for the journal to do something.

What could the journal do? For one, they could call on the authors to release their code and some version of their data and to respond in good faith on the pubpeer thread. That’s not a statement that the published paper is wrong; it’s a statement that the topic is important enough to engage the hivemind. Nobody’s perfect in design of a study or in data analysis, and it seems absolutely ludicrous for data and code to be hidden so that, out of all the 8 billion people in the world, only 4 people have access to this information from which such big conclusions are drawn. It’s kind of like how in World War 2, so much was done in such absolute secrecy that nobody but the U.S. Army and Joseph Stalin knew what was going on. Except here the enemy can’t spy on us, so secrecy serves no social benefit.

37 thoughts on ““The good news about this episode is that it’s kinda shut up those people who were criticizing that Stanford antibody study because it was an un-peer-reviewed preprint. . . .” and a P.P.P.S. with Paul Alper’s line about the dead horse

  1. On your response to my argument that they probably didn’t adjust for baseline severity adequately, the problem is that we don’t actually know how much adjusting for various factors changed the effect on mechanical ventilation, because they didn’t use mechanical ventilation as the response in any of the Cox models they estimated. They only used mortality and, in the supplementary materials (table S6), the composite end point “mortality or mechanical ventilation”. The hazard ratios for the various treatments they looked at was ~1.5 for this composite end point, so given that it was ~1.3 for mortality, it would presumably be higher for mechanical ventilation alone, but probably not 2.6-2.8, which is the difference between the groups in table 2. In tables S7A-D, where they report the results of the score propensity matching analysis, the difference is 2-2.4. Thus, adjusting for the various covariates they included in the model probably did reduce the effect on mechanical ventilation, but it was still huge even after statistical adjustment. Even if you don’t think this reduction was huge, it was definitely large on mortality, for which they found hazard ratios of ~1.3 for all treatments, when table 2 shows between 1.8 and 2.5 times more dead people in the treatment groups than in the control group. So I don’t think your response to my argument really works in this case.

    I agree with the rest of your post, but note that a correction was recently issued (https://twitter.com/TheLancet/status/1266396646809767936), which confirms Jacob Steinhardt’s analysis on your blog (https://statmodeling.stat.columbia.edu/2020/05/25/hydroxychloroquine-update/#comment-1344852), though it leaves unanswered most of the concerns about the data that people noted here and elsewhere. However, the Lancet said it would publish responses to the paper, along with replies by the author, so hopefully we’ll soon know whether they can explain the weird things people have noticed.

    • A huge problem (and this is not the authors’ fault) is that most probably the qSOFA assessment tool is not relevant at all with Covid patients and does not assess the real severity of their condition. Watson has already cited this article on that subject https://annalsofintensivecare.springeropen.com/articles/10.1186/s13613-020-00664-w?fbclid=IwAR2Oq48hrUBELkYXGt_6w-6Im-fFEAJFe0RZpWUi32I3ikIm7wFodrSilrA.

      This could explain why the groups with LCH and/or without macro are so different from the control group: it is very credible that the drugs were given to patients who were unstable and whose condition worsened rapidly after inclusion. We do not have any information on the treatment strategies used in the units (“routine” or compassionate administration of these drugs). Therefore, in this case, these patients died despite having received LCH, not because of LCH. But we will never know because we don’t have that information.

      (sorry for my terrible English, I am French speaking and I use deepl translate)

  2. Re: As usual, I’m open to the possibility that the conclusions in the disputed paper are correct. Just because they haven’t made a convincing case and they haven’t shared their data and code and people have already found problems with their data, that doesn’t mean that their substantive conclusions are wrong. It just means they haven’t supplied strong evidence for their claims. Remember evidence and truth.
    —-

    This resonates with me.

  3. Seeing a lot of people say that they think the paper will be pulled etc, even referencing the posts here. I’m almost jealous of how optimistic they can be but I think it’ll be a few years before we see any major moves happening (retraction, if it ever goes to that). Who knows, maybe the pandemic will change the speed at which this stuff happens but I’m not expecting anything.

    It’s clear that if authors publish a paper saying 1 + 1 = 3, and others point out that it’s wrong, it still won’t be retracted, at least not for a long time. Editors will publish some letters to the editor and a bunch of other stuff to make it seem like discourse is happening but will almost never pull the paper

    That paper on calculating post hoc power using the published effect estimates from a study STILL hasn’t been pulled and that was just plain wrong and attracted a ton of attention

    https://retractionwatch.com/2019/06/19/statisticians-clamor-for-retraction-of-paper-by-harvard-researchers-they-say-uses-a-nonsense-statistic/

    • Zad:

      Yes, we discussed this general issue of the journal letting bad results stand a few days ago, considering two past examples of bad papers that Lancet published:

      1. Andrew Wakefield vaccines paper, finally was retracted 12 years after publication.

      2. Gun-control paper which was never retracted. The journal ran a couple of letters and a scientifically incorrect response, and that was it. Never even an apology, of the form, “Yeah, we don’t feel comfortable actually retracting this terrible paper, but at least we’ll let you, the reader, know not to trust it.” If you go the the Lancet’s webpage for that article, it doesn’t even link to the letters page.

      My impression is that, to the journal editors, scientific accuracy is not as important as career building and political position taking.

      So maybe the journal editors will retract this recent paper, if they think it’s good for their careers to do so, and if they think it’s a good political position to take.

      • Andrew:

        My impression is that, to the journal editors, scientific accuracy is not as important as career building and political position taking.

        I want to believe that the spectrum is “a bit” wider than that.
        IMHO often the journal doesn’t take any position because ideally bad and good results will sum up in the long term, with correct models prevailing on wrong ones.
        The main issue is the sensationalism associated with some study results. Relying on a single study to make definitive claims is the opposite of what empirical science is, newspapers, public (and often scientists) should be informed about this. Scepticism should be proportional to the importance and novelty of studies. The short circuit between media, academia and scientific journals has pushed to a sort of “click-baiting” bias that produces what we all see. We all should be more relaxed, enjoy research and understanding of new things (small or big). Big money and fame are not supposed to be part of any meta-science model of science itself.

        • Paolo:

          Sure, but what about that gun-control paper? It was obviously a bad idea. Somehow it slipped through the editorial process, and then the paper is there forever. In Lancet. Canonical. Journals don’t have a mechanism for saying they’re sorry that they published something that’s clearly wrong. Also see all those Psychological Science papers from 2010 through 2015 or so.

        • Andrew:

          you are absolutely right. But, science is also what you are doing here on your blog.
          Science is also the fact that you are saying that these papers are de facto bad (I agree with you). A journal is just a container. Giving blind credit to journals is part of the big problem of bureaucratization of science. But I also know a lot of people that believe in science and use their critical thinking, despite what Lancet or others say.

          People should be more educated about the fact that “facts” is the sum of several years of confirmation and new knowledge.
          IMHO, science, as done by scientists, is in good shape. This blog proves that. But, unfortunately, science is failing one of its major scopes, teaching that facts should be evaluated, tested, and then accepted as “temporarily” true.

        • “But, unfortunately, science is failing one of its major scopes, teaching that facts should be evaluated, tested, and then accepted as “temporarily” true.”

          I meant teaching the general public.

  4. Because Andrew mentioned the Santa Clara study:

    This is hilarious:

    –snip–

    LOS ANGELES (CBSLA) – While a new round of antibody testing appeared to indicate that Los Angeles County has done a good job limiting the spread of the coronavirus, it also showed that the region is still not close to achieving herd immunity even as the number of coronavirus cases countywide crossed the 40,000-mark Wednesday.

    There were 1,324 new L.A. County coronavirus cases and 57 deaths reported Wednesday. It brings the total number of cases to 40,857, and the death toll to 1,970.

    Officials also Tuesday released the results from the second phase of an ongoing antibody study being conducted by USC and the L.A. County Department of Public Health.

    –snip–

    Ok, that’s not the hilarious part. Here’s the hilarious part:

    –snip–

    1,014 Angelenos were tested from May 8-12 in a drive-thru and in-home format. 2.1% of them tested positive for coronavirus antibodies, officials announced.

    This was significantly down from the 4.65% who tested positive in the first phase of testing, which was conducted April 10-14, the results of which were published Monday in the Journal of the American Medical Association.

    –snip–

    But wait. It gets even more hilarious.

    –snip–

    The second phase was conducted at a completely different site than the first phase. There was also more of an effort made to ensure Latinos, Asians and African-Americans took part in the second phase, L.A. County Public Heath Director Dr. Barbara Ferrer disclosed.

    –snip–

    So they included a more representative sample, and the number *went down,” which is in complete contradiction to their rationalization for why the numbers in their testing with MLB employees was low. But wait, it gets even funnier still:

    –snip–

    “If you pooled the results across the two waves…about three percent tested positive,” lead investigator Dr. Neeraj Sood, a USC professor of public policy, told reporters at a news briefing Wednesday.

    –snip–

    What? If you pooled the results? He’s saying that you should basically disregard the first findings, that they used to stage a national publicity campaign to weigh in on public health policy options, and to say that the policies in place were “draconian,” and pool them with the 2nd findings.

    Because that would make the numbers more to their liking?

    Remarkable!

    https://losangeles.cbslocal.com/2020/05/20/la-county-still-far-away-from-herd-immunity-new-antibody-numbers-show/

      • It says that
        a) lateral flow assays are not very sensitive and can’t detect low levels of antibodies months after the infection
        b) cheap lateral flow assays have cross-reactions with HCoV antibodies, and people have had fewer of these for a fee months now as spring arrived
        c) lateral flow assay surveys in areas with single digit prevalence percentages are unreliable (surprise!)

        Take your pick.

      • Yeah. Yeah. I…yeah. Plenty of legitimate, yeah. I don’t understand the secrecy. It can be a lot of things, legitimate things.

    • Hey, at least they didn’t bury the real data, fabricate some other data, a lie about it. I mean, these days that’s something right?

      In any case, with all of these surveys they need to be running them *frequently* and in various populations. It’s totally expected that you’d get plenty of variation under different recruitment strategies.

      • Daniel –

        > Hey, at least they didn’t bury the real data, fabricate some other data, a lie about it. I mean, these days that’s something right?

        I’m still trying to hold on to my prior that rhe bard doesn’t need to be set that low. I want to think that the motivated reasoning on display with that group (I didn’t want to go there at first, but their rationalizations of their findings have landed me there) is not typical.

      • Yeah. I saw that. Another convenience sample. Worse yet, of people who went into grocery stores? Really?

        I must be missing something, ’cause I don’t get why researchers think they can confidently and meaningfully extrapolate from that kind of sampling.

        • I think it’s better than no data. Ideally you’d do one at grocery stores, one at take-out restaurants, one with a church congregation, etc etc and get a sense of the variation among groups.

        • Daniel –

          > I think it’s better than no data. Ideally you’d do one at grocery stores, one at take-out restaurants, one with a church congregation, etc etc and get a sense of the variation among groups.

          I agree. I don’t criticize the studies, per se. Information/data are good. My issue is with what people do with the data – specifically when they extrapolate from convenience sampling.

    • You seem to be objecting to Sood not pooling his data with the data from the Bendavid study, which was done on another population with very different selection criteria.

      There’s
      a) Bendavid’s Santa Clara study
      b) Sood’s first wave in LA County
      c) Sood’s second wave in LA County

      The Sood study was originally announced to plan several waves, so b) and c) are actually part of the same study, while a) is not. Not including the data from a), but pooling b) and c) looks reasonable to me.

      • Mendel –

        > a), but pooling b) and c) looks reasonable to me.

        I don’t understand how it’s OK to pool the data collected a month apart and then average them to infer a pooled infection rate now. That’s what the author stated. Seems to me that at best, you could only infer that infection rate mid-way during that time period was about mid-way between the two rates they found. Well, except than it would mean that people are getting uninfected.

        Are you suggesting that it’s valid to lower the rate they just found, by pooling with the data from a month earlier?

  5. > At this point, it seems that nobody believes the published analysis or conclusion,

    The analysis may be biased because they don’t fully adjust for severity but the conclusion doesn’t seem *that* unbelievable.

    In his first message here Watson wrote that “The big finding is that when controlling for age, sex, race, co-morbidities and disease severity, the mortality is double in the HCQ/CQ group”. In fact the increase in mortality they report is quite lower, but still huge.

    He wrote the huge effect found suggested that the analysis was wrong and pointed to a NEJM study with a better analysis noting that “they saw no effect on mortality”. Actually they don’t say much about mortality, their composite endpoint being mortality or intubation. And their estimate and 95% confidence interval for the hazard ratio is 1.04 [0.82 1.32].

    Would you say that [0.82 1.32] shows no effect while [1.22 1.46] claims a huge effect? Do you find those results contradictory?

    > and the main questions revolve around what the data actually are and where they came from.

    The main issues with the data (patient characteristics too similar across continents, too many deaths in Australia) have been addressed. The “very large Japanese hospital” objection was based on a flawed reasoning.

    I think the dosing question remains open, but I don’t know how much of a problem it is. I have not investigated the issue and know little about it, but I noticed that while Watson wrote that “Nowhere in the world recommends higher doses than this, with the exception of the RECOVERY trial in the UK” there was at least another exception where higher doses were given: the Brazilian trial that resulted in death threats to investigators which I mentioned the other day.

    Would it be nice to have the data? At least more detailed information about it? Could a better analysis yield a different result? Closer to the truth? The question to those questions is yes, but I’m not convinced those are problems that call for a retraction of the paper.

    • I haven’t been following this particular kerfuffle very closely, but isn’t the main issue that it seems totally implausible that anyone would have access to this vast array of data across the globe, and it’s kind of like someone telling you they found the arc of the covenant and read the clay tablets that tell of the precise characteristic of the apocalypse that will occur in 2020 AD or something?

      If the data doesn’t exist in the first place, the method of analysis is moot.

      • Indeed. Aside the statistical questions that were raised and discussed in the blog posts, one of the major criticisms of the paper is the actual data and how the authors got access to them. This is very sensitive data and normally you’d need lots of informed consent, ethics review authorizations, etc. The group I work in had a lot to do just to gain access to something far smaller (as in, two orders of magnitude smaller) than this data set.

        However, the authors claimed no ethics review was necessary. This raises some eyebrows and warrants at least some clarifications.

        • If Surgisphere has access to digital health records and billing information for research under existing legislation, and has had this access already in place because this is their business model, it doesn’t seem implausible, and would probably exempt them from ethics review: the data is observational only, and consent would be governed by legislation and the way the data collection for the patient records is set up, probably including a blanket science clause. What they would have been guilty of might just be representing a single source in Australia as a “hospital” when it could encompass records from many hospitals and some private practices or care homes.
          (Unfortunately I haven’t been able to access the clarification that Surgisphere posted about this.)

          I believe few people understand that there is a difference between a targeted HCQ treatment with a well-defined indication and dosage and an “all treatments” overview that would include compassionate use. HCQ may well be a health hazard if applied with no understanding of its effect, and beneficial for specific patients in specific circumstances with a specific dosage. (This is probably true for most effective drugs. My grandmother overdosed on aspirine because she had heard it helped against thrombosis, and had to be hospitalized for internal bleeding.) This study doesn’t say “HCQ is bad”, it says “the way we have been applying it is bad”. It says “just take it” is bad advice because you might have a lot to lose here. It’s a responsible decision for any ongoing trial to sit back and answer the question, “how safe is our own treatment plan”, because this study shows that HCQ can be very unsafe.

          I would love to hear more about the ACE2-inhibitor data that the study also shows: why is that apparently associated with a survival benefit, and how can that be harnessed for treatment? It stands to reason that blocking ACE2 receptors would make infecting cells more difficult for the virus, and that people taking these would have been taking them chronically, i.e. also during the early stages of the Covid-19-infection.

          So my takeaway “truth” from this study is “don’t take HCQ willy-nilly” (which reflects current regulations in many countries) and “expect a study showing benefits of ACE2 inhibitors”. Who’s going to argue with that?

        • The problem is also the reactions to such a study: a lot of trials were stopped IMO without checking whether they observed the same results or not (precautionary principle, rather than hard data). OTOH, the RECOVERY trial did an immediate check on their records and, as the situation was not the same, it continued.

          UMN’s COVID-PEP study also reported no cardiotoxicity (without macrolide usage). So there is a lot that is not addressed by this observational study (dosage in particular, as you mention), and that’s why proper trials are important, so that the debate can be put to an end.

          As an aside, I know institutions (at least mine) that would have not allowed Surgisphere to keep the data proprietary (or put conditions so that it could be shared at least in aggregated form), or rather, they would’ve not enrolled in such a program.

      • The main issue seems to be that the data (and code) has not been published, and it’s not real anyway, and if it is real it has been obtained irregularly, and it’s bad data anyway, and even if it is good data the analysis is bad (and we don’t like the result, because if we did we wouldn’t be asking all these questions).

        Some of these allegations could be made about 99% of the published research, others are grave but wouldnt affect the result, others invalidate everything as you say. Personally, I’m more interested in understanding how the data and the analysis are biased or otherwise flawed.

        I don’t find likely that the data doesn’t exist in the first place. If your take is that it’s totally implausible that anyone would have access to this vast array of data, the critics may have a communication problem. How will your perception of the issues change if the data does indeed exist?

        • This is not just the usual case of wishing the code and data would be published.

          The company Surgisphere allegedly with this network of 671 hospitals in six continents has a handful of employees, none of them IT or data developers; the ‘collaborative network’ has been open only since September 2019; the QuartzClinical software that allegedly integrates with EHR, finance and supply systems was only launched last year; the process for signing up and starting to provide this sensitive data over the cloud is described as only taking an hour or two; the company’s websites are frankly weird and in no way conducive to believing they run this global data management operation. It is nearly impossible that this software is operating in all these hospitals and sending data back to the USA.

          See my blog post on how implausible this is at http://freerangestats.info/blog/2020/05/30/implausible-health-data-firm.

        • > It is nearly impossible that this software is operating in all these hospitals and sending data back to the USA.

          > I believe with very high probability the data behind that high profile, high consequence Lancet study are completely fabricated.

          One running theme in this blog is that a low p-value may be evidence against the null hypothesis but it doesn’t mean that your preferred alternative hypothesis is correct.

          Other alternative hypotheses do exist: https://statmodeling.stat.columbia.edu/2020/05/25/this-controversial-hydroxychloroquine-paper-whats-lancet-gonna-do-about-it/#comment-1345799

        • Thanks, that vqi possibility is intriguing. However it’s sufficiently different to the claimed origin of the data that it would still count as fabrication, and certainly as career-ending research fraud for deceptively refusing to list the sources, and doing so for deceptive reasons. Its just that the fabrication would be building on an existing real dataset.

  6. Derek Lowe has posted an entry about Surgisphere on his blog: https://blogs.sciencemag.org/pipeline/archives/2020/06/02/surgisphere-and-their-data

    Among other links (including one to this page), there is one to an Expression of Concern published in the NEJM and one to a twitter thread that seems interesting and which links to other twitter threads that seem interesting (I find long texts in twitter unreadable but I thought I would share anyway).

Leave a Reply to Andrew Cancel reply

Your email address will not be published. Required fields are marked *