This is not a post about remdesivir.

Someone pointed me to this post by a doctor named Daniel Hopkins on a site called KevinMD.com, expressing skepticism about a new study of remdesivir. I guess some work has been done following up on that trial on 18 monkeys. From the KevinMD post:

On April 29th Anthony Fauci announced the National Institute of Allergy and Infectious Diseases, an institute he runs, had completed a study of the antiviral remdesivir for COVID-19. The drug reduced time to recovery from 15 to 11 days, he said, a breakthrough proving “a drug can block this virus.” . . .

While the results were preliminary, unpublished, and unconfirmed by peer review, Fauci felt an obligation, he said, to announce them immediately. Indeed, he explained, remdesivir trials “now have a new standard,” a call for researchers everywhere to consider halting any studies, and simply use the drug as routine care.

Hopkins has some specific criticisms of how the results of the study were reported:

Let us focus on something Fauci stressed: “The primary endpoint was the time to recovery.” . . . Unfortunately, the trial registry information, data which must be entered before and during the trial’s actual execution, shows Fauci’s briefing was more than just misleading. On April 16th, just days before halting the trial, the researchers changed their listed primary outcome. This is a red flag in research. . . . Unfortunately, the trial registry information, data which must be entered before and during the trial’s actual execution, shows Fauci’s briefing was more than just misleading. On April 16th, just days before halting the trial, the researchers changed their listed primary outcome. This is a red flag in research. . . . In other words they shot an arrow and then, after it landed, painted their bullseye. . . .

OK, this might be a fair description, or maybe not. You can click through and follow the links and judge for yourself.

Here I want to talk about two concerns that came up in this discussion which arise more generally when considering this sort of wide-open problem where many possible treatments are being considered.

I think these issues are important in many settings, so I’d like to talk about them without thinking too much about remdesivir or that particular study or the criticisms on that website. The criticisms could all be valid, or they could all be misguided, and it would not really affect the points I will make below.

Here are the 2 issues:

1. How to report and analyze data with multiple outcomes.

2. How to make decisions about when to stop a trial and use a drug as routine care.

1. In the above-linked post, Hopkins writes:

This choice [of primary endpoint], made in the planning stages, was the project’s defining step—the trial’s entry criteria, size, data collection, and dozens of other elements, were tailored to it. This is the nature of primary outcomes: they are pivotal, studies are built around them. . . .

Choosing any primary outcome means potentially missing other effects. Research is hard. You set a goal and design your trial to reach for it. This is the beating heart of the scientific method. You can’t move the goalposts. That’s not science.

I disagree. Yes, setting a goal and designing your trial to reach for it is one way to do science, but it’s not the only way. It’s not “the beating heart of the scientific method.” Science is not a game. It’s not about “goalposts”; it’s about learning how the world works.

2. Lots is going on with coronavirus, and doctors will be trying all sorts of different treatments in different situations. If there are treatments that people will be trying anyway, I don’t see why they shouldn’t be used as part of experimental protocols. My point is that, based on the evidence available, even if remdesivir should be used as routine care, it’s not clear that all the studies should be halted. More needs to be learned, and any study is just a formalization of the general idea that different people will be given different treatments.

Again, this is not a post about remdesivir. I’m talking about more general issues of experimentation and learning from data.

56 thoughts on “This is not a post about remdesivir.

  1. > Yes, setting a goal and designing your trial to reach for it is one way to do science, but it’s not the only way. It’s not “the beating heart of the scientific method.” Science is not a game. It’s not about “goalposts”; it’s about learning how the world works.

    While we could debate whether this is “doing science” or not (and I know that’s his characterization, not yours), pivotal clinical trials are not “about learning how the world works”. The are about getting a piece of information that will be used by other people to decide whether a drug should be approved and what is going to be written in the prospectus in terms of indications, dosing, side-effects, etc. Regulators and advisory panels do not like moving goalposts.

    • Interesting.

      So would you tell that clinical trials are more about politics (i.e. about the ways we organize our lives) than about science?

      • I’m not sure what does the question mean. My point is that a registrational clinical trial is not about science for the sake of knowledge, it’s about the kind of science that produces a description of the efficacy and safety of a treatment that will be found reliable and convincing by the third party who has to take a regulatory decision. That involves a lot of guidelines, protocols, statistical analysis plans, reviews, audits and other things, including business considerations, that may fall under that “politics” umbrella.

        • This brings up a longstanding puzzle for me since I’ve been regularly reading this blog. It often seems the prevailing opinion here is inclined against any “statistical” procedure that involves bright-line, a priori decision rules.

          But sometimes a decision is going to be made and we’re better off if the parameters of decision-making are specified ahead of time and then adhered to. Otherwise, it’s all open ended scientific discovery to which presumably everyone in the world is entitled to his or her own opinion. That would be pure politics, then.

        • Brent:

          Decision making is important, and I recommend that it be doing using assessment of costs, benefits, and probabilities. Not by using p-values or Bayes factors or other null hypothesis significance testing approaches. I don’t think there’s a good map from “the tail-area probability of a data summary, conditional on a null hypothesis that nobody believes” to good decisions.

        • I have no time at all for the NHST p-value stuff. But the gist of previous discussions here has been (correct me if I misunderstood) was that even more sophisticated and principled decision rules are still somehow unacceptable.

          I realize the importance of trading off costs and benefits but surely it is legitimate to undertake a program of clinical trials with the intent of approving a drug only if the treatment effect is of some certain, clinically meaningful magnitude as estimated with a certain degree of certainty. Of course that doesn’t mean approving it regardless of demonstrated side effects, etc. But I continue to believe that saying something like “We must establish with 95% certainty that the drug reduces illness duration by more than four days” no matter what additional conditions or caveats are attached.

          Am I wrong in thinking that you disagree with any a priori criterion of that type?

        • why is 4 days ok, but 3 days isn’t?

          If you want to do drug approval based on utility, you should go ahead and approve anything where the expected utility is larger than 0 (where I assume 0 means whatever exists at the moment). This inevitably HAS to include the cost of the drug, which is at the moment illegal to consider if I understand correctly (and besides there’s no regulatory power to set limits on drug cost).

          I’d be very much in favor of this kind of more principled decision making. But I fear that it’d destroy lots of possibilities for selling essentially ineffective drugs at enormous prices, so I assume drug companies would fight it tooth and nail, except maybe smaller companies.

        • Andrew –

          > Decision making is important, and I recommend that it be doing using assessment of costs, benefits, and probabilities. Not by using p-values or Bayes factors or other null hypothesis significance testing approaches.

          Sounds like you’re framing it as an either/or.

          Why shouldn’t p-values, Bayes factors, and/or NHSTs be a component of cost/benefit analyses and assessing probabilities?

        • Joshua said,
          “Why shouldn’t p-values, Bayes factors, and/or NHSTs be a component of cost/benefit analyses and assessing probabilities?”

          And just how would you propose using p-values, Bayes factors, and/or NHSTs in cost/benefit analyses and assessing probabilities?

        • Martha –

          Seems to me it’s information. You use the information to inform the probabilities. To inform the costs and benefits. You don’t look at it as definitive, or dispositive. You gather more information. You share perspectives with others. You check for your biases.

          And, you’ll like this part, you respect the uncertainties.

        • The purpose of cost-benefit analysis in statistical analysis is to use available information as *input* in designing the process for statistical analysis. So what you’re saying sounds bass-ackwards.

        • I read Andrew’s comment as meaning we should take more aspects into account, and not make decisions on arbitrary cutoffs in isolation. I don’t think he means p-values etc should be excluded from the overall assessment but interpreted alongside other relevant information such as costs, size of the benefit, etc.

        • Vegard –

          > I read Andrew’s comment as…

          Thanks.

          If your reading is correct, then I completely agree with Andrew. Your description is exactly what I was attempting to describe to Martha

        • > Why shouldn’t p-values, Bayes factors, and/or NHSTs be a component of cost/benefit analyses and assessing probabilities?

          Not that we should be treating Andrew posts like a golden tablet, but I think he does give his answer in the post:

          > I don’t think there’s a good map from “the tail-area probability of a data summary, conditional on a null hypothesis that nobody believes” to good decisions.

          Regarding:

          > I don’t think he means p-values etc should be excluded from the overall assessment but interpreted alongside other relevant information such as costs, size of the benefit, etc.

          p-values come from some sort of analysis, and so I assume whatever is being done to get that p-value could be replaced by a different analysis.

          In the golden tablet theme, from the Alexey sleep post a few days ago: https://statmodeling.stat.columbia.edu/2020/05/26/alexey-guzeys-sleep-deprivation-self-experiment/

          > The goal should be to learn, not to test hypotheses, and the false positive probability has nothing to do with anything relevant. It would arise if your plan were to perform a bunch of hypothesis tests and then record the minimum p-value, but it would make no sense to do this, as p-values are super-noisy.

        • Ben –

          Thanks.

          I guess this is the part I have the most trouble with:

          > The goal should be to learn, not to test hypotheses, and the false positive probability has nothing to do with anything relevant.

          I’m having trouble wrapping my mind around “nothing to do with anything relevant” part. But maybe I’m must stuck from moving on from an existing mindset.

          I agree with the learning goal, but I feel like I learn something from hypothesis testing. Not something conclusive, but I have learned something about probabilities with a caveat that there are uncertainties.

          Say we have hypothesis testing from a study you agree is well designed, employing a careful use of statistical analysis conducted by knowledgeable statisticians, that shows an extremely low probability of false positives. And it is replicated. And careful meta-analyses of similar studies show consistent results.

          In that case, do you think there is NO evidence relevant to weighing probabilities, to cost-benefit analysis, to decision-making, that arise from hypothesis testing? I’m thinking in particular of application to situations where there are potentially low probability but high damage function risks. Like, for example, the use of a drug to treat a disease.

          I’m not trying to create an absurdum argument – but to understand if there are limiting parameters for the argument being presented.

        • I think it’s that the p-value, the Bayes factor, and whatever NHST stuff doesn’t add anything.

          > Say we have hypothesis testing

          > from a study you agree is well designed, employing a careful use of statistical analysis conducted by knowledgeable statisticians, that shows an extremely low probability of false positives. And it is replicated. And careful meta-analyses of similar studies show consistent results.

          Like the first thing is separate from the second thing (excluding the false positives thing, cause I guess that’s p-value sorta stuff). If you have all the second things, then what does computing a p-value do? That’s my read on this at least.

        • Granted, I’m probably only slightly more knowledgeable on this stuff than your average end-table. I had to Google p-values to clarify my thinking. So take what I said with a grain of salt, but the comment on the Alexey post stood out to me.

        • Ben –

          > If you have all the second things, then what does computing a p-value do?

          For me, it can help to inform about probabilities in a broad sense. It’s important to remember, however, thst it doesn’t prove or even *necessarily* test whether, your actual hypothesis is true. It might, or it might not.

          That is why I think that such hypothesis testing should be accompanied by components such as a clear speculation about a plausible mechanism before causality is inferred. And then you work further to test that plausibility (necessarily with the help of others since you are the easiest person for you to fool).

          It must be taken with a grain of salt. And you should keep in mind the key caveat that an experimental paradigm is not (typically) the real world

        • Posted to soon. Meant to add… but it’s information. I generally think the more information the better (even if not always).

        • Pre-specifying the decision rule has some appeal – it will nullify the emotional factors that might come into play later. On the other hand, some of these emotional factors are necessary. There are just too many variables (just how good is the data? how have the researchers responded to criticism? what alternative explanations have people offered after seeing the “evidence?”) to think that prespecification is better than weighing everything and then deciding. I think being explicit about what the decision is based on is about the best you can hope for. Requiring the decision rule to be specified in advance is just too confining, in my opinion.

  2. > even if remdesivir should be used as routine care, it’s not clear that all the studies should be halted

    I think that meant randomized studies. Observational studies, based on the fact that different people will be given different drugs, can still be done but I’ve read somewhere that learning from them is not so straightforward.

  3. the results were preliminary, unpublished, and unconfirmed by peer review

    Meanwhile the FBI is raiding people giving IV vitamin C and the FTC is sending threatening letters to people offering hyperbaric oxygen therapy.

    There is better evidence for both those treatments than for remdesivir which became the standard of care based on claims about one p-hacked study that wasn’t even published. And the claims about that study was in conflict with studies that were published.

    This isn’t even close tombneing based on science anymore.

  4. Outcome switching can be a red flag when not justified and done in a highly suspect way, but this really wasn’t the case here. A few days after the press release, the NIAID explained the outcome switch was a decision by the statisticians of the trial who were blinded and this excerpt was on the paper for that trial which was published last week

    “The primary outcome was initially defined as the difference in clinical status, defined by the eight-category ordinal scale, among patients treated with remdesivir as compared with placebo at day 15. This initial primary outcome became the key secondary outcome after the change in primary outcome. The change was proposed on March 22, 2020, by trial statisticians who were unaware of treatment assignments and had no knowledge of outcome data. When this change was proposed, 72 patients had been enrolled and no interim data were available. The amendment was finalized on April 2, 2020, without any knowledge of outcome data from the trial and before any interim data were available. This change in primary outcome was made in response to evolving information, external to the trial, indicating that Covid-19 may have a more protracted course than previously appreciated.”

    https://www.nejm.org/doi/full/10.1056/NEJMoa2007764

    • That’s true (I didn’t mention that in my other as it was more general and not about remdesivir). Still, changing your endpoints because external data suggest that the drug is not going to work as you expected and you want to save the day somehow doesn’t look too good :-)

    • The change was proposed on March 22, 2020, by trial statisticians who were unaware of treatment assignments and had no knowledge of outcome data.

      What does being blinded have to do with it? Did they assume the outcome would be changed in the null model used to calculate the p-value? If not, then the null model was rendered false by their actions.

      It doesn’t sound to me like they understand how statistical significance works. With sufficient data you will get a low p-value when at least one assumption that went into deriving the null model is false (ie, it predicts the wrong thing).

      Forcing one of the assumptions used to be false by design is p-hacking.

      • It was an adaptive trial so a lot of stuff was subject to change as the data started rolling in. They were also using group sequential methods to penalize alpha for the primary outcome of interest, but for the secondary outcomes reported as is. See pages 36-40 of the trial protocol (https://www.nejm.org/doi/suppl/10.1056/NEJMoa2007764/suppl_file/nejmoa2007764_protocol.pdf)

        “9.4.6.1 Interim Safety Analyses: Interim safety analyses will occur at approximately 25%, 50%, and 75% of total enrollment. Safety analyses will evaluate serious AEs by treatment arm and test for differences using a Pocock spending function approach with a one-sided type I error rate of 0.025. This approach is less conservative than what will be used to test for early efficacy results because proving definitive harm of the experimental agents is not the focus of this study. Pocock stopping boundaries at the looks described correspond to z-scores of (2.37, 2.37, 2.36, & 2.35). This contrasts with the z-scores topping boundaries for the Lan-DeMets spending function that mimics O’Brien-Fleming boundaries:(4.33, 2.96, 2.36 & 2.01). The unblinded statistical team will prepare these reports for review by the DSMB.

        9.4.6.2 Interim Efficacy Review: The Lan-DeMets spending function analog of the O’Brien-Fleming boundaries will be used to monitor the primary endpoint as a guide for the DSMB for an overall two-sided type-I error rate of 0.05.Interim efficacy analyses will be conducted after the BEEChas selected the primary efficacy endpoint at approximately 50%, 75% and 100% of total information.”

        And this from the supplementary appendix of the published paper (https://www.nejm.org/doi/suppl/10.1056/NEJMoa2007764/suppl_file/nejmoa2007764_appendix.pdf):

        “Because the SAP did not include a provision for correcting for multiplicity when conducting tests for secondary outcomes, results are reported as point estimates and 95% confidence intervals. The widths of the confidence intervals have not been adjusted for multiplicity, so the intervals should not be used to infer definitive treatment effects for secondary outcomes.More details can be found in the statistical analysis plan. Analyses were conducted using SAS version 9.4 and R version 3.5.1.”

        We may or may not agree with the analysis approach, but my point here was that they were pretty transparent with everything

  5. I am just finishing up a book that I’d like to recommend. It’s “Malignant” by Vinayak Prasad. It deals with the process of drug approval and more importantly drug usage in oncology. It details multiple cases of bulls-eye painting leading to wide usage of marginally effective treatments at enormous costs. Remdesivir is going to get an tremendous boost by being the first one out of the starting blocks. Since drug makers call the shots with respect to clinical trials (even when taxpayers foot most of the bill), it is quite unlikely that we will see clinical trials comparing several antivirals. The book also gives some examples where doctors persisted with using a novel compound even when follow-up studies overturned the initial results. Further, there are lots of examples of real good results in early studies fade in latter situations when the meds are used by lots of people. Deviation toward the norm is the norm. For these reasons, I think we should hold remdesivir to very strict standards.
    The book is not too thick, and it can be read in two-three days. In the spirit of the book, let me disclose that I have no personal relationship with the author but have met people who he holds in high esteem.

  6. Buried deep in the supplementary information it said that was a largish gap between those who started the trial on invasive ventilation between the treated and control group

    My back of the envelope calculation was that the Odds ratio of being on a ventilator and being assigned to the treatment group was about 0.81 with a p value of 0.12 (from memory – you will have to check the data yourself). But given the Odds ratio on survival was 0.7, it is a bit concerning that such an important correlate with mortality was so heavily weighted to the placebo group – (obviously not to the level of statistical significance). Especially as such weight is being placed on this moderately sized study.

    I wonder if the randomisation process should be modified to permit such important mortality correlated baseline data to be more evenly distributed between the two groups?

    (Disclaimer: non-statistician, apologies if I have missed used any technical terms)

  7. Simple question:

    Did they also show an effect on their original outcome variable?

    Maybe this was already covered in the 8 screens of material on this thread, but I didn’t see it.

    • Yes: “The primary outcome of the current trial was changed with protocol version 3 on April 2, 2020, from a comparison of the eight-category ordinal scale scores on day 15 to a comparison of time to recovery up to day 29. Little was known about the natural clinical course of Covid-19 when the trial was designed in February 2020. Emerging data suggested that Covid-19 had a more protracted course than was previously known, which aroused concern that a difference in outcome after day 15 would have been missed by a single assessment at day 15. […] The original primary outcome became the key secondary end point. In the end, findings for both primary and key secondary end points were significantly different between the remdesivir and placebo groups.”

      https://www.nejm.org/doi/full/10.1056/NEJMoa2007764

      • Thanks for the NEJM link. It is comforting to know that both the original and revised primary endpoint achieved significance in the final data. I do find the change in primary endpoint more than a little unusual. The definition of recovery is itself based on the 8 point ordinal scale (first time point a patient achieved 1, 2, or 3). But why chose that definition? Could it have been just 1 or 2, or could it have been 1,2,3,4? Generally, it isn’t a good idea to dichotomize an ordinal scale. What external data were they seeing that would have prompted this change? In hindsight, probably would have been better to just let things play out without the change.

      • I’m not bothered too much from changing the scale from 7 to 8 ordinal categories. But having argued many, many times against dichotomizing a continuous or ordinal scale, I am at a loss as to what would persuade me to make such a change in a primary endpoint after the study started.

  8. As Tyler Cowen says, “Our regulatory state is failing us.” It will remain for some enterprising PhD student 20 years from now to figure out why the bureaucracy and media backed a patented drug (remdesivir) and pooh-poohed an unpatented one (hydroxychloroquinine). (I hasten to add that maybe neither works or maybe they both do, but then I’m not sitting 20 years in the future.)

    There is an unfortunate answer to your actual question, Andrew. We have created this enormous superstructure of statistical procedure when we are operating outside of the hot media spotlight. (And I’m going to read Malignant, linked above, because I suspect that that superstructure is pretty rickety even then.) But when there is a crisis, that all goes out the window, and what you’re left with is cherry picking, since nobody is given access to the data to make their own decisions. Statistical calculations, when done, are the final bludgeon to suppress dissent. The careful retrospective analyses won’t matter two years from now. Your “statistics as search for truth” is quaint. It’s really a search for things that will shut people up. You are Bill James arguing against the sportswriters. The fact that you’re right counts for little, for now.

    • Your “statistics as search for truth” is quaint. It’s really a search for things that will shut people up.

      This is a tragedy of statistics in medicine. Besides those of us who love statistics, it seems like few people believe statistics is a search for truth. When around us, others lightly pretend to view statistics as a search for truth, because they understand we are true believers, but they are only looking for support from statistics, not truth.

  9. This is not a comment about Remdesvir.

    I’ve read the NEJM article. The study looks good. But it doesn’t show much. The media (and the stock market) are acting as though it does. A finding of 4 days less hospitaliztion would be really interesting if the hospital stay was about 4 days, not 15 days. I think people are adding in the mortality data, but that part of the study is far more problematic.

    So my comment is that people are, it seems, desperate for good news. I noted that when this drug’s early report came out, the media started talking about ‘powerful’ treatments using the multi-drug cocktail model developed with HIV. They were of course leaping over the development of AZT, as if this drug were that, which it is not.

    • Well, a reduction of 15 day average stays to 11 day average stays could be huge, if one of the major drivers of the problem was overwhelming the hospital system.

      That doesn’t – as of right now – seem to be the case in the US; only a few places (maybe just NY; I haven’t seen good data for Detroit, New Jersey, etc.) even got close, and even they weren’t turning people away.

      But it may be the case in other places, or in the future. Italy definitely had issues with it.

      I’m pretty optimistic about the US – I doubt we’ll see another local outbreak as bad as NYC in the last week of March / first half of April, because NYC is very much an US outlier, and is more like European cities in some ways – density and mass transit use/less car-dependence. There are news reports of the local ICU system in Montgomery, AL being full – but I’m not sure how much that means; another news article claimed that sending ICU patients over to Birmingham isn’t unusual in non-pandemic situations.)

      I wouldn’t rule out the mortality effect either. Sure, it’s not technically significant at p < .05, but it's close, and if it reduces hospital stays (and therefore has a real effect on the disease course) I would think that means there's significant prior plausibility for a mortality effect. Though this isn't really my field. (environmental, not infectious disease…)

Leave a Reply

Your email address will not be published. Required fields are marked *