Some recommendations for design and analysis of clinical trials, with application to coronavirus

Various people have been contacting me lately about recommendations for design and analysis of clinical trials, with application to coronavirus. Below are some quick thoughts, or you can scroll down to the Summary Recommendations at the end. I’m sure there’s lots more to say on this topic but I’ll get my quick thoughts down here.

Goals of this post

I’ll first talk about a recently-published study that someone pointed me to. But the main aim of this post is not to discuss this particular study, but rather to consider general issues regarding sample size, uncertainty, decision making, multilevel modeling, and policy. These are issues that come up in clinical trials all the time, and which we typically dodge in various ways. The current crisis provides us with some extra motivation to try to do things right.

Basic analysis of one experiment

Jonathan Falk points us to this WSJ op-ed by Jeff Colyer and Daniel Hinthorn that reports:

But researchers in France treated a small number of patients with both hydroxychloroquine and a Z-Pak, and 100% of them were cured by day six of treatment. Compare that with 57.1% of patients treated with hydroxychloroquine alone, and 12.5% of patients who received neither. What’s more, most patients cleared the virus in three to six days rather than the 20 days observed in China.

Hmmm, 57.1% . . . that reminds me of that famous number 142857. 57.1% is exactly 4/7!
And 12.5%, of course that’s 1 out of 8.

Falk writes:

First off, for 57.1% of patients treated by hydroxyquinine alone to be helped, the numbers given the hydroxyquinine alone must be a multiple of 7, and those getting neither must be a multiple of 8. With 100% success for both, we don’t know how many were given both. But let’s say that the trial had three arms of 8 (with one of the hydroxyquinine alone arm patients having dropped out for some reason).

That seems like a reasonable guess, so let’s go with it. In evaluating the new treatment (hydroxychloroquine and a Z-Pak), it seems that the relevant comparison is to hydroxychloroquine alone. The comparison is simple enough to do. In R:

library("rstanarm")
library("arm")
x <- rep(c(0, 1), c(7, 8))              #  0 if hydroxychloroquine alone, 1 if hydroxychloroquine and Z-Pak
y <- rep(c(1, 0, 1, 0), c(4, 3, 8, 0))  #  0 if not cured, 1 if cured
drugs <- data.frame(x, y)
fit <- stan_glm(y ~ x, family=binomial(link="logit"), data=drugs, refresh=0)
print(fit)

Here's what we get:

            Median MAD_SD
(Intercept) 0.7    0.8   
x           2.7    1.5   

As you might expect, you get a big fat uncertainty interval, with the data roughly consistent with no effect on one extreme, or an effect of 5 on the other. Not too many treatments have an effect of 5 on the logit scale (that will take you from 50% to 99% on the probability scale), so I wouldn't take that high end too seriously.

You can't do the analysis using regular glm as there's complete separation. If you want to use classical, non-Bayesian, inference, you can do the Agresti and Coull approach and add 2 successes and 2 failures to each group, thus comparing 6/11 to 10/12:

y <- c(4, 8)   #  number of successes in each group
n <- c(7, 8)   #  number of trials in each group
y_star <- y + 2   # Agresti-Coull adjustment
n_star <- n + 4   # Agresti-Coull adjustment
p_star <- y_star/n_star
diff <- p_star[2] - p_star[1]
se_diff <- sqrt(sum(p_star*(1-p_star)/n_star))
print(c(diff, se_diff), digits=2)

The result:

0.29 0.18

OK, this is on the probability scale: the estimated increase in cure rate is somewhere between 0 or slightly negative, or huge.

Yet another possible analysis is to do a hypothesis test, but I'm not particularly interested in a hypothesis test because what I really want to know is the increase in cure rate.

Alternatively we could pool the two active treatments, then we have 12/15 cures in the treated group, compared to 1/8 in the control group:

x <- rep(c(0, 1), c(8, 15))              #  0 if no drug, 1 if hydroxychloroquine with or without Z-Pak
y <- rep(c(1, 0, 1, 0), c(1, 7, 12, 3))  #  0 if not cured, 1 if cured
drugs_vs_nothing <- data.frame(x, y)
fit2 <- stan_glm(y ~ x, family=binomial(link="logit"), data=drugs_vs_nothing, refresh=0)
print(fit2)

Here's what we get:

            Median MAD_SD
(Intercept) -1.7    0.9  
x            3.0    1.0  

Or the Agresti-Coull:

y <- c(1, 12)   #  number of successes in each group
n <- c(8, 15)   #  number of trials in each group
y_star <- y + 2   # Agresti-Coull adjustment
n_star <- n + 4   # Agresti-Coull adjustment
p_star <- y_star/n_star
diff <- p_star[2] - p_star[1]
se_diff <- sqrt(sum(p_star*(1-p_star)/n_star))
print(c(diff, se_diff), digits=2)

The result:

0.49 0.16

Still a lot of uncertainty but it seems like a clear improvement. Or maybe it's some other aspect of the treatment; I don't know if the study was blinded. (No, it wasn't even randomized! See P.S. at the end of this post.)

In any case, no matter what analysis is done, the obvious recommendation here is to test the treatment on more than 24 people!

Doing more with the data

You're potentially throwing away a lot of information by summarizing each person's result by a binary, cured-or-not-cured-after-6-days variable:

- "Cured" is measured by some biomarkers? You could have a continuous measure, no?

- Why just 6 days? What's happening after 2 days, 4 days, 8 days, 10 days, etc?

Getting more granular data will also help resolve the difficulty of that 8-out-of-8 thing, moving the data off the boundary.

Analyzing many experiments

There's not just one therapy being tested! Lots of things are being tried. I think it makes sense to embed all these studies in a hierarchical model with treatment-level predictors, partial pooling, the whole deal.

Let's get real here. We're not trying to get a paper published, we're trying to save lives and save time.

Stepping back

When considering design for a clinical trial I'd recommend assigning cost and benefits and balancing the following:

- Benefit (or cost) of possible reduced (or increased) mortality and morbidity from COVID in the trial itself.
- Cost of toxicity or side effects in the trial itself.
- Public health benefits of learning that the therapy works, as soon as possible.
- Economic / public confidence benefits of learning that the therapy works, as soon as possible.
- Benefits of learning that the therapy doesn't work, as soon as possible, if it really doesn't work.
- Scientific insights gained from intermediate measurements or secondary data analysis.
- $ cost of the study itself, as well as opportunity cost if it reduces your effort to test something else.

This may look like a mess---but if you're not addressing these issues explicitly, you're addressing them implicitly. The problem's important, so want your sample size to be as large as possible. So first test everyone in the U.S., then take all the people with coronavirus and divide them into 2 groups, etc. OK, we can't do this because we can't test everyone, so we don't have infinite resources . . . Also, maybe don't do it on 100,000 people because maybe the regimen has some side effects . . . etc.

And, as always, I don't think "statistically significant" should be the goal. Suppose that the treatment increases recovery rate from, say, 80% to 85%. That's pretty good. But you'd like to know who those other 15% are. Maybe the treatment helps among some groups and not others, etc.

That said, if the goal is a quick statistics answer, then, sure you can do some simulations, for example if the recovery rate after 3 days is X without the therapy and Y with the therapy, and you do a study with N people in each group, etc etc.

Looking forward

Whatever therapies are being tried, should be monitored. Doctors should have some freedom to experiment, and they should be recording what happens. To put it another way, they're trying different therapies anyway, so let's try to get something useful out of all that.

It's also not just about "what works" or "does a particular drug work," but how to do it. For example, Colyer and Hinthorn write:

On March 9 a team of researchers in China published results showing hydroxychloroquine was effective against the 2019 coronavirus in a test tube. The authors suggested a five-day, 12-pill treatment for Covid-19: two 200-milligram tablets twice a day on the first day followed by one tablet twice a day for four more days.

You want to get something like optimal dosing, which could depend on individuals. But you're not gonna get good discrimination on this from a standard clinical trial or set of clinical trials. So we have to go beyond the learning-from-clinical-trial paradigm, designing large studies that mix experiment and observation to get insight into dosing etc.

Also, lots of the relevant decisions will be made at the system level, not the individual level. For example, Colyer and Hinthorn write:

Emergency rooms run the risk of one patient exposing a dozen nurses and doctors. Instead of exposed health workers getting placed on 14-day quarantine, they could receive hydroxychloroquine for five days, then test for the virus. That would allow health-care workers to return to work sooner if they test negative.

These sorts of issues are super important and go beyond the standard clinical-trial paradigm.

Summary recommendations

- Bayesian inference for treatment effect, not hypothesis test.
- Include more information from each patient, not just cured or not.
- Design and analyze multiple studies together using multilevel model.
- Use fake-data simulation when designing a study.
- Formal decision analysis using numbers for costs and benefits.
- Relevant decisions and outcomes are at the system level, not just the individual level.
- Continue gathering data after the treatment is released into the wild.
- Analyze clinical trial and subsequent data to get recommendations for dosing, drug combinations, etc., beyond simple yes/no on a single treatment plan.

P.S. A commenter points to this review of the above-linked study. The review, by Darren Dahly, Simon Gates, and Tim Morris, reports that the study was non-randomized and there was no adjustment for pre-treatment differences between the groups, and that several patients were dropped from the study. So, many potential sources of bias. Like me, Dahly et al. criticized the decision to summarize each person's outcome by a binary cured-or-not at day 6: "is not clear (nor justified by the authors) that the outcome on day-6 is the best measure to conclude that a negative result indicates “virologically cured”, especially in light of the observation that two patients who were positive on day-6 but negative by day-9, and another that was negative on day-6 but positive on day-8." So, yeah, lots of problems.

45 thoughts on “Some recommendations for design and analysis of clinical trials, with application to coronavirus

    • Thanks, Simon. It’s always good to get the actual details rather than the Op-Ed details. But I think you’re a little too harsh on this ‘study.’ I agree that the evidence on the combo is really weak… only 6 subjects got the combo. But 14 got chloroquinine, not 7. And given that the controls were all healthier people (we think!) that makes it even better.

      On the down side, the dropouts from the study complicate matters immensely, and Andrew’s basic thought — let’s try this a little more systematically, is obviously right. And the endpoints are… sure, ad hoc. Ignoring the Z-Pack for the moment, this really does look promising, right?

      And it’s just data. Call it that instead of a ‘study’ and it’s ‘good data’ rather than a ‘terrible study.’

      • Maybe, but… no randomisation, inappropriate control group, no statistical adjustment for confounders, patients omitted without justification. It’s not good.

        I’m intolerant because whether it’s through ignorance, incompetence, malice or anything else, the result is the same; once it is published people will believe it and start basing decisions on it. That can cost lives. So I do think people doing medical research have some obligations – the costs of doing it badly can be severe.

        • no statistical adjustment for confounders

          I didn’t think it was a particularly convincing study, but thank god for that though. That statistical adjustment renders all the numbers meaningless.

        • Statistical “adjustment” is just arbitrarily changing the meaning of the number so that the numerical value changes too.

          It doesn’t actually fix any problems with confounding, etc. I’d much rather just know the raw value than one adjusted for whatever other data was available.

        • Anon:

          I think it’s best to show the raw-data comparison, and then show the pre-treatment differences between treatment and control units, and then also show the adjusted difference.

        • I think it’s best to show the raw-data comparison, and then show the pre-treatment differences between treatment and control units, and then also show the adjusted difference.

          The pre-treatment comparison is definitely useful, but the adjusted stats just don’t add anything for me. I’ve started ignoring it just like I ignore all p-values.

          I mean stuff like instead of adjusting for age, I’d rather see a plot of effect by age. No person exists who is an average of all ages, so what is the possible point of that?

        • Anon:

          If you have a small-N study, you won’t be able to get good estimates by subgroups, but you can still do some adjustment for differences between treatment and control groups.

        • Frankly, Andrew, it’s not worth responding to Anoneuoid. He does not believe in confounding variables.

          No I do, I just don’t believe you can mathematically make them go away.

          Honestly I think he should be blocked because of the amount of misinformation on coronavirus he constantly spews out.

          The moment you dropped to zero credibility was when you cited a study of 50% children to make claims about smoking and the flu… You are just mad about that public display of epic incompetence.

          Meanwhile, smokers are still missing in the new papers coming out. Vitamin C is in the official guidelines for Shanghai and now being used in NYC, the number of cases in the US exploded exactly when they started rolling otu testing, the estimates for the number of asymptomatic/mild (hidden) cases is growing. Every single thing I’ve said about this pandemic is correct.

        • Anon:

          If you have a small-N study, you won’t be able to get good estimates by subgroups, but you can still do some adjustment for differences between treatment and control groups.

          But what does that adjusted value tell you?

        • And I’ll give another prediction too. No version of Koch’s postulates have really been fulfilled for either SARS or nCov-19. Ie, afaik no animals have had severe enough illness to kill them, so the claims that the virus was the (sole) causative agent of severe illness were premature.

          It is going to come out eventually that coinfection with something else (maybe a wide variety of other viruses/bacteria) is very important to the course of the illness.

          Some research has been done on this:

          Respiratory agents, such as human meta pneumovirus or chlamydia, have been isolated from SARS patients (1,16), and were initially suspected to be the causative agents of SARS.However,SARS-CoV was finally identified as the agent of SARS, since it fulfilled Koch’s postulate (7).Nevertheless, when animals were infected with SARS-CoV alone, most failed to develop SARS-like severe pneumonia(12). These results may imply that the respiratory agents found in some SARS cases could work in combination with SARS-CoV in order to induce a severe form of pneumonia

          […]

          The mortality rate in SARS victims is reported to be approximately 10%, and aged people suffering from chronic heart or renal diseases or diabetes have been shown to be extremely prone to this infection (38). Such individuals are supposed to be susceptible to a variety of infectious agents that normally fail to affect ordinary, healthy individuals. It is possible that exacerbation of SARS in these people could be attributed to co-infection by non- or low-pathogenic agents, such as mycoplasma, chlamydia and the like, although these agents were not often isolated from the patients’ lungs (1, 35). However, this does not imply that these agents did not intensify the effects of SARS, since there is a possibility that they may induce a mild inflammation that triggers SARS-CoV’s high replication, while failing to themselves grow in the lungs. This may be inferred from the finding in the present study that Pp infection exacerbated SARS-CoV infection, but it did not multiply efficiently in the lungs of mice.

          https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1348-0421.2008.00011.x

          Then again, I haven’t seen anyone report what happens when you expose aged animals to the virus either.

        • I did see a report that in mice aged animals are ravaged and young animals are not. If you’re interested in chasing that down, it should be something that you can find either googling through the news for articles on age related patterns, or maybe PUBMED. I think it was the original SARS virus that was studied.

        • I did see a report that in mice aged animals are ravaged and young animals are not. If you’re interested in chasing that down, it should be something that you can find either googling through the news for articles on age related patterns, or maybe PUBMED. I think it was the original SARS virus that was studied.

          I found this:

          We administered 105 50% tissue culture infective doses (TCID50) of SARS-CoV (Urbani isolate [13]) intranasally to 12- to 14-month-old BALB/c mice as previously described (29). SARS-CoV-infected aged mice demonstrated signs of clinical illness characterized by significant weight loss, hunching, ruffled fur, and slight dehydration measured by skin turgor. Weight loss began 3 days postinfection (p.i.), with a nadir of 8% loss on day 4 p.i., and was noted through day 6 (P < 0.04) (Fig. ​(Fig.1A).1A). Clinical signs of illness resolved by day 7 p.i., and inactivity, changes in gait, and mortality were not observed.

          https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1082763/

          I guess “ravaged” is relative. The aged mice lost a bit of water weight. There is a “mouse adapted strain” that was selected to kill mice though:

          SARS-CoV replicates in the lungs of young
          mice, but they do not show signs of illness. Adaptation of SARS-CoV
          by serial passage in the lungs of mice resulted in a virus (MA15) that
          is lethal for young mice following intranasal inoculation. Lethality is
          preceded by rapid and high titer viral replication in lungs, viremia,
          and dissemination of virus to extrapulmonary sites accompanied by
          hematological changes and pathological changes in the lungs. Mice
          infected with MA15 virus die from an overwhelming viral infection
          with extensive, virally mediated destruction of pneumocytes, and
          ciliated epithelial cells. The MA15 virus has six coding mutations in
          its genome, which, when introduced into a recombinant SARS-CoV,
          confer lethality.

          https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1769406/

        • Was that person being denied her medication a necessary consequence of that study? Was the study a necessary cause for that person being denied her medication?

          That study is not the only reason why people consider using chloroquine for covid-19. It was already included in the treatment guidelines from several countries before that trial was even started. For example, the “Dutch CDC” being praised in another blog entry suggested using it to treat severe infections.

          The study may be poor (it says “open-label non-randomized” right there in the title, to be fair). We don’t have solid evidence from clinical trials for the efficacy of anything. Not offering any treatment options would also be a bad thing.

  1. > That seems like a reasonable guess, so let’s go with it.

    There is no need for guessing: “A total of 26 patients received hydroxychloroquine and 16 were control patients. Six hydroxychloroquine-treated patients were lost in follow-up during the survey because of early cessation of treatment. (…) The results presented here are therefore those of 36 patients (20 hydroxychloroquine-treated patients and 16 control patients). None of the control patients was lost in follow-up. (..) Among hydroxychloroquine-treated patients six patients received azithromycin (500mg on day1 followed by 250mg per day, the next four days) to prevent bacterial super-infection under daily electrocardiogram control.“

    > In evaluating the new treatment (hydroxychloroquine and a Z-Pak), it seems that the relevant comparison is to hydroxychloroquine alone.

    That’s not what the study is about: “Chloroquine and hydroxychloroquine have been found to be efficient on SARS-CoV-2, and reported to be efficient in Chinese COV-19 patients. We evaluate the role of hydroxychloroquine on respiratory viral loads.”

    https://www.mediterranee-infection.com/wp-content/uploads/2020/03/Hydroxychloroquine_final_DOI_IJAA.pdf

    • Thank you very much for the link to the paper. Based upon it I am editing below the WSJ quote, which itself is almost a direct quote from the paper, using integers rather than percentages or verbal descriptions. I am enclosing my edits with brackets. I think the edit is an improvement. It sounds less impressive.

      “But researchers in France treated {6} patients with both hydroxychloroquine and a Z-Pak, and {all 6} were cured by day six of treatment. Compare that with {8 out of 14} patients treated with hydroxychloroquine alone, and {2 out of 12} patients who received neither. What’s more, {14 out of 20} patients {treated with hydroxychloroquine} cleared the virus in three to six days rather than the 20 days observed in China.”

  2. My fear is that some compound will show a bit of activity, and then this will be hailed and promoted widely as a “cure” resulting in focusing our efforts on something of very modest activity to the detriment of other candidates. Oseltamivir improves influenza outcomes, but the difference is not dramatic. https://www.ncbi.nlm.nih.gov/pubmed/31839279
    Our current system is likely to bet everything on the first thing that comes along. Skeptics could get the fate of Darwinists in Lysenko land.
    The initial trial of azidothymidine in AIDS was actually quite impressive. https://www.ncbi.nlm.nih.gov/pubmed/3299089. Further improvements came as a result of the work of uptight lab-coat wearers who followed all the rules and got mocked in popular culture.
    Clear-eyed upfront design with clearly defined clinical endpoints are needed. Randomize the first patient. It is unlikely that any compound will be a homerun. Remain skeptical. Don’t trust cheery reports from investigators; they are emotionally invested. Don’t trust a politician’s opinion on a medicine; too many reasons to outline on this one.

  3. “- Bayesian inference for treatment effect, not hypothesis test.”

    Choosing to do a hypothesis test doesn’t mean one wouldn’t also choose to estimate an effect. Typically you’d do both.

    Justin

  4. Instead of arguing about the merits of using any drug or drug combination, or the flaws of any particular study, I wish the ASA or some organization would focus more on developing a very basic online data capture system that would record patient demographics, severity of illness, drugs used, and response over time and work with other organizations to implement the data capture. We need both randomized trials and real world data to make decisions in the future.

Leave a Reply to Keith O'Rourke Cancel reply

Your email address will not be published. Required fields are marked *