Skip to content

In Bayesian priors, why do we use soft rather than hard constraints?

Luiz Max Carvalho has a question about the prior distributions for hyperparameters in our paper, Bayesian analysis of tests with unknown specificity and sensitivity:

My reply:

1. We recommend soft rather than hard constraints when we have soft rather than hard knowledge. In this case, we don’t absolutely know that spec and sens are greater than 50%. There could be tests that are worse than that. Conversely, to the extent that we believe spec and sens to be greater than 50% we don’t think they’re 51% either.

2. I typically use normal rather than beta because normal is easier to work with, and it plays well with hierarchical models.

“The Moral Economy of Science”

In our discussion of Lorraine Daston’s “How Probabilities Came to Be Objective and Subjective,” commenter John Richters points to Daston’s 1995 article, “The Moral Economy of Science,” which is super-interesting and also something I’d never heard of before. I should really read the whole damn article and comment on everything in it, but for now I’ll just issue this pointer and let you do some of the work of reading and commenting.

An open letter expressing concerns regarding the statistical analysis and data integrity of a recently published and publicized paper

James Watson prepared this open letter to **, **, **, and **, authors of ** and to ** (editor of **). The letter has approximately 96,032 signatures from approximately 6 continents. And I heard a rumor that they have contacts at the Antarctic Polar Station who are going to sign the thing once they can get their damn fur gloves off.

I like the letter. This kind of thing should be a generic letter that applies to all research papers!

I’ve obscured the details of the letter here because I don’t want to single out the authors of this particular paper or the editor of this particular journal.

If the paper really does have all the problems that some people are concerned about, then maybe the journal in question will follow the “Wakefield rule” and retract in 2032. You thought journal review was slow? Retraction’s even slower!

A journalist who’s writing a story about this controversy asked me what I thought, and I said I didn’t know. The authors have no obligation to share their data or code, and I have no obligation to believe anything they say. Similarly, the journal has no obligation to try to get the authors to respond in a serious way to criticisms and concerns, and I have no obligation to take seriously the papers they publish. This doesn’t mean all or even most of the papers they publish are bad; it just means that we need to judge them on their merits.

Blast from the past

Lizzie told me about this paper, “Bidirectionality, Mediation, and Moderation of Metaphorical Effects: The Embodiment of Social Suspicion and Fishy Smells,” which reports:

As expected (see Figure 1), participants who were exposed to incidental fishy smells invested less money (M = $2.53, SD = $0.93) than those who were exposed to odorless water (M = $3.34, SD = $1.02), planned contrast i(42) = 2.07, p = .05, Cohen’s d = 0.83, or fart spray (M = $3.38, SD = $1.23), i(42) = 2.22, p = .03, d = 0.78.

Fart spray!

The paper faithfully follows Swann’s 18 rules for success in social priming research.

I was surprised to see that people were still doing this sort of thing . . . but then I looked at the paper more carefully. Journal of Personality and Social Psychology, 2012. Just one year after they’d published that ESP paper. To criticize a psychology journal for publishing this sort of thing in 2012 would be like mocking someone for sporting a mullet in the 1980s.

Of course, just cos a paper is on a funny topic and just cos it follows the cargo-cult-science template, it doesn’t mean that it is wrong. I guess I’ll believe it when I see a preregistered replication, not before. In the meantime, just recall that experimental results can be statistically significant and look super-clean but still not replicate. The garden of forking paths is not just a slogan, it’s a real thing that can easily lead researchers to fool themselves; hence the need to be careful.

P.S. All that said, it’s still not as bad as “Low glucose relates to greater aggression in married couples”: that’s the study where they had people blasting their spouses with loud noises and sticking pins into voodoo dolls.

These issues also arise with published research on more important topics.

This is not a post about remdesivir.

Someone pointed me to this post by a doctor named Daniel Hopkins on a site called, expressing skepticism about a new study of remdesivir. I guess some work has been done following up on that trial on 18 monkeys. From the KevinMD post:

On April 29th Anthony Fauci announced the National Institute of Allergy and Infectious Diseases, an institute he runs, had completed a study of the antiviral remdesivir for COVID-19. The drug reduced time to recovery from 15 to 11 days, he said, a breakthrough proving “a drug can block this virus.” . . .

While the results were preliminary, unpublished, and unconfirmed by peer review, Fauci felt an obligation, he said, to announce them immediately. Indeed, he explained, remdesivir trials “now have a new standard,” a call for researchers everywhere to consider halting any studies, and simply use the drug as routine care.

Hopkins has some specific criticisms of how the results of the study were reported:

Let us focus on something Fauci stressed: “The primary endpoint was the time to recovery.” . . . Unfortunately, the trial registry information, data which must be entered before and during the trial’s actual execution, shows Fauci’s briefing was more than just misleading. On April 16th, just days before halting the trial, the researchers changed their listed primary outcome. This is a red flag in research. . . . Unfortunately, the trial registry information, data which must be entered before and during the trial’s actual execution, shows Fauci’s briefing was more than just misleading. On April 16th, just days before halting the trial, the researchers changed their listed primary outcome. This is a red flag in research. . . . In other words they shot an arrow and then, after it landed, painted their bullseye. . . .

OK, this might be a fair description, or maybe not. You can click through and follow the links and judge for yourself.

Here I want to talk about two concerns that came up in this discussion which arise more generally when considering this sort of wide-open problem where many possible treatments are being considered.

I think these issues are important in many settings, so I’d like to talk about them without thinking too much about remdesivir or that particular study or the criticisms on that website. The criticisms could all be valid, or they could all be misguided, and it would not really affect the points I will make below.

Here are the 2 issues:

1. How to report and analyze data with multiple outcomes.

2. How to make decisions about when to stop a trial and use a drug as routine care.

1. In the above-linked post, Hopkins writes:

This choice [of primary endpoint], made in the planning stages, was the project’s defining step—the trial’s entry criteria, size, data collection, and dozens of other elements, were tailored to it. This is the nature of primary outcomes: they are pivotal, studies are built around them. . . .

Choosing any primary outcome means potentially missing other effects. Research is hard. You set a goal and design your trial to reach for it. This is the beating heart of the scientific method. You can’t move the goalposts. That’s not science.

I disagree. Yes, setting a goal and designing your trial to reach for it is one way to do science, but it’s not the only way. It’s not “the beating heart of the scientific method.” Science is not a game. It’s not about “goalposts”; it’s about learning how the world works.

2. Lots is going on with coronavirus, and doctors will be trying all sorts of different treatments in different situations. If there are treatments that people will be trying anyway, I don’t see why they shouldn’t be used as part of experimental protocols. My point is that, based on the evidence available, even if remdesivir should be used as routine care, it’s not clear that all the studies should be halted. More needs to be learned, and any study is just a formalization of the general idea that different people will be given different treatments.

Again, this is not a post about remdesivir. I’m talking about more general issues of experimentation and learning from data.

Age-period-cohort analysis.

Chris Winship and Ethan Fosse write with a challenge:

Since its beginnings nearly a century ago, Age-Period-Cohort analysis has been stymied by the lack of identification of parameter estimates resulting from the linear dependence between age, period, and cohort (age= period – cohort). In a series of articles, we [Winship and Fosse] have developed a set of methods that allow APC analysis to move forward despite the identification problem. We believe that our work provides a solid methodological foundation for APC analysis, one that has not existed previously. By a solid methodological foundation, we mean a set of methods that can produce substantively important results where the assumptions involved are both explicit and likely to be plausible.

After nearly a century of effort this is a big claim. How might we test it? In mathematics, if someone claims to have proved a theorem, the proof is not considered valid until others have rigorously analyzed it. Our request and hope that researchers will interrogate our claim with similar rigor. Have we in fact succeed after so many years of efforts by others?

Full Challenge Document

APC-R Software Download

My own articles on age-period-cohort analysis are here, here, and here. The first of these was an invited discussion for the American Journal of Sociology that they decided not to publish; the second (with Jonathan Auerbach) is our summary of what went wrong with that notorious claim a few years ago about the increasing death rate of middle-aged white Americans, and the third (with Yair Ghitza and Jonathan Auerbach) is our very own age-period-cohort analysis of presidential voting.

I have not looked at Winship and Fosse’s work in detail, but I agree with their general point that the the right way forward with this problem is to think about nonlinear models.

Last post on hydroxychloroquine (perhaps)

James “not this guy” Watson writes:

The Lancet study has already been consequential, for example, the WHO have decided to remove the hydroxychloroquine arm from their flagship SOLIDARITY trial.

Thanks in part to the crowdsourcing of data sleuthing on your blog, I have an updated version of doubts concerning the data reliability/veracity.

1/ Ozzy numbers:
This Australian government report (Table 5) says that as of 10th May, only 866 patients in total had been hospitalized in Australia, of whom 7.9% died (68 patients)… whereas 73 Australian patients in the Lancet paper were reported as having died. The mean age reported in the Lancet paper for Australian patients is 55.8 years. The median age for all Australian patients in the attached is 47 years, and for those hospitalized it’s 61 years. (Note the Lancet paper only included hospitalized people, up to April 14th).

2/ A very large Japanese hospital:
The Mehra et al. paper in the NEJM (Cardiovascular disease, drug therapy, and mortality in Covid-19, same data provenance, time period: Dec 20th to March 15th) gave the number of hospitals broken down by country. They had 9 hospitals in Asia (7 in China, 1 in Japan and 1 in South Korea) and 1,507 patients. Their follow-up paper in The Lancet presumably used the same data plus extra data up until April the 14th. The Lancet paper had 7,555 participants in Asia and also 9 hospitals. The assumption would be that these hospitals are the same (why would you exclude the hospitals from the first analysis in the second analysis?). Therefore, we assume that they had an extra 6048 patients in that time period.
Cases in China went from 80,860 on March the 15th to 82,295 by April the 14th (difference is 1435). South Korea: increase from 8,192 to 10,564 (difference is 2372); Japan: from 833 to 7,885 in this time (7052). This is a total increase of 10,859. If all cases in China and South Korea in the intervening period were seen in these 8 hospitals, then it would imply that 2241 patients were seen in 1 hospital in Japan in the space of a month!

3/ High dosing:
Almost 2 thirds of the data come from North America (66%, 559 hospitals). In the previous NEJM publication, the majority of the hospitals were in USA (121 versus 4 in Canada). Assuming that the same pattern holds for the extra 434 hospitals in this Lancet paper, the majority of the patients will have received doses of HCQ according to FDA recommendations: 800mg on day 1, followed by 400mg (salt weights) for 4-7 days. This is not a weight-based dosing recommendation.
The mean daily doses and durations of dosing for HCQ are given as: 596 mg (SD: 126) for an average of 4.2 days (SD: 1.9); HCQ with a macrolide: 597 mg (SD 128) and 4.3 days (SD 2). The FDA dosing for 4 days would give an average of 500mg daily, i.e. (800 + 3×400) / 4. Nowhere in the world recommends higher doses than this, with the exception of the RECOVERY trial in the UK.
So are these average daily doses possible?

4/ Disclaimer/background
It may be worth mentioning that I (or the research unit for which I work) could be seen as having a “vested interest” in chloroquine because we are running the COPCOV study (I am not an investigator on that trial). COPCOV is a COVID19 prevention trial in health workers. Participants will take low dose chloroquine as prophylaxis for 3 months (they are not sick and the doses are about 3x lower than given for treatment – so different population&dose than Lancet study). The Lancet study will inevitably damage this trial due the media attention. Understanding whether the underlying data are reliable or not is of extreme importance to our research group. Because our unit has been thinking/reading about (hydroxy)chloroquine a lot recently (and some people in the group have been studying chloroquine pharmacology for 40 years) we rapidly picked up on the “oddness” of this recent paper.

My conclusion from this is that post-publication review is a vital component of science. Medical journals need to embrace and stop pretending that peer/editorial review will solve all problems.

Perhaps the authors of that Lancet study will respond in the comments here? They haven’t yet responded on pubpeer.

P.S. The authors have this followup post which has some general discussion of their data sources but no engagement with the criticisms of the paper. On the ladder of responses to criticism, I’d put them at #4 (“Avoid looking into the question”). The good news is that they’re nowhere near #6 (“Attempt to patch the error by misrepresenting what you’ve written, introducing additional errors in an attempt to protect your original claim”) or #7 (“Attack the messenger”). As I’ve said before, I have an open mind on this, and it’s possible the paper has no mistakes at all: maybe the criticisms are misguided. I’d feel better if the authors acknowledged the criticisms and responded in some way.

Alexey Guzey’s sleep deprivation self-experiment

Alexey “Matthew Walker’s ‘Why We Sleep’ Is Riddled with Scientific and Factual Errors” Guzey writes:

I [Guzey] recently finished my 14-day sleep deprivation self experiment and I ended up analyzing the data I have only in the standard p < 0.05 way and then interpreting it by writing explicitly about how much I believe I should update based on this data. I honestly have absolutely no knowledge of Bayesian data analysis, so I'd be curious if you think the data I have is worth analyzing in some more sophisticated manner or if you have general pointers to resources that would help me figure this out (unless the answer to this is that I should just google something like "bayesian data analysis"..) Here’s the experiment.

One concern that I have is that my Psychomotor Vigilance Task data (as an example) is just not very good (which I note explicitly in the post), and I would be worried that if I try doing any fancy analysis on it, people would be led to believe that the data is more trustworthy than it really is, based on the fancy methods (when in reality it’s garbage in garbage out type of a situation).

Here’s the background (from the linked post):

I [Guzey] slept 4 hours a night for 14 days and didn’t find any effects on cognition (assessed via Psychomotor Vigilance Task, a custom first-person shooter scenario, and SAT). I’m a 22-year-old male and normally I sleep 7-8 hours. . . .

I did not measure my sleepiness. However, for the entire duration of the experiment I had to resist regular urges to sleep . . . This sleep schedule was extremely difficult to maintain.

Lack of effect on cognitive ability is surprising and may reflect true lack of cognitive impairment, my desire to demonstrate lack of cognitive impairment due to chronic sleep deprivation and lack of blinding biasing the measurements, lack of statistical power, and/or other factors.

I believe that this experiment provides strong evidence that I experienced no major cognitive impairment as a result of sleeping 4 hours per day for 12-14 days and that it provides weak suggestive evidence that there was no cognitive impairment at all.

I [Guzey] plan to follow this experiment up with an acute sleep deprivation experiment (75 hours without sleep) and longer partial sleep deprivation experiments (4 hours of sleep per day for (potentially) 30 and more days). . . .

His main finding is a null effect, in comparison with Van Dongen et al., 2003, who reported large and consistent declines in performance after sleep deprivation.

My quick answer to Guzey’s question (“I’d be curious if you think the data I have is worth analyzing in some more sophisticated manner”) is, No, I don’t think any fancy statistical analysis is needed here. Not given the data we see here. An essentially null effect is an essentially null effect, no matter how you look at it. Looking forward, yes, I think a multilevel Bayesian approach as described here and here) would make sense. One reason I say this is because I noticed this bit of confusion from Guzey’s description:

The more hypotheses I have, the more samples I need to collect for each hypothesis, in order to maintain the same false positive probability ( This is a n=1 study and I’m barely collecting enough samples to measure medium-to-large effects and will spend 10 hours performing PVT. I’m not in a position to test many hypotheses at once.

This is misguided. The goal should be to learn, not to test hypotheses, and the false positive probability has nothing to do with anything relevant. It would arise if your plan were to perform a bunch of hypothesis tests and then record the minimum p-value, but it would make no sense to do this, as p-values are super-noisy.

Guzey has a whole bunch of this alpha-level test stuff, and I can see why he’d do this, because that’s what it says to do in some textbooks and online tutorials, and it seems like a rigorous thing to do, but this sort of hypothesis testing is not actually rigorous, it’s just a way to add noise to your data.

Anyway, none of this is really an issue here because he’s sharing his raw data. That’s really all the preregistration you need. For his next study, I recommend that Guzey just preregister exactly what measurements to take, then commit to posting the data and making some graphs.

There’s not much to say about the data analysis because Guzey’s data don’t show much. It could be, though, that as Guzey says he’s particularly motivated to perform well so he can find that sleep deprivation isn’t so bad.

Why do we go short on sleep and why do we care?

God is in every leaf of every tree.

As is so often the case, we can think better about this problem by thinking harder about the details and losing a layer or two of abstraction. In this case, the abstraction we can lose is the idea of “the effect of sleep deprivation on performance.”

To unpack “the effect of sleep deprivation on performance,” we have to ask: What sleep deprivation? What performance?

There are lots of reasons for sleep deprivation. For example, maybe you work 2 jobs, or maybe you’re up all night caring for a child or some other family member, or maybe you have some medical condition so you keep waking up in the middle of the night, or maybe you stay up all night sometimes to finish your homework.

Similarly, there are different performances you might care about. If you’re short on sleep because you’re working 2 jobs, maybe you don’t want to crash your car driving home one morning. Or maybe you’re operating heavy machinery and would like to avoid cutting your arm off. Or, if you’re staying up all night for work, maybe you want to do a good job on that assignment.

Given all this, it’s hard for me to make sense of general claims about the impact, or lack of impact, of lack of sleep on performance. I have the same concerns about measuring cognitive ability, as ability depends a lot on motivation.

These concerns are not unique to Guzey’s experiment; they also arise in other research, such as the cited paper by Van Dongen et al.

This controversial hydroxychloroquine paper: What’s Lancet gonna do about it?

Peer review is not a form of quality control

In the past month there’s been a lot of discussion of the flawed Stanford study of coronavirus prevalence—it’s even hit the news—and one thing came up was that the article under discussion was just a preprint—it wasn’t even peer reviewed!

For example, in a NYT op-ed:

This paper, and thousands more like it, are the result of a publishing phenomenon called the “preprint” — articles published long before the traditional form of academic quality control, peer review, takes place. . . . They generally carry a warning label: “This research has yet to be peer reviewed.” To a scientist, this means it’s provisional knowledge — maybe true, maybe not. . . .

That’s fine, as long as you recognize that “peer-reviewed research” is also “provisional knowledge — maybe true, maybe not.” As we’ve learned in recent years, lots of peer-reviewed research is really bad. Not just wrong, as in, hey, the data looked good but it was just one of those things, but wrong, as in, we could’ve or should’ve realized the problems with this paper before anyone even tried to replicate it.

The beauty-and-sex-ratio research, the ovulation-and-voting research, embodied cognition, himmicanes, ESP, air rage, Bible Code, the celebrated work of Andrew Wakefield, the Evilicious guy, the gremlins dude—all peer-reviewed.

I’m not saying that all peer-reviewed work is bad—I’ve published a few hundred peer-reviewed papers myself, and I’ve only had to issue major corrections for 4 of them—but to consider peer review as “academic quality control” . . . no, that’s not right. The quality of the paper has been, and remains, the responsibility of the author, not the journal.


So, a new one came in. A recent paper published in the famous/notorious medical journal Lancet reports that hydroxychloroquine and chloroquine increased the risk of in-hospital death by 30% to 40% and increased arrhythmia by a factor of 2 to 5. The study hit the news with the headline, “Antimalarial drug touted by President Trump is linked to increased risk of death in coronavirus patients, study says.” (Meanwhile, Trump says that Columbia is “a liberal, disgraceful institution.” Good thing we still employ Dr. Oz!)

All this politics . . . in the meantime, this Lancet study has been criticized; see here and here. I have not read the article in detail so I’m not quite sure what to make of the criticisms; I linked to them on Pubpeer in the hope that some experts can join in.

Now we have open review. That’s much better than peer review.

What’s gonna happen next?

I can see three possible outcomes:

1. The criticisms are mistaken. Actually the research in question adjusted just fine for pre-treatment covariates, and the apparent data anomalies are just misunderstandings. Or maybe there are some minor errors requiring minor corrections.

2. The criticisms are valid and the authors and journal publicly acknowledge their mistakes. I doubt this will happen. Retractions and corrections are rare. Even the most extreme cases are difficult to retract or correct. Consider the most notorious Lancet paper of all, the vaccines paper by Andrew Wakefield, which appeared in 1998, and was finally retracted . . . in 2010. If the worst paper ever took 12 years to be retracted, what can we expect for just run-of-the-mill bad papers?

3. The criticisms are valid, the authors dodge and do not fully grapple with the criticism, and the journal stays clear of the fray, content to rack up the citations and the publicity.

That last outcome seems very possible. Consider what happened a few years ago when Lancet published a ridiculous article purporting to explain variation in state-level gun deaths using 25 state-level predictors representing different gun control policies. A regression with 50 data points and 25 predictors and no regularization . . . wait! This was a paper that was so fishy that, even though it was published in a top journal and even though its conclusions were simpatico with the views of gun-control experts, those experts still blasted the paper with “I don’t believe that . . . this is not a credible study and no cause and effect inferences should be made from it . . . very flawed piece of research.” A couple of researchers at Rand (full disclosure: I’ve worked with these two people) followed up with a report concluding:

We identified a number of serious analytical errors that we suspected could undermine the article’s conclusions. . . . appeared likely to support bad gun policies and to hurt future research efforts . . . overfitting . . . clear evidence that its substantive conclusions were invalid . . . factual errors and inconsistencies in the text and tables of the article.

They published a letter in Lancet with their criticisms, and the authors responded with a bunch of words, not giving an inch on any of their conclusions or reflecting on the problems of using multiple regression the way they did. And, as far as Lancet is concerned . . . that’s it! Indeed, if you go to the original paper on the Lancet website, you’ll see no link to this correspondence. Meanwhile, according to Google, the article has been cited 74 times. OK, sure, 74 is not a lot of citations, but still. It’s included in a meta-analysis published in JAMA—and one of the authors of that meta-analysis is the person who said he did not believe the Lancet paper when it came out! The point is, it’s in the literature now and it’s not going away.

A few years ago I wrote, in response to a different controversy regarding Lancet, that journal reputation is a two-way street:

Lancet (and other high-profile journals such as PPNAS) play a role in science publishing, that is similar to the Ivy League in universities: It’s hard to get in, but once you’re in, you have that Ivy League credential, and you have to really screw up to lose that badge of distinction.

Or, to bring up another analogy I’ve used in the past, the current system of science publication and publicity is like someone who has a high fence around his property but then keeps the doors of his house unlocked. Any burglar who manages to get inside the estate then has free run of the house. . . .

As Dan Kahan might say, what do you call a flawed paper that was published in a journal with impact factor 50 after endless rounds of peer review? A flawed paper. . . .

My concern is that Lancet papers are inappropriately taken more seriously than they should. Publishing a paper in Lancet is fine. But then if the paper has problems, it has problems. At that point it shouldn’t try to hide behind the Lancet reputation, which seems to be what is happening. And, yes, if that happens enough, it should degrade the journal’s reputation. If a journal is not willing to rectify errors, that’s a problem no matter what the journal is.

Remember Newton’s third law? It works with reputations too. The Lancet editor is using his journal’s reputation to defend the controversial study. But, as the study becomes more and more disparaged, the sharing of reputation goes the other way.

I can imagine the conversations that will occur:

Scientist A: My new paper was published in the Lancet!

Scientist B: The Lancet, eh? Isn’t that the journal that published the discredited Iraq survey, the Andrew Wakefield paper, and that weird PACE study?

A: Ummm, yeah, but my article isn’t one of those Lancet papers. It’s published in the serious, non-politicized section of the magazine.

B: Oh, I get it: The Lancet is like the Wall Street Journal—trust the articles, not the opinion pages?

A: Not quite like that, but, yeah: If you read between the lines, you can figure out which Lancet papers are worth reading.

B: Ahhh, I get it.

Now we just have to explain this to journalists and policymakers and we’ll be in great shape. Maybe the Lancet could use some sort of tagging system, so that outsiders can know which of its articles can be trusted and which are just, y’know, there?

Long run, reputation should catch up to reality. . . .

I don’t think the long run has arrived yet. Almost all the press coverage of this study seemed to be taking the Lancet label as a sign of quality.

Speaking of reputations . . . the first author of the Lancet paper is from Harvard Medical School, which sounds pretty impressive, but then again we saw that seriously flawed paper that come out from Stanford Medical School, and a few months ago we heard about a bungled job from the University of California medical school. These major institutions are big places, and you can’t necessarily trust a paper, just because it comes from a generally respected medical center.

Again, I haven’t looked at the article in detail, nor am I any kind of expert on hydro-oxy-chloro-whatever-it-is, so let me say one more time that outcome 1 above is still a real possibility to me. Just cos someone sends me some convincing-looking criticisms, and there are data availability problems, that doesn’t mean the paper is no good. There could be reasonable explanations for all of this.

Be careful when estimating years of life lost: quick-and-dirty estimates of attributable risk are, well, quick and dirty.

Peter Morfeld writes:

Global burden of disease (GBD) studies and environmental burden of disease (EBD) studies are supported by hundreds of scientifically well-respected co-authors, are published in high level journals, are cited world wide and have a large impact on health institutions‘ reports and related political discussions.

The main metrics used to calculate the impact of exposures on the health of populations are „numbers of premature deaths“, DALYs („disability adjusted life years“) and YLLs („Years of Life Lost“). This large and influential branch of science overlooks seminal papers published by Robins and Greenland in the 1980s. These papers have shown that „etiologic deaths“ (premature deaths due to exposure) cannot be identified from epidemiological data alone which entails that YLLs and DALYs cannot be broken down by age or endpoints (diseases). DALYs due to exposure are problematic when interpreted in a counterfactual setting. Thus, most of this influential GBD and EBD mainstream work is scientifically unjustified.

We published a paper on this issue (open access):

Hammitt JK, Morfeld P, Tuomisto JT, Erren TC. Premature Deaths, Statistical Lives, and Years of Life Lost: Identification, Quantification, and Valuation of Mortality Risks. Risk Anal. 2019 Dec 10. doi: 10.1111/risa.13427.

Just for some additional background when you like to comment on the issue: Here is a letter exchange in Lancet with the leader of the largest GBD (global burden of disease) project world wide (Christopher Murray, Seattle).

This exchange is not covered in our paper. It may give an indication how the arguments and bias calculations are received.

My only comment is that I still think Qalys (or Dalys or whatever) are a good unit of measurement. The problems above are not with qualys, but with intuitively appealing but problematic statistical estimates of them. What joker put seven dog lice in my Iraqi fez box?

P.S. That above-linked discussion also involves Ty Beal, whose name rang a bell . . . here it is!

Hydroxychloroquine update

Following up on our earlier post, James “not the cancer cure guy” Watson writes:

I [Watson] wanted to relay a few extra bits of information that have come to light over the weekend.

The study only has 4 authors which is weird for a global study in 96,000 patients (and no acknowledgements at the end of the paper). Studies like this in medicine usually would have 50-100 authors (often in some kind of collaborative group). The data come from the “Surgical Outcomes Collaborative”, which is in fact a company. The CEO (Sapan Desai) is the second author. One of the comments on the blog post is “I was surprised to see that the data have not been analyzed using a hierarchical model”. But not only do they not use hierarchical modelling and they do not appear to be adjusting by hospital/country, they also give almost no information about the different hospitals: which countries (just continent level), how the treated vs not treated are distributed across hospitals etc. A previous paper by the same group in NEJM says that they use data from UK hospitals (no private hospitals are treating COVID so must be from the NHS). Who is allowing some random company to use NHS data and publish with no acknowledgments. Another interesting sentence is about patient consent and ethical approval:

The data collection and analyses are deemed exempt from ethics review.

We emailed them to ask for the data, in particular to look at the dose effect which I think is key in understanding the results. They got back to us very quickly and said

Thanks for your email inquiry. Our data sharing agreements with the various governments, countries and hospitals do not allow us to share data unfortunately. I do wish you all the very best as you continue to perform trials since that is the stance we advocate. All we have said is to cease and desist the off label and unmonitored and uncontrolled use of such therapy in hospitalized patients.“

So unavailable data from unknown origins . . .

Another rather remarkable aspect is how beautifully uniform the aggregated data are across continents:

For example, smoking is almost between 9.4-10% in 6 continents. As they don’t tell us which countries are involved, hard to see how this matches known smoking prevalences. Antiviral use is 40.5, 40.4, 40.7, 40.2, 40.8, 38.4%. Remarkable! I didn’t realise that treatment was so well coordinated across the world. Diabetes and other co-morbidities don’t vary much either.

I [Watson] am not accusing the authors/data company of anything dodgy, but as they give almost no details about the study and “cannot share the data”, one has to look at things from a skeptical perspective.

Again, I have not looked into this at all. I’m sharing this because open data is a big deal. Right now, hydroxychloroquine is a big deal too. And we know from experience that Lancet can make mistakes. Peer review is nothing at all compared to open review.

The authors of the paper in question, or anyone else who knows more, should feel free to share information in the comments.

Doubts about that article claiming that hydroxychloroquine/chloroquine is killing people

James Watson (no, not the one who said that cancer would be cured by 2000, and not this guy either) writes:

You may have seen the paper that came out on Friday in the Lancet on hydroxychloroquine/chloroquine in COVID19 hospitalised patients. It’s got quite a lot of media attention already.

This is a retrospective study using data from 600+ hospitals in the US and elsewhere with over 96,000 patients, of whom about 15,000 received hydroxychloroquine/chloroquine (HCQ/CQ) with or without an antibiotic. The big finding is that when controlling for age, sex, race, co-morbidities and disease severity, the mortality is double in the HCQ/CQ groups (16-24% versus 9% in controls). This is a huge effect size! Not many drugs are that good at killing people.

This caught my eye, as an effect size that big should have been picked up pretty quickly in the interim analyses of randomized trials that are currently happening. For example, the RECOVERY trial has a hydroxychloroquine arm and they have probably enrolled ~1500 patients into that arm (~10,000 + total already). They will have had multiple interim analyses so far and the trial hasn’t been stopped yet.

The most obvious confounder is disease severity: this is a drug that is not recommended in Europe and the USA, so doctors give it as “compassionate use”. I.e. very sick patient, so why not try just in case. Therefore the disease severity of the patients in the HCQ/CQ groups will be greater than the controls. The authors say that they adjust for disease severity but actually they use just two binary variables: oxygen saturation and qSOFA score. The second one has actually been reported to be quite bad for stratifying disease severity in COVID. The biggest problem is that they include patients who received HCQ/CQ treatment up to 48 hours post admission. This means that someone who comes in OKish and then deteriorates rapidly could be much more likely to get given the drug as compared to someone as bad but stable. This temporal aspect cannot be picked up a single severity measurement.

In short, seeing such huge effects really suggests that some very big confounders have not been properly adjusted for. What’s interesting is that the New England Journal of Medicine published a very similar study a few weeks ago where they saw no effect on mortality. Guess what, they had much more detailed data on patient severity.

One thing that the authors of the Lancet paper didn’t do, which they could have done: If HCQ/CQ is killing people, you would expect a dose (mg/kg) effect. There is very large variation in the doses that the hospitals are giving (e.g. for CQ the mean daily dose is 750 but standard deviation is 300). Our group has already shown that in chloroquine self-poisoning, death is highly predictable from dose (we used stan btw, very useful!). No dose effect would suggest it’s mostly confounding.

In short, it’s a pretty poor dataset and the results, if interpreted literally, could massively damage ongoing randomized trials of HCQ/CQ.

I have not read all these papers in detail, but in general terms I am sympathetic to Watson’s point that statistical adjustment (or, as is misleadingly stated in the cited article, “controlling for” confounding factors) is only as good as what you’re adjusting for.

Again speaking generally, there are many settings where we want to learn from observational data, and so we need to adjust for differences between treated and control groups. I’d rather see researchers try their best to do such adjustments, rather than naively relying on pseudo-rigorous “identification strategies” (as, notoriously, here). So I applaud the authors for trying. I guess the next step is to look more carefully at pre-treatment differences between the two groups.

Are the (de-identified) data publicly available? That would help.

Also, when I see a paper published in Lancet, I get concerned, as they have a bit of a reputation for chasing headlines. I’m not saying that it is for political reasons that they published a paper on the dangers of hydroxychloroquine, but this sort of thing is always a concern when Lancet is involved.

P.S. More here.

“Banishing ‘Black/White Thinking’: A Trio of Teaching Tricks”

Richard Born writes:

The practice of arbitrarily thresholding p values is not only deeply embedded in statistical practice, it is also congenial to the human mind. It is thus not sufficient to tell our students, “Don’t do this.” We must vividly show them why the practice is wrong and its effects detrimental to scientific progress. I [Born] offer three teaching examples I have found to be useful in prompting students to think more deeply about the problem and to begin to interpret the results of statistical procedures as measures of how evidence should change our beliefs, and not as bright lines separating truth from falsehood.

He continues:

Humans are natural born categorizers. We instinctively take continuous variables and draw (often) arbitrary boundaries that allow us to put names to groups. For example, we divide the continuous visible spectrum up into discrete colors like “red,” “yellow,” and “blue.” And the body mass index (BMI) is a continuous measure of a person’s weight-to-height ratio, yet a brief scan of the Internet turns up repeated examples of the classification [into three discrete categories].

In some cases, such as for color, certain categories appear to be “natural,” as if they were baked into our brains (Rosch, 1973). In other cases, categorization is related to the need to make decisions, as is the case for many medical classifications. And the fact that we communicate our ideas using language—words being discrete entities—surely contributes to this tendency.

Nowhere is the tendency more dramatic—and more pernicious—than in the practice of null hypothesis significance testing (NHST), based on p values, where an arbitrary cutoff of 0.05 is used to separate “truth” from “falsehood.” Let us set aside the first obvious problem that in NHST we never accept the null (i.e., proclaim falsehood) but rather only fail to reject it. And let us also ignore the debate about whether we should change the cutoff to something more stringent, say 0.005 (Benjamin et al., 2018), and instead focus on what I consider to be the real problem: the cutoff itself. This is the problem I refer to as “black/white thinking.”

Because this tendency to categorize using p values is (1) natural and (2) abundantly reinforced in many statistics courses, we must do more than simply tell our students that it is wrong. We must show them why it is wrong and offer better ways of thinking about statistics. What follows are some practical methods I have found useful in classroom discussions with graduate students and postdoctoral fellows in neuroscience. . . .

Create your own community (if you need to)

Back in 1991 I went to a conference of Bayesians and I was disappointed that the vast majority seem to not be interested in checking their statistical models. The attitude seemed to be, first, that model checking was not possible in a Bayesian context, and, second, that model checking was illegitimate because models were subjective. No wonder Bayesianism was analogized to a religion.

This all frustrated me, as I’d found model checking to be highly relevant in my Bayesian research in two different research problems, one involving inference for emission tomography (which had various challenges arising from spatial models and positivity constraints), the other involving models for district-level election results.

The good news is that, in the years since our book Bayesian Data Analysis came out, a Bayesian community has developed that is more accepting of checking models by looking at their fit to data. Many challenges remain.

The point of this story is that sometimes you can work with an existing community, sometimes you have to create your own community, and sometimes it’s a mix. In this case, my colleagues and I did not try to create a community on our own; we very clearly piggybacked off the existing Bayesian community, which indeed included lots of people who were interested in checking model fit, once it became clear that this was a theoretically valid step.

P.S. For more on the theoretical status of model checking in Bayesian inference, see this 2003 paper, A Bayesian formulation of exploratory data analysis and goodness-of-fit testing and this 2018 paper, Visualization in Bayesian workflow.

P.P.S. Zad’s cat, pictured above, is doing just fine. He doesn’t need to create his own community.

New report on coronavirus trends: “the epidemic is not under control in much of the US . . . factors modulating transmission such as rapid testing, contact tracing and behavioural precautions are crucial to offset the rise of transmission associated with loosening of social distancing . . .”

Juliette Unwin et al. write:

We model the epidemics in the US at the state-level, using publicly available death data within a Bayesian hierarchical semi-mechanistic framework. For each state, we estimate the time-varying reproduction number (the average number of secondary infections caused by an infected person), the number of individuals that have been infected and the number of individuals that are currently infectious. We use changes in mobility as a proxy for the impact that NPIs and other behaviour changes have on the rate of transmission of SARS-CoV-2. We project the impact of future increases in mobility, assuming that the relationship between mobility and disease transmission remains constant. We do not address the potential effect of additional behavioural changes or interventions, such as increased mask-wearing or testing and tracing strategies.

Nationally, our estimates show that the percentage of individuals that have been infected is 4.1% [3.7%-4.5%], with wide variation between states. For all states, even for the worst affected states, we estimate that less than a quarter of the population has been infected; in New York, for example, we estimate that 16.6% [12.8%-21.6%] of individuals have been infected to date. Our attack rates for New York are in line with those from recent serological studies [1] broadly supporting our modelling choices.

There is variation in the initial reproduction number, which is likely due to a range of factors; we find a strong association between the initial reproduction number with both population density (measured at the state level) and the chronological date when 10 cumulative deaths occurred (a crude estimate of the date of locally sustained transmission).

Our estimates suggest that the epidemic is not under control in much of the US: as of 17 May 2020, the reproduction number is above the critical threshold (1.0) in 24 [95% CI: 20-30] states. Higher reproduction numbers are geographically clustered in the South and Midwest, where epidemics are still developing, while we estimate lower reproduction numbers in states that have already suffered high COVID-19 mortality (such as the Northeast). These estimates suggest that caution must be taken in loosening current restrictions if effective additional measures are not put in place.

We predict that increased mobility following relaxation of social distancing will lead to resurgence of transmission, keeping all else constant. We predict that deaths over the next two-month period could exceed current cumulative deaths by greater than two-fold, if the relationship between mobility and transmission remains unchanged. Our results suggest that factors modulating transmission such as rapid testing, contact tracing and behavioural precautions are crucial to offset the rise of transmission associated with loosening of social distancing.

Overall, we show that while all US states have substantially reduced their reproduction numbers, we find no evidence that any state is approaching herd immunity or that its epidemic is close to over.

One question I have is about the assumptions underlying “increased mobility following relaxation of social distancing.” Even if formal social distancing rules are relaxed, if the death rate continues, won’t enough people be scared enough that they’ll limit their exposure, thus reducing the rate of transmission? This is not to suggest that the epidemic will go away, just that maybe people’s behavior will keep the infections spreading at something like the current rate? Or maybe I’m missing something here.

The report and other information is at their website.

Unwin writes:

Below is our usual three panel plot showing our results for the five states we uses as a case study in the report – we chose them because we felt they showed different responses across the US. New in this report, we estimate the number of people who are currently infectious over time – the difference in this and those getting newly infected each day is quite stark.

We have also put the report on open review, which is an online platform enabling open reviews of scientific papers. It’s usually used for computer science conferences but Seth Flaxman has been in touch to partner with them to try it for a pre-print. If you’d like, click on the link and you can leave us a comment or recommend a reviewer.

Lots and lots of graphs (they even followed some of my suggestions, but I’m still concerned about the way that the upper ends of the uncertainty bounds are so visually prominent), and they fit a multilevel model in Stan, which I really think is the right way to go, as it allows a flexible workflow for model building, checking, and improvement.

You can make of the conclusions what you will: the model is transparent, so you should be able to map back from inferences to assumptions.

But the top graph looked like such strong evidence!

I just posted this a few hours ago, but it’s such an important message that I’d like to post it again.

Actually, maybe we should just post nothing but the above graph every day, over and over again, for the next 20 years.

This is hugely important, one of the most important things we need to understand about statistics.

The top graph is what got published, the bottom graph is the preregistered replication from Joe Simmons and Leif Nelson.

Remember: Just cos you have statistical significance and a graph that shows a clear and consistent pattern, it doesn’t mean this pattern is real, in the sense of generalizing beyond your sample.

But the top graph looked like such strong evidence!


It’s so easy to get fooled. Think of all the studies with results that look like that top graph.

Some thoughts on another failed replication in psychology

Joe Simmons and Leif Nelson write:

We report our attempt to replicate a study in a recently published Journal of Marketing Research (JMR) article entitled, “Having Control Over and Above Situations: The Influence of Elevated Viewpoints on Risk Taking”. The article’s abstract summarizes the key result: “consumers’ views of scenery from a high physical elevation induce an illusory source of control, which in turn intensifies risk taking.”

Lots of details, of which the most important is that their replication is very close to the original study, but with three times the sample size.

And now the findings, first the exciting statistically significant results from the original published study, then the blah, noisy results from the preregistered replication:

Yup, the usual story. It’s 50 shades of gray all over again. Or embodied cognition. Or power pose. Or a zillion other examples pushed by the happy-talk crowd.

The above graphs pretty much tell the whole story, but I have one point I’d like to pick up on.

But the top graph looked like such strong evidence! Let’s be very very aware and very very afraid of this. It’s soooo easy to get fooled by graphs such as Figure 1 that just seem to slam a point home.

So let’s say this again: Just cos you have statistical significance and a graph that shows a clear and consistent pattern, it doesn’t mean this pattern is real, in the sense of generalizing beyond your sample. This is a big deal.

P.S. I wrote this post last year but it’s appearing now, so I’ll add this special message just for coronavirus studies:

Just cos you have statistical significance and a graph that shows a clear and consistent pattern, it doesn’t mean this pattern is real, in the sense of generalizing beyond your sample. This is a big deal.

Also, thanks to Zad for the above foto which he captions, “When you get arrested by the feds for not social distancing.”

OK, here’s a hierarchical Bayesian analysis for the Santa Clara study (and other prevalence studies in the presence of uncertainty in the specificity and sensitivity of the test)

After writing some Stan programs to analyze that Santa Clara coronavirus antibody study, I thought it could be useful to write up what we did more formally so that future researchers could use these methods more easily.

So Bob Carpenter and I wrote an article, Bayesian analysis of tests with unknown specificity and sensitivity:

When testing for a rare disease, prevalence estimates can be highly sensitive to uncertainty in the specificity and sensitivity of the test. Bayesian inference is a natural way to propagate these uncertainties, with hierarchical modeling capturing variation in these parameters across experiments. Another concern is the people in the sample not being representative of the general population. Statistical adjustment cannot without strong assumptions correct for selection bias in an opt-in sample, but multilevel regression and poststratification can at least adjust for known differences between sample and population. We demonstrate these models with code in R and Stan and discuss their application to a controversial recent study of COVID-19 antibodies in a sample of people from the Stanford University area. Wide posterior intervals make it impossible to evaluate the quantitative claims of that study regarding the number of unreported infections. For future studies, the methods described here should facilitate more accurate estimates of disease prevalence from imperfect tests performed on nonrepresentative samples.

The article includes a full description of our models along with R and Stan code. (We access Stan using cmdstanR.) And it’s all on this pretty Github page that Bob set up!

The paper and code are subject to change. I don’t anticipate any major differences from the current version, but Bob is planning to clean up the code and add some graphs showing dependence of the inferences to prior distributions on the hyperparameters. Then we’ll post it on Arxiv or Medrxiv or ResearchersOne or whatever.

Also, if we get raw data for any studies, we could do more analyses and add them to the paper. Really, though, the point is to have the method out there so that other people can use it, criticize it, and improve upon it.

Above I’ve quoted the abstract of our paper. Here’s how end it:

Limitations of the statistical analysis

Epidemiology in general, and disease testing in particular, features latent parameters with high levels of uncertainty, difficulty in measurement, and uncertainty about the measurement process as well. This is the sort of setting where it makes sense to combine information from multiple studies, using Bayesian inference and hierarchical models, and where inferences can be sensitive to assumptions.

The biggest assumptions in this analysis are, first, that the historical specificity and sensitivity data are relevant to the current experiment; and, second, that the people in the study are a representative sample of the general population. We addressed the first concern with a hierarchical model of varying sensitivities and specificities, and we addressed the second concern with multilevel regression and poststratification on demographics and geography. But this modeling can take us only so far. If there is hope or concern that the current experiment is has unusual measurement properties, or that the sample is unrepresentative in ways not accounted for in the regression, then more information or assumptions need to be included in the model, as in Campbell et al. (2020).

The other issue is that there are choices of models, and tuning parameters within each model. Sensitivity to the model is apparent in Bayesian inference, but it would arise with any other statistical method as well. For example, Bendavid et al. (2020a) used an (incorrectly applied) delta method to propagate uncertainty, but this is problematic when sample size is low and probabilities are near 0 or 1. Bendavid et al. (2020b) completely pooled their specificity and sensitivity experiments, which is equivalent to setting sigma_{gamma} and sigma_{delta} to zero. And their weighting adjustment has many arbitrary choices. We note these not to single out these particular authors but rather to emphasize that, at least for this problem, all statistical inferences involve user-defined settings.

For the models in the present article, the most important user choices are: (a) what data to include in the analysis, (b) prior distributions for the hyperparameters, and (c) the structure and interactions to include in the MRP model. For these reasons, it would be difficult to set up the model as a plug-and-play system where users can just enter their data, push a button, and get inferences. Some active participation in the modeling process is required, which makes sense given the sparseness of the data. When studying populations with higher prevalences and with data that are closer to random samples, more automatic approaches might be possible.

Santa Clara study

Section 3 shows our inferences given the summary data in Bendavid et al. (2020b). The inference depends strongly on the priors on the distributions of sensitivity and specificity, but that is unavoidable: the only way to avoid this influence of the prior would be to sweep it under the rug, for example by just assuming a zero variation in the test parameters.

What about the claims regarding the rate of coronavirus exposure and implications for the infection fatality rate? It’s hard to say from this one study: the numbers in the data are consistent with zero infection rate and a wide variation in specificity and sensitivity across tests, and the numbers are also consistent with the claims made in Bendavid et al. (2020a,b). That does not mean anyone thinks the true infection rate is zero. It just means that more data, assumptions, and subject-matter knowledge are required. That’s ok–people usually make lots of assumptions in this sort of laboratory assay. It’s common practice to use the manufacturer’s numbers on specificity, sensitivity, detection limit, and so forth, and not worry about that level of variation. It’s only when you are estimating a very low underlying rate that the statistical challenges become so severe.

One way to go beyond the ideas of this paper would be to include additional information on patients, for example from self-reported symptoms. Some such data are reported in Bendavid et al. (2020b), although not at the individual level. With individual-level symptom and test data, a model with multiple outcomes which could yield substantial gains in efficiency compared to the existing analysis using only the positive/negative test result.

For now, we do not think the data support the claim that the number of infections in Santa Clara County was between 50 and 85 times the count of cases reported at the time, or the implied interval for the IFR of 0.12-0.2%. These numbers are consistent with the data, but the data are also consistent with a near-zero infection rate in the county. The data of Bendavid et al. (2020a,b) do not provide strong evidence about the number of people infected or the infection fatality ratio; the number of positive tests in the data is just too small, given uncertainty in the specificity of the test.

Going forward, the analyses in this article suggest that future studies should be conducted with full awareness of the challenges of measuring specificity and sensitivity, that relevant variables be collected on study participants to facilitate inference for the general population, and that (de-identified) data be made accessible to external researchers.

P.S. I’ve updated the article, fixing some typos and other things, and adding references and discussion based on comments we’ve received here and elsewhere. A COVID-19 collaboration platform.

Following up on today’s post on design of studies for coronavirus treatments and vaccines, Z points to this site, which states:

In the U.S. only a few COVID-19 randomized clinical trials (RCTs) have been centrally organized, e.g. by NIAID, PCORI and individual PIs. Over 400 such trials have been registered on with dozens being added each day. Many of them are designed to answer similar questions and combining data or aggregating evidence could dramatically increase their efficiency and precision, getting answers to doctors faster and more reliably. Many of the studies getting off the ground right now are being run outside of the research environment by hospitals in need of decision-making information for their doctors. Like these studies, our primary goal is to get high-quality evidence to doctors quickly.

Furthermore, local outbreaks may taper off before institutions are able to enroll their target sample size. Notably, this has happened in China, where many trials are effectively suspended with incomplete enrollment and inconclusive results. If the US response to COVID-19 proves successful, hospitals will be lucky to find themselves in a similar position with a sudden drop off in COVID cases that could affect enrollment. If protocols are public and open for collaboration, an RCT can be picked up in different regions as the outbreak moves across the country. Unfortunately, no platform currently exists for such collaboration on RCTs.

CovidCP fills this gap by publicizing protocols whose PIs are open to various levels of collaboration: joining forces with other research teams to create a core protocol; admitting new sites under the existing PI and IRB; sharing anonymized interim and/or final data through our partner Vivli with other sites that choose to independently operate a trial under a similar protocol.

There’s more at the faq. I have no idea how this will go, but it seems worth a try, to help the bazaar function better.

This one’s important: Designing clinical trials for coronavirus treatments and vaccines

I’ve had various thoughts regarding clinical trials for coronavirus treatments and vaccines, and then I came across thoughtful posts by Thomas Lumley and Joseph Delaney on vaccines.

So let’s talk, first about treatments, then about vaccines.

Clinical trials for treatments

The first thing I want to say is that designing clinical trials is not just about power calculations and all that. It’s also about what you’re gonna do with the results once they come in. The usual ideas of design (including in our own books, unfortunately) focus on what can be learned from a single study. But that’s not what we have here.

Hospitals have lots of coronavirus patients right now, and they can try out whatever treatments are on the agenda, starting with the patients that are at the highest risk of dying. This should be done in a coordinated fashion, by which I don’t mean a bunch of randomized trials, each aiming for that statistical-significance jackpot, followed by a series of headlines and maybe an eventual meta-analysis. When I say “coordinated,” I mean that all the studies should put their patient-level information into an open repository using some shared format, everything gets registered, all the treatments, all the background variables, all the outcomes. This shouldn’t be a burden on experimenters. Indeed, a shared, open-source spreadsheet should be easier to use, compared to the default approach of each group doing their own thing.

Ok, now that I wrote that paragraph, I wish I’d written it a couple months ago. Not that it would’ve made any difference. It would take a lot to change the medical-industrial complex. Sander Greenland et al. have been screaming for years, and the changes have been incremental at best.

Let me tell you a story. A doctor was designing a trial for an existing drug that he thought could be effective for high-risk coronavirus patients. He contacted me to check his sample size calculation: under the assumption that the drug increased survival rate by 25 percentage points, a sample size of N = 126 would assure 80% power. (With 126 people divided evenly in two groups, the standard error of the difference in proportions is bounded above by √(0.5*0.5/63 + 0.5*0.5/63) = 0.089, so an effect of 0.25 is at least 2.8 standard errors from zero, which is the condition for 80% power for the z-test.) When I asked the doctor how confident he was in his guessed effect size, he replied that he thought the effect on these patients would be higher and that 25 percentage points was a conservative estimate. At the same time, he recognized that the drug might not work. I asked the doctor if he would be interested in increasing his sample size so he could detect a 10 percentage point increase in survival, for example, but he said that this would not be necessary.

It might seem reasonable to suppose that a drug might not be effective but would have a large effect if it did happen to work. But this vision of uncertainty has problems. Suppose, for example, that the survival rate was 30% among the patients who do not receive this new drug and 55% among the treatment group. Then in a population of 1000 people, it could be that the drug has no effect on the 300 of people who would live either way, no effect on the 450 who would die either way, and it would save the lives of the remaining 250 patients. There are other possibilities consistent with a 25 percentage point benefit—for example the drug could save 350 people while killing 100—but I’ll stick with the simple scenario for now. In any case, the point is that the posited benefit of the drug is not “a 25 percentage point benefit” for each patient; rather, it’s a benefit on 25% of the patients. And, from that perspective, of course the drug could work but only on 10% of the patients. Once we’ve accepted the idea that the drug works on some people and not others—or in some comorbidity scenarios and not others—we realize that “the treatment effect” in any given study will depend entirely on the patient mix. There is no underlying number representing “the effect of the drug.” Ideally one would like to know what sorts of patients the treatment would help, but in a clinical trial it is enough to show that there is some clear average effect. My point is that if we consider the treatment effect in the context of variation between patients, this can be the first step in a more grounded understanding of effect size.

I also shared some thoughts last month on costs and benefits, in particular:

When considering design for a clinical trial I’d recommend assigning cost and benefits and balancing the following:

– Benefit (or cost) of possible reduced (or increased) mortality and morbidity from COVID in the trial itself.
– Cost of toxicity or side effects in the trial itself.
– Public health benefits of learning that the therapy works, as soon as possible.
– Economic / public confidence benefits of learning that the therapy works, as soon as possible.
– Benefits of learning that the therapy doesn’t work, as soon as possible, if it really doesn’t work.
– Scientific insights gained from intermediate measurements or secondary data analysis.
– $ cost of the study itself, as well as opportunity cost if it reduces your effort to test something else.

This may look like a mess—but if you’re not addressing these issues explicitly, you’re addressing them implicitly. . . .

Whatever therapies are being tried, should be monitored. Doctors should have some freedom to experiment, and they should be recording what happens. To put it another way, they’re trying different therapies anyway, so let’s try to get something useful out of all that.

It’s also not just about “what works” or “does a particular drug work,” but how to do it. . . . You want to get something like optimal dosing, which could depend on individuals. But you’re not gonna get good discrimination on this from a standard clinical trial or set of clinical trials. So we have to go beyond the learning-from-clinical-trial paradigm, designing large studies that mix experiment and observation to get insight into dosing etc.

Also, lots of the relevant decisions will be made at the system level, not the individual level. . . . These sorts of issues are super important and go beyond the standard clinical-trial paradigm.

Clinical trials for vaccines

I haven’t thought about this at all so I’ll outsource the discussion to others.


There are over 100 potential vaccines being developed, and several are already in preliminary testing in humans. There are three steps to testing a vaccine: showing that it doesn’t have any common, nasty side effects; showing that it raises antibodies; showing that vaccinated people don’t get COVID-19.

The last step is the big one, especially if you want it fast. . . . We don’t expect perfection, and if a vaccine truly reduces the infection rate by 50% it would be a serious mistake to discard it as useless. But if the control-group infection rate over a couple of months is a high-but-maybe-plausible 0.2% that means 600,000 people in the trial — one of the largest clinical trials in history.

How can that be reduced? If the trial was done somewhere with out-of-control disease transmission, the rate of infection in controls might be 5% and a moderately large trial would be sufficient. But doing a randomised trial in setting like that is hard — and ethically dubious if it’s a developing-world population that won’t be getting a successful vaccine any time soon. If the trial took a couple of years, rather than a couple of months, the infection rate could be 3-4 times lower — but we can’t afford to wait a couple of years.

The other possibility is deliberate infection. If you deliberately exposed trial participants to the coronavirus, you could run a trial with only hundreds of participants, and no more COVID deaths, in total, than a larger trial. But signing people up for deliberate exposure to a potentially deadly infection when half of them are getting placebo is something you don’t want to do without very careful consideration and widespread consultation. . . .


One major barrier is manufacturing the doses, especially since we decided to off-shore a lot of our biomedical capacity in the name of efficiency (at the cost of robustness). . . . We want an effective vaccine and it may be the case that candidates vary in their effectiveness. There are successful vaccines that do not grant 100% immunity. The original polio vaccines were only 60-70% effective versus one of the strains, but that still led to a vast decrease in the number of infections in the United States once vaccination became standard.

So, clearly we want trials. . . . Now we get to the point about medical ethics. A phase III trial takes a long time to conduct and there is some political pressure for a fast solution. . . . if the virus is mostly under control, you need a lot of people and a long time to evaluate the effectiveness of a vaccine. People are rarely exposed so it takes a long time for differences in cases between the arms to show up. . . .

Another option is the challenge trial. Likely only taking a few hundred participants, it would have no more deaths than a regular trial. But it would involve infecting people, treated with a placebo(!!), with a potentially fatal infectious disease. There are greater good arguments here, but the longer I think about them the more dubious they get to me. Informed consent for things that are so dangerous really does suggest coercion. . . .

Combining these ideas

Organizing clinical trials for treatments . . . I just don’t think this is gonna happen.

But organizing clinical trials for vaccines? Maybe this is possible. Based on the above discussion, it seems like it’s likely we’ll soon be seeing vaccine trials based on infecting healthy people with the virus and then seeing if they fight it off. If so, I have a few thoughts:

1. I don’t see why you need to give anyone placebos. If we have several legitimate vaccine ideas, let’s give everyone some vaccine or another. If they all work, and nobody gets sick, that’s great. If we’re testing 100 vaccine ideas, then we can guess that most of them won’t be so effective, so we’ll get placebos automatically.

2. As discussed above, coordinate all of these. Certainly no need for 100 different placebo groups.

3. Multilevel modeling all the way. Bayesian inference. Decision making based on costs and benefits, not statistical significance.

Can we make this happen?

P.S. Zad informs us that the above cat is exhausted from quarantine and wants a vaccine immediately if not sooner.