I originally gave this post the title, “New England Journal of Medicine makes the classic error of labeling a non-significant difference as zero,” but was I was writing it I thought of a more general point.

First I’ll give the story, then the general point.

**1. Story**

Dale Lehman writes:

Here are an article and editorial in this week’s New England Journal of Medicine about hydroxychloroquine. The study has many selection issues, but what I wanted to point out was the major conclusion. It was an RCT (sort of) and the main result was “After high-risk or moderate-risk exposure to Covid-19, hydroxychloroquine did not prevent illness compatible with Covid-19….” This was the conclusion when the result was “The incidence of new illness compatible with Covid-19 did not differ significantly between participants receiving hydroxycholoroquine (49 of 414 (11.8%) and those receiving placebo (58 of 407 (14.3%)); the absolute difference was -2.4 percentage points (95% confidence interval, -7.0 to 2.2; P=0.35).”

The editorial based on the study said it correctly: “The incidence of a new illness compatible with Covid-19 did not differ significantly between participants receiving hydroxycholoroquine ….” The article had 25 authors, academics and medical researchers, doctors and Phds – I did not check their backgrounds to see whether or how many statisticians were involved. But this is Stat 101 stuff: the absence of a significant difference should not be interpreted as evidence of no difference. I believe the authors, peer reviewers, and editors know this. Yet they published it with the glaring result ready for journalists to use.

To add to this, the study of course does not provide the data. And the editorial makes no mention of their recent publication (and retraction) of the Surgisphere paper. It would seem that that whole episode has had little effect on their processes and policies. I don’t know if you are up for another post on the subject, but I don’t think they should be let off the hook so easily.

Agreed. This reminds me of the stents story. It’s hard to avoid binary thinking: the effect is real or it’s not, the result is statistically significant if it’s not, etc.

**2. The general point**

Indeed, the standard way that statistical hypothesis testing is taught is a 2-way binary grid, where the underlying truth is “No Effect” or “Effect” (equivalently, Null or Alternative hypothesis) and the measured outcome is “Not statistically significant” or “Statistically significant.”

*Both these dichotomies are inappropriate.* First, the underlying reality is not a simple Yes or No; in real life, effects vary. Second, it’s a crime to take all the information from an experiment and compress it into a single bit of information.

Yes, I understand that some times in life you need to make binary decisions: you have to decide whether to get on the bus or not. But. This. Is. Not. One. Of. Those. Times. The results of a medical experiment get published and then can inform many decisions in different ways.

“it’s a crime to take all the information from an experiment and compress it into a single bit of information”

Don’t we do that sort of thing all the time? We have summary statistics rather than only communicating large numbers of datapoints, we use maps even though they don’t entirely match the territory, we create formal models that deliberately abstract away details we know exist in reality.

Summary stats are measures of the distribution of the real data. The Y/N result of an NHST test gives no information about the data.

Indeed, you could say it’s the whole point of science. By finding underlying rules and patterns, we generalize to understand the world.

Wonks, Anon:

Data compression is fine. Compressing the data into a single yes/no bit is going too far.

+1 to this. Most of the stuff we talk about on this blog would be well summarized with a handful of short floats. Let’s call it say less than 100 short floats each of which is 32 bits, so 3200 bits, compared to 1.

280 words.

Data compression is a big issue for me in economics. For instance, in cost-benefit analysis, in order to arrive at a single metric, a discount rate is applied to all the time-varying effects of an action. This erases the time pattern entirely: it doesn’t matter any more whether you are dealing with a smallish cost today or a very large one 100 years from now. It seems to me the question is not “should we compress” but to what degree? Eliminating the time structure of costs and benefits erases useful, sometimes crucial, information. But a usable representation of the time structure will be a compression too. So the matter is context-specific and requires judgment.

This is related to AG’s point about the variation in effects. Condensing all evidence of effects into a single summary metric is usually going too far. But leaving variation at the individual (or unit of observation) level is going too far in the other direction. In my opinion, it all comes down to the questions one is asking (or should be) and compressing the data to the level that matches.

In fairness, there were lots of “bits” of information quoted in Dale Lehman’s summary of the article in question. Yes, one of them was the significant/not significant dichotomy, the objectionable “bit”. Those quoted sections would improved if instead of “did not prevent illness” the wording were something like “there only inconclusive evidence of a weak protective effect”. But surely anyone remotely numerate is going to look at “49 of 414” versus “58 of 407” and make their own dichotomous decision. This particular trial per se fails to convince us of the efficacy of hydroxycholoroquine.

It is truly scary to listen to top virologists, immunologists, epidemiologists and others at the forefront of this pandemic show using the word ‘significance’ as they see fit and explaining it in a dichotomous manner.

> ) and the measured outcome is “Not statistically significant” or “Statistically significant.”

I don’t understand. I get that “no effect” isn’t a binary comparison to a statistically significant effect. But what’s wrong with the binary of “statistically significant” vs. “not statistically significant?”

Or are you just saying that this isn’t a binary outcome measure of whether there’s an effect?

Statisticians should look on the bright side here. Had these authors not been properly schooled in the challenges of randomization, they could easily have concluded that “pre-symptomatic use of hydroxychloroquine reduces COVID-19 induction by 17% (2.4/14.3).”

I don’t see much of a problem here. From a statistical point of view, Andrew is of course correct.

But from the perspective of medical decision making, the editorial statement reflects the prior that all drugs are guilty until proven innocent.

Unless someone brings the evidence, a drug doesn’t work.

The phrasing is loose but it reflects how the trial result will be interpreted in practice.

Isn’t the fact that dichotomized interpretation will inevitably be applied in practice, the crux of the problem though?

Surely a more productive way of thinking about treatment justification is how much effect the treatment will have, how much variation there is around that effect, and how uncertain those estimates are.

Perhaps it appears safe to adopt the attitude that nothing works – until there is evidence that it does. The problem is that when there is some evidence that it might work, the conclusion becomes it *does* work. That is, it leads directly to dichotomous thinking. Also, it leads to ignoring all evidence unless it passes the magical .05 threshold.

I hate arbitrary cutoff and thresholds as much as anyone but still I get the impression that for discussion here even if the numbers had come out as “59 of 414” versus “58 of 407” (i.e. as close to identical proportions as possible) it would still be unacceptable to state a conclusion that in any way implied there was not benefit shown. Maybe I overstate the dogmatism but sometimes there’s a difference between “dichotomous thinking” and simply stating the obvious.

Name:

I recommend reporting the confidence interval (or uncertainty interval) for the difference. Or just an estimate and standard error. In this example you’d get an estimate of 0.000 +/- 0.024. The 0.000 doesn’t really mean anything here—it’s just luck—and if the interval were something like 0.010 +/- 0.024 it should still get the same interpretation. Not that the effect is zero but that there’s no evidence it’s nonzero.

I feel like clinitians would interpret something like 0.010 +/- 0.024 as “0.01 and some small error”, and we should instead report it as [-0.014..0.034] which would better convey the uncertainty since lazy thinkers wouldn’t bother with the calculation of the central value.

I think you are being kinda uncharitable here. Sure, in an intro stats textbook that would be an error but this is in an academic journal.

So when it comes time to interpret the sentence

“After high-risk or moderate-risk exposure to Covid-19, hydroxychloroquine did not prevent illness compatible with Covid-19” we have two options.

1) Read in (commonly not explicitly mentioned) the standard qualification for any negative empirical result of ‘we didn’t find that’

2) Assume they literally intended the mistake discussed in the OP.

Given that it’s in an academic journal I lean towards them (admittedly mistakenly) simply assuming 1 was the obvious way it would be read.

Is NEJM really an “academic” journal? Many (most?) of the authors are medical professionals, not academic researchers. And the intended readership is usually medical professionals as well (plus the press, of course).

I think you are way too generous to the NEJM’s readership. These misunderstandings are incredibly widespread.

Peter:

I agree with Simon. See this article of ours for an example where a non-rejection (in that case, a p-value of 0.20 that ended up being a p-value of 0.09 after appropriate analysis) was taken as conclusive evidence of no effect.

If we interpret the Hippocratic “first do no harm” as “only do harm with probability alpha” then wouldn’t treatment be precluded if the posterior interval of size 1-alpha includes zero? That doesn’t seem like such a bad decision criterion.

Rick:

I don’t think so. It all depends on the costs and benefits. Also, the p-value is not the probability of the drug doing harm. For example, consider a drug whose effects are highly variable, which helps 55% of the people but hurts 45%. The drug has a 45% chance of doing harm, but you still might prescribe it.