This is Erik. Stephen Senn, Frank Harrell and I wrote a paper about the malpractice of dichotomizing numerical outcomes in clinical trials. The paper is pretty short, but there’s an even shorter summary (with many interesting comments) on Frank’s Datamethods Discussion Forum. We write in the abstract:
We have studied 21 435 unique randomized controlled trials (RCTs) from the Cochrane Database of Systematic Reviews (CDSR). Of these trials, 7224 (34%) have a continuous (numerical) outcome and 14 211 (66%) have a binary outcome. We find that trials with a binary outcome have larger sample sizes on average, but also larger standard errors and fewer statistically significant results. We conclude that researchers tend to increase the sample size to compensate for the low information content of binary outcomes, but not sufficiently.
We continue:
In many cases, the binary outcome is the result of dichotomization of a continuous outcome, which is sometimes referred to as “responder analysis”. In those cases, the loss of information is avoidable. Burdening more participants than necessary is wasteful, costly, and unethical.
Stephen wrote a short post on Linkedin where he doesn’t mince words:
Year in, year out, for a length of time which is only awarded to statistical survivors (no, this is not about immortal time bias), I have been banging on about the stupidity, the criminal vandalism, the wanton destruction of information involved in dichotomisation. It not only inflates standard errors and increases necessary sample sizes, thereby blurring inferences, while bloating budgets, delaying development, and obliterating other opportunities but it also rots brains, causing causal confusion via the number needed to trick.
In the paper, we use a simple method to get an approximate sense of the loss of information across the clinical trials in the Cochrane Database of Systematic Reviews (CDSR). We also made a shiny app which does two things. First, it calculates the loss of information after a “responder analysis” has been perpetrated. Second, the app can take a sample size calculation for a two-group parallel comparison of proportions and (assuming the proportions result from dichotomizing a continuous outcome) calculate the required sample size if one would not dichotomize. We hope that this will discourage would-be dichotomizers.
Erik:
Agreed, and excellent use of the Zombies category!
Just for you, here are two more examples where people way overestimated the amount of information in binary data:
– The research paper from 1985 that purported to debunk the hot hand in basketball. The debunking was in error–there really is a hot hand, and it was there all along. There are various reasons those researchers got it wrong, and one part of the story is sample size. They were predicting success based on success in previous shots. Success is a noisy outcome–it’s binary!–and success in previous shots is a noisy predictor. Put these together and you get attenuation (an underestimate of effect size) and more noise. A noisy estimate of a smaller estimand: that’s a recipe for apparent null findings, even in the presence of a large underlying effect.
– That paper, hyped by the Freakonomics team, claiming that attractive parents were more likely to have girls. That was based on data from 3000 births . . . a pretty big sample size, it would seem? But, no, to detect any plausible difference, they would’ve need a sample size 100 times larger; see here. Binary data be noisy!
Whether a shot goes in, and the sex of a baby, really are binary variables, so the problem in these cases was not dichotomization but rather the related problem of overrating the amount of information from binary data.
I think this is part of a more general phenomenon which has always bothered me: categorization of continuous variables (e.g. age), for which dichotomisation is the extreme case. My suspicions about the practice are a combination of the loss of information (for which the article in this post usefully quantifies) and the potential to find “significant” (itself a dangerous dichotimisation) effects through artificial grouping of continuous variables.
But it is often pointed out that much of medicine relies on such things. Diagnoses are often dichotimised. Treatments often rely on grouped age or diagnostic values (e.g. blood pressure). I’ve never been convinced that these were good excuses, but the practice is common and it avoids the need to understand incremental effects that continuous variables (whether they are the response variables or the features) would require. I’d rather see people educated to understand incremental effects (and particularly incremental probabilities), but in the absence of that perhaps these categorizations serve a purpose.
I’ve heard many MDs say that. The fallacy is that the fact that you want to ultimate make a binary decision (many decisions in medicine only seem to be binary but that’s for another day) has nothing to do with making the data used to inform the decision into binary variables. Example: a track coach doesn’t change her stopwatch to say “fast” and “slow” even though a runner either will or will not be selected for the olympics.
Doesn’t these estimates about the loss of information based on the probits also assume equal variances in the groups though (beyond assuming normal distributions)? Say that the dependent variable is normally distributed with expected value zero in both groups and standard deviation 1 in the first group, but standard deviation 2 in the second group. Then the probability to be below -.5 is about .31 for the first group, but .40 for the second group. So dichotomizing can also produce spurious relationships, amongst other things. And in this case we have an apparent increase in power (actually increase in Type I error), which may be another reason why this incorrect assumption about removing measurement error (as noted in the summary).
Mathias: In the paper, we write about the assumption of normality and equal variance:
“(…) not every continuous outcome has a normal distribution. However, our approach only assumes that there exists a transformation that achieves normality and equal variance, and that such a transformation (or one nearly like it) is applied. The common practice of taking logarithms of a skewed, non-negative outcome is a good example of such a transformation. If such a transformation is not applied, then it is well known that the relative efficiency of the Mann–Whitney test relative to the t-test is 3/pi=0.95, which is still much more efficient than dichotomizing.”
You’re right that a treatment which doesn’t affect the mean but does affect the variance, can cause a difference in proportions. Whether that should be considered spurious or a type I error is to some extent a matter of opinion.
Erik,
You write: “our approach only assumes that there exists a transformation that achieves normality and equal variance”.
But is this realistic? That is, is it such an “only” assumption? I’m no expert on medical research, but I would assume that if they measure blood pressure, or BMI, or the amount of virus in the blood, or something like that, and then dichotomize, then they would do so directly on the scale of the original continuous variable. They might dichotomize based on BMI = 30, or systolic blood pressure = 141 mmHg, and not “systolic blood pressure transformed to have equal standard deviation between groups” = ???, for example. And if the dichotomization is based on the original scale, then variations in distribution beyond the mean will play into the result. Also, if the dichotomization was done after such a transform, I would think this would be highly problematic in itself, no? Because now we can’t even be sure that 0-1 on the dichotomized variable means the same thing in both groups. If so, that seems like the first thing that should be criticized.
I would also guess that in the situations where there is some more substantial difference in distributions between groups (beyond the mean), researchers are more likely to dichotomize. That is, the goal could be to not have to do some other transform to yield normal distributions with equal variances.
I agree that it may be a matter of perspective whether the probability difference in my example is a spurious result or not. But I called it spurious based on your calculations where the probit difference equaled Cohen’s d in the continuous variable. So, under the assumption that a standardized difference in means between groups on the continuous variable is the real effect size we want to estimate. In that example, Cohen’s d equals 0, but there is a difference in probabilities, and therefore also a difference in probits. However, if we don’t care about such a standardized difference and are actually interested in the difference in probabilities, then the power calculations don’t seem as informative. And I would think that the reason why medical practitioners think that dichotomization can remove error variation is because of cases where the standardized mean difference and the difference in probabilities give quite different results. For example, a treatment may only yield a small standardized mean improvement, but could also decrease the variance, and would therefore make it comparably more probable that you land above some cutoff under which you’re considered unhealthy than what is suggested by the Cohen’s d – probit comparison for calculating relative power in the article.
This may all seem like I’m defending dichotomization, which is not my intent (although as others have commented, it could sometimes be the case that it is really a dichotomous version that is of real interest). I actually currently have a paper under review where I show the modelling that needs to be done to be able to account for such collapsing of scores in linear models (whether that collapsing is done by researchers, or as a result of the measure in the first place), and I’m planning future papers to show the errors that can arise if this is not modelled, or modelled incorrectly. One point in those papers is that dichotomization can lead to quite different results depending on the manner in which variable scores are collapsed in the measurements (beyond using cutoffs), and on the underlying distribution, so I’m more thinking that it does not illustrate the issue in its full light.
Mathias: Sure, I agree with what you’re saying.
Our main goal was to get some sense of the loss of information in actual practice, i.e. across thousands of trials of the Cochrane database. This involved quite some simplification, and the quote in my earlier response to you comes from a whole list of caveats in section 6.
Basically, we’re comparing the power of a chi-squared test of two proportions to the power of a t-test of the continuous variable. For the chi-squared test, it makes no difference if we cut-off the BMI at 30, or log(BMI) at log(30). But for the t-test is does make a difference if we’re working with BMI or log(BMI). To have a sensible comparison of the chi-squared test to the t-test, we’re assuming that our hypothetical researcher would do the t-test after a suitable transformation. Instead of doing a such a transformation, they could also do a Wilcoxon test but that wouldn’t make much of difference in terms of power.
This seems to assume that you can do the same strictly increasing transform in both groups (e.g. log(X)) to get normally distributed variables with equal variance between the groups though? In my normally distributed example above, you would need to divide the values in the second group by 2, but not divide anything (or divide by 1) in the first group for them to have equal variances while still being normally distributed. I’m not sure it is possible to use an equivalent transform across groups to fix this.
Mathias: There are many tests for a location shift such as Student’s t-test (equal variances), Welch’s t-test (unequal variances) and the Mann–Whitney test. The t-tests are already quite forgiving, and all 3 tests have very similar power. So when we’re comparing the power of the chi-squared test to the t-test (or z-test, actually), we’re really just assuming that the researcher does something sensible to test for a location shift of the numerical (continuous) variable.
In the absence of a shift, these location tests have no power to detect a difference in the variance. The chi-squared test can have power in that case. It depends on the context if that’s a good or bad thing. Of course, there are also dedicated tests for a change in the variance of numerical variable.
Yes, the power for these tests will be similar (albeit smaller when unequal variances need to be taken into account), but the power for the dichotomized test can be larger. Take my example with the normally distributed variables, but now say that they have N(0.001,1) and N(0,2) distributions [N(mean,standard deviation)]. If you cut at -0.5, then you will suddenly have larger power in the chi-squared test than when variances are equal, but that will not be true for the Student’s t-test or Welch’s t-test (or Mann-Whitney, I assume in this case, although it tests for differences in distribution, and so can be more and more powerful as the sample size increases even if there is no difference in mean, or no difference in median, which further complicates this comparison for different distributions). To have 80% power at a 0.05 alpha level to detect a true difference in proportions of 0.31 and 0.40 (roughly the probabilities you get in this case), you need 443 participants in each group. You need many more to have the same power with a t-test for that difference in means on the original variables (about 39 million per group; these numbers are taken from an online power calculator). However, this dichotomization is at the cost of also providing a higher chance of Type I error, or chance of error in the wrong direction if the direction of that difference in means is what is of interest. As noted by Alfred below, dichotomization can also go hand in hand with p-hacking, and differences like the ones above can make dichotomization an apparently appealing method to reach significance (consider the researcher who thinks “but we know that dichotomization can remove error, because results can get stronger [read ‘more significant’] after dichotomization”). I think the point about the risk of lowered power is worthwhile to point out, but I would have wished for some more recognition of these other issues, which does not really come forward when it reads like the distributional assumption is not that consequential.
This seems like a good argument to me, but some caution is advised.
Back when Tamiflu was available to the public, I read about a clinical trial in which the conclusion was that Tamiflu reduced the duration of flu symptoms by 12 hours on average or something like that. But that is not how Tamiflu works. If you took it at the onset of symptoms – the first sniffle – there was a good chance that you would not get sick. If you took it even a few hours later, it did little or nothing to change the course of the disease. The results were treated as a continuous variable – hours of being sick – rather than the sick/not sick dichotomy that actually resulted from usage of the drug.
Matt: I assume you’re right that “hours of being sick” is not a good way to quantify the effect of tamiflu. However, I’d be surprised if there aren’t any better ways than the binary sick/not sick. I do agree that there are cases where a binary outcome makes sense.
It’s strange to me that this needs to be said. What is usually the reason given for dichotomisation?
9P: That’s why it’s in the zombies category!
I’m not sure why dichotomisation remains popular. It’s often said that dichotomies make sense to medical researchers because in clinical practice, their diagnoses and treatment decisions are often based on cut-offs. I also think they have a sense that dichotomisation is a good way to clean up messy, noisy numerical data which may have skewed distributions and outliers. Of course, dichotomisation actually just adds noise!
Hi Erik
The behavioural driver for dichotomization of continuous endpoints seems to be widespread misunderstanding about the criteria needed to establish causality at the level of individual clinical trial subjects.
Misguided researchers dichotomize continuous outcome measures with the intent to classify individual patients as “responders” or “non-responders” to the therapy being tested. By applying these labels, drug sponsors give regulators the false impression that they can somehow anticipate the “proportion of patients exposed to a treatment who will experience more versus less marked improvement.” This piece of information is the holy grail of drug development. Everyone wants it, but many don’t understand that it’s usually unattainable. And understanding *why* it’s usually unattainable requires deep understanding of the type of information we need to infer causality for individual patients.
Those who initially proposed “responder analysis” (a group that seems to overlap with proponents of the concept of “number needed to treat”) seem not to have understood that the group-level causality identifiable through a trial usually *can’t* be mapped to causality at the level of individual trial subjects (given the way most trials are designed). For many, this will be a *very* counter-intuitive concept.
In short, proponents of “responder analysis” think that observing outcomes among individual trial subjects is a valid way to identify heterogeneity in treatment effect, not realizing that this is usually not a causally-valid exercise.
Hi ES: Yes, I agree that that’s another reason for dichtomisation. In the paper we focus mainly on the loss of information (and statistical power) but causal confusion is also bad consequence.
Ultimately its because such studies keep getting funded. No argument against it will matter unless it leads to funding drying up.
The dominant perception is that medical research is largely doing ok, and problems like this are just minor inefficiencies. So its not worth changing.
The reality is replication rates worse than we would get from coming up with an idea and flipping a coin, there is essentially no quality control at all. But that is not the perception.
Errors in the measurement of a predictor get reflected directly into noise in the response variable. Errors in the predictor caused by quantization, which amount to measurement errors, are going to be close to a uniform distribution so the result in the response will be very non-normal. The result can be have a larger variance than might have been expected.
Here’s another example of criminal dichotomisation from the land of diagnostic tests:
It is standard practice to construct an empirical ROC curve of a continuous test result and then dichotomise it at the point of the curve closest to the upper left corner – test results greater than this are then lumped together and given a positive likelihood ratio, while lesser test results are given a negative likelihood ratio.
For example, the ROC curve cutoff for the diagnosis of acute pancreatitis using a blood test for lipase is 300 IU, and levels higher than this are said to have positive likelihood ratio of about 9. A novice would be forgiven for thinking that a lipase level of say 301 substantially increases the diagnostic probability of pancreatitis.
But here’s the kicker, the likelihood ratio of an individual lipase level is actually the slope of the ROC curve at that individual point. A moments reflection will make it clear that the slope at 301 will be very close to 1, and provides very little diagnostic information. The slope for a level of say 3000 is much higher but both these values are lumped together.
TL;DR – diagnostic likelihood ratios are only useful for binary test outcomes
The use and misuse of ROC curves in biomedical research is one of the banes of my entire career.
Is this an alternative to dichotomization?
An ROC-type measure of diagnostic accuracy when the gold standard is continuous-scale
https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.2228
Dichotomization is a sure-fire tip-off for p-hacking. I know of an article that constructed an index of something as the first principal component of three variables. They then turned this into a binary variable, slicing it at the median. The binary variable has the result they want in their regressions; the continuous variable (the actual p.c.) does not. The authors provide no defense of their procedure.
Erik:
This will be a useful paper for statistical consultants to use in rebuttals to medical colleagues, regulators, and journal referees who request some sort of “responder analysis” in protocols, drug approval summaries, or publications. I plead guilty to agreeing to do this simply because I didn’t want to argue and lacked published reinforcement.
Out of curiosity, have you ever examined the effects of dichotomization when the outcome is a mixture of a continuous variable and a point mass at zero? In a number of trials that I’ve been a part of (seizures, migraines, panic attacks) the endpoint is a mixture with a non-negative percentage of zeros. Is it still bad to look at the percentages of zeros as an endpoint? These sometimes go under the name “remission analyses”.
RoyT: No, I haven’t done that. I guess people use zero-inflated models and I would think that those are much more powerful than dichtomising. However, when there are lots of zeros, there may be two separate processes at play. In that case, it can make sense to do two separate analyses: One to compare the proportion of zeros, and one to compare the number of seizures (migraines, panic attacks,…) among the patients who have them.
In my experience clinicians are low bandwidth processors. They must collapse continuous information in order to ingest it. Try handing them an 2-way interaction of continuous variables or a corresponding lossy 2×2 table. You’ll get a lot more traction with the latter.
Increasing the bandwidth of the downstream consumers needs more work (AI assisted inference might be useful?)
Erik: I’m happy to see that you take this issue on. Frank is probably already aware that there is a substantial literature on what is called distributional regression which estimates models of the form I(y > a) = x’b for various values of a as an alternative to quantile regression. I wouldn’t say that this is always unsound, but often it is.
I might also mention the blog: Live Free or Dichotomize by Lucy D’Agostino McGowan. I especially appreciated the 2016 post called “Hill for the data scientist” https://livefreeordichotomize.com/posts/2016-12-15-hill-for-the-data-scientist/index.html which collects some priceless xkcd numbers on causal inference.
I remember having some this discussion with a development economist and he agreed that the continuous outcome was more informative, but said that effects on arbitrary cut-offs (e.g., common low birth weight or malnutrition thresholds) were what was required to get the attention of policymakers who don’t understand the continuous measure. I wonder if something similar happens in medicine with classifications such as obesity and high blood pressure.