Debate over categorizing continuous variables

In a comment to an entry linking to my paper on splitting a predictor at the upper quarter or third and the lower quarter or third, MV links to this article by Frank Harrell on problems caused by categorizing continuous variables:

1. Loss of power and loss of precision of estimated means, odds, hazards, etc.

2. Categorization assumes that the relationship between the predictor and the response is flat within intervals; this assumption is far less reasonable than a linearity assumption in most cases

. . .

12. A better approach that maximizes power and that only assumes a smooth relationship is to use a restricted cubic spline . . .

My reply:

I agree that it is typically more statistically efficient to use continuous predictors. But, if you are discretizing, our paper shows why it can be much more efficient to use three groups (thus, comparing “high” vs. “low”, excluding “middle”), rather than simply dichotomizing into high/low.

As discussed in the paper, we specify the cutpoints based on the proportion of data in each category of the predictor, x. We’re not estimating the cutpoints based on the outcome, y. (This handles points 7, 8, 9, and 10 of the Harrell article.)

We’re not assuming that the regression function is flat within intervals or discontinuous between intervals. We’re just making direct summaries and comparisons. That’s actually the point of our paper, that there are settings where these direct comparisons can be more easily interpretable.

Just to be clear: I’m not recommending that discrete parameters be used for articles in the New England Journal of Medicine or whatever, in an area where regression is a well understood technique. I completely agree with Harrell that it’s generally better to keep variables as continuous rather than try to get cute with discretization. On the other hand, when you have your results, it can be helpful to explain them with direct comparisons. The point of our paper is that, if you’re going to do such direct comparisons, it’s typically efficient to do upper and lower third or quarter, rather than upper and lower half.

2 thoughts on “Debate over categorizing continuous variables

  1. I've seen a similar conclusion for medical stats, but can't remember the reference. I did find this: Effect of Categorizing a Continuous Covariate on the Comparison of Survival Time
    Timothy M. Morgan; Robert M. Elashoff
    Journal of the American Statistical Association, Vol. 81, No. 396. (Dec., 1986), pp. 917-921

  2. In Gelman and Park (2007), I think you may have inadvertently reinvented the "twenty-seven percent rule," an old shortcut in applied statistics. Google this phrase and you'll see what I mean. It goes back to the 1920s.

Comments are closed.