Skip to content

Models, assumptions, and data summaries

I saw an analysis recently that I didn’t like. I won’t go into the details, but basically it was a dose-response inference, where a continuous exposure was binned into three broad categories (terciles of the data) and the probability of an adverse event was computed for each tercile. The effect and the sample size was large enough that the terciles were statistically-significantly different from each other in probability of adverse event, with the probabilities increasing from low to mid to high exposure, as one would predict.

I didn’t like this analysis because it is equivalent to fitting a step function. There is a tendency for people to interpret the (arbitrary) tercile boundaries as being meaningful thresholds even though the underlying dose-response relation has to be continuous. I’d prefer to start with a linear model and then add nonlinearity from there with a spline or whatever.

At this point I stepped back and thought: Hey, the divide-into-three analysis does not literally assume a step function. It doesn’t assume anything at all; it’s just a data summary! People discretize input variables all the time! So why am I complaining?

I justify my complaints on two levels. First on the grounds of interpretation: my applied colleagues really were interpreting the three-category model in terms of thresholds. The three categories were: “0 to A”, “A to B”, and “B to infinity”. And somebody really was saying something about the effect of exposure A or exposure B. Which just ain’t right.

My second issue is statistical efficiency. You can say that the categorical-input model is nothing but a summary, an estimate of averages—but by binning like this, you lose statistical efficiency. And you become the slave to “statistical significance”; there’s the temptation to butcher your analysis and throw away tons of information, just so you can get a single clean, statistically significant result.

P.S. The more categories you have, the less of a concern it is to discretize. And sometimes your data come in discrete form (see here, for example).


  1. Ivon Fergus says:

    Can’t read column at right: blue print over black background.

  2. revo11 says:

    I was under the impression that approximating non-linearities by categorization like this was pretty typical. It probably reflects sociological concerns – interpreting confidence intervals or p-values on categorical factors is probably a lot more widely understood than the interpretation of spline regressions.

    I’m not sure what “The three categories were: “0 to A”, “A to B”, and “B to infinity”. And somebody really was saying something about the effect of exposure A or exposure B.” means – if a category is defined “0 to A”, I would think that they would interpret this as “effect of exposure 0 to A”.

    • K? O'Rourke says:

      I tried to encourage dual analyses many years ago, where the categorized ones were called understudy analyses (that could stand in for those who can not understand the more elaborate techniques).

      Now if they qualitatively gave different interpretations, I speculated that most often there would be something too wrong with the elaborate analysis and it likely needed tweeking.

      Recall the statistical reviewer called it dangerous thinking (in that it might distract people from making sure to only use valid methods).

    • Fred says:

      I agree that binning is sometimes useful to non-statisticians to understand and interpret the estimates of the categories. Such practice is inevitable in empirical research. If sample size is large enough, from my perspective, even larger number of arbitrary bins can be chosen to understand the behaviour of the estimates and the graph of the estimates with CI can be plotted. As such, we can later decide whether binning is necessary or what kind of non-linearities should be incorporated.

  3. Phil says:

    I just had a physical, including a blood test. The report gives the measurement of various parameters about my blood, and gives a “reference range” that my doctor says summarizes where most people fall, not a target range. (The measurements include the average volume of a blood cell! They can measure that! Mine is 86 femtoliters, the middle of the reference range. I have no idea why one would care, but there it is).

    The only exception to the “reference range” that summarizes typical numbers is for LDL/HDL ratio; that’s the ratio of low-density lipoprotein to high-density lipoprotein. For that one, they give quartiles:
    7.13, high risk

    My number is 2.6, in the “average risk” range. Only average, that’s too bad, I’d like to be in the low-risk group. But it’s gotta be good that I’m in the lower 15% of the “average risk” range, doesn’t it? I mean, it’s gotta be better than being, say, 4.8.

    In this case I think it’s OK to give categories with boundaries — what would be the alternative and why would more precision help? But for _analyzing_ data, I’m with Andrew, I’d be disappointed if they bin everything coarsely like this. Unless, of course, there’s a reason to believe the bin boundaries mean something. If you’re looking at crop survival as a function of temperature it probably makes sense to put “over 0 C” and under “0 C” in different bins. But most things aren’t that nonlinear like that, and even if they were, binning only helps if the bin boundaries correspond to the places where the function is changing rapidly.

    • Phil says:

      Weird, the other numbers in the quartile ranges didn’t show up…ah, I see, funny, it interpreted less than and greater than signs as tags!
      In case anyone cares, the numbers (for males) that they give for risk categories of LDL/HDL ratio are:
      low risk, less than 2.28
      average risk, 2.29-4.90
      moderate risk, 4.90-7.13
      high risk, above 7.13

    • Eric says:

      Just for information, the volume of a red blood cell (RBC) is of interest when trying to determine the reason for anemia. Certain forms of anemia will result in smaller RBCs while others will result in larger RBCs.

      At least in the medical field, most laboratory tests break down results into a normal range and flag those results outside of the normal range. That is because of three reasons I can think of: first, different laboratories may give slightly different results or have different reference ranges (perhaps because of different testing methadologies or due to different demographics of their “normal” population. Secondly, most physicians are not scientists and definitely not mathematicians or statisticians. In medical school, physicians are taught how to interpret results in terms of postive predictive values, sensitivities, and specificities. I haven’t yet come across a non-pathologists who can coherently explain what an odds-ratio is. Finally, a single routine test ordered can return a number of different results, many of which are noise and not needed or could just distract a diagnosis or therapy. As a result, lab tests are summarized to highlight possibly abnormal results that should be looked at more closely.

  4. Eric says:

    Can you recommend a place where we as a community could see the recent “best practices” for cases like this? I assume that one reason sub-optimal analyses such as this happen is that there are a lot of people who would like to do the most efficient analysis but we’re not statisticians and so we just use what we know (or can easily find.) I know that I’m one of these people.

Where can you find the best CBD products? CBD gummies made with vegan ingredients and CBD oils that are lab tested and 100% organic? Click here.