## Econometrics, political science, epidemiology, etc.: Don’t model the probability of a discrete outcome, model the underlying continuous variable

This is an echo of yesterday’s post, Basketball Stats: Don’t model the probability of win, model the expected score differential.

As with basketball, so with baseball: as the great Bill James wrote, if you want to predict a pitcher’s win-loss record, it’s better to use last year’s ERA than last year’s W-L.

As with basketball and baseball, so with epidemiology: as Joseph Delaney points out in my favorite blog that nobody reads, you will see much better prediction if you first model change in the parameter (e.g. blood pressure) and then convert that to the binary disease state (e.g. hypertension) then if you just develop a logistic model for prob(hypertension).

As with basketball, baseball, and epidemiology, so with political science: instead of modeling election winners, better to model vote differential, a point that I made back in 1993 (see page 120 here) but which seems to continually need repeating. A forecasting method should get essentially no credit for correctly predicting the winner in 1960, 1968, or 2000 and very little for predicting the winner in 1964 or 1964, but there’s information in vote differential, all the same.

As with basketball, baseball, and epidemiology, and political science, so with econometrics: Even in recent years, with all the sophistication in economic statistics, you’ll still see people fitting logistic models for binary outcomes even when the continuous variable is readily available. (See, for example, the second-to-last paragraph here, which is actually an economist doing political science, but I’m pretty sure there are lots of examples of this sort of thing in econ too.)

?

OK, ok, if this is all so obvious, why do people do the other thing? Why do people keep modeling the discrete variable? Some of the answer is statistical naivety, a simple “like goes with like” attitude that it makes sense to predict W-L from W-L rather than ERA.

More generally there’s the attitude that we should be modeling what we ultimately care about. If the objective is to learn about wins, we should study wins directly. To which I reply, sure, study wins, but it will be more statistically efficient to do this in a two-stage process: first study vote differential given X, then study wins given vote differential and X. The key is that vote differential is available, and a simply performing a logit model for wins alone is implicitly taking this differential as latent or missing data, thus throwing away information.

Finally, from the econometrics direction, I see a bias or robustness argument. The idea is that it’s safer, in some way, to model the outcome of interest, as this model will not be sensitive to assumptions about the distribution of the intermediate variable. For example, a linear model for score differentials could be inappropriate for games where one team runs up the score (or, conversely, for those games where the team that’s winning sends in the subs so that the score is less lopsided than it would be if both teams were playing their hardest). In response to this, I would make my usual argument that your models already have bias and robustness issues in that, to do your regression at all, you’re already pooling data from many years, many places, many different situations, etc. If the use of continuous data can increase your statistical efficiency—and it will—this in turn will allow you to do less pooling of data to construct estimates that are reliable enough for you to work with.

1. Jay Ulfelder says:

From personal experience, I can offer another reason political scientists often model discrete representations of underlying continuous variables: lack of data.

I do a lot of work on political violence, including episodes of civil war and mass killing. We conventionally identify those episodes through Boolean expressions that include body counts and descriptive features of the context and parties to the violence. Following your logic—which I don’t dispute—we would do better to model those death counts directly.

The problem is that we simply don’t have reliable death counts for most of the episodes we think we can identify when we use a simple threshold. That’s partly because of the poor bureaucratic capacity of the societies in which these things tend to happen, but it’s also because the parties to these conflicts are motivated to conceal or misrepresent their activities. And then there’s the general fog of war.

So we can aspire to follow your advice, but practicalities keep us from doing so, and I don’t expect that problem to disappear any time soon.

2. Kaiser says:

I generally agree with the message. Here’s one example of a business problem in which I’d choose to model discrete rather than continuous variables: let’s say you are a catalog retailer. You can model the discrete event (Buy or not buy) or the continuous event (dollar amount of purchase). In reality, you’d do a two-stage model similar to what Andrew described above. Model the propensity to buy anything at all, and then model the amount of purchases given the customer buys. But if you only build one model, the discrete one is more useful. Directly building a model for the continuous variable is difficult because the distribution of that variable looks like a huge spike at zero plus a long tail for the positives, with a scattering of negative values (credits).

• BrendanH says:

Sounds like the motivation for a Tobit regression, or a Heckman selection model — not two models, but model the continuous conditional on observing the outcome.

• Anonymous says:

why not model the continuous variable as a mixture model?

3. Rahul says:

I agree with Andrew’s point. Makes sense intuitively: why throw away information?

To me the converse question is more interesting: Are there any weird situations where, in spite of data being available, modelling the final discrete outcome makes more sense?

Examples such as @Jay Ulfelder’s comment above don’t count because there data is simply non-existent. That’s a different situation.

Even more generally, within any given class of models, are there situations where using aggregate data leads to better models than using a more granular model?

• george says:

Are there any weird situations where, in spite of data being available, modelling the final discrete outcome makes more sense?

Two come to mind. First, extremely heavy-tailed distributions, where a few way-far-out observations heavily influence the analysis; dichotomizing variables leads to much lower influence. Second, when measurement error is really bad, much of the variability in the underlying continuous variable is simply noise; compared to this a dichotomized variable is much cleaner.

But these are unusual. Much more often, analysis using the continuous variable is a better plan.

• Robert Grant says:

Janet Peacock and colleagues wrote this paper a couple of years ago: “Dichotomising continuous data while retaining statistical power using a distributional approach”. It’s a great approach, acknowledging that you want to analyse with the continuous data but sometimes there is a real reason for communicating with the discrete version.

In medical stats we spend a lot of time messing around with meta-analysis and one big problem is where some studies report mean change in the outcome, while others report the odds ratio or risk ratio of achieving some threshold. To make matters even worse, sometimes the threshold is absolute, sometimes relative to previous measurements. You can fix this if you ditch the old meta-analysis formulas and view it as a Bayesian latent variable or coarsened data model. Hopefully I’ll be presenting this at JSM this year, assuming they like the idea…

• K? O'Rourke says:

Robert: > ditch the old meta-analysis formulas and view it as a Bayesian latent variable or coarsened data model

You might wish to read my thesis on what I called the observed summary likelihood http://statmodeling.stat.columbia.edu/wp-content/uploads/2010/06/ThesisReprint.pdf

Certainly room for improvement and as far as I know know one has implemented anything practical yet – I would suggest ABC to get posterior and then posterior/prior for an approximate likelihood.

• Robert Grant says:

Keith – Wow, thanks for that, I will read it with interest. Perhaps I can steal some of your ideas, I mean stand on your gigantic shoulders!

• Ian Fellows says:

In the case of heavy tailed distributions, I would consider a transformation (i.e. log) rather than dichotomizing it. Since you don’t care about the scale in the first place, there is no downside to doing this.

• george says:

Robert, thanks for the reference.

Ian, I’d consider log transforms too, but they don’t always help. E.g. you can’t log-transform negative values. And I’m reluctant to say we “don’t care about the scale”, ever. But when the scale is at least up for some discussion, interpretability also matters – other scientists should be able to understand what your results mean. Log-transformation isn’t bad in this regard, Box-Cox and inverse-Z transformations are considerably more challenging. Dichotomization (itself just another transformation) gives results that are very easy to interpret.

• Ian Fellows says:

I just meant that you don’t care about scale of the outcome if you are going to dichotomize the result of the fit model when reporting your results.

4. Clark says:

A related problem that I see in medical research is the use of arbitrarily discrete predictors. They’ll (usually MDs) take something like age and convert it to a categorical variable by breaking it into arbitrary ranges. This has always struck me as a needless loss of information by effectively averaging the response over each of these ranges, and I’ll generally try to persuade them to provide the continuous data.

I have encountered difficulties with NOT discretizing the predictors in cases where the treatment interacts with that predictor, like ANCOVA models (or their GAM equivalents). The people I work with are generally not terribly interested in the intercepts or slopes associated with an ANCOVA, they’re interested in where, over the domain of the predictor, the outcome differs between treatment categories (for instance, a drug may have a large effect for the first week following trauma, then tail-off to nothing over the next few weeks). Intuitively, this is like a running contrast over the predictor’s range. My best solution thus far has been to estimate the confidence interval for the difference by finding the difference’s standard error as the square root of the sums of squares of the lines/curves associated with the individual treatment conditions. The problem I have with this approach is that it appears to be too conservative — in cases where the interaction lacks significance, you would like the result to be roughly equivalent to the treatment contrast in isolation, but it generally results in a rather broader confidence interval. I think the running contrasts notion is a good one, but what is needed is a better methodology to implement it. Another difficulty is that MDs have difficulty interpreting these comparisons, and are more comfortable with comparisons involving discrete levels of a predictor.

Due to these difficulties, I am for the moment leaning towards using discrete predictors in cases where there is an interaction between treatment levels and predictors, accompanied by adjustment for multiple comparisons. I would prefer a better solution. On the plus side, using more conservative methods help to reduce problems with reproducibility.

• jrc says:

I think using binned covariates (RHS vars, predictors, whatever) is a much different thing than using discretized outcome variables. Here’s one example: age.

Suppose we have people aged 18-65, with age reported in years. First, its worth noting that we have already discretized a continuous variable – even if we didn’t use round years, we’d still be binning the ages (we can’t have age-in-nanoseconds or whatever approaches continuous). So, now, given that, what can we do? We could, depending on sample size, including parametric controls for age (age, age-squared, cubed, etc.). This would fit some smooth polynomial function to the age profile, through implicitly-binned-ages-in-years. Or, we could use dummy variables for age-in-years, which would be a fully flexible non-parametric control for age (and hope T was orthogonal to age-within-year-bins). With sufficient data and weak priors, I’d say that the non-parametric controls are better. Wider bins are just tweaking the trade-off between parametric and non-parametric adjustments.

I tend to agree regarding interactions with treatment, but that is mostly for two reasons: improved power (sparseness tends to be a problem when looking at sub-groups of treated individuals) and interpretation (an interaction between a continuous variable and treatment is often hard to interpret, especially with a main treatment effect also present, and possibly some other interactions).

One other option for continuous predictors that heterogeneously impact treatment: estimate a local linear (lowess, something like that) regression on the T group and C group separately. Subtract the two estimates at a bunch of values of X, and then bootstrap the whole procedure. Then you get a nice, non-parametric estimate of the treatment effect across X, along with standard error bars at each point, which produces a totally readable, very convincing graph with good visualization of estimated precision.

5. K? O'Rourke says:

> why throw away information?

When the bias in it can’t adequately be dealt with (i.e. traded in safely to reduce variance).

For instance when you actually suspect that a team that could have won, purposely lost by a large amount to encourage large bets against them next time (i.e. the old pool shark trick).

In combining RCTs and observational studies, I once used the fruit salad metaphor of combining sweet (unbiased) and sour (biased) apples and put the challenge as – when is there enough knowledge about the bias to adequately sweet the sour apples so that you should not just throw them away.

6. Gray says:

I haven’t read the original econ/poli-sci paper you mention, but from the your linked description, it doesn’t seem like an appropriate comparison. Regression discontinuity studies are explicitly not trying to model the entire probability distribution, but are looking at particular part of it where there’s a plausible natural experiment.

Another example might be more helpful, and I agree with your point about baseball. If we want to forecast election outcomes, I agree completely, model vote shares. But for the example you cite, it’s almost like saying “don’t use matching”.

• Drew D says:

I believe what Andrew is referring to with the Lee RDD is that it might make more sense to use a continuous OUTCOME (not the running variable). Thus, you can identify the effect on vote share, rather than an effect on probability of winning.

7. Brian says:

What about the case of modeling mortality?

Here the discrete is of obvious interest. It is different from the wins/ERA example, because there is no prior discrete (most people can only die once). And the continuous related variables (e.g. BMI, cholestorol, blood pressure, etc) are generally what you might use as explanatory variables. I’m not interested in who might have high BP – I want to know of those that do, who is going to die.

Does modeling a risk of death under a logistic approach represent the only approach here?

• Rahul says:

Yes, but it’s easy to imagine that predictive power may be improved by having access to these other variables as intermediate variables?

8. jrc says:

I come up against this a lot with public health type people. And often their reason, as unsatisfying as it is to me with my “applied statistician” hat on, does make some sense if you put a “policy informing researcher” hat on: sometimes people understand the dichotomous variable better.

So for instance – I could tell you that a nutrition intervention reduced “childhood stunting” by 10pp, or I could tell you it increased mean child height-for-age z-score among children under five by .12sd. The first one sounds like I’m curing malnutrition. The second one sounds like I’m a statistician who got some stars next to a number that doesn’t mean anything. Now me – I get a lot more info from the .12, but that’s because I know something about the distribution of child height-for-age z-score (I know, it sounds N(0,1), but it isn’t in Sub-Saharan Africa or South Asia).

This is much like the hypertension/blood-pressure example. But in that case, since I know very little about blood pressure, the incidence of the clinical condition is the unit that makes sense to me (as an information consumer, or more maybe like an infographic consumer).

All said – I still do my analysis on the continuous underlying variable – that’s what makes more sense and contains more information. But I might also report results on the categorical variable, or at least find some way to translate my marginal effect estimates into some more readily interpretable relation to a thing in the world people understand (such as a clinical condition, or a poverty rate, or something).

9. Rahul says:

(1) What does “statistical efficiency” in this context mean? Same as predictive power?

(2) Is the statistical efficiency of the two step process guaranteed to be always better than ignoring the latent variable? If not always, mostly? It sounds intuitive, yes, but are there stronger, more rigorous reasons?

• K? O'Rourke says:

Rahul:

Think of the likelihood as the probability of what you observe and then the probability of a function of what you observe as a psuedo likelihood – you can only lose information (as long as function is non-invertible).

Now if you are only losing very sparse information about nuisance parameters in the full likelihood while the psuedo likelihood does not involve those parameters – there can be a _gain_ – the canonical example being REML estimation.

In a Bayesian perspective you are mixing posteriors (not conditioning on the true posterior – same as in ABC) but then you avoid sensible priors for the nuisance parameters in the full model.

10. jonathan says:

Thanks for the link to the other blog.

11. jimmy says:

joseph and mark have a great blog. more should read it.

12. Mike says:

I think Brian’s point about mortality is an interesting example here since it also extends to a whole slew of social processes where you have a terminal event and no obvious underlying continuous variable. To follow Andrew’s example in the opening post, what would the underlying continuous variable of a government termination be, for instance? Under most democratic constitutions, a change in vote-share is neither necessary nor sufficient for a sitting government to terminate, so clearly it cannot be vote differentials (just ask the Italians). Of course, you could choose to model vote differentials instead of actual terminations, but then you would simply be changing the question rather than modeling the “underlying continuous variable.” To me there seems to be plenty of discrete social events such as these where the continuous variables would more sensibly be included as predictors rather than as substitutes for the actual events.

• Andrew says:

Mike:

Yes, I agree that you can’t always work with a continuous variable. Indeed, my book with Jennifer has a whole chapter on logistic regression. My advice is to work with continuous variables where possible and to avoid discretizing. Sometimes people discretize just so they can run a chi-squared test or fit a logistic regression, other times people discretize out of a naive view that an analysis on the ultimate discrete outcome is more safe or robust. My post was arguing against those attitudes.

13. Patrick McCabe says:

I’m impressed by your credentials, Dr. Gelman, and your Erdös Number. But could you please stop referring to “the great” Bill James? Though I’m even more mathematically & statistically challenged than James, I’ve studied his work, on and off, since 1982. Plus the work of virtually everyone in the baseball analytical community. Where James’s stock began to plummet around 2000. His Win Shares system was thoroughly dismantled and isn’t used by anyone but James’s acolytes. James is a good writer, and a good man, but he hasn’t been devoted to the objective study of baseball statistics since the 1986 Baseball Abstract.

• Andrew says:

Patrick:

Bill James may have declined from his peak, but he’s still great. James’s contributions to statistics and sabermetrics are great, whether measured by total value or by peak value. I agree that Bill James started falling for his own hype around 1987 or so. But to say that James isn’t great because he’s gone downhill, is like saying that Pete Rose wasn’t a great player because he didn’t hit so well during his last few seasons.