Comments on: Econometrics, political science, epidemiology, etc.: Don’t model the probability of a discrete outcome, model the underlying continuous variable

By: Andrew

Andrew — Sun, 09 Oct 2016 15:50:39 +0000

Patrick:

Bill James may have declined from his peak, but he’s still great. James’s contributions to statistics and sabermetrics are great, whether measured by total value or by peak value. I agree that Bill James started falling for his own hype around 1987 or so. But to say that James isn’t great because he’s gone downhill, is like saying that Pete Rose wasn’t a great player because he didn’t hit so well during his last few seasons.

By: Patrick McCabe

Patrick McCabe — Sun, 09 Oct 2016 06:43:20 +0000

I’m impressed by your credentials, Dr. Gelman, and your Erdös Number. But could you please stop referring to “the great” Bill James? Though I’m even more mathematically & statistically challenged than James, I’ve studied his work, on and off, since 1982. Plus the work of virtually everyone in the baseball analytical community. Where James’s stock began to plummet around 2000. His Win Shares system was thoroughly dismantled and isn’t used by anyone but James’s acolytes. James is a good writer, and a good man, but he hasn’t been devoted to the objective study of baseball statistics since the 1986 Baseball Abstract.

By: Andrew

Andrew — Mon, 03 Mar 2014 17:36:03 +0000

In reply to Mike. Mike: Yes, I agree that you can't always work with a continuous variable. Indeed, my book with Jennifer has a whole chapter on logistic regression. My advice is to work with continuous variables where possible and to avoid discretizing. Sometimes people discretize just so they can run a chi-squared test or fit a logistic regression, other times people discretize out of a naive view that an analysis on the ultimate discrete outcome is more safe or robust. My post was arguing against those attitudes.

By: Mike

Mike — Mon, 03 Mar 2014 17:32:07 +0000

I think Brian’s point about mortality is an interesting example here since it also extends to a whole slew of social processes where you have a terminal event and no obvious underlying continuous variable. To follow Andrew’s example in the opening post, what would the underlying continuous variable of a government termination be, for instance? Under most democratic constitutions, a change in vote-share is neither necessary nor sufficient for a sitting government to terminate, so clearly it cannot be vote differentials (just ask the Italians). Of course, you could choose to model vote differentials instead of actual terminations, but then you would simply be changing the question rather than modeling the “underlying continuous variable.” To me there seems to be plenty of discrete social events such as these where the continuous variables would more sensibly be included as predictors rather than as substitutes for the actual events.

By: jimmy

jimmy — Thu, 27 Feb 2014 22:49:40 +0000

joseph and mark have a great blog. more should read it.

By: Robert Grant

Robert Grant — Thu, 27 Feb 2014 17:04:33 +0000

In reply to K? O'Rourke. Keith - Wow, thanks for that, I will read it with interest. Perhaps I can steal some of your ideas, I mean stand on your gigantic shoulders!

By: Ian Fellows

Ian Fellows — Thu, 27 Feb 2014 00:44:44 +0000

In reply to george. I just meant that you don't care about scale of the outcome if you are going to dichotomize the result of the fit model when reporting your results.

By: Anonymous

Anonymous — Thu, 27 Feb 2014 00:10:39 +0000

In reply to Kaiser. why not model the continuous variable as a mixture model?

By: george

george — Wed, 26 Feb 2014 22:25:59 +0000

In reply to Ian Fellows.

Robert, thanks for the reference.

Ian, I’d consider log transforms too, but they don’t always help. E.g. you can’t log-transform negative values. And I’m reluctant to say we “don’t care about the scale”, ever. But when the scale is at least up for some discussion, interpretability also matters – other scientists should be able to understand what your results mean. Log-transformation isn’t bad in this regard, Box-Cox and inverse-Z transformations are considerably more challenging. Dichotomization (itself just another transformation) gives results that are very easy to interpret.

By: BrendanH

BrendanH — Wed, 26 Feb 2014 22:03:33 +0000

In reply to Kaiser. Sounds like the motivation for a Tobit regression, or a Heckman selection model -- not two models, but model the continuous conditional on observing the outcome.

By: K? O'Rourke

K? O'Rourke — Wed, 26 Feb 2014 21:16:13 +0000

In reply to Robert Grant.

Robert: > ditch the old meta-analysis formulas and view it as a Bayesian latent variable or coarsened data model

You might wish to read my thesis on what I called the observed summary likelihood https://statmodeling.stat.columbia.edu/wp-content/uploads/2010/06/ThesisReprint.pdf

Certainly room for improvement and as far as I know know one has implemented anything practical yet – I would suggest ABC to get posterior and then posterior/prior for an approximate likelihood.

By: jonathan

jonathan — Wed, 26 Feb 2014 21:00:34 +0000

Thanks for the link to the other blog.

By: Ian Fellows

Ian Fellows — Wed, 26 Feb 2014 20:28:30 +0000

In reply to george. In the case of heavy tailed distributions, I would consider a transformation (i.e. log) rather than dichotomizing it. Since you don't care about the scale in the first place, there is no downside to doing this.

By: jrc

jrc — Wed, 26 Feb 2014 20:00:48 +0000

In reply to Clark.

I think using binned covariates (RHS vars, predictors, whatever) is a much different thing than using discretized outcome variables. Here’s one example: age.

Suppose we have people aged 18-65, with age reported in years. First, its worth noting that we have already discretized a continuous variable – even if we didn’t use round years, we’d still be binning the ages (we can’t have age-in-nanoseconds or whatever approaches continuous). So, now, given that, what can we do? We could, depending on sample size, including parametric controls for age (age, age-squared, cubed, etc.). This would fit some smooth polynomial function to the age profile, through implicitly-binned-ages-in-years. Or, we could use dummy variables for age-in-years, which would be a fully flexible non-parametric control for age (and hope T was orthogonal to age-within-year-bins). With sufficient data and weak priors, I’d say that the non-parametric controls are better. Wider bins are just tweaking the trade-off between parametric and non-parametric adjustments.

I tend to agree regarding interactions with treatment, but that is mostly for two reasons: improved power (sparseness tends to be a problem when looking at sub-groups of treated individuals) and interpretation (an interaction between a continuous variable and treatment is often hard to interpret, especially with a main treatment effect also present, and possibly some other interactions).

One other option for continuous predictors that heterogeneously impact treatment: estimate a local linear (lowess, something like that) regression on the T group and C group separately. Subtract the two estimates at a bunch of values of X, and then bootstrap the whole procedure. Then you get a nice, non-parametric estimate of the treatment effect across X, along with standard error bars at each point, which produces a totally readable, very convincing graph with good visualization of estimated precision.

By: Robert Grant

Robert Grant — Wed, 26 Feb 2014 19:46:02 +0000

In reply to george. Janet Peacock and colleagues wrote this paper a couple of years ago: "Dichotomising continuous data while retaining statistical power using a distributional approach". It's a great approach, acknowledging that you want to analyse with the continuous data but sometimes there is a real reason for communicating with the discrete version. In medical stats we spend a lot of time messing around with meta-analysis and one big problem is where some studies report mean change in the outcome, while others report the odds ratio or risk ratio of achieving some threshold. To make matters even worse, sometimes the threshold is absolute, sometimes relative to previous measurements. You can fix this if you ditch the old meta-analysis formulas and view it as a Bayesian latent variable or coarsened data model. Hopefully I'll be presenting this at JSM this year, assuming they like the idea...

By: K? O'Rourke

K? O'Rourke — Wed, 26 Feb 2014 19:00:50 +0000

In reply to Rahul.

Rahul:

Think of the likelihood as the probability of what you observe and then the probability of a function of what you observe as a psuedo likelihood – you can only lose information (as long as function is non-invertible).

Now if you are only losing very sparse information about nuisance parameters in the full likelihood while the psuedo likelihood does not involve those parameters – there can be a _gain_ – the canonical example being REML estimation.

In a Bayesian perspective you are mixing posteriors (not conditioning on the true posterior – same as in ABC) but then you avoid sensible priors for the nuisance parameters in the full model.

By: Drew D

Drew D — Wed, 26 Feb 2014 18:54:03 +0000

In reply to Gray. I believe what Andrew is referring to with the Lee RDD is that it might make more sense to use a continuous OUTCOME (not the running variable). Thus, you can identify the effect on vote share, rather than an effect on probability of winning.

By: Rahul

Rahul — Wed, 26 Feb 2014 18:26:36 +0000

(1) What does “statistical efficiency” in this context mean? Same as predictive power?

(2) Is the statistical efficiency of the two step process guaranteed to be always better than ignoring the latent variable? If not always, mostly? It sounds intuitive, yes, but are there stronger, more rigorous reasons?

By: jrc

jrc — Wed, 26 Feb 2014 18:05:16 +0000

I come up against this a lot with public health type people. And often their reason, as unsatisfying as it is to me with my “applied statistician” hat on, does make some sense if you put a “policy informing researcher” hat on: sometimes people understand the dichotomous variable better.

So for instance – I could tell you that a nutrition intervention reduced “childhood stunting” by 10pp, or I could tell you it increased mean child height-for-age z-score among children under five by .12sd. The first one sounds like I’m curing malnutrition. The second one sounds like I’m a statistician who got some stars next to a number that doesn’t mean anything. Now me – I get a lot more info from the .12, but that’s because I know something about the distribution of child height-for-age z-score (I know, it sounds N(0,1), but it isn’t in Sub-Saharan Africa or South Asia).

This is much like the hypertension/blood-pressure example. But in that case, since I know very little about blood pressure, the incidence of the clinical condition is the unit that makes sense to me (as an information consumer, or more maybe like an infographic consumer).

All said – I still do my analysis on the continuous underlying variable – that’s what makes more sense and contains more information. But I might also report results on the categorical variable, or at least find some way to translate my marginal effect estimates into some more readily interpretable relation to a thing in the world people understand (such as a clinical condition, or a poverty rate, or something).

By: Rahul

Rahul — Wed, 26 Feb 2014 17:54:52 +0000

In reply to Brian. Yes, but it's easy to imagine that predictive power may be improved by having access to these other variables as intermediate variables?

By: Brian

Brian — Wed, 26 Feb 2014 17:44:09 +0000

What about the case of modeling mortality?

Here the discrete is of obvious interest. It is different from the wins/ERA example, because there is no prior discrete (most people can only die once). And the continuous related variables (e.g. BMI, cholestorol, blood pressure, etc) are generally what you might use as explanatory variables. I’m not interested in who might have high BP – I want to know of those that do, who is going to die.

Does modeling a risk of death under a logistic approach represent the only approach here?

By: george

george — Wed, 26 Feb 2014 17:38:02 +0000

In reply to Rahul. Are there any weird situations where, in spite of data being available, modelling the final discrete outcome makes more sense? Two come to mind. First, extremely heavy-tailed distributions, where a few way-far-out observations heavily influence the analysis; dichotomizing variables leads to much lower influence. Second, when measurement error is really bad, much of the variability in the underlying continuous variable is simply noise; compared to this a dichotomized variable is much cleaner. But these are unusual. Much more often, analysis using the continuous variable is a better plan.

By: Gray

Gray — Wed, 26 Feb 2014 16:45:42 +0000

I haven’t read the original econ/poli-sci paper you mention, but from the your linked description, it doesn’t seem like an appropriate comparison. Regression discontinuity studies are explicitly not trying to model the entire probability distribution, but are looking at particular part of it where there’s a plausible natural experiment.

Another example might be more helpful, and I agree with your point about baseball. If we want to forecast election outcomes, I agree completely, model vote shares. But for the example you cite, it’s almost like saying “don’t use matching”.

By: K? O'Rourke

K? O'Rourke — Wed, 26 Feb 2014 15:31:00 +0000

> why throw away information?

When the bias in it can’t adequately be dealt with (i.e. traded in safely to reduce variance).

For instance when you actually suspect that a team that could have won, purposely lost by a large amount to encourage large bets against them next time (i.e. the old pool shark trick).

In combining RCTs and observational studies, I once used the fruit salad metaphor of combining sweet (unbiased) and sour (biased) apples and put the challenge as – when is there enough knowledge about the bias to adequately sweet the sour apples so that you should not just throw them away.

(Chapter 11 in https://www.amazon.com/Empirical-Likelihood-Inference-Lecture-Statistics/dp/0387950184#_ )

By: Clark

Clark — Wed, 26 Feb 2014 15:17:40 +0000

A related problem that I see in medical research is the use of arbitrarily discrete predictors. They’ll (usually MDs) take something like age and convert it to a categorical variable by breaking it into arbitrary ranges. This has always struck me as a needless loss of information by effectively averaging the response over each of these ranges, and I’ll generally try to persuade them to provide the continuous data.

I have encountered difficulties with NOT discretizing the predictors in cases where the treatment interacts with that predictor, like ANCOVA models (or their GAM equivalents). The people I work with are generally not terribly interested in the intercepts or slopes associated with an ANCOVA, they’re interested in where, over the domain of the predictor, the outcome differs between treatment categories (for instance, a drug may have a large effect for the first week following trauma, then tail-off to nothing over the next few weeks). Intuitively, this is like a running contrast over the predictor’s range. My best solution thus far has been to estimate the confidence interval for the difference by finding the difference’s standard error as the square root of the sums of squares of the lines/curves associated with the individual treatment conditions. The problem I have with this approach is that it appears to be too conservative — in cases where the interaction lacks significance, you would like the result to be roughly equivalent to the treatment contrast in isolation, but it generally results in a rather broader confidence interval. I think the running contrasts notion is a good one, but what is needed is a better methodology to implement it. Another difficulty is that MDs have difficulty interpreting these comparisons, and are more comfortable with comparisons involving discrete levels of a predictor.

Due to these difficulties, I am for the moment leaning towards using discrete predictors in cases where there is an interaction between treatment levels and predictors, accompanied by adjustment for multiple comparisons. I would prefer a better solution. On the plus side, using more conservative methods help to reduce problems with reproducibility.

By: Rahul

Rahul — Wed, 26 Feb 2014 14:50:34 +0000

I agree with Andrew’s point. Makes sense intuitively: why throw away information?

To me the converse question is more interesting: Are there any weird situations where, in spite of data being available, modelling the final discrete outcome makes more sense?

Examples such as @Jay Ulfelder’s comment above don’t count because there data is simply non-existent. That’s a different situation.

Even more generally, within any given class of models, are there situations where using aggregate data leads to better models than using a more granular model?

By: Kaiser

Kaiser — Wed, 26 Feb 2014 14:36:02 +0000

I generally agree with the message. Here’s one example of a business problem in which I’d choose to model discrete rather than continuous variables: let’s say you are a catalog retailer. You can model the discrete event (Buy or not buy) or the continuous event (dollar amount of purchase). In reality, you’d do a two-stage model similar to what Andrew described above. Model the propensity to buy anything at all, and then model the amount of purchases given the customer buys. But if you only build one model, the discrete one is more useful. Directly building a model for the continuous variable is difficult because the distribution of that variable looks like a huge spike at zero plus a long tail for the positives, with a scattering of negative values (credits).

By: Jay Ulfelder

Jay Ulfelder — Wed, 26 Feb 2014 14:18:17 +0000

From personal experience, I can offer another reason political scientists often model discrete representations of underlying continuous variables: lack of data.

I do a lot of work on political violence, including episodes of civil war and mass killing. We conventionally identify those episodes through Boolean expressions that include body counts and descriptive features of the context and parties to the violence. Following your logic—which I don’t dispute—we would do better to model those death counts directly.

The problem is that we simply don’t have reliable death counts for most of the episodes we think we can identify when we use a simple threshold. That’s partly because of the poor bureaucratic capacity of the societies in which these things tend to happen, but it’s also because the parties to these conflicts are motivated to conceal or misrepresent their activities. And then there’s the general fog of war.

So we can aspire to follow your advice, but practicalities keep us from doing so, and I don’t expect that problem to disappear any time soon.