Patrick:

Bill James may have declined from his peak, but he’s still great. James’s contributions to statistics and sabermetrics are great, whether measured by total value or by peak value. I agree that Bill James started falling for his own hype around 1987 or so. But to say that James isn’t great because he’s gone downhill, is like saying that Pete Rose wasn’t a great player because he didn’t hit so well during his last few seasons.

]]>Mike:

Yes, I agree that you can’t always work with a continuous variable. Indeed, my book with Jennifer has a whole chapter on logistic regression. My advice is to work with continuous variables where possible and to avoid discretizing. Sometimes people discretize just so they can run a chi-squared test or fit a logistic regression, other times people discretize out of a naive view that an analysis on the ultimate discrete outcome is more safe or robust. My post was arguing against those attitudes.

]]>Keith – Wow, thanks for that, I will read it with interest. Perhaps I can steal some of your ideas, I mean stand on your gigantic shoulders!

]]>I just meant that you don’t care about scale of the outcome if you are going to dichotomize the result of the fit model when reporting your results.

]]>why not model the continuous variable as a mixture model?

]]>Robert, thanks for the reference.

Ian, I’d consider log transforms too, but they don’t always help. E.g. you can’t log-transform negative values. And I’m reluctant to say we “don’t care about the scale”, ever. But when the scale is at least up for some discussion, interpretability also matters – other scientists should be able to understand what your results mean. Log-transformation isn’t bad in this regard, Box-Cox and inverse-Z transformations are considerably more challenging. Dichotomization (itself just another transformation) gives results that are very easy to interpret.

]]>Sounds like the motivation for a Tobit regression, or a Heckman selection model — not two models, but model the continuous conditional on observing the outcome.

]]>Robert: > ditch the old meta-analysis formulas and view it as a Bayesian latent variable or coarsened data model

You might wish to read my thesis on what I called the observed summary likelihood http://statmodeling.stat.columbia.edu/wp-content/uploads/2010/06/ThesisReprint.pdf

Certainly room for improvement and as far as I know know one has implemented anything practical yet – I would suggest ABC to get posterior and then posterior/prior for an approximate likelihood.

]]>In the case of heavy tailed distributions, I would consider a transformation (i.e. log) rather than dichotomizing it. Since you don’t care about the scale in the first place, there is no downside to doing this.

]]>I think using binned covariates (RHS vars, predictors, whatever) is a much different thing than using discretized outcome variables. Here’s one example: age.

Suppose we have people aged 18-65, with age reported in years. First, its worth noting that we have already discretized a continuous variable – even if we didn’t use round years, we’d still be binning the ages (we can’t have age-in-nanoseconds or whatever approaches continuous). So, now, given that, what can we do? We could, depending on sample size, including parametric controls for age (age, age-squared, cubed, etc.). This would fit some smooth polynomial function to the age profile, through implicitly-binned-ages-in-years. Or, we could use dummy variables for age-in-years, which would be a fully flexible non-parametric control for age (and hope T was orthogonal to age-within-year-bins). With sufficient data and weak priors, I’d say that the non-parametric controls are better. Wider bins are just tweaking the trade-off between parametric and non-parametric adjustments.

I tend to agree regarding interactions with treatment, but that is mostly for two reasons: improved power (sparseness tends to be a problem when looking at sub-groups of treated individuals) and interpretation (an interaction between a continuous variable and treatment is often hard to interpret, especially with a main treatment effect also present, and possibly some other interactions).

One other option for continuous predictors that heterogeneously impact treatment: estimate a local linear (lowess, something like that) regression on the T group and C group separately. Subtract the two estimates at a bunch of values of X, and then bootstrap the whole procedure. Then you get a nice, non-parametric estimate of the treatment effect across X, along with standard error bars at each point, which produces a totally readable, very convincing graph with good visualization of estimated precision.

]]>Janet Peacock and colleagues wrote this paper a couple of years ago: “Dichotomising continuous data while retaining statistical power using a distributional approach”. It’s a great approach, acknowledging that you want to analyse with the continuous data but sometimes there is a real reason for communicating with the discrete version.

In medical stats we spend a lot of time messing around with meta-analysis and one big problem is where some studies report mean change in the outcome, while others report the odds ratio or risk ratio of achieving some threshold. To make matters even worse, sometimes the threshold is absolute, sometimes relative to previous measurements. You can fix this if you ditch the old meta-analysis formulas and view it as a Bayesian latent variable or coarsened data model. Hopefully I’ll be presenting this at JSM this year, assuming they like the idea…

]]>Rahul:

Think of the likelihood as the probability of what you observe and then the probability of a function of what you observe as a psuedo likelihood – you can only lose information (as long as function is non-invertible).

Now if you are only losing very sparse information about nuisance parameters in the full likelihood while the psuedo likelihood does not involve those parameters – there can be a _gain_ – the canonical example being REML estimation.

In a Bayesian perspective you are mixing posteriors (not conditioning on the true posterior – same as in ABC) but then you avoid sensible priors for the nuisance parameters in the full model.

]]>I believe what Andrew is referring to with the Lee RDD is that it might make more sense to use a continuous OUTCOME (not the running variable). Thus, you can identify the effect on vote share, rather than an effect on probability of winning.

]]>(2) Is the statistical efficiency of the two step process guaranteed to be always better than ignoring the latent variable? If not always, mostly? It sounds intuitive, yes, but are there stronger, more rigorous reasons?

]]>So for instance – I could tell you that a nutrition intervention reduced “childhood stunting” by 10pp, or I could tell you it increased mean child height-for-age z-score among children under five by .12sd. The first one sounds like I’m curing malnutrition. The second one sounds like I’m a statistician who got some stars next to a number that doesn’t mean anything. Now me – I get a lot more info from the .12, but that’s because I know something about the distribution of child height-for-age z-score (I know, it sounds N(0,1), but it isn’t in Sub-Saharan Africa or South Asia).

This is much like the hypertension/blood-pressure example. But in that case, since I know very little about blood pressure, the incidence of the clinical condition is the unit that makes sense to me (as an information consumer, or more maybe like an infographic consumer).

All said – I still do my analysis on the continuous underlying variable – that’s what makes more sense and contains more information. But I might also report results on the categorical variable, or at least find some way to translate my marginal effect estimates into some more readily interpretable relation to a thing in the world people understand (such as a clinical condition, or a poverty rate, or something).

]]>Yes, but it’s easy to imagine that predictive power may be improved by having access to these other variables as intermediate variables?

]]>Here the discrete is of obvious interest. It is different from the wins/ERA example, because there is no prior discrete (most people can only die once). And the continuous related variables (e.g. BMI, cholestorol, blood pressure, etc) are generally what you might use as explanatory variables. I’m not interested in who might have high BP – I want to know of those that do, who is going to die.

Does modeling a risk of death under a logistic approach represent the only approach here?

]]>*Are there any weird situations where, in spite of data being available, modelling the final discrete outcome makes more sense? *

Two come to mind. First, extremely heavy-tailed distributions, where a few way-far-out observations heavily influence the analysis; dichotomizing variables leads to much lower influence. Second, when measurement error is really bad, much of the variability in the underlying continuous variable is simply noise; compared to this a dichotomized variable is much cleaner.

But these are unusual. Much more often, analysis using the continuous variable is a better plan.

]]>Another example might be more helpful, and I agree with your point about baseball. If we want to forecast election outcomes, I agree completely, model vote shares. But for the example you cite, it’s almost like saying “don’t use matching”.

]]>When the bias in it can’t adequately be dealt with (i.e. traded in safely to reduce variance).

For instance when you actually suspect that a team that could have won, purposely lost by a large amount to encourage large bets against them next time (i.e. the old pool shark trick).

In combining RCTs and observational studies, I once used the fruit salad metaphor of combining sweet (unbiased) and sour (biased) apples and put the challenge as – when is there enough knowledge about the bias to adequately sweet the sour apples so that you should not just throw them away.

(Chapter 11 in http://www.amazon.com/Empirical-Likelihood-Inference-Lecture-Statistics/dp/0387950184#_ )

]]>I have encountered difficulties with NOT discretizing the predictors in cases where the treatment interacts with that predictor, like ANCOVA models (or their GAM equivalents). The people I work with are generally not terribly interested in the intercepts or slopes associated with an ANCOVA, they’re interested in where, over the domain of the predictor, the outcome differs between treatment categories (for instance, a drug may have a large effect for the first week following trauma, then tail-off to nothing over the next few weeks). Intuitively, this is like a running contrast over the predictor’s range. My best solution thus far has been to estimate the confidence interval for the difference by finding the difference’s standard error as the square root of the sums of squares of the lines/curves associated with the individual treatment conditions. The problem I have with this approach is that it appears to be too conservative — in cases where the interaction lacks significance, you would like the result to be roughly equivalent to the treatment contrast in isolation, but it generally results in a rather broader confidence interval. I think the running contrasts notion is a good one, but what is needed is a better methodology to implement it. Another difficulty is that MDs have difficulty interpreting these comparisons, and are more comfortable with comparisons involving discrete levels of a predictor.

Due to these difficulties, I am for the moment leaning towards using discrete predictors in cases where there is an interaction between treatment levels and predictors, accompanied by adjustment for multiple comparisons. I would prefer a better solution. On the plus side, using more conservative methods help to reduce problems with reproducibility.

]]>To me the converse question is more interesting: Are there any weird situations where, in spite of data being available, modelling the final discrete outcome makes more sense?

Examples such as @Jay Ulfelder’s comment above don’t count because there data is simply non-existent. That’s a different situation.

Even more generally, within any given class of models, are there situations where using aggregate data leads to better models than using a more granular model?

]]>I do a lot of work on political violence, including episodes of civil war and mass killing. We conventionally identify those episodes through Boolean expressions that include body counts and descriptive features of the context and parties to the violence. Following your logic—which I don’t dispute—we would do better to model those death counts directly.

The problem is that we simply don’t have reliable death counts for most of the episodes we think we can identify when we use a simple threshold. That’s partly because of the poor bureaucratic capacity of the societies in which these things tend to happen, but it’s also because the parties to these conflicts are motivated to conceal or misrepresent their activities. And then there’s the general fog of war.

So we can aspire to follow your advice, but practicalities keep us from doing so, and I don’t expect that problem to disappear any time soon.

]]>