Modeling heterogenous treatment effects

Don Green and Holger Kern write on one of my favorite topics, treatment interactions (see also here):

We [Green and Kern] present a methodology that largely automates the search for systematic treatment effect heterogeneity in large-scale experiments. We introduce a nonparametric estimator developed in statistical learning, Bayesian Additive Regression Trees (BART), to model treatment effects that vary as a function of covariates. BART has several advantages over commonly employed parametric modeling strategies, in particular its ability to automatically detect and model relevant treatment-covariate interactions in a flexible manner.

To increase the reliability and credibility of the resulting conditional treatment effect estimates, we suggest the use of a split sample design. The data are randomly divided into two equally-sized parts, with the first part used to explore treatment effect heterogeneity and the second part used to confirm the results. This approach permits a relatively unstructured data-driven exploration of treatment effect heterogeneity while avoiding charges of data dredging and mitigating multiple comparison problems. We illustrate the value of our approach by offering two empirical examples, a survey experiment on Americans support for social welfare spending and a voter mobilization field experiment. In both applications, BART provides robust insights into the nature of systematic treatment effect heterogeneity.

I don’t have the time to give comments right now, but it looks both important and useful. And it’s great to see quantitatively-minded political scientists thinking seriously about statistical inference.

Pretty pictures, too (except for ugly Table 1, but, hey, nobody’s perfect).

4 thoughts on “Modeling heterogenous treatment effects

  1. I really like these types of approaches but, unfortunately, in pharmacoepidemiology, the high quality databases tend to be somewhat underpowered. The databases with a lot of participants tend to lack covariates of interest (including the obvious candidates for effect modification).

    It's a tricky problem. But, one day, I'd love to find just the right problem where dividing the sample in two parts is a reasonable approach as I like (theoretically) the idea of doing selection and fitting on different samples (to reduce the effect of the Winner's curse).

  2. Worry that automated pattern shearchs with the split sample _bandaid_ are a bit like shooting people with a machine gun and then offering them bandaids… with so many false positive possibilities (bullets) the bandaid just isn't going to be able stop much of the false directions (bleeding).

    BART does seem to work really well in practice and its nice to see it being used for this – just worry people will stop thinking about and checking models lest it delay rapid publication of now much more easy to get _results_


  3. Great paper.

    We've built algorithms to apply multi-tree models (and other analogous methods) to do automated search for heterogeneity in the results of randomized field experiments at my company, and applied them to cumulatively thousands of RFT in commercial settings. A few quick observations are:

    1. The train / test split (50/50 in the paper) is surprisingly critical, and the optimal split depends a lot on he number of experimental units. As you get to smaller samples, the percent of test cases that ought to be in the train group grows. When you get to test groups of below something like 100 units, it becomes necessary to use a round robin all-but-one approach. Otherwise which specific units fall into the test versus train group can materially influence the analysis, even when using multiple trees.

    2. Even when you do all this, if you then test some segmentation you discover in a future randomized experiment, it sometimes holds and sometimes doesn't (at least in a relatively small sample size commercial environment, and obviously unless the first experiment had randomized cells that were designed purposely to expose the experimentation). In effect, this is a sophisticated hypothesis builder.

    3. If you then have a large number of randomized experiments in these kinds of predictor-result pairs, you can then apply pattern-finding on that (laboriously constructed!) data to help figure out when the kinds of methods described in the paper are more or less reliable in finding reliable predictive rules. Obviously this kind of recursive process can go on up through an arbitrary number of levels, but this is as far as we've ever taken it.

    Jim Manzi

  4. Jim: this is a [_blind_ or thoughtless] sophisticated hypothesis builder.

    An thats where the concern arises (Peirce argued that we evolved as beings – better able to hypothesize than thoughtless pattern searches – but not very convincingly).

    There will be some spurious patterns in any study and randomly spliting that study into two – won't fix that.

    Thanks for sharing your experience and your point 3 sound very interesting and possibly could address some of thoughtlessness concerns


Comments are closed.