Matching for preprocessing data for causal inference

Chris Blattman writes:

Matching is not an identification strategy a solution to your endogeneity problem; it is a weighting scheme. Saying matching will reduce endogeneity bias is like saying that the best way to get thin is to weigh yourself in kilos. The statement makes no sense. It confuses technique with substance. . . . When you run a regression, you control for the X you can observe. When you match, you are simply matching based on those same X. . . .

I see what Chris is getting at–matching, like regression, won’t help for the variables you’re not controlling for–but I disagree with his characterization of matching as a weighting scheme. I see matching as a way to restrict your analysis to comparable cases. The statistical motivation: robustness. If you had a good enough model, you wouldn’t neet to match, you’d just fit the model to the data. But in common practice we often use simple regression models and so it can be helpful to do some matching first before regression. It’s not so difficult to match on dozens of variables, but it’s not so easy to include dozens of variables in your least squares regression. So in practice it’s not always the case that “you are simply matching based on those same X. To put it another way: yes, you’ll often need to worry about potential X variables that you don’t have–but that shouldn’t stop you from controlling for everything that you do have, and matching can be a helpful tool in that effort.

Beyond this, I think it’s useful to distinguish between two different problems: imbalance and lack of complete overlap. See chapter 10 of ARM for further discussion. Also some discussion here.

6 thoughts on “Matching for preprocessing data for causal inference”

1. Chris' commenters make similar points, espec. with respect to comparable cases and robustness.

It's not so difficult to match on dozens of variables, but it's not so easy to include dozens of variables in your least squares regression.

is this essentially the "garbage can" critique or what else are you referring to when you say that many variables in LS regressions is "not so easy"?

2. I think the two views are compatible if you allow weights to be zero. But I also think of matching as a way to create valid comparison groups – and that's why I prefer it to methods that keep all the data.

3. The flip side to this is to do something like surrogate variable analysis for variance that makes no sense whatsoever, and adjust for the surrogate variable; this is a common strategy in bioinformatics. I personally disagree with it since you're just as likely to blow away interesting variation as you are to mop up technical artifacts, and I've always argued that the answer was better sampling and experimental design (blocking).

John Leek has a page with an explanation and implementation of SVA at http://www.biostat.jhsph.edu/~jleek/sva/ for comparison.

4. "It's not so difficult to match on dozens of variables, but it's not so easy to include dozens of variables in your least squares regression."

Why is this true, given that matching on dozens of variables typically involves estimating propensity scores from some kind of regression model? Doesn't that raise exactly the same problems (unless, for some reason, we have really good reason to believe a simple model is appropriate for selection but not the outcome)?

5. The model used to match (whether it's a logit regression or a nonparametric algorithm) is not important: it's not meant to duplicate the selection process. All that matters is that you can use it to get good balance between the treatment and control groups. (This also means that you can data mine all you want–try as many models as you can until you get the "best" balance!) Rubin makes this point repeatedly in his work.

6. There's a sense in which matching and regression have been shown to estimate different weighted average treatment effects, although this is kind of orthogonal to Blattman's bigger issue. The result was derived by Angrist (1998, Econometrica) and is discussed in Angrist & Pischke's book Mostly Harmless Econometrics (pp. 75-77).

When Y is regressed on T and a full set of dummies for the possible values of X, OLS estimates a weighted average of covariate-specific treatment effects. The cells are weighted not just by overall sample size, but by the balance between treatment and control group sample sizes. E.g., suppose the X=1 cell has 100 people with 50 treated and 50 controls, and the X=2 cell has 100 people with 90 treated and 10 controls. Then OLS puts more weight on the treatment effect for X=1. One way to see this is that OLS is designed to minimize variance when the model's true, and this particular regression model assumes homogeneous treatment effects.

In contrast, the estimator that Angrist calls "matching" (which, when X is discrete, could also be implemented by subclassification and weighting, or by OLS with an expanded model that includes interactions between T and the X dummies) is designed to estimate the average effect of treatment on the treated, so in this example, it puts more weight on the treatment effect for X=2.

Some people use this result to defend regression, but another take is that when treatment effects are heterogeneous, matching makes the estimand transparent, whereas it's too easy to use regression without being aware of its implicit weighting. A third take is that if we use regression, we should try to use flexible functional forms and interact treatment with the covariates.