Matthew Bogard writes:

Regarding the book Mostly Harmless Econometrics, you state:

A casual reader of the book might be left with the unfortunate impression that matching is a competitor to regression rather than a tool for making regression more effective.

But in fact isn’t that what they are arguing, that, in a ‘mostly harmless way’ regression is in fact a matching estimator itself?

“Our view is that regression can be motivated as a particular sort of weighted matching estimator, and therefore the differences between regression and matching estimates are unlikely to be of major empirical importance” (Chapter 3 p. 70)

They seem to be distinguishing regression (without prior matching) from all other types of matching techniques, and therefore implying that regression can be a ‘mostly harmless’ substitute or competitor to matching. My previous understanding, before starting this book was as you say, that matching is a tool that makes regression more effective.

I have not finished their book, and have been working at it for a while, but if they do not mean to propose OLS itself as a matching estimator, then I agree that they definitely need some clarification.

I actually found your particular post searching for some article that discussed this more formally, as I found my interpretation (misinterpretation) difficult to accept.

What say you?

My reply:

I don’t know what Angrist and Pischke actually do in their applied analysis. I’m sorry to report that many users of matching do seem to think of it as a pure substitute for regression: once they decide to use matching, they try to do it perfectly and they often don’t realize they can use regression on the matched data to do even better. In my book with Jennifer, we try to clarify that the primary role of matching is to correct for lack of complete overlap between control and treatment groups.

But I think in their comment you quoted above, Angrist and Pischke are just giving a conceptual perspective rather than detailed methodological advice. They’re saying that regression, like matching, is a way of comparing-like-with-like in estimating a comparison. This point seems commonplace from a statistical standpoint but may be news to some economists who might think that regression relies on the linear model being true.

Gary King and I discuss this general idea in our 1990 paper on estimating incumbency advantage. Basically, a regression model works if either of two assumptions is satisfied: if the linear model is true, or if the two groups are balanced so that you’re getting an average treatment effect. More recently this idea (of their being two bases for an inference) has been given the name “double robustness”; in any case, it’s a fundamental aspect of regression modeling, and I think that, by equating regression with matching, Angrist and Pischke are just trying to emphasize that these are just tow different ways of ensuring balance in a comparison.

In many examples, neither regression nor matching works perfectly, which is why it can be better to do both (as Don Rubin discussed in his Ph.D. thesis in 1970 and subsequently in some published articles with his advisor, William Cochran).

This morning, your questioner got a blog-reply out of Joshua Angrist, co-author of Mostly Harmless Econometrics, too: http://www.mostlyharmlesseconometrics.com/2011/07…

He seems to say the same thing with more methodological detail, summarizing with “i can’t imagine a situation where matching makes sense but regression does not.''

Frank:

I took a look. Oddly enough, Angrist seems to think he's disagreeing with me even though he's not. He didn't link to this blog, so I suspect he's responding to his questioner's excerpt of my blog rather than to the whole thing.

Why do we want to weight by the variance of X? When there is essential heterogeneity in treatment effects, don’t we want an average treatment effect for a particular goal (say, the treated)? [Cross-commented to MHE blog post.]

Dean:

Don't ask me; I wouldn't do that sort of weighting! In chapter 10 of ARM we discuss matching as a way to handle the problem of lack of complete overlap.

Dr. Gelman, thanks for this explanation. I actually did inquire at the MHE blog in reference to your earlier post and related article in the STATA Journal. Not trying to be dubious or start a debate between you guys, but I figured the odds of getting a response from either here or there were stacked against me so I inquired both places. I'm flattered to get quick responses from each of you and I have certainly learned a lot from the discussion. I'm sure others have as well.

Good stuff.

I have learned quite a bit from this blog (and there is wealth of information unexplored in the archive of old posts).

Have you (prof. Gelman), ever considered turning this into a book? I can imagine that you have plenty projects on the shelve but I think that there is enough insightful material on "practical statistics" to make a nice book. I believe that Terence Tao is doing something similar for mathematics with his blog.

I figured you wouldn't as I've read Gelman & Hill!

This didn't make any sense to me or ring any bells… Seems it was actually mistake (http://www.mostlyharmlesseconometrics.com/2011/07/regression-what/#comment-582).

The weighting is by the variance of D given X. Thus with binary D, the weights are P(D = 1 | X)(1 – P[D = 1 | X)). As Angrist says, this is efficient when the treatment effect is constant.

This further confirms the intuition that when treatment effects are heterogeneous (always?), we likely want to do some other weighting scheme in order to estimate something we care about.

Louis:

I've published two books on practical statistics already!

Fair enough. I got a bit excited on some good stuff I found on the blog.

Dean:

I generally think weighing is a scam. It's occasionally a useful trick but I wouldn't elevate it to a principle. I'd rather estimate the treatment effect as a function of the predictors (and maybe even of unobservables, as in a measurement error model) and then average later if so desired. Jennifer wrote a paper on this using Bart to estimate the treatment effect function.