Skip to content

A propensity for bias?

Teryn Mattox writes:

I was reminded by your recent post on propensity score matching of a nagging doubt I have about this methodology. It seems as though propensity score matching actually exacerbates selection bias. I do research on childhood interventions, and am considering using a matched design to compare the outcomes of children that did and did not receive the “treatment” of high quality preschool. But…if there are two families that are very similar in every observable way, but the parents still elected to put their kids in different preschool programs, doesn’t that mean that there must be that much more difference in unobservable characteristics between the two of them? For example, we have very few highly educated families putting their kids in bad quality care – those few families must be doing something very different, no? So won’t we dramatically overestimate the treatment effect, even more than if we just did a simple OLS regression?

My response: First off, this isn’t anything specific to propensity scores, or even to matching; it also arises in regression or in any other situation where you’re controlling for pre-treatment variables. I’ll give two quick answers. First, it is generally recommended that you control for as many pretreatment variables as you can. In their classic article on matching for causal inference, Dehejia and Wahba emphasize the importance of controlling for enough variables (and they discuss what “enough” can mean in practice). Second, the hope when controlling for things (whether by regression modeling, matching, or other methods) is to reduce the selection bias that you’re referring to. I imagine there’ve been quite a few papers in statistics and econometrics discussing the conditions under which controlling for a pre-treatment variable reduces bias in estimated treatment effects.


  1. Keith O'Rourke says:

    Teryn's comment seems very much in line with Rubin's and others suggestion to simply ommit those with extreme propensity scores (close to 0 and 1) from the anlaysis.

    Even guessing there will be a discussion of this by Jennifer in your book ;-)


  2. sylvain says:

    There are not many studies on how to choose the control variables for PS matching. Heckman and Navarro (2004, REStat) show that adding control variables or proxies may exacerbate bias in some cases. In my own work, I show that in some settings controlling for past outcomes may exacerbate bias, and that using difference in difference matching can be a better solution… The litterature is not rich on this topic, but there are good reasons to believe that the simple motto "control for as many pretreatment variables as you can" is not always correct.

  3. Andrew Gelman says:

    Again, I'd like to separate three issues:

    1. "Controlling for" pre-treatment variables

    2. Linear regression, matching, a combination of the two, or some other method of doing the "controlling"

    3. Propensity score as a particular tool used (in various ways) in matching.

    To me, the most interesting issue is issue 1, but I certainly agree that, in practice, the details of 2 and 3 are crucial.

    On the question of what should be controlled for, I strongly recommend the work of Rajeev Dehejia. You might want to start with the 2005 article entitled "My final thoughts" (you can search for it on the linked page) and then go backward from there.

  4. Matt says:

    I've read fairly widely in this literature (include the classic Dehejia and Wahba paper), and I've never found anything that satisfyingly addresses Teryn's main concern. I don't think what Andrew's implying does either.

    Suppose some new policy where "treatment" is not imposed on some population is implemented in a non-random fashion. In the case of propensity score matching, our estimate of treatment effect is unbiased only if either the variables used to estimate the propensity score are themselves, or are proxies for, the only variables upon which propensity to seek "treatment" depends. Of course we know that this is never true.

    We can simulate from a distribution we know we won't be using for our model, but the results from these types of tests don't answer the most basic question. Is it plausible, in a social science setting, that the sources of propensity not measured by the variables we used to estimate propensity are not themselves associated with outcome, likely even causally? If that is not plausible, then there is no way to make an unbiased estimate of treatment effect. I don't see how we can make a claim that MAR, or even something reasonably close to MAR, holds when evaluating a large majority of public policies, in the absence of randomized, imposed implementation.

    I realize I'm just restating Teryn's question, but I'm interested and don't feel like I've seen a good answer.