Fan Li and Peng Ding write:
Difference-in-differences is a widely-used evaluation strategy that draws causal inference from observational panel data. Its causal identification relies on the assumption of parallel trend, which is scale dependent and may be questionable in some applications. A common alternative method is a regression model that adjusts for the lagged dependent variable, which rests on the assumption of ignorability conditional on past outcomes. In the context of linear models, Angrist and Pischke (2009) show that difference-in-differences and the lagged-dependent-variable regression estimates have a bracketing relationship. Namely, for a true positive effect, if ignorability is correct, then mistakenly assuming the parallel trend will overestimate the effect; in contrast, if the parallel trend is correct, then mistakenly assuming ignorability will underestimate the effect. We show that the same bracketing relationship holds in general nonparametric (model-free) settings without assuming either ignorability or parallel trend. We also extend the result to semiparametric estimation based on inverse probability weighting.
Li and Ding sent the paper to me because I wrote something on the topic a few years ago, under the title, Difference-in-difference estimators are a special case of lagged regression.
P.S. Li and Ding’s paper has been updated, so I updated the link above.
From that 2007 post, a comment: https://statmodeling.stat.columbia.edu/2007/02/15/differenceindif/#comment-42233
Apparently taking a statistics course makes you lose all notion of algebra:
setting gamma_1 to 1 and solving for “Y_1i – Y0i” and inspecting coefficients yields that the models are *exactly the same* when beta_0 = gamma_0 and beta_1 = gamma_3
So… you’re saying they aren’t the same in general. Anyway, causal identification (which, if you get it right, lets you figure out what an intervention will do) doesn’t just depend on the equations but also on conditional independence assumptions (or equivalently, the causal graph backing the equations).
I see that, in response to the comment Jens posted on Gelman’s blog from 2007, Gelman said “Thanks for the comments. I’ll take a look more carefully and get back to you all.” But I don’t see that he posted a follow up response.
Andrew — You may want to change the title of your post because difference-in-difference is not a special case of lagged regression.
Li and Ding state this explicitly in the very nice article that you link to. I quote below:
“Gelman (2007) pointed out that restricting beta to equal 1 in (6) gives identical least squares estimators for tau from models (5) and (6). This suggests that, under these two linear models, the difference-in-difference estimator is a special case of the lagged dependent-variable regression estimator. However, the nonparametric identification Assumptions 1 and 2 are not nested, and the difference-in-differences estimator is not a special case of the lagged-dependent variable adjustment estimator in general.”
Their reference to “Gelman (2007)” is one of your blog posts. Li and Ding are politely pointing out that you are mistaken in your statements about differences-in-differences because you have forgotten about more general cases.
Thank you for linking to Li and Ding’s paper. It was a useful read.
I think the real problem here is one of cultural confusion. Bayesian models will look like:
Outcome[i] = Fb(EarlierOutcome, Covariates, Parameters) + Error[i]
The quantity of interest will be the posterior distribution over Parameters.
Whereas typical Econometric “unbiased estimator” methods will want
Outcome[i] – EarlierOutcome[i] = Fe(TreatmentIndicator, Covariates, Parameters) + Error[i]
and the quantity of interest is the “unbiased point estimate” of the Parameters, usually a linear coefficient of the treatment indicator.
If you restrict the Bayesian model to use EarlierOutcome in a strictly *linear* way, and restrict the EarlierOutcome coefficient to be 1, and restrict the usage of Covariates, and eliminate various structural equation assumptions in Fb, etc then you can convert the first model into the second form.
In this sense, the second form is a special case of the first one.
My impression is that the attraction of the second form is that with appropriate assumptions you can maybe get unbiased estimates of the TreatmentIndicator coefficient without having to make all the structural mechanistic assumptions that go into the Fb Bayesian model.
I personally don’t find that to be a convincing argument. It’s like saying that if you randomly tweak certain screws under the hood of your car you can get the fastest lap time without even knowing what a fuel injector is or whether the car is even a gasoline, diesel, or electric…. maybe so, but I doubt it in practice and besides the main thing I want to know is exactly what all the knobs do.
+1