A common approach Sadish could examine is to define the treatment group T to not reflect actual program participation, but to be a group eligible for the program that is not vulnerable to selection. The research on how the Earned Income Tax Credit affects labor supply is a classic Difference-in-Difference analysis that uses this approach. The EITC is a tax credit that acts as a wage subsidy if you work and have children. We want to know the effects of the EITC on encouraging work, but expanding the EITC will automatically increase the number of people eligible for the program (and allow higher-income people to receive the credit). So if you ran a regression with an indicator EITC for receiving the credit, an indicator Y for whether the person works, and an indicator P for before and after the credit is made more generous,

Y = b0 + b1 EITC + b2 P + b3 P * EITC + e (i)

you would get effects biased by selection, just as Sadish pointed out (since everyone who selects into the EITC will be working). Instead, what researchers like Eissa and LIebman (1996) did is limit the sample to single women, and compare women with and without children (with repeated cross-sections). The idea is that the status of having children is roughly fixed over time (so there is limited selection between the two groups based on the policy), but only women with children benefit from the expanded EITC. Then you run

Y = b0 + b1 child + b2 P + b3 P * child + e (i)

and attribute the effect b3 to the EITC expansion. This has obvious limitations (what if something else happened at the same time as the EITC expansion that affected women with children differently? What if people have children in response to the ETIC? etc.), which many researchers have done a lot of work to rule out. But that is the fundamental method–picking your “treatment” group so that it is not subject to selection. That means you don’t want to use actual measured program participation as the T variable; you want a variable that predicts program participation well in both the pre and post periods. If you see outcomes improve more for the group that is more tied to the program (regardless of whether each individual actually participates or not), that suggests a causal effect of the program.

There are other options to deal with selection (find an instrumental variable, find a cutoff in how the policy was applied to get a regression discontinuity, etc.), but this is the most directly applicable to a Diff-in-Diff approach.

]]>Today I saw in the current issue of the Notices of the American Mathematical Society an article “Progressions in Reasoning in K-12 Mathematics”, by Dev Sinha. There’s a preprint at https://arxiv.org/pdf/1812.11947.pdf

Since the article mentions several examples of reasoning based on a visual display, this reminded me of some books I’ve used ideas from in teaching (especially prospective teachers): Proofs without Words (I, II and III) by Roger Nelsen.

]]>The hypothesis is that before P, Q–>Y and Q–>T, that T–>Y is spurious. This takes two equations:

T = b0 + b1 Q + e (i) [H: b1>0]

Y = b0 + b1 Q + b2 T + e (ii) [H: b1>0, b2:???]

Maybe the program has a causal effect, maybe not.

You could even have an alternate:

Y = b0 + b1 Q + b2 T + b3 T x Q + e (ii_a) [H: same as above + b3>0 {the treatment is more effective for high-Q} or b3>0 {worse selection}]

Y = b0 + b1 Q + b2 T + e (iv) [H: b1>>0 {P changes how Q relates to Y, presumably through an omitted variable, uh oh!}, b2>>0 {P improves the effectiveness of the treatment}]

You could even have an alternate:

Y = b0 + b1 Q + b2 T + b3 T x Q + e (iv_a) [H: same as above + b3>0 {P improves the treatment for high-Q} or b30 {selection}, b3>0 {selection worsens}]

Y = b0 + b1 Q + b2 P + b3 Q x P + b4 T + b5 T x Q + b6 T x P + b7 T x Q x P + e (vi) [H: b1>0, b3~0 {P does not directly change how Q relates to Y}, b4=0 {T has no effect on Y}, b5??? {is there T effect heterogeneity by Q?}, b6~0 {P does not improve T’s effectiveness}, b7??? {does P change how T effect heterogeneity by Q manifests?}

If you specify all this, you can test your hypotheses and also interpret the other coefficients in light of how they would reflect on alternative models of the process (such as, P makes T better). Notice, it’s actually the Q x P interaction that carries the theoretical load for “inducing selection”! You can simplify a lot if testing for T effect heterogeneity by Q isn’t important:

T = b0 + b1 Q + b2 P + b3 Q x P + e (vii)

Y = b0 + b1 Q + b2 P + b3 Q x P + b4 T + b5 T x P + e (viii)

I’m also assuming you don’t have endogeneity issues (like when other commenters say to watch out for omitted variables- a big worry would be finding b3_vi or b3_viii =/= 0 since it would suggest P changes how Q relates to Y, and to me that screams omitted variable in this case), and that you know how to handle the subscripts. And you can add whatever else Andrew wants you to add, for instance other pre-treatment indicators that allow you to get a better grip on Q’s actual impact on selection into T.

]]>From what I can tell, DAGs offer a calculus for reducing a large set of predictors to a smaller (potentially much smaller) set from which causal inference can proceed via standard regression techniques.

I can see the appeal. In epidemiology, at least as it was practiced during my time with the WHO, one is confronted with huge datasets containing a vast number of predictors. The usual practice was to try various predictor sets, talking yourself in and out including or excluding certain variables, until you got something that seemed interpretable. This is an incredible disorganized, tedious, and often irritating enterprise as you regularly find counter intuitive adjusted effects. E.g. smoking is protective against chronic disease.

Faced with such challenges, I can see the appeal of any approach that offers a seemingly principled way of fixing attention on a limited set of predictors. Whether or not this logic is valid is a different question.

]]>Q) Do you have a real life example?

A1) I’ve gotta go; I’m done here

A2) …

A3) Here is a pedagogical example

The issue I disagree with you on is the practicality of thinking about potential confounders. Thinking about potential confounders, the data generating mechanisms, and the temporal relationships between variables is fundamental to causal inference, so your dismissal of thinking critically about potential confounders is enough for me to stop spending more time on this specific issue.

I have some DAGs to attend to.

]]>it is hard for you and I to judge whether there really exists another Z that could be a common… Hard to really know without knowing all of the study details, time periods, policies, population of interest etc.

I don’t think it is *ever* practical, and there will always be dozens or more possible confounds someone who thinks hard could come up with. Do you have a real life counter example of using a linear regression where the variables included were not a matter of convenience?

I would do a regression of T~Q+ newslectionvar1 +newselecionvar2, see which var is stringest predictor of T

Why not (using the R formula notation): T ~ Q*newslectionvar1*newselecionvar2? And doesn’t this assume you have data for the newselectionvars?

I mean my ultimate point is I don’t believe it is ever practical to interpret regression coefficients in the way people want to.

]]>re: multiple vars that could predict selection into T- I agree with you that Sadish needs to think about all the possible vars that could be leading to folks selecting into T after policy. If thats the case the DiD design that I specified would be difficult to do with multiple vars. In that case, I would do a regression of T~Q+ newslectionvar1 +newselecionvar2, see which var is stringest predictor of T and use that as the exposure variable in the DiD that I specified above.If the main hypothesis is that there is some selection going on, then maybe just showing that this exists with respect to one var such as Q could be worth it.

Hard to really know without knowing all of the study details, time periods, policies, population of interest etc.

]]>you need to make that sure there are no common causes Z that could be leading to a spurious association between Q and T. If so then these common causes need to be adjusted for.

Is this practical? How many plausible Z are there vs what data is going to be available?

And if you look at my example above, it is not just common causes you need to worry about. You need to adjust for any other reason that people who select to be in the program would have better outcomes, otherwise your model is misspecified. At some point you can be satisfied with an approximation and ignore factors that have only negligible influence on the outcome of course.

]]>However, I am very confused (probably not reading something correcly) and have some questions for Sadish and would be curious to know what Andrew thinks. Sadish is concerned about adjusting for post treatment variables AKA mediators for the regression Y = b0 + b1 T + b2 P + b3 T X P + e (i).

I am assuming that he is worried about adjusting for T because it happens after P? ie T is a mediator of the association between P and Y?

If so then I am not sure if I follow. How is the variable P operationalized? Is it a yes/no variable? Did the policy that introduced the changes to the treatment program happen once? Or did new changes to the treatment program keep on happening over and over across cross sections. Is there data on T,P, and Y at all cross sections? From your description it sounds to me (especially if the policy that implemented changes to the treatment program happened only once) like the variable P, which you describe as “period after policy change (P)” is not a covariate but rather a date that you can use to partition the cross sections into pre and post. If this is the case, and if you observe Q then you have the perfect data to conduct a difference in differences (DiD) quasi experimental study:

As jrc mentioned, let T be your outcome. and let Q be the new real treatment group (assuming he is still interested in estimating causal effect of quality on treatment group participation). The DiD design could be used to estimate the differential change in the proportion of people participating in T between those who are high Q versus low Q from before until after the policy change.

Lets suppose there are 3 cross sections that happened before policy and 3 cross sections that happened after policy and lets suppose that at each of these cross sections you know if people were in the treatment program (T) and you know the value of their Q (high versus low to keep it simple) This is how you would set up. You would have 6 rows per person (one for each of 6 waves). You would have to create a new variable called post that is 0 if the cross section was before policy change and 1 if is after the policy change. The DiD regression would be T= b0 +b1 Q +b2 post + b3 Q *post (add random effect for person since there will be multiple rows per person). B1 represents the difference in T (proportion/odds/probability) between those who are high versus low Q prior to the policy change (this is equaivalent to jrcs statment “suppose that the BEST people select into T in the first period”). B2 represnts the change in

proportion T from before until after the policy change AMONG those of low Q. And B3 is the difference in differences estimator and is the causal effect that youre interested in. and if this is positive then this would show Q–>T after the policy change. If there confounders include in regression.

If you do not observe Q, then this is a hard problem. But main point is that you dont need Y to answer this question and thus the regressions at least how you have specified them are misleading.

]]>Balance tables and summary statistics are just about the only time I use tables anymore. That said, I’d be happy to switch to a good graphical format, instead of just a couple columns of means and t-test on the difference. I guess I should find the book a flip over to page 202 and see how Andrew and Jennifer visually represent balance tests.

]]>A graph, please. Never a table. See page 202 of m book with Jennifer for an example.

]]>Supposing that you are balanced on observables and your argument is about unobservables only, then things get much trickier. You could maybe find some outcomes correlated with your “quality” measure but not correlated with the outcome (call those Z) and regress Z on T.

If you are in a world of “selection on observables” that opens avenues for all kinds of correction procedures (Heckman type selection models and the subsequent improvements) and hence claims about what is selection and what is treatment effect. But in general, I think that if your argument is that ALL of the selection is based on unobservables, you are gonna have a big problem. In a panel setting you could look at something like “residuals in the pre-period” from a regression on pre-period only and see if the particularly people that select into treatment had high residuals before the intervention, but in a repeated cross-section you’d have to proxy those people with observables (like a matching of some sort), and you’d be back to a method that only works when there is selection on observables.

Re-reading your description (and supposing only selection on unobservables) – suppose that the BEST people select into T in the first period, and from there on out you get “decreasingly high quality people” selecting in. Then looking at treatment effects across time might reveal something too. Is this what you were thinking with interacting T with “P” (which I interpret as a different treatment effect for those who entered in the later periods)? If that is the case, then I suspect that adding a theoretical model on selection into treatment, where the first movers are the “best” and something like “seeing other people benefit” subsequently increases other people’s probability of selecting-in; those who were near the margin of selecting into in the first period, but needed a little push utility/informtation-wise to get there. It wouldn’t be perfectly clean, but it might be something you could show and argue for.

]]>p(S | GeneralPopulation & P & Choose_T) which relates through some model the different densities for the group that Choose_T=1 vs Choose_T=0, and then do you a Bayesian fit and you find out something like:

“either the treatment T is very effective, and susceptibility is uniform among the population, or there is a wide range of susceptibility and the treatment is only effective on the susceptible people, and the susceptible people are dramatically over-represented in the treatment group vs the regular population”

That kind of statement itself is sometimes enough to do what you need, namely perhaps convince people that you need to study the issue of whether the treatment will really be effective in the broader population, and how to measure susceptibility and etc etc.

]]>then Y = …. + F(T,S,P) + …

Now, what is the probability of S being a certain value, if you have no other information (ie. the probability in the general population)’

p(S | GeneralPopulation)

you should provide some prior here.

Now, what is the probability that is a given value given a subset of the population that chooses T

p(S | GeneralPopulation & P & Choose_T)

This is a different distribution…. This is the model you need, a way to predict the distribution of susceptibility to treatment given the knowledge that the person chose T. However, unless you have an independent way to assess S, or a specific kind of model for F(S,T,P) you won’t be able to identify uncertainty in S vs whatever uncertainty you have in F(S,T,P)

If F is a very specific model, you can then estimate S, but if F is a fuzzy model that has lots of fitting parameters, then the fitting parameters will be probabilistically dependent on the S parameter…

So, what does S (susceptibility to T) also predict that’s different or additional to the Y outcome? Like for example, people who are highly susceptible to losing weight by exercising might also be people who are good at certain sports… so you could measure both weight loss and say score when playing tennis or whatever, thereby allowing you to identify both S and F

]]>My hypothesis is that high-quality people select into the program. I expect that people selecting into T will have better outcomes (Y) because they are of higher quality.

I feel like we have discussed this “high quality people selecting into a program” question before. Is this a dupe?

Anyway, you should think of all the reasons people selecting into the program may have better outcomes, including that they may be “high quality”. Then you need to find a way to distinguish between or account for all those possibilities.

I’ll start:

1) People selecting into the program are more enthusiastic about it, so will have better outcomes.