Sadish Dhakal writes:

I am struggling with the problem of conditioning on post-treatment variables. I was hoping you could provide some guidance. Note that I have repeated cross sections, not panel data. Here is the problem simplified:

There are two programs. A policy introduced some changes in one of the programs, which I call the treatment group (T). People can select into T. In fact there’s strong evidence that T programs become more popular in the period after policy change (P). But this is entirely consistent with my hypothesis. My hypothesis is that high-quality people select into the program. I expect that people selecting into T will have better outcomes (Y) because they are of higher quality. Consider the specification (avoiding indices):

Y = b0 + b1 T + b2 P + b3 T X P + e (i)

I expect that b3 will be positive (which it is). Again, my hypothesis is that b3 is positive only because higher quality people select into T after the policy change. Let me reframe the problem slightly (And please correct me if I’m reframing it wrong). If I could observe and control for quality Q, I could write the error term e = Q + u, and b3 in the below specification would be zero.

Y = b0 + b1 T + b2 P + b3 T X P + Q + u (ii)

My thesis is not that the policy “caused” better outcomes, but that it induced selection. How worried should I be about conditioning on T? How should I go about avoiding bogus conclusions?

My reply:

There are two ways I can see to attack this problem, and I guess you’d want to do both. First is to control for lots of pre-treatment predictors, including whatever individual characteristics you can measure which you think would predict the decision to select into T. Second is to include in your model a latent variable representing this information, if you don’t think you can measure it directly. You can then do a Bayesian analysis averaging over your prior distribution on this latent variable, or a sensitivity analysis assessing the bias in your regression coefficient as a function of characteristics of the latent variable and its correlations with your outcome of interest.

I’ve not done this sort of analysis myself; perhaps you could look at a textbook on causal inference such as Tyler VanderWeele’s Explanation in Causal Inference: Methods for Mediation and Interaction, or Miguel Hernan and Jamie Robins’s Causal Inference.

Synthetic panel?

I feel like we have discussed this “high quality people selecting into a program” question before. Is this a dupe?

Anyway, you should think of all the reasons people selecting into the program may have better outcomes, including that they may be “high quality”. Then you need to find a way to distinguish between or account for all those possibilities.

I’ll start:

1) People selecting into the program are more enthusiastic about it, so will have better outcomes.

The assumption is Treatment interacts with some underlying factor called “quality” but I’d rather call it something like “susceptibility” (to the treatment). So let’s call it S.

then Y = …. + F(T,S,P) + …

Now, what is the probability of S being a certain value, if you have no other information (ie. the probability in the general population)’

p(S | GeneralPopulation)

you should provide some prior here.

Now, what is the probability that is a given value given a subset of the population that chooses T

p(S | GeneralPopulation & P & Choose_T)

This is a different distribution…. This is the model you need, a way to predict the distribution of susceptibility to treatment given the knowledge that the person chose T. However, unless you have an independent way to assess S, or a specific kind of model for F(S,T,P) you won’t be able to identify uncertainty in S vs whatever uncertainty you have in F(S,T,P)

If F is a very specific model, you can then estimate S, but if F is a fuzzy model that has lots of fitting parameters, then the fitting parameters will be probabilistically dependent on the S parameter…

So, what does S (susceptibility to T) also predict that’s different or additional to the Y outcome? Like for example, people who are highly susceptible to losing weight by exercising might also be people who are good at certain sports… so you could measure both weight loss and say score when playing tennis or whatever, thereby allowing you to identify both S and F

On the other hand, sometimes identifiability isn’t so important, sometimes it’s just useful enough to quantify how much ambiguity you have in your system. So for example you could provide a model

p(S | GeneralPopulation & P & Choose_T) which relates through some model the different densities for the group that Choose_T=1 vs Choose_T=0, and then do you a Bayesian fit and you find out something like:

“either the treatment T is very effective, and susceptibility is uniform among the population, or there is a wide range of susceptibility and the treatment is only effective on the susceptible people, and the susceptible people are dramatically over-represented in the treatment group vs the regular population”

That kind of statement itself is sometimes enough to do what you need, namely perhaps convince people that you need to study the issue of whether the treatment will really be effective in the broader population, and how to measure susceptibility and etc etc.

As a first step, wouldn’t you want to just make a “balance table” on observables, comparing T/C group means/variances on observable characteristics? Or, similarly, regress T on X (instead of Y on X). You know, as you would do if you were running an experiment and wanted to see if you had good balance between groups on observables.

Supposing that you are balanced on observables and your argument is about unobservables only, then things get much trickier. You could maybe find some outcomes correlated with your “quality” measure but not correlated with the outcome (call those Z) and regress Z on T.

If you are in a world of “selection on observables” that opens avenues for all kinds of correction procedures (Heckman type selection models and the subsequent improvements) and hence claims about what is selection and what is treatment effect. But in general, I think that if your argument is that ALL of the selection is based on unobservables, you are gonna have a big problem. In a panel setting you could look at something like “residuals in the pre-period” from a regression on pre-period only and see if the particularly people that select into treatment had high residuals before the intervention, but in a repeated cross-section you’d have to proxy those people with observables (like a matching of some sort), and you’d be back to a method that only works when there is selection on observables.

Re-reading your description (and supposing only selection on unobservables) – suppose that the BEST people select into T in the first period, and from there on out you get “decreasingly high quality people” selecting in. Then looking at treatment effects across time might reveal something too. Is this what you were thinking with interacting T with “P” (which I interpret as a different treatment effect for those who entered in the later periods)? If that is the case, then I suspect that adding a theoretical model on selection into treatment, where the first movers are the “best” and something like “seeing other people benefit” subsequently increases other people’s probability of selecting-in; those who were near the margin of selecting into in the first period, but needed a little push utility/informtation-wise to get there. It wouldn’t be perfectly clean, but it might be something you could show and argue for.

Jrc:

A graph, please. Never a table. See page 202 of m book with Jennifer for an example.

I also strongly prefer graphs to tables. But I wonder how much this is “just me” — in the sense that I am a very visually/spatially oriented person. So I have to admit the possibility that people who are not strongly visually oriented might find tables more user-friendly.

I totally agree that figures are generally better than tables. In fact, once I’ve found a good way to visually present some set of results, I have never had someone tell me they’d rather see it in a table.

Balance tables and summary statistics are just about the only time I use tables anymore. That said, I’d be happy to switch to a good graphical format, instead of just a couple columns of means and t-test on the difference. I guess I should find the book a flip over to page 202 and see how Andrew and Jennifer visually represent balance tests.

Slightly tangential, but perhaps of interest to some reading this (especially those who are concerned about students who haven’t been well educated in reasoning skills):

Today I saw in the current issue of the Notices of the American Mathematical Society an article “Progressions in Reasoning in K-12 Mathematics”, by Dev Sinha. There’s a preprint at https://arxiv.org/pdf/1812.11947.pdf

Since the article mentions several examples of reasoning based on a visual display, this reminded me of some books I’ve used ideas from in teaching (especially prospective teachers): Proofs without Words (I, II and III) by Roger Nelsen.

Agree with you here, if you want to estimate causal effect of Q–>T (whether quality caused people to participate in T) then you dont need to worry about Y at all. Make T the outcome and regress against Q. This is assuming Q is observed. But even then, in order to estimate causal effect of Q–>T you need to make that sure there are no common causes Z that could be leading to a spurious association between Q and T. If so then these common causes need to be adjusted for.

However, I am very confused (probably not reading something correcly) and have some questions for Sadish and would be curious to know what Andrew thinks. Sadish is concerned about adjusting for post treatment variables AKA mediators for the regression Y = b0 + b1 T + b2 P + b3 T X P + e (i).

I am assuming that he is worried about adjusting for T because it happens after P? ie T is a mediator of the association between P and Y?

If so then I am not sure if I follow. How is the variable P operationalized? Is it a yes/no variable? Did the policy that introduced the changes to the treatment program happen once? Or did new changes to the treatment program keep on happening over and over across cross sections. Is there data on T,P, and Y at all cross sections? From your description it sounds to me (especially if the policy that implemented changes to the treatment program happened only once) like the variable P, which you describe as “period after policy change (P)” is not a covariate but rather a date that you can use to partition the cross sections into pre and post. If this is the case, and if you observe Q then you have the perfect data to conduct a difference in differences (DiD) quasi experimental study:

As jrc mentioned, let T be your outcome. and let Q be the new real treatment group (assuming he is still interested in estimating causal effect of quality on treatment group participation). The DiD design could be used to estimate the differential change in the proportion of people participating in T between those who are high Q versus low Q from before until after the policy change.

Lets suppose there are 3 cross sections that happened before policy and 3 cross sections that happened after policy and lets suppose that at each of these cross sections you know if people were in the treatment program (T) and you know the value of their Q (high versus low to keep it simple) This is how you would set up. You would have 6 rows per person (one for each of 6 waves). You would have to create a new variable called post that is 0 if the cross section was before policy change and 1 if is after the policy change. The DiD regression would be T= b0 +b1 Q +b2 post + b3 Q *post (add random effect for person since there will be multiple rows per person). B1 represents the difference in T (proportion/odds/probability) between those who are high versus low Q prior to the policy change (this is equaivalent to jrcs statment “suppose that the BEST people select into T in the first period”). B2 represnts the change in

proportion T from before until after the policy change AMONG those of low Q. And B3 is the difference in differences estimator and is the causal effect that youre interested in. and if this is positive then this would show Q–>T after the policy change. If there confounders include in regression.

If you do not observe Q, then this is a hard problem. But main point is that you dont need Y to answer this question and thus the regressions at least how you have specified them are misleading.

+1

Is this practical? How many plausible Z are there vs what data is going to be available?

And if you look at my example above, it is not just common causes you need to worry about. You need to adjust for any other reason that people who select to be in the program would have better outcomes, otherwise your model is misspecified. At some point you can be satisfied with an approximation and ignore factors that have only negligible influence on the outcome of course.

Anoneuoid: without really knowing what Q is and what the policies were etc, it is hard for you and I to judge whether there really exists another Z that could be a common. I’m just trying to be as comprehensive as possible. Sadish can decide if its practical or not to to worry about Z. “When in doubt, DAG it out”:

https://journals.lww.com/epidem/Fulltext/2019/07000/Analyzing_Selection_Bias_for_Credible_Causal.8.aspx

re: multiple vars that could predict selection into T- I agree with you that Sadish needs to think about all the possible vars that could be leading to folks selecting into T after policy. If thats the case the DiD design that I specified would be difficult to do with multiple vars. In that case, I would do a regression of T~Q+ newslectionvar1 +newselecionvar2, see which var is stringest predictor of T and use that as the exposure variable in the DiD that I specified above.If the main hypothesis is that there is some selection going on, then maybe just showing that this exists with respect to one var such as Q could be worth it.

Hard to really know without knowing all of the study details, time periods, policies, population of interest etc.

I don’t think it is *ever* practical, and there will always be dozens or more possible confounds someone who thinks hard could come up with. Do you have a real life counter example of using a linear regression where the variables included were not a matter of convenience?

Why not (using the R formula notation): T ~ Q*newslectionvar1*newselecionvar2? And doesn’t this assume you have data for the newselectionvars?

I mean my ultimate point is I don’t believe it is ever practical to interpret regression coefficients in the way people want to.

I agree with you that it is never practical to interpret coefficients the way people want to. People shouldnt interpret anything the way they want to. People should interpret coefficient the way coefficients are supposed to be interpreted (a lot of the times people dont really know how to interpret the coefficients so they come up with their own way).

The issue I disagree with you on is the practicality of thinking about potential confounders. Thinking about potential confounders, the data generating mechanisms, and the temporal relationships between variables is fundamental to causal inference, so your dismissal of thinking critically about potential confounders is enough for me to stop spending more time on this specific issue.

I have some DAGs to attend to.

There is something about these DAGs…

Q) Do you have a real life example?

A1) I’ve gotta go; I’m done here

A2) …

A3) Here is a pedagogical example

Several of my colleagues find DAGs very appealing, so I spent a little time on some intro papers on the subject.

From what I can tell, DAGs offer a calculus for reducing a large set of predictors to a smaller (potentially much smaller) set from which causal inference can proceed via standard regression techniques.

I can see the appeal. In epidemiology, at least as it was practiced during my time with the WHO, one is confronted with huge datasets containing a vast number of predictors. The usual practice was to try various predictor sets, talking yourself in and out including or excluding certain variables, until you got something that seemed interpretable. This is an incredible disorganized, tedious, and often irritating enterprise as you regularly find counter intuitive adjusted effects. E.g. smoking is protective against chronic disease.

Faced with such challenges, I can see the appeal of any approach that offers a seemingly principled way of fixing attention on a limited set of predictors. Whether or not this logic is valid is a different question.

If higher Q people select into T, then perhaps the link between T and Y is due to Q. The interaction with P could indicate a better program or it could indicate the induction of a worse selection problem. The trouble is distinguishing between these. Luckily, laying out two model diagrams (https://imgur.com/EN0tHiy , eq. vii and viii) will help.

The hypothesis is that before P, Q–>Y and Q–>T, that T–>Y is spurious. This takes two equations:

T = b0 + b1 Q + e (i) [H: b1>0]

Y = b0 + b1 Q + b2 T + e (ii) [H: b1>0, b2:???]

Maybe the program has a causal effect, maybe not.

You could even have an alternate:

Y = b0 + b1 Q + b2 T + b3 T x Q + e (ii_a) [H: same as above + b3>0 {the treatment is more effective for high-Q} or b3>0 {worse selection}]

Y = b0 + b1 Q + b2 T + e (iv) [H: b1>>0 {P changes how Q relates to Y, presumably through an omitted variable, uh oh!}, b2>>0 {P improves the effectiveness of the treatment}]

You could even have an alternate:

Y = b0 + b1 Q + b2 T + b3 T x Q + e (iv_a) [H: same as above + b3>0 {P improves the treatment for high-Q} or b30 {selection}, b3>0 {selection worsens}]

Y = b0 + b1 Q + b2 P + b3 Q x P + b4 T + b5 T x Q + b6 T x P + b7 T x Q x P + e (vi) [H: b1>0, b3~0 {P does not directly change how Q relates to Y}, b4=0 {T has no effect on Y}, b5??? {is there T effect heterogeneity by Q?}, b6~0 {P does not improve T’s effectiveness}, b7??? {does P change how T effect heterogeneity by Q manifests?}

If you specify all this, you can test your hypotheses and also interpret the other coefficients in light of how they would reflect on alternative models of the process (such as, P makes T better). Notice, it’s actually the Q x P interaction that carries the theoretical load for “inducing selection”! You can simplify a lot if testing for T effect heterogeneity by Q isn’t important:

T = b0 + b1 Q + b2 P + b3 Q x P + e (vii)

Y = b0 + b1 Q + b2 P + b3 Q x P + b4 T + b5 T x P + e (viii)

I’m also assuming you don’t have endogeneity issues (like when other commenters say to watch out for omitted variables- a big worry would be finding b3_vi or b3_viii =/= 0 since it would suggest P changes how Q relates to Y, and to me that screams omitted variable in this case), and that you know how to handle the subscripts. And you can add whatever else Andrew wants you to add, for instance other pre-treatment indicators that allow you to get a better grip on Q’s actual impact on selection into T.

This kind of problem is the bread and butter of applied economics research. The key is that you can’t observe Q (so any models based on including Q in the equation won’t work). And I would be very worried about calling a coefficient on T*P a causal effect of the program when T is vulnerable to selection.

A common approach Sadish could examine is to define the treatment group T to not reflect actual program participation, but to be a group eligible for the program that is not vulnerable to selection. The research on how the Earned Income Tax Credit affects labor supply is a classic Difference-in-Difference analysis that uses this approach. The EITC is a tax credit that acts as a wage subsidy if you work and have children. We want to know the effects of the EITC on encouraging work, but expanding the EITC will automatically increase the number of people eligible for the program (and allow higher-income people to receive the credit). So if you ran a regression with an indicator EITC for receiving the credit, an indicator Y for whether the person works, and an indicator P for before and after the credit is made more generous,

Y = b0 + b1 EITC + b2 P + b3 P * EITC + e (i)

you would get effects biased by selection, just as Sadish pointed out (since everyone who selects into the EITC will be working). Instead, what researchers like Eissa and LIebman (1996) did is limit the sample to single women, and compare women with and without children (with repeated cross-sections). The idea is that the status of having children is roughly fixed over time (so there is limited selection between the two groups based on the policy), but only women with children benefit from the expanded EITC. Then you run

Y = b0 + b1 child + b2 P + b3 P * child + e (i)

and attribute the effect b3 to the EITC expansion. This has obvious limitations (what if something else happened at the same time as the EITC expansion that affected women with children differently? What if people have children in response to the ETIC? etc.), which many researchers have done a lot of work to rule out. But that is the fundamental method–picking your “treatment” group so that it is not subject to selection. That means you don’t want to use actual measured program participation as the T variable; you want a variable that predicts program participation well in both the pre and post periods. If you see outcomes improve more for the group that is more tied to the program (regardless of whether each individual actually participates or not), that suggests a causal effect of the program.

There are other options to deal with selection (find an instrumental variable, find a cutoff in how the policy was applied to get a regression discontinuity, etc.), but this is the most directly applicable to a Diff-in-Diff approach.