Why not look at Y?

In some versions of a “design-based” perspective on causal inference, the idea is to focus on how units are assigned to different treatments (i.e. exposures, actions), rather than focusing on a model for the outcomes. We may even want to prohibit loading, looking at, etc., anything about the outcome (Y) until we have settled on an estimator, which is often something simple like a difference-in-means or a weighted difference-in-means.

Taking a design-based perspective on a natural experiment, then, one would think about how Nature (or some other haphazard process) has caused units to be assigned to (or at least nudged, pushed, or encouraged into) treatments. Taking this seriously, identification, estimation, and inference shouldn’t be based on detailed features of the outcome or the researcher’s preference for, e.g., some parametric model for the outcome. (It is worth noting that common approaches to natural experiments, such as regression discontinuity designs, do in fact make central use of quantitative assumptions about the smoothness of the outcome. For a different approach, see this working paper.)

Taking a design-based perspective on an observational study (without a particular, observed source of random selection into treatments), one then considers whether it is plausible that, conditional on some observed covariates X, units are (at least as-if) randomized into treatments. Say, thinking of the Infant Health and Development Program (IHDP) example used in Regression and Other Stories, if we consider infants with identical zip code, sex, age, mother’s education, and birth weight, perhaps these infants are effectively randomized to treatment. We would assess the plausibility of this assumption — and our ability to employ estimators based on it (by, e.g., checking whether we have a large enough sample size and sufficient overlap to match on all these variables exactly) — without considering the outcome.

This general idea is expressed forcefully in Rubin (2008) “For objective causal inference, design trumps analysis”:

“observational studies have to be carefully designed to approximate randomized experiments, in particular, without examining any final outcome data”

Randomized experiments “are automatically designed without access to any outcome data of any kind; again, a feature not entirely distinct from the previous reasons. In this sense, randomized experiments are ‘prospective.’ When implemented according to a proper protocol, there is no way to obtain an answer that systematically favors treatment over control, or vice versa.”

But why exactly? I think there are multiple somewhat distinct ideas here.

(1) If we are trying to think by analogy to a randomized experiment, we should be able assess the plausibility of our as-if random assumptions (i.e. selection on observables, conditional unconfoundedness, conditional exogeneity). Supposedly our approach is justified by these assumptions, so we shouldn’t sneak in, e.g., parametric assumptions about the outcome.

(2) We want to bind ourselves to an objective approach that doesn’t choose modeling assumptions to get a preferred result. Even if we aren’t trying to do so (as one might in a somewhat adversarial setting, like statisticians doing expert witness work), we know that once we enter the Garden of Forking Paths, we can’t know (or simply model) how we will adjust our analyses based on what we see from some initial results. (And, even if we only end up doing one analysis, frequentist inference needs to account for all the analyses we might have done had we gotten different results.) Perhaps there is really nothing special about causal inference or a design-based perspective here. Rather, we hope that as long as we don’t condition our choice of estimator on Y, we avoid a bunch of generic problems in data analysis and ensure that our statistical inference is straightforward (e.g., we do a z-test and believe in it).

So if (2) is not special to causal inference, then we just have to particularly watch out for (1).

But we often find we can’t match exactly on X. In one simple case, X might include some continuous variables. Also, we might find conditional unconfoundedness more plausible if we have a high-dimensional X, but this typically makes it unrealistic that we’ll find exact matches, even with a giant data set. So typical approaches relax things a bit. We don’t match exactly on all variables individually. We might match only on propensity scores, maybe splitting strata for many-to-many matching until we reach a stratification where there is no detectable imbalance. Or match after some coarsening, which often starts to look like a way to smuggle in outcome-modeling (even if some methodologists don’t want to call it that).

Thus, sometimes — perhaps in the cases where conditional unconfoundness is most plausible because we can theoretically condition on a high-dimensional X — we could really use some information about what covariates actually matter for the outcome. (This is because we need to deal with having finite, even if big, data.)

One solution is to use some sample splitting (perhaps with quite-specific pre-analysis plans). We could decide (ex ante) to use 1% of the outcome data to do feature selection, using this to prioritize which covariates to match on exactly (or close to it). For example, MALTS uses a split sample to learn a distance metric for subsequent matching. This seems like this can avoid the problems raised by (2). But nonetheless it involves bringing in quantitative information about the outcome.

Thus, while I like MALTS-style solutions (and we used MALTS in one of three studies of prosocial incentives in fitness tracking), it does seem like an important departure from a fully design-based “don’t make assumptions about the outcomes” perspective. But perhaps such a perspective is often misplaced in observational studies anyway — if we don’t have knowledge of what specific information was used by decision-makers in selection into treatments. And practically, with finite data, we have to make some kind of bias–variance tradeoff — and looking at Y can help us a bit with that.

[This post is by Dean Eckles.]

7 thoughts on “Why not look at Y?

  1. One of the things I have never been able to formalize, I’ve always thought showing (apriori plausible) dose-response relationships makes an observational finding seem much stronger to me. But this does not fit in at all in the usual econ style hierarchy of research designs. I mean it is possible for confounding to result in similar dose-response situations sometimes, but testing a dose response curve (e.g. you believe effects fade after a point) vs Beta != 0 is a much stricter type of hypothesis test.

  2. It’s not exactly the same, but this post reminds me of some critiques of research on algorithmic auditing and counterfactual fairness, where many approaches assume you can identify racial bias in allocation decisions by matching units on all Xs (e.g., educational qualifications) and imagining that only race has been flipped. Sort of like the famous Bertrand and Mullainathan paper on names on resumes does. The critiques (from people like Lily Hu, Issa Kohler-Haussman, and others) argue that it’s not coherent to conceive of effects as race as something that could be estimated directly by randomly assigning race while holding all else constant, because there will be various paths by which race impacts the Xs.

  3. I think matching does “model” y or at least its expectation. It is like KNN is also a supervised learning method. Perhaps the question is then that why “probabilistic modeling” is still less popular in design-based causal inference despite its overwhelming success in almost all other branches of statistics?

    • Roughly, in causal reducing inference systematic error trumps random error whereas in other areas reducing random error trumps systematic error – so the amount of trade off for one versus the other is different.

      Rubin’s idea was that by hiding the y from the analyst, they could not increase systematic error in their modelling of it.

    • In some sense, yes, particularly as soon as you move away from being about to match exactly. This is kind of what I was referring to with “even if some methodologists don’t want to call it that” with respect to coarsened exact matching, where the coarsening is often driven by theory about Y.

Leave a Reply

Your email address will not be published. Required fields are marked *