Philip Dawid (a longtime Bayesian researcher who’s done work on graphical models, decision theory, and predictive inference) saw our discussion on causality and sends in some interesting thoughts, which I’ll post here and then very briefly comment on:
Having just read through this fascinating interchange, I [Dawid] confess to finding Shrier and Pearl’s examples and arguments more convincing that Rubin’s. At the risk of adding to the confusion, but also in hope of helping at least some others, let me briefly describe yet another way (related to Pearl’s, but with significant differences) of formulating and thinking about the problem. For those who, like me, may be concerned about the need to consider the probabilistic behaviour of counterfactual variables, on the one hand, or deterministic relationships encoded graphically, on the other, this provides an observable-focused, fully stochastic, alternative. A full presentation of the essential ideas can be found in Chapters 9 (Confounding and Sufficient Covariates) and 10 (Reduction of Sufficient Covariate) of my online document “Principles of Statistical Causality“.
Like Pearl, I like to think of “causal inference” as the task of inferring what would happen under a hypothetical intervention, say F_E = e, that sets the value of the exposure E at e, when the data available are collected, not under the target “interventional regime”, but under some different “observational regime”. We could code this regime as F_E = idle. We can think of the non-stochastic variable F_E as a parameter, indexing the joint distribution of all the variables in the problem, under the regime indicated by its value.
It should be obvious that, even to begin to think about the task of using data collected under one regime to infer about the properties of another, we need to make (and should attempt to justify!) assumptions as to how the regimes are related. Supose the response of interest is Y, and we also measure additional variables X (all symbols may represent collections of variabkles). We call X a sufficient covariate when we can assume that the following two conditions hold:
1. X ind F_E
2. Y ind F_E | X, E
Here (in the absence of the independence symbol) A ind B | C denotes that A is independent of B given C, or, equivalently, that p(a | b, c) does not depend on the value b of B (for given a,c). Note that this makes sense even if (as in 1 and 2) B is a parameter variable rather than a random variable. We can handle such “extended conditional independence” (ECI) properties using exactly the same algebraic rules as for regular probabilistic conditional independence (CI). And, if desired, we can use graphical representations (which explicitly include parameter variables along with random variables) to represent and manipulate ECI properties, exactly as for CI. The graph representing 1 and 2 would have arrows from F_E to E, from X to E and to Y, and from E to Y.
Assumption 1 says that the distribution of X is the same in all regimes, be they interventional or observational: this may well be reasonable if X is a “pre-treatment” variable. More important, Assumption 2 says that the distribution of Y, given both E and X, is the same in all regimes: that is to say, we do not need to know whether the value of E arose by intervention or “naturally”: this conditional distribution is a stable “modular component” that can be learned from the observational regime, and transferred to the interventional regime. Even if we restrict to pre-treatment variables, this is a strong additional condition, that may hold for some (non-unique) choices, and fail for others. In particular, if we do have a sufficient covariate, there is no reason that this property should be preserved when we add or subtract components of X.
When — and to a large extent only when — X is a sufficient covariate in the above sense does it make causal sense to “adjust for” X (e.g. by applying Pearl’s “back-door” formula). An interesting question is “When can we reduce X?”, i.e. find a non-trivial function V of X that is itself a sufficient covariate, so simplifying the adjustment task. One easily verified case is when V is the propensity score based on X, in which case E ind X | V, F_E (of course, if X is NOT itself initially sufficient, then nor typically will be V). Another is when the (modular) distribution of Y given X and E in fact only depends on V and E.
There are some parallels between the concept of covariate sufficiency and Fisher’s concept of a sufficient statistic, but also important differences. In particular, if we have identified two different sufficent covariates, V and W, there need be no way to combine them: neither their union (V,W), nor the information Z common to both of them, need be a sufficient covariate. The required properties simply do not follow from Assumptions 1 and 2, and counterexamples are readily provided.
To turn to Shrier’s “M-bias” example, we can turn Figure 1 of his original letter (doi:10.1002/sim.3172) into a graphical representation of ECI properties simply by adding an additional parameter node F_E, with an arrow from F_E to E. The graph then encodes, by d-separation, Assumptions 1 and 2, where Y is “outcome”, and X is, alternatively, either U_1 or U_2. Thus each of U_1 and U_2 is a sufficient covariate (as, in this special case, is the information common to them both—which is null). But although Assumption 1 holds for X = C, Assumption 2 for X = C is NOT a consequence of d-separation, and does not follow from the assumptions made: so there is no reason to expect C to be a sufficient covariate — and it typically will not be. In the absence of sufficiency, we can expect adjustment to lead to a mismatch between the quantity estimated in the observational regime and the target causal quantity of the interventional regime — which is my interpretation of the term “bias”.
Now, my brief comment:
I haven’t had a chance to read through this carefully–in fact, I’m almost finished with my long response to the earlier blog discussion; I’ll post something soon–but what struck me was Dawid’s phrase, “the task of using data collected under one regime to infer about the properties of another.”
This reminds me of two different ways that we, as statisticians, deal with nonrandom sampling. (Ultimately, that’s what this is all about. If we had random sampling–in this case, random treatment assignment, so that observed data were a completely random sample of potential data–then we wouldn’t have to worry about causal inference at all; we could just do straight descriptive statistics and interpret all our results causally.) The two approaches are:
1. Modeling both the observed-data distribution and the target distribution. From this perspective, it’s natural to use weighting to adjust the sample from the first distribution to be representative of the second distribution.
2. Filling in the gaps with missing data. From this perspective, it’s natural to model and impute enough missing data, either the entire population or enough to look like a sample from the target distribution.
Each of these approaches has practical problems–not insurmountable problems, but real difficulties nonetheless. The weighting approach can run into difficulty with empty cells or, if these are smoothed, one has to figure out a good way to do the smoothing. The imputation approach can be tough when there are a lot of variables to impute on and you don’t really trust your model.
When it comes to causal inference, I associate method 1 above with Greenland and Robins, and I associate method 2 with Rubin. I don’t really know where I’d put Pearl, Rosenbaum, Heckman, and some others here. I guess, based on Dawid’s above discussion, that Pearl would be most comfortable with approach 1. My point here is not to make a recommendation but just to air this particular distinction.