1. Pearl has mathematically proved the equivalence of Pearl’s and Rubin’s frameworks. At the same time, Pearl and Rubin recommend completely different approaches. For example, Rubin conditions on all information, whereas Pearl does not do so. In practice, the two approaches are much different. Accepting Pearl’s mathematics (which I have no reason to doubt), this implies to me that Pearl’s axioms do not quite apply to many of the settings that I’m interested in.
I think we’ve reached a stable point in this part of the discussion: we can all agree that Pearl’s theorem is correct, and we can disagree as to whether its axioms and conditions apply to statistical modeling in the social and environmental sciences. I’d claim some authority on this latter point, given my extensive experience in this area–and of course, Rubin, Rosenbaum, etc., have further experience–but of course I have no problem with Pearl’s methods being used on political science problems, and we can evaluate such applications one at a time.
2. Pearl and I have many interests in common, and we’ve each written two books that are relevant to this discussion. Unfortunately, I have not studied Pearl’s books in detail and I doubt he’s had the time to read my books in detail also. It takes a lot of work to understand someone else’s framework, work that we don’t necessarily want to do if we’re already spending a lot of time and effort developing our own research programmes. It will probably be the job of future researchers to make the synthesis. (Yes, yes, I know that Pearl feels that he already has the synthesis, and that he’s proved this to be the case, but Pearl’s synthesis doesn’t yet take me all the way to where I want to go, which is to do my applied work in social and environmental sciences.) I truly am open to the probability that everything I do can be usefully folded into Pearl’s framework someday.
That said, I think Pearl is on shaky ground when he tries to say that Don Rubin or Paul Rosenbaum is making a major mistake in causal inference. If Pearl’s mathematics implies that Rubin and Rosenbaum are making a mistake, then my first step would be to apply the syllogism the other way and see whether Pearl’s assumptions are appropriate for the problem at hand.
3. I’ve discussed a poststratification example. As I discussed yesterday (see the first item here), a standard idea, both in survey sampling and causal inference, is to perform estimates conditional on background variables, and then average over the population distribution of the background variables to estimate the population average. Mathematically, p(theta) = sum_x p(theta|x)p(x). Or, if x is discrete and takes on only two values, p(theta) = (N_1 p(theta|x=1) + N_2 p(theta|x=2)) / (N_1 + N_2).
This has nothing at all to do with causal inference: it’s straight Bayes.
Pearl thinks that if the separate components p(theta|x) are nonidentifiable, that you can’t do this, and you should not include x in the analysis. He writes:
I [Pearl] would really like to see how a Bayesian method estimates the treatment effect in two subgroups where it is not identifiable, and then, by averaging the two results (with two huge posterior uncertainties) gets the correct average treatment effect, which is identifiable, hence has a narrow posterior uncertainly. . . . I have no doubt that it can be done by fine-tuned tweaking . . . But I am talking about doing it the honest way, as you described it: “the uncertainties in the two separate groups should cancel out when they’re being combined to get the average treatment effect.” If I recall my happy days as a Bayesian, the only operation allowed in combining uncertainties from two subgroups is taking a linear combination of the two, weighted by the (given) relative frequencies of the groups. But, I am willing to learn new methods.
I’m glad that Pearl is willing to learn new methods–so am I–but, no new methods are needed here! This is straightforward, simple Bayes. Rod Little has written a lot about these ideas. I wrote some papers on it in 1997 and 2004. Jeff Lax and Justin Phillips do it in their multilevel modeling and poststratification papers where, for the first, time, they get good state-by-state estimates of public opinion on gay rights issues. No “fine-tuned tweaking” required. You just set up the model and it all works out. If the likelihood provides little to no information on theta|x but it does provide good information on the marginal distribution of theta, then this will work out fine.
In practice, of course, nobody is going to control for x if we have no information on it. Bayesian poststratification really becomes useful in that it can put together different sources of partial information, such as data with small sample sizes in some cells, along with census data on population cell totals.
Please, please don’t say “the correct thing to do is to ignore the subgroup identity.” If you want to ignore some information, that’s fine–in the context of the models you are using, it might even make sense. But Jeff and Justin and the rest of us use this additional information all the time, and we get a lot out of it. What we’re doing is not incorrect at all. It’s Bayesian inference. We set up a joint probability model and then work from it. If you want to criticize the probability model, that’s fine. If you want to criticize the entire Bayesian edifice, then you’ll have to go up against mountains of applied successes.
As I wrote earlier, you don’t have to be a Bayesian (or, I could say, you don’t have to be a Bayesian)–I have a great respect for the work of Hastie, Tibshirani, Robins, Rosenbaum, and many others who are developing methods outside the Bayesian framework)–but I think you’re on thin ice if you want to try to claim that Bayesian analysis is “incorrect.”
4. Jennifer and I and many others make the routine recommendation to exclude post-treatment variables from analysis. But, as both Pearl and Rubin have noted in different contexts, it can be a very good idea to include such variables–it’s just not a good idea to include them as regression predictors.) If the only think you’re allowed to do is regression (as in chapter 9 of ARM), then I think it’s a good idea to exclude post-treatment predictors. If you’re allowed more general models, then one can and should include them. I’m happy to have been corrected by both Pearl and Rubin on this one.
5. As I noted yesterday (see second-to-last item here), all statistical methods have holes. This is what motivates us to consider new conceptual frameworks as well as incremental improvements in the systems with which we are most familiar.
Summary . . . so far
I doubt this discussion is over yet, but I hope the above notes will settle some points. In particular:
– I accept (on authority of Pearl, Wasserman, etc.) that Pearl has proved the mathematical equivalence of his framework and Rubin’s. This, along with Pearl’s other claim that Rubin and Rosenbaum have made major blunders in applied causal inference (a claim that I doubt), leads me to believe that Pearl’s axioms are in some way not appropriate to the sorts of problems that Rubin, Rosenbaum, and I work on: social and environmental problems that don’t have clean mechanistic causation stories. Pearl believes his axioms do apply to these problems, but then again he doesn’t have the extensive experience that Rosenbaum and Rubin have. So I think it’s very reasonable to suppose that his axioms aren’t quite appropriate here.
– Poststratification works just fine. It’s straightforward Bayesian inference, nothing to do with causality at all.
– I have been sloppy when telling people not to include post-treatment variables. Both Rubin and Pearl, in their different ways, have been more precise about this.
– Much of this discussion is motivated by the fact, that, in practice, none of these methods currently solves all our applied problems in the way that we would like. I’m still struggling with various problems in descriptive/predictive modeling, and causation is even harder!
– Along with this, taste–that is, working with methods we’re familiar with–matters. Any of these methods is only as good as the models we put into them, and we typically are better modelers when we use languages with which we’re more familiar. (But not always. Sometimes it helps to liberate oneself, try something new, and break out of the implicit constraints we’ve been working on.)