Rubinism: separating the causal model from the Bayesian data analysis

In the most recent round of our recent discussion, Judea Pearl wrote:

There is nothing in his theory of potential-outcome that forces one to “condition on all information” . . . Indiscriminate conditioning is a culturally-induced ritual that has survived, like the monarchy, only because it was erroneously supposed to do no harm.

I agree with the first part of Pearl’s statement but not the second part (except to the extent that everything we do, from Bayesian data analysis to typing in English, is a “culturally induced ritual”). And I think I’ve spotted a key point of confusion.

To put it simply, Donald Rubin’s approach to statistics has three parts:

1. The potential-outcomes model for causal inference: the so-called Neyman-Rubin model in which observed data are viewed as a sample from a hypothetical population that, in the simplest case of a binary treatment, includes y_i^1 and y_i^2 for each unit i).

2. Bayesian data analysis: the mode of statistical inference in which you set up a joint probability distribution for everything in your model, then condition on all observed information to get inferences, then evaluate the model by comparing predictive inferences to observed data and other information.

3. Questions of taste: the preference for models supplied from the outside rather than models inspired by data, a preference for models with relatively few parameters (for example, trends rather than splines), a general lack of interest in exploratory data analysis, a preference for writing models analytically rather than graphically, an interest in causal rather than descriptive estimands.

As that last list indicates, my own taste in statistical modeling differs in some ways from Rubin’s. But what I want to focus on here is the distinction between item 1 (the potential outcomes notation) and item 2 (Bayesian data analysis).

The potential outcome notation and Bayesian data analysis are logically distinct concepts!

Items 1 and 2 above can occur together or separately. All four combinations (yes/yes, yes/no, no/yes, no/no) are possible:

– Rubin uses Bayesian inference to fit models in the potential outcome framework.

– Rosenbaum (and, in a different way, Greenland and Robins) use the potential outcome framework but estimate using non-Bayesian methods.

– Most of the time I use Bayesian methods but am not particularly thinking about causal questions.

– And, of course, there’s lots of statistics and econometrics that’s non-Bayesian and does not use potential outcomes.

Bayesian inference and conditioning

In Bayesian inference, you set up a model and then you condition on everything that’s been observed. Pearl writes, “Indiscriminate conditioning is a culturally-induced ritual.” Culturally-induced it may be, but it’s just straight Bayes. I’m not saying that Pearl has to use Bayesian inference–lots of statisticians have done just fine without ever cracking open a prior distribution–but Bayes is certainly a well-recognized approach. As I think I wrote the other day, I use Bayesian inference not because I’m under the spell of a centuries-gone clergyman; I do it because I’ve seen it work, for me and for others.

Pearl’s mistake here, I think, is to confuse “conditioning” with “including on the right-hand side of a regression equation.” Conditioning depends on how the model is set up. For example, in their 1996 article, Angrist, Imbens, and Rubin showed how, under certain assumptions, conditioning on an intermediate outcome leads to an inference that is similar to an instrumental variables estimate. They don’t suggest including an intermediate variable as a regression predictor or as a predictor in a propensity score matching routine, and they don’t suggest including an instrument as a predictor in a propensity score model.

If a variable is “an intermediate outcome” or “an instrument,” this is information that must be encoded in the model, perhaps using words or algebra (as in econometrics or in Rubin’s notation) or perhaps using graphs (as in Pearl’s notation). I agree with Steve Morgan in his comment that Rubin’s notation and graphs can both be useful ways of formulating such models. To return to the discussion with Pearl: Rubin is using Bayesian inference and conditioning on all information, but “conditioning” is relative to a model and does not at all imply that all variables are put in as predictors in a regression.

Another example of Bayesian inference is the poststratification which I spoke of yesterday (see item 3 here). But, as I noted then, this really has nothing to do with causality; it’s just manipulation of probability distributions in a useful way that allows us to include multiple sources of information.

P.S. We’re lucky to be living now rather than 500 years ago, or we’d probably all be sitting around in a village arguing about obscure passages from the Bible.

15 thoughts on “Rubinism: separating the causal model from the Bayesian data analysis

  1. Certainly, 1 and 2 are distinct. But they do seem to get horribly confused by non-experts, and untangling them seems hard.

    For example, how does one go about convincing a non-expert using BDA that, to appropriately answer their causal question, they should *not* include covariates "on the right-hand side of a regression equation" – even when it makes the model fit better?

  2. Anon: The answer to your question is that the inference part of BDA depends on a model. An instrument comes into a model in a particular way. BDA means little in isolation; most of our book (including chapter 7, which covers causal inference) is about particular statistical models and how they it to scientific models and data.

  3. Pearl also draws a sharp distinction between causal models and the choice of inference method (i.e., Bayesian or frequentist) — on that you agree.

    When Pearl writes "conditioning" he really does mean conditioning, and not just including on the right hand side of a regression equation. When he writes about what variables can be in the conditioning set for valid causal inference, his equations only involve terms of the form Pr(.|.) — he assumes no particular form for any probability term. So that can't be the point of disagreement.

  4. Corey,

    I can accept that Pearl does not want to condition on some variables. That's his choice.

    The point I was making was that Pearl describes conditioning (or, as he puts it, "indiscriminate conditioning") as a "culturally-induced ritual" that is seemingly without foundation. Actually, though, conditioning is what you do in Bayesian inference. And Bayesian inference is, as we all know, a highly successful mode of statistical reasoning. Pearl already said he's not a Bayesian, so, again, there's no reason why he should be required to condition on all available information. But he might want to think twice before dismissing those who do follow the Bayesian approach. Especially if he is dismissing researchers who have decades of experience solving problems in applied statistics.

    Also, not to be picky, but "frequentist" is not a method of inference. Frequentism is an approach to evaluating inferences; it is not a method that produces inferences. Any method (including a Bayes estimate) can be a "frequentist" method if it is evaluated in that way.

  5. Well, I've got my copies of BDA 2ed and Gelman and Hill, so I'm well sourced for Rubin-style inference, and I have access to a copy of Pearl's "Causality". If I have some free time in the near future, I'll take a stab at what will either turn out to be (a) a demonstration of Pearl's thesis in potential-outcome terms for the simplest causal model that demonstrates the thesis, or (b) a comparative study of the two approaches as applied to that causal model (this last if your inference that Pearl makes an inapplicable assumption in his equivalence proof is correct). The proof of the pudding is the eating.

  6. P.S. Regarding my comment just above: I'm not trying to deny Pearl the right to criticize Rosenbaum, Rubin, and others. I'm just pointing out that full conditioning is the essence of Bayesian statistics. So there's a lot behind this idea, it's not merely some sort of culturally-induced ritual or a simple misunderstanding. By not wanting to condition on all available information, Pearl is putting himself in the mainstream of non-Bayesian statistics. A fine place to be, but certainly not the only reasonable place to be, especially given the many successes of applied Bayesian statistics in recent decades.

  7. 1 != 2 as raised with question marks in my earlier post

    (the do operate [oprerater] operates on the specification??? and should be separate from usual Bayes conditioning of the chosen joint probability model on "all" the data ??? as well as the choice of Larry's functional on the posterior? )

    so perhaps "stepping into it" further than I should

    1 specification of joint probability model(s) to adequtely represent causal questions of interest.

    2 the chosen joint model conditioned on all observed data (Bayes Theorem)

    3 extracting interesting marginal features from the posterior

    (the argumant can't be about 2!)

    Keith

  8. First of all, I wanted to say that I've really enjoyed following this debate! Thanks for taking the time for going over these issues.

    In a previous post, Pearl presented a question regarding instrumental variables and how they would be handled from a Bayesian perspective:

    "Let X and Y be the outcomes of two fair coins,
    and let Z be a bell that rings if at least one
    of X and Y comes up head. We wish to estimate
    the causal effect of X on Y after collecting
    a huge number of samples, each in the form of
    a triplet (X, Y, Z). Should we include Z in the analysis?"

    I was wondering if you could discuss how the information from the bell would be used in a Bayesian analysis.

    I am asking because recently I learned about the problems that emerge when controlling for a collider in regression and I would like to learn if you can avoid the issue using Bayesian statistics.

    My apologies if you already replied to that question, but I did not see the answer in the previous posts.

  9. David:

    The way this is set up, it seems that the causal effect of x on y is zero. I'm not quite sure how the model would be set up; I'd have to know more about what information is known to the analyst.

    One extreme is that you have all the information described above. In this case, there are no unknown parameters in the model, and so there is no inference to do. The causal effect is zero as specified.

    At the other extreme, all the analyst has are the data (x,y,z), which would be 25% (0,0,0), 25% (0,1,1), 25% (1,0,1), and 25% (1,1,1). Clearly x and y are statistically independent here, but it would be impossible to talk about causation without further assumptions.

    So I imagine that Pearl is talking about some intermediate situation. This sort of problem is a little too context-free for me to really understand, I have to admit.

  10. Rubinism vs. Pearlism: Resolution Finally Reached
    By Judea Pearl on July 14, 2009
    Dear Andrew,
    In your last posting you have resolved the clash between Rubin and Pearl — many thanks.
    You have concluded: "Conditioning depends on how the model is set up. "
    Which is exactly what I have been arguing in the last five postings.

    But I am not asking for credit. I would like only to repeat it to all the dozens of propensity score practitioners
    who are under the impression that Rubin, Rosenbaum and other leaders are strongly
    in favor of including as many variables as possible in the propensity score
    function, especially if they are good predictors of the treatment assignment.

    Let me quote you again, lest it did not reach some of these practitioners:
    "They (Rubin, Angrist, Imbens) don't suggest including an
    > intermediate variable as a regression predictor or as a predictor in a
    > propensity score matching routine, and they don't suggest including an
    > instrument as a predictor in a propensity score model." (Gelman posting 2009)

    Our conclusion is illuminating and compelling:
    When Rosenbaum wrote: "there is little or no reason to avoid adjustment for a true covariate, a variable describing
    subjects before treatment" (Rosenbaum, 2002, p. 76], he really meant to exclude
    instrumental variables, colliders and perhaps other nasty variables from his statement. And when Rubin wrote (2009):
    “to avoid conditioning on some observed covariates,… is nonscientific ad hockery.”
    he really did not mean it in the context of propensity score matching (which was the
    topic of his article." And when Gelman wrote (in your first posting of this discussion):
    "For example, we [Gelman and Hill] recommend that your model should, if possible,
    include all variables that affect the treatment assignment" he (Gelman) really meant
    to exclude variables that affect the treatment assignment if they act like instruments.
    (which, if we look carefully at the reason for this exclusion, really means to exclude
    ALL variables that affect treatment assignment). And when Rubin changed the definition
    of "ignorability" (2009) "to be defined conditional on all observed covariates" he really
    meant to exclude colliders, instrumental variables and other trouble makers, he simply
    did not bother to tell us that (1) some variables are trouble makers, and (2) how to spot
    them.

    If you and I accept these qualifications, and if you help me get the word out to those poor practitioners,
    I dont mind it if you tell them
    that all these exceptions and qualifications are well known in the potential-outcome subculture
    and that these prove that Pearl's approach was wrong all along. But, please get the word out to those poor propensity-
    score practitioners, because they are conditioning on everything they can get their hand on.
    I have spoken to many of them, and they are not even aware of the problem.
    They follow Rubin's advice, and they are scared to be called "unprincipled" — I am not.

  11. Judea:

    I agree with you that the term "unprincipled" is unfortunate, and I hope that all of us will try to use less negative terms when describing inferential approaches other than ours.

    Regarding your main point above, it does seem that we are close to agreement. Graphical modeling is a way to understand the posited relations between variables in a statistical model. The graphical modeling framework does not magically create the model (and I'm not claiming that you ever said it did), but it can be a way for the user to understand his or her model and to more easily to communicate it to others.

    I think you're still confused on one point, though. It is not true that Rubin and Rosenbaum "really meant to exclude colliders, instrumental variables and other trouble makers." Rubin specifically wants to include instrumental variables and other "trouble makers" in his models–see his 1996 paper with Angrist and Imbens. He includes them, just not as regression predictors.

    I agree with you that Rubin would not include an instrument or an intermediate outcome in the propensity score function, and it is unfortunate if people are doing this. But he definitely recommends including instruments and intermediate outcomes in the model in an appropriate way (where "appropriate" is defined based on the model itself, whether set up graphically (as you prefer) or algebraically (as Rubin prefers).

  12. Resolution is Fast Aprroaching in the Discussion
    on Rubin's and Pearl's Approaches to Propensity Scores.

    Reply by Judea Pearl, July 15 2009
    Dear Andrew,
    We are indeed close to a resolution.
    Let us agree to separate the resolution into three parts:
    1. Propensity score matching
    2. Bayes analysis
    3. Other techniques

    1. Propensity score matching.
    Here we agree (practitioners, please listen) that one should
    screen variables before including them in the propensity-score function.
    Because some of them can be trouble-makers, name, capable of increasing
    bias over and above what it would be without their inclusion, and some are
    guaranteed to increase that bias.
    1.1 Who are those trouble makers, and how to spot them, is a separate question that is a matter of taste.
    Pearl prefers to identify them from the graph
    and Rubin prefers to identify them from the probability distribution
    P(W|X, Y_0, Y_1) which he calls "the science",
    1.2 We agree that once those trouble makers are identified, they should be excluded (We repeat: excluded) from entering the propensity score function,
    regardless of how people interpreted previous statements by Rubin (2007,
    (2009), Rosenbaum (2002) or other analysts;

    2. Bayes analysis.
    We agree that, if one manages to correctly formulate the "science" behind the problem
    in the form of constraints over the distribution P(W|X, Y, Y_0,Y_1) and load
    it with appropriate priors, then one need not exclude trouble makers in advance;
    sequential updating will properly narrow the posteriors to reflect both the science
    and the data. One such exercise is demonstrated in section 8.5 of Pearl's book
    Causality, which purposefully include an instrumental variable to deal
    with Bayes estimates of causal effects in clinical trials with non-compliance.(Mentioned
    here to allay any fears that Pearl is "confused" about this point, or is unaware
    of what can be done with Bayesiam methods)
    2.1 Still, if the "science" proclaims certain covariates to be "irrelevant", there is
    no harm in excluding them EVEN FROM a BAYESIAM ANALYSIS,
    and this is true whether the "science" is expressed as distribution over counterfactuals
    (as in the case of Rubin) or as a graph, based directly on the subjective judgments that are
    encoded in the "science". There might actually be benefits to excluding them, even when
    measurement cost is zero.
    2.2 Such irrelevant variables are, for example, colliders, and certain variables
    affected by the treatment, e.g., Cost Outcome.
    2.3 The status of intermediate variables (and M-Bias) is still in the open.
    We are waiting for detailed analysis of examples such as Smoking —>Tar—>Cancer
    with and without the Tar. There might be some computational advantages
    to including Tar in the analysis, although the target causal effect (of smoking
    on cancer) is insensitive to Tar measurements if Smoking is randomized.
    .
    3. Other methods.
    Instrumental variables, intermediate variables and confounders
    can be identified and harnessed to facilitate effective causal inference using other methods,
    not involving propensity score matching or Bayes analysis. The measurement of Tar,
    for example (see example above) can be shown to
    enable a consistent estimate of the causal effect (of Smoking on Cancer)
    even in the presence of confounding factors affecting both smoking and cancer
    (page 81-84 of Causality).

    Shall we both sign on this resolution?

  13. Comments on David's questions.
    David referred to my coin-bell example:
    "Let X and Y be the outcomes of two fair coins,
    and let Z be a bell that rings if at least one
    of X and Y comes up head. We wish to estimate
    the causal effect of X on Y after collecting
    a huge number of samples, each in the form of
    a triplet (X, Y, Z). Should we include Z in the analysis?"
    and asked: how Bayesian analysis would handle thi
    case.
    I should explain the example better.
    The story above describes what is going on in
    the real world. However, the scientist suspects
    that coin-1 has some effect on coin-2, and he/she
    is setting up a Bayesian experiment to find out
    what the magnitude is of this suspected causal
    effect.
    Nature knows that the answer is zero, but the
    scientist is not convinced because the data shows
    correlations on the days that the bell rings, as
    well as on the days that the bells does not ring.
    The question is how a Bayesian analysis would
    figure out that the bell ringing is irrelevant.
    and that the answer is "coin-1 has no effect on
    coin-2"
    ========Judea

  14. The Bayesian analysis figures out that the bell ringing is irrelevant the same way that the graphical analysis does: by judicious assumption! If the scientist believes it is possible ex ante that the bell mediates a causal relationship between the coins, he will also believe so ex post, because these data (taken as pure b,c1,c2 triples) cannot distinguish between the independent coins + dependent bell story and the independent bell + dependent coins story without additional assumptions. This is true regardless of whether one encodes these assumptions about which relationships between the bell ringing and the coin flips in a graphical model or a likelihood function.

    One can decompose the joint distribution p(b,c1,c2) into a variety of expressions based on different conditional distributions that each look like different causal mechanisms, but are really just conditional probability distributions. To take the two in question, p(b,c1,c2) = p(b|c1,c2)p(c1,c2) = p(c1,c2|b)p(b). In this case, assumptions based on our knowledge of physical and temporal considerations make us very confident that the conditional probability in the latter decomposition cannot reflect a causal relationship. But since either decomposition can describe any possible distribution of data triples, the only way you know that estimating the conditional probability in p(b|c1,c2)p(c1,c2) might reveal a causal relationship while estimating the conditional probability in p(c1,c2|b)p(b) will not is that you know ex ante that the conditional probability in the latter is not causal.

    To get back to the question of whether one ought to "condition on everything", that advice needs some additional qualifiers. I am not sure this is exactly right the right phrasing, but I think the appropriate guidance for Bayesian conditioning for causal inference is going to look something like the following: one should condition on all variables for which you can encode every plausible causal relationship between that variable and others as a conditional probability in an identified likelihood function. This is going to be a lot more restrictive than "condition on everything", because when you are not sure about causal directions, including many will generally make the likelihood unidentified.

  15. graphs vs. likelihood functions
    Ben,
    I believe you need to take into account the major
    difference between graphs and likelihood functions. The former CAN encode causal relations, the latter CANNOT.

Comments are closed.