Martin Lindquist and Michael Sobel published a fun little article in Neuroimage on models and assumptions for causal inference with intermediate outcomes. As their subtitle indicates (“A response to the comments on our comment”), this is a topic of some controversy. Lindquist and Sobel write:

Our original comment (Lindquist and Sobel, 2011) made explicit the types of assumptions neuroimaging researchers are making when directed graphical models (DGMs), which include certain types of structural equation models (SEMs), are used to estimate causal effects. When these assumptions, which many researchers are not aware of, are not met, parameters of these models should not be interpreted as effects. . . . [Judea] Pearl does not disagree with anything we stated. However, he takes exception to our use of potential outcomes notation, which is the standard notation used in the statistical literature on causal inference, and his comment is devoted to promoting his alternative conventions. [Clark] Glymour’s comment is based on three claims that he inappropriately attributes to us. Glymour is also more optimistic than us about the potential of using directed graphical models (DGMs) to discover causal relations in neuroimaging research . . .

Lindquist and Sobel’s arguments make sense to me, except on one point. They consider a causal setting z -> x -> y, where z is the treatment variable, x is the intermediate outcome, and y is the ultimate outcome, and much of their discussion centers on estimating the causal effect of x on y. I have two difficulties with their perspective:

1. If x is an observed variable that is not directly manipulated, I don’t know if it makes sense to talk about the effect of x on y, unconditional on the intervention that was used to change x. In their example, I’d talk about “the effect of x on y, if x is changed through z.” Different z’s can induce different effects of x on y.

2. Lindquist and Sobel talk about the effect of z on x. If z=0 or 1, they write x(z), so that the causal effect of z on x is x(1) – x(0) (or, more generally, x(1) compared to x(0), but we lose nothing by considering simple differences here). So far, so good.

But I get stuck at the next step, where they define the effect of x on y. If x can equal 0 or 1, they write y(z,x), so that the causal effect of x on y, conditional on z, is y(z,1) – y(z,0). At least, I think that’s what they’re saying.

The trouble is, I don’t see how the two parts of this model fit together. For any given item in the experiment, I think they’re following the rule that x(z) has a particular (although maybe unknown) value. But then I don’t see what it means to look at y(z,1) – y(z,0). For any particular value of z, it seems to me that only one of these two terms is possible. (For example, if x(z)=1, then y(z,1) is defined but y(z,0) seems meaningless.)

I’m not saying that this framework is wrong, just that I don’t understand it.

That said, Lindquist and Sobel’s criticisms of Pearl and Glymour seem sound to me.

P.S. I wrote this last month and put it in the queue. Since then I’ve noticed that Pearl has responded to Lindquist and Sobel; see here. I don’t find Pearl’s response to be so convincing—I agree with Lindquist and Sobel’s statement that the graphical or structural equation modeling expression looks simple and appealing but the underlying assumptions in those expressions are not so clear. But you can judge for yourself; as I wrote in my discussion of the book by Morgan and Winship, it’s good to have muultiple expressions for a model, as different users are looking for different things.

To be specific, Pearl contrasts three expressions of a single model, the causal chain Z—>X—>Y. Here’s Pearl:

Pearl characterizes the third expression is a more meaningful and clear display.

In contrast, Lindquist and Sobel argue that the above graphical expression appears clear only because it sweeps the model’s assumptions under the rug. Lindquist and Sobel write:

None of this seems clear and simple to me! Speaking of clear and simple, I’m reminded of a scene, several decades ago, when a bunch of us on the county math team won some competition, and the prize was that we each got to choose one of several math books. One of the books was called Elementary Linear Algebra, and I remember making a disdainful remark to my friend that I didn’t want something elementary. My friend replied, “Linear algebra is not elementary.” Good point.

Which brings back another memory: our coach for the Mathematical Olympiad program was an unbelievably grumpy old man. At one point he interrupted one of his lectures to rant about how all the calculus books now are wasting their space with applications. At some point, he said, they’re gonna come up with a book called Applied Calculus with Applications. That all seemed natural to me at the time but in retrospect I’m amazed by how brainwashed we all were. There was one kid there who I recall was interested in engineering problems rather than number theory etc., but that was an unusual preference. (I just looked him up and, amazingly, he grew up to be an engineering researcher!) The other thing I remember about the grumpy coach dude, besides his personality (which, in retrospect, was perhaps necessary to keep a bunch of 15-year-old boys in line; even nerds can make trouble), was that he thought it was cheating to use calculus or analytic geometry. His favorite sorts of problems used elaborate arguments from classical geometry and he always felt we should be able to solve these without resorting to technical means.

As I’ve remarked more than once in this space, I feel lucky in retrospect to have been pretty unprepared for the Olympiad program, with the result that I didn’t do very well there, gradually lost interest in this sort of competitive event, and decided I didn’t want to be a pure mathematician. I think it must’ve been really hard on the kids who were top performers but didn’t happen to be Noam. It was easier for those of us in the bottom half of the group.

I find Pearl’s directed graph approach to causality so sound, I have very little patience for other approaches. That said, Pearl drives to the very cliff edge, then pins the accelerator to the floor, then plummets to the canyon floor.

[1] The cogs of the universe are not intrinsically causal-analysis-friendly. If you wait long enough (an exponential formula of the age of the universe), mud will cause rain: the mud will evaporate into droplets hurling upwards, and those droplets will break and coalesce into a laden cloud. Impossible to express with Pearl’s notation: the graph must be acyclic.

Models can be of (1) discrete probability, (2) ad-hoc-histogram-based statistical, (3) continuous distribution statistical, (4) directed-acyclic-graph causal, possibly with arrows that behave stochastically as (1) (2) (3), or (5) computationally free-wheeling. Models tax limited resources, to different degrees.

[2] Models exist in a necessary multiplicity leading to situational competition and endless jockeying.

[3A] There are the models that have the ability to be updated by new (actually-possible & reasonable, existing or potential) observations/experiments

[3B] There are models that have utility in rational decision making about what actions to take

[4] [3A] & [3B] may be different sets of models, may have only an embarrassingly vague or even known-fallacious relation between them (Our really-existing experiment protocol may let us “reject the null-hypothesis”, we wish for that to allow us to assign a truth value to the hypothesis in our models of utility – we wish for a pony. A population of researchers, acting in good faith, that take on the discipline of “rejecting the null-hypothesis” will tend to have fewer false reports based on samples that are too small than otherwise – but nothing else can be said.)

[5] updating [3A] might require catastrophic changes in [3B], vice versa.

[6] The universe doesn’t have limit itself to that which can be reasonably modeled, obviously.

[7] There are a necessary multiplicity of models for making rankings and other decisions about models, leading to situational competition and endless jockeying. We flatter ourselves by assuming an infinite regress: the process terminates with a fast and frugal and thoughtless and careless heuristic.

Pearl assumes but a single ultimate directed-acyclic-graph causal model (allowing arrows that behave stochastically as (1) (2) (3), granted), in defiance to the above.

[Any quotes (other punctuation) used above is not necessarily an accusation that Andrew Gelman ever said the emphasized phrase, I know I must be clear. This entire comment not at all a criticism of AG; this comment most definitely an invitation to have my own ignorance revealed, possibly mocked. I will continue to rehash stuff until I have the clarity I need, polluting comment boxes.]

Manuel: With regard to 6, the universe to has to be representable for inquiring minds to exist – so there are limitations to the universe in which inquiry happens.

And I think we are in one of those :-)

If you wish some challenging reading you might wish to look at Peirce and the Threat of Nominalism (Cambridge: Cambridge University Press, expected release in 2011) for more rigorous arguments about what is possible.

On the whole post, is this not like Charlie Chaplin and WC Fields[or some one else], both great comedians who could not stand each others work. Remember hearing that one went into the theatre to watch the other’s work and became physically ill and had to leave?

Pearl’s approach is great when you actually know what the graph is. His approach grew out of the analysis of computer programs for debugging and compiling, where every change to the system is well-defined and enumerated. Want to know what caused x=0? Retrace the execution of the program until you see a line of code that touches x. It’s not so simple in social science applications, where there are possibly infinite ways of changing the system we haven’t even thought of yet.

Dear tc,

A few mild corrections to your posting.

1. My approach did not grow out of computer programs but, rather, from physics — the values of

physical quantities are determined by Nature afte She consults the values of other quantities in

the neighborhood. This is the idea of structural equation modeling.

2. While it is true that the approach gives us maximum untility when we actually know the graph structure,

it is not correct to say that the approach gives us nothing when the graph is only partially known, or even

totally unknown. The latter case puts us in the good company of the Potential Outcome Society and, we can at

least reason about the virtues of randomized experiments. In the former case, the approach instructs us to

articulate what we know and distinguish it from what we do not know and, most importantly, gives us the guarantees

that we have done the best with what is known — no other system can do better.

3 Regarding the complexities of social sciences — agree. But such complexities do not excempt social scientists

from the obligation to find out if the little they do know is sufficient to give them the answers they care about.

Most social scientists just give up, to what? to bemoaning how complex causal analysis is.

It is not.

======Judea

I found this post surprising in your resistance to formalization of causal claims.

It is one thing to point out that particular assumptions are not plausible (e.g., that the exclusion restriction is not satisfied, so z -> x -> y is not the true model). It is another to resist a formalism that makes it possible to state these assumptions.

So I found it frustrating that you seemed to resist the notation, rather than stating problems with assumptions. When you write “Different z’s can induce different effects of x on y.”, I read this as the latter. But “I don’t know if it makes sense to talk about the effect of x on y, unconditional on the intervention that was used to change x” reads as a resistance of even talking about what we likely care about. Now a single study, with a single intervention, may not help us learn a great deal about this quantity — but that’s still our long-term goal. Also, one reason that the resulting estimates change is because of (in the binary case) of differences in the population of compliers. This is a different problem than the intervention violating the exclusion restriction, and by formalizing the problem we can see this.

Depending on the problem, potential outcomes, non-parametric structural equations, and DAGs are each great ways of expressing the assumptions. But I do agree with Pearl that potential outcomes can quickly get very unwieldy except in the simplest cases.

Dean:

You are bothered by my statement, “I don’t know if it makes sense to talk about the effect of x on y, unconditional on the intervention that was used to change x.” But perhaps you won’t be surprised to hear that I’m not bothered by my own statement! Sure, you can average the effects on y of different interventions that affect x, but that’s still an average treatment effect, and I’d like my model to reflect that.

Also, I don’t resist formalizing causal claims. The Lindquist and Sobel paper is full of formalization of causal claims. I just find some of these formalizations confusing. I wouldn’t call this resistance, exactly!

First, I want to note that I was only surprised / frustrated since I usually agree with you on matters of statistical and causal inference.

On the issue of averaging “the effects on y of different interventions that affect x”, we can treat this directly within the formalism by considering (a) heterogeneous effects of the instrument Z on the treatment X and (b) violations of the exclusion restriction.

(a) leads to thinking about who the compliers are, and highlights that the LATEs identified by each of several different instruments might differ primarily because the compliers are different.

(b) leads to considering other mechanisms of the effect of Z on Y besides X. In some cases, it may be possible to identify a critical second variable W. It may often be the case that this variable is of greater interest. So understanding how W interacts with X to produce Y is appealing, as is averaging over the observed distribution of W, rather than averaging over rare interventions/instruments.

I guess I felt that by not using formalism to express your objection, we lost clarity about whether your worries were (a) or (b) or both.

But maybe all you are saying is that they take x_i(z) to have a determinate, single value, and that then talking about y_i(z, 0) – y_i(z, 1) doesn’t make much sense. If that’s it, then I agree: x_i(z) is better thought of as a random variable. I find this is a confusing aspect of the potential outcomes notation. Maybe using the do() operator helps, since then we care about E[y_i | do(X = 1), Z = z] – E[y_i | do(X = 0), Z = z].

Or of course we could write E[y_i(z_i, 0) – y_i(z_i, 0)]…

“Maybe using the do() operator helps, since then we care about E[y_i | do(X = 1), Z = z] – E[y_i | do(X = 0), Z = z].”

I am a little confused. I thought the whole issue in IV is we do not manipulate X directly but Z instead. Should it not be re-written with “do(Z=0)” and so on?

“z -> x -> y is not the true model”

No such thing as a true model, only useful codifications of our causal knowledge.

By “causal” I simply mean to say that if I do X, ceteris paribus, I expect Y will happen. By “useful” I mean, for example, using our codified causal knowledge in a DAG to help AI robot get on with life.

I don’t see a DAG as capturing reality and I do not care about “reality” or “truth”. As a pragmatists (see Rortry) I care how my causal understanding of the shadows in the cave, to use Plato’s analogy, help me life my life in the cave.

For example, should I care if, every time I flick on the switch, it is not my action that turns on the light but God that does it only to trick me?

Constant conjunction, propinquity and all do not seem at all relevant, nor a careful ontology of causality or any notion of truth. Only experience would seem to matter. DAGs are useful for codifying this experience, not to discuss the ontology of causality.

If units are randomly assigned to levels of Z, then wherever we write Z = z we can write do(Z = z). If the exclusion restriction holds, then we also have that E[y_i | do(X = 1), Z = z] = E[y_i | do(X = 1)].

While I generally think that in social phenomena, the default expectation should be that everything affects everything else, there are also many cases where particular exclusion restrictions are very well motivated. Temporal ordering provides one such case (see the next blog post). As does mediation of effects of peer attitudes by peer behaviors.

I’ve read Rorty and was lucky enough to take his class and even, on one occasion, dine with him. I agree with his pragmatism on many points, especially in his forceful presentation of Donald Davidson’s philosophy. Nonetheless, I still have use for wondering whether Z’s effect on Y is exhausted by its effect on Y via X.

“potential outcomes can quickly get very unwieldy except in the simplest cases”

In my experience DAGs can also get complicated pretty fast too, though I like them.

Maybe there is an R package to move from non-parametric structural equation model to DAG and vice versa; one that then applies the backdoor criterion for some pre-determined intervention within a given DAG and that spits out the minimum set of controls needed. Any suggestions? It would make it easier to work with DAGs.

This is a good point about the difference between model specification and computation. Graphs and functional equations are meant to be a simple language to express assumptions, but to infer the consequences of these assumptions is a different matter (in the same sense Bayesian inference can provide a clean language to express a model, but no one is expected to do MCMC by hand). I agree lack of truly complete and user friendly software is necessary. The Tetrad software does some of it, though, but I guess there are all sorts of other questions users would like to be able to answer, like in Fernando’s request.

Incidentally, I personally found Lindquist and Sobel’s reading of Pearl’s example an unfortunate mess. I see nothing “illusory” about its (relative) simplicity. Moreover, Pearl is making the error terms explicit in the graph to draw a closer analogy to the potential outcome framework and SEM, but even that does not contradict model specification without the error terms using the very same independence calculus.

Andrew wrote: “Different z’s can induce different effects of x on y.”

My impression is that this would imply a bad choice of x. Example: Let y be the car’s velocity, let x be the position of the accelerator, let z1 be the driver, and let z2 be the cruise control mechanism. If z1 causes the value of x to be x0, this should produce the same velocity as when z2 causes the value of x to be x0.

This is an engineering example. Maybe there are no good behavioral science choices for an intermediate outcome x?

Dear Andrew,

I was glad to see the paper by Lingquist and Sobel (S&L) being discussed on your blog.

I wrote a little reply to it, but NewImage won’t publish it , so I posted it on my blog

http://www.mii.ucla.edu/causality/

The gist of it is the following:

L&S’s warning of the importance of scrutinizing assumptions is admirable. Yet readers of

NeuroImage will have difficulty understanding why they are judged incapable of scrutinizing

causal assumptions in the one language that makes these assumptions transparent, i.e., diagrams

or SEM, and why they are threatened with “incorrect inferences” for not rushing to translate

meaningful assumptions into a language where they can no longer be recognized, let alone

justified.

I then formulate the causal-chain example in three different languages and asks readers:

1. In what language are these causal assumptions more meaningfully and clearly displayed?

2. How can members of the Arrow-Phobic Society be lured into a dispassionate examination of

the comparison above?

Perhaps your readers will be able to answer question (2), I am still hoping.

======Judea

Judea:

Look carefully at the blog post above. I linked to, quoted from, and discussed your response!