Judea Pearl and Dana Mackenzie sent me a copy of their new book, “The book of why: The new science of cause and effect.”

There are some things I don’t like about their book, and I’ll get to that, but I want to start with a central point of theirs with which I agree strongly.

**Division of labor**

A point that Pearl and Mackenzie make several times, even if not quite in this language, is that there’s a division of labor between qualitative and quantitative modeling.

The models in their book are qualitative, all about the directions of causal arrows. Setting aside any problems I have with such models (I don’t actually think the “do operator” makes sense as a general construct, for reasons we’ve discussed in various places on this blog from time to time), the point is that these are qualitative, on/off statements. They’re “if-then” statements, not “how much” statements.

Statistical inference and machine learning focuses on the quantitative: we model the relationship between measurements and the underlying constructs being measured; we model the relationships between different quantitative variables; we have time-series and spatial models; we model the causal effects of treatments and we model treatment interactions; and we model variation in all these things.

Both the qualitative and the quantitative are necessary, and I agree with Pearl and Mackenzie that typical presentations of statistics, econometrics, etc., can focus way too strongly on the quantitative without thinking at all seriously about the qualitative aspects of the problem. It’s usually all about how to get the answer given the assumptions, and not enough about where the assumptions come from. And even when statisticians write about assumptions, they tend to focus on the most technical and least important ones, for example in regression focusing on the relatively unimportant distribution of the error term rather than the much more important concerns of validity and additivity.

If all you do is set up probability models, without thinking seriously about their connections to reality, then you’ll be missing a lot, and indeed you can make major errors in casual reasoning, as James Heckman, Donald Rubin, Judea Pearl, and many others have pointed out. And indeed Heckman, Rubin, and Pearl have (each in their own way) advocated for substantive models, going beyond data description to latch on to underlying structures of interest.

Pearl and Mackenzie’s book is pretty much all about qualitative models; statistics textbooks such as my own have a bit on qualitative models but focus on the quantitative nuts and bolts. We need both.

Judea Pearl, like Jennifer Hill and Frank Sinatra, are right that “you can’t have one without the other”: If you think you’re working with a purely qualitative model, it turns out that, no, you’re actually making lots of data-based quantitative decisions about which effects and interactions you decide are real and which ones you decide are not there. And if you think you’re working with a purely quantitative model, no, you’re really making lots of assumptions (causal or otherwise) about how your data connect to reality.

**The Book of Why**

Pearl and Mackenzie’s book is really three books woven together:

**1.** An exposition of Pearl’s approach to causal inference based on graphs and the do-operator.

**2.** An intellectual history of this and other statistical approaches to causal inference.

**3.** A series of examples including some interesting discussions of smoking and cancer, going far beyond what you’ll generally see in a popular book or a textbook on statistics or causal inference.

**About the exposition of causal inference**, I have little to say. As regular readers of this blog know, I have difficulty understanding the point of Pearl’s writing on causal inference (see, for example, here). Just as Pearl finds it baffling that statisticians keep taking causal problems and talking about them in the language of statistical models, I find it baffling that Pearl and his colleagues keep taking statistical problems and, to my mind, complicating them by wrapping them in a causal structure (see, for example, here).

I’m *not* saying that I’m right and Pearl is wrong here—lots of thoughtful people find Pearl’s ideas valuable, and I accept that, for many people, Pearl’s approach is a good way—perhaps the best way—to frame causal inference. I’m just saying that I don’t really have anything more to say on the topic.

**About the intellectual history of causal inference**: this is interesting. I disagree with a lot of what Pearl says, but I guess that’s kinda the point, as Pearl is fighting against the statistics establishment, which I’m part of. For example, there’s this from the promotional material that came with the book:

Using a calculus of cause and effect developed by Pearl and others, scientists now have the ability to answer such questions as whether a drug cured an illness, when discrimination is to blame for disparate outcomes, and how much worse global warming can make a heat wave.

Ummm, I’m pretty sure that scientists could do all these without the help of Pearl! Actually, for that last one, I think the physicists don’t really need statisticians at all.

On page 66 of the book, Pearl and Mackenzie write that statistics “became a model-blind data reduction enterprise.” Hey! What the hell are you talking about?? I’m a statistician, I’ve been doing statistics for 30 years, working in areas ranging from politics to toxicology. “Model-blind data reduction”? That’s just bullshit. We use models all the time. There *are* some statisticians who avoid models, or at least there used to be—I used to work in the same department with some of them—but that’s really a minority view within the field. Statisticians use models all the time, statistics textbooks are full of models, and so forth.

The book has a lot more examples along these lines, and I’ll append them to the end of this post.

I think the key line in the Pearl and Mackenzie book comes on page 90, where they write, “Linguistic barriers are not surmounted so easily.” My many long and frustrating exchanges with Pearl have made me realize how difficult it is to have a conversation when you can’t even agree on the topic to be discussed!

Although I was bothered by a lot of Pearl and Mackenzie’s offhand remarks, I could well imagine that their book could be valuable to outsiders who want to get a general picture of causal reasoning and its importance. In some sense I’m too close to the details. The big picture, if you set aside disputes about who did what when, is important, no matter what notation or language is used to frame it.

And that brings me to **the examples** in the book. These are great. I find some of the reasoning hard to follow, and Pearl and Mackenzie’s style is different from mine, but that’s fine. The examples are interesting and they engage the reader—at least, they engage me—and I think they are a big part of what makes the book work.

**Mischaracterizations of statistics and statisticians**

As noted above, Pearl and Mackenzie have a habit of putting down statisticians in a way that seems to reflect ignorance of our field.

On page 356, they write, “Instead of seeing the difference between populations as a threat to the ‘external validity’ of a study, we now have a methodology for establishing validity in situations that would have appeared hopeless before.” No, they would not have appeared hopeless before, at least not to statisticians who knew about regression models with interactions, poststratification, and multilevel models.

Again, on page 357: “the culture of ‘external validity’ is totally preoccupied with listing and categorizing the threats to validity rather than fighting them.” No. You could start with Jennifer’s 2011 paper, for example.

On page 371, they thank Chris Winship, Steven Morgan, and Felix Elwert “for ushering social science into the age of causation.” Look. I think Winship, Morgan, and Elwert are great. I positively reviewed Morgan and Winship’s book. But even they wouldn’t say “they ushered social science into the age of causation.” Social scientists have been doing good work on causation long before Winship, Morgan, and Elwert came along. By writing the above sentence, Pearl and Mackenzie are just gratuitously insulting all the social scientists who came before these people. It’s kind of like when Kingsley Amis sang the praises of Ian Fleming and Dick Francis: that was his poke in the eye to all the other authors out there.

I don’t know which lines of the book were written by Pearl, which by Mackenzie, and which by both. In any case, I find it unfortunate that they feel the need to keep putting down statisticians and social scientists. If they were accurate in their putdowns, I’d have no problem. But that’s not what’s happening here. Kevin Gray makes a similar point here, from the perspective of a non-academic statistician.

Look. I know about the pluralist’s dilemma. On one hand, Pearl believes that his methods are better than everything that came before. Fine. For him, and for many others, they *are* the best tools out there for studying causal inference. At the same time, as a pluralist, or a student of scientific history, we realize that there are many ways to bake a cake. It’s challenging to show respect to approaches that you don’t really work for you, and at some point the only way to do it is to step back and realize that real people use these methods to solve real problems. For example, I think making decisions using p-values is a terrible and logically incoherent idea that’s led to lots of scientific disasters; at the same time, many scientists do manage to use p-values as tools for learning. I recognize that. Similarly, I’d recommend that Pearl recognize that the apparatus of statistics, hierarchical regression modeling, interactions, poststratification, machine learning, etc etc., solves real problems in causal inference. Our methods, like Pearl’s, can also mess up—GIGO!—and maybe Pearl’s right that we’d all be better off to switch to his approach. But I don’t think it’s helping when he gives out inaccurate statements about what we do.

**P.S.** I also noticed a couple of what seem to be technical errors. No big deal, we all make mistakes, and there’s plenty of time to correct them for the second edition.

– Figure 2.3 of the book reproduces Galton’s classic regression of adult children’s heights on the average of parents’ heights. But even though the graph is clearly labeled “mid-parents,” in the caption and the text, Pearl and Mackenzie refer to it as “father’s height.”

Here’s one lesson you can learn from statisticians: Look at the data. (I’m also suspicious of Figure 2.2, as the correlation it shows between fathers’ and sons’ heights looks too large to me. But I could be wrong here; I guess it would be good to know where the data for that graph came from.)

This is not a big deal, but it shows a lack of care. Someone had to write that passage, and there’s no way to get it wrong if you read Galton’s graph carefully. What’s the point of reproducing the graph in your book if you don’t even look at it?

– From the caption of Figure 4.3: “R. A. Fisher with one of his many innovations: a Latin square . . . Such designs are still used in practice, but Fisher would lager argue convincingly that a randomized design is even more effective.”

Huh? A Latin square design *is* randomized. It’s a restricted randomization. Just read any textbook on experimental design. Or maybe I’m missing something here?

**P.P.S.** More from Pearl here.

**P.P.P.S.** And more from Pearl here. He also brings up “cause of effect” questions, a topic that Guido Imbens and I discuss in this paper. I’m not saying that Pearl’s framework is wrong and ours is right—I expect that different approaches will be useful in different problems—I’m just pointing out that these questions can be addressed in the potential outcome framework.

Pearl also writes, “Can one really make progress on a lot of applied problems in causal inference without dealing with identification? Evidently, potential outcome folks think so, at least those in Gelman’s circles.” No, I never said that, not at all! Indeed I stated very clearly in many places, including in this post and its comment thread, that causal identification is necessary for causal inference. The thing I wrote, that Pearl was responding to, was my statement, “The methods that I’ve learned have allowed my colleagues and I to make progress on a lot of applied problems in causal inference . . .” When talking about “the methods that I’ve learned,” of course that includes what I’ve learned about causal identification.

And Pearl writes, “Gelman wants to move identification to separate books . . .” No, I do not want to do this nor did I ever say such a thing. Indeed, my book with Jennifer has three chapters on causal inference, so it can be clearly seen from my actions that I do not want to move causal identification to separate books.

On the plus side, Pearl says he’ll no longer characterize me as being frightened or lacking courage. So that’s a plus.

On the subject of mischaracterizations in this useful book, contra Pearl, there are no stats models that lack a causal model.

If you’re not using Pearl-like causal structure diagrams, you are (often tacitly) presuming a “causal salad” structure.

Jumbled additive factors, which is a poor description in many many cases.

If you’ll forgive the link to my own work, “causal salad” errors are explained in this short post

https://bigthink.com/errors-we-live-by/judea-pearls-the-book-of-why-brings-news-of-a-new-science-of-causes

Good share.

Causal Salad is in strong contention for my phrase of 2019!

Awesome! A superhero versus superhero movie. I’m gonna hit Pause and go make popcorn. BRB.

Of late, I have enjoyed the discussions on Andrewʼs, Frank Harrelʼs, Sander Greenlandʼs Twitter, and Deborah Mayoʼs blogs Preferable by far to the political discourse. I may not understand the technical discussions. But I often delve into something I know zippo about. I pick stuff up along the way.

I was just surprised that there were so many interesting thinkers. I wish I had come across u all 15 years ago.

I am very curious what people think of this book. I’ve read Pearl’s other books, but have just been flipping through TBoW reading the historical bits. The chapter on smoking was the most interesting to me. I hadn’t read any of that before.

Re qual/quant division of labor: I see lots of value for communication in DAGs and d-separation etc. But computational hurdles are serious. Take for example a proper instrumental variable model. No one should have to use 2-stage-worst-squares. I was writing up some notes lately on how to do IV and front-door models in Stan. There are some conceptual hurdles there for scientists who don’t think inside covariance matrices.

I read every interesting book twice before I comment. This might be good venue to discuss the book. Others on Twitter may join in that wanted to start another Book Club about Why.

honestly i assume statisticians who don’t like 2SLS just don’t really understand how to answer causal questions. You aren’t achieving any believable causal estimates with a complicated structural model, I’m sorry. Hasn’t happened yet in economics, not sure it ever will. What do you think of Mostly Harmless by Angrist and Pischke? When it comes to obtaining estimates that will actually be used/believed in making policy decisions, we won’t be turning to models fit in Stan.

Blip:

Here is my review of Mostly Harmless Econometrics. Regarding two-stage least squares, you have to distinguish:

(a) the statistical model;

(b) the procedure for estimating the model.

Even if you accept the model as is, there are times when a classical estimate has problems. Two-stage least squares, like other simple non-regularized estimates, can be very noisy. As you might have heard, there is a replication crisis in science, deriving in part from the use of noisy estimates with noisy data. For an example in economics, see section 2.1 of this paper. (That’s an example not of two-stage least squares but of a noisy use of ordinary least squares, but the same general point holds in many published examples of two-stage least squares as well.) Regularized approaches, Bayesian or otherwise, can help us do better.

Regarding your : That’s not correct. Stan is already being used in real problems in economics. Anyway, Stan’s just a tool, a force multiplier, which like any model-fitting tool can be used well or poorly. I’m sure there are people out there even more hard core than you, who say that Stata is useless because for any real problem you shouldn’t need fancy tools like regression, it should be enough to just compute averages.

2SLS is a flakey estimator. Angrist himself has published revisions of his own IV estimates because of bias problems. If you want to use instrumental variables, you could use a multivariate model and a modern sampler, like Stan. A multivariate outcome model is not a complicated structural model. It is just the model that justifies using instrumental variables.

I just feel like I can’t relate to someone who is talking about causal questions and causally saying “oh ya just add a few more variables in there, why not?”. The whole premise of model fitting, checking, and then fitting again, does not really fly with causal work. It’s not about fit necessarily! I find it very hard to believe, or even understand, the identifying assumptions in more complicated models. These are fine for purely statistical exercises, but they are not fine for causal work. Have you ever tried to replicate a structural paper in economics? It’s not even possible. The model has been fit and re-fit so many times to make it “work” that the notion that we could somehow get believable causal estimates out of it is laughable. IV “works” in an identification sense for very specific reasons – you can’t just willy nilly add complexity to that model without completely changing the required assumptions. Of course identification isn’t everything and the replication crisis in psychology has demonstrated this nicely. But I think, at least in economics, we are so far away from getting believable causal estimates from complicated models. Those have the double wammy issue of terrible identification and being p-hacked to death. P-hacking is just harder to see in these settings, but we can be sure it’s there.

To Rachael below, why are IV’s assumptions untenable? Entire seminars are devoted to defending these assumptions, and it’s often very clear what is required. For example, if someone is using changes in schooling laws as an instrument for education level obtained, we are going to spend a lot of time thinking about why a person’s exposure to different schooling laws could be correlated with our outcome of interest (e.g. wages) other than through changes in education. Why can’t we defend the validity of this assumption? Of course we can’t ever know if it holds but that’s true of any identifying assumption. At least it’s clear what they are. With complex models the assumption for causal identification is typically “my model is truth”. And if it’s not the truth (which we know it’s not), god knows what those parameter estimates actually mean. At the end of the day the key to good causal work if being to show very clearly what variation is being used to identify your parameter of interest. Even then, it’s so easy to hide a lack of robustness in more complicated model structures.

None of that applies to my position.

Instrumental variables and 2SLS are not the same thing. I am not arguing against IV. I am arguing against using 2SLS. There are better estimators.

Wow, this whole topic has really brought out a lot of anti-statistics/statistician sentiments!

Actually blip I’ve already had discussions with quite a few policymakers who want to use what I’ve been doing in Stan for making decisions in their organisations. I’ve had meetings with folks from USAID, the Nudge Unit and Givewell and they’ve all wanted to push forwards on this and collaborate in future.

Also, 2SLS like all IV relies on completely untenable assumptions in almost all practical applications, and I teach this issue when I teach econometrics classes. Maybe dial back the confidence a bit here.

I do think the claim of being an island of his own – qualitatively different from (almost?) all before and even now – gets in the way of others getting value from his work and maybe even pointing out ways he might become less wrong.

The only way in which I care is that it almost always the case that earlier related ideas have informed me in useful ways and that is being frustrated…

Based on past exchanges here, this should be good. I too am getting the popcorn ready, which I have plenty of times to do as I am furloughed.

I tend to look past hype and see how ideas can be useful to my research. Richard McElreath in his course this year (that he records and posts) is making good use of DAGS in helping to figure out how to use and interpret analyses of interest. They may be other ways of doing this, and they may precede some of what Pearl and others have developed, but his examples show how useful they can be.

I will also note that this has already hit Pearl’s twitter feed, see https://twitter.com/yudapearl/status/1082670574541787136. And in defense of one thing Pearl mentions (how much worse global warming can make a heat wave) that is actually a reference to a paper “CAUSAL COUNTERFACTUAL THEORY FOR THE ATTRIBUTION OF WEATHER AND CLIMATE-RELATED EVENTS” (DOI:10.1175/BAMS-D-14-00034.1 – which can also be downloaded at https://ftp.cs.ucla.edu/pub/stat_ser/r451-reprint.pdf). I think there may have been a second more recent paper but I haven’t been able to dig it up.

I found it kind of interesting that he completely sidestepped any discussion of model selection or comparison, tacitly assuming that practitioners or experts already have some causal model completely formed in their head when they begin their analysis. Maybe it was meant to be more of a survey or narrative discussion and model selection was considered to be too advanced, but I felt like the book was weaker without *any* discussion of how his framework can account for this important component of statistical modelling. Without the ability to compare causal DAGs it seems like all the fancy do-calculus in the world is just going to be spinning its wheels.

My impression is that causal model selection (or “causal discovery”) is still relatively new and undeveloped, and that most of the interesting work being done in this area is coming out of CMU.

Maxwell:

Just to be a grump here, I’m not a big fan of causal discovery, at least not for the sorts of problems I’ve ever worked on, for reasons discussed in this article (or, for similar statistical points in a noncausal framework, see here). In short: In the problems I’ve seen, everything is connected to everything else, so there’s no underlying sparse structure to discover.

Do you avoid making causal claims in your work?

If you do make causal claims, isnʼt there some procedure by which you select between potential causal models, even if that procedure is informal and unspecified?

In our causal inference we set up models based on our scientific understanding; for one example, see here and here.

There is some underlying mental process by which you are interpreting the relevant science to select a causal model.

In selecting a model, isn’t your brain engaging in a poorly understood form of causal discovery?

Wouldn’t it be useful to understand that mental process, for the purpose of communicating its results, and replicating it?

Maxwell:

I don’t think that in our research we are doing “causal discovery” in the sense of learning true conditional independences in nature; see the discussion on pages 960-962 of this article. I agree that it could be beneficial to improve and formalize the process that we use to construct statistical models for causal inference; I’m just not convinced that it makes sense or is useful to do so using a causal discovery framework that is centered around the estimation of patterns of conditional independence.

I am about a third of the way into the book and am enjoying it (and learning from it). I appreciate this genuinely ambivalent review.

I had been echoing several of Judea’s viewpoints back in 2005-2009, typically in the foreign policy community, resulting in my having to press this them even when others objected to my raising them. If you get caussality wrong in some of these missions on which some history/narrative is assumed, you are up for costs and consequences for which you are not prepared. So I welcome any effort to elaborate. On the other hand, simply accepting one theoretical explanation is not enough.

Andrew,

Thanks for pointing out that causation is not new. I think that the law engages with it ubiquitously. It’s that the politicization of so much knowledge has led to superficial assessments of many issues. The drive for fast content trumps the need for accuracy and assessment of causality for a whole variety of reasons.

“[T]ypical presentations of statistics, econometrics, etc., can focus way too strongly on the quantitative without thinking at all seriously about the qualitative aspects of the problem. It’s usually all about how to get the answer given the assumptions, and not enough about where the assumptions come from” – Exactly. I hesitate (ok, only slightly) to plug my own work, but I just published an article which argues, in part, that a widely cited article which purports to show that the RTLM “hate radio” broadcasts increased participation in the Rwanda genocide by ten percent is questionable, because there is no way to know whether the key independent variable (radio reception) is a valid proxy for the actual variable of interest (radio consumption) unless you have a decent qualitative understanding of radio listening habits in 1990s Rwanda.

(As an added bonus, the author relies on p-values, etc., to make his argument, but an an alternative analysis demonstrates that including radio reception in his model does not improve the model’s ability to predict genocide participation rates). The article is here: https://www.tandfonline.com/doi/full/10.1080/13698249.2018.1525677

I’ll surely follow the discussion that will have place here. Lots to learn.

Thanks, A, I was about to order it to try to make sense of the disparaging comments on statisticians’ “bad habits”, but will abstain!

One point of clarification on Pearl’s typical word usage. When he writes “social science,” he tends to mean “sociology” (and maybe “political science” and “psychology”). I am pretty sure that he does not mean to imply economics in that claim, even though most social scientists consider economics to be a social science.

Regardless, you are basically correct. Morgan would not claim any such ushering, not even narrowly for sociology. My PhD was earned in 2000, and I did not begin to read the causality literature in a deep way until about 1997. It took me until about 2004 before I felt ready enough to start writing the first edition of Morgan and Winship. By then, decades of sociologists had already been writing actively on causation, and many, many, many sociologists were trying to estimate causal effects over the same time period. There is a steady stream of methodological literature from Blalock and Duncan, through Berk, Bollen, Clogg, Hauser, Heise, Sobel, and others, lots of which Pearl cites and knows. So, it is an odd statement of his to make. I think what Pearl means is that we have given more attention to his work than most other sociologists, and a good amount of people get some of their first introduction to Pearl’s ideas from my book with Chris. But, part of that is because he did not really unite his ideas until 2000 in the first edition of his book. So, if he had written something like, “Winship, Morgan, and Elwert were the first sociologists to give my work sustained attention,” that would probably be justified. And, then if you think “causation” is the same thing as “my work,” you get his sort of claim. If he had written his book in 1968, I am sure that Duncan would have engaged it, and the methodological stream would have picked it up sooner.

By the way, Chris, Felix, and I disagree a lot on this stuff. If one reads our work and compares across pieces, you’ll see the differences. I stand by the claims in Morgan and Winship, as that is most strongly shaped by my views.

Steve,

I appreciate the clarification. By saying “ushering social science to the age of causation” I meant “ushering sociology to the age of modern causation”. Throughout my book (BOW) I am distinguishing between the old approaches to causation (which include Blalock and Duncan) and the modern approach, which I also called the “causal revolution”. The latter is characterized by three ingredients (1) Structural representation of scientific knowledge (2) Counterfactual logic and (3) the use of graphs for identification and testability. So, given this understanding, you do deserve the “ushering” credit, Blalock and Duncan would not realy qualify.Please note, the modern approach is not exclusively “my work”, and I have given other players due credit. More importantly, the statistics works that Andrew refers to would not qualify for this category; it certainly missing

ingredients (1) and (3).

n

> I don’t actually think the “do operator” makes sense as a general construct

I think you’ve mentioned that you’ve used Rubin’s potential outcome notation. do-calculus proponents argue that they’re equivalent, in the sense that any do() expression can be represented in Rubin’s notation, e.g. $P(Y_x = y)$ is equivalent to $P(Y=y | do(X=x))$.

Sidebar: Somewhat confusingly, $P(Y=y | do(X=x), Z=z)$ is defined as $P(Y=y, Z=z | do(X=x)) / P(Z=z | do(x=x))$, NOT $P(Y_x = y | Z=z) which is a counterfactual query that doesn’t have a direct representation in the do() notation. In this sense, the do notation is weaker than the potential outcomes notation, but there’s a complete algorithm by Shpitser that converts counterfactuals into do() notation, with respect to a given causal diagram.

(Which, as far as I can tell, no one has ever implemented in working code. So the complaint that do() isn’t the right tool for most working statisticians has some merit.)

There does appear to be a “philosophical” difference between researchers who favor the potential outcomes approach and the do-calculus approach. From what I’ve seen, potential outcome proponents tend to start with making conditional independence assumptions about potential outcome variables, e.g. the conditional ignorability assumption $Y_x \perp X | Z$, which are then used to justify estimating causal effect.

The do-calculus proponents tend to view such conditions as being logical consequences of a given model, e.g. with respect to [some causal diagram], it follows that $Y_x \perp X | Z$.

Full disclosure: I consider myself pretty firmly in the “do calculus camp”, mainly because I find it far easier to interpret causal diagrams than conditions like conditional ignorability; if you asked me to enumerate conditional independence assumptions, given a causal diagram, I could do so fairly easily – if you asked me to find a causal diagram that respected a particular set of conditional independence assumptions, I would find this very difficult.

But it seems to me that if you’re okay with using one of {potential outcomes, do calculus}, you should consider the other to be a valid construct.

Joshua:

When I say that I don’t think the do operator makes sense as a general construct, what I mean is that I think it makes sense for some variables but not in general. In a medical study, where x=1 if you take the drug and x=0 if you take the placebo, then do(x) or set x=0 or set x=0 is clear. In an observational study, where x is an observed variable, then do(x) is not so clear to me. Rubin would sometimes give the example of an education study in which x is the number of hours a student studies for an exam. In that case, do(x) doesn’t tell us that much, because much can depend on how x is set. Another example would be a health study in which x is a person’s weight. There are different ways of lowering weight, and some of these ways are healthier than others. So do(x) is clear to me for “treatments” or “instruments” but not in general for observed variables x.

Hi Andrew,

I think the do operator can handle and possibly even help to clarify this type of nuance. Let’s say that we think diet and exercise also affect health (perhaps measured in terms of calories consumed and minutes spent jogging per day). Then we might write down a DAG where diet and exercise are parents of both weight and health, with diet, exercise, and weight all being parents of health. We could differentiate between interventions lowering weight in different ways in terms of how the do operator intervenes on diet and exercise. If we think mental stress or another variable is also a parent of both weight and health then we would also add those variables to the DAG.

Dionissi, Andrew and Joshua,

FYI,

Whether the do-operator makes sense, and whether Rubin’s manipulability restriction is reasonable, are both

discussed in this paper.

Pearl, “Does Obesity Shorten Life? Or is it the Soda?

On Non-manipulable Causes,”

https://ftp.cs.ucla.edu/pub/stat_ser/r483-reprint.pdf

https://ucla.in/2EpxcNU

Journal of Causal Inference, 6(2), online, September 2018.

I both strongly agree and strongly disagree with (what I think is) your stance here.

As to the claim that the do() operator doesn’t always tell us what we want: absolutely! If we’re simply observing some evidence, then we absolutely want to use ‘ordinary’ conditioning. I think this ties into your claim that Bayesians should always want to update on all the available evidence.

Related: if we’re trying to express a counterfactual query, e.g. effect of treatment on the treated, then the do() operator isn’t what we want because it isn’t expressive enough! Analyzing ETT requires potential outcome notation to be expressed, i.e. $P(Y_x’ | x)$, where x and x’ are two different realizations of X.

However, unless I’ve misinterpreted your opinion, I think your suggestion that conditioning is a viable replacement for do() is simply incorrect. There’s a few scenarios that I think are easily conflated:

If we are observing X, where X is a person’s weight, conditioning is correct.

If we want to talk about an _idealized_ intervention, where someone’s weight is changed and _nothing else_ is changed, do() is correct. P(y | x) may, coincidentally, be equal to P(y | do(x)), for certain models, but these represent entirely different queries.

If we want to talk about a _realistic_ intervention, where someone’s weight is changed, by a variety of different means, with the exact mechanisms varying from individual to individual, then we are looking at a complex expression involving many do()s and probability distributions over those interventions. If a doctor tells a patient “lower your weight”, perhaps there’s a 50% chance that the patient will eat less than normal vs. exercising more, so our query would be something along the lines of 0.5 * P(y | do(calorie_intake = 1800)) + 0.5 * P(y | do(workouts_per_week = 4)). Of course, this is still far too simple, a realistic model would have to take into account many more factors.

Similar to the idealized intervention case, estimating the effect of a realistic intervention to change a patient’s weight may be, coincidentally, very close to conditioning on a patient having a different weight, but it’s a different type of query entirely.

Joshua:

I don’t think that “conditioning is a viable replacement for do().” I just don’t think that “do()” is always defined.

This all becomes clearer for the Bayesian if you remember what Bayesian probability is doing: expressing a state of information.

p(Y | X) is a statement about the plausibility that Y would be measured if you know nothing other than X has been measured.

p(Y | I gave someone some training in how to improve X and their previous X was x and their post training X was x2)

is a very different set of information than

p(Y | X=x2)

Somewhat off topic, but see also: ‘inverse problems as

statisticsʼ

http://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.55.6364&rep=rep1&type=pdf

bottom of pg 6. Distinguishing parameters from random variables also makes clear the difference between the physical theory (represented by parameters) and the observable consequences (represented by random variables).

In frequentist stats the consequences Y of a theory theta are represented by p(Y;theta) which is not the same as p(Y|theta), as discussed to no end around here.

Is p(Y;theta) the frequentist statisticianʼs analogoue of p(Y|do(X))?

It seems to me that if youʼre okay with the idea of ceteris paribus, you should be okay with the idea of do() for any variable, even if itʼs modeling an intervention that you could never actually perform in practice. There doesnʼt exist any way to just change someoneʼs temperature, for example. You can give them ibuprofen, you can put ice on their forehead, you can have them eat some soup, but I think the idea of saying, “What would happen if we somehow had a way of directly changing their temperature?” still makes sense as a concept. A more extreme visual: thereʼs no way to suddenly reduce the mass of the sun to zero, but this doesnʼt change our expectation that if, somehow, it was, then the earth would stop orbiting the sun. This is entirely untestable, but I think almost all physicists would admit it as a valid question, and theyʼd all have the same answer.

Thereʼs a subtlety in cyclic models: if we think of X as causing Y and Y as causing X, then (I would argue) this is ‘shorthandʼ for X_t causing Y_t+1 and vise-versa. The notion of the ‘valueʼ of X and Y becomes under-defined: there may be a stable equilibrium prior to intervention, but post-intervention, there could be several possible equilibria, or perhaps no equilibria (thereʼs a proof that this is uncomputable in the general case, under some mild assumptions about the types of random variables permitted). But I donʼt think this is a case of do() not making sense, I think itʼs a case of the model itself being underspecified. do() may be irrelevant, but not invalid as a construct.

(I do hope that Iʼm not becoming annoying; I just think this is an interesting topic and Iʼm just curious where, exactly, we end up disagreeing.)

Joshua:

If by “if youʼre okay with the idea of ceteris paribus,” you mean, “if Iʼm ok with routinely interpreting regression coefficients as causal effects,” my answer is No, Iʼm not! Jennifer and I talk about this a lot in our book, and itʼs also discussed in other applied statistics books with a causal inference focus, such as in the book by Angrist and Pischke.

I have blown hot and cold as regards this book. I think that there is much to admire but I think, for example, the discussion of Lord’s paradox is glib (see my take here https://errorstatistics.com/2018/11/11/stephen-senn-rothamsted-statistics-meets-lords-paradox-guest-post/ ). In my frustrating exchanges with Pearl I have come to the conclusion that the theory cannot really deal with uncertainty appropriately. This, however, does not change my initial opinion that there is much of value for statisticians to learn; it is rather that I think there is also quite a lot in statistics that it might be valuable for Pearl and Mackenzie to learn: in particular that ‘random’ variation can be multilevel and generally complex.

Stephen:

Yes, I agree. It’s that division-of-labor thing.

Andrew,

The hardest thing for people to snap out of is the bubble of their

own language. You say:

“I find it baffling that Pearl and his colleagues keep taking

statistical problems and, to my mind, complicating them by

wrapping them in a causal structure (see, for example, here).”

No way! and again: No way! There is no way to answer causal questions

without snapping out of statistical vocabulary.

I have tried to demonstrate it to you in the past several

years, but was not able to get you to solve ONE toy problem

from beginning to end.

This will remain a perennial stumbling block until one of your

readers tries honestly to solve ONE toy problem from beginning to end.

No links to books or articles, no naming of fancy statistical

techniques, no global economics problems,

just a simple causal question whose answer we know in advance.

(e.g. take Simpson’s paradox: Which data should be consulted?

The aggregated or the disaggregated?)

Even this group of 73 Editors found it impossible, and have

issued the following guidelines for reporting observational studies:

https://www.atsjournals.org/doi/pdf/10.1513/AnnalsATS.201808-564PS

To readers of your blog: Please try it. The late Dennis Lindley

was the only statistician I met who had the courage to admit:

“We need to enrich our language with a do-operator”. Try it,

and you will see why he came to this conclusion, and perhaps

you will also see why Andrew is unable to follow him.

JP

Judea:

I appreciate that you are commenting here, but for Christ’s sake. The fact that I don’t find your presentations convincing has nothing to do with “courage” or being “unable to follow” someone. I wish you could just present your methods without the insults. We disagree. That’s fine. Science is full of disagreements, and there’s lots of room for progress using different methods. (See here, for example, or footnote 1 here.) Disagreement on methods does not imply lack of courage.

For reasons discussed in the above post and earlier posts, I’m not convinced by statements such as “Instead of seeing the difference between populations as a threat to the ‘external validity’ of a study, we now have a methodology for establishing validity in situations that would have appeared hopeless before” and others quoted above that dismiss tons of successful work done by statisticians and others. I respect that many people find your ideas useful and that they can have an important role to play, but I don’t find them so useful for the problems that I’ve worked on, which include problems of causal inference.

That’s fine. Not every method has to be useful for every person. I myself have worked on lots of methods that are useful to me and others, but which various other people have not found useful. I understand that. For you, “There is no way to answer causal questions without snapping out of statistical vocabulary.” Fine. For me and many others, one can indeed answer causal questions within statistical vocabulary. Indeed, Jennifer and I have 3 chapters in our book on causal inference. But, in any case, by saying that I have not found your ideas clear or useful, this does not mean I’m saying your ideas will not be clear or useful to others. There are many ways to skin a cat.

Also, and separate from the “courage” point, I understand that many people will want to work within your framework with the do operator. That’s fine. Within that framework, I recommend that people use multilevel modeling to do partial pooling. Conversely, I have no problem with users of multilevel modeling also working within your causal framework. This is the division of labor discussed in my post above. Casual identification and statistical inference are related and are both relevant for solving real-world problems of causal inference.

Typo at beginning of last sentence.

Casual identification is pretty common. Even when I have to go get my REAL-ID when I renew my driver’s license in a week, they’re happy to admit that they’ll give any person who arrives in the office carrying a particular embossed piece of paper known as a birth certificate, and W2 from their employer, and a print-out of an electricity bill off the internet *must be* the person named on the certificate obviously!

psshh I wish we had a required 2 semester course in formal logic for anyone wishing to get a government job of any kind, even the janitors at the national park restrooms or whatever. Think how much better off we’d be if every person in congress had to be able to actually describe and answer impromptu questions about the logic behind the Diffie-Hellman key agreement protocol.

JP: There is no way to answer causal questions without snapping out of statistical vocabulary.

AG: We disagree. That’s fine. Science is full of disagreements, and there’s lots of room for progress using different methods.

I’m only an amateur, but from the outside it sure doesn’t feel “fine” for the two of you to disagree on what seems like such a fundamental issue. Instead, this seems like a case where two extremely smart individuals should be able to reach an common understanding instead of accepting disagreement as a final outcome.

JP: I have tried to demonstrate it to you in the past several

years, but was not able to get you to solve ONE toy problem

from beginning to end.

AG: For me and many others, one can indeed answer causal questions within statistical vocabulary.

Pearl obviously disagrees that standard statistical vocabulary is sufficient to answer all simple causal questions. You seem to think he’s wrong. I think you’d be doing a great service to encourage him to formulate such a “toy” question that he thinks is unanswerable without resorting to the do-calculus, which you then try to answer to the audiences’ satisfaction using more standard techniques. Maybe the two of you turn out to be in agreement but using different terminology, maybe you are right that his tools are optional, or maybe he’s right that they are essential. Any of these outcomes would feel much more satisfactory and productive than agreement to disagree. Please consider offering him a platform with which to make his case.

Andrew is fine to use potential outcomes, differential equations etc, so it depends on what you count as ‘statistical vocabulary’…

Yeah, disagreement is fine, not being able to talk about the disagreement is quite a shocking revelation for me.

I guess this is where Philosophy of Statistics should walk in and build a bridge between the disciplines by using the generalized language…

What Matt says below.

do-calculus might be necessary to solve certain toy problems, and it offers a formal language for looking at causality, but we can think of other examples where scientific consensus on causation (whether by ‘inference to best explanation’ or whatever) was achieved without the do-calculus, e.g. smoking and lung cancer

Nathan:

My view is similar to many other statisticians and econometricians. I view Pearl’s formalism as a way to help people understand certain causal structures that can also be expressed using traditional statistical models. I recognize that many people find this structure to be helpful. I’ve also seen various claims, for example that causal structure can be discovered from data analysis alone. I don’t think those claims make sense, for various reasons including what I say on pages 960-962 of this article.

I’m happy to agree to disagree because I think that’s the only possibility. To say that “maybe Pearl is right that his tools are essential” . . . That just makes no sense to me. We solve causal inference problems all the time without Pearl’s tools. Beyond that, no tools are essential. I’ve done a lot of research on Bayesian data analysis, but I don’t think Bayesian data analysis is essential; I just think it is useful. The utility of any tool depends also on who is using it. I welcome Pearl’s book in part because I know that many people do find his tools compelling.

Regarding toy problems: Pearl and I have had such discussions in the past but they have gone in circles. See for example the discussion I refer to in this comment elsewhere in the thread. I find the whole thing exhausting. Again, though, I recognize that many people find Pearl’s ideas appealing, and I did think there was a lot of interesting stuff in his book. It’s common for someone to have a mix of good and bad ideas—this is not at all unique to Pearl and his collaborators!—and so “agreeing to disagree” often makes sense, I think. It’s not as if there are any realistic alternatives here!

What about solving non-toy problems? There was an amusing exchange between George Davey Smith and Pearl on Twitter. Smith goaded Pearl to give even a single example of his approach helping solve a concrete, real-life problem. Pearl was unable to provide a single example. If what Pearl does is as important and revolutionary as he claims, it’s certainly taking its sweet time to deliver the magic.

Rubin and statisticians and econometricians over-complicate causal inference by trying to make it quasi-mathematical. In that way, they ruin it. Pearl does something similar, though his approach is way more useful for working scientists. The most useful approach to accomplishing causal inference continues to be Campbell’s. Causal inference is all about ruling out alternative explanations. Design is always better for ruling out alternatives. Statistical methods are okay sometimes under some conditions. That’s what Campbell taught me; that’s all I need to know. The optimal way to think about alternative explanations is what I would call conceptual (what you call ‘qualitative’ here). Pearl is on the right track in promoting a more conceptual approach, cutting through all the esoteric statistical assumption mumbo jumbo. I’ll take a DAG over pages and pages of impenetrable statistical formalae any day.

Sentinel:

I’ll take just about anything over pages and pages of impenetrable statistical formalae any day. I don’t think the formulas in my book with Jennifer are impenetrable, but of course it’s penetrable to us, as we wrote it! In any case, I like to keep things simple too, but I recognize that sometimes a bit of math is necessary for hard problems. Regarding causal inference, you can take this one up with Michael Sobel and Paul Rosenbaum. Finally, regarding your statement that design is better than analysis: I agree too, as does Rubin, who once wrote an article, ““For objective causal inference, design trumps analysis.” Design is part of statistics too!

Fair enough. The post and comments make it really clear (to me, at least) that the real issue here is about style and language. No one perspective has the truth, though I suspect that each perspective on causal inference ultimate gets to the same basic ‘truths’. It’s too bad we have such a hard time communicating across disciplines about this topic. It seems like what we really need is a common language of causation so we can all learn from each other.

> really need is a common language of causation

I strongly disagree, we need multiple languages with good translations between them.

No on has ever discerned the best representation for all time, all problems and all problem solvers.

The desire for a common language is simply tribal and I think a lot of what is going on here.

Languages can adapt – https://en.wikipedia.org/wiki/Linguistic_relativity

And a bit overboard – linguistic genocide as a central aspect of cultural genocide was discussed along with physical genocide as a serious crime against humanity – OK maybe way overboard.

I am aligned with Keith’s ‘we need multiple languages, with good translations between them’. Although ‘standardization’ of terms can facilitate good translations.

It would be jolly nice to have a Statistics to Statistics or Statistics to Machine Learning dictionary! Like

Variational Bayes === Importance sampling, but instead of sampling you look for an optimal importance function.

Yup. Since I am not a statistician, I pay strict attention to both the purported definition and the context n which it is made. I also pay attention to the sociology of expertise in any given discussion. It is becoming increasingly difficult to rely on any one expert exclusively.

Good points, Keith.

Sort of tangential to this discussion, are there out in the wild examples of DAGS with arrows going in both directions (feedback say) or with time lags? For example, the atmosphere, particularly pressure and wind may be great confounders on things going on in the ocean, but there is also evidence that the ocean, particular temperature (heat) affect the atmosphere. I have seen statements that DAGS can indeed handle such problems, but I have never actually seen a DAG where this is the case. Andrew sent me something where Jim Savage mentions such animals, but I haven’t seen one.

Also, the Book of Why as well as Causality emphasize the questions that can be asked, or if you want the questions that need to be asked to verify if the stated model structure is true. But often trying to estimate those relationships is extremely difficult, such as if you have large-scale multi-variate spatial-temporal data with many things going an at a variety of spatial and temporal scales. The estimation part often seems like it is trivial once the causal model has been analyzed, while in reality it is anything but, even if I know the proper causal questions.

Note that DAGs cannot have loops by definition (the A stands for “acyclic”).

Are you aware of Dynamic Bayesian Networks? https://en.wikipedia.org/wiki/Dynamic_Bayesian_network

Sometimes something “feels like” a loop, when in fact it’s an unrolled loop. This is the case for differential equations for example: pressure affects temperature, and temperature affects heat flow, and heat flow affects pressure….

But time goes *forward*, it’s the pressure now t=0 that affects the temperature instantaneously *after* now t=dt, and then temperature at t=dt affects heat flow at t=2dt and then heat flow at 2dt affects pressure at 3dt … or something like that. causality doesn’t go backwards in time. But the name “pressure” and “temperature” stays constant, it’s just the time index that changes, and hence it can feel like a loop when in fact it’s a chain.

Hi Daniel:

Agree 100%, and in some of the links through Jim Savage the same point is made. My question is can anyone point me to where someone has actually done this.

Hi Carlos:

Thanks. Yes I am aware of Dynamic Bayesian Networks. I am also very aware of the large literature on estimating large-scale spatio-temporal models (and by large I mean like close to a million series, daily over 10 years, where there are effects at different time and spatial scales). Ask Dan Simpson, who contributes to this blog often, identifying and estimating spatio-temporal problems properly is extremely non-trivial. And that was my point, not stated very well, is that even if I can build the DAG and identify all the proper causal questions, and what things I can and can not control for, the next step is extremely hard to do. So to my mind it is somewhat glib to say that the problem has been solved using the causal techniques. Useful things may have been identified by creating and analyzing the DAG, but as the soccer announcers always say when some gets open to score, “there is still a lot of work to do”.

Sure, causal analysis helps you to build the right model but you still need to make it work… Understanding and trusting the model can represent a valuable starting point nevertheless (even though in some cases a wrong model will work in practice better than the correct one!).

Similar in spirit to Andrew’s global warming comment…here’s a cartoonish but interesting example I’ve also used in my own teaching:

http://sprott.physics.wisc.edu/pubs/paper277.pdf

Now, are the models considered there causal models or a statistical models?

Andrew is obviously familiar with combining differential equations and data (e.g. in toxicology), so I assume he’s fine to call them causal.

What would Pearl say?

If he agrees they are causal then how do we represent them in his language? Can we analyse oscillations in this framework (e.g. the never ending love-hate cycle)?

Can we discover bifurcations in nonlinear systems? How would you mathematically express the sudden buckling of a beam in the language of DAGs? Or the bistable dynamics of insect outbreaks?

In the language of computer science, isn’t

dy/dx = f(x,y)

just a kind of name for a class of “loop” constructs

y(x+dx) = y(x) + f(x,y(x)) dx

for all infinitesimal dx?

the DAG is an enormous and not that interesting “chain”, the mathematical expression is really just a shorthand for a vast family of iterative numerical solution computer programs.

So it’s not that you can’t use DAGs to describe models like Sprotts (which is fabulous btw thanks for that), its that applied mathematicians have *been using DAGS* to deal with models of dynamical systems for literally hundreds of years. Drawing the DAG doesn’t help much IMHO. what’s really needed is to *think mechanistically in the first place*. I think of this DAG stuff as a lot like formal theorem proving computer programs. It might be useful to run the specification of a computer architecture through it to validate that the microcode doesn’t create division bugs like in the Pentium, but short of that kind of formal checking of highly static formal devices, theorem proving has done basically nothing for mathematics. The theory and the practice are potentially two different things.

If someone can show me a group of people who attack a real world problem like say the public health problem of opioid overdoses in the US or whatnot, and they gain some major insight that others didn’t have, and they succeed in making some major accomplishment, and the reason they succeed is because suddenly the formal machinery of DAGs and DO calculus made clear what no one had really figured out before… then I’ll buy that it’s a very worthwhile thing.

Mean time, I’ll continue thinking about differential equations like the one I wrote out here on the post about living to 122 yrs: http://statmodeling.stat.columbia.edu/2019/01/07/did-she-really-live-122-years/#comment-943089

Yes and no, I think – all the main theorems I’m aware of assume a finite number of discrete variables, unique solutions etc etc, I think. But yes you could think of a mechanistic dynamical system as a DAG – after all, that’s what a state -> state mapping is!

But, more importantly, I’ve hardly seen* many proper analyses of what happens to the DAG framework when more interesting structure like that in common scientific models is included.

Main point: to me this would temper the grander claims of a ‘causal revolution’ etc etc.

*There are some largely negative analyses of the standard framework, like:

https://arxiv.org/abs/1805.06539

where they state:

“we are left wondering whether the standard starting point in causal discovery—that the data-generating process can be accurately modeled with an SCM—is tenable in the context of biochemical systems, considering that even very simple biochemical systems (a single enzyme reaction) already violate this assumption. “

For anyone like me, who was trained on differential equations

and physics before linear regression, I think Pearlʼs stuff is initially pretty confusing (and still not without some issues imo).

Other than the Book of Why – which I do think is probably the best intro to causal DAGs Iʼve read minus the other stuff – I found this other paper by the same group (Same group as a previous paper I linked, not Pearl et al) really helpful:

https://arxiv.org/pdf/1304.7920.pdf

They show how you can think of DAGs as describing the equilibrium states of ODEs, under certain somewhat restrictive conditions.

The original ODE contains causal ordering info that is lost in just considering the equilibria, so they introduce the idea of ‘labelledʼ equilibrium equations, which is very similar to the idea of nullclines in ODE theory – you know which variable had its derivative set to zero to get that particular equation.

In the other paper I linked they show how to extend these ideas to non-recursive causal systems, which is much more appropriate for real world systems with feedbacks like an enzyme reaction.

Alternatively you could try to unfold the graph in time but then youʼd have to bite the bullet on analysing general dynamical systems, including oscillations or even chaotic behaviour, and thatʼs one I havenʼt seen the DAG folk come close to doing (yet?)

I think of graphical analyses like DAGs as useful heuristic tools for organizing ideas and thinking about relationships between variables in a system. Interestingly, *qualitative* mathematical analysis of DAGs has also been around at least since Richard Levins developed Loop Analysis from signed digraphs. Anyway, I haven’t read much of Pearl’s thinking, but what I have seen in discussions here and summaries elsewhere does not make me think that I am missing out on a revolutionary new approach to science.

Having said the above, I do think it’s worthwhile reading the book and learning the methods – just ignore the insults and grandiose claims!

And I am a little surprised that Andrew doesn’t seem to give Pearl’s methods a proper go – I think that would make for an even stronger/clearer discussion of when and when not to DAG. But I suppose there is no obligation to use something that doesn’t ‘feel right’ to you…

> If someone can show me a group of people who attack a real world problem like say the public health problem of opioid overdoses in the US or whatnot, and they gain some major insight that others didn’t have, and they succeed in making some major accomplishment, and the reason they succeed is because suddenly the formal machinery of DAGs and DO calculus made clear what no one had really figured out before… then I’ll buy that it’s a very worthwhile thing.

Miguel Hernan’s course on DAGs (https://www.edx.org/course/causal-diagrams-draw-assumptions-harvardx-ph559x) have case studies in epidemiology (e.g. birth-weight paradox) where experts came to different conclusions. But by drawing a DAG, applying Judea Pearl’s backdoor criterion, it made it clear which experts were correct.

Hopefully Pearl offers an explanation for the Latin square error or flags it in the current errata. Not sure how one mixes the two up. From Lehmann (2011) pg. 69

“To obtain an effective randomization, Fisher considers sets of Latin Squares that can be transformed into each other by various types of transformations. He then proposes to choose at random (i.e., with equal probabilities) one member of such a set.

He next turns to an analysis of variance similar to that used for randomized blocks. He points out that, “The 35 independent comparisons among 36 yields give 35 degrees of freedom. Of these, five are ascribable to differences between rows, and five to dif- ferences between columns… . Of the remaining 25 degrees of freedom, 5 represent differences between the treatments tested, and 20 represent components of error which have not been eliminated, but which have been carefully randomized so as to ensure that they shall contribute no more and no less than their share to the errors.”

Definitely an interesting discussion, once you take the heat down a little bit. :-) I noticed in Pearl’s response he said “was not able to get you to solve ONE toy problem from beginning to end”. I am not sure what he is referring to specifically, but I wonder if that would be useful – to see both methodologies on a small (toy) problem from beginning to end. Perhaps there isn’t as much disagreement as their appears at first or perhaps the differences can be more clearly seen. Jaynes liked to do this to highlight differences between Bayesian and Frequentist approaches – calling it the Galilean approach.

Brian:

Some discussion of toy problems came up in earlier threads a few years ago, for example here, here, and here. The conversation was, as usual, frustrating for all sides.

For example, here’s Pearl:

To which I replied:

Pearl:

And, after a few zillion more comments, Pearl wrote:

That was a few years ago; since then, our paper, Bayesian aggregation of average data: An application in drug development, has finally appeared in print.

I give the following excerpts not to present me as the good guy or the reasonable guy, but just to illustrate how this discussion can seem frustrating from many directions. Communication can be difficult! Hence this blog. But hashing things out in discussion doesn’t always work either. Sometimes we just have to do our work and accept that other methods are out there too.

As I wrote in the above post regarding division of labor, I think that Pearl and I are focusing on different problems. That’s ok: there’s lots of work to be done in this world.

I think Deming got it right. There simply is no

statisticalmethod for extrapolating outside the conditions that generated the sample.A lot of confusion has resulted from attempting to apply methods appropriate to “enumerative” studies to “analytic” studies:

William Deming. On probability as a basis for action. The American Statistician, Nov 1975, Vol 29, No.4 pg 145-152.

https://pdfs.semanticscholar.org/adec/f8c3cc38faec3e11370561e13d89e7499452.pdf

Hi Andrew, interesting comment about the book. I’m a follower of your posts and I also agree with you about the way that Pearl’s book shows somehow how evil and blind are the statisticians. I’d be nice though, if you share with the readers here other sources or material about causal inference.

Odra:

There are lots and lots of statistics books that cover causal inference, including books by Rosenbaum, Hernan and Robins, VanderWeele, Angrist and Pischke, Imbens and Rubin, Morgan and Winship, and three chapters of my book with Jennifer Hill. Any of these would be a fine place to start, I think.

In my applied work, I sometimes find the language of DAGs useful for communicating with scientific collaborators regarding which covariates to adjust for in a regression analysis. Additionally, I sometimes find DAGs useful for evaluating alternative study designs. In many problems, however, I can get by just fine without a DAG ever crossing my mind. Even if formal causal modeling is required, the potential outcomes framework is often my preferred approach. My view is that Pearl (and the CMU gang) have enhanced statistical communication by formalizing a different way of expressing causal assumptions that is sometimes (but not always) more convenient and illuminating than the potential outcomes approach. It remains to be seen, however, whether any important statistical problems could not have been solved before, but can now be solved owing to their work in this area.

Hi Andrew,

I read the book. There’s some interesting stories. But Pearl’s tone is pretty obnoxious and self righteous tone throughout that it kind of ruins it for me.

However I’m not really sure I understand this debate and why DAGs are so different. For example, a common assumption in causal inference is that response and treatment are independent given other explanatory variables x. One can write this both in a math sense and a graph (x->treatment, x->response, treatment->response). I get that you don’t need the graph to do analysis but it may be easier to explain to others in this format.

Put another way – are there cases when you do causal inference where you cannot draw a corresponding DAG of the causal structure of variables?

Sam:

Yes, I agree that one can go back and forth between different notations. It can be helpful to write a model as a graph.

> Again, on page 357: “the culture of ‘external validity’ is totally preoccupied with listing and categorizing the threats to validity rather than fighting them.” No. You could start with Jennifer’s 2011 paper, for example.

I may be mistaken, but I think that paper assumes ignorability and it doesn’t “fight” the threats to external validity that Pearl is talking about.

Carlos:

One of the threats to validity is departure from ignorability arising from not adjusting for important pre-treatment predictors, a problem which in turn can arise if you’re restricting your analysis to noisy unregularized methods such as least squares (in which case, excluding a predictor from your model is a crude form of regularization). Jennifer’s paper fights that threat to validity by supplying a method that allows more predictors to be included in the model.

There a lot of threats to validity. Another is measurement error, when the variables being measured are not the same as the variables we are interested in studying. This happens a lot.

It’s hard to fight all the threats to validity at once; we must work on all fronts. Again, division of labor. In a good statistical design and analysis, we have to think about lots of different things.

I agree, there are a lot of threats to “validity” – internal and external – and work has to be done in all fronts. But I think Pearl’s concerns are often on a different level than yours.

There doesn’t seem to be all that much controversy here. Rather, much of the “controversy” seems to come from the overblown rhetoric of Pearl and Mackenzie. For instance, the preface to their book says:

This rhetoric jars because, as Andrew points out, the idea of causation is not new to statistics. To the contrary, it permeates statistics.

But, statistical discussions of causality can feel ad hoc because they focus on specific problems with specific models.

Perhaps Pearl and Mackenzie’s contribution is similar to what Shannon did in his earliest work that analyzed circuits in terms of Boolean algebra. People already knew a lot about how circuits worked, but in an ad hoc way. Shannon brought rigor to the field by developing explicit, algebraic tools that simplified the analysis and made more powerful analysis possible.

I am not saying that Pearl and Mackenzie’s work is as foundational as Shannon’s early work, I am just saying it is similar in what it brings to the table.

Let me, for the one, welcome our new overlords and say that it seems to me that Uncle Judea’s framework of causal confounders—colliders—mediators is useful, perhaps not in helping those of you who think carefully do non-stupid statistics, but in helping those of us who do not think carefully do non-stupid statistics, and in providing a royal road to teaching people how to do not-stupid statistics…

This is my favorite comment. I would just add that none of us could do non-stupid statistics all the time. Which means everyone of us could benefit from Pearl’s insights.

From one of Andrew’s comments:

“When I say that I don’t think the do operator makes sense as a general construct, what I mean is that I think it makes sense for some variables but not in general…In an observational study, where x is an observed variable, then do(x) is not so clear to me. Rubin would sometimes give the example of an education study in which x is the number of hours a student studies for an exam. In that case, do(x) doesn’t tell us that much, because much can depend on how x is set.”

Isn’t this logically equivalent to saying that you don’t think that the potential outcomes framework makes sense as a general construct, because in some cases much will depend on the mechanism used to assign treatment?

From Rubin:

“SUTVA is simply the a priori assumption that he value of Y for unit u when exposed to treatment t will be the same no matter what mechanism is used to assign treatment t to unit u…”

Also, Pearl has made the point that compound interventions* (i.e. where the intervention influences variables other than x) can be handled within his framework, particularly in a recent dispute with Nancy Cartwright (I’ll try and dig out the papers).

* my terminology may be wrong

Brian:

Yes, exactly. I think the potential outcomes framework only makes sense when it is clear what is meant by setting x to some specified value. This point has been discussed many times in the literature, and it’s the basis for ideas such as instrumental variables, in which one can imagine setting the instrument to some specified value.

Brian,

I could not resist a comment on your post. Rubin’s manipulability restriction is unnecessary.

This paper explains why:

-483 Pearl, “Does Obesity Shorten Life? Or is it the Soda?

On Non-manipulable Causes,”

https://ftp.cs.ucla.edu/pub/stat_ser/r483-reprint.pdf

https://ucla.in/2EpxcNU

Journal of Causal Inference, 6(2), online, September 2018.

Moreover, SUTVA is needed only for orthodox PO folks who do not speak structure.

Otherwise, it is automatically satisfied in the structural interpretation of counterfactuals,

Accessible here: https://ucla.in/2G2rWBv

A drastic revision of Stone Age PO is in order.

Hi Andrew,

As an econ student, the debate in statistics about causal inference seems much more than what we were taught in empirical identification in econ, e.g., RCT, DD, RD, Matching, IV. It seems that we are utilizing a bunch of ways including both Rubin’s ideas (like RCT, DD) and a DAG thing (like IV). How would you put down the causal inference in economics to the progress of big picture? Thank you so much.

For an admirer of both Pearl and Gelman, it seems so sad that 50% of the “dispute” is based on not agreeing on the definition of “model” (and the unwillingness of both parties to explain their conception to each other).

1) A “model” for Gelman (grossly oversimplifying): a probabilistic hierarchical model, which can be described in a DAG, of observed and unobserved quantities. Something you can program in Stan, feed data, and obtain inferences on the unobserved

2) A “model” for Pearl: a probabilistic hierarchical model, which can be described in a DAG, of observed and unobserved quantities. Something you can program in Stan, feed data, and obtain inferences on the unobserved, BUT WHERE EVERY ARROW REPRESENTS A CAUSAL RELATIONSHIP.

After reading half of BDA3 and half of TBoW, the latter is the only material difference between Pearl’s and Gelman’s conception of a “model”!

Pearl actually addresses this matter explicitly (p. 94):

“Bayesian networks (…) are related to causal diagrams in a simple way: a causal diagram is a Bayesian network in which every arrow signifies a direct causal relation, or at least the possibility of one, in the direction of that arrow.”

Mike:

I don’t quite understand what you’re saying. Let’s take a simple example. We model the heights of adult men as normally distributed. The statistical model is y ~ normal(mu, sigma). If you want, you can draw arrows from mu and sigma to y. But it would not make sense to say that mu and sigma cause y in any scientific sense of causal inference. I guess what I’m saying is that a graphical model can have some arrows with causal interpretation, but most of its arrows will not have causal interpretations, as they just represent mathematical relationships. For a more complicated example, consider a model of an educational experiment, predicting post-test score, y, on pre-test score, x, and treatment indicator, z. You might have a model such as y ~ normal(a + b*x + theta*z, sigma). In that case, the arrow from z to y is causal, but the arrows from a, b, x, theta, and sigma to y are not causal. It does not make sense to say that the regression coefficients cause the post-test score, or that the pre-test score causes the post-test score.

I guess many educators would claim that a pretest scores “causes” the post test score in the sense that pretest is an indicator for prior knowledge. I agree that it is hard to interpret the intercept though.

That’s not causality. If you go into the computer database and change the entry in the score database for that person they won’t get better post test scores.

Intelligence/skill causes both pre-test score, and post-test score. If you can somehow give someone vitamins that improve their intelligence skill or attention then they will get both better pre-test and better post test scores.

Dear Andrew,

Is it really a stretch of the imagination to say that mu causes y in your simple example?

Fancifully, mu causes location and sigma causes dispersion.

Can we think of mu as a “proximate cause” in a chain of causation?

An attempt at a scientific explanation would render mu as a conditional mean depending on putative causes.

The model parameters instantiate the nature of the connection.

If this line of reasoning is valid, the same DAG that is used to organize causal queries (following Pearl)

can be augmented to organize the Bayesian analysis of the statistical model (following Gelman).

This approach is giving me very satisfying results in applications to semiconductor design, manufacturing, test and reliability.

I am uncomfortable including a variable in a model just because it is correlated to an effect of interest.

In my applications, I get the best results with putative causes organized as a DAG.

Best regards,

Jeff

I find these discussion fun because I have no dog in the fight. I do like Pearl’s distinction between causal and probabilistic conditioning (which may be clearer in Shalizi’s text) as this clarifies (greatly!) something that has bothered many who recognized that “confounders” only make sense with the former (See Gelman and Hill!). Anyway, I do think all of this is a bit like the big endians and little endians from Gulliver…that is, we are arguing about two different systems but our ability to estimate causal effects with observational designs is largely non-existent (sure, factors with big effects can be reasonably estimated but these have all been discovered). My evidence is the field of epidemiology (where we have some RCT to give us reality checks and yes I know the problems with RCTs) and my own simulations modeling “what if” worlds and our ability to estimate the true ifs (https://onlinelibrary.wiley.com/doi/full/10.1111/evo.12406). Fire away in 3-2-1…

Just to clarify, I assume when you say “(which may be clearer in Shalizi’s text) ” you are referring to http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf.

yes – the text drops hints here and there but the explicit difference is given on p. 505-506 of that version.

wow, thx, great reference

The gist of this argument between Andrew and Judea bears some resemblance to 20th century debate between Bohr and Einstein. They couldn’t understand each other because Bohr was talking from epistemological point of view and Einstein from ontological.

It was nicely described by E.T Jaynes:

“The main suggestion we wish to make is that how we look at basic probability theory has deep

implications for the Bohr-Einstein positions. Only since 1988 has it appeared to the writer that

we might be able finally to resolve these matters in the happiest way imaginable: a reconciliation

of the views of Bohr and Einstein in which we can see that they were both right in the essentials,

but just thinking on different levels.

Einstein’s thinking is always on the ontological level traditional in physics; trying to describe

the realities of Nature. Bohr’s thinking is always on the epistemological level, describing not reality

but only our information about reality. The peculiar

flavor of his language arises from the absence

of all words with any ontological import. J. C. Polkinghorne came independently

to this same conclusion about the reason why physicists have such difficulty in reading Bohr. He

quotes Bohr as saying:

“There is no quantum world. There is only an abstract quantum physical description. It is wrong to

think that the task of physics is to find out how nature is . Physics concerns what we can say about

nature.”

So in Bohr’s writings the notion of a “real physical situation” was just not present and he gave

evasive answers to questions of the form: “What is really happening when …?””

He then continues to disagree with Bohr:

“We disagree strongly with one aspect of Bohr’s quoted statement above; in our view, the

existence of a real world that was not created in our imagination, and which continues to go about

its business according to its own laws, independently of what humans think or do, is the primary

experimental fact of all , without which there would be no point to physics or any other science.

The whole purpose of science is learn what that reality is and what its laws are.”

Interesting.

This is why I come here. Thanks!

For more on Bohr, see http://www.bohmian-mechanics.net/sokalhoax.html

For me it’s this. I love Pearl’s method. I think DAGs are a wonderful tool. But if you take his claims about statistics seriously, then there was no proof smoking caused lung cancer until after he came along and drew a DAG in the book of Why.

Side note, I really like Krieger and Davey-Smiths work on triangulation of evidence. I think it’s a much better framework: https://academic.oup.com/ije/article/45/6/1787/2617188

That’s a very interesting link, thank you.

To celebrate that the system is alive again, I’ll point out that there are interesting commentaries on that paper in the same issue of the journal:

https://academic.oup.com/ije/issue/45/6#250304-2617148

Yup, thanks.

“In this essay, we suggest that in epidemiology no one causal approach should drive the questions asked or delimit what counts as useful evidence. Robust causal inference instead comprises a complex narrative, created by scientists appraising, from diverse perspectives, different strands of evidence produced by myriad methods. DAGs can of course be useful, but should not alone wag the causal tale.”

“framework of ‘inference to the best explanation’, an approach perhaps best developed by Peter Lipton, a philosopher of science who frequently employed epidemiologically relevant examples.”

I think Lipton misconstrued Pierce somewhat in his account of ‘inference to the best explanation’ http://www.commens.org/sites/default/files/working_papers/peirces_concept_of_abduction_hypothesis_formation_across_his_later_stages_of_scholarly_life.pdf

But interesting that he employed it in epidemiology – too many distractions.

I think I have some insight as to why Andrew Gelman and Judea Pearl seem to be talking past each other a lot of the time. (Of course, this should be taken with a grain of salt. Communication can be difficult, and I may not understand and/or end up misrepresenting their positions.)

Heckman’s “Scientific Model of Causality” outlines three distinct tasks:

1.) Definitions of counterfactuals

2.) Identifying parameters from population distributions

3.) Identifying parameters from real data

The first requires a scientific theory; the second is the problem of parameter identification; the third is in the domain of estimation / hypothesis testing theory.

In addition, Pearl (and Shpitser?) outline a “causal hierarchy” of successively more powerful types of queries:

1.) Associational / “statistical”, e.g. P(y | x)

2.) Causal / interventional, e.g. P(y | do(x))

3.) Counterfactual, e.g. P(Y_x’ | x)

Much (most?) of the research from the causal diagram / structural causal model community (e.g. Pearl, Tian, Shpitser, Bareinboim, et al) has focused on providing *precise* conditions under which causal queries can be *identified*. In the Heckman hierarchy, they are working on the 1st and 2nd tasks, and they are, usually, computing causal (sometimes, counterfactual) queries, in terms of associational (sometimes, causal) expressions, with respect to the assumptions embodied in causal diagrams.

These results are precise / mathematically rigorous, in that the resulting formulas *exactly* compute the queries of interest, in the limit of infinite samples, assuming that the provided causal models are correct.

Statisticians are generally concerned with the 3rd task of estimation / hypothesis testing since the true population distribution is not available. Note that you don’t need a causal model to answer associtaional / “statistical” queries, a statistical model suffices (this is often a source of confusion regarding Bayesian networks; ‘ordinary’ Bayesian networks are statistical models, causal Bayesian networks embody stronger stronger assumptions).

Pearl’s claim is that you can’t rigorously express causality without using do notation (or similar). It may be possible to informally express causal knowledge, and associational queries may, for certain queries, in certain models, be decent approximations of causal queries, but the notation of probability theory and statistics literally does not allow expressing causal queries. (Rubin’s potential outcome notation does, but is very difficult to use correctly.) I think Pearl’s argument boils down to “This informality has the potential to be very bad, and causal assumptions/queries should be formalized in some kind of standard notation.”

However (most) of the state of the art methods with causal diagrams assume that a researcher is only interested in nonparametric model assumptions and recursive (acyclic) models. In addition, fully specified causal diagrams are often unavailable (although there is some active research on helping with this problem). Finally, very little of the causal diagram / SCM research focuses on the problems associated with finite-sample variation, which isn’t a detail to be brushed aside – in some sense, that’s the entire point of statistics. I think Gelman’s argument boils down to, “There are other methods that you might not consider to be fully rigorous, but they seem to be working alright, and it’s unclear how to use your methods for the problems we’re interested in.”

Very well put. There are different phases in the exercise of producing causal information that can, e.g., inform policy. In my teaching, I distinguish between causal inference (identification problems) and statistical inference.

Joshua,

You characterization of the two efforts is accurate. Up to the point where you say:

“Note that you don’t need a causal model to answer associtaional / “statistical” queries”

This is true, but you need a causal model to decide what you need to estimate. You can’t start

the estimation process before receiving instructions from the identification process. So, how do

statistician survive? They estimate convenient quantities, and make believe they are engage in

“causal inference” because if any catches them in cheating, they can always post-justify what they

did by finding assumptions that will make it Kosher.

You mention three limitations to SCM. (1) nonparametric (2)acyclicity (3) large sample.

(1)True, although graphiccal models are revolutionizing linear SEM as well

(2)True, but you cannot manage cycle with PO or with statistical techniques

(3) True, but you can’t do any finite sample inference if you do not have an estimand to estimate

As to ““There are other methods that you might not consider to be fully rigorous,” I do not insist on rigor, I insist however on

stating those other methods, however handwaving they are, and relating them to what rigor dictates, so that we know what approximations were made.

Finally: “it’s unclear how to use your methods for the problems we’re interested in.” Really? All it takes is to examine the estimand

that comes out of our inference engine and try to estimate it with your powerful statistical methods, rather than pretend that you dont care about the estimand, because whatever you estimate “seems to be working”.

The main issue is how can Andrew and his team estimate things without an estimand, namely without doing identification, or borrow

an estimand from graphical models. Andrew answers it: “I find it baffling that Pearl and his colleagues keep taking statistical problems and, to my mind, complicating them by wrapping them in a causal structure “. In other words, he thinks he can do identification without causal structure, using statistical techniques, or wish away the need for identification, continue estimating what is easily estimated and then write: “identification is important, we need more books about it” . I dont get the logic.

Judea:

Let me clarify. I have worked on problems that involve causal inference, and there we have identification strategies (some combination of assumptions, data collection, and modeling). I’ve worked on other problems that do not involve causal inference, and for those we do not require causal identification. I don’t think we can do identification using statistical techniques without causal structure or assumptions, and I would not want to leave the impression that I think that. Indeed, I’ve consistently been critical of statistical methods that have been advertised as being able to discover causal structure using data alone.

Andrew,

You said: Again, on page 357: “the culture of ‘external validity’ is totally preoccupied with listing and categorizing the threats to validity rather than fighting them.” No. You could start with Jennifer’s 2011 paper, for example.

I looked into Jennifer’s paper and, alas, I see no connection to “external validity”. It is a good place to demonstrate the virtue of toy examples. Do you or any of your readers truly believe that the method developed in that paper enables us to solve the three toy examples illustrated in Figure 3 of:

Pearl and E. Bareinboim “External validity: From do-calculus to

transportability across populations” Statistical Science May 2014

http://ftp.cs.ucla.edu/pub/stat_ser/r400-reprint.pdf

https://ucla.in/2N7S0K9

I doubt it. But I would really like to hear if there exists someone who thinks so. Having thought about this problem for quite some time, I am reiterating my belief (p. 237) that “the culture of ‘external validity’ is totally preoccupied with listing and categorizing the threats to validity rather than fighting them.” I would be open to change my belief as soon as someone shows me another method capable of deciding which of the three examples is transportable and how.

Judea:

There was this “The sensitivity, specificity, and positive and negative predictive values (PPV and NPV, respectively) of the BPQ for a diagnosis of dementia were calculated for both the memory clinic and primary care. The ability of the BPQ to detect dementia in a primary care setting was estimated using the prevalence of dementia in the Canadian population.” where PPV and NPV were transported to the Canadian population using Bayesian methods developed by Jim Berger. Ability of the “bergman-paris” question to detect dementia in community-dwelling older people. Caporuscio, Monette, Gold, Monette, O’Rourke.

I think I mentioned this to in San Diego years ago and you did not seem to think of it as transportation. One thing to keep in mind is that clinical journals do not want to give space to formal arguments while getting such stuff published in statistical journals is very hard given the lack of any new technical development (recasting in different math is not quite that). So maybe you are looking for more formal discussions in print than there are?

Keith,

May I conclude that, in view of the papers you mentioned above, a technique exists today that

enables ordinary mortals to examine the three stories in Fig. 3 of [http://ftp.cs.ucla.edu/pub/stat_ser/r400-reprint.pdf

and decide which of them is transportable and how.

Do you know of anyone who would be able to demonstrate this technique?

Would the resultant transport formula come out the same?

I am not looking for a formal discussion, I am just looking for a solution to a simple problem.

Solutions do not stop with citations to other papers, but are reducible to: “show me

the problem, step 1, step 2,… and the answer is =ANS “. Is the solution you have in mind reducible?

Judea:

All disagreements aside, I just want to thank you again for commenting here. We have a great comments section, this is a rare place for open and sustained intellectual discussion, and I appreciate your willingness to engage.

> not looking for a formal discussion

To me you are.

An example where someone credibly transported a parameter in an application does not count.

From your comments to Andrew below, you want to see “organizing these assumptions in any “structure””, “apparatus

… [to have] representation for such assumptions” and “just making “causal assumptions”

and leaving them hanging in the air is not enough. We need to do something with the assumptions, listen to them, and

process them so as to properly guide us in the data analysis stage.”

I have read your paper with the three figures many times and did not discern anyway I would done anything different in that paper above.

But I do agree that good formal representations are important and that is absent.

p.s. I am guessing you are aware of CS Peirce’s Existential Graphs which do the same for logic – put it into a manipulatible representation that preserves truth relationships.

Andrew,

I appreciate your kind invitation to comment on your blog.

Let me start with a Tweet that I posted on

https://twitter.com/yudapearl (updated 1.10.19)

1.8.19 @11:59pm – Gelman’s review of #Bookofwhy should be of

interest because it represents an attitude that paralyzes

wide circles of statistical researchers. My initial reaction

is now posted on https://bit.ly/2H3BH3b Related posts:

https://ucla.in/2sgzkPZ and https://ucla.in/2v72QK5

These postings speak for themselves but I would like

to respond here to your recommendation:

“Similarly, I’d recommend that Pearl recognize that the

apparatus of statistics, hierarchical regression modeling,

interactions, post-stratification, machine learning, etc etc

solves real problems in causal inference.”

It sounds like a mild and friendly recommendation, and your

readers would probably get upset at anyone who would be so

stubborn as to refuse it.

But I must. Because, from everything I know about causation,

the apparatus you mentioned does NOT, and CANNOT solve any

problem known as “causal” by the causal-inference community

(which includes your favorites Rubin, Angrist, Imbens,

Rosenbaum, etc etc.).

Why?

Because the solution to any causal problem

must rest on causal assumptions and the apparatus

you mentioned has no representation for such assumptions.

1. Hierarchical models are based on set-subset

relationships, not causal relationships.

2. “interactions” is not an apparatus unless you represent

them in some model, and act upon them.

3. “post-stratification” is valid only after you decide what

you stratify on, and this requires a causal structure (which you

claim above to be an unnecessary “wrapping” and complication”)

4. “Machine learning” is just fancy curve fitting of data

see https://ucla.in/2umzd65

Thus, what you call “statistical apparatus” is helpless in

solving causal problems. We came to this juncture several

times in the past and, invariably, you pointed me to books,

articles, and elaborated works which, in your opinion, do

solve “real life causal problems”. So, how are we going

to resolve our disagreement on whether those “real life”

problems are “causal” and, if they are, whether your

solution of them is valid. I suggested applying your methods to

toy problems whose causal character is beyond dispute.

You did not like this solution, and I do not blame you,

because solving ONE toy problem will turn your perception of

causal analysis upside down. It is frightening.

So I would not press you. But I will add another Tweet

before I depart:

1.9.19 @2:55pm – An ounce of advice to readers who comment

on this “debate”: Solving one toy problem in causal

inference tells us more about statistics and science than

ten debates, no matter who the debaters are. #Bookofwhy

Addendum. Solving ONE toy problem will tells you

more than dozen books and articles and

multi-cited reports. You can find many such toy problems

(solved in R) here:

* https://ucla.in/2KYYviP

* sample of solution manual: https://ucla.in/2G11xUE

For your readers convenience, I have provided free access

to chapter 4 here: https://ucla.in/2G2rWBv

It is about counterfactuals and, if I were not inhibited

by modesty, I would confess that it is the best text

on counterfactuals and their applications that you can

find anywhere.

I hope you take advantage of my honesty.

Enjoy

Judea

Judea:

We are in agreement. I agree that data analysis alone cannot solve any causal problems. Substantive assumptions are necessary too. To take a familiar sort of example, there are people out there who just think that if you fit a regression of the form, y = a + bx + cz + error, that the coefficients b and c can be considered as causal effects. At the level of data analysis, there are lots of ways of fitting this regression model. In some settings with good data, least squares is just fine. In more noisy problems, you can do better with regularization. If there is bias in the measurements of x, z, and y, that can be incorporated into the model also. But none of this legitimately gives us a causal interpretation until we make some assumptions. There are various ways of expressing such assumptions, and these are talked about in various ways in your books, in the books by Angrist and Pischke, in the book by Imbens and Rubin, in my book with Hill, and in many places. Your view is that your way of expressing causal assumptions is better than the expositions of Angrist and Pischke, Imbens and Rubin, etc., that are more standard in statistics and econometrics. You may be right! Indeed, I think that for some readers your formulation of this material is the best thing out there.

Anyway, just to say it again: We agree on the fundamental point. This is what I call in the above post the division of labor, quoting Frank Sinatra etc. To do causal inference requires (a) assumptions about causal structure, and (b) models of data and measurement. Neither is enough. And, as I wrote above:

Where we disagree is just on terminology, I think. I wrote, “the apparatus of statistics, hierarchical regression modeling, interactions, poststratification, machine learning, etc etc., solves real problems in causal inference.” When I speak of this apparatus, I’m

notjust talking about probability models; I’m also talking about assumptions that map those probability models to causality. I’m talking about assumptions such as those discussed by Angrist and Pischke, Imbens and Rubin, etc.—and, quite possibly, mathematically equivalent in these examples to assumptions expressed by you.So, to summarize: To do causal inference, we need (a) causal assumptions (assumptions of causal structure), and (b) models or data analysis. The statistics curriculum spends much more time on (b) than (a). Econometrics focuses on (a) as well as (b). You focus on (a). When Angrist, Pischke, Imbens, Rubin, Hill, me, and various others do causal inference, we do both (a) and (b). You argue that if we were to follow your approach on (a), we’d be doing better work for those problems that involve causal inference. You may be right, and in any case I’m glad you and Mackenzie wrote this book which so many people have found helpful, just as I’m glad that the aforementioned researchers wrote their books on causal inference which so many have found helpful. A framework for causal inference—whatever that framework may be—is complementary to, not in competition with, data-analysis tools such as hierarchical modeling, poststratification, machine learning, etc.

P.S. I’ll ignore the bit in your comment where you say you know what is “frightening” to me.

+1

Andrew,

I would love to believe that where we disagree is just on

terminology. Indeed, I see sparks of convergence in your

last post, where you enlighten me to understand that by

“the apparatus of statistics, …’ you include

the assumptions that PO folks (Angrist and Pischke, Imbens and

Rubin etc.) are making, namely, assumptions of conditional

ignorability. This is a great relief, because I could not

see how the apparatus of regression, interaction,

post-stratification or machine learning alone, could elevate

you from rung-1 to rung-2 of the Ladder of Causation. Accordingly,

I will assume that whenever Gelman and Hill talk about

causal inference they tacitly or explicitly make the

ignorability assumptions that are needed to take them

from associations to causal conclusions. Nice.

Now we can proceed to your summary and see if we still have

differences beyond terminology.

I almost agree with your first two sentences:

“So, to summarize: To do causal inference, we need (a) causal

assumptions (assumptions of causal structure), and (b) models

or data analysis. The statistics curriculum spends much more

time on (b) than (a)”.

But we need to agree that just making “causal assumptions”

and leaving them hanging in the air is not enough. We need to

do something with the assumptions, listen to them, and

process them so as to properly guide us in the data

analysis stage.

I believe that by (a) and (b) you meant to distinguish

identification from estimation. Identification indeed

takes the assumptions and translate them into a recipe with which

we can operate on the data so as to produce a valid estimate of

the research question of interest.

If my interpretation of your (a) and (b) distinction is

correct, permit me to split (a) into (a1) and (a2)

where (a2) stands for identification.

With this refined-taxonomy, I have strong reservation to your

third sentence: “Econometrics focuses on (a) as well as (b).”

Not all of econometrics. The economists you mentioned, while

commencing causal analysis with “assumptions” (a1), vehemently resist to

organizing these assumptions in any “structure”, be it a

DAG or structural equations (Some even pride themselves

of being “model-free”). Instead, they restrict their

assumptions to conditional ignorability statements

so as to justify familiar estimation routines.

[In https://ucla.in/2mhxKdO, I labeled them:

“experimentalists” or “structure-free economists”

to be distinguished from “structuralists” like Heckman,

Sims, or Matzkin.]

It is hard to agree therefore that these “experimentalists”

focus on (a2) — identification. They actually assume (a2) away

rather than use it to guide data analysis.

Continuing with your summary, I read:

“You focus on (a).” Agree. I interpret (a) to mean

(a) = (a1) + (a2) and I let (b) be handled by

smart statisticians, once they listen to the guidance of (a2).

Continuing, I read:

“When Angrist, Pischke, Imbens, Rubin, Hill, me, and various

others do causal inference, we do both (a) and (b).

Not really. And it is not a matter of choosing “an

approach”. By resisting structure, these researchers

apriori deprive themselves of answering causal questions

that are identifiable by do-calculus and not by a single

conditional ignorability assumption. Each of those questions may

require a different estimand, which means that you cannot start

doing the “data analysis” phase before completing the identification

phase.

[Currently, even questions that are identifiable by

conditional ignorability assumption cannot be answered by

structure-free PO folks, because deciding on the

conditioning set of covariates is intractable without the

aid of DAGs, but this is a matter of efficiency not of

essence.]

But your last sentence is hopeful:

“A framework for causal inference — whatever that

that framework may be — is complementary to, not in

competition with, data-analysis tools such as hierarchical

modeling, post-stratification, machine learning, etc.”

Totally agree, with one caveat: the framework has to be a genuine

“framework,” ie, capable of leverage identification to guide

data-analysis.

Let us look now at why a toy problem would be frightening;

not only to you, but to anyone who believes that the PO

folks are offering a viable framework for causal inference.

Lets take the simplest causal problem possible, say

a Markov chain X —>Z—>Y with X standing for Education,

Z for Skill and Y for Salary. Let Salary be determined by

Skill only, regardless of Education. Our research problem is

to find the causal effect of Education on Salary given

observational data of (perfectly measured) X,Y,Z.

To appreciate the transformative power of a toy example,

please try to write down how Angrist, Pischke, Imbens, Rubin, Hill,

would go about doing (a) and (b) according to your understanding

of their framework. You are busy, I know, so let me ask any

of your readers to try and write down step by step how

the graph-less school would go about it.

Any reader who tries this exercise ONCE will never be the

same. It is hard to believe unless you actually go through

this frightening exercise, please try.

Repeating my sage-like advice:

Solving one toy problem in causal

inference tells us more about statistics and science than

ten debates, no matter who the debaters are.

Try it.

Judea:

I think we agree on much of the substance. And I agree with you regarding “not all econometrics” (and, for that matter, not all of statistics, not all of sociology, etc.). As I wrote in my review of your book with Mackenzie, and in my review of Angrist and Pischke’s book, causal identification is an important topic and worth its own books.

In practice, our disagreement is, I think, that we focus on different sorts of problems and different sorts of methods. And that’s fine! Division of labor. You have toy problems that interest you, I have toy problems that interest me. You have applied problems that interest you, I have applied problems that interest me. I would not expect you to come up with methods of solving the causal inference problems that I work on, but that’s OK: your work is inspirational to many people and I can well believe it has been useful in certain applications as well as in developing conceptual understanding. I consider toy problems of my own for that same reason. I’m not particularly interested in your toy problems, but that’s fine; I doubt you’re particularly interested in the problems I focus on. It’s a big world out there.

In the meantime, you continue to characterize me as being frightened or lacking courage. I wish you’d stop doing that.

Andrew,

Convergence is in sight, modulo two corrections:

1. You say:

“You [Pearl] have toy problems that interest you, I [Andrew] have toy problems that interest me.

…I doubt you’re particularly interested in the problems I focus on. ”

Wrong! I am very interested in your toy problems, especially those with causal flavor. Why?

Because I love to challenge the SCM framework with new tasks and new angles that other researchers found

to be important, and see if SCM can be enriched with expanded scope. So, by all means, if you have

a new twist, shoot. I have not been able to do it in the past, because your shots were not toy-like,

e.g., 3-4 variables, clear task, with correct answer known.

2. You say:

“you continue to characterize me as being frightened or lacking courage”

This was not my intention. My last remark on frightening toys was general, everyone is frightened by the honesty

and transparency of toys — the adequacy of one’s favorite method is undergoing a test of fire. Who wouldn’t be frightened?

But, since you prefer, I will stop using this metaphor.

3. Starting afresh, and the sake of good spirit: How about attacking a toy problem? Just for fun, just for sport,

Judea:

I’ve attacked a lot of toy problems.

For an example of a toy problem in causality, see pages 962-963 of this article.

But most of the toy problems I’ve looked at do not involve causality; see for example this paper, item 4 in this post, and this paper. This article on experimental design is simple enough that I think it could count as a toy problem: it’s a simple example without data which allows us to compare different methods. And here’s a theoretical paper I wrote awhile ago that has three toy examples. Not involving causal inference, though.

I’ve written lots of papers with causal inference, but they’re almost all applied work. This may be because I consider myself much more of a practitioner of causal inference than a researcher on causal inference. To the extent I’ve done research on causal inference, it’s mostly been to resolve some confusions in my mind (as in this paper).

This gets back to the division-of-labor thing. I’m happy for you and Imbens and Hill and Robins and VanderWeele and others to do research on fundamental methods for causal inference, while I do research on statistical analysis. The methods that I’ve learned have allowed my colleagues and I to make progress on a lot of applied problems in causal inference, and have given me some clarity in understanding problems with some naive formulations of causal reasoning (as in the first reference above in this comment).

As I wrote in my above post, I think your book with Mackenzie has lots of great things in it; I just can’t go with a statement such as, “Using a calculus of cause and effect developed by Pearl and others, scientists now have the ability to answer such questions as whether a drug cured an illness, when discrimination is to blame for disparate outcomes, and how much worse global warming can make a heat wave”—because scientists have been answering such questions before Pearl came along, and scientists continue to answer such questions using methods other than Pearl’s. For what it’s worth, I don’t think the methods that my colleagues and I have developed are

necessaryfor solving these or any problems. Our methods are helpful in some problems, some of the time, at least until something better comes along—I think that’s pretty much all that any of us can hope for! That, and we can hope that our writings inspire new researchers to come up with new methods that are useful in the future.Adrew,

Agree to division of labor: causal inference on one side and statistical analysis on the other.

Assuming that you give me some credibility on the first, let me try and show you that even the publisher advertisement that you mock with disdain is actually true and carefully expressed. It reads: “Using a calculus of cause and effect developed by Pearl and others, scientists now have the ability to answer such questions as whether a drug cured an illness, when discrimination is to blame for disparate outcomes, and how much worse global warming can make a heat wave”.

First, note that it includes “Pearl and others”, which theoretically might include the people you have in mind. But it does not; it refers to those who developed mathematical formulation and mathematical tools to answer such questions. So let us examine the first question: “whether a a drug cured an illness”. This is a counterfactual “cause of effect” type question. Do you know when it was first formulated mathematically? [Don Rubin declared it non-scientific]. Now lets go to

the second: “when discrimination is to blame for disparate outcomes,” This is a mediation problem. Care to guess when this problem was first formulated (see Book of Why chapter 9) and what the solution is?

Bottom line, Pearl is not as thoughtless as your review portrays him to be and, if you advise your readers to control their initial reaction: “Hey, statisticians have been doing it for centuries” they would value learning how things were first formulated, first solved and why statisticians were not always the first.

Judea:

I disagree with your implicit claim that, before your methods were developed, scientists were not able to answer such questions as whether a drug cured an illness, when discrimination is to blame for disparate outcomes, and how much worse global warming can make a heat wave. I doubt much will be gained by discussing this particular point further so I’m just clarifying that this is a point of disagreement.

Also, I don’t think in my review I portrayed you as thoughtless. My message was that your book with Mackenzie is valuable and interesting even though it has some mistakes. In my review I wrote about the positive part as well as the mistakes. Your book is full of thought!

They both fought for truth but when their methods clashed, “Kah-BLAMMO!” … Rhetorical CONFLICT! … /munch, munch, munch (add salt; sorry you pseudo-experimentalists), /munch, munch, munch. / PAUSE. / run to kitchen, … microwave. Pop, pop, pop. Crossing fingers hoping for more. /vaults over sofa back and settles back in.

Judea,

How do we assess if X and Z interact to cause Y and whether this interaction effect is identified from the observed data.

Somewhat related:

Does the standard frontdoor adjustment for unknown confounders assume we have a representative (eg random) sample from our target population?

Eg what assumptions are made about the connection between the distribution of the unmeasured vars in the target population and that in the data available?

ojm,

Yes, the standard frontdoor adjustment for unknown confounders assumes we have a representative (eg random) sample from our target population.

If you suspect disparity between target and study population, express your suspicions in a Selection Diagram and turn the

transportability engine on. The answer will come back in seconds. See https://ucla.in/2N7S0K9 It depends of course on HOW the two populations differ; some differences can be ignored and others may be detrimental.

[BTW, Andrew could not forgive me for stating that “the problem of external validity has not progressed an iota since Campbel and Stanley”. I hope you see how damn right I was.]

Thanks :-)

I find it interesting that a seemingly ‘causal’ notion relies on such a statistical notion as having a random sample from a population.

Do all of your usual ‘internal’ validity methods require assumptions on sampling mechanisms?

As I mentioned elsewhere, I’m more used to things like conservation of mass and energy, which usually don’t suffer from transportability problems (I suppose actually you can derive these laws from ‘transportability assumptions a la Noether!), so find this all quite foreign, but very interesting.

To me when I think ‘causal’ I think eg conservation equations. The closest ‘philosophical’ account I know of is Dowe’s ‘physical causation’

https://www.cambridge.org/core/books/physical-causation/D056895488F735AC513E455D3683497F

Is it fair to say that you are more interested in constructing estimators from empirical data than say building ‘mechanistic’ models like those found in something like mathematical biology?

By ‘mechanistic models’ and mathematical biology I mean eg

https://www.springer.com/gp/book/9780387952239

Is this sort of thing orthogonal to your goals? Which again, I take as constructing estimators from empirical data as guided by qualitative ‘causal’ info?

CK,

To identify the interaction, we need to identify the quantity

P(y|do(x,z)) for at least four values of x and z, and check whether

the difference P(y|do(x,z)) -P(y|do(x’,z)) depends on z.

The first identification is an exercise in do-calculus, for which we have complete algorithm

once you write down the graph. If the backdoor condition holds, it becomes a difference between

two regression expressions.

For gentle introduction, see https://ucla.in/2KYYviP

> Hierarchical models are based on set-subset

relationships, not causal relationships.

Judea:

While I agree that causal or physical assumptions are a necessary supplement to pure empirical analysis, I wanna mention that this seems like a weird interpretation of hierarchical models.

Hierarchical models are fundamentally about conditional independencies and Markovian assumptions, not set/subset relationships as far as I’m familiar with them. Do you have any concrete examples of a hierarchical model where this is the case?

While the implied conditional independencies might not be enough for you to directly model causal assumptions – since you seem to require probabilistic assumptions to be about observables and not latent or unobservable constructs – it is a convenient way to incorporate or combine physical and statistical assumptions.

Two references on this:

‘Physical‐statistical modeling in geophysics’

https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2002JD002865

‘Bayesian hierarchical time series models’

http://www.leg.ufpr.br/lib/exe/fetch.php/pessoais:hierarquical_model_time_series.pdf

An important point is that while the probability calculus is symmetric etc etc, our epistemic status with respect to the various conditional distributions is different – thus we build a process model for future variables conditional on past variables that directly influence these (Markovian physical assumptions etc).

You of course would probably prefer to express this knowledge as a DAG, but I’m also unsure whether this formal representation and the theorems are sufficient to cover real world phenomena where models are not recursive and so on. Physics gets by with a mixture of mathematics and intuition, but has not been axiomatised to anyone’s satisfaction (I think this was even one of Hilbert’s problems – to axiomatise physics – perhaps you could try to claim the prize?)

(Similarly classical mechanics is time reversible but our epistemic access is different – we know the past, not the future and we are usually only interested in coarse-grained features etc – this is well-known to be enough to deal with reversibility/irreversibility ‘paradoxes’ and the second law)

This ‘hierarchical physical-statistical’ modelling point of view is also nicely discussed in a statistics book that has plenty of ‘physical’ or ‘causal’ modelling examples:

‘Statistics for spatial-temporal data’:

https://www.wiley.com/en-us/Statistics+for+Spatio+Temporal+Data-p-9780471692744

They even discuss the connections between the epistemically distinguished conditional distributions and DAGs in section 2.4. Some screenshots here:

https://twitter.com/omaclaren/status/1084250405884723206?s=21

Finally, here’s a recent example from my own work using hierarchical modelling of this sort to combine physical and statistical models:

https://arxiv.org/abs/1810.04350

I found the frameworks discussed by Gelman, Berliner, Cressie etc much easier to relate to such a setting, where we have a geophysical model based on PDEs, than discrete DAGs etc, but perhaps it would be possible to use your ideas to do similar things?

Do you have any pointers for doing this sort of thing (geophysical inverse problems) using DAGs etc?

Essentially, +1 to everything ojm wrote.

I think another valid way of expressing the aims of Bayesian hierarchical models (BHM) is that they enable us to wield conditional probability to build *generative models* of our data, that can readily embody substantive scientific assumptions/models/theories, thus naturally including “causal” models. In many applications at the cutting edge of science, we are not really interested in quantities like “average causal effects” – rather we want to fit, expand and/or compare generative models that provide greater scientific insight (i.e. the parameters have meaningful scientific interpretation), and/or in some cases forecast accuracy (i.e. we care a great deal about predictive ability).

Some twitter convos with the causal folk have led me to realise that they are actually far closer to data analysts than generative modellers than I realised.

They basically have observed empirical data and some qualitative causal info and what to construct estimators based on the empirical data that are valid for some aspect of the causal pathway regardless of the unknown details.

I think they require samples to be representative of the unknown confounders etc, ie collected under the relevant regime, they just don’t have access to the values.

Meanwhile the hierarchical modellers I know are building explicit fully specified generative models that don’t require any data a priori – they can always be simulated. When data becomes available they crank the handle and get an updated generative model.

Because the model is generative and based on mechanisms any query can be directly simulated to represent the current state of knowledge about some pathway etc. On the other hand the causal folk directly use observed data to estimate eg P(Y|X).

So, weirdly, I think they are closer to traditional stats than many ‘generative’ modellers!

I think one of the most interesting and timely cases for causal inference has been the marijuana / schizophrenia debate. Unlike the aspirin/headache debate which never happened, for a long time statistics has shown a correlation between schizophrenia and marijuana use. Did marijuana cause schizophrenia (a disease named in the 19th century in Germany, not in the grips of a marijuana plague)?

So one group did a causal inference study. And they showed why yes marijuana did have a slight causal relation to schizophrenia.

Then another group did a similar study but looking at arrows going in *both* directions. And they found that the schizophrenia leading to marijuana use causal arrow was so strong (quantitatively) as to make them look at the arrow going the other way as a possible error.

So causal inference eventually showed the way to what most folks knew already, but that years of stats and ‘an association found’ hadn’t been able to pinpoint. :)

study 1 (biased arrow, but indeed a strong look at the causal link from marijuana to schizophrenia): https://www.nature.com/articles/mp2016252

study 2 (both arrows and mendelian randomization, showing what may be the use of marijuana to *relieve* issues with schizophrenia): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5341491/

I do agree that the book’s savaging of statistics is at cross-purposes with trying to get people to use the techniques. Thanks for the blog review.