Theiss Bendixen and Benjamin Grant Purzycki wrote this book. He writes:
The website holds:
– All data and code used in the book
– Free sample chapters
– Bonus material
These aren’t quite the same methods for causal inference that I’m inclined to use (for my own approach, see chapters 18-21 of Regression and Other Stories), but their presentation is clear and has code, and it’s always good to see another perspective.
Thanks a ton for spotlighting the book, it means a lot!
I guess it’s true that we do take a slightly different approach to many leading causal inference books — more epidemiology than econometrics could be a way to sum it up?
This is probably clearest in our emphasis on g-methods, which we take as the core of a causal inference workflow, which then ties into how we approach subgroup analysis, estimands, mediation, poststratification, imputation, etc.
We felt this perspective was not well covered in book-form from an applied angle.
Anyway, we obviously do owe a lot to your work and we think of the book as a launch pad to the more specialised texts in causal inference out there — including those of yours, et al. đ
I’m pretty sure Andrew meant your approach was different than his and Jennifer’s (and now Aki’s) presentation of causal inference in their regression books. They follow Rubin’s potential outcome framework (https://en.wikipedia.org/wiki/Rubin_causal_model), not Pearl’s directed acyclic graphical (DAG) modeling framework.
I thought the web site was the book, but it’s clearly billed as a “companion” rather than the book in a subtitle. I was confused, because I couldn’t make heads or tails of section 2 on the companion website because there’s no text, just code and section titles and the only citation is to the tidyverse for the plotting code. Maybe Andrew can explain the fork, the pipe, and the collider to me (those are the section titles), since he has a blurb describing the book itself as “clear and readable.” I’ve never understood any of the graphical model language around causal inference. I should probably work through Thomas Richardson’s codex for intertranslating the notations (https://www.stats.ox.ac.uk/~evans/uai13/Richardson.pdf).
Thanks! Yes, that link is the companion website đ
The publisher’s page is here: https://collegepublishing.sagepub.com/products/the-data-analysts-guide-to-cause-and-effect-1-298063
And as for the DAG language, we do our best at introducing those in Chapter 2, which is one of the free chapters here: https://github.com/tbendixen/dag-book/releases/download/ch1-2/151702_book_item_151702.pdf
As for Pearl vs Rubin, we try to be pragmatic and draw from both, which might annoy some of the purists out there… đ
It’s not that I don’t understand how DAGs represent probability functions or don’t understand causal notions like confounding, it’s that I don’t know the lingo used in the causal DAG literature like “pipe”, “fork”, and “collider” and “do-operator” and so on. For example, I could have written down the graphical models for the fork, pipe, and collider based on the code. So it’s mostly just a language gap, like someone’s talking about stats in Greek or some other language I don’t speak.
Stan’s design was originally based on BUGS, which is a graphical modeling language. BUGS used the DAG to generate Markov blankets for variables to design conditional samplers for generalized Gibbs. We compiled the graphical model to a density function and never used the graphical model part. Then we realized we had a more general language that could allow recursive functions, local variables, reassignment, etc.
Chapter 2 says, “a DAG plots the assumed causal relationships between variables.” Isn’t the DAG just a representation of our conditional independence assumptions? Later in the paragraph, you write, “DAGs guide subsequent analytic strategies for recovering the causal effects of interest”. What’s a “causal effect”? Do you mean estimating the variables in the graphical model? Or do you mean estimating which variables causally influence other variables?
I see why Andrew likes it—you use the term “workflow” :-)
“So itâs mostly just a language gap, like someoneâs talking about stats in Greek or some other language I donât speak.”
Have you written anything to explain how it mostly just a language issue? I really appreciate the perspective you take on stats. Maybe it’s because you spent all that time with the CMU philosophers back in the day. :-)
Good guess, but that wasn’t the influence. I co-taught philosophy of language with Teddy Seidenfeld (a Bayesian philosopher of science). Although I ate lunch with them most days, I never had any technical discussions with Clark Glamour, Peter Spirtes, or Richard Scheines when they were working on Tetrad.
I haven’t written about how it’s a language issue, but Thomas Richardson was a student of that group at Carnegie Mellon in our department’s logic and philosophy of science Ph.D. program. He’s now a stats professor at U. Washington and wrote a guide on how to inter-translate DAG-based causal inference and potential outcomes.
My own influence in thinking about everything as a language comes from the fact that I focused on logic and 20th century philosophy of mind/language as an undergraduate and then worked on both natural language semantics and programming language semantics as a grad student and professor. Edinburgh (where I was a student) and CMU both had great groups in programming language theory. Gordon Plotkin and Robin Milner were at Edinburgh (there’s a floor-to-ceiling photo of Milner at the Polytechnique—they love PL theory in France). The ML language that led to the language OCaml in which Stan’s parser was written was being developed while I was at Edinburgh by Bob Harper and crew; Dana Scott, John Reynolds, and Frank Pfenning were at CMU. And I built my first open-source language, the Attribute Logic Engine, to prove to everyone that the system defined in my monograph actually worked (I thought it was obvious from the math at the time!).
“Isnât the DAG just a representation of our conditional independence assumptions?”
Yes, that’s a more technical way of saying it. The DAG is “causal” to the extent we’re willing to commit to those assumptions/the assumptions reflect the DGP.
“Whatâs a âcausal effectâ?”
That’s part of the research question/estimand and cannot be read off the DAG. The same DAG can lead to different analytical strategies (e.g. covariate adjustment sets) depending on the estimand in question. In Ch. 2, we give the example of mediation, where the “total” and “direct” effects require different adjustment sets and there are similar examples throughout.
And yes, workflow is important! đ
Wow, small world. Teddy was my dissertation advisor. Prior to that I was working with people on the logic and type theory side of the department, which was basically under Dana at the time. I know of Thomas Richardson, but his time in the department predates my own time there. I’ll see if I can find that guide that you mentioned.
Without getting into arguments about what “causality” means or how to estimate it, some of the things I like about DAGs are that they force you to lay out your assumptions, and even without explicitly defining conditional probabilities you can derive a series of conditional independence statements that you can then test (and if not true that means the data you have and the DAG you have drawn are not consistent). If you want to know how x affects y when there are confounders, then the DAG, through various criteria, can tell you what set of variables (possibly non-unique) will properly adjust for confounders.
If you want an example, see https://dagitty.net/dags.html – it should pull up a default DAG with the stated properties.
I would add that forks, pipes and colliders are worth learning because all DAGs, no matter how complicated, are made up of those three components, and things like the do-calculus, and backdoor and frontdoor adjustments all are based on that.
If you want to see nicely worked examples using this in a Bayesian context, look at McElreath’s lectures on YouTube – he does a much better job of explaining this and how to use it in analysis than I ever could.
Thanks, this perspective aligns quite well with the approach we take in the book.
I’m never sure what people mean by “DAG.” Technically speaking, the directed-acyclic graph is just a collection of nodes representing variables and arrows representing conditional dependence. But by itself, the DAG doesn’t describe a statistical model. Just the conditional independence assumptions. You need to say what the conditional distributions are to get a complete description. Kruschke tries to merge all this notation in his book, but I find it very hard to read. I find just using ~ to define a DAG the way BUGS did to be much easier to understand in that it combines the DAG structure with the distributional assumptions in a single notation.
I wasn’t being very clear. I understand how DAGs work, I just don’t know the jargon used for them by people doing Pearl-style causal inference. For example, BUGS used the Markov blanket of nodes in a DAG to define generalized Gibbs samplers (and JAGS and presumably NIMBLE followed suit). Spiegelhalter et al. cited Pearl as the inspiration for their work on BUGS, which turned the DAG idea into a workable declarative programming language. But the cool part is that DAGs can be used even if you’re not doing causal inference—there’s nothing intrinsically causal about the DAGs in BUGS other than the conditional independence assumptions.
Another thing I find unsatisfying about the DAG excitement is that it seems to steer people away from thinking about processes and dynamics.
Here’s an example, 20 years ago I was working briefly with some biologists who were studying when fruitfly maggots attempt to escape a hypoxic environment (wild type, and various mutants). They were doing some things like t-test on median time to escape or some such thing. I saw they had this full distribution of escape times under various conditions and recognized a chance for fitting a dynamic model. In the model the concentration of oxygen caused the production and build-up of some signal molecule, and the level of the signal molecule affected the probability per unit time that they would begin escape (and this probability per time was affected by their genotype).
We devised some experiments where we placed maggots in cages with normal air, and then began releasing I think it was CO2 or maybe N2 into the cage, while reading the O2 level through time, and then whenever a maggot began its escape procedure we recorded the time of that onset until all the maggots were doing escape, or 15 minutes was up, whichever came first (or something like that).
Now in this scenario O2 level causes signal molecule buildup rate, and signal molecule *level* causes probability per unit time to escape, and then time passage accumulates probability to initiate escape.
At every moment of time there’s a DAG that can move you from this moment in time to the following moment in time, with changes in levels… but there’s an infinite number of them, and this is just what a differential equation encodes. Anyway, I wrote some differential equations in Maxima, was able to solve them analytically, and got a closed form expression for the time to escape inititation *distribution* and then we maximum likelihood fit the parameters of this distribution and had an inferred curve for the level of the signal. This signal level was something that they had a candidate molecule for and so they could potentially try to measure that molecule and confirm the mechanistic theory.
They didn’t have any money, and I was just starting my PhD and there wasn’t any Stan or anything, so we never completed the project, but it’s a project I wish I could revisit with all the great tools we have available today.
Anyway, I don’t find DAGs very helpful for this dynamic type model, and its some of my favorite modeling, so I haven’t spent a ton of time learning the DAGgy type of model description language, but perhaps someone here can enlighten me on why I should?
You can drop anything you want into the DAG framework by just encapsulating it as a function. For example, if you have an ODE in your model, solutions play the same role as the linear predictor X * beta in a linear regression. So you need a multivariate node you can assign the solution of the ODE to deterministically. I agree this isn’t enlightening per se, but it does let you code this kind of stuff. There was a real tension in BUGS/JAGS and remains a real tension in PyMC and NumPyro around graphical models and time series and in general, anything multivariate. Unlike Stan, they don’t compose constraint on parameters plus density, but everything has to be a node in a graph. So you need all of your distributional stuff built into the system or easily added, because it’s hard in the setup. (Though you can always use the 1s trick from BUGS to hack things with a dummy Poisson.)
Where you get utility of out DAGs is when something you care about can be defined in terms of DAG structure, like extract the proper conditioning of each variable for Gibbs sampling. That’s something we cannot do with Stan’s density-based model. At least not easily the way we built it.
It’s funny you should mention fruit flies. I spent Monday in the Princeton biology department talking to Stas Schvartsman and his lab about a fruit fly experiment. I wish they hadn’t shown me the wet lab or told me how they pipette the flies. We’re looking at building a predictive model of when they molt. It’s really creepy watching them do it sped up on video.
Daniel, I have similar ideas to you. DAGs are not used in physics modeling.
This post was useful for me:
https://statmodeling.stat.columbia.edu/2022/05/14/causal-is-what-we-say-when-we-dont-know-what-were-doing/
I read it and was like “ok, I will never use DAGs. I want to model the process every time”.
Bob, at the high level yes you’re right, I could drop the ODE into a DAG based programming framework. But to me the interesting part is what causes what *during the dynamic process* and that to me looks a little like
State at time t -> DAG -> state at time t+dt
which means the ODE isn’t just “a function” in a DAG for the overall complete process, but is rather some way to help structure the dynamic process model itself.
But then this DAG is just encoded as a functional form for the ODE in terms of “structural equations”. Other times I might use an agent based model, in which case the “DAG” is just encoded in terms of a computer function that alters the agents state from one time step to another.
I have yet to see a way where I benefit from DAGs in these process models.
I think it’s absolutely fair to say that there’s nothing inherently causal about a DAG in and of itself — committing to a given DAG as representing a set of causal relationships requires assumptions about the data-generating process. I also agree that a DAG doesn’t represent a fully generative or a statistical model (e.g. functional relationships among variables are not encoded in a DAG).
But DAGs are still a useful tool for making certain assumptions about the DGP explicit, which in turn can help guide/justify a set of analyses. At least that’s the spirit in which we use DAGs in the book.
In case you think I was saying DAGs were useless, I agree that’s the right spirit in which to use them.