Causal inference: I recommend the classical approach in which an observational study is understood in reference to a hypothetical controlled experiment

Amy Cohen asked me what I thought of this article, “Control of Confounding and Reporting of Results in Causal Inference Studies: Guidance for Authors from Editors of Respiratory, Sleep, and Critical Care Journals,” by David Lederer et al.

I replied that I liked some of their recommendations (downplaying p-values, graphing raw data, presenting results clearly) and I am supportive of their general goal to provide statistical advice for practitioners, but I was less happy about their recommendations for causal inference, which was focused on what taking observational data and drawing causal graphs. Also I don’t think their phrase “causal association” has any useful meaning. A statement such as “Causal inference is the examination of causal associations to estimate the causal effect of an exposure on an outcome” looks pretty circular to me.

When it comes to causal inference, I prefer a more classical approach in which an observational study is understood in reference to a hypothetical controlled experiment.

I also think that the discussion of causal inference in the paper is misguided in part because of the authors’ non-quantitative approach. For example, they consider a hypothetical study estimating the effect of exercise on lung cancer and they say that “Controlling for ‘smoking’ will close the back-door path.” First off, given the effects of smoking on lung cancer, “controlling for smoking” won’t do the job at all, unless this is some incredibly precise model with smoking very well measured. The trouble is that the effect of smoking on lung cancer is so large that any biases in this measurement could easily overwhelm the effect they’d be trying to estimate. And this sort of thing comes up a lot in public health studies. Second, you’d need to control for lots of things, not just smoking. This example illustrates how I don’t see the point of all their discussion of colliders. If we instead simply take the classical approach, we’d start with a hypothetical controlled study of exercise on lung cancer, a randomized prospective study in which the experimenter assigns exercise levels to patients, who are then followed up, etc., then we move to the observational study and consider pre-treatment differences between people with different exercise levels. This makes it clear that there’s no “back-door path”; there are just differences between the groups, differences that you’d like to adjust for in the design and analysis of the study.

Also I fear that this passage in the linked article could be misleading: “Causal inference studies require a clearly articulated hypothesis, careful attention to minimizing selection and information bias, and a deliberate and rigorous plan to control confounding. The latter is addressed in detail later in this document. Prediction models are fundamentally different than those used for causal inference. Prediction models use individual-level data (predictors) to estimate (predict) the value of an outcome. . . ” This seems misleading to me in that a good prediction study also requires a clearly articulated hypothesis, careful attention to minimizing selection and information bias, and a deliberate and rigorous plan to control confounding.

The point is that, once you’re concerned about out-of-sample (rather than within-sample) prediction, all these issues of measurement, selection, confounding, etc. arise. Also, a causal model is a special case of a predictive model where the prediction is conditional on some treatment being applied. So I think it’s a mistake to think of causal and predictive inference as being two different things.

P.S. Long comment thread below, and I think I need to clarify something. I’m not saying that researchers should not use graphical models when doing causal inference. Graphical models can be useful, and, in any case, many statistical methods that do not explicitly use graphical models, can be interpreted as using graphical models implicitly. If people want to argue in the comments about the utility or importance of graphical models for causal inference, that’s fine: just be clear that this is not the point of the above post. The above post is responding to a very specific article that gives what I see as some misleading advice. My problem with the article is not that it recommends the use of graphical models; rather, my problems are some specific issues stated above.

183 thoughts on “Causal inference: I recommend the classical approach in which an observational study is understood in reference to a hypothetical controlled experiment

    • Ck:

      What I wrote above was, “we’d start with a hypothetical controlled study of exercise on lung cancer, a randomized prospective study in which the experimenter assigns exercise levels to patients, who are then followed up, etc., then we move to the observational study and consider pre-treatment differences between people with different exercise levels.” This does not imply that you should throw instrumental variables in as predictors in a regression model.

      • Andrew,
        I think CK has very good point here. If you are not using DAGs explicitly, you can’t tell whether your
        covariate is an IV or one of those ” differences between the groups, differences that you’d like to adjust for in the design and analysis of the study.” There is nothing in your data that would warn you against “throwing instrumental variables in as predictors in a regression model.” (This is a theorem, not conjecture).

        This is an excellent opportunity to tell readers that EXPLICIT use of DAGs is sometimes necessary, for it conveys information that cannot be obtained from any of the statistical methods that do not explicitly use graphical models.

        How about making history by being the first traditionalist to admit it.

  1. When I read this paper, I understood the examples like the hypothetical exercise-lung cancer study as very stylized illustrations of the approach they promote, and in that sense I did not find them so misleading. But maybe other people may read them and come to the wrong conclusion that this is all you need to do to run a proper causal inference study. Also, my ideal prediction study looks very similar to what you describe, but there are may prediction studies around that strive for other ideals (and not just in the ML field), so maybe those prediction studies are what the authors had in mind.

  2. I recently went to a Causal Inference short course and got hit with a fire hose of this stuff. I’m still digesting what they taught, and am certainly not an expert, but for now I’ll say:

    – I agree that the logic is circular: they start out with assumptions of causality, and then proceed to do “causal inference.” A better name might be “statistical inference given causal assumptions” in the sense that they are estimating statistical parameters for models that require causal (and other) assumptions for validity.

    – And yet, I still find DAGS useful to reason through these sorts of issues explicitly. Statisticians and econometricians use this sort of reasoning often, and DAGs seem like a useful tool for doing so.

    – They treat the fact that we can (typically) only observe one outcome per subject (for whatever treatment they received) as a missing data problem for the potential outcomes, and propose methods that estimate effects under conditional exchangeability.

    Also my understanding is that they are trying to back into the equivalent data for a randomized trial, given their beliefs of the casual graph for the observational data. So I think at a high level, you have a similar goal even if your approach is different. They write: “We encourage authors to design observational studies that emulate the clinical trial they would have designed to answer the causal question of interest.”

    Compare that with your statement: “If we instead simply take the classical approach, we’d start with a hypothetical controlled study of exercise on lung cancer, a randomized prospective study in which the experimenter assigns exercise levels to patients, who are then followed up, etc., then we move to the observational study and consider pre-treatment differences between people with different exercise levels. This makes it clear that there’s no ‘back-door path’; there are just differences between the groups, differences that you’d like to adjust for in the design and analysis of the study.”

    I think the difference here is that the DAG approach gives the researcher a way to explicitly state their assumptions about what causes what within a system and which are confounding variables. Where a variable is placed in a causal system will affect how you deal with it. Also, including explicit assumptions about unmeasured confounding variables helps the researcher to choose an appropriate model for it, instead of just adjusting for the observed differences. And of course a researcher can alternate between DAGs and estimation as their understanding evolves.

    I find DAGs most useful in thinking through plausible scenarios for unmeasured confounding variables. You can see examples of using DAGs in videos from Hernan on Causal DAGs for an edx course:
    https://www.youtube.com/channel/UCqqDqIGqOWVsnnltMbUlxRw/playlists

    As usual, given their assumptions, the math works out just fine for their methods, as far as I can tell. And different methods come with stronger or weaker assumptions. It seems to me that Bayesian networks (which use DAG structures directly in estimation and are advocated by Pearl) are a more promising framework for estimating parameters in DAGs, rather than things like IPW or 2SLS, which don’t strike me as particularly robust. Unfortunately, estimation with mixed data types (discrete and continuous) in BNs is computationally difficult, and I can’t find any implementations that handle it.

    There are also ways to translate some DAGs into (non-BN) Bayesian models, which are also appealing because they are fit as a single coherent model.

    • Dave,
      There is nothing circular in addressing the task of going from qualitative to quantitative causal relationships and calling it “causal inference”. See http://bayes.cs.ucla.edu/BOOK-2K/jw.html for a somewhat entertaining exposition of this perceived circularity, as well as the difficulties that traditional statisticians have had in cutting the umbilical chord to mother-stat and speaking causally, naturally, w/o referencing RCT’s or “hypothetical controlled experiments”.
      Most of the progress in modern causal inference has been accomplished only after that umbilical chord was snapped. See https://ucla.in/2KYvzau
      judea

        • No. It means speaking about causal relations as causal relations w/o reference to RCT, and speaking about statistical relations as statistical relations whenever they can be expressed in the language of traditional statistics, namely, probability theory.

      • Thanks for the reply. Your first link makes my point exactly. The term “causal inference” implies (to me, at least) that we are inferring causality. But as your link states, first causality assumptions are made, then statistics are estimated, and then the same causal assumptions return in the interpretation of those statistics. Hence, statistical inference given causal assumptions. That doesn’t contradict the idea that the conclusions follow logically from the premises; it’s just trying to separate premises from inference.

        • Dave,
          The strong sense of “causal inference”, namely inferring causation from assumption-less data is a mathematical impossibility, like squaring the circle, so it hard to believe that people who used the label “causal inference” use it in this strong sense. For example, the title of this blog is “statistical modeling, causal inference, and more..” and no one would accuse Andrew for chasing after a mathematical impossibility. Conclusion: causal inference is and always has been inference from causal assumptions to non-trivial causal conclusions with the help of data.

          As to circularity, things are not as circular as you describe them. We start with (1) a causal quantity Q that we wish to estimate and (2) a set C of causal assumptions that we are willing to make. Neither one can be expressed in the language of statistics. We now ask what statistical quantity Es we should estimate, and how, such that the result will coincide with Q, provided the assumptions were valid.If we find such an “estimand” Es, we forget about the causal assumptions C and use Es as a recipe to interrogate the data and estimate Es. The set S of causal assumption does NOT return back once we have Es. It returns only when someone asks: How sure are you that Es equals Q? To which we answer: As sure as we are in the validity of C. This process of going from C to Es is what we call “causal inference”, or “inference engine” in https://ucla.in/2HI2yyx (fig. 2), where the notions of Q, Es and C are exemplified and illustrated.

          In short, I do not believe your suggested title ““statistical inference given causal assumptions” accurately describes this process, because this title overlooks the fact that the aim of the exercise is not a statistical quantity but a causal quantity Q
          that has not spelling in the statistical alpha-bet.

          Another interesting curiosa, only a tiny fraction of researchers who think that causal and predictive inferences are not two different things can actually derive an estimand Es from a given set C of assumptions. Science thrives on distinctions.

        • Judea, suppose a researcher has a “data generating process” model. A simple example, some force is applied to a bolt through a load cell, the sensor generates a voltage, then a noisy amplifier amplifies the voltage and a volt meter reads off the voltage.

          the causal assumptions are clear, the force causes the electrical events.

          There are some unknown quantities, such as the load cell sensitivity, amplifier gain, and the noise amplitude.

          A Bayesian model builder given a mathematical equation for their data generating process can infer statistical quantities such as prior predictive probability for the voltage unloaded and after placing 1000 lbs on the bolt, with some data including experimental data of placing known weights on the load cell, can also infer the posterior probability of the amplifier gain and the sensitivity and noise, and soforth. from that posterior they can get a posterior predictive distribution for the voltage if they place say 2000 lbs on the bolt…

          At no point will they ever be confused that maybe applying some voltage might cause a crushing force on the bolt… because it’s a very simple system and the causal assumptions are clear.

          on the other hand, besides algebra and basic logic, all the calculations are Bayesian probability. What in your opinion is missing from this example that would lead researchers to be misled as you suggest?

        • Daniel,

          From your description of the problem you simply have samples from the experimental distribution and are making inferences regarding parameters of the experimental distribution. This is not a causal inference problem.

          To be more precise: if you had knowledge of the true joint probability distribution, would you still have any inferential problems to solve? If you no, this is not a causal inference problem, that’s vanilla statistical inference.

          If yes, then something is missing on the description of your problem. How do decide the correct functional from the observed data that matches the target quantity of interest?

        • what would be the voltage on the voltmeter if I placed 2000 lbs on the apparatus… this isn’t causal inference? it’s identical to “how long would my headache last if I took aspirin” in general form

        • “what would be the voltage on the voltmeter if I placed 2000 lbs on the apparatus… this isn’t causal inference?”

          Daniel, yes that’s a causal question, but that’s trivially solved in the problem you described.

          That’s why I asked: if you had full knowledge of the true joint probability distribution (no sampling uncertainty), would you, as the Bayesian statistician, be of any help there?

        • Well, I’m not trying to complicate the discussion here with a lot of complicated description of generating processes, the point is I’m not a “Bayesian Statistician” I’m a mathematical modeler using Bayesian methods. I first sit down and describe some causal process, maybe it’s pharmacokinetics, or maybe it’s a force measurement instrument, or maybe it’s a biological interaction between immune cells and infected cells and healthy cells… whatever. then when I realize I don’t have knowledge of the correct numbers to plug in to get accurate causal predictions, like “what would be the voltage on the instrument if I put this load” or “what would be the viral load 1 day, and 1 week after administering this drug, or what would be the trajectory of the concentration of the drug in the liver from time t=0 to time t=12hr … so I do some Bayesian model fitting based on some experimental results… and I arrive at my model.

          What part of this is missing insights from Pearl’s books for example? Where would people like me go wrong that would require me to cut umbilical cords etc?

        • Put another way, model building uses formalisms. For example the formal ideas of real functions of n real variables… Or set theory, or formal languages such as say Julia, R, Stans modeling language, or algebra or what have you, of course all of these formalism are useful. If I have nothing but set theory i can theoretically construct all the rest, but if course practically speaking i get much farther programming in R than in Intel x86-64 machine language… So each formalism is useful for making certain things easy or possible or whatever. But as a model builder I’m already using algebra, calculus, ODEs, PDEs, set theory, function theory, functional analysis, approximation theory, functional programming, numerical analysis, and of course probability theory. What is missing that I should clamor to buy Pearls latest book. What does it enable me to do easily that algebra, non-standard analysis, Physics, Agent Based Modeling languages, Julia/R etc are not providing me? Honest question.

        • My view is that DAGs are a form of naive or ‘folk’ physics.

          Daniel is used to expressing causal assumptions in the usual language of post Aristotle science – conservation laws and constitutive equations.

          Pearl et al have developed a form of Aristotlean physics where variables ‘listen’ to each other to ‘set’ their values, rather than say spatiotemporal processes subject to conservation laws and constitutive assumptions.

          So instead of say modelling the *motion* of the planets based on structural assumptions like ‘momentum is conserved’ and constitutive assumptions like ‘gravitational force between two bodies is proportional to the inverse square of the distance’ etc we get stuff like ‘rain causes mud’.

          Now I think folk physics is probably useful on messy areas like sociology, it’s kind of ridiculous the extent to which people in this area ignore other modelling frameworks and think ‘variables listening to each other’ is a radical improvement on these.

        • Which reminds me of the time I was taking a fluid dynamics course with a friend majoring in computer science.

          Watching him try to solve a physics problem using computer science thinking was pretty fascinating: ‘this goes up so this goes down so…hmm I’ll assume A and try to deduce B…hmm’.

          The ‘correct’ approach to the problem was instead to ask ‘what is conserved?’ and ‘what constitutive assumptions can I make?’.

          I imagine him loving DAGs, but I can’t imagine them helping him solve a fluid dynamics problem.

        • ojm. I think you are on to something, and I think that when you combine it with my insight about processes and equilibrium you get a simplified kind of representation of many useful relationships. For example the gas law PV=NkT. if you inject fuel into a diesel cylinder you get a decidedly non equilibrium flame front… but within a few milliseconds the temperature of the interior of the cylinder equilibrates sufficiently that the gas law holds… so if you measure things at the right time scale you can pretend that there is an instantaneous relationship between variables and leave out the time dependence, but then you need to recover the causality direction… and this is what do calculus seems to accomplish… to my untrained eye.

          on the other hand if you model causality as a time dependent process… do calculus doesn’t seem to help you much also to my untrained eye (untrained in do calculus)

        • Ironically the do calculus can’t handle the ideal gas law either.

          Do calculus seems to exist in a sort of limbo between properly dynamic and fully static.

          The ideal gas law is problematic for do calc because it is an invariant relationship between multiple variables, ie f(P,V,T) = 0 say.

          DAGs require one of these to be solved for eg

          V = f(P,T).

          But in ideal gas experiments you can also consider varying volume and temperature to see how pressure responds ie

          P = f(V,T).

          One hack I’ve seen from causal inference folks to try to deal with this is to try to thus make three copies of the same relationship!

        • > Ironically the do calculus can’t handle the ideal gas law either.

          Neither can it handle Pythagoras’ theorem…

          Why would it be problematic that an experiment where you act in one variable is described differently than an experiment where you act in another variable?

          Of course it’s true that if you have state functions or conserved quantities (or Euclid’s postulates) you don’t need to bother with causal models and do-calculus. You don’t need statistics at all!

        • Re don’t need statistics at all.

          That’s true – but DAGs are a competing causal modelling formalism for describing physical systems under different experimental manipulations.

          The ideal gas law is a great achievement and summarisation of much experience eg Boyle’s law etc before. In comparison DAGs seem inelegant at best.

        • Would you rather go back to

          Boyle’s law, Charles’s law, Gay-Lussac’s law, Avogadro’s law etc ?

          These (first three in particular) are replaced by the single ideal gas law.

        • > DAGs are a competing causal modelling formalism for describing physical systems under different experimental manipulations

          Is there really a competition here? We agree that DAGs are not the source of all knowledge and using them to make up a causal model when a precise physical model for an experiment is available will produce disappointing results.

          Regarding the ideal gas law, I guess part of its elegance comes from the “ideal” part that makes it objectively worse than the less elegant (?) van der Waals equation. And statistical mechanics is an even greater achievement!

          I don’t see the point in comparing “the ideal gas law” with “causal models”. If you think that “the ideal gas law” is better than “causal models” I am not going to tell you that it’s the other way around. But if you claim that “the ideal gas law” is better than “statistical mechanics” I will. :-)

        • > Is there really a competition here? We agree that DAGs are not the source of all knowledge and using them to make up a causal model when a precise physical model for an experiment is available will produce disappointing results.

          I’m glad we agree – as I think would people like Andrew and Daniel.

          BUT – when Andrew makes comments to this effect eg that he uses a system of ODEs to capture a causal model and combines that with Bayesian inference he seems to get ‘YOU DON’T USE DAGS SO YOU DON’T DO CAUSAL MODELLING!’.

          I also get ‘but DAGs are nonparametric so they can express anything’ etc.

          Basically computer science types and some epidemiologists who seem to think causal modelling didn’t exist until DAGs or structural functional equations etc came along, but who also don’t want to, say, predict the motion of the planets or analyse an enzyme kinetics reaction.

          Despite all this, I still recommended the book of why to someone just yesterday! Hopefully I don’t create another one of these monsters ;-)

        • > when Andrew makes comments to this effect eg that he uses a system of ODEs to capture a causal model and combines that with Bayesian inference he seems to get ‘YOU DON’T USE DAGS SO YOU DON’T DO CAUSAL MODELLING!’.

          I think the comments he gets are more in the sense of “if you can do casual inference it’s because you have some, implicit or explicit, causal assumptions in your models”.

        • Daniel,
          Lets go back to your first sentence and ask: How is the “data generating process” represented mathematically? If it is represented by equations, then Andrew is right, the question “what if we place 2000 lb” can be treated the same as “what if we see 1000 lb placed” and WOOPS the answer can be produced by classical prediction tools, including Bayesian. But, and please do not knock it off hand, classical prediction tools, including Bayesian, will also conclude that tweaking the voltmeter needle will make the weight 2000 lb, instead of the current weight of 1000 lb. Try it in the mathematics. The equation of physics are algebraic, and algebraic equality signs are symmetric. No escape.

          This behooves us to ask: “So how should we represent the “data generating process” if we want to prevent such WOOPs from our inference methodology. Welcome to the modern age of causal inference, in which predictive tools are barred from managing certain interventional questions and a new calculus takes charge of the interventional level of the Ladder of Causation. (as in the Book of Why, or here https://ucla.in/2HI2yyx).

          Of course “thinking of” tweaking the voltmeter in terms of RCT would stop you from concluding that the weight will change. But we are talking mathematics, not “thinking”. In other words, once we are done “thinking” we wish to represent the result of our “thinking” in some equation and go ahead with our analysis, combine it with data, and ask new questions. So, back to your first sentence. How would you represent the “data generating process” mathematically? You can tell already that I am heading toward “structural causal equations”, and the logic that governs them. It is easy for anyone who wishes to acquire it.

          I have written about the helplessness of BAyesian inference here: https://ucla.in/2nZN7IH as well as on this blog. This helplessness is traumatic to Bayesian philosophers who are trained to believe that, given enough data, the posteriors are bound to peak around the correct answer. Not so in causal inference, They wont! Sorry. And readers on these blog will be able to recover from this trauma when they are prepared to accept the hard fact that we are operating in a new era, new logic, new calculus — the era of causal inference. It is this acceptance that Andrew resists so vehemently. Which is unfortunate, because postponing the transition would make it more difficult.

        • Judea, you write “Almost by definition, causal and statistical concepts do not mix. Statistics deals with behavior under uncertain, yet static conditions, while causal analysis deals with changing conditions.”

          But yet, I can have a model (or many models!) for how change is occurring. I can then fit those models to data (using Bayes, or approximate Bayes i.e. Max Likelihood, whatever), and make statistical inference about fixed, but unknown parameters. Sure the joint probability distro can be factored in many ways (a sort of symmetry I guess) but it is precisely our knowledge, and scientific model that selects one factorization as most useful to *represent our knowledge*…

        • When I place my weight on my apparatus, what actually occurs is that there is a transient oscillation in the voltage as the material compresses etc and then after some time t_a the voltage reaches a near constant value V_a. So the equations are not symmetric in time, but we often use sloppy notation. Careful notation might be something like V(t+t_a) = F(force(t)) but it’s often written V=F(force). Nevertheless the result only holds in an asymptotic or intermediate asymptotic way with time (hence my subscript a for asymptotic).

          In fact this is a complaint i OFTEN have with economics, they like to use algebraic equations and equilibrium ideas but they actually have a dynamic process and the time scale for equilibrium is typically not known well at all… I feel as if they often measure their outcomes well before asymptotic validity happens and then wonder why they have such trouble… But I digress.

          My personal preference is to think about *processes* and there is not an ambiguity when time asymmetry is properly understood. No one who is same will be confused that changing the voltage now will cause the weight 100ms ago to change and with good notation we avoid the apparent symmetry in the equations.

          So, I ask you, if I were pedantic about my symbolic representations, and accept the axiom that causality can only ever be forward in time, am I already committing the statistical “sin” of structural causal equation modeling? ;-)

        • > Bayesian Bayesian philosophers who are trained to believe that, given enough data, the posteriors are bound to peak around the correct answers who are trained to believe that, given enough data, the posteriors are bound to peak around the correct answer

          Rather naive for any philosopher to think there is a direct access to reality (correct answer).

          p.s. I did enjoy re-reading your Understanding Simpson’s paradox paper this morning.
          Did any write a simulation package for the paper?

        • Hi Daniel, see if I got this right.

          It seems to me you have the functional specification of a differential equation of your (physical/physiological) system or the like and your data is experimental (I assume there’s no confounding or other issues in your setup?). Thus, your remaining task is to obtain numerical estimates for the parameters of the model. This last task is the statistical inference part (Bayesian or Frequentist), the causal inference part was already “solved, assuming experts agree with your physiological model and experimental data was obtained correctly.

          Tools of non-parametric causal inference can also help you build and analyze causal models, and they are most helpful when you have a qualitative partial specification of the system and cannot obtain data directly from the domain you want to make inferences from. For instance you want to make inferences regarding the experimental distribution, but have only access to the observational distribution, or when experiments are imperfect (suffer from non-compliance, selection bias, data missing not at random), or when you want to transport the results to a different population and so on. These tools will allow you to build non-parametric causal models (you don’t need to know the functional form of the equations) to understand whether your domain knowledge of the system is sufficient for answering the causal query of interest, and if so, what is the functional of the observed data that answers your question. Once we have that functional, the causal inference part is “solved” and now we need to solve the statistical inference part and get numerical estimates, as in your example.

          In your specific example, I don’t know where non-parametric causal inference tools can help. For instance, do you usually have to extrapolate results to a population different from the experimental? If so, how do you currently do it?

        • Whenever I build scientific models (as opposed to some pure prediction model) i have a “process” in mind, as I mention above even when my notation is sloppy there is some notion of before and after… Weight is placed on a sensor, sensor briefly compresses, resistance a short time later is less, voltage across some terminals responds to resistance, amplifier receives altered input voltage, amplifier outputs alternate output voltage… Sure the whole process maybe takes 4 milliseconds and this timeframe is entirely down to the stiffness of the load cell as the voltage amplifiers equilibriate at nanosecond timescales…

          But there are plenty of examples where I’m not interested in the transient dynamics just the long time behavior and my sloppy notation ignores the dynamics… But there is still fundamentally a one way directionality implied in the model.

          Also plenty of cases where I don’t have explicit functional forms. I would normally handle this with inference on parameters in a functional expansion.

          And where I don’t have purely experimental data for example perhaps someone has patient data on whoever happens to show up at the local hospital… We can’t control say sex, age, or initial viral load, we can only see who shows up and then measure what happens after giving a medicine and see how our causally interpreted equations correspond to observed measurements.

          So it sounds to me like the goal of these causal inference calculus things is to help people who do stuff in simplistic ways very unlike what I do so that they can kind of put some ideas into a kind of theorem prover like machine and arrive at some more fully baked model of the kind I am already making… Which is why I ask, for a community of people already modeling “data generating processes” as a process (In time) what is the compelling feature?

        • Daniel,
          I will focus on your (good) question: “if I were pedantic about my symbolic representations, and accept the axiom that causality can only ever be forward in time, am I already committing the statistical “sin” of structural causal equation modeling? ;-)
          Ans. NO. So far you have not committed any sin, you have just constructed a structural equation model, in which each variable “listens” only to variables that precede it in time. Good. Now you need to find a representation for “acting”, or introducing an external change. You need to ensure that when you change the voltage, the change will be transmitted only to the needle of the voltmeter, not to the weight, because the weight was placed earlier in time. Sound easy? Yes, but we need to do it symbolically in the equations, so that when stupid robot receives the equations he/she would know what changes when the voltage is increased (externally). You need also to prevent contradictions for our poor robot. He sees an equation v=f(weight) and now you are doubling the voltage without changing the weight. Bingo! The robot goes crazy. He goes even crazier when some information should go backward in time, if it comes from passive observations, and some only forward, if it comes from interventions.

          You see where I am heading. You have just invented the do-operator. And you are safe if you use it properly. But, do you
          really want to use it that way?. Only if you are prepared to manage all the microprocesses that lead from weight to voltmeter, instead of taking the shortcuts the do-calculus offers you. To appreciate it, look at the agony that traditional statisticians go through when they try to handle Simpson’s Paradox, and they usually fumble, including top statisticians from Harvard. It is like resisting the rules of calculus and insisting on deriving the limit as eps–>0 each time you get a new function to differentiate.

          Such resistance can only emanate from metaphysical, not from technical considerations.

        • > These tools will allow you to build non-parametric causal models (you don’t need to know the functional form of the equations) to understand

          Unfortunately ‘non-parametric’ *DAGs* can’t capture the concept of a differential equation. I also don’t really see how they help express things like ‘energy is conserved’ and ‘entropy increases’, being based on an alternative semantics of ‘this variable listens to these variables’.

          Like I said, it feels very Aristotle, who expressed motion in the form of things like

          v = f(F)

          with semantics ‘velocity listens to force’

          not

          d(mv)/dt = F

          with semantics ‘momentum is conserved and transferred between systems via forces’.

          Different mathematics (eg plain functions vs differential/integral equations) and different semantics ‘listening’ vs ‘conservation equations + constitutive assumptions’.

        • Also worth pointing out that eg Newton’s 2nd law is actually also nonparametric is a very similar sense – it leaves the force laws completely unspecified.

          The semantics is merely the generic ‘momentum is a conserved quantity’

          Constitutive assumptions fill these in, but can be purely qualitative eg F = f(x,y) but not z, is compatible with the 2nd law of thermodynamics etc or specific eg F = GMm/r^2 etc.

        • Chris,
          If you know the two distributions, before and after the change, there is no problem indeed. However, Causal analysis deals with inferring the post-change distribution from the pre-change distribution.
          A typical example used in Book of Why is: what if I double the price of tooth paste, given sales-price records in the past. Fitting sales to price gives P(sales| price reported) which is different than P(sales| price fixed)

        • Judea, obviously if you fit a statistical relationship to past information when many different factors influenced the toothpaste price, you can’t necessarily get very far. Sales may be sensitive to things you aren’t measuring, like seasonality or the next big YouTube fad “Toothpaste Challenge” in which silly children try to film themselves eating a whole tube of toothpaste or whatever. But suppose magically you had a computer program sales(P) which would correctly give you the actual sales you will experience in the next say month if you set the price to a constant value P but it is contaminated with some noise of an approximately known size.

          Obviously this magical computer program reads the internet, and finds out huge numbers of things about demographics and trends in toothpaste usage and YouTube videos and etc. But it provides you a kind of simplified interface, put in time t in UNIX timestamp format, and price P and it predicts the total sales of toothpaste for the week following time t if the price were set to P.

          My impression is you call this *not the domain of statistics* and that people like Chris, OJM, Andrew and I would call this *within* the domain of statistics, that is we see not only the fitting part but also the model development part as part of what we should be doing.

          I don’t disagree with you that many people *do* see “statistics” as just stuff you do with your observed numbers and some push button software (this is regrettably very true in my wife’s field: molecular biology) but I suspect some of the gap between you and Andrew is that you see statistics as encompassing a small realm and Andrew and I and others here see Statistics as also encompassing the larger realm of model building, model checking, model choice, making causal vs noncausal assumptions etc.

        • Daniel your answer gave me the opposite impression, that you would actually benefit a lot from learning the tools. It seems to me you are not handling non-experimental data formally, but dealing with it in an ad-hoc manner. Also you didn’t understand the non-parametric modeling—it has nothing to do with expansions we are not even talking about fitting data for now, we are simply talking about deriving identification results that are valid regardless of the functional form.

        • This “it would be good for you” type thing is what I always hear when we have this conversation… But no one can really tell me in what way it would be good for me. I mean, it’s kind of like broccoli and tempeh salads or green tea extract or whatever. Please enlighten me, what new benefits do I achieve? So far what I’ve internalized is that there is a new symbolic notation with some semantics such that now theorem proving robots can prove facts about my model. I find this a dubious sounding benefit. I also hear from OJM for example that he doesn’t think ODEs are compatible with do calculus… so there’s that.

          In fact, every time I read about this stuff it’s always with reference to extremely under-specified models. ie. the kind of model “somehow smoking affects cancer”

          I’m sure if you’re the kind of person who enjoys using theorem proving robots to do stuff then this is the sort of thing that you find very useful.

          What I much prefer is to actually interrogate science until I have an understanding of as many of the *specific* ways in which say smoking affects cancer as I can imagine. Smoking affects cancer through chemical reactions between small particles and lung tissue, through modifying certain immune responses, through altering cellular DNA maintenance and error correction… etc etc etc. I’m not sure which of those is dominant but I suspect that perhaps several things working together are important…

          Then if I want to understand my model, I need to interrogate and measure those causal relationships… how much Interleukin is produced in lung tissue pre and post smoking… how much this and that is there in the cells and the bloodstream etc.

          Now, someone may come along and they just want to know “how much can you increase the average smoker’s lifespan by making them quit smoking” and that’s a fine thing to want to measure, but it’s not really science is it? I mean, you could maybe estimate this thing from some pseudo-experimental data from people who were prescribed Wellbutrin and nicotine gum back in the 80’s and 90’s and successfully quit after analyzing some approximate causal graph with a robot theorem prover and showing that the net effect of smoking is identifiable from the observational data given the exact model… But the results will always be *conditional on the causal graph being a good model* and without investigation of the mechanism and its myriad of predictions… you really are basically saying “if all this stuff I have nearly zero information about is true…. then this number is close to the correct number for the longevity increase” which to me is like saying “if this photograph is a picture of an Intel microprocessor then inputting these instructions will cause 2+2 to be added in register a and b and the answer placed in register c” but if it turns out it was an ARM microprocessor then … you’re screwed.

          In other words, I’ve never seen a “reduced form” causal inference problem that I like. It’s all about mechanism, literally that’s the purpose of science is to study mechanism.

          So, suppose I have an agent based model of predator and prey and vegetative food sources and invasive vegetative species… (say it’s wolves, deer, grass, and invasive star thistle that crowds out grass) and it’s written in say NetLogo https://ccl.northwestern.edu/netlogo/ and I have a decade of observational data on this process in some national park somewhere… what *specific* things will I gain from do calculus? Can you give some step by step things that I should do, and then what valuable results I will get from doing this? I want to predict whether the wolves or deer or both will go extinct if the climate changes in a certain way and the grass and thistle respond to climate in a certain way.

          So far the example of measuring force on a bolt was apparently too simple.

          Here are some specific concerns I have though:

          1) NetLogo is Turing complete and it’s a theorem that we can’t in general solve the halting problem… so any Do calculus automated theorem provers will not *in general* be able to prove anything at all about the model. For simple models this may not be so important, but the general case involving complicated models is a totally different story.

          2) If I build some surrogate DAG model of the NetLogo model, how do I know if one corresponds to the other really?

          3) Minor changes in the code of the NetLogo model can sometimes make *dramatic* changes in the emergent behavior. This is kind of a general theorem in dynamical systems: simple rules lead to complex behavior, so the overall behavior, that is, in some sense the existence of a nontrivial causal arrow in the DAG is potentially sensitively dependent on the particular initial conditions or parameter values…

          4) As long as I have this ABM model, and it actually is a formal model that *runs on a computer* why not just run the model to interrogate it, rather than building a (probably wrong/incomplete) model-of-a-model in DAG notation and interrogating a theorem prover about certain properties.

        • > Well, I’m not trying to complicate the discussion here with a lot of complicated description of generating processes, the point is I’m not a “Bayesian Statistician” I’m a mathematical modeler using Bayesian methods.

          > My impression is you call this *not the domain of statistics* and that people like Chris, OJM, Andrew and I would call this *within* the domain of statistics, that is we see not only the fitting part but also the model development part as part of what we should be doing.

          What is it, then? Can “mathematical modeling using Bayesian methods” be described as “doing statistics” or not?

        • > So, suppose I have an agent based model of predator and prey and vegetative food sources and invasive vegetative species… (say it’s wolves, deer, grass, and invasive star thistle that crowds out grass) [….] I want to predict whether the wolves or deer or both will go extinct if the climate changes in a certain way and the grass and thistle respond to climate in a certain way.

          That’s not really science is it? I mean, you could maybe estimate this thing from some observational data on this process in some national park somewhere… But the results will always be conditional on having a good causal model and without investigation of the mechanism and its myriad of predictions…

        • > the results will always be conditional on having a good causal model and without investigation of the mechanism and its myriad of predictions

          That’s exactly what causal ‘inference’ using DAGs does too!

          Like I said, it seems to be more about replacing things like a properly dynamic causal model based on eg conservation principles with a pseudo-dynamic graph based on variables ‘listening’ to each other.

          But regardless, all standard causal inference takes the causal assumptions as given and either deduces new causal implications of these or (and/or) combines these with (potentially observational) data.

          To me, a key point at stake is whether DAGs are the right language for causal modelling. In some situations quite possibly, because folk physics is all we’ve got there, but in many others we already have alternatives. This gets lost in part because people hear ‘nonparametric’ while forgetting the DAG/structural equation part.

          Simple challenges to analyse with competing formalisms –

          The motion of the planets
          A series of ideal gas experiments
          A two stroke engine
          Climate prediction
          An enzyme kinetics reaction

        • > To me, a key point at stake is whether DAGs are the right language for causal modelling. In some situations quite possibly, because folk physics is all we’ve got there, but in many others we already have alternatives.

          I do agree! In my experience you are never teached “statistics” when you study physics and it’s no accident that frequentist methods were developed by biologists, psychologists, economists and other poor souls.

        • Oliver, how to fully capture the equilibrium state of ODEs in a causal model (and other types of functional laws) is actually a current line of research. See the works by Mooji and his team or Peters and his team. It’s all literally very recent. Maybe you and Daniel can help moving this forward too. This doesn’t change the basically principles though, you still need to specify what it means to intervene on the system, distinguish algebraic from structural relations, distinguish modeling the joint distribution from modeling the structure, and so on.

        • Mooij is great!

          To be completely honest I didn’t *really* understand DAGs etc until I saw his work a while ago relating these to the equilibrium states of ODEs.

          I’m interested in these connections, but I think it needs to start from recognising that these other frameworks *already* make many of these distinctions, and are already very expressive. A key question is – why not use them as they are?

        • having a hard time following the conversation from my phone, but I wanted to address a couple issues I saw as best as I could.

          Carlos Until turns my words back on me… but seemed to miss the distinction I was making. suppose you can estimate a single number that describes what would happen to say toothpaste sales if you exactly doubled the price today… now you have a number… the number is estimated even though we have zero knowledge of the mechanistic details, and hence is generally unusable outside the particular situation in which it’s measured.

          compare with estimating a causal mechanistic model in which parameters describe processes that occur… like the rate at which people use toothpaste per week under certain conditions and how many people overuse toothpaste and can easily conserve it if the prices go up and people’s preferences for toothpaste vs say less convenient substitutes etc etc… in the end you have a description of some dynamic possibilities, a whole manifold of possibilities.

          Now both can be useful but one of these things seems like a pure measurement, similar to walking up to a jug of water and pouring it into a graduated cylinder, whereas the other is like having a numerical description of the water cycle on earth in terms of evaporation, transpiration, precipitation, river flow, groundwater flow etc… the difference is vast

          now maybe we should all celebrate whenever a middle school student measures some water in a graduated cylinder, but I can’t help but think that an accomplishment of describing the mechanisms of water flow through the environment is what science is about, and measuring the volume of some particular parcel of water is not…

          As for the distinction of what’s in stats vs what isn’t, I was referring to the narrow conception that was implied by the question of what do you do if you’re the Bayesian Statistician and someone hands you a model and a complete description of the joint probability of everything under that model… this implies statisticians are just devices for calculating standard errors. My opinion as that most of the people here are BUILDING models and designing experiments and checking the model suitability, not sitting around waiting for people to plop data in their lap and ask them if their results are significant or whatever. it’s well within the stats bailiwick to write descriptions of processes and ask about how well those descriptions match reality including causality.

        • Also sorry Carlos my phone insists that your last name is obviously Until not Ungil…

          But to further the distinction, ABMs are exactly entirely about mechanisms. Plants grow based on seasons affected by temperatures, reproductive efficiency, sunlight etc, prey animals eat, travel, mate, and make choices about migration, risk taking and soforth, prey animals hunt in packs, require prey for food, etc each rule you describe in an ABM is a lot more information than an arrow from variable A to variable B it’s a description of specific processes that occur in specific ways. this in my opinion is what the goal of science is, to have full if approximate descriptions of how things work at a quantitative and specific level.

        • Sorry for multi-posting, but now that I’m back on my laptop and can read and follow the thread easier…

          I think ultimately, ABMs are kind of the prototype of a general purpose causal modeling system. They can more or less represent everything from atoms up through chemical reactions, biology of cells, ecology and all the way to say economic interactions in markets for video game virtual goods. They work by deciding what actions each element of the system will carry out in the next time step based on the full state of the system in the current timestep including any memory of the past using a Turing complete computing system.

          As OJM says “I’m interested in these connections, but I think it needs to start from recognising that these other frameworks *already* make many of these distinctions, and are already very expressive. A key question is – why not use them as they are?”

          I have a hard time understanding what it is that DAGs provide me with that ABMs don’t have, whereas I can see a lot that DAGs seem not to have compared to ABMs (for example the ability to represent a simple ODE like the trajectory of a ball dropping through syrup.)

        • There is work connecting ABMs and DAGs, especially discrete-time ABMs, but

          – DAGs cannot representative continuous-time and/or continuous-space processes
          – Are based on a naive semantics of variables ‘listening’ to each other to ‘set’ their values
          – Don’t seem to allow for the possibility that simple unique equilibria aren’t the only option

          For example suppose I proposed the discrete model:

          x(n+1) = r*x(n)*(1-x(n))

          for r greater than 3.6 but less than 4.0 and subject to the *externally set* x(0) = x0.

          First of all, what’s the best way to understand this motivation for this model

          – x(n+1) ‘listens’ to x(n) to ‘set’ its value
          OR
          – it represents a growth/reproduction process with a carrying capacity

          Personally I prefer the second.

          Next, can I understand the ‘effect’ of ‘setting’ the initial condition to x(0) = 0.1? Should this be similar to setting x(0) = 0.100000000001? What if I vary r?

          All of these questions are ‘interventional’ and can be investigated by the concept of…changing a parameter value and/or changing an initial condition. No magic, done everyday by ‘math modellers’ in the sciences.

          Finally, can I understand and analyse the continuous analogue of this:

          dx/dt = r*x*(1-x)

          ?

          In the usual semantics this still represents a ‘growth’ process and can be motivated by conservation arguments. But there is no clear ‘variable listening to another to set its value’ interpretation.

        • Judea, thanks for responding. I’m not sure I understand what you mean yet though. You say “If you know the two distributions, before and after the change, there is no problem indeed. However, Causal analysis deals with inferring the post-change distribution from the pre-change distribution.”
          But if I fit a model to some data, I have a posterior distribution I can work with to do exactly that! For instance, if I have a dynamical model I can push out the posterior predictive distribution for new data for the next time step,or given some perturbation or something. Is that not casual analysis?
          You then say
          “A typical example used in Book of Why is: what if I double the price of tooth paste, given sales-price records in the past. Fitting sales to price gives P(sales| price reported) which is different than P(sales| price fixed)”
          So here I agree we need to be careful. You are asking for the impact of an intervention, but we have only observational data given. The way Id tackle this is fit a few different models, based on plausible theory and reasoning and go from there. Your predictions and inferences are of course conditional on the models- but its not like we aren’t going to interrogate and check those models in various ways. How do you use do-calculus here to do better? I am open to new ideas – but right now this feels a bit like we are uncomfortably adjacent to a something from nothing pitch.
          I do have to say that I find DAGs useful in the contexts where we are using linear models.

        • Carlos Cinelli writes:

          “Thus, your remaining task is to obtain numerical estimates for the parameters of the model. This last task is the statistical inference part”

          and:

          “Once we have that functional, the causal inference part is ‘solved’ and now we need to solve the statistical inference part and get numerical estimates, as in your example.”

          Thank you for helping me make my point, which is that estimating model parameters from data is, and should be called, “statistical inference.” Causal assumptions are just more assumptions that inform the specification of a statistical model, even if they are not yet found in traditional Statistics textbooks. Since the “causal inference part is ‘solved'” by assumptions, this would logically lead to the name, “statistical inference given causal assumptions.”

          But I guess some of this is about marketing, which CS people are very good at (see “Machine Learning” and “Artificial Intelligence”).

        • +1 for ‘statistical inference given causal assumptions.‘

          The benefit of this view is that statisticians can, you know, work with physicists on the Higgs boson experiments without requiring the standard model or whatever to be expressed as a DAG.

        • I don’t have time for addressing all the but when you say

          > each rule you describe in an ABM is a lot more information than an arrow from variable A to variable B it’s a description of specific processes that occur in specific ways.

          I think I can agree even though I don’t even know what ABM stands for. A perfect model is better than a not-so-good model. But the latter is better than having no model or a wrong model.

          > this in my opinion is what the goal of science is, to have full if approximate descriptions of how things work at a quantitative and specific level.

          That doesn’t mean that all the empirical work done without a perfect model. Saying that everything done from broad causal considerations is garbage because you would do it “correctly” so you can’t even fathom how causal reasoning may be better than no reasoning at all is not reasonable.

          Or maybe you’re just saying that you have no use for causal models because your models already have all the causality in them. In that case I think we all agree you are not missing anything. But that’s a bit like saying that numerical methods are useless when you know the analytic solution to every equation…

        • Anonymous: of course a “perfect” model would be great, but also of course we rarely if ever get a “perfect” model. ABM stands for Agent Based Model, essentially an ABM is a model in which objects move in a 2D or 3D environment, and interact according to rules specified in a turing complete programming language. This would encompass say Molecular Dynamics simulations, fluid mechanics using Lagrangian methods, and many many other “higher level” simulations such as wolf sheep predation, or economic models of distribution of goods, or queuing theory, or spread of infection in a population, or whatever.

          Obviously this isn’t the only modeling system in the world, but it’s one of the most general purpose systems I can think of. This kind of model builds causal assumptions in basically as follows:

          1) The system is dynamic: the future always comes from the past, whereas the past is never influenced by the future

          2) whatever the model predicts for the future is assumed to be the causal effect of all the things going on in the present… and the “goodness” of this model in accurately predicting effects from causes is what needs to be determined in order to decide whether the causal assumptions were accurate

          3) At any moment a “do” operator is implemented by simply altering some aspect of the state and seeing the predictions for the future. At any given time the “state” is usually quite complicated, it’d include things like the (x,y,z) or (x,y) location of each agent, the velocity of each agent, other aspects of the agent, such as hunger, or connections to other agents, or how far the agent can see in space, or which set of rules it’s currently following, or what its preferences are between options or what its infectious state is or etc etc. It’s easy to imagine say 10,000 agents in a simulation each with a few tens to hundreds of individual state variables, so that you might have a million state variables in a simulation.

          That doesn’t mean that all the empirical work done without a perfect model. Saying that everything done from broad causal considerations is garbage because you would do it “correctly” so you can’t even fathom how causal reasoning may be better than no reasoning at all is not reasonable.

          I’m not so arrogant to say that I would “do it correctly”, I’m just saying I think the correct way to attack many problems in causal inference is to *not give up at the very first opportunity* but rather to think about the dynamics from the start… there are many techniques for doing this, ABM being one of them, but I fear that too many people in epidemiology or economics or biomedical or criminology or whatever have been taught to give up before even trying. it’s like a mantra “in social sciences we don’t have rules like physics does”. It’s true that the rules aren’t like physics forces between particles rules, but that doesn’t mean there aren’t rules… for example if you go to buy fish at a market and there are no fish to buy, you will come home with zero fish regardless of the price you are willing to pay… this is actually a physics conservation law applied to economics… jrc who posts here has posted info on a study of how cell phone adoption affected the price of fish in rural fishing villages in third world countries. Obviously prices are based on information, and information travels over technological mechanisms at certain speeds between certain people…. model it! at least give it a try…

          I fear though, that in our current environment where getting a professorship in say Econ requires you to publish N papers and get K endorsements from peers etc or lose your job after a few years… That stepping outside the usual methods in your field guarantees you are the victim of a selection bias, and so we are stuck in an attractor: those who do NOT write ABMs or causal simulations with other techniques or whatever incorporating biology, physics, information theory, transportation, varied preferences, cultural phenomena, etc into their Econ models will advance, and those who do try to incorporate stuff like this willtake much longer than will be allowed for them, and so “die out” of the population of economists… (or epidemiologists, or sociologists or poly sci or whatever)

          —-

          ojm: your simple discrete time system example is great because its been well characterized as having chaotic strange attractors, the “effect” of a simple “cause” such as “set x[0] to .2” is an entire non-periodic orbit through chaotic state space… There is no simple DAG, it’s just a giant chain x[0] -> x[1] -> x[2] …. off to infinity. Take this to the next step and include 1000 agents, each of which moves around a 2D field and when it encounters other agents in its “neighborhood” they swap information and continue to move around, with random perturbations to their motion and communicate…. What is the causal effect of setting agent 33 to have a particular virus at time t=88? It will be a whole epidemic that spreads through the state space in a complex way.

          Claiming that until I express this in DAG form I am not doing causal inference is … meh

        • > Science thrives on distinctions.
          Well if these distinctions are personal differences which are being taken to be important, academia thrives on those big time, but science if persisted in adequately can be expected to eventually remove all personal differences.
          (Or so CS Peirce would argue).

        • Well, Keith, what do YOU think? Are these distinctions merely a matter of personal preferences? Or are they substantive differences that prevent traditionalists from solving even the most rudimentary problems in causal inference? Say Simpson’s Paradox https://ucla.in/2Jfl2VS. Do you know ANY traditionalist who can choose the correct answer when Simpson’s reversal shows up in the data?

          Do you believe Bayesian inference will help the traditionalists? Be it inductive or deductive? No, it will not, see https://ucla.in/2nZN7IH. And Lindley was the first statistician to admit that a new calculus is needed for causal inference. Unfortunately, he was also the last prominent statistician to do so, not counting the new generation of statisticians who have read Primer https://ucla.in/2KYYviP and have woken up to: “Why haven’t they told us?”

        • Chris,
          You say:
          “So here I agree we need to be careful. You are asking for the impact of an intervention, but we have only observational data given. The way Id tackle this is fit a few different models, based on plausible theory and reasoning and go from there.

          Here is where do-calculus comes in and says: (1) No need to be “careful”. “Careful” is a pre-causal word that statisticians used when they meant: “I havn’t the slightest idea of what to do”. (2) No need to fit “a few different models” the calculus tells you which ONE model to fit, and then you go from there. (3) “fitting” cannot be used in causal inference to decide among models. I have explained it here https://ucla.in/2nZN7IH and will explain here again.

          Try to use “fitting” to decide between Model-1: X—–>Y and Model-2 : X<—-Y. Assign priors on the two models and take 10 trillion samples, iid, from the joint distribution f(x,y) . The result is predictable, both models fit perfectly, yet one predicts that Y will respond to manipulation of X, and the other says NO Response.

          You asked:
          How do you use do-calculus here to do better?
          You start as you suggested, with a "model based on plausible theory and reasoning' But it is a causal model, not a statistical model, namely, a carrier of causal assumptions. Then you submit you question:
          Find P(Y|do(X)) to the calculus with a humble request: Can this question be estimated from the joint density f(x,y,z1, z2….) of my observed variables? The calculus will tell you yes or no and, if yes, how.

          I perfectly understand your hesitation and reluctance to believe that a calculus like that would be worth a dime, given that most of the statistical establishment is still hesitant and reluctant, (https://ucla.in/2v72QK5 ) and leaders like Gelman keep telling their followers: "We fit statistical models to changing conditions all the time" and " causal models are a special case of predictive models." So, I do not blame you for feeling: "a bit like we are uncomfortably adjacent to a something from nothing pitch." I am just amazed at how long it takes to scientific progress to propagate across disciplines.

          I visit this blog occasionally, to check on how aware Gelman's 21K followers are of the Causal Revolution that is raging around them and of all the excitement that they are missing by laboring to preserve traditional paradigms and saying: "Causal inference is just statistical inference given causal assumptions.‘

          I am sure though that among those 21K followers, a non-zero fraction wakes up daily to what is going on, reads some of the forbidden literature (eg, https://ucla.in/2KYYviP — highly recommended for heretics, chapters available) and says: "Gee, if Causal Inference is Statistical Inference given causal assumptions" then "Car making is car-painting given an engine and a body."
          Judea

        • Judea, you seem to miss the point of what Chris was saying entirely. He wouldn’t try to arrive at causal assumptions by fitting, he’d try to make some causal assumptions using mechanistic ideas, and then fit this model and see how it fit, then he’d add in another causal assumption he might be unsure of, and see what it would predict, and try to collect information that might help him distinguish between which assumption was correct vs wasn’t. then he might add additional assumptions, and see if they also made different predictions and maybe could be distinguished in the data…

          I don’t see how you or anyone could ever avoid this, it’s not like DAGs arrive from god with truth stamped on them, sometimes we will just not know which causal model is the best one to use. After all, at it’s heart, the “true” causal model seems to be that fundamental particles interact and conserve momentum and energy… but this doesn’t help us predict sales of toothpaste. So whatever our model is, it’s going to be a simplification, and we will have to decide how simple is complex enough and how simple is too simple.

        • Try to use “fitting” to decide between Model-1: X—–>Y and Model-2 : X<—-Y. Assign priors on the two models and take 10 trillion samples, iid, from the joint distribution f(x,y) . The result is predictable, both models fit perfectly, yet one predicts that Y will respond to manipulation of X, and the other says NO Response.

          So bit of a tangent here, but people are working on approaches where the space of possible models is neither fully free nor restricted to multivariate gaussian (in these two cases nothing can be done to distinguish the direction of the arrow). For example, if our models are

          M1: X -> Y, y = f(x) + ε
          M2: Y -> X, x = f(y) + ε

          where f(·) is an arbitrary function (let’s say reasonably smooth) and ε is additive noise (arbitrary other than zero mean). The assumption of additive noise makes these two cases distinguishable unless f(·) is in a very special subset of possible functions (the ones that satisfy a certain differential equation). Here’s an old paper about this sort of thing: http://www.machinelearning.org/archive/icml2009/papers/279.pdf

          Do you have an opinion on these sorts of approaches Prof. Pearl?

        • >>”Gee, if Causal Inference is Statistical Inference given causal assumptions” then “Car making is car-painting given an engine and a body.”

          Only if the engine and body were assumed to exist and then later concluded to exist as a consequence of that same assumption.

          And I say this as someone who finds causal DAGs to be useful (in some situations), as stated in my initial comment.

        • Dave:

          I think you might be confusing inference (about empirical matters) with deduction (that the conclusions follow logically from the premises) when it has to be induction it its to be about empirical matters. The phrase in Judea’s comment “provided the assumptions were valid” marks it as not deductive as in deduction assumptions are take as always true.

          I do think it is a risk with “mathematization” of helps for doing induction that it may make induction itself seem like deduction (e.g. that interview with Lindley by Tony O’Hagan where Lindley declares Bayesian inference as deductive.)

        • Daniel (and Chris).
          Sorry if I missed the point of what Chris was saying. But please examine the process you are describing:
          “.. he’d try to make some causal assumptions using mechanistic ideas, and then fit this model and see how it fit, then he’d add in another causal assumption he might be unsure of, and see what it would predict, ….”

          You are describing the process of specifying a causal model from a mental representation of mechanistic ideas, and then “fit this model and see how it fit”. The first part is still a mysterious mental exercise that, as far as I know, no one has attempted to formalize. But the second part requires that we write down a causal model and draw some conclusions from it, for example, does it have any testable implications, observational as well as experimental. This is what we do with DAGs. This is where DAGs excel with no competitor in sight. This is where I see the “We do not need DAGs” mantra to be harmful because it lulls people into believing that their informal methods cannot be improved by mathematics.

          For fun, just ask one of those “we do not need” folks to present their favorite “causal model” and to tell us if it has any testable implications. But do it diplomatically, so you won’t be perceived as the cause of their embarrassment.

          If their “causal model” is a “potential outcome” model, they will be facing the problems discussed here: https://ucla.in/2QpcGzS. And if their model is a parametrized statistical model, the statistical assumptions are testable, but not the causal assumptions. Thus, there is no methodology today, save for DAGs and its associated tools, to do the second step in your description of how Chris is about to construct a model.

          It is for this reason that I consider all those “we do not need” talks and “we are already doing it” rejectionism to
          be harmful to a community of smart researchers, who are currently lulled into outdatedness.

          a

        • Judea, so you are saying that you see no value in explicitly representing scientific, process-based models that capture key aspects of hypotheses or theories, fitting these to data, and comparing model fit, predictions, running sensitivity analyses, etc.etc.?

          Real example here: Let’s take two variables (temp T and water vapor W) measured in a closed chamber environment including plants over a short timescale. We have a joint density composed as f(T,W). You say that using statistical methods I can never distinguish T —> W, from W —> T? That may be true if we only have arbitrary factorizations of the joint to work with. But we don’t. I can use physiological reasoning here. Temperature drives transpiration (loss of water from plant stomata) through its impact on vapor pressure deficit (VPD). Over a short enough time scale (further aided by an experimental environment of course), this will overwhelm the effect of water vapor on temperature. So I can write down a motivated physiological model centered on transpiration response to VPD, fit it to my time series, and do causal analysis of the effect of T on W, which in turn allows me to learn about how my plants regulate transpiration etc. (in reality, this is just an intermediate step in a larger model with other goals).

        • Can you tell me then, how after I have specified an Agent Based Model of say infectious disease or invasive species or some similar thing, how do I use DAGs on it?

          If I wanted testable implications, normally I would specify some plausible range of parameter values and then generate randomly from those parameter values, and run my simulation. then I would measure quantities from the simulated results and see if they were different in different regimes. This would then let me decide whether measuring these measurements could help me distinguish between different parameter values… if I have multiple models I would compare the measures across multiple models and do the same type of thing…. once I have selected distinguishing measures I would attempt to collect them in the real world and use them to filter down the set of plausible parameters to those that match reality.

          if the simulated data is never different as I change certain parameters then I must either find alternative measures or accept that the parameter is not identifiable.

          How would I use a DAG to do a similar thing for my ABM?

    • Dave

      You said:

      “Unfortunately, estimation with mixed data types (discrete and continuous) in BNs is computationally difficult, and I can’t find any implementations that handle it.”

      In fact an accurate inference algorithm for hybrid (discrete and continuous) variable BNs was developed by Neil et al in this paper:

      Neil, M., Tailor, M., & Marquez, D. (2007). Inference in hybrid Bayesian networks using dynamic discretization. Statistics and Computing, 17(3), 219–233. Retrieved from http://dx.doi.org/10.1007/s11222-007-9018-y

      and – importantly – the algorithm (and extensions to it) is implemented in the AgenaRisk software which makes it simple to build and run hybrid BN models. https://www.agenarisk.com/

      Norman Fenton
      http://www.eecs.qmul.ac.uk/~norman/

      • Great, thanks for sharing. I suppose I meant I haven’t been able to find an open source implementation that handles mixed data types, but I will take a look.

    • The snapping sound I heard was E-flat major — same as Bethoven’s Eroica
      Beautiful sound, especially if you look at progress, before and after snapping.

        • Given the history of discussions on causality in this blog, I was ready to get the popcorn out and settle back. Nice to see some humor in these discussions, and hope that if it continues that the dialogue can be both light and informative, as well as knowing when a discussion is going nowhere, the points have been made and leave it at that.

  3. “a good prediction study also requires … a rigorous plan to control confounding.”

    Andrew, can you elaborate in what sense a predictive study needs a rigorous plan to control for confounding?

    What is your target quantity of inference in a prediction study (for example, E[Y|x]?), how do you define confounding in that setting and what would be a rigorous plan for controlling them?

    • Carlos:

      I’m talking about adjusting for differences between sample and population, for example if you’re studying the associations between religious attendance, income, and voting in a sample, but you’re interested in learning about these associations in a larger population.

      • Ah, I admit this point confused me as well. Coming from an epidemiology background, I disagree with calling that confounding. That’s an issue of external validity/generalizability.

        I suppose if your target is the general population and your sample gives you the wrong answer you could argue you’re needing to correct a form of selection bias (though I think many epis would chafe at calling it that, too). But confounding, I always learned and conceptualized, is strictly an issue of internal validity and is only relevant for causal questions. It’s mixing up the effects of two variables, which is totally nonsensical in prediction problems since if you get the right prediction who cares what the individual predictor effects are?

      • Hi Andrew,

        As Zach has also mentioned, we usually denote this by selection bias.

        But leaving terminology aside for now, could we say your problem statement is: (1) we have data from P(Vote, Religion, Income, S = 1) (where S denotes sample selection) but (2) we want to make inferences regarding parameters of P(Vote, Religion Income)—that is, the joint distribution of the population as whole?

  4. I think the final paragraph could use clarity. While causal models can be framed as models whose purpose is to predict unobserved potential outcomes, causal models differ from non-causal predictive models in at least 2 fundamental ways: 1) we can never observe the ground truth for our predictions; and 2) we do not care about the predictive power of models on the outcomes, as models with excellent predictive power on the outcomes can be biased (see Figure 1 in https://arxiv.org/pdf/1608.00060.pdf).

    • Luke:

      1. There are many non-causal models where we can never observe the ground truth, for example in psychometrics, pharmacology, and other fields where the variables we really care about are latent.

      2. In predictive problems, we care about predictions for new cases, hence the need for adjustment for differences between training and new data. This issue comes up in predictive models in general, which makes sense given that causal models are a special case of predictive models.

  5. I gotta agree with Daniel Lakeland above, I spent some time on this DAG stuff years ago and just never saw what I was supposed to do with it. It reminds me of Mayo’s stuff, it seems to generates endless paragraphs of discussion but I never see it used to accomplish anything. Like perform a feat or make a surprising prediction.

    Like smoking causes cancer. I’d first of all try to model tumorigenesis, then see how smoking may affect the parameters of that model. Then estimate the effect on each parameter and see if the resulting predictions of the model fit the cancer rates in smokers. What does any of this causal inference or DAG stuff add?

    Eg, for a single cell lineage with probability p of a cancer relevant error occurring during a given division d, and n of these errors required for tumorigenesis I’d get the probability of tumorigenesis after d divisions to be:

    (1 – (1 – p)^d)^n

    Smoking can affect p (increasing the error rate) and it can affect d (increasing the number of divisions in the tissue), or it can affect n (reduce the number of errors that must accumulate before the cell goes out of control). Or it can affect something after tumorigenesis but before cancer is detectable, perhaps it reduces the effectiveness of immune surveillance so more cancer cells survive, maybe history of smoking makes a cancer screening/diagnosis more likely (eg, lung cancer is very similar to tuberculosis), etc.

    Id want to model the process that generated the cancer incidence data and get plausible ranges for each parameter along with how smoking may affect them, the synthesize all this info into a prediction of eg age specific incidence curves that could be compared to reality.

    What do I use a DAG for?

    • +1

      Exactly my feelings too. I’ve even asked around & there seems to be no serious work (outside of the methodological expositions or toy examples) where DAGs have been used so we may read & learn from it.

      What have people used DAGs for to learn anything new & non-obvious?

        • Sorry, whatever game you are wanting to play seems to me it will just lead to more endless paragraphs of discussion without giving me what I want.

          But the most glaring problem is they don’t check the output of their example model against new data to demonstrate it can usefully predict anything.

        • Anoneuoid, you want to learn how DAGs can help you, so I’m giving you a real example to illustrate that. Can you tell us what would go wrong with the sensitivity analysis proposal of Imbens even when all the model assumptions are met and considering no sampling uncertainty?

        • Can you tell us what would go wrong with the sensitivity analysis proposal of Imbens even when all the model assumptions are met and considering no sampling uncertainty?

          No, I didn’t read closely enough to know anything like that. It is up to the people who come up with a method to demonstrate it is worth spending time on. Like I said, I spent a good amount of time a few years ago on this stuff and didn’t get anything out of it.

          Archimedes pulled a ship from the harbor using only his own manpower and ingenuity, then people accepted he knew what he was talking about.

          Haley predicted a comet would return at a certain date a lifetime in the future and was only off by a few years, then people accepted astronomers can predict astronomical events.

          Micro/molecular biologists can make arbitrary cells fluoresce green, thereby showing they can manipulate what is going on inside cells.

          I’m simply not going to devote any further than a trivial amount of time (eg writing these posts and skimming that paper) to causal inference using DAGs until someone performs some impressive feat using the technique. This is totally normal and rational behavior.

        • Anoneuoid, It’s one thing to quietly reserve judgment on something until you see strong evidence of its value. It’s another thing to write many paragraphs dismissing the value of ideas that you freely admit to not understanding and not being willing to put effort into learning.

          My experience was that I had to learn the basics of DAGs before I could appreciate their value. In comparison with the thoughts that usually ran through my brain when doing statistical analyses, the ideas of the DAG-based approach to causal inference were novel enough that I could not anticipate their value before I understood them. Understanding was a prerequisite to appreciation.

          There are other ideas in science that are like that too. In particular, I’ve heard Bayesian statistics discussed that way — that one has to understand at least the basic ideas in order to appreciate what it has to offer, above and beyond the frequentist approach that more people were trained in. Try reading your own comment with the term “causal inference using DAGS” replaced with “Bayesian statistics”:

          “I’m simply not going to devote any further than a trivial amount of time (eg writing these posts and skimming that paper) to Bayesian statistics until someone performs some impressive feat using the technique. This is totally normal and rational behavior.”

          Now you sound like a lot of staunch doubters of Bayesian methods that I’ve run into, who don’t understand why they should learn something other than the frequentist methods they have relied on so much :)

          Carlos, I admire your dedication to these blog points. Maybe you and Anoneuoid won’t end up agreeing, but others readers have surely benefited!

        • It is however pretty easy to find some nontrivial research on real world problems that has been done with Bayesian Inference.

          Furthermore it’s pretty easy to say what it does for you, Bayesian Inference allows you to calculate a number that describes the logical consequences of some assumptions you made in a certain way, it balances your theoretical understanding from a prior with your theoretical understanding of the limits of your models predictive ability combined with the actual data.

          Now, I’ve asked here what I could do with DAGs if I have an agent based model of a complex dynamical system involving say wolves, deer, grass, and an invasive plant… and have heard only silence, recently Judea posted that DAGs are for people without detailed mechanisms… I still don’t know what to think. If the DAG researchers are doing stuff to enhance research without mechanism then I wish they’d just say “hey if you aren’t modeling mechanism then we can help you at least straighten out your non mechanistic thinking” instead of things that read more like “we hold the torch to bring science out of the dark ages into the light of reality” that’s pretty obnoxious sounding to a fluid mechanics person or someone studying ecology through ABMs or a biopharma person studying pharmacokinetics with multi compartment ODEs

        • Now you sound like a lot of staunch doubters of Bayesian methods that I’ve run into, who don’t understand why they should learn something other than the frequentist methods they have relied on so much :)

          I don’t see how anyone could doubt the usefulness of mcmc. If you do research the way I recommend it’s almost something you would inevitably end up inventing yourself if it didn’t already exist.

          If you mean some other bayesian method like bayes factors than I’d probably agree with them.

      • I think DAG are very useful in epidemiology, especially if one tries to draw conclusions about causality from observational data. In this context, DAGs help making assumptions about causal relationships explicit. Analysing DAGs reveals conditional independencies implied by these assumptions. This is relevant, because these are testable implications of causal models that are otherwise harder to figure out. It is probably possible to do the same without a DAG, but a DAG makes this process relatively easier.

        An example for something interesting I have learned trough DAGs is that MRP (multilevel regression and post stratification) in some circumstances cannot reduce bias from self selection into observational studies when one estimates the association between a predictor and an outcome.

        More generally, it seems to me that DAGs are especially useful in fields like epidemiology or analysis of observational data (when one cannot or does not want to model biological or physical systems), whereas they are less needed in fields like physics (where mechanistic models are clearer) or analysis of experimental data.

        • Thanks Guido.

          But can you show some examples of articles in epidemiology that have used DAGs to do some good work? I’d be curious to read.

        • How about:

          https://www.ncbi.nlm.nih.gov/pubmed/30561628: Paradoxical collider effect in the analysis of non-communicable disease epidemiological data

          https://www.ncbi.nlm.nih.gov/pubmed/15308962: A structural approach to selection bias.

          https://www.ncbi.nlm.nih.gov/pubmed/25589243: A cautionary note about estimating effects of secondary exposures in cohort studies

          Also: My impression is that DAGs are underused in epidemiology (but what do I know, I’m a psychologist), but this doesn’t mean they are not useful. (Bayesian inference was not used a lot until recently, that does not mean it is not useful)

        • Your first example uses a model that predicts high sodium diets leads to increased chronic blood pressure, the newest findings are that this is incorrect.

          In fact higher sodium intake seems to be associated with lower chronic blood pressure (although BP may rise transiently due to sodium intake): https://www.eurekalert.org/pub_releases/2017-04/eb2-ldm041217.php

          See… If a DAG had been used to predict this surprising (in light of conventional claims) new result then I would be interested. They had an opportunity to do something useful here but all they did with their method is “confirm” whatever the authors already believed.

        • I think there is a misunderstanding here.
          In my view this paper is not about about high sodium levels, but about collider bias. The paper can’t show us anything about the true effect of Sodium on high chronic blood pressure because it uses simulated data.

          I don’t think DAGs can make any interesting, theoretically substantial predictions. After all, they are just tools to encode our predictions (assumptions) about causal effects. [DAGs allow making predictions about conditional independencies, but this is-in my view–more relevant to check if a DAG is consistent with the observed data, not to gain important new theoretical insights].

          [DAG-explanations of paradoxes like Simpsons ]

          I’d like to add that I think there is a difference between mechanistic models (e.g. in cognitive science), which are tools for theory development and might thus lead to interesting predictions for some experiment, and DAGs, which for me are tools to reason about the possibility to estimate causal effects from observational data and choice of adjustment variables for a regression model.

          There areas of application can also overlap (I am not claiming to be an expert). But the important thing to me is that DAGs and mechanistic models are not competing approaches. I see them as tools which each serve a particular purpose especially well.

        • I don’t think DAGs can make any interesting, theoretically substantial predictions. After all, they are just tools to encode our predictions (assumptions) about causal effects.

          Is this not the same as any model? Eg, if you apply a more complete version of my cancer model above to epidemiological data, you see that it can fit really well for many types of cancer. However, there are certain aspects of the data that cannot be explained by a very small value for p, which has long been assumed to represent somatic mutation rate (ill give a very generous range of 10^-5 to 10^-10 mutations per bp per division). Either that or cell lineages that originate cancer are dividing multiple times per day normally, which would be surprising.

          Thus the model predicts that cancer primarily develops due to a much more common type of error (eg, chromosomal missegregation which appears to occur on the order of 10^-1 to 10^-3 per division). Or perhaps almost all cancer originates in quickly replicating invading immune cells that somehow acquire the properties of the cells in the host tissue).

          Of course maybe the model of accumulating errors is totally wrong to begin with, but if we later find that, eg, accumulation of chromosomal missegregations is the primary type of event leading to cancer it will tell us the model is on to something.

          If a DAG cannot lead insights like this I do not see what purpose they serve.

        • In my smoking-cancer example above I am doing epidemiology though (eventually predicting and understanding age specific incidence curves, from eg SEER).

          And I skipped over it for brevity, but the assumptions would be explicit when I derived my model of “tumorigenesis” (actually the correct term was probably “carcinogenesis” for what I showed there since it only models the creation of a single cancer cell). At it’s core the model is basically just deriving the geometric distribution and then assuming the identical process must occur n times in the same cell, obviously we can relax some assumptions like constant error rate, etc if we wanted.

          So I just see no need for a DAG, what do you see it adding to my example?

        • I don’t know. But I suspect that much like the principle that “Everything is correlated with everything else”, that also: “Everything that happens is influenced via some causal chain by everything that happened earlier”.

          So it isn’t really identifying whether causes exist that should be goal (the vast majority will be negligible), but instead figuring out how a system works and how we can most efficiently tweak it to do what we want.

        • Put another way, although everything is caused by everything else to some extent, there are often only a small number of factors that are needed to get usefully accurate predictions.

          on the other hand, there is also chaotic Dynamics… so that the effect of causing a certain variable to take on a certain value can be dependent on the specific state of the system at the time you perturb it… think of the orbit of a chaotic 3 part multi-pendulum… what happens if you fire an impulse of momentum into the third part? depends entirely on the full state of the system, and can bifurcate for even extremely small perturbations, like spitting a spitball at the device

        • I have no desire to stir the pot, but if one is intimidated by formal models of the underlying mechanisms relating phenomena, then the perceived lack of progress within a field like epidemiology is ostensibly “solved” by methodological innovations (such as DAG).

  6. > If we instead simply take the classical approach, we’d start with a hypothetical controlled study of exercise on lung cancer, a randomized prospective study in which the experimenter assigns exercise levels to patients, who are then followed up, etc., then we move to the observational study and consider pre-treatment differences between people with different exercise levels.

    So you would __simply__ take thousands of people and make them exercise consistenly the exact ammount of exercise that is indicated in the protocol for years? Self-reported smoking and exercise may be noisy, but I think data quality in such an experiment would be even worse. I would be expect people to be more likely to misreport what they do (unless you do something drastic like putting them in cages with running wheels). Let alone the problems to enroll participants (and the self-selection bias, unless you make participation mandatory).

      • Thanks for pointing that out. I guess that explains what the “hypotetical” word was doing in that sentence. Still, I don’t understand what kind of reasoning does he describe and how it makes clear which are the differences between groups that you want to adjust for.

        Say the different “exercise level” groups differ on smoking rates, testosterone levels, caloric intake, urban/rural living… You may need to adjust for some but not for others and I don’t see how knowing that you could have ran an hypothetical randomized where you wouldn’t care will help. It seems to me that you need to reason about each variable to see what the relationship to exercise and lung cancer may be.

        • I do think Judea is right, there is no explanation for the creative process of deciding what to investigate, which assumptions are ok to make, which make no causal sense etc. How do you decide to adjust for testosterone levels and not say antidiuretic hormone levels or insulin levels or frequency of exposure to indoor pool water, or shellfish on the diet…

  7. Not sure if this lands in right spot. This is a reply to Anoneuoid post that ends in “If a DAG cannot lead insights like this I do not see what purpose they serve.”

    If I understand you correctly, you describe a biological model of how cancer develops and ask what DAGs can contribute here. Maybe DAGs can contribute here, maybe they can’t, I am not sure. But cancer is maybe a good example to explain again why I think biological plausible /mechanistic models and DAGs serve different purposes.

    If you want to explain how cancer develops on a cellular level, I wouldn’t ask for a DAG and agree that a detailed model of cell replication makes more sense.

    However, if you want to know if some toxin might, by a mechanism you do not understand, cause cancer, and all you have is a sample from an observational study with data about exposure to the toxin, a bunch of covariates, and cancer diagnosis, then DAGs can help to answer questions like:
    Given the assumed causal relationships of observed and unobserved variables in my simplified model of cancer,
    Is this assumed causal model consistent with the observed data?
    Is it possible to estimate the effect of toxin on cancer in this study? (The answer might well be “No”.)
    For which covariates should I adjust, and for which shouldn’t I adjust in a regression?

    I think this are relevant questions, even if they don’t lead to interesting theoretical insights. And DAGs are a useful tool to provide answers to such questions.

    • However, if you want to know if some toxin might, by a mechanism you do not understand, cause cancer, and all you have is a sample from an observational study with data about exposure to the toxin, a bunch of covariates, and cancer diagnosis, t

      I guess I just reject the idea this is possible without a model of cancer to begin with. Otherwise you end up with an arbitrary (convenient in some way) model predicting A increases/decreases B and that is just too vague to differentiate it from tons of other explanations.

      • OK, but this seems also to imply that no experiment that is not informed by a detailed understanding of a causal (in this case biological) mechanism can establish causality.

        I am not aware of a framework for causal inference where knowledge of a detailed causal mechanism is required to establish causality. If such knowledge would be required, the potential outcomes framework would for example also be invalid.

        A mechanistic understanding certainly strengthens a causal claim, but I don’t think it is the only way to causality.

        • You mentioned “causal” 7 times in that post.

          Honestly, if you look at my typical post you will almost never see me mention causality (although maybe it is implied in ways I don’t pay attention to). It just doesn’t seem to be a fruitful area to focus on to me, perhaps for the reasons that motivated my parent post.

          Instead, I am interested when people model the process that they think generated the data and use the parameters of your model to put bounds on what you should see in other data if the model is good. I am interested in that because it has a long history of producing fruitful results.

        • I agree that modeling hypothesized mechanisms is a very important way to make scientific progress (lots of my own research does it, I’m not sure how successful).

          If you are not interested in causality, then we probably have talked past each other. But causal inference are what DAGs are for.

    • a detailed model of cell replication

      Also, I wouldn’t call that model “a detailed model of cell replication” in any way. It is a very vague and high level model that only considers division rates and “errors” that could correspond to pretty much anything. All sorts of different “submodels” could go into determining the parameters of it like p and d as a function of age.

  8. Guido,
    Once we have “knowledge of a detailed causal mechanism” we no longer need to “establish causality” because that
    knowledge IS sufficient for answering the two questions that Causal Inference is purporting to answer: (1) questions about the effects of pending interventions and (2) questions about the effects of undoing past events.

    Many bio-statisticians are busy building models of causal mechanism and do not ask themselves what questions such a model enables them to answers. They understand intuitively that once the parameters of their models are satisfactorily estimated, they gain an “understanding ” of the phenomenon of interest and they are perfectly satisfied. They do not ask what “understanding” is. The need to ask what “understanding” is arose in areas where a mechanistic models were deemed impossible apriori (eg smoking and cancer) and people asked: “Hey, perhaps we can get answers to urgent policy questions without detailed mechanisms, but with the help of some data”. It worked in the case of RA Fisher and his agricultural plots, and it worked in many other areas where “functional mechanisms” cannot be established; epidemiology and social science are perfect examples.

    I see that Anoneuoid prefers to purge the adjective “causal” from science and go back to pre-Fisherian era (and pre-Wright) of analysis. There is some wisdom in it. Bio-statisticians would then be able to continue their parameter-tweaking research without having to think whether their assumptions are causal or statistical, without having to sort out their assumptions into “testable” vs. “untestable” categories, and without having to think what the model enables them to do that they could not do without. Life would be much easier for them then, followed by fruitful results in bio-statistics. Continuing this movement, perhaps Andrew will agree to remove the words “causal inference” from the title of his blog because, after all, everything is just predictive models.

    I, for one, will not remove “causal” from the title of my blog because, for me and for many of my readers, the distinction between “causal” and “statistical” has been an eyes opener. The realization of what can and what cannot be inferred from observational and experimental data, and what information is needed to answer counterfactual questions has saved readers hundred of hours of futile explorations and endless debates. See for example the decades-long debates on what “confounding” is, or what “exogeneity” is, or what “external validity” is, or what “indirect effect” is … etc. etc. (See some remarks on the history of the distinction – https://ucla.in/2N9udy7) No, I can’t deprive my readers from the benefits of modern causal inference.
    Judea

    • Judea, can I take this post to mean that those of us doing mechanistic dynamic modeling using ODEs, Agent Based Models, PDEs, and even algebraic equations with causality already assumed in a certain direction and a certain way are free to continue doing so with your blessing, and that it’s the “mechanism free” causal inference that you have elevated to a science with your DAG system? because I just don’t see how a DAG would help an aerodynamicist reduce the drag on a submarine for example. The aerodynamicist knows that the drag is affected by the shape and the speed, and not the other way around with say adjusting the speed causing the shape to change or whatever. The aerodynamicist builds a CFD model, adjusts the shape, simulated the flow in a computer, and then determines the drag coefficient… then builds a physical model and runs it through a wind tunnel, and verifies the calculation. causal inference!

      Anoneuoid is a bit of a radical, he basically says if you don’t even try to build a mechanistic understanding… then just forget it… I’m less radical than that, I say try always to build some mechanism, but still measuring a causality from some minimal assumptions is occasionally a useful thing, if only to decide which things to include in your mechanistic model for example.

      Anoneuoids core assumption which he’s expressed before is “everything causes everything else, the real question is what is negligible small” or in other words, the only graph he’d accept without detailed experimental evidence otherwise is a completely connected undirected graph, where bidirectional connections would denote dynamic feedback in time, so “taking aspirin affects your headache at the next time step, but the headache change at the next timestep affects your brains neurons so that headache later indeed causes you to take aspirin” and “clouds cause headaches by adjusting the pupillary opening and changing your susceptibility to eyestrain, but eyestrain causes clouds because you close your blinds and turn on the electric lighting and AC and this alters the climate..etc I’m pretty sure you could say that he views science as primarily about discovering which are the key variables that dominate such relations. What he wants is more of what you have already said can’t be formalized, the creative discovery of the mechanistic ideas in the first place.

      what your methods seem to enable is this middle ground where people think they know which variables matter but they aren’t very sure HOW they matter, so they make some guesses and then because they are very vague, they need some help… a simulator which is not very vague at all but actually explicitly claims to predict the future causally but simply doesn’t have the right numwerical values plugged in is a typical situation for me and that of say Chris Wilson, or ojm, or maybe a few others… I think you can see that a person with a CFD simulator, or an explicit model of plant transpiration and an experimental apparatus to affect and measure it would find it annoying to be told to “go backwards” and use a tool that is designed for “does smoking cause cancer” rather than “plants modulate their evaporative water loss through stomata opening size which is dynamically affected by blablabla…”

      so while I think those people seeking to find out if subsidized student loans cause improved health outcomes for the subsidy reciever’s children … but they aren’t willing to actually describe the mechanism they imagine and to measure all the mechanistic consequences and compare the mechanistic predictions to the measured outcomes… those people probably should read your DAG literature today right now… But I’d also argue they should get ahold of someone who can help them think about mechanism too, because a science full of “in the social environment of 2010-2020 improved education caused improved health of children” is a pretty meager science… it can’t tell us about what to do in 2030 after myriad background variables have changed which could be assumed near constant in 2010-2020 for example changes in healthcare laws and in economic conditions and in technology for providing healthcare and etc

      I think it’s this which causes the majority of people here to fail the bandwagon leap. also I suspect there are plenty of less vocal people here reading the blog who do less mechanistic stuff and they are not motivated to push back in the way that ojm or Anoneuoid or I do.

    • I see that Anoneuoid prefers to purge the adjective “causal” from science and go back to pre-Fisherian era (and pre-Wright) of analysis. There is some wisdom in it. Bio-statisticians would then be able to continue their parameter-tweaking research

      I don’t want to “purge” the concept of causality, it just has failed to interest me. Like I keep repeating, I just don’t see what usefulness comes from worrying about it.

      I am definitely “old fashioned” and prefer pre-Fisher methods of research. NHST is another thing that has never shown itself useful (other than in getting papers published). But I am not, and never was, a bio-statistician. I don’t know what I am now… I make most of my money equity trading these days since its close to pure meritocracy. There is no boss and no clients so I don’t have to waste any more of my time convincing other people when they are obviously wrong about stuff. The rest comes from what I guess would be called software development.

      And deriving mathematical models to attempt explaining a phenomenon and then figuring out what the values are for any parameters is science of the highest order. It is hardly “parameter-tweaking research”.

      Btw, If you are interested in more of my thoughts about the smoking-cancer link I’ve previously commented about that on this blog. I suspect the narrative and recommendations that the causal/statistical inference method has generated are pretty far off base and possibly actively harmful (eg, quitting smoking can lead to cancer since the most cell divisions occur when the tissue heals):

      https://statmodeling.stat.columbia.edu/2015/08/12/reprint-of-observational-studies-by-william-cochran-followed-by-comments-by-current-researchers-in-observational-studies/#comment-232707
      https://statmodeling.stat.columbia.edu/2018/07/30/file-drawers-fire/#comment-823053

    • I knew I should have gotten out the popcorn!!! But seriously, I see value in what has been done in the causal literature, but what I don’t see, as a number of others have commented, is a lot of applications in complicated models, or even things that would guide how to do that. Instead when challenged, there is a retreat to toy models. Fine, I see that causal inference allows me to deal with toy models, unfortunately the problems I have aren’t toy models. So i have large scale multivariate spatial-temporal data, where the feedbacks probably occur in both directions at a time scale shorter than the data I can analyze, there are a lot of spurious structures due to the spatial and temporal correlation. Given the limits of DAGs, I don’t know how I would even begin to model this in a DAG, and even if there exists literature on how to do so (and if there is I would greatly appreciate being pointed to the references), we can not then simply estimate the results by simple models as so much of the causal literature implies.

      i think it would really help everyone that when challenged by people skeptical of your approach, instead of going back to the same toy models, you instead say here are papers that use DAGs to good effect in the class of problems that you are trying to understand, and here is how you would implement them. You may not agree with Andrew’s approach, but usually when he describes how he would approach such problems, there are paper(s) that show that approach on real, messy data and problems, which allows me to judge for myself how well I feel his solutions deal with the issues.

      • “i think it would really help everyone that when challenged by people skeptical of your approach, instead of going back to the same toy models, you instead say here are papers that use DAGs to good effect in the class of problems that you are trying to understand, and here is how you would implement them. You may not agree with Andrew’s approach, but usually when he describes how he would approach such problems, there are paper(s) that show that approach on real, messy data and problems, which allows me to judge for myself how well I feel his solutions deal with the issues.”

        Agreed

    • Fair enough!

      Can we have another post to argue about whether graphical models are the best/only way to represent causal assumptions?

      Here are some suggested questions

      – is a semantics of variables ‘listening’ to each other to ‘set’ their values (as Pearl emphasises) a good way to express causal assumptions? How does this compare with the semantics used in other scientific fields like physics or chemistry?

      – is a ‘nonparametric’ graphical model/structural equation model really capable of representing arbitrary causal systems and in particular the associated semantics? Why do so many areas of science instead use eg integral or differential equations?

      – are alternative modelling frameworks really ‘parametric’ and/or require fully specified mechanisms? Or are they alternative frameworks with different semantics that can also accomodate qualitative or partially specified assumptions?

      – can DAGs represent continuous time and/or continuous space processes?

      – would Newton/Kepler have benefited from or been hindered by DAGs?

      – suppose I’m interested in why a particular dynamic system exhibits oscillations in certain regimes eg rabbit and fox populations (as a toy example). How might I formulate and analyse a toy model of such a system?

      • Ojm:

        I have not found Pearl’s approach helpful in my own theoretical or applied work, but lots of thoughtful people do get something useful out of his approach, so I’m willing to believe it has value. There are all sorts of different applied problems out there, and a method might not help for the set of problems I work on, but it could be useful in other settings.

        • I’m not disputing that it can have value – I’ve even found it useful in a couple of real (data analysis rather than modelling) problems!

          I think the book of why is a pretty good intro if you just ignore pretty much everything except the examples (like I said above, I’ve even recommended it to multiple people).

          So…I’m not actually trolling! I think discussing the pros and cons of DAGs etc and comparisons with more standard causal modelling approaches in the sciences would a legitimately interesting topic of discussion for many folk!

        • I bought their book a while ago coz I wanted to learn more about how the social sciences dealt with causality.

          Chapter 10 is interesting, but mostly words. It also seems to count ‘mediation’ analysis as ‘mechanistic’??

          Most of all, I thought the quote from Gary King on the back was hilarious:

          > More has been learned about causal inference in the last few decades than the s total of everything that had been learned about it all all prior recorded history

          !!

          Tell that to Newton, Einstein even Carnot etc!

        • This article, discussed in detail in chapter 10, is pretty interesting:

          https://academic.oup.com/esr/article-abstract/17/1/1/502739

          (Causation, Statistics, and Sociology by Goldthorpe)

          But again, largely verbal. They compare

          – causation as robust dependence
          – causation as consequential manipulation
          – causation as generative process

          and argue for the last.

          This seems reasonable to me, though not unobjectionable, but the key questions to me then are

          – what are generative processes
          – how should they be represented mathematically
          – how should they be interpreted mathematically

          I’ve briefly given reasons here why I doubt DAGs as the math rep and variables listening to each other to set their values as the semantics of causality are obviously the correct way to go, and are at least somewhat at odds with other areas of science.

        • Andrew,
          It is not clear to me if your comments on Lederer etal article reside within the sphere of your
          own theoretical or applied work, or outside it?

          If within, I am curious to know if you would still recommend the “standard approach”, namely:
          ” .. we move to the observational study and consider pre-treatment differences between people
          with different exercise levels. This makes it clear that there’s no “back-door path”; there are
          just differences between the groups, differences that you’d like to adjust for in the design and
          analysis of the study.”
          Judea

        • Judea:

          As I wrote in the above post, I think that article has some value but also gives some confusing and misleading advice. That doesn’t mean I think your ideas are bad; I just think that particular article has problems. I think Bayesian methods are super-useful but I’ve read lots of articles that recommend Bayesian methods in ways that I think are confusing and misleading. Regarding your question: I believe what I wrote in the above post, and I also realize that statistics in general, and causal inference in particular, are complicated, which is why my colleagues and I have written many articles and book chapters on the general topic of causal inference, along with many research articles applying causal inference to problems of interest. These are big problems that can, and will be, attacked in many different ways, and I have no interest in a flame war of the emacs vs. vi or mac vs. pc variety. And I think it’s a mistake on your part to turn a blind eye to the failings of the above-inked article, just because it happens to recommend your methods. Lots of thoughtful researchers find your methods to be useful and even essential—that’s great, and it should be fine for you to recognize problems with literature that happens to promote your methods but also happens to have serious problems.

        • Andrew,
          I have not commented yet on Lederer etal article. I was preparing to do it shortly on Twitter
          with some critics (“causal associations” sounds empty to me).

          Apropos, I think it is wrong on your part to portray me as a blind promoter of ONE approach, to the exclusion of
          others, when you have seen me laboring hard (including on your blog) to enrich causal inference with
          new tools and new concepts, subject to only one requirement: “Help answer questions about interventions and counterfactuals
          with some guarantee of validity”
          JP

  9. You “don’t see the point of all their discussion of colliders” but do you agree with their conclusions? [control for smoking when studying the effect of exercise on lung cancer; do not control for respiratory crackles when studying the effect of beta-blockers on ARDS]

    If yes: Do you think that “the classical approach in which an observational study is understood in reference to a hypothetical controlled experiment” would lead to the same conclusions?

    If yes: Would the “classical approach” rely on the same kind of causal assumptions represented on those graphs? [smoking has an effect on exercise and cancer; both heart failure, which leads to beta-blocker use, and pneumonia, which can cause ARDS, may produce crackles]

    If no: How are smoking and respiratory crackles different when the hypothetical controlled experiments are considered?

      • Well, it’s a top-level comment including verbatim quotes from the post :-)

        But you’re right, I could have started the comment explicitly addressing Andrew. Thanks.

      • Andrew,
        The whole point of Lederer’s etal article was to warn researchers against the pitfalls of the “standard approach” which you describe as:
        ” .. we move to the observational study and consider pre-treatment differences between people with different exercise levels. This makes it clear that there’s no “back-door path”; there are just differences between the groups, differences that you’d like to adjust for in the design and analysis of the study.”

        The pitfalls are:
        (1) considering pre-treatment differences DOES NOT MAKES IT CLEAR that there’s no “back-door path”;
        (2) Differences between the groups ARE NOT differences that you’d like to adjust for.
        (3) Adjusting for those difference may result in BIAS AMPLIFICATION instead of bias reduction
        (4) Deciding between bias-amplifiers and bias-reducing difference requires DAG’s insight, which is missing from the
        the “standard approach” that you describe.
        Conclusion: What you describe as the “standard approach” is really a “standard habit” that Lederer et al article warns against.
        An “approach” comes with theoretical guarantees which DAG-based methodology provides, and habits do not.

        Roy and Martha (Smith)
        You say:
        ” there are paper(s) that show that [Andrew’s] approach on real, messy data and problems, which allows me to judge for myself how well I feel his solutions deal with the issues.”
        Do you really believe that running a procedure on messy data and large problems would allow you “to judge for yourself how well” the solution avoids such pitfalls? Those pitfalls were hidden from researchers for decades and were discovered by mathematical analysis and demonstrated by simulation on toy problems. Do you expect your unaided judgment to determine “how well his solutions deal with the issues”?

        I hate to question the capabilities of your unaided judgment but, when dealing with messy data and large problems, even the almighty seeks the advice of mathematical analysis.

        I wish I could reply with the post above each time a reader on this blog glorifies the wisdom obtained from running something on “messy data and large real life problems”
        Judea

        • Hi Judea:

          It is funny sad that in your response you make an uncalled for personal swipe about my ability to judge what I read, including your papers and books (of which I have read a large majority), but what you didn’t respond to was the very clear challenge to refer me to papers that use DAGs in the type of contexts that I mention. Therefore I have to assume that you are unable to do so, instead will refer me to the toy examples again.

          I think a proper response is very simple – here are a list of references that use DAGs in at least the setting of multivariate time series with feedback, better in a spatial-temporal setting. if you can not provide such a list, then there really isn’t much guidance for using your techniques in those settings, is there.

          I am sort of sadden by this response, because I have followed your work for many, many years. This is not the way to convince people that you have solutions to the problems they face.

        • Roy,
          I dont see any ” personal swipe about your ability to judge what you read” . On the contrary,
          I said the task you wish to undertake is beyond the capability of any mortal “when dealing with messy data and large problems, even the almighty seeks the advice of mathematical analysis.”

          Here is what I asked: Do you really believe that running a procedure on messy data and large problems would allow you “to judge for yourself how well his [Andrew’s] solutions deal with the issues”?

          This is not a swipe but an honest request to help me resolve the puzzle or “running”.

          This theme appears again and again on this blog. The ability to “run” something on messy problem seems to be so uplifting that some readers take it as a teaching exercise, as if something is learned from it beyond the sheer ability to “run it”
          I wish someone could explain to me what is so enticing about the ability to “run” something on messy data.

          I am not trying to avoid your other request for reference to a DAG-based exercise on large and messy problem. I do not collect those references because I have never thought we can learn something from just “running”. But I will keep my eyes open.

        • Hi Judea:

          I am getting old, and I know my eyesight is not as good as it use to be,. I seemed to have missed the list of references in your response. So I again pose my very simple question, please provide a list of references that deal with how to use your techniques in the contexts I mention. Otherwise, yes, you are either deflecting the conversation of making attacks on my ability to understand what you are saying.

          So it is really very simple, in your next response there either a list of references or there is not. if there is not, then you have answers my question, haven’t you.

        • I quickly skimmed the article linked by Judea.

          It seems like a fairly generic perspective article (fine as far these go) but no new results as far as I can tell. It also discusses things like attractor reconstruction and Takens’ theorem from dynamical systems theory as alternatives to SCM etc, and which I believe are outside the scope of any SCM/DAG stuff I’ve ever seen.

          Would be cool if this sort of stuff turns into something though (they have a website it seems).

      • Andrew, as in the “then a miracle occurs” cartoon, I think you could be more specific in what happens between

        “we move to the observational study and consider pre-treatment differences between people with different exercise levels.”

        and

        “This makes it clear that there’s no “back-door path”; there are just differences between the groups, differences that you’d like to adjust for in the design and analysis of the study.”

        I thought that your conclusion was that one should control for smoking in the observational study, but I didn’t understand precisely why. It seems that actually I understood even less than I had assumed.

        • Carlos:

          In that article, they wrote, “Controlling for ‘smoking’ will close the back-door path.” I don’t think this is correct for reasons I stated above: First their “controlling” step (I would call it “adjusting”) will be extremely sensitive to the model they use to adjust; second there will be other variables to adjust for, not just smoking.

          I wrote: “move to the observational study and consider pre-treatment differences between people with different exercise levels.” This is difficult too as it’s open-ended, but that’s real life with an observational study of this sort. There are no magic bullets. If the applied research team doing this study judges that enough pre-treatment variables are measured on each person in this study, they can do some sort of matching and regression study in order to find comparable groups and adjust for differences between them; otherwise it could make sense to include latent variables in the model and adjust for them too. A lot will depend on the details of what’s being studied, and the end result will necessarily depend on lots of assumptions.

        • Thinking about this, it seems to me that one of Andrew’s points is that it’s not like there is a latent DAG just waiting to be written down, and any DAG that *is* written down, will have to compete with potentially a variety of other DAGs for Andrew’s attention. In real world messy problems with tens or hundreds of possible causal variables, including some you may not have thought of, there’s no magic bullet that writing down a DAG will get you accurate real causal inference.

          What writing down a DAG does is formalize a certain set of assumptions. And then, I assume what Judea’s do calculus does is let you work out some consequences of those assumptions.

          How does this help you if you have say 7 possible DAGs and each one works out using do calculus a different set of consequences… one tells you to control for smoking, one tells you that smoking choice is an instrument variable, one tells you that in your population of older people, recently quitting smoking causes cancer, one tells you that quitting smoking causally reduces risk of cancer…

          As far as I can tell, having 7 entirely different competing possible models of how things work, each with its own consequences derivable from do calculus or whatever is *exactly* the kind of situation we are in routinely in the “messy real world problems” whereas having exactly one DAG that everyone agrees on is exactly the situation we’re in for the toy problems people here are complaining about.

          In the end, how does the DAG system help you when there is no one DAG a given researcher or group or even community can decide on as the correct one to represent the problem, and in fact it’s not even just 2 or 3 possibilities but perhaps a very large set including variables we haven’t thought of yet (for example all possible combinations of 4 different alleles in 37 gene loci, so maybe there are 66045 different genetic variables and each one interacts with 12 different lifestyle choice variables, to produce lung cancer risk, and the mechanism of interaction isn’t clear (does genetics cause lifestyle choice, or does lifestyle choice combined with genetics cause a third thing?).

          For each of the 12 lifestyle variables there could be a genetic complex to choice arrow, or not, so there are potentially 66045 * 12 = 792540 different arrows you could draw… which means that there are potentially

          2^792540 different DAGs you might want to consider, which is a number that is so large it’s just laughable.

          so, there’s that.

        • As far as I can tell, having 7 entirely different competing possible models of how things work, each with its own consequences derivable from do calculus or whatever

          […]

          In the end, how does the DAG system help you when there is no one DAG a given researcher or group or even community can decide on as the correct one to represent the problem

          If you can derive different consequences from the various DAGs that let you distinguish between them, then check those against new data it would be useful. Perhaps they are just too vague to make meaningful predictions though.

        • The problem is that as far as I can tell, the kinds of consequences you can derive are things like “if you control for exercise but not smoking you can estimate the causal effect of dental heigene on oral cancer accurately” for one DAG and then for the other “if you control for smoking but not exercise you can estimate….” and then for a third DAG ” if you control for smoking and exercise both…”

          each one will give you a different estimate…

          if you have mechanistic models then you can look at many consequences, it would let you distinguish… but simple but conflicting and competing nonparametric DAGs… I don’t know, seems hopeless without effectively moving towards mechanism

        • Daniel and Anonueoid,

          tl;dr: I discuss what causal inference says when many DAGs are consistent with prior knowledge (a lot), provide a loose overview of DAG-based causal inference (maybe it will connect with you), and describe what causal inference can bring to studies of carcinogenesis (it could be helpful).

          When I use causal inference, there are often multiple DAGs that are consistent with my prior knowledge. Sometimes incredibly many, as you point out. In these situations, a basic understanding of DAG-based causal inference tells me two things

          (A) Suppose there is a single target of my inference, like the average effect of treatment X on outcome Y among the individuals included in my study. Sometimes, I can set up an analysis that will let me estimate this parameter satisfactorily for all the many DAGs that are consistent with my prior knowledge. At least in my experience, this actually occurs in practice. The theory of DAG-based causal inference is what tells me how to set up the analysis just right. A theorem that is especially useful for this appears in VanderWeele and Shpitser (2011) A new criterion for confounder selection. Biometrics 67:1406-13.

          (B) If many DAGs are consistent with my prior knowledge, then the theory of DAG-based causal inference can reveal testable implications of these DAGs, which will allow me to rule some of them out based on the data available in my study. Anoneuoid, this is what you are pointing out!!! You can see some of these testable implications by looking at the examples on the daggity website (www.dagitty.net/development/dags.html). Click through the examples and look at the “testable implications” sidebar. I should note that in my own work, the DAGs that I use do not look like many of the daggity examples, especially the ones with many nodes. Instead, the DAGs I use tend to involve only a handful of nodes, but some of the nodes are composed of many subsidiary variables and receive a label like “all background variables”. I should also note a limitation: just because two DAGs have testably different implications in an ideal study with infinite sample size does not mean that they will have reliably different implications in a real study with a finite sample. However, this limitation also applies to the fully mechanistic models that you, Anoneuoid, use.

          Daniel and Anoneuoid, I recommend again that you learn this material! Read!!! The length of your comments is starting to rival the length of some introductions to causal inference. An excellent condensed introduction is Pearl, Glymour, and Jewell. “Causal Inference in Statistics: A Primer.” However, it has a tragic number of typos/writing glitches, which are listed on the books webpage. To profit from the book, you may need to go through and correct these by hand.

          Another, longer option is Winship and Morgan’s “Counterfactuals and causal inference,” which I recommended to you previously and is also recommended by Andrew above.

          Stepping back, here’s a very loose, nonrigorous overview of what DAG-based causal inference does. I’ve tried to put it in language that will connect with your statistical worldview. Maybe I’ve succeeded or maybe I’ve failed (again), but give it a try:

          Suppose you are working in a class of mechanistic models of the following restricted type: You model is composed of variables X, some observed and some not. The value of each element x of X is set by a mechanistic function fx(Yx), where Yx is a subset of X. Let us further restrict the class of mechanistic models to those involving discrete time (I know this is a major restriction). Now, we have arrived at the class of models for which DAG-based causal inference is strongest. Imagine these models as animals, each of which is made up of muscles and skeletons. In this analogy, the muscles are the mechanistic functions fx. The muscles determine how the skeleton moves — in this analogy, the skeleton is the DAG. When using a fully mechanistic approach to modelling, you are modelling the whole animal — both the muscles and the skeleton. If you are used to this fully mechanistic approach, you may think that nothing much could be learned from studing the skeleton alone. But you are wrong. A remarkable amount of information can be learned about the animal from studying the skeleton alone, and this is what DAG-based causal inference tells us.

          This analogy also speaks to the domain in which DAG-based causal inference is strongest: observational studies. In observational studies, the data are fixed. We cannot see the data move. It has been fossilized. So, we are stuck learning from the skeleton, which is what DAG-based causal inference is designed for.

          Now, let’s move from that loose analogy back to the specifics of what you, Daniel and Anoneuoid, want to learn.

          Daniel, I think you mentioned multiagent models above. I don’t know enough about those to distinguish good from bad work, so I can’t provide you with insight into how DAG-based causal inference might help. Or it could be inapplicable so far, I don’t know.

          Anoneuoid, you’ve mentioned mathematical models of carcinogenesis. I know more about this topic, and DAG-based insights are definitely applicable, though perhaps not in a way you like: The reality is that there are thousands mathematical models of carcinogenesis now, including originating work by Armitage and Doll, Knudson, and Nordling and continuing to the unreadably-many models that have been published since. A major limitation to the field as a whole is that many different models are consistent with the same data. There is nowhere near enough data to distinguish between the many underlying mechanistic models that have been proposed. What to do about this? Some informative evidence could come from comparing lung cancer rates in individuals who smoke different amounts (including none) and for different durations. The different models of carcinogenesis predict different effects of smoking amount and duration on cancer rates, through their structures and the rate constants that are affected by carcinogens. So we can compare the model predictions with actual cancer rates among smokers in order to rule out at least some of the models. Yet, heavier smokers and lighter smokers differ in more ways than smoking alone, and this could also affect the cancer rates in ways that none of the mathematical models of carcinogenesis account for. How do we know that, in comparing lung cancer rates in heavier and lighter smokers, we are not observing the effects of these other differences, rather than the effects of smoking alone? We don’t, except that we may be able to use DAG-based causal inference to control for these other differences, and thereby estimate the smoking-specific effect on cancer rates — which is the effect that can be used to test the models of carcinogenesis.

        • Thanks for this summary, it is perhaps the most useful comment I’ve seen so far from a proponent of the DAG based methodology in terms of explaining what the purpose is. I have actually read several introductory papers etc a few years back, and was basically unimpressed. I suspect that this is because relatively little of the stuff I do is “fossilized” as you say. That is, if my mechanistic model has some implications that I want to test, I can generally go either collect data or find data already collected, which would help to answer the question. In fact much of my interests comes from say predicting things output from my mechanistic models and then helping say a biologist or physical therapist, or engineer, or business person design an experiment to collect data to see if the thing works.

          Let me go a little further than that, and discuss the concept of a DAG as a computing machine. We can think of a DAG as a way of describing the dependencies of a calculation. For example I have 4 nodes, A, B, C, and D, the connectivity is A-C, B-C, C-D. Evidently I can’t compute D until I know C, and I can’t compute C until I know A and B. This encodes in essence the 1-way flow of time and/or information.

          But lots of other computing “machines” are capable of encoding this kind of information flow, such as recursive functions. Furthermore, they’re capable of encoding more complex relationships, such as mutual recursion, or what have you. As the computing system becomes more general, the things it can do become more general, and the things we can do *to* the computing machine in an automated way, such as determine if it will halt or if changing a variable will cause it to output a different output, are reduced.

          By all means, if you can encode some model in a DAG and then derive useful stuff from it, hey, more power to you. What I find annoying is the assertions that people working higher up the chain on more general computing models are failing to do causal inference, or can’t solve problems that DAG people can solve.

          Let me give you an example of how multi-agent models would be extremely complicated to encode into a DAG, if it’s possible at all… A wolf runs around after sheep, in time it can see sheep that are “nearby” within some radius, sheep walk randomly around grazing, and using rules to determine how to move. The rules have emergent properties that keep sheep in small herds that tend to break up into multiple herds when they get large. The wolves range around randomly looking for sheep herds, and then assess which are the weak sheep and then coordinate an attack on the weak sheep….

          I am going to run this model for 20,000 timesteps, representing say 20,000 hours = 2.28 years. Each variable will need 20,000 subscripted versions to represent its value at the given timestep. The connections between variables are determined by physical considerations like “how far can a wolf see” so whether there is a causal connection (an arrow between nodes in the DAG) between say the position of sheep 3 and time 88 and the behavior of wolf 5 at time 89 depends on the distance between the sheep and the wolf at that time, as well as things like communication between wolves, as well as whether the sheep 3 is still alive by time 88, and myriad other “state variables”.

          All of these state variables at time 88 depend on the outcomes / sample paths of random numbers generated at each node, such as wolf 5 sees sheep 3 at time 88 and randomly generates a number that encodes whether it will decide to chase this sheep in time 89 or not…

          So the real model encoded into a DAG is actually a distribution over DAGs each sample DAG having a few tens of thousands of nodes perhaps and a few million or so connections. However, what I really want is statistical properties of the behavior, such as “at time 20,000 what is the net energy transfer from grass to sheep and from sheep to wolves” or “what is the average population growth of sheep in the spring” or “what is the probability that a sheep born in the spring survives to reproduce if it is using rule 1 for its behavior compared to sheep that use rule 2” or questions like this.

          So, the way this would work is typically to generate random simulations according to some initial conditions rules, and then use billions of calculations encoded through a turing complete language, to compute the consequences of the model for many different realizations of the model, perhaps say 10,000 realizations, and then use the observed statistics of the simulations and the observed statistics of say real population or behavior studies, to find the model parameter ranges that best agree with observational data.

          Now that I’ve fit this model, I can then make predictions about what would happen if certain parameters were changed, like the rate of death of the wolves was increased due to say issuing hunting tags for wolves in the spring.

          As far as I can see, the main thing that trying to represent this in a DAG form would do is to change it from a few hundred lines of turing complete recursive/iterative code to a few hundred thousand DAG nodes with a few quadrillion quadrillion quadrillion quadrillion stochastically represented interconnections. No analysis would be possible in any practical way as the DAG wouldn’t even fit in the memory of the largest computer we’ve ever built. The kind of information you could find out would be relatively useless like “changing the death rate of wolves will affect the population of sheep”, information I either already knew, or could discover by running the hundred lines of code model a few tens or hundreds of times in maybe a half hour of computing…

          So, heck, maybe I’m wrong, but since you seem to be more expert than I am, and can’t tell me whether anything useful can be done by DAGs, but I *know* useful stuff can be done in existing ABM modeling systems… or using ODEs or using fluid mechanics simulations, or using various other methods… you’ll have to excuse me if I find it unimpressive *for my purposes*.

          Of course, not everything I do has this mechanistic nature, and so maybe at some point I’ll work on a problem where a DAG would be obviously useful, and I’ll take yet another look at the DAG stuff once again…

        • Anoneuoid, you’ve mentioned mathematical models of carcinogenesis. I know more about this topic, and DAG-based insights are definitely applicable, though perhaps not in a way you like: The reality is that there are thousands mathematical models of carcinogenesis now, including originating work by Armitage and Doll, Knudson, and Nordling and continuing to the unreadably-many models that have been published since. A major limitation to the field as a whole is that many different models are consistent with the same data. There is nowhere near enough data to distinguish between the many underlying mechanistic models that have been proposed. What to do about this?

          I haven’t worked on this for a few years, but eg here is the results of my model for bone/joint cancer (I chose that one because it has one of the most interesting looking age-specific incidence curves): https://i.ibb.co/qWVjDJV/bonejoint.png

          It is basically what I described in this thread (which is a corrected armitage/doll; the original goes to infinity) along with assuming cell division in that tissue decreases exponentially from birth. This is not arbitrary “curve fitting” and “parameter tweaking”, it is a model of a process derived from very simple and reasonable assumptions that has some parameters that need to be measured or guessed.

          I eventually dropped that project because I got frustrated when couldn’t find data to constrain my assumptions about number of cell divisions by age, etc. Essentially the model is impossible to really test due to lack of data.

          The problem is people are collecting the wrong type of data begin with like this:

          Suppose there is a single target of my inference, like the average effect of treatment X on outcome Y among the individuals included in my study.

          That is just a waste of resources. Instead collect the type of data that lets us distinguish between different models.

          It really upsets me to think about the waste of resources over the last 50 years on this NHST-type stuff instead of collecting useful information we could use to learn about phenomena like cancer. We don’t need more data on “average effects”, we need an entirely different type of data that can be used to constrain our models of what is going on.

        • Daniel:

          “By all means, if you can encode some model in a DAG and then derive useful stuff from it, hey, more power to you. What I find annoying is the assertions that people working higher up the chain on more general computing models are failing to do causal inference, or can’t solve problems that DAG people can solve.”

          Yes. Exactly.

          The most persuasive arguments I’ve seen for DAGs for causal inference are:
          – Lots of thoughtful people think DAGs are a good idea.
          – Various existing causal models can be expressed as DAGs. To put it another way, DAGs are a useful framework for many people to understand causal problems.
          – The claim has been made that, when people try to do causal inference without DAGs, they can easily make mistakes. In that way there’s an analogy of DAGs to Bayesian decision theory: both are internally coherent systems, and one advantage of an internally coherent system is that it can make you face up to inconsistencies in your behavior.

          All three of the above arguments seem reasonable to me. I have not found the need for DAGs in my own work, but I accept that other people find them to be a valuable way of structuring their causal model. One thing I do object to is naive overconfidence in DAGs—I object to this in the same way that I object to naive overconfidence in any statistical method, Bayesian inference included. The article discussed in the above post seemed to me to express naive overconfidence in a way that could be misleading to readers, hence my post.

        • I agree with the recommendation to learn the stuff – given your mechanistic modelling background I’d actually recommend starting from something like Mooij et al:

          https://arxiv.org/abs/1304.7920

          I do want to push back a bit against the idea that DAGs represent the skeleton while other approaches are necessarily more detailed.

          There is a *lot* of work in dynamical systems, physics etc on classifying models into equivalence classes such that any model with some generic features exhibit some generic behaviour. This work is just much harder imo because the class of models considered is less restricted.

          The ‘structural equation’ assumption is far from generic, even with ‘nonparametric’ put in front.

        • (As an analogy:

          Calling an arbitrary numerical quantity a ‘nonparametric number’ doesn’t help me decide if a numerical quantity is the right representation of something.)

        • Andrew and Daniel,

          tl;dr: We are agreeing more now. DAG-based causal inference can be useful for pointing out mistakes in many areas where it does not (yet) provide solutions. Toy models are useful for pointing out and avoiding mistakes.

          We seem closer to general agreement. Building off of Andrew’s comment above, I’d add that you can broadly describe what DAG-based causal inference has to offer in terms of two categories

          (A) Areas where DAG-based causal inference reveals mistakes
          (B) Areas where DAG-based causal inference provides solutions — meaning the final model fitting involves a DAG

          Currently, many research topics belong to category A without belonging to category B. As the theory of causal inference develops, additional topics will begin to fall into category B. I expect that very many topics will ultimately end up in category B. But others disagree, which seems fair. As statisticians know, “It’s tough to make predictions, especially about the future.”

          By that same argument, however, it seems terrible to dismiss or suppress DAG-based research, which absolutely happens. Sometimes, it seems like the dismissal happens on this blog, even. However, Andrew, several of your comments above go a long way toward removing that concern and I do appreciated that. Thank you.

          In category A, toy models are wonderful because they encapsulate the mistake in a setting that is simple enough to communicate. Using toy models is also important in category A because, otherwise, arguments tend to be derailed by all kinds of issues with the studies under consideration that are extraneous to the causal mistake itself. As Pearl puts it above, “pitfalls were hidden from researchers for decades and were discovered by mathematical analysis and demonstrated by simulation on toy problems.”

          Andrew, I am not really familiar with your own research but, based on your comments, I do worry that you could be falling into mistakes that are evident from DAGs, and might be avoided if you used DAGs. This is not a criticism of you as an individual or your work in particular, but expression of more general concerns of my own: When anybody suggests performing regression-based adjustments, standardizing/post-stratifying, or IV analyses without consulting DAGs… I think that they will eventually make important errors. I think that not because I think they are unskilled or imperceptive scientists, but because I do not see how any person could avoid the errors without consulting DAGs.

          Daniel, in the kinds of research that you have discussed above, understanding DAGs might help you avoid errors even if the final models that you use do not involve DAGs at all. Selection biases, for example, apply to a lot of sample-based research, presumably including samples of wolves. Selection biases are naturally expressed in the language of DAGs, which makes them easier to understand and avoid.

          Finally, moving on to category B. Above, I stated that DAG-based causal inference is at its strongest with “fossilized” observational data, in which the “skeleton” of causal relationships can be expressed as a DAG, but the “muscle” of mechanistic functions is unseen. I stated that the DAG-based approach to causal inference reveals the remarkable amount that we can learn from the skeleton alone, without seeing the animal’s muscles. However, please don’t take this to mean that DAG-based causal inference is only useful for “fossilized” data — many other topics also fall into category B, including experimental design, troubleshooting procedures for machines that are broken in multiple ways, and a lot more. DAGs are a flexible language that can be applied to solve a variety of problems, not only inference from observational data.

        • Anoneuoid, Re: Carcinogenesis. Thanks for the link — curve fit looks nice for a 5-parameter model.

          I think we’ve reached the heart of our disagreement. We have both seen that there is not enough data to constrain the parameters (or structures) of many mechanistic models, such as those of carcinogenesis. Your response is to conclude that people are collecting the wrong types of data. I agree with that in part, but my response has been different: I have tried to see which aspects of models -are- constrained by available data. Following that path led me to stumble into the field of DAG-based causal inference, which I have found rewarding.

          A couple things you may find interesting

          > Your statement about exponential fall-off in cell division rates reminds me of this paper, which looks at linear fall-off: Pompei et al. (2003) Cancer turnover at old age. Nat Rev Cancer 3:388. link: http://www.nature.com/articles/nrc1073-c2

          > Only the approximation of the Armitage Doll model goes to infinity. The exact solution does not, but this is usually ignored because the exact solution is cumbersome.

        • A couple things you may find interesting

          > Your statement about exponential fall-off in cell division rates reminds me of this paper, which looks at linear fall-off: Pompei et al. (2003) Cancer turnover at old age. Nat Rev Cancer 3:388. link: http://www.nature.com/articles/nrc1073-c2

          > Only the approximation of the Armitage Doll model goes to infinity. The exact solution does not, but this is usually ignored because the exact solution is cumbersome.

          There is no need to invoke a drop in cell division rates to explain the drop off in age-specific incidence at old age. It comes naturally from the correct Armitage-Doll model (I still call it that because it is derived from the assumptions they make in that paper), which is the one based on the geometric distribution I posted here.

          They originally did make the unnecessary assumption (and I suspect wrong too, if their premise that carcinogensis is due to accumulating errors is correct) that p is very small for mathematical convenience back in the 1950s:

          In this case and so long as the occurrence of each mutation is a relatively rare event-the incidence rate of cancer at age t will be proportional to the probabilities of occurrence of each mutation per unit time and the sixth power of the age

          https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2007940/

          But there is nothing cumbersome about using the full model today when we have computers, so it makes zero sense that people still use the approximation… but whatever. The age specific incidence is then proportional to the first derivative of the cumulative probability of carcinogenesis (1 – (1 – p)^d)^n, which is:

          Where q = 1- p,
          -n(q^d)*ln(q)(1 – q^d)^(n -1)

          You can then get the second derivative, set it equal to zero, and solve for d to get the number of divisions until peak incidence: log_q(1/n)

          You can see that using the high for somatic mutation rate p = 10^-5 and only 2 errors that need to accumulate, the model predicts a peak after ~70k divisions. It rises to ~700k for p = 10^-6 an n = 2, etc (most somatic mutation rate estimates are closer to 10^-8). This is way beyond the hayflick limit, a given cell should only be ~60 divisions separated from the zygote. Also, 80 years is ~30k days, so then the model would require these cells to be dividing multiple times per day when even breast cancer cells in culture only divide about once per day. The only human cells known to come close to those rates of division during normal function is proliferating lymphocytes.

          No doubt there are also biases against reporting and screening for cancer at old ages since the doctors figure the patient will die of something else first anyway. I never got to the point of attempting to model that aspect but that is a whole other degree of freedom I am sure is unconstrained, but that is definitely a next step that would be required.

          And these biases in diagnoses can have a huge effect, eg we can look at the effect on total diagnoses of (what is now widely thought to be worthless or even harmful) prostate cancer screening in the 1990s:

          Widespread use of PSA screening has lowered the median age at diagnosis and increased the number of men diagnosed with localized disease. There now exists a large population of survivors. Some of those with localized disease have indolent tumors that will never become symptomatic or cause death. Because of overdiagnosis, the true efficacy of prostate cancer treatment is uncertain. Prostate cancer–specific mortality rates are declining. It is however certain that we are curing some men who do not need to be cured.

          https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3540881/

          But I mean we do not have basic information like how many cells of a given type are typically in each tissue…, Here is a paper bemoaning that in 2013: https://www.ncbi.nlm.nih.gov/pubmed/23829164/

          It is a very sad state of affairs for the hundreds of billions of dollars that have been spent on cancer. If there had been little interest in studying cancer until now, I would be less discouraged than the reality. After spending all that time and money we have almost no cumulative knowledge gained, just a bunch of isolated “facts” and the conclusion “cancer is many diseases”. This is what NHST does to a field of study.

        • MoreAnon —

          Thanks for your explanations/examples. They are helpful in understanding how DAGs can be useful and important.

          However, I do have one quibble, with your muscles/skeleton example. You say,

          “The muscles determine how the skeleton moves — in this analogy, the skeleton is the DAG. When using a fully mechanistic approach to modelling, you are modelling the whole animal — both the muscles and the skeleton. If you are used to this fully mechanistic approach, you may think that nothing much could be learned from studying the skeleton alone. But you are wrong. A remarkable amount of information can be learned about the animal from studying the skeleton alone, and this is what DAG-based causal inference tells us.”

          I am willing to agree that in many situations “a remarkable amount of information can be learned about the animal from studying the skeleton alone,” but I’m not convinced that modeling based on “the skeleton is the DAG” is adequate in all cases — since I cannot accept that “The muscles *determine* how the skeleton moves;” I can, however, accept the weaker statement that the muscles usually strongly *influence* how the skeleton moves. And reciprocally, I believe that at least in some cases (and perhaps in many cases), the skeleton influences how the muscles move (e.g., by constraints that the skeleton and the location of muscle attachments have on the muscles). So, although I’m willing to grant that your model/description may be fine for some circumstances, I can’t accept it in the generality that you seem to describe.

        • Also, I note that paper suggests just adding (1 – Beta*t) to the approximate Armitage Doll model and figuring somehow that corresponds to senescense… That’s the type of stuff people rightly dismiss as “curve fitting” and “parameter tweaking”. Their final equation doesn’t even make sense.

          This is totally different from what I recommend, which is deriving the model from a set of explicit premises/assumptions and then examining the behaviour.

        • Daniel,
          You say:
          What writing down a DAG does is formalize a certain set of assumptions. And then, I assume what Judea’s do calculus does is let you work out some consequences of those assumptions.

          Well put. I would only modify it slightly, to read: What writing down a DAG does is formalize a certain set of assumptions and let you see some of their logical ramifications. Then, if you wish to work out additional consequences of those assumptions, you can use do-calculus.

          The reason for the suggested modification is that people tend to underestimate the computational power of DAGs which saves you hours of hard labor incurring when you refrain from using DAG. Even deciding back-door condition is too laborious for communities that “do not find DAGs useful for the kind of research they are doing”. Examples of laborious efforts can be seen here: https://ucla.in/2QpcGzS, https://ucla.in/2L8OCyl

        • Daniel, (et al, especially moreanonymous)
          I assume I am one cause of your annoyance when asserting
          that people working higher up the chain, say on mechanistic
          models, “can’t solve problems that DAG people can solve”

          Let me try to mitigate your annoyance somewhat by
          explaining what a person of my convictions would mean in
          making such assertions.

          Mechanism-seeking researchers aim to obtain functional
          relationships between variables v_i= f_i(v1, v2,…vn)
          and, once satisfied with the form of the functions
          and the estimated parameters, they are ready to answer
          causal questions, eg, what if we set v_i to 5 mg.
          or, what is the effect of V_5 on V_7.

          [There is still a question of whether they can do that,
          because “setting v_i to 5 mg may depend on HOW you set it,
          but lets leave this problem aside now.
          See firing squad example in book of why]

          I think it is still justified to say that, unless our
          mechanistic researcher took a class in DAGs, he/she will
          be unable to answer such questions BEFORE building
          the functional model, when the only information available
          is qualitative; “who the arguments are of each function”

          This partial knowledge is what we assume in SCM,
          represented in a graph, and leading to do-calculus
          which shows you, miraculously, how much you can still do
          with such partial knowledge and, even more miraculously,
          what you can and cannot infer in every state of partial knowledge.

          Mechanism-seeking researchers cannot tell you
          those miracles, not because they are not smart but
          because they are aiming toward functional descriptions
          and did not stop half way to serve the sciences
          that must rely on partial knowledge, such as
          epidemiology and economics.

          Aha! one would argue, so the mechanism-seeking researchers
          are engaged in a more general causal inference; they can
          answer all the causal questions in the world, and more.
          Not exactly. To build their functional models they need
          to invoke either sophisticated experiments or daring untestable causal assumptions.
          Given observational data only, the inferences
          produced by the do-calculus are the maximum possible.

          I hope I have clarified somewhat the relationship between
          the DAG-based and mechanistic-based sciences.
          Judea

        • > Mechanism-seeking researchers aim to obtain functional relationships between variables

          No…that’s one of my points here. That’s what you *assume* but something like an ODE defines a curve x(t) that satisfies the differential equation dx/dt = f(x) which in turn can express/derive from something like ‘momentum is conserved’. Or say integral equations.

          These *do not* appear to express functional relationships in the sense of say x1 = f(x2) etc.

          My claim is that DAGs/SCMs cannot express eg differential or integral equations and hence do not represent a vast majority of mechanistic models.

          Happy to be shown wrong.

        • Philosophically this is similar to the distinction between those who seek to express causality as relations between discrete events or ‘variable values’ and those who adopt a process oriented view.

          Eg

          > Substitute for the time honoured “chain of causation”, so often introduced into discussions upon this subject, the phrase a “rope of causation”, and see what a very different aspect the question will wear’. According to the process theory, any facts about causation as a relation between events obtain only on account of more basic facts about causal processes and interactions

          From https://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199279739.001.0001/oxfordhb-9780199279739-e-0011

        • And

          > Causal processes are the world-lines of objects, exhibiting some characteristic essential for causation.

          Eg an ODE defines a temporal process x(t) which satisfies and ODE than might express the characteristic ‘momentum is conserved’ (etc).

          Very different to ‘the variable x listens to the variable y, as expressed by the structural equation x := f(y)’

        • Andrew,
          I think Carlos problem was/is that the advise you are giving to researchers as a “classical” alternative to what Lederer et al propose is laden with pitfalls that could easily be avoided by making DAGs explicit, and cannot be avoided otherwise. So why resist?

          Specifically, to judge “that enough pre-treatment variables are measured on each person in this study, they can do some sort of matching and regression study in order to find comparable groups and adjust for differences between them;” is just the WRONG thing to do. Because “pre-treatment variables” include IV’s and colliders and adjusting for them (or “matching” or “controlling”) means an invitation to bias amplification. [Suggested reading https://ucla.in/2N8mBMg%5D. Moreover, those dangerous pre-treatment variables CANNOT be distinguished from innocent confounders w/o a DAG. So why resist?

          You are advising researchers to remain faithful to the “classical approach”, which you and others have practiced successfully in the past while finding no need to use DAGs explicitly. Now we see that this advice is bias-prone, and can easily be rectified by learning to use DAGs. Why not advise those who follow your recommendation about matching on pre-treatment variables to use DAGs (explicitly) in order to protect themselves from introducing bias?

          Let me assure you that I am asking these questions not because I am wedded to DAGs, but because I know that, once your readers learn to leverage DAGs in their research, their whole attitude toward causal inference will change, and the science of cause and effect will gain a quantum leap forward.

        • Judea:

          You write, “I know that, once your readers learn to leverage DAGs in their research, their whole attitude toward causal inference will change, and the science of cause and effect will gain a quantum leap forward.”

          That sounds like some optimistic causal inference on your part! That points to a difference between us. You think you know what would happen, while I’m uncertain.

        • Andrew,
          I know what would happen because it happened to me (remember, I was a Bayesian) and what
          happened to literally thousands of readers that came out of the traditional paradigm to add
          another dimension to their landscape.

          But this is not the only difference between us. The more immediate difference is that I am trying to clear
          the mine-field toward which the traditional paradigm is leading its practitioners. At least I am trying.

        • ojm,
          The distinction between y=f(x) and g(y,x) = 0 deserves its own discussion but it does not alter my explanation. The topic was whether mechanism-seeking researchers can answer causal questions (like “what if we set Y to 0”) from partial information alone, given in the form of “who listens to whom”; the kind of information that is natural for certain scientist to elicit and to comprehend.

          On a different occasion we will debate what DAGs cannot do, promise. Right now, the question is “Can a mechanism-seeking researcher answer causal questions from partial information alone, given in the form of “who listens to whom”.

          The prisoner is dead. What if Rifleman-A refrained from shooting?
          More people died from inoculation than from smallpox. Should we ban inoculation?
          Is there a drug that is good for men, good for women, and bad for a typical person?

        • “who listens to whom”; the kind of information that is natural for certain scientist to elicit and to comprehend.

          I don’t think this is “natural” at all, this is something they get taught to do. As evidence, just look at what was being published pre-WWII. Ie, before the whole NHST method of generating cheap publications caught on.

        • > Can a mechanism-seeking researcher answer causal questions from partial information alone

          Yes, I think so. But ‘who listens to whom’ is not what or how a mechanistic researcher typically thinks (imo – others may disagree).

          This is *not* about

          > The distinction between y=f(x) and g(y,x) = 0

          More about y = f(x) vs x’ = f(x), i.e. about

          ‘variables causing variables’

          vs

          ‘processes that possess properties we call causal’.

          See again the Dowe link.

        • Or see Feynman quote on pg 19 (according to page numbering of book) here:

          https://github.com/omaclaren/open-learning-material/blob/master/physical-science-for-engineers/2018-S1/ENGGEN140%20Course%20Book%20Part%202%202018.pdf

          Which also suggests that ‘mechanistic’ modelling is something of a misnomer. Perhaps ‘physical’ or ‘processural’ would be better?

          No details are required to know that energy is conserved!

          See also the variety of problems solved in that rather elementary first-year course. Do DAGs or SCMs help with these problems?

        • ‘processes that possess properties we call causal’

          exactly, specifically the key to me is a mapping from current state to future state, basically a flow. The flow can have certain properties that are invariant, like conservation of energy, or it can have rules that are dynamic…. like wolves can only sense sheep that are “close enough” so that the variables that influence the flow at any given time change based on various properties of the state.

          In economic models for example we would look at policy changes altering the flow so that through time variables tend towards certain values for example.

          it’s an interesting fact that many systems have high level statistics which “act as if” they were functions of other high level statistics. PV=NkT is this kind of thing. It is an emergent property of the flow through time of gazillions of particles. It seems to be a time invariant, but in actual fact it’s an average property of the flow averaged over small but nonzero regions of time and space.

        • Daniel –

          Yup, when I teach this stuff I emphasise the distinction between conservation and constitutive equations.

          Things like the ideal gas law or eg Hooke’s law represent constitutive assumptions relating state variables to each other in a similar sense to Judea’s x1 = f(x2), but usually not requiring one to be solved for.

          In contrast the more fundamental causal relationships are given by the (one and two sided) conservation equations, which are the result of principles like ‘energy is conserved’ or ‘entropy is increasing’ etc.

          Both of these classes of relationships are difficult (at best) to express in DAGs, and further no clear distinction is made between these classes in DAG. Though the analogy is

          DAG ‘is like’ conservation principle
          Particular equation ‘is like’ constitutive equation

        • Andrew, regarding the excessive simplicity of the model in the example (too sensitive to noise, not including enough variables) you are right. After all, they only devote three sentences to discuss the smoking/exercise/cancer example. But this problem affects the “classical approach” just the same, when you consider that simple model, and you could include more variables and improve the model in either case.

          The point of those examples is to illustrate something that you don’t touch at all in your comments as far as I can see.

          > If the applied research team doing this study judges that enough pre-treatment variables are measured on each person in this study, they can do some sort of matching and regression study in order to find comparable groups and adjust for differences between them; otherwise it could make sense to include latent variables in the model and adjust for them too.

          The question is that sometimes it doesn’t make sense to adjust for some variables. In the first example one should adjust for the confounder, in the second example one may or may not adjust for the mediator (depending on whether or not we want to exclude the indirect effect), in the last examples one should not adjust for the colliders.

          > A lot will depend on the details of what’s being studied, and the end result will necessarily depend on lots of assumptions.

          Indeed, and I think the point of that paper is discussing some of those assumptions and how they affect the result. It’s not clear how the classical approach in which an observational study is understood in reference to a hypothetical controlled experiment addresses those particular issues.

        • Carlos,
          I agree with you wholeheartedly.
          However, what prevents the classical approach from doing things right is not its conceptual reliance of “hypothetical controlled experiment”, but its refusal to use a metal detector in its journey through the mine field.

          Epidemiologists at Harvard School of Public Health are also committed (religiously) to viewing every observational study as an imitation of RCT, but they are using DAGs as metal detectors and successfully so.

          The resistance of the “classical approach” to metal detectors cannot be explained by the data available to us today. It awaits future historians of 21st Century science to open archives and resolve the puzzle.

  10. Thanks! And as I said in my previous post way up the chain, the way you convince people of both the utility of your methods and how they can actually use them is by supplying references that do that in their field of interest. I have read a lot of your work, and criticism of your work, and one question has always been how to do feedback in acyclic graphs, and how to possibly do DAGs when there are so many physical interactions in the data – for example interactions in space-time data may well vary by location). Even saying that it hasn’t been worked out yet for those situations is fine, it advances the discussion and makes clear at least the present limits of the techniques. For example there is the reference I posted above about the equivalence in certain formulations of the do-calculus and Bayes methods. How can that insight be used to help people analyze their data. Steve Scott has an R package that does causality using Bayes methods for multiple times series Comparing and contrasting that with what you propose would be a really useful discussion. Even discussion of some older methods such as Granger causality or some of the causal methods that have come out of system dynamics (such as by Sugihara) would be useful and enlightening.

    All the other stuff IMO opinion just turns people off, even the assumption that they haven’t read your work or don’t understand your points.

  11. Great discussion thread. Let me add yet another perspective.

    Statistical process control was proposed by Walter Shewhart in 1926 to enable industry to control processes, instead of focusing on outcomes. This has obvious economic advantages and provides the opportunity to get more for less https://www.linkedin.com/pulse/evolution-quality-from-product-information-ron-s-kenett/

    The premise of all this is that we have a theoretical construct of a stable process. The control chart tells us if we have a process under control. If not, we can affect the data generation process (change the machine set up) to get back under control. This is not repeated hypothesis testing. we do not fit a model to the data, just the opposite.

    In the industry 4.0 context we have sensors and lots of data used for health monitoring of production systems
    https://www.amazon.com/gp/product/1119513898/ref=dbs_a_def_rwt_bibl_vppi_i18

    Now, can we enhance Shewhart control charts with a structure enabling diagnostic, prognostic and prescriptive analytics. One approach is to use a DAG explaining how process measurements are affected by sensor data. To get a control system, we do not need any causal understanding. To get diagnostic, prognostic and prescriptive capabilities we do. This analysis can involve causal structural models in the form of DAGs or in other forms such as ANN and deep learning black boxes.

    In industry the proof is in the pudding. In other words, if you are able to run you processes with less scrap, get faster diagnostic ability and conduct effective condition based maintenance, you are a winner. This is more pragmatic than the smoking and cancer debate. The application of the do calculus and the back door assessment are certainly useful in this context. among other things, engineers seem to understand it and, thereby, overcome their natural statistical stress syndrome that impedes their analytic capabilities.

Leave a Reply

Your email address will not be published. Required fields are marked *