“Cut them in half” . . . that’s the Edlin factor!

]]>I do not blame you.

This is ridiculous:

“Admittedly, these estimates are subject to some uncertainty. So if you think those that are given here are too high, even though they are based on the best of contemporary research, then just cut them in half.”

That’s a baldfaced “argument to moderation.”

]]>Just on first glance, lots to criticize in Hanushek’s article — so much that I have no motivation to go into detail.

]]>http://hanushek.stanford.edu/publications/valuing-teachers-how-much-good-teacher-worth

This seems so faulty–specifically in terms of causal reasoning–that I looked on your blog for commentary. I saw a brief comment on Kahneman (who may have been responding to arguments like this) but nothing specifically about Hanushek’s argument.

It seems that “seriousness about causal reasoning” would help a lot. First, to what extent, and in what contexts, does high school achievement (measured via performance/growth on standardized tests) actually correlate with income? I imagine it would have a lot to do with the chosen field. There may be a *general* association between test-score achievement and later income, but I bet this breaks down when you look more closely.

But the idea of replacing the lowest-performing teachers with “average” teachers–as a way to boost students’ eventual income, and thus the national economy–seems even more far-fetched. Hanushek defines teacher effectiveness in terms of student test score growth. A teacher could end up with an “average” rating for all sorts of reasons. Maybe she focuses on test prep (which boosts the test scores just enough to put her in the “average” category). Maybe she teaches well but is stuck with a bad curriculum. Maybe she teaches poorly within a strong curriculum. Maybe the kids are already doing well on the tests, so they show little “growth.” There’s no reason to assume that a teacher’s particular “averageness” will help a given class of students achieve more on tests. In fact, it’s a bizarre concept: “Your averageness will turn these students’ lives around.”

So a subtler and more precise causal analysis is needed.

]]>Judea, in your most recent comment you say “The problem

happened to have a simple, closed form solution in terms of

the observed probability of X, Y and M, regardless of the functional

form of f and g, and regardless of the distribution of

the error terms (assuming they are independent).”

and this hammers home for me our difficulty in communicating. In Cox/Jaynes theory, there is *no such thing as an observed probability*. There is only an observed *frequency*.

The probabilities come from the knowledge K. The distinction is very very real for us here in this land. Typically, when I do consulting, someone will come to me with their dataset, and ask me to analyze it, and the first thing I will do is have them talk to me for a long time and answer questions about what they think is physically going on and how all the measurement instruments work and etc etc. They sometimes get frustrated and ask me to quickly calculate some p values or whatever because they are used to some other form of “statistics” where stuff gets put into some canned software and a button is pressed and a stamp of approval of p < 0.05 is generated and they have magical permission to publish their paper or whatever. But I don't do that.

Until I gather enough background knowledge on what is going on and how the model will be used, and whether there are unmentioned variables that are known etc, I can not put any probabilities and I can not make any progress and it does not matter how big the dataset is.

(I am perhaps exaggerating, sometimes it's clear from a very quick inspection what the background knowledge needed is but in the general case… no)

]]>Judea, I think the confusion you have can be answered by the following. You assume it is possible with the 3 variable binary problem and a large dataset to answer questions about probability.

I on the other hand INSIST that *probability* is relative to the knowledge built into the model K.

and since we have no substantive model in the example problem, there is really no knowledge K by which I could calculate real probabilities. The best I can do is give bucketloads of examples with more details where I could then have some knowledge by which to calculate probabilities.

The probabilities you are thinking of are frequencies, and in this Bayesian camp, each of those frequencies has a plausibility over it (!!!) and that all depends on the knowledge K.

So, why can’t I just give the numerical answer? Because there is no knowledge K!

]]>Prof Pearl: Thank you, that is very helpful.

]]>Daniel

Thank you for your patience and effort to teach me about the

philosophy and practice of the Cox\Jaynes camp.

You asked me above if I can see the connection to what

people in my “causal inference camp are doing, and I

must confess that I am still oscillating between two

plausible yet diametrically opposed hypothesis:

Hypothesis 1:

The Cox\Jaynes camp is far more advanced than us

at the causal inference community. What we take as

major accomplishments in the past two decades

they have accomplished long time ago and are now

solving mediation and other causal problems on a routine basis.

We need to learn how they are doing it.

Hypothesis-2

Researchers in the Cox/Jaynes camp do not grasp what

they are missing by thinking that they are solving

mediation and other causal problems. Sad, but

understandable, given their devotion to the conditioning bar “|”.

Why cant I decide between H1 and H2?

Because we started this track with a concrete and simple

example on mediation among 3 binary variables. The problem

happened to have a simple, closed form solution in terms of

the observed probability of X, Y and M, regardless of the functional

form of f and g, and regardless of the distribution of

the error terms (assuming they are independent).

Not having seen the solution makes me undecided if your comrades

know how to solve this and other problems like that, or not.

It would have taken just two lines to write down the solution

and achieve a global peace and understanding.

So far I have seen illuminating discussions on K1 and K2,

plausibility and frequency, Bayesian logic, and more,

but no solution.

This makes me uncertain whether your comrades consider

the original problem to be too trivial, or too difficult to solve.

Whatever the case, you have been very patient with me

and I will not bother you more. Still, just in case some

of this blog’s readers are interested, here are a few

pointers for the curious.

1. Mediation problems like the 3-variable example

are now solved for any number of variables, binary and continuous, in

the sense that we now know what conditions are needed

to make NIE identifiable and, once it is, what

the estimand is for NIE.

2. My new (2016) book “Causal Inference – a primer”

(co-authored with Glymour and Jewell)

describes the theory and applications of counterfactual

reasoning at the undergraduate level.

see http://bayes.cs.ucla.edu/PRIMER/

You will find there numerically solved homework

problems of the type:

(a) what would be the expected salary of workers who are now

at skill level s had they had another year of education.

(b) What portion of the effect of education on salary

is explainable by skill attainment.

(c) What is the probability that the plaintiff would

be alive had he not taken this drug

3. The book by VanderWeele (2014) and the writings

of Kosuke Imai and his students, as well as Jamie Robins,

and Jay Kauffman in epidemiology can give you further

insights into current research in causal mediation.

Enjoy,

Judea

]]>Ck I don’t have any problem at the moment with Pearl’s formula for the extent of mediation. Note however that its not something that I’ve ever felt the need to define or calculate. Once I have my parameters I can answer boatloads of questions about the process at hand by plugging numbers into the equations. What will be the time course of drug concentration in the liver if the pill is made 6mm in diameter? How much of the drug is excreted in the kidney instead of delivered to the liver? How much passes unabsorbed through the stool? All of these are related to mediation but are generally more salient in a specific example such as pharmacokinetics.

]]>Carlos, there’s a copy of Lindley’s review posted online at Pearl’s lab website.

]]>Daniel,

I know we have been here before but I’m still not clear with how you define the extent of mediation in functions.

For example When you say:“X causes Y by partially affecting M and partially causing Y directly according to a function y = f(X,g(X)+Merr, a) + Yerr…”, it is not clear to me how this function extract “the extent to which X affects Y through M”. I know you talk about Ks but it is not clear to me how these Ks define the extent of mediation. The formula that I’m familiar with is f(0,g(1),u)-f(0,g(0),u), the one presented by Judea, and it clearly shows where the direct effect is frozen.

So here are the questions to you:

Is there a problem with the way f(0,g(1),u)-f(0,g(0),u) presents the extent of mediation?

If not, is f(X,g(X)+Merr, a) + Yerr equivalent to f(0,g(1),u)-f(0,g(0),u)? how? (they should be equivalent if they measure the same thing).

If there is a problem, where is it and how f(X,g(X)+Merr, a) + Yerr fixes it?.

]]>If this doesn’t do it then nothing will.

]]>Also Judea, I should say that I hope you find all of this helpful. The goal initially was something like “for Judea to find out how people over at Gelman’s blog think about causality and statistics” so I have tried to explain using as much as possible some unambiguous symbols, some hopefully helpful words, and also to discuss some ambiguity in symbols that might make you and the blog denizens talk past each other (Fr, Pl, p, p(|K) etc)

So my attempt is to help you have an understanding of what we are up to over here, out of both respect for you and hope that you might also enlighten us in important ways either by asking very cogent questions (which you have) or by discovering where exactly is the connection between what you talk about in your Causality Heirarchy and DAGs and things, so that you can point at pieces of what we are doing here and say “here if this thing here is of type FOO then it can never do BAR, but if this thing here is of type BAZ then we accomplish QUUX” etc

I am getting near the limits of my ability to know how to explain it better, as I’ve tried so many ways… but here is a more formula based version below…

So, let me go back to the very beginning, we have outcome Y, observed quantity X, and observed quantity M. Then, two different people think about the problem, and one says “X causes Y by partially affecting M and partially causing Y directly according to a function y = f(X,g(X)+Merr, a) + Yerr with Merr, Yerr, and “a” all unknown according to some plausibility values in appendix A” and we call this state of knowledge K1

and person 2 says “I know some family of functions Q parameterized by a,b,c,d,e,f is sufficient to fit any smooth function in 2 variables, so I will say y = Q(X,M,a,b,c,d,e,f) + yerr and get my large dataset of Y,X,M and find a,b,c,d,e,f by using some plausibility values in appendix A2” and this state of knowledge we will call K2

so person one writes down in Bayesian logic

p( y | X,M,a,K1) = normal(f(X,g(X)+Merr,a),ysigma);

p(Merr,ysigma,msigma,a | K1) = see appendix A

and then:

p(ysigma,msigma,a,Merr | Y,X,M, K1) = p(Y | X,M,a,Merr,K1) p(Merr,ysigma,msigma,a|K1)/Z1

where Z1 is a normalizing factor.

and person 2 writes down in Bayesian logic

p(y | X,M,a,b,c,d,e,f,K2) = normal(Q(X,M,a,b,c,d,e,f),yerrsigma)

p(yerrsigma,a,b,c,d,e,f | K2) = see appendix A2

and then

p(a,b,c,d,e,f | Y,X,M,K2) = p(y | X,M,a,b,c,d,e,f,yerrsigma,K2) p(yerrsigma,a,b,c,d,e,f|K2)/Z2

where Z2 is a normalizing factor…

and clearly K2 and K1 resulted in different p functions and because some “causal thinking” goes into K1 the first person will answer your query “what would have happened if X=X* instead of X” by plugging in X* confident in the fact that the structural equation models f and g represent the thoughts about counterfactuals, not just factuals.

and person 2, if they’re being honest, will say “I just fit this to what really happened, I don’t have a function for what ‘would have happened’ because my K2 didn’t tell me enough” though if they’re being lazy they might not realize this problem and might just stick in X* and see what happened.

In the sense that probability as logic was used in both the first case and the second case, it is “orthogonal” to the causal thinking that goes into K1 or the non-causal thinking that goes into K2. But in the sense that what you think about in K1 or K2 changes p the results are dependent on the thinking…

I believe you would say the K1 model is a “level 3” model, one that incorporates counterfactual ideas, whereas the K2 model is a “level 1” model, one that incorporates only associations observed in data… and so I think your hierarchy is a way to classify different statistical (probability as logic) models into those 3 categories you describe in your heirarchy.

and…. importantly for the original goal of understanding what people here are doing…. whether a person at this blog is doing level 1, level 2, or level 3 thinking according to your heirarchy, they will probably use the same p( | ) symbols, and you will only be able to classify these things by asking them something to clarify what knowledge went into the construction of the model.

]]>Pearl, elsewhere I’ve given you symbolic versions plus words, but here is the very simple concept stripped down from most of my talking (hopefully the talking was useful for some others at least)

if I know some stuff called K1, then I write down p(y | x, K1) which is a formula about as you say “how plausible it is for Y to achieve y given that X is made to be x by any means, and the mechanisms by which x causes y are truly the mechanisms known to background knowledge K1”

and the formula for p( y | x, K1) is “told to me by my K1” and when my K1 admits counterfactual knowledge, then it will provide me with plausibilities for counterfactuals when I plug in counterfactual values for x.

However, please note, I can use K2, some other set of knowledge, which might not express counterfactuals and mechanisms and science, it might express only things like “functions on this interval can be approximated by polynomials” and then we have a pure association. I will get p( y | x, K2) and **the formulas will usually be vastly different p(y | x, K1) != p(y | x, K2)**

So, the generalized logic of probability theory is *the method* by which different kinds of knowledge can extract information from data and it can do this extraction regardless of whether I use K1 with counterfactual thinking, or K2 with associational thinking.

If this is what you mean by “orthogonal” (ie. that the probability calculus applies whether or not something is causal) then I agree with you!!! Peace and prosperity as you say. Just as 2-value logic always applies, so real-valued logic of probability theory always applies.

If on the other hand, you meant p(y | x) is a fact about the world that doesn’t change whatever my K1 or K2 or other K is, and so it is orthogonal to K1 K2 etc… then I absolutely vehemently require that this kind of p should have the notation “Fr” before I will agree with you, and better yet even a subscript on the Fr denoting the dataset you are using to get the Fr. And then I will say my p( y | x, K1) is a plausibility “Pl” and *it does change with the K* and it is not equal to the Fr(y|x) in your dataset, and in that sense my p(y|x,K) is not orthogonal to the knowledge K it is instead completely determined by the knowledge K.

I will also mention that Fr and Pl have the same mathematical properties so it is not surprising that they both wind up being called “p( )”

Do we have our peace and prosperity yet?

1) I am happy if you acknowledge that your causal hierarchy is in essence a kind of “classification system for the different kinds of K that can be used in p( y | x, K)” describing to what uses it can be put, and that by “orthogonal” you meant the version that I agreed with above ie “that the math of probability can be used for either causal models or noncausal models”

and I am not happy if

2) you insist that p(y|x) can only be interpreted as a fact about the world, or about your dataset, and the content of a variable K1, or K2 can not alter p(y|x) and this is what you mean by orthogonal

]]>Daniel,

Your proposal is very clear when it comes to (1) Associational Statistics, but when it comes to

to (2) Probability as Generalized Logic, you lost me.

The reason you lost me is that even “Logic” cannot handle causes and counterfactuals (unless we go to

modal logic or counterfactual logic etc), so what do you propose to generalize? The entirety of human

reasoning, formal and informal, past and future, and call it Probability? Why not call it “everything else

which is useful”?

Your proposal will be much clearer if you can continue in the style of (1), and use X, Y symbols, rather

than shifting to verbal description. What more can (2) tell us about X and Y and Z?

For example, should it tell us how plausible it is for Y to achieve level y had X been x’, give that X is in fact x ?

You loose me when you quit our relevant variables X,Y and Z and you shift to speak about mechanisms like f which are postulated

to get answers to questions about X, Y and Z.

Can we stay symbolic?

If you do, I bet you will end up with the Causal Hierarchy and, then, I do not mind your calling it “Probability” or “Plausibility”

or “Generalized Logic”, as long as we know what questions it is capable of answering that level (1) is not.

Judea

I don’t like to argue over the definition of words, but I agree with you that Granger Causality is not what I think of when I want to use the word causality.

Yet, I don’t want to relegate “statistics” to be just “finding associations between variables A and B in a big dataset” because I think there is something within what is commonly called “statistics” which is “Cox/Jaynes Bayesian logic” and that it allows us also to infer the unknowns that are embedded within our imperfect mechanistic models from data and I hope you are eventually going to agree that searching for associations in datasets like Granger Causality and making some scientific assumptions and then inferring the unknowns under those assumptions, like I often do when I build models are two different things. So maybe even if they both fall under “Statistics” we should separate them out.

So, for the sake of clarity perhaps we can make the following distinctions:

1) Associational Statistics: finding out how observing X can help us predict Y for whatever reason. Typical application: Granger Causality.

2) Probability as Generalized Logic: one application is finding out what some data can tell us about an assumed mechanism f. Typical causal application: finding out the rate constants needed in a 4 compartment ODE pharmacokinetic model involving the stomach and intestines, the bloodstream, the liver, and the kidneys and the goal is to figure out the coefficients describing the ODE that describe how drugs transport between the compartments and are metabolized, so that after the posterior for the coefficients has peaked we can then run simulations to help us predict how to adjust the manufacturing method of the pill to give us the best possible time-release mechanism to keep the drug concentration constant in the liver throughout the day.

I think you will agree that the example in (2) contains a bunch of pretty strong causal assumptions, and because of that it will in general produce different predictions for drug concentration DL under the full generality of conditions, and hence the “formulas” for p(DL| PillDiameter, Model) will be different than if the model were “fit this polynomial to a dataset of DL and PillDiameter”. And so, within the context of (2) “causality is *not* orthogonal to *the choice of plausibility values* p(DL|Model,PillDiameter)” is a true statement.

]]>> precludes the statistical model from answering questions about changes in the hypothesized process.

Whorfianism used to have a strong version that claims that language determines thought and that linguistic categories limit and determine cognitive categories – but that has gone out of style.

]]>Carlos

I am glad you mentioned the writings of Dennis Lindley who

was a true gentleman and became my role model.

At the age of 85, he was as curious as a 3-years old,

and, instead of trying to teach me how to do Bayesian analysis

(about which he knew much more than me), he kept on asking:

“And how would you solve a problem like this?”

“And what if we do not know this or that?”. Then he had the

guts, at the end a 145-message exchange, to go back

to his compatriots and tell them: Hey guys, I think I learned

something from this strange alien.

We need more thinkers like him in the sciences.

I am glad you gave me an excuse to mention him on this

blog, and I wish I could live up to his legacy.

Judea

Kieth O’Rorke,

I wrote:

“causal and statistical concepts do not mix. Statistics deals with behavior under uncertain, yet static conditions”

This does not preclude statistics from modeling dynamic process. However it precludes the statistical

model from answering questions about changes in the hypothesized process.

A typical example is Granger causality in economics. It is used for prediction of time series, which is data generated

by a dynamic process. Fine. But because it uses only the joint distribution of the temporal variables involved,

it falls under “statistics”, not “causal” and, as Granger himself confessed, it has nothing to do with causality, e.g. it cannot tell

us if the price at time t1 caused the price at time t2 or there was a third variable that caused both.

judea

I acknowledge that it is useful to have a distinction between what would have happened if you had done something different in the past, and what will happen when you do something in the future. Since the past is unchangeable, the retrospection is inherently probabilistic (ie. uncertain) whereas, if you explain clearly enough what it is you will do in the future, we can then do it, and see if it does happen.

Either way, at this point now in time, a model can only provide plausibilities over what the outcome would have been in the past, or what the outcome of the experiment will be in the future. But, at a later time after the experiment, we can at least know how good the future prediction was. We can never know how good the prediction of the alternative reality of the past was except through some kind of situation where we maybe in the future set things up similarly to how things were in the past and then do what we would have done in the past, and then see if the outcome is what we predicted, together with an assumption about whatever is different in initial conditions being irrelevant to the outcome.

]]>Andrew,

You wrote:

“Counterfactual” is, to me, an awkward term in that what is counterfactual and what is not, is only determined after the experiment has been performed.

It is precisely for this reason that I prefer “counterfactual” over “potential outcome” — the latter presumes a specific treatment and some experiment.

Counterfactuals do not need treatments nor experiments.

If you look again at the Causal Hierarchy, you will find that counterfactuals characterize the QUERY one is asking, not the result of any experiment.

The answer to some counterfactual queries may be obtained from an experiment, but the character of the query is determine by the syntax of the query,

whether or not an experiment is involved. e.g., there is no experiment involved in “If it were not for the aspirin my headache would still be bothering me”

or “Had Julius Caesar not crossed the Rubicon he would have become an emperor”

The common element here is retrospection, not experiment.

“But I would say that, in the context of the model, they are irrelevant for the probability calculations”

The point is that if you know K1 you write down p( | K1) and if you know K2 you write down p( | K2), and if K1 contains the information about how to run a computational fluid dynamics simulation and get an acoustical noise estimate whereas K2 contains some very basic guess that a 3 term polynomial will fit your database of engine test-runs, then you’re going to wind up with very different results.

Now, if you ask the following: Suppose K1 contains causal reasoning, and arrives at a particular model p( | K1) and K2 contains only some rough guesses and arrives at p( | K2) and p( | K1) = p( | K2) by some miracle, will there be different numbers that come out of the whole shenanigans?

Of course, no, because once you’ve reduced things to specific formulas, algebra takes over. But the set of cases where the exact causal formula comes out of a wild-ass guess is pretty small, so small as to be a pure distraction in this conversation. The point is, different K going in usually leads to different formulas for p coming out. And so my everyday experience is *when I think causally about a problem, it affects what p I put down* and so “causal thinking is orthogonal to p(y | x)” is a statement that I experience violations of every day on a practical level provided that you understand what *I* mean by p(y|x) which is a predictive part of the structural equation model and NOT an observed frequency. If you want to say “it happens before you start to run Stan, so it’s pre Bayesian-analysis” then I could also point to the idea that I might have 4 or 5 plausible causal models, put a prior over them, and then do some model selection via Bayesian analysis, so it actually CAN be part of the Bayesian analysis to choose from among several models. You can call this “fitting a single uber-model” if you like… but I don’t think that these fine grained distinctions get us closer to the goal.

My biggest point is, I’m trying to help bridge a gap in communication between Pearl and people like me or maybe Andrew or Corey or possibly you, or some others at the blog, and what we experience every day is that we think about how some stuff should work, and then what we dig up out of our knowledge affects what p( | K) we write down, and the reason that is possible is, it’s a plausibility given a model, not an observed frequency in a data set. Because if it were an observed frequency in a dataset there’d be no flexibility of choice, but if it’s a plausibility given a knowledge-base, there is flexibility of choice!

As for “I don’t think counterfactuals can be observed (by definition)” ok, fine, then maybe Andrew is right a better terminology is “potential outcomes”. If I can do an experiment where I take something where x=0 and y=Y0 and then go and change x=1 and observe y=y1 afterwards, you might say “technically y might have equalled y2 if you had changed it earlier or in a different way or whatever, and y2 is the true counterfactual”

so, fine, this gets to Andrews point about actually modeling the physics of do(x=1) which is what I take as his meaning when he says “we need to think about potential interventions or “instruments” in different places in a system” that is, have someone describe what they mean physically by “changing x to 1” and then figure out what the mechanistic results are, and if you describe an alternative method of “changing x to 1” you might wind up with alternative mechanistic result.

The point I was trying to make was that if we want to talk about how doing a real-live experiment on a particular case produces a measureable outcome which is different from the previously measured outcome, then we could do the experiment, and observe the measured outcome. Then, we could learn from the data an associational formula without any causal thinking that nevertheless (sometimes) would successfully predict the outcome of specific experimental interventions.

]]>Carlos:

Regarding your statement, “I don’t think counterfactuals can be observed (by definition)”:

This is one reason I prefer Rubin’s term “potential outcomes.” The potential outcomes are defined before any treatments have been assigned; you can then do an experiment to observe some of them. “Counterfactual” is, to me, an awkward term in that what is counterfactual and what is not, is only determined after the experiment has been performed.

]]>If the model is not the same, then you didn’t reply to my question. And as far as I can see, the question of causality being or not orthogonal to the conditional probability only has sense in the context of a model. If you want to include in your term K the causal relationships and the existence of competitor products, so you can make use of this properties later to calculate causal effects from observational data or to make a decision about filing your drug for approval, please go ahead. But I would say that, in the context of the model, they are irrelevant for the probability calculations. Of course they will be relevant for the creation of the model and for the use you will make of the results, but this happens outside of the Bayesian framework.

Even accepting that you can put the causality information in K, it won’t be used at all when you combine your model, priors and data to get your posteriors. The Cox/Jaynes probability formalism doesn’t handle causal links, only logical links. Judea Pearl has proposed one way to formalise causal inference so you can actually use this causal assumptions to extract additional information from the experiment. There are other formulations, probably equivalent (even though some are more widely applicable than others). Maybe you’re doing it right using ‘ad hoc’ reasoning, though.

One comment to finish (at least for today): while I didn’t really understand your example at 5:02 pm I noticed you suggested doing experiments to observe counterfactuals. I don’t think counterfactuals can be observed (by definition).

]]>“”

> p(Y | X=1,a,b,c,K1) and this would be some mathematical expression…

> p(Y | X=1,a,b,c,K2) and because in this case K2 has different information, this would be a different formula,

I’m afraid I don’t follow you. If the model is the same and the data is the same, why would the formula be different?

“”

But the model ISN’T the same. as an example in the previous comment I say in the first case maybe I have an ODE for which a,b,c are coefficients, and in the second case, because I don’t have the causal model I don’t have an ODE, so I just do some polynomial fitting or whatever and the a,b,c are coefficients in the polynomial.

This is what I mean above by “in a worked example, you’ll see the formulas or Stan code or whatever are different in different cases”

p(foo | K1) and p(foo | K2) can be UTTERLY DIFFERENT. if K1 is causal knowledge it might be that the first p refers to a complex computational fluid dynamics problem with strange boundary conditions to predict say the acoustical noise produced by a jet turbine engine, and under K2 we’ve just got a 3 term polynomial regression against a dataset of tests of prototype engines with different length blades.

]]>Carlos, another way to think about this very concretely. Suppose I want to be so honest it hurts with my modeling. So, in the case where K1 contains causal thinking, then I can attribute this causality in the observational data. When X=0 and when X=1 my causal model f(X,a,b,c) + err operates. So I can learn about both the case X=0 and X=1 even without experiments where I change the X=0 into an X=1, because I think the same mechanism is going to be at work when I do that change, as when nature does it for me in the observations.

Now, suppose I’m fully associational, with no causal thinking, and just fitting curves and things. Now I can collect my data Yactual, and add to my data set Ycounterfactual, containing entirely missing data values “NA”. I can then say perhaps “at least I don’t think the Ycounterfactual will be outside the range of Y across all cases”. So I put a prior on Ycounterfactual that is very broad but not infinitely so. Now, with no data on Ycounterfactual, the posterior over Ycounterfactual will be the same as the prior over Ycounterfactual… So if you ask me to predict a counterfactual value, I’ll wind up just giving you the prior distribution.

Perhaps then I go in and do some experiments, so I actually eventually observe some Ycounterfactuals (under the experimental treatment). Now maybe I can learn the counterfactual relationship even without a causal model.

]]>>> given the model what relevance does it have whether you used or not a causality argument to formulate it?

> p(Y | X=1,a,b,c,K1) and this would be some mathematical expression…

> p(Y | X=1,a,b,c,K2) and because in this case K2 has different information, this would be a different formula,

I’m afraid I don’t follow you. If the model is the same and the data is the same, why would the formula be different?

There are many pieces of “knowledge” in K that might have influenced the design of the experiment and the model chosen. For example, if you are developing a drug your clinical trial might depend on how close to the market are your competitors. Would you say that P is not orthogonal to the existence of other drugs under development?

]]>K includes the color of my shoes, fine, but regardless of the color of my shoes I choose the same P, so at least P is insensitive to that part of K. ie. the color of my shoes is irrelevant precisely because it doesn’t change P.

Now, what relevance does my choice of whether P is formulated by a causal argument contained in K vs whether P is formulated by a non-causal argument contained in K have? It has precisely the effect of usually altering my probabilities P and that includes over counterfactual predictions.

if you say: “in case 214 if we had given the treatment what would the outcome have been?” I’d say:

K1 contains causal information, so let me plug in X=1 to the equations, and then I’d give you p(Y | X=1,a,b,c,K1) and this would be some mathematical expression… and I could for example average over my uncertainty in a,b,c and give you a number.

or maybe I’d say:

K2 is not causal here, so I don’t know what would happen if you actually changed X=1 in that case, but to the extent that the observed information is sufficient to predict things, I’d calculate:

p(Y | X=1,a,b,c,K2) and because in this case K2 has different information, this would be a different formula, and the posteriors over the a,b,c would be different, and their use within the function would be different, and if I averaged over the a,b,c I’d get a different number. And, if I’m honest, and aware that I’m not doing a “structural equation model” with causal analysis, I should give the strong caveat here, we don’t even expect this to necessarily work.

Now, it’s POSSIBLE to accidentally stumble upon the right causal formulas even without a causal analysis in K2, so you’d ask me, what if the formulas were the same? And I’d basically have to say “What a happy accident!” but in the general case, they wouldn’t be the same. For example under K2 the non-causal analysis, my a,b,c might be just some coefficients in a general polynomial, whereas in K1 perhaps they’re coefficients in an ODE that I have to integrate to get the answer.

It’s also possible to do the causal analysis and be utterly wrong so that K1 predicts just wildly inaccurate stuff for counterfactuals. Caveats about how good your model are apply everywhere, not just in “associational” models or not just in “causal” models.

So, I think it’s very useful to keep in mind what you’re doing, what does the K mean, and what are you requiring of your model. It’s particularly useful if you want to design experiments to test your model, where you will literally make X=1 and measure Y, and see if the model predicts the right outcomes.

]]>> P is conditional only on the background knowledge/causal model you have BEFORE the data, then of course it can change from one place to another, and it can do so based on the causal analysis, and so the P of a Cox/Jaynes Bayesian, conditioned on knowledge K, can NOT be orthogonal to the knowledge K which includes the causality model.

The knowledge K includes everything that is known, therefore P can NOT be orthogonal to the colour of your shoes. Maybe you will tell me now that the colour of your shoes is irrelevant, but given the model what relevance does it have whether you used or not a causality argument to formulate it?

]]>Also note, if your goal is to learn the Fr out there in the big data set, then you can put Plausibility Pl over a variety of possible Fr functions, plug in your data to the machinery and turn out posterior distributions over the Fr. This is sometimes done using say gaussian mixture models, or fancy models like Dirichlet processes, where you’re sort of putting a probability over possible histograms, and in the end a “peaked” posterior means all the histograms you plot from the posterior distribution look very close to the same.

In this case, there is no way to even discuss the idea of the “real frequency” of histograms. Suppose there are 10 million people out there who are of interest. There is the real frequency we’d get if we did a sample of all 10 million subjects. there is the data we have from a large randomly selected data set of only 1 million subjects, and then there is what we know about the histogram in the 10 million subject case given the background info including the choice of say Dirichlet Processes with certain priors, and the 1 million subject sample. This last thing is in terms of Cox/Jaynes plausibility of what the frequency histogram in the 10 million sample case is. THere is only 1 real frequency histogram for the 10 million subject case so there is no frequency over it, but there is still plausibility over it.

So, if you are simultaneously trying to learn the causal parameters a,b,c in your f(x,a,b,c) function and at the same time trying to learn the real frequency of the errors Fr(y-f(x,a,b,c)) as opposed to just sticking with the theoretical information in your causal model about the errors, you can do this by including the above kinds of tricks in your model, and then you’ll have a different model:

the vague symbolic versions of these things are sometimes poor substitutes for several examples of different assumptions leading to running Stan code with different code in it. Then you can point to different lines in the code and say “see my causal assumption FOO which is incorporated into the code for file 2 makes me change line 12 from what you see in file 1 to what you see in file 2” and the use of the “Knowledge K” becomes a concrete thing that is easier to point at.

]]>Keith, I think this is one reason people always ask for a “real world” example, because in a real world example, background information must be used, and then the P will change with this information, and in a worked example, you’ll see the formulas or Stan code or whatever are different in different cases. Where does this difference come from? If you believe p(Y | x) is a fact out there in the world equal to the Fr(Y|X) then there should be no changing of the P from one model to another… but if you acknowledge that the P is conditional only on the background knowledge/causal model you have BEFORE the data, then of course it can change from one place to another, and it can do so based on the causal analysis, and so the P of a Cox/Jaynes Bayesian, conditioned on knowledge K, can NOT be orthogonal to the knowledge K which includes the causality model.

]]>I don’t really remember if this is how I originally found the paper, but Andrew Gelman made a comment about it a few years ago: http://statmodeling.stat.columbia.edu/2012/01/21/judea-pearl-on-why-he-is-only-a-half-bayesian/

You have reminded me of Lindley’s review of your book. http://onlinelibrary.wiley.com/doi/10.1111/j.1751-5823.2002.tb00355.x/abstract

Unfortunately I can’t find a copy online now. There is also a section “Seeing and Doing” in his book “Understanding Uncertainty” but he doesn’t go into much detail there.

Daniel:

Nice point about the K.

If you write the prior and data generating models out chronologically, for instance in ABC format, you write out very different generation schemes for instance when there is random assignment to groups versus selection into groups.

This is all lost (but should not be forgotten) in standard Bayesian formulation of

P(theta| data) = ( P(theta) * P(data|theta) )/P(data)

But, see my even further discussion of the conditioning in my reply to Judea above.

]]>Judea, as I said, Cox/Jaynes Bayesianism is not an *extension* of probability, since Cox’s theorem says that Pl is isomorphic to mathematical probability theory… so this is one reason the Bayesians want to continue using P just as you want to… And in a sense, this is absolutely correct.

But, of course information about what you’re doing is important in assigning the probabilities. So, typically, when trying to be very unambiguous, those of us in this camp want to add a further symbol, a symbol that stands for the state of knowledge used to assign the P.

P( y | x, a,b,c, K) = normal(f(x,a,b,c),s); where the K is some stand-in symbol for all the stuff we assumed to allow us to get f and to choose the normal distribution and to put priors on s and soforth.

and when you say “I jump to P, because my P is enslaved to Fr when the sample is large”

then the K must be KLargeSamp = “I have a large sample and I accept that it is fully representative of the future and so I enslave P to Fr”

But, of course, in doing this, you have “used up” the information in the sample. So, you have NO data with which to infer anything else! Put another way, your probability is conditional on your data already, so if you want to do better and peak your posterior distributions of parameters etc, you must collect MORE data. And then when you do, what if Fr2 is not the same as Fr1? Whoops, go back and re-calibrate P to Fr2… but then again, you have no data to peak your posteriors of the parameters… so collect more data… but whoops, around in a circle. Eventually your Fr may stabilize, and now you can use it in your next data set… Why are my posteriors always so FLAT you ask? It’s because you’ve used up your data in enslaving P to Fr! And that should not be surprising, because if you want to learn Fr in all its details, it’s an infinite dimensional thing! It would take a lot of data!

Alternatively you might wish to go back to right before you collected your data… when you had only the causal model… When this is the case, you CAN NOT use the state of knowledge KLargeSamp, you can only use what the causal model gives you. Perhaps it tells you “things can not be too far from the prediction” and then you are in the case I typically am in, where you assign say a normal distribution because you believe it represents your knowledge K about how good is the causal model.

Enslaving the P to the Fr is altering your causal model and using up your data, do so at your own risk. I believe your “transportability” analysis is a method by which you undo this. The method by which I would avoid this problem is to try to be faithful to the full set of background knowledge that I have at the time I *build* the model and then put the data in and see what I get. I don’t insist on Pl = Fr before I can get started. Note, I have a specific blog post on this topic that shows in an example how I do not need to do this enslavement, and for your refreshment, it’s about Orange Juice http://models.street-artists.org/2014/03/21/the-bayesian-approach-to-frequentist-sampling-theory/

For example of how this works, my full set of background knowledge might be “I am taking a sample of college freshmen” in which case if I’m being fully honest, I must admit that this will not “transport” to the whole population, so if I want to extrapolate to the whole population I must put into my model some information that I may have about how extreme the failures to transport might be. So my Pl must NOT be my Fr_college_freshmen() no matter how large my college freshman sample is, it must be my knowledge about how well my causal model works across all populations, which will be typically much less specific than Fr_college_freshman! much broader.

Put another way, Pl is not a fact about the world, Pl is always a fact *about my model*, which is why it is decorated with the K to remind me what information went into building the model. And so, by constantly asserting the need for some K, we can go back to using P instead of Pl and Fr and then we must remember that if we just put p(x) we really mean P(x | K) but are being lazy.

Now, as to the “do” operation. This distinction between p(y | do(x=1), a, b, c, K) = normal(f(1,a,b,c),s) and p(y | x=1,a,b,c,K) = normal(f(1,a,b,c),s) is already built-in to K if we’re being honest, but we often are not explicit enough, so I don’t think the “do” will cause any harm, and it can be helpful, but the bigger picture in my mind is to be more explicit about the knowledge we assume, all the baggage inside the K

and typically some of the baggage in the K would look like this:

“I have carried out a careful first-principles analysis of some physics/chemistry/biology and choose a function f based on this scientific analysis, and I believe there are causes in the world that enforce f to be true whether I observe x or whether I go in and set x to some value… and my analysis informs me that the accuracy with which f predicts y is such and such, and that there is a process of selection in doing my survey that makes it more likely to get x values near x1 than I would if my process of selection were more uniform across all the possibilities in the world, and the doctors doing the blood draws were not blinded to who they were drawing blood from, and my instruments were the ones available at hospital H1 but there is a whole different manufacturer whose instruments may be used at other hospitals………..”

What a lot of stuff to cram into such a small symbol K!!!

So, when the Bayesians on this blog go about being lazy with their notation, you must remember that those who know what’s going on admit this K under the hood, and so they have in their mind whether they are analyzing a causal connection or a non-causal connection but they are lazy in their notation. Still, they will be very confused by “P is orthogonal to causality” because they’re assuming:

p( stuff | K = “there is a massive causal model that says what my P should be”)

so since the causal model is telling them what the P should be, it seems so obvious that you are wrong in saying “P is orthogonal to Causality” but I believe this is a misunderstanding caused by 2 ideas:

1) Many or even MOST people enslave the P to the Fr *mistakenly* there is no logical requirement to do so, and it is important to understand the implications of doing it!

2) Bayesians who don’t enslave the P to the Fr are not explicit enough in their notation to write out even the tiny symbol “K” much less all the stuff that the K stands for.

And so Judea Pearl comes to this blog and sees people writing p(y | x) and believes he knows what they mean, and he says “these people are fully confused!”

and these modelers at this blog see “p(y | x) is orthogonal to causality” and so they say “Pearl is eating too many bananas!”

And so, if we are honest, we will say:

p(y | x, a, b, c, K(first_principles_causal_analysis_only_without_any_data_to_enslave_P_to_Fr_see_appendix_A) ) = normal(f(x,a,b,c),s)

and our formulas will take up several lines, but now I think both you and I can see that with the K in place, the question of whether p is orthogonal to causality is a question of whether K is in fact a causal analysis with a first-principles analysis of the precision involved, or is a pure associational analysis into which we plan to stick a big dataset.

]]>Yes, p(y | x=1,a,b,c) = normal(f(x,a,b,c),s) where a,b,c may be facts about the world that are constant across all cases (such as an unknown but fixed speed of light, or diffusivity of a protein through a fixed medium or whatever), or facts about the individual case (such as for people: age, personality traits, race, adiposity, fitness levels, concentration of magnesium in the blood whatever)…

]]>Judea:

“causal and statistical concepts do not mix. Statistics deals with behavior under uncertain, yet static conditions”

Perhaps many discussions and concepts in statistics are static but that does not make all of statistics static!

There are diachronic concepts/models.

To me – being Bayesian means to purposefully representing empirical phenomena (data in hand or in the future) jointly as arising from a data generating (probability) model with that data generating model first being randomly set with some choice of a parameter (or more generally a distribution).

That is

1. A data generating model is randomly selected.

2. Given the data generating model selected – generate data.

With data already in hand, the joint distribution defined in 1 and 2 is to be conditioned on the data values (Bayes theorem.)

(This provides a simple example/picture https://en.wikipedia.org/wiki/Approximate_Bayesian_computation#The_ABC_rejection_algorithm )

But because in needs to be a purposeful representation of the world, it needs to represent what one thinks is/was happening/going on and there may be may steps within 1 and 2 to do this (e.g. in 2 selective reporting of data that was generated). For the joint model defined (and what is represents) Bayesian theorem gives what the analysis needs to be.

David Freedman has written out simple Dags this way (using his infamous box models) and I did a set for a webinar of standard simple ones. You have to build the causal structure in the joint model – the random choice/setting of the “joint distribution and data generating models” model – but except for perhaps the third level of your hierarchy – I don’t see why there would be a problem.

]]>Carlos Ungil,

Thanks for reminding me of the paper I wrote in 2001

“Bayesianism and Causality, or, Why I am only a

half-Bayesian”

http://ftp.cs.ucla.edu/pub/stat_ser/r284-reprint.pdf

I just read it again and, thanks God for making me

modest, otherwise I would have confessed in public

that this is one of the best papers I read on

Bayes inference since Savage (1962).

Strangely, it is cited by only 34 papers when,

in contrast, my book on Bayesian networks (1988)

has 22,447 citations (according to google scholar).

How on earth did you discover it?

Let me cite a few paragraphs to tell readers

what it is all about, and how it is connected to the discussion with Daniel:

Introduction

I turned Bayesian in 1971, as soon as I began reading Savage’s

monograph {\em The Foundations of Statistical Inference}

\cite{savage:62}. %(Savage, 1962).

The arguments were unassailable: (i) It is plain

silly to ignore what we know, (ii) It is

natural and useful to cast what we know in

the language of probabilities, and

%how certain we are about what we know, and

(iii) If our subjective probabilities are erroneous,

their impact will get washed out in due time, as the

number of observations increases.

Thirty years later, I am still devout Bayesian in the sense of (i),

but I now doubt the wisdom of (ii) and I know that,

in general, (iii) is false. Like most Bayesians, I believe that the

knowledge we carry in our skulls,

be its origin experience, schooling or hearsay,

is an invaluable resource in all human activity,

and that combining this knowledge with empirical data is the key

to scientific enquiry and intelligent behavior.

%a small part of which is the interpretation of statistical data.

Thus, in this broad sense, I am a still a Bayesian.

However, in order to be combined with data,

our knowledge must first be cast in some formal language,

and what I have come to realize in the past ten years is that

the language of probability is not suitable

for the task; the bulk of

human knowledge is organized around causal, not

probabilistic relationships, and

the grammar of probability calculus

is insufficient for capturing those relationships.

Specifically, the building blocks of our

scientific and everyday knowledge are elementary facts such as

“mud does not cause rain” and “symptoms do not cause

disease” and those facts, strangely enough, cannot be expressed

in the vocabulary of probability calculus.

It is for this reason that I consider myself only

a half-Bayesian.

In the rest of the paper, I plan to

review the dichotomy between causal and statistical

knowledge, to show the limitation of probability calculus

in handling the former, to explain the impact that this limitation

has had on various scientific disciplines and, finally,

I will express my vision for future development

in Bayesian philosophy: the enrichment of

personal probabilities with causal vocabulary and causal

calculus, so as to bring mathematical analysis closer

to where knowledge resides.

The Demarkation Line

The demarcation line between causal and statistical

concepts is thus clear and crisp. A statistical concept

is any concept that can be defined in terms of

a distribution (be it personal or frequency-based)

of observed variables, and a causal concept is any concept

concerning changes in variables that cannot be defined

from the distribution alone.

Summary

This paper calls attention to a basic conflict between

mission and practice in Bayesian methodology.

The mission is to express prior knowledge mathematically

and reliably so as to assist the interpretation of data,

hence the acquisition of new knowledge.

The practice has been to express prior knowledge as prior

%The practice is to express prior knowledge as prior

%jp practice has been ….

probabilities — too crude a vocabulary, given the grand mission.

Considerations of reliability (of judgment) call for enriching

the language of probabilities with causal vocabulary

and for admitting causal judgments into the Bayesian

repertoire. The mathematics for interpreting causal judgments

has matured, and tools for using such judgments in the

acquisition of new knowledge have been developed.

The grounds are now ready for mission-oriented Bayesianism.

————–end of quotes —————-

As I said earlier: I am nominating it for a Best Paper

Award. Unfortunately, my colleagues at the Society for

Bayes inference think that, to be a Bayesian, you need

a “theta” — No theta, no Bayes.

Anecdotically, from all my Bayesian colleagues, only

the late Dennis Lindley (1923 – 2013) admitted

that Bayes analysis should adopt the do(x) notation.

Thanks again for refreshing my memory,

Judea

Did you mean p(y | x=1,a,b,c) = normal(f(x,a,b,c), s) ? No problem if you were keeping notation light, but I’m not sure if you’re assuming a,b,c are “unique” parameters (that you describe with posteriors that will eventually peak at some value). In fact a,b,c, are stochastic and depend on the individual. To give a general solution for p(y |x=1) you would have to integrate over them and the distribution can be anything.

]]>Daniel,

I can’t accept your distinction between Fr and Pl as described, for several reasons.

1.

First, I do not restrict my P to frequencies, because I believe in the usefulness of judgmental knowledge and I would

like to give it a symbol. So I call it P(E) where E is any event that can be defined in terms of observable variables, regardless

if those variables are actually measured or not.

So, P(E) is my personal belief that event E is true.

If we are lucky, we sometimes get frequency to support our P(E), but I dont want switch to Pl just because my equipment

is not good enough to measure all the variables supporting E.

Everything is subjective under P, so we do not need Pl to remind us of that.

Another way of saying it is: Go ahead put your favorite Pl in front of everything I write and we achieve peace and prosperity.

For example, if I write P(y|do(x)) = p1, you can add Pl {P(y|do(x))=p1} = 0.9999, peace and prosperity: You are happy with the Pl,

and I just ignore it and focus on what’s behind it, namely P(y|do(x))=p1

2, Now, where does P(y|do(x))= p1 come from?

Two ways. Either I am conducting a RCT and observe that in individuals subjected to X=x the frequency of Y=y happened to be p1.

(You might prefer to call it Fr(y|do(x)) = p1, but I jump to P, because my P is enslaved to Fr when the sample is large.)

Or, I am conducting observational studies on x,y to get P(x,y) (or Fr(x,y)), and I have a theoretical model, written as a structural equation Y= f(x,a,b)

and I use some features of f(x,a,b) to extract P(y|do(x)) from Fr(x,y).

(example of “features” are “y does not affect x” or “b affects only y , not x” etc.)

I dont need Pl in this exercise.

3

I may need Pl when I am not sure about f, and I have two competing models f1 and f2, and I look around with bewilderment: What shall I do?

I open the dictionary of advanced Bayesian analysis, and it tells me: Dont panic! When you have two competing models, assign priors, fit, wait for

the posterior to peak, and call the posterior Pl (y|do(x))

Here I know I am in trouble. As long as I decorate some formulas with Pl to pacify my Bayesian colleagues

it does no harm. But when Pl conflicts with P, pause man, check the sources of the conflict , and proceed with caution (perhaps

using transportability theory). But dont even look at the posteriors because, even if they peak, they may peak into nonsense.

Conclusion:

If you want me to decorate all my formulas with Pl — fine with me. But I would still like to analyze what is in the square brackets

behind the Pl {****}, this is what counts. And behind those brackets I find the Causal Hierarchy distinction that is so useful to anyone

tries to work out a concrete example under the microscope.

Judea

Success!!! Seriously, now I think we are about to discover a gulf between you and some of us on this blog which consists in a solvable misunderstanding between us which this Cox/Jaynes camp is aware of (we are aware that there are misunderstandings that is). Unfortunately while we can come up with some conventions here that will make it plain how to annotate our thoughts unambiguously, the term “probability” and the use of the letter “p” has historical baggage and there is a fight between camps related to use of this word, and I can’t snap my fingers and make it go away any more than I can snap my fingers and make hatred for people with different religious beliefs go away or whatever. Still, I am willing in this context, to do the following.

Suppose I take a large sample of x and y out in the world and I see that whenever x has value 1, y has various values. Let’s call the frequency (or frequency density) with which we see a value y in this large sample by the annotation

Fr(y | x=1)

Then, I agree with you wholeheartedly that this observed fact could be consistent with any sort of causal connection between y and x. It is orthogonal to causality! Do we agree here? I hope so. We could imagine this as “amount of rain fall” and “amount by which barometer dial changes”, and we can both agree that if I go in and push on the barometer dial, I can not make the rain fall. Or, it could be “amount of rain fall” and “change in temperature relative to the dew point” and then if you can somehow force the temperature in a whole region to drop below the dew point, then rain would fall depending on how much water was in the air etc. Both are consistent with the same Fr(y | x=1), the first is non-causal, and the second is fully causal, so causality must be orthogonal to Fr(y|x=1).

Now, lets you and I sit down with our “thinking caps” and discuss the physics of what goes on in the world related to y and x. Somehow we will both realize that x does cause y in some way, and we come up with some explanation about how that happens, and I will write down my thinking using algebra, and you will teach me how to express it in terms of graphs and do calculus. It will be a good time had by all.

Unfortunately, both you and I will agree in this case, that even if we can set x=1 we can not predict y to 300 decimal places. The world is not so exact, it isn’t like C = 2 * pi * r exactly, and the value of pi = 3.1415926535…. and every single time we get the exactly correct answer. Still, we do agree that setting x=1 will produce values of y near to y1, and we also agree that it doesn’t seem plausible to us that if our model were correct, that it could be farther away from y1 than about s units and that values nearer to y1 seem more reasonable to us than other values. Now we ask ourselves “how to assign some number that helps us say that our equation is imprecise, but that it predicts values near y1 and more and more denies values as they depart from y1 until by the time they get out to y1 +- 2 to 3 s values… they seem totally implausible based on our causal analysis of the physical connection between x and y?

Let’s call this function that assigns numbers for how plausible things are according to our model of causality “Pl” for “plausibility”. Now we have some logical requirements for any system of plausibility… and when we look at them it turns out they are basically equivalent to the requirements that Jaynes lays out carefully in “Probability Theory the Logic of Science” and also elsewhere, for degrees of plausibility, and then we read Jaynes’ version of Cox’s theorem, and we decide that the mathematics of plausibility is the same mathematics as probability, there is no generalization of probability, the Cox axioms say Plausibility values behave *exactly* as probability…. but the *meaning* is different from our previous function “Fr”

So now we have Pl(y | x=1) is *a part of our structural equation* that specifies how “tightly” our equation should be expected to predict y due to these causal considerations.

If you admit the idea that we can do such a thing, that is, include in our structural equation / causal analysis a measure of imprecision of our outcome, a measure of “for all we know this region is as good as our physical analysis gets us” and you admit to the need for some logical structure to calculations regarding this “Pl” function, then I believe you will wind up as a Cox/Jaynes Bayesian as well because Cox’s theorem says you have to be if you agree with the axioms, and the axioms are pretty mild.

So, now, here I am in this position, I’ve read Cox’s theorem, I’ve drunk the Kool Aid, and I know some physics and biology and soforth, and then I run out and start modeling y and x. And the first thing I do is I sit down with my physics and I decide … f(x,a,b,c) is a great model for y based on physics, if only I knew the values of a,b,c but… it doesn’t predict y exactly, it only predicts that y will be “very close” on the scale of some size “s”.

So I will wind up writing down the following *structural equation*.

y = f(x,a,b,c) + err

But I have more information than this, so by the way as part of my causal analysis, the err will be a number in the high-plausibility region of a function which I will assign to describe my degree of plausibility.

Pl(err) = normal(0,s) which says that the error, which is exactly equal to y – f(x,a,b,c) is a number close to zero on the scale of s. NOTE it DOES NOT say that if I repeatedly “do” x=1 over and over I will get y values distributed according to Fr(y | x=1,a,b,c) = Pl(y | x=1,a,b,c)… that is, Frequency and Plausibility are orthogonal, just like frequency and causality… and they must be, because the Pl function is really a part of my causal equation modeling…

So, now a Bayesian on this blog says

p(y | x=1) = normal(f(x,a,b,c), s)

Because they use p to mean *ambiguously* “plausibility” (a non-observable thing, unlike Fr the frequency). And then they read your statement “the structural interpretation of this equation has in fact nothing to do with the conditional distribution of y given x” and they think THIS IS ABSURD, because when they read “conditional distribution” they interpret “Pl” and when you write “conditional distribution” you intend “Fr”

So, your objection is then “everything is a probability” and I want to modify this “everything that is not known with absolute certainty, the way the value of pi is known, can have some varying degree of Plausibility associated to it as a tool of a kind of generalized logic”

So, is that so bad? Note, we’re not talking about “this drug reduces the frequency with which people get cancer” we’re talking about “taking this drug makes it more plausible that YOU will not have cancer any more”

I believe in the virtue of distinctions too, so I like the distinction between Fr and Pl, but as I say, I can’t not wave a wand and force everyone to use unambiguous notations, and since Cox’s axioms says that Pl behaves exactly as “probability” it seems unlikely that we will avoid notational ambiguity. However, can we now agree that you and I do not have *conceptual* ambiguities between Pl and Fr, and so if you would like some further conceptual discussion of how Pl or Fr are different or where one or the other come into play in practice for people around this camp please ask away!

]]>Daniel,

I wrote:

1. “We will see that the structural interpretation of this equation has in fact nothing to do with the conditional distribution of y given x;

rather, it conveys causal information that is orthogonal to the statistical properties of x and y”

I stand behind it. The structural equation Y=a +bX + err does not constrain the conditional destributioin of y given x.

Any values of a and b are consistent with any conditional distribution P(y|x) [At least in the linear case]

I have also defined very carefully what I mean by statistical properties “properties definable in terms of joint distribution of OBSERVED VARIABLEs”

I you want to broaden the notion of probabilities and statistical properties, you have my blessing, but we will need to distinguish then between

P in the statistical sense and P* in the broadened sense. I am not convinced of the wisdom of this distinction. And this bring me to your second complaint:

2. I also wrote

“I did not realize though that their philosophy has bloomed into a methodological movement that is more than just a free license to attach a prior to whatever one feels uncertain about and wait for the posteriors to peak.”

I stand behind it too. Note the word MORE. I confessed to have believed that Cox\Jaynes philosophy remained in the state I saw it in the 1980’s, namely

a free license to assgin priors to everying, fit to data and wait for the posteriors to peak. But our conversations gave me hope that it bloomed

into something different, eg. a more disciplined science that is more productive than this general licence.. I was hoping to learn from you what else

it permits and forbids beside that license. Why from you? Because you are the first person from that camp who showed interest in discussing a 3-variable

concrete example as opposed to metaphysical discussions that hide assumptions and hide research questions etc.under the cover of “practical messy problems”.

I am still hopeful to learn what the Cox\Jaynes philosphy has developed into , but from your interpretation of the structural equation example I

infer (perhaps hastily) that EVERYTHING is probability; causality, counterfactuals, logic, metaphysics,… just everything.

[For example, take the statement “were it not for the aspirin I would still have a headache” which I classified as counterfactual, while

Cox\Jaynes (according to my understanding) would argue: No way! it is probabilistici, because all I am saying is that I am attributing

high probability to it. Same goes for “this drug reduces your chances of cancer” — the moment you believe in it Boom-Boom! it turns into probabilistic

in the Cox\Jaynes sense.]

I believe in the virtue of distinctions, not in blurring of distinctions. So I am inclined to give up on the Cox\Jaynes ideology/methodoloty/license/

Unless anyone can show me (using concrete 3-variable example) that it is not just a license to “assign and fit.”

Judea

Judea, agreed!

]]>Daniel,

Once we define the unit-based NIE(u) = f(0,g(1),u)-f(0,g(0),u)

we can play around with its expectation, its probability, its median or, some people even

like the expectation of the ratio f(0,g(1),u) / f(0,g(0),u), why not?

Whatever one is concerned about in practice.

This does not deserve an ideological war.

Judea

Also, Judea, I question whether it really behooves me to compute NIE in the pre-data case. For example, a researcher works many years to develop a causal structural model for some process… Then, the researcher hires a computer scientist who knows how to fit models using software, perhaps a naive young graduate student RA who does not study the process in question.

Then, the RA says “gee I need to put some priors on these parameters my boss gave me… but I know virtually nothing about this process… for now I’ll put these enormously wide priors.. maybe priors with no mean value such as cauchy priors…

Now the boss comes and says “we’re going to write a grant to get funds to do some surveys to find out about how to make things better in the world using our causal model, for the grant writing process we’ll need this quantity NIE… here’s how it works, please compute it for me” and the RA goes out and using the enormously wide priors, computes the NIE. Perhaps even though some of the underlying parameters HAVE NO MEAN VALUE the NIE does actually have a mean value, and the RA reports it… it’s say 75, and everyone rejoices because this is a big number in this field, and so we can make small changes to X and get big improvements in the world according to this model!

Now, the grant is approved, and we get our surveys and our data, and the posterior values concentrate, and we re-compute the NIE and we discover NIE is close to 0.

How can this be?

If instead of NIE where we take an expectation, we instead had looked at the distribution of the quantity over which you take your expectation, we might have seen that given the uselessly broad priors, the effect could have been anything from 0 to 1000 with a mean of 75, but that this was due to our lack of knowledge, as soon as some data is available, some parameters concentrated sharply and the NIE post-data was near 0.

In general, taking the expectation of a highly uncertain quantity could leave you with a false sense of what might happen. Only after we have some level of concentration does the width of the distribution around the expected value not matter so much. The quantity f(0,g(1),u)-f(0,g(0),u) like any other quantity in a Bayesian model, has a distribution, and it seems more useful to consider the entire distribution, with the expectation being something of interest only when the distribution isn’t too wide.

So, I certainly think the quantity seems like a useful idea, but I am less convinced of it having a well defined useful single value through the expectation operator.

]]>No, in general I’m not aware of Pearl’s papers, I’m sure I probably should be. But statements like “We will see that the structural interpretation of this equation has in fact nothing to do with the conditional distribution of y given x; rather, it conveys causal information that is orthogonal to the statistical properties of x and y” indicate at least imprecision in the conception of a conditional distribution, as does “I did not realize though that their philosophy has bloomed into a methodological movement that is more than just a free license to attach a prior to whatever one feels uncertain about and wait for the posteriors to peak.”

And people who call themselves Bayesian are, as is seen on this blog, not all necessarily in agreement about these topics either. So it seems possible that the explicit discussion will provoke some understanding, and if not between me and Pearl then perhaps for others such as James Savage or yourself or whoever else.

It seems very obvious to me that if a structural equation says Y = a + b*x + Err(c) and it is interpreted causally then one way to turn this into words is:

“if I knew the right values of a,b,c and you set the value of X to X1 then when you measure Y, you will find Y – a – b*X1 will become in the world, a value within the high probability region of the distribution I assumed for Err(c)”… that is the distribution for Err is itself a part of the causal assumptions, a part of the structural equations, a description of how precise those equations are. It is fundamentally the case that the SEM and the observed X alters our state of knowledge about where the Y will be because we believe that the structural equation is causal even if not super-precise.

whereas another way to turn this into words is:

“if you observe an enormous number of samples, then Y – a – b*x will in this observed set of samples have the frequency distribution defined by Err(c)” which is a fundamentally different statement about the world, does not apply at the individual level, does not require any causality in a particular instance, and is the “orthogonality” that I believe Pearl refers to in the quote above.

Both of these have been called “probability”, but the first one is ONLY admissible in a Cox/Jaynes sense.

]]>I don’t think there is a fundamental disagreement, given that you say that the causal reasoning happens outside of the Bayesian framework. This is of course consistent with Jaynes, who insists on the fact that inference is concerned with logical connections, which may or may not correspond to causal physical influences.

]]>Judea, thanks for your paper, having read the initial 5 pages (I am planning to continue through to the end), I see even further agreement between what you discuss in terms of SEM and my own personal opinion on modeling. I of course can’t speak for Andrew, nor for all others here. But, discussions here indicate to me that many of the frequent commenters have similar views and also that many have alternative views, so this is a blog where we hash out some ideas. Furthermore, in my discussions with people who are not professional mathematical modelers, but who are intellectually rigorous professional scientists (mostly Biologists) I see an ability to design experiments that inherently implies an understanding of important issues we’ve been discussing. I believe your SEM and my method of building mathematical models, and my scientific but non-mathematical colleagues (Biologists for example) all subscribe to similar views.

1) There is a difference between y = a + b*x + err (I see this in the data) and y = a + b*x + err (and if I can intervene and set the values of x I will observe the relationship continues to hold)

2) Some things confound each other in that they operate at the same time and have related effects on the outcomes, and it is then useful to design experiments where the different effects can be teased apart by changing the different variables independently or setting the various variables simultaneously to various random values or designing experiments where the effect of some particular “pathway” or “cause” has been neutralized, amplified, or otherwise altered through intervention. For example my Biologist colleagues will propose to add a drug which binds to a receptor to test whether without that receptor active, a different biological pathway will continue to induce an observed effect…

Based on what you say, here is where I think we have important differences in understanding that may hinder your communication with this blog’s readership.

In your comment you say:

“I did not realize though that their philosophy has bloomed into a methodological movement

that is more than just a free license to attach a prior to whatever one feels uncertain about

and wait for the posteriors to peak.”

And I think this is an important and HUGE misunderstanding on your part, because insights into what probability mean to Cox or Jaynes (or me or several others on this blog such as Corey, or “Laplace”) help to distinguish between different kinds of “statistical” thinking. One kind (Cox/Jaynes Bayes) is strongly associated with people who start with what you call “Structural Equations” with causal interpretations, and who then need to find out the numerical value of certain quantities in that equation. The other kind of Statistical thinking is associated with people who take the world to be “as if” a random number generator generates outcomes, and this thinking is associated with “testing” whether a distribution fits the data. There are other groups of people, many of whom simply sort of continue along doing what they were taught without having strong methodological beliefs. For the moment, let’s just contrast the two strong methodologies.

Now, what I’d like to suggest is that Cox/Jaynes do not simply give “a free license to attach a prior to whatever one feels uncertain about and wait for the posteriors to peak” but rather, a free license to assign numbers that describe a degree of plausibility to *any numerical quantity*. In particular, that would include OUTCOMES.

Now, this differs from the alternative conception where distributions can *only* be assigned to the outcomes, and where inherently there is a replacement of causality with “as if from a random number generator”.

Why is this important? You in the past have said you prefer to avoid this discussion. I think you should not if you want to understand this blog, and I think you should welcome this point because I think it is strongly parallel to the things you say in your paper linked above about SEM.

If I tell you that when I measure something, say people’s pulse, it will be a random variable with a certain gamma distribution and this is the actual long term frequency of those different measurements… you and I should immediately ask “what is the cause of this thing having such a precisely defined long term frequency? Certainly that could not occur unless some physical laws were in place to enforce it!”

Now, if I tell you that “if you go and measure 500 people’s pulses, the most information I have about them is that each will individually be somewhere in the high probability region of a given gamma distribution and where the gamma has higher priority is where I think things are more plausibly going to be found” then you and I need not ask for some physical causes that make that gamma histogram happen in the real world. In fact, if we measure 500 pulses and they are all say 55-65 beats per minute, and the gamma distribution I specified has high probability region between 40 and 150, clearly our data do not have a gamma histogram (there were no measurements between 40 and 55 or between 65 and 150!), yet the assertion “they will all be within the high probability region of the gamma distribution” DID hold. By this criterion, the Cox/Jaynes bayesian says “see, my predictions held!” and the Frequentist says “see, your predictions were utterly false p < 0.0000000000033 !”

It’s that disconnect between “the histogram of the actual data” and “the information I have about what the measurements are likely to be” which MAKES IT POSSIBLE to do causal modeling with probability to represent uncertainty. For, if I insist on the matching of my causal model to the histogram of the data…. then I must search for causal models that produce that histogram! But, if instead I *start* with a causal model, and I acknowledge its lack of precision, and I measure the imprecision allowable according to relatively more or less plausible alternatives in the region of the causal prediction… then I can be very happy with a causal model that fails to meet the histogram of outcomes, provided that it DOES put the predictions in the high probability region of the assigned distribution.

It is *this* rather than the “license to put a prior” which distinguishes, because the Cox/Jaynes version involving “plausibility” is perfectly compatible with a causal interpretation of the equations regardless of what the equations predict for long term frequencies, and the “Frequentist” interpretation of outcomes “as if from a random number generator, filling up the predicted histogram” is ONLY compatible with the subset of Structural Equations which have causal mechanisms that ENFORCE that frequency histogram.

And so, you will find a group of people who work with algebraic equations that describe CAUSES such as “the water gets warmer and then the plant life changes, and also the metabolism of fish changes, and these things cause some fish to die and other fish to grow more quickly” and they will inevitably NEED to wind up using probability in the manner that Cox’s axioms describe, and that will include *assigning* distributions over outcomes which will NOT be the observed histogram of anything, but instead whose shape itself describes a kind of plausibility for a causal outcome in a certain region given the imprecision of the model.

So, on page 4 of your paper when you say “We will see that the structural interpretation of this equation has in fact nothing to do with the conditional distribution of y given x; rather, it conveys causal information that is orthogonal to the statistical properties of x and y”

I think the root of confusion about why people at this blog can talk about “the conditional distribution of y given x” and “causality” at the same time is that *the distribution they are talking about is NOT the observed histogram of Y when X is held constant* but rather *the precision of the information that the causal model for Y can give us after we’ve seen the actual value of X*

I hope you can tell me that you see this distinction, and that you agree that the distinction is useful and that it helps to separate *the observed statistical properties of many measurements* from *the plausible values for an outcome that our causal model would predict given that it can’t predict everything with perfect precision*.

And then, I hope you and I can come to an agreement that with this conception, it is possible to talk about an SEM providing us with “a conditional distribution (of plausibilities) for y given x” and also “x causes y” as being no longer orthogonal and there is less confusion between you and those of us on this blog who have this Cox/Jaynes view !

]]>Niel Girdhar,

Good question.

The answer is that if we can separate

Gender from “Set Gender” we are ok,

we can regard the “set gender” as a new variable

and what I called “freezing” may be interpreted

as “holding” — no problem.

The reason I said that we need a new operator, “freezing”

was to cover situations where we do not have this

facility to change gender perception without changing

gender itself. If we only have three variables

X, M and Y, and we are allowed only do operators on

these three, we cannot identify NIE.

(unless we have non-confounding assumptions)

Good question.

Judea