Thinking fast, slow, and not at all: System 3 jumps the shark

By now, we’re all familiar with the three modes of thought. From wikipedia:

System 1 is fast, instinctive and emotional.

System 2 is slower, more deliberative, and more logical.

System 3 is when you say things that sound good but make no sense.

System 3 can get activated when you trust what someone tells you rather than figuring it out yourself.

I thought about this after someone pointed out this post by Rachael Meager, who pointed out this erroneous claim in the new book, Noise, by Daniel Kahneman, Olivier Sibony, and Cass Sunstein.

We must, however, remember that while correlation does not imply causation, causation does imply correlation. Where there is a causal link, we should find a correlation. If you find no correlation between age and shoe size among adults, then you can safely conclude that after the end of adolescence, age does not make feet grow larger and that you have to look elsewhere for the causes of differences in shoe size. In short, wherever there is causality, there is correlation.

As Rachael points out, “this is not a case of experts simplifying a claim for a lay audience. This claim is just outright incorrect.”

It’s an interesting formulation when someone says, “We must remember X,” where X is a false statement. What is it exactly that we’re supposed to remember??

Rachael gives an example where there is causation but no correlation: “Imagine driving a car, reaching a hill and pumping the gas as you begin to go up so that your speed is constant. The correlation between pressing on the gas and the speed of the car is zero but they’re obviously causally related, it’s that the agent is optimizing speed!”

Strictly speaking, if your speed is constant, the correlation is not zero, it’s undefined. But, once you allow the speed to vary, you can get the correlation between speed and the position of the accelerator pedal to be positive, negative, or zero, even though in all cases pushing the accelerator makes the car go faster.

You can also get causation without correlation from a non-monotonic relationship or from plain old selection bias. So let me just emphasize that Rachael’s example is fine and there are a zillion others too. Causation and correlation are different things; it’s just not true to say that one implies the other.

“Why did we think they could get that one right?”

The question is, how could the authors of this book have made such a clear mistake?

To answer this question, we can turn to Cass Sunstein, one of the authors, who in an interview about the book says:

When a forecaster is wrong, we think, “Why did they make that mistake?” The better question is: “Why did we think they could get that one right?”

Well put.

The authors of this new book are a psychologist, a law professor, and some dude who describes himself as “a professor, writer and keynote speaker specializing in the quality of strategic thinking and the design of decision processes.” Between them, there’s no reason to think they’d have any particular expertise in correlation, causation, or statistics. You might as well ask me to have an opinion on the non-accelerating inflation rate of unemployment or the theory of operant conditioning. If I were to write a book and include categorical statements about such things, I’d check with the experts first. The relevant skill for Kahneman here was not to be an expert on statistics or econometrics but rather to realize that his coauthors are not experts either. A Washington Post reviewer called the authors an “all-star team,” but you wouldn’t want a baseball all-star team to play basketball (unless it included, I dunno, Jackie Robinson, Michael Jordan, and Danny Ainge), and I don’t know that you’d want a psychology/biz-school/law-school all-star team to be playing statistics. Again, though, maybe this is part of the problem. These guys get too much deference, more than is good for you. In sports, you ultimately have to face the music. In celebrity academia, once you’re high enough in the stratosphere, you can stay afloat forever.

Feynman famously said, “The first principle is that you must not fool yourself.” The second principle, I guess, is not to get fooled by others. Here’s what I’m guessing happened with this book:

1. One of the two other authors, Sunstein or Sibony, fooled himself into thinking he understood correlation and causation. I guess that if you’re a smooth enough writer you can use words to paper over any conceptual difficulties.

2. Kahneman was fooled into thinking his coauthors knew what they were talking about. [above lines crossed out because they were uninformed speculation about which authors wrote which parts of the book. — ed.]

So I think the answer to Sunstein’s question, “Why did we think they could get that one right?”, is that, like that famously well-dressed emperor, they were surrounded by yes-men. And remember that Sunstein’s earlier reaction to being questioned was to liken the skeptics to the former East German secret police. Take someone who gets too much positive feedback, and who actively resists negative feedback, and that’s a recipe for overconfidence, which is, ironically, one of the biases that Kahneman discussed in his earlier book.

The chain of trust

We discussed this general issue a few years ago in the context of the unstable mix of skepticism and trust that was characteristic of the Freakonomics franchise. The skepticism came because one of the main themes of Freakonomics was how everything you thought was right, was wrong. Drunk walking is worse than drunk driving, global cooling rather than global warming, etc. The trust came because, after their first book, which was mostly based on author Levitt’s research, the Freaknomics franchise pretty much ran out of original research and was reduce to promoting the work of Levitt’s friends and various randos on the internet.

Something similar seems to have happened with Kahenman. His first book was all about his own research, which in turn was full of skepticism for simple models of human cognition and decision making. But he left it all on the table in that book, so now he’s writing about other people’s work, which requires trusting in his coauthors. I think some of that trust was misplaced.

The question then arises, how is it that luminaries such as Philip Tetlock, Max Bazerman, Robert Cialdini, Rita McGrath, Annie Duke, Angela Duckworth, Adam Grant, Jonathan Haidt, Steven Levitt, and Esther Duflo thought this book was so brilliant, essential, masterful, eye-opening, important, etc. (I took these from the blurbs on the book’s Amazon page.) The simplest answer to this question is that the book really is wonderful, it just has this one little mistake. Noise is indeed an important subject, and three authors who don’t understand correlation and causation can still write an excellent book on the topic. To return to our sports analogy, suppose a dream team of baseball experts were to write a general book about sports in which, as an aside, they say something like, “We must, however, remember, that in football if it’s fourth down and you’re too far away to kick a field goal, you should always punt. Only fools go for it.” It could still be a brilliant, essential, masterful, eye-opening, important book that just happens to contain this one little mistake.

A new continent?

In the above-linked interview, Sunstein says a few other things that bother me. One bit is where he writes:

One [of the things] I learned in this [book] collaboration is not to think in terms of [for instance], “Will this stock go up”? “Is this the right investment strategy?” but instead to think: “What’s the probability that this stock will go up?” “What’s the probability that this is a good investment strategy?” So rather than asking, “Is it good to invest in international stocks [versus] domestic stocks?”, it’s better to ask, “What probability do you assign to the proposition that international stocks will outperform domestic stocks in 2022?”

I mean, sure, yeah, think probabilistically. But . . . he learned that just now, in the past five years writing this book? Jeez . . . he was pretty naive five years ago. This seems like a commonplace insight to me. Don’t financial advisers tell you this all the time? We can’t know the future, we can only guess?

I mean, really, what the hell?? I’m reminded of that scene in one of David Lodge’s books where the professors of English are sitting in a circle, playing a game where they take turns listing famous books that, embarrassingly, they’ve never read. And one of them lists Hamlet. A bit too embarrassing, it turns out! Similarly, it’s kind of admirable how open Sunstein is about his former cluelessness, but it makes you wonder whether he was really the most qualified person to write a book about a topic that lots of people know about, but which until five years ago he’d never thought about.

Also, just a minor point but I don’t think it’s quite right to ask questions like “What’s the probability that this stock will go up?” I mean, sure, you can ask the question just to check that your investment advisor is on the ball, but I don’t think investment advisors should be thinking of the stock price going up or down as a binary outcome. The investment advisor should be thinking of things like expectation and tail risk. Anyway, not a big deal but perhaps revealing of Sunstein’s continuing discomfort with the concept of noise.

What really bothered me, though, was when Sunstein said:

Unlike bias, noise isn’t intuitive, which is why we think we’ve discovered a new continent.

At first I thought this was weird because, who does this guy think he is, Christopher Columbus? Also everybody knows about noise. I can’t expect Sunstein et al. to have heard of W. E. Deming and the quality control revolution, but he’s heard of Fischer Black, right? Right?

What kind of new continent is this?

But then I realized that Sunstein kinda is like Columbus, in that he’s an ignorant guy who sails off to a faraway land, a country that’s already full of people, and then he goes back and declares he’s discovered the place.

I guess in response he might say that Deming, Black, and a few zillion other statisticians and economists might know all about noise, but we haven’t conveyed it well to policymakers or the general public. And that could well be. I could well believe that, despite all our best efforts, we haven’t educated the world on the importance of noise, and so Sunstein et al. are doing the world a great service by expressing these ideas in a readable way. Fisher Black published his paper in the Journal of Finance. A lot more people will read a popular book than will ever read anything in the Journal of Finance. Deming wrote lots and lots, but I guess he’s mostly forgotten by now. So, again, a new book could be worthwhile.

To be fair, I guess we should interpret Sunstein’s “new continent” remark in a direct analogy to Columbus. When Sunstein says he discovered something, he’s saying that he recently became aware of something that was already well known, just not known in his social circle.

Let me just conclude with the final word from Sunstein in his interview:

I would speculate that bull—- -receptive people are likely to indulge their imaginations, and they might well be prone to being noisy. If someone is receptive to bull—- that’s in the form of seemingly profound or meaningful sentences that actually don’t mean anything at all, that’s a warning sign.

Well put. The guy’s got a way with words. The only thing I can’t figure out is why he says that “bull—- receptivity is not a positive thing.” It’s been good for him, no?

P.S. The question comes up in this sort of post: Why bother? Why should I care? I’m not completely sure, but one thing that bothers me about the nudgelords is that they’re going around telling everybody else what to do—or, more precisely, advertising their services to world leaders who can use their techniques to nudge us into doing stuff that they, the leaders, want us to do—but they don’t have their own house in order. It creeps me out that these people always seem to think they’re gonna be the nudgers and never the nudgees. Further discussion along these lines here.

I’m sure that my above post is unfair in the sense that these three people spent several years working hard on a book, and I’m basing my entire reaction on some combination of the title, a technical error that someone found, and an interview where one of the authors was maybe a bit too relaxed. These three pieces of information are in no way a summary of the actual book! Not even close. I’m bothered because I fear that these renowned authors may get a bit too much deference in book reviews (as in the above-noted blurbs) and I think we should be careful about that. I’m not sure we should take too seriously the musings on statistical noise of three authors who don’t have a firm grip on correlation and causation, and it seems that at least one of the authors had never thought about the topic until very recently. Hence this post. But the book could have great material. You can be the judge. If anything, this post might help sell a few copies!

P.P.S. The authors of Noise wrote a bit about their book in a recent op-ed. I think that article was mostly reasonable, but I’d prefer to use the term “variation” rather than “noise.” They say that it’s a bad thing that different judges, given the same facts, give different sentences. But different voters, given the same facts, choose different candidates, and we’re used to that! So I resist their implication that variation is a liability.

P.P.P.S. Some people emailed me about this post so let me briefly clarify. My concern about the Noise book is that, from the information I’ve seen so far, the authors don’t seem to know so much about the statistical aspects of what they’re writing about. There’s the above-mentioned error of correlation and causation (and let me emphasize that this is a general error, not dependent on the specifics of the particular example that Rachael happened to bring up), also one of the authors saying that the topic was pretty much entirely new to him. Statistics isn’t the only part of noise, but I’m a statistician so I’ll focus on that issue.

P.P.P.P.S. Kahneman replies in comments below, and there are several responses. His work has been very influential in my thinking, both about statistics and about science, and I appreciate his taking the time to comment. I think these sorts of comment threads are helpful in allowing us all to work out areas of disagreement or miscommunication.

P.P.P.P.P.S. I checked the index of the book. No mention of Fischer Black or W. E. Deming. If you’re going to explore a new continent, it can help to have a local guide who can show you the territory.

145 thoughts on “Thinking fast, slow, and not at all: System 3 jumps the shark

  1. If you use “correlation” to mean “P(A,B) != P(A)P(B)”, which people sometimes do, and you make the common “faithfulness” assumption people use in causal inference, then isn’t the claim “causation implies correlation” true?

    These aren’t unreasonable things to do, and it seems like in introductory material, faithfulness is often handwaved away as a kind of technical assumption that doesn’t deserve too much attention. So maybe that’s how the authors got the wrong idea.

    The claims as written in the book seem to be very broad, confident, and incorrect, but perhaps “causation implies correlation” plus a footnote explaining what is meant by correlation, and that the circumstances when causation won’t result in correlation are somewhat special, could be reasonable.

    • Cs:

      But if you use “correlation” to mean “P(A,B) != P(A)P(B),” then in the real world pretty much everything is correlated with everything else, and the statement is empty. The point of examples such as Rachel’s is that you can have an observed correlation of +0.2, 0, or -0.2, say, while in each case there is causation. So, first, causation doesn’t imply correlation, and, second, sure, we can say that some correlation is always there, but in that case we can say that the sign of the correlation is not necessarily the same as the sign of the causation.

    • cs student: I find it quite awkward to talk about correlation/independence as they are all quantities averaged over a fixed population—in casual inference there are at least two populations. It seems your argument here does not fully rescue the claim: suppose the data are generated by y = z – x + other variables, in which z is treatment and x is one pre-treatment variable, and it also happens to be x = z. Then in the observed population, y and z are uncorrelated and also independent, despite the actual treatment effect = 1. Maybe a further resuscitation you implicitly suggest is to use conditional independence rather than independence or correlation. But it is still wrong: indeed in this example, y is independent of z given x.

      • If I understand you correctly, and if I understand faithfulness correctly, this situation is what the faithfulness assumption is meant to rule out — where you have variables or combinations of them that perfectly cancel each other.

        • I think I agree about the first point, you won’t actually see things cancel out like that (big exception: if there’s a control system or something like the example in the post). Which is why it seems to me that the faithfulness assumption is usually pretty safe.

          The second point seems correct though, I can see that asking for actual independence to say things “aren’t correlated” is too stringent in practice.

        • Not to criticize the distinctions being made as that is clearly outside my expertise I want to point out that in practical (vs academic/systemic) situations there are much more salient contributing factors to outcomes which need to be taken into account. These: there are multiple contributing factors (not all are known) and the time sequence of these contributions can be crucial. The second point is what I will elaborate and it is based on actual experiences a long time ago as young engineer, experiences which left an enduring skepticism.
          The situations involved technical investigations of accidents or near accidents in industrial settings which presented inherent serious safety risks to life and property. In enough cases it was undeniable that had the sequence of events in the immediate (reasonably) timeframe been different the outcome would have been totally different.If the timeframe were extended beyond the immediate even more factors could be added but not usefully. The exercise was basically a very detailed review of the process design and operating procedures. I found that playing what-if scenarios very useful and that often revealed the time dependence. Noise or chance of course was usually behind this.
          These events and studies carried real and immediate consequences. Human injury, material loss, serious penalties (gov. safety regulations), career hit etc. No attempt was made to find systemic causes. The usual adopted solution was increased automation safeguards even though human error and chance often appeared to be dominant. The decisions were unavoidably subjective as the “solvable” potential contributing factors were significantly (and I suspect knowingly) overestimated. As an example of time sequence dependence in the general consider development of any machine where any change would cause failure.

      • Here’s a worked example on this topic by Master Shalizi: http://bactra.org/weblog/1178.html

        In this instance of two effects cancelling each other out, where treatment is anti-correlated with symptom, only the injection of noise saves the above statement (oh, the irony). I.e. it is the “sloppiness” in the treatment, as Shalizi puts it, that allows one to recover the true relationship between the treatment and response variables naively.

    • Yes, you are correct cs student. The faithfulness assumption is commonly made and meant to rule out special cases of things perfectly canceling out either by chance or design. It can be shown that perfect canceling out by chance is an event of measure 0 (as Andrew says, it basically doesn’t happen). And canceling out by design will also rarely occur. So the statement in the book is really not that bad, just missing a minor caveat.

      • Z:

        I strongly disagree with your statement that “the statement in the book is really not that bad, just missing a minor caveat.” All correlations are nonzero, fine. But if the only point of “causation does imply correlation” is that correlations are not exactly zero, that’s an empty statement! For it to have any applicability, this statement has to imply something about the sign of the correlation or its magnitude. As illustrated by many many examples (including Rachael’s), the sign can go in the wrong direction and the magnitude can be way off, even in a setting where there are clear causal effects.

        In his comment below, Daniel Kahneman discusses a prediction problem where R-squared is low, which allows us to conclude that we have no great explanation for the phenomenon at hand for this subset of cases. That sort of predictive statement can be useful. It can be helpful to know that something can only be predicted to some level of accuracy. We can only predict NFL games to an accuracy of about two touchdowns (that is, the standard deviation of the observed score differential, conditional on our best predictions, is about 14 points). This tells us something about football. I don’t see that anything is added to these examples by appending a false (or, if true, vacuous) statement about causation and correlation.

        I don’t think Kahneman et al. were malicious in making this statement; I just think it’s kind of sloppy and just confuses an otherwise clear statement about prediction.

      • I think these exchanges can come to seem personal when someone like Sunstein is criticized for errors in reasoning, but over time fails to improve his practice (and instead attacks the critics), and continues to make similar kinds of errors. Frustrated critics subsequently express that frustration, and it seems like personal criticism or ad hominem. That may be a fair characterization of the New Republic book review. Or maybe not, I only read the first couple paragraphs.

    • Lori:

      I don’t think it helped for them to use an empty example like age and shoe size. It would be better to explore these issues in a policy example. I’m assuming they did so, elsewhere in the book, in which case the above-cited paragraph is just a bit of noise in the book that the reader has to ignore on the way to the real stuff. They just shouldn’t have said, “We must, however, remember…”, as this could mislead the reader into thinking they’re saying something true and important.

      • I agree they shouldn’t have said it with such confidence and that it’s misleading. In fact the statement would, hopefully, be marked as incorrect in a stats course. I haven’t read the book, but probably will at some point, so I’ll have to wait to see if it’s noise.before getting on to their ideas behind noise. Seriously! Is causal inference really this difficult?I guess so.

        • I guess it’s easy to make mistakes in a long book. The toughest mistakes are the ones that you don’t realize are tricky points. I’m guessing the authors just wrote those paragraphs without thinking much about it, just thinking these were obvious statements that nobody could disagree with. Kinda like what would happen if I tried to write a book about macroeconomics or something.

  2. Their statement is clearly wrong. I wonder though, if we randomly pick 100 causally related events – would their description fit 99 of them? I guess quantifying this is impossible, unless one somehow specifies what kind of events one has in mind.

    • Carsten:

      I’m not sure. A similar question came up a few years ago, when someone pointed me to a claim that “Causation is correlated with correlation,” which was a kind of ecological statement about correlations and causations in the wild. I replied with the dictum, “correlation does not even imply correlation”: that is, correlation in the data you happen to have (even if it happens to be “statistically significant”) does not necessarily imply correlation in the population of interest.

      So I guess it depends on what you look at. One difficulty with arguments about correlations and causations in the wild, is that most examples in the wild are easy. Drop a coyote and an anvil off a cliff and they will both fall down. But the tough cases, the ones where you need to bring in the experts, they’re not a random sample of cases in the wild. They’re tough for a reason. So I’m just not sure. It could well be that the book in question discusses real-world topics in a reasonable way even while bungling the general question.

        • John:

          From what I’ve seen, the coyote typically runs off the cliff on its own accord, or else stands on the very edge on a protruding bit of stone that is too weak to support its weight.

      • That is a fair point, that most life is straightforward and doesn’t need statistics.

        I still wonder what they should have written, given the audience. When explaining how gravity works, one shouldn’t begin with explaining how the quantum level is different (if the audience doesn’t know about quantum mechanics). I prefer a reader who has the information provided in the book section, rather than a reader who knows 0 about this issue.

  3. Andrew,

    I worked on an early “noise” project with one of the authors shortly after I moved from academia to industry and I’m not surprised by this sort of gaffe. I don’t think you would see this stuff if Tversky were still around.

    For some additional context, I think I we had you out to give a talk around that time, but I missed it because I was on paternity leave or something.

  4. What struck me about that paragraph (which I’ve read about, but not read the whole chapter or book) is that it immediately reminded me of the statements in undergraduate research methods books about causation which often include a formulation like this. E.g. https://www.sagepub.com/sites/default/files/upm-binaries/23639_Chapter_5___Causation_and_Experimental_Design.pdf
    Which is basically about experimental design. Obviously there does not have to be a bivariate correlation because what about Simpson’s Paradox etc. But unfortunately this is how it gets interpreted, and I always cringe when I see it because it’s a misleading way to state things.

  5. I also thought of the ‘faithfulness’ assumption, but
    (a) Pearl makes it clear that it’s an assumption (although I think it’s stronger than he does); and,
    (b) the faithfulness assumption is for the complete causal DAG, so in Rachel’s example it would involve conditioning on the hill and it would plausibly be true

    • I would find it interesting if you could expand more on this second point. I don’t quite grasp what you mean by it and would like to understand this stuff better.

    • Thomas:

      The only trouble with this sort of sophisticated discussion is that it can give a casual reader the impression that the statement in Noise about correlation and causation is basically correct and we’re just picking at technicalities. But actually that statement is ridiculously wrong and it takes a lot of technical arguments to even try to come up with a way that it could possibly be interpreted as being correct.

  6. I think there is some confusion here.

    If there are two conditions:

    1. Pressing the pedal at increasing pressure to increase the flow of gasoline to *maintain* speed.
    2. Pressing the pedal at a constant pressure which will result in a reduction of speed.

    There will be a correlation. That said, simply because one does not observe a correlation in a set of observational data does mean there is no causation producing that data.

    • Curious:

      The car example is interesting for its own sake, but for the point of discussing that paragraph from Noise it’s kind of overkill. That statement about correlation and causation is wrong for so many reasons.

    • If change the accelerator pedal to a metered valve and plot speed, resistance and flow in real time, there will be some sort of correlation. But the point is that a standard accelerator pedal and standard gages render the causation invisible, and in the real world many causes are hidden by similar indirect linkages.

    • Dl:

      Not really because (a) there’s still the problem of non-monotonic effects and interactions, and (b) the implicit ceteris paribus thing would only work in balanced experiments, but they’re talking about observational data, where selection is a key problem, indeed that’s a big reason why we say “correlation does not imply correlation” in the first place!

  7. The residuals in a linear regression “cause” the observed values (by definition) but are uncorrelated with them (by construction).

    This seems much the same as the car example.

    • I thought this was a very clever example, but then I realized it’s not true.
      cov(Y,Y-Yhat)=var(Y)-cov(Y,Yhat) which is not generally equal to 0.

      You can also try it in R:
      > x y cor(y,resid(lm(y~x)))

      Maybe what you mean is that X causes the residuals, but is uncorrelated with them?
      Or am I missing something here?

  8. p.s. Telling the car story in terms of residuals (the accelerations) to a regression makes it clearer that this isn’t actually a kosher example — we seek causal explanations in terms of finite (or small relative to the amount of data) number of parameters, not something with as many degrees of freedom as the thing caused.

  9. I suspect that for a lot of people causation implicitly means, for lack of a better term, “monotonic folk causation”, where (the distribution of) some thingee being “caused” is an increasing function of each of several variables X1, X2, … Xn, thought of as its causes. There you will indeed always see a positive correlation between any of the causing variables, or monotone functions of the variables, and the thingee. This is broader than it sounds, because reasonable non-monotonic models can be cast in this form by adding variables.

    And the informal meaning of “correlation” is non-independence of random variables, not nonzero correlation coefficient. The standard counterexamples like O or X shaped data don’t falsify that interpretation. Rachel’s car story does but at the cost of allowing maximally non-explanatory (i.e., fully general nonparametric) “causes” which again violates the implicit assumptions of the folk thinking.

    So unless there are more natural counterexamples Kahnemann and company can say that this is a game of gotcha played among academics rather than a serious problem. Of course Kahneman is one of the worst offenders at that game with Linda the bank teller and the Hot Hand and similar fake errors that come from bogus imposition (by Kahneman et al) of precise formal models that ignore the assumptions in the folk thinking.

    • Mr:

      I don’t think this works as a save, because these monotonic relations also can be destroyed by selection bias without any need for anything as complicated as Rachael’s feedback mechanism and time process. To put it another way: they do admit that correlation doesn’t imply causation, and a big reason for that is selection bias—or, to put it another way, unmodeled differences between groups. Indeed, “correlation does not imply correlation” could in a certain way be taken as a result of “noise” in the sense that these spurious relationships can occur in data as a result of variation that has not been accounted for in one’s conceptual model. This same sort of thing can lead to causation not implying correlation! So I think that even if you want to represent their statement as a “folk” saying, it still doesn’t work! Just think of any example of the following form: “In observational data it looked like the treatment had no effect, but actually the people in the treatment group were older than the people in the control group.” Or, for a policy example, intervention X lowers crime. But, in the data, intervention X is often performed during periods of high crime, and it turns out there’s no relationship between the presence of the intervention and the crime rate a year later. This last example illustrates how fragile this correlation thing is: change how it’s measured (for example, change whether you measure the level, the first derivative, or the second derivative) and you can make the correlation go positive, negative, or zero, all with the same underlying causal effect. I think the naive statement that causation implies correlation is flat-out dangerous here, and the folk meaning doesn’t work either, if the correlation can be zero or go in the wrong direction.

      • But that’s precisely the thing: the informal “correlation” doesn’t actually imply any direction, or strength. It’s essentially shorthand for “there’s SOME relation between the two seemingly coincidental events”. And if you *know* that there’s a causal link between them, the fact that they’re related is already proven, satisfying the condition for informal “correlation” (“causation” is a relation, therefore… you can’t have one without the other).

        I will agree that the *informalisation of such technical terms* is dangerous on principle – and we already know this, because the Internet is full of expositions twisting the meaning of words like “theory” to bend reality to particular viewpoints (typically detrimental to scientific understanding).

        I also agree that the way it’s used as an imprecise, broad term, is dangerous in that it can be used to explain literally *anything* while sounding passably versed in it to not arouse suspicion (which also has been done before). But such is the nature of informal discourse: much of it relies on context and understanding of the participants, which – contrary to the (ideal?) scientific setting – can vary quite wildly.

        I don’t know if this is what they meant to say, and if so – why the authors believed it a good idea in the first place… so there’s a possibility that this discussion is moot. Still.

        • Kadigan:

          I see your point. I’m still concerned that the statement in the book is becoming empty, something like, “When there is causation there will be correlation somewhere; you might not observe it in data but there will be correlation in some latent variables.” At that point, I’d agree, as this loops back to the definition of causal effect as a difference between outcomes under treatment and control conditions, so if you consider y^T and y^C as latent variables and if you imagine applying treatment or control to multiple identical people, you’ll get a correlation. But then the statement is really empty! I guess you’re right that it all depends on where they’re going with that statement and what real-world implications they’re planning to draw from it. I’m not encouraged by the authors’ follow-up statement:

          It follows that where there is causality, we should be able to predict—and correlation, the accuracy of this prediction, is a measure of how much causality we understand.

          It kind of sounds like they’re talking about R-squared here, but, as we know, R-squared is all about prediction, does not have to do with causation at all, as in all those famous examples of spurious correlations. Also, they explicitly say “we should be able to predict,” which means they’re talking about observables, not latent variables.

          I just think what they’re doing is . . . not “word salad,” exactly, but an attempt to make a (false) statistical point using rhetorical arguments. The math is not on their side so they’re trying to use the English language to slide by. I don’t think they’re trying to cheat in their reasoning here; what it looks like to me is that they’re trying to explain something to the reader that they don’t understand themselves.

          The comment-thread discussions about correlation and causation are interesting, though, and I agree with you that certain intuitions can be made to work out by considering latent variables. I actually think they would’ve been better off saying, “Correlation does imply causation; it’s just not always the causation you’re thinking of.” Then it would be more clear that (a) you have to think about latent variables to make any sense of these statements (hence it’s flat-out false to say “Where there is a causal link, we should find a correlation” unless by “find” you mean “build a theoretical model so that correlation appears”) and (b) you can’t assume that causation implies correlation among any particular measurements (hence their followup statements about prediction being a measure of causality is wrong). Too bad none of the blurbers caught this one when they were reading the manuscript!

  10. I think the issue there is in that the sciences use words that have very rigorous meanings, while the general population uses them in a much more relaxed manner — to the point of losing much of their value, and causing issues like the one described.

    Exhibit A: “theory”; Exhibit B: “experiment”. When Mr John Doe says these words, they imply a lot less than when Dr Anon E. Mous does – that’s how you arrive at “Gravity is just a theory.” among other things. And, I’m thinking, “correlation” can be considered Exhibit C.

    I had to actually sit down and read what “correlation” means, because for a solid 20 minutes I couldn’t see what was wrong about the paragraph you quoted. Only after I realised that the pedestrian use of the word “correlation” simply means “a relationship of some sort” (and that you meant the more rigorous meaning) did I figure out what you were having an issue with.

    So, in the end, I think they used the “pedestrian” meaning here, perhaps entirely by accident. And I’m left to wonder: why did that communications error happen in a book that should’ve gotten it right (by the virtue of being written, as I understand, by statisticians, for statisticians)?

    • Kadigan:

      I agree with your general point, but in the specific example I don’t think their statement is correct even if it means “a relationship of some sort”; see some of the above discussion in comments for detail.

  11. I think most of this post is fair criticism of some embarrassing errors, but the take in the p.p.s. seems very wrong to me. The fact that judges given the same facts give different verdicts is not of the same nature as the fact that voters vote for different candidates based on the same information. Voters have legitimately different preferences and voting is about aggregating those preferences. The legal system is about implementing and executing existing law and rules. A judge is supposed to uphold the letter and the spirit of the law, not act on personal views. If the penalty a defendant receives depends on the judge he faces, or if two defendants get different punishments for the same crime, that is a grave injustice. (Far graver than some famous professors writing a shitty book, which is why I didn’t want to let this particular rhetorical flourish slide.)

      • It seems that Andrew might need to rebrand this newly discovered continent as System 4 (or higher), since the person in the link below was claiming System 3 in 2019, although maybe Andrew’s conception of System 3 really subsumes this one? But yeah, the System 3 or 4 or 40 gag is gold. Discovering new “systems” is probably a research program at this point, like discovering new biases was a few weeks ago. Those nice posters with all the known(?) biases on them aren’t going to just create themselves like some sort of Von Neumann universal constructor designed in a secret lab underneath the University of Chicago’s business school.

        http://www.hcdi.net/back-to-the-future-system-3/

    • As a lawyer, I would say that one possible principle of justice is treating each case on its own merits. Never (or let’s be careful and say rarely), does the same thing happen twice, except in the most mechanical of minor legal cases. And to add information on a partly unrelated aspect of the legal instances of noise, I do not know much about Sunstein’s personal history, but law professors often have very little experience with trials, and with the practice of law in general.

  12. I share Andrew’s upset at the claim about correlation and causation, as well as the larger question about trust in these authors. But I think the analogies about expertise – but not in statistics – are off the point (and largely incorrect). What I think this example shows is that marketing triumphs over content. Once again. The example of causation implying correlation is just like their use of “noise” rather than “variation.” They are masters at marketing ideas. And it sells, not just to the lay public but to well-known smart people as well. That is disturbing.

    Here’s another example. I’ve read countless attempts to explain the difference between data analysts and data scientists. They all boil down to some variation of one looking at past data and the other looking at the future (predictive modeling rather than descriptive modeling). All of the explanations get kudos from a variety of readers. That is despite the fact that the statement makes no sense – describing for its own sake may make good academic careers, but is useless if it stops there. Predicting without description is worse – I’d call it witchcraft. But the distinction sounds good, and we all want to think we’ve learned something.

    Similarly, the correlation/causation statement “feels” like we’ve learned something – and learned it from world experts. I felt that way from reading “Thinking Fast and Slow” and myriad other works. I think we all do that, and the process of reading something and learning something is a healthy thing. The problem isn’t that what is learned is wrong – it is inadequate. There is some truth in the statement that causation implies correlation, but that truth only works in limited circumstances (hence the general statement is wrong). But let’s not deny that when someone reads that, they may, in fact, be improving their knowledge from what it was.

    What is missing is the critical thinking that causes people to go beyond the first statement – to understand the limitations and to see the more important points that correlation is a poor indicator of anything, that myriad pieces of evidence are required to make sense out of data, and that noise (or variability) is easily mistaken for something systematic (“Fooled by Randomness” anyone?). What disturbs me is that this critical thinking seems to be missing, and that people seem happy to settle for the initial good feeling that something sounds good. I believe we (humans here) have become very good consumers – of ideas, in this case.

    You may think I have wandered astray with this comment. But, for me, the most disturbing thing is that we are being trained to be consumers – of material goods and of knowledge. And, if we are good consumers, then the superstar producers get elite status and our devotion. Those same experts are so blinded by their stardom that they don’t bother to think more deeply themselves. The problem is not that they are not statisticians – it is that they have turned their backs on critical thinking because it gets in the way of their elite status. (I might do the same if I could be as successful as they are.)

    • In all fairness, Cass Sunstein has penned several insightful books that examine group processes. In particular, Wiser is one of Sunstein’s best and worth the read. Constitution of Many Minds & Laws of Fear were also excellent.

      I really don’t worry too much about experts’ personality or self image as a criterion of their work cuz I have known assholes who are great thinkers. To do so, could constitute, in some contexts, the fallacy of attribution. Nor do I have devotion to any expert based on his status. I tend to respect the ones who offer insights that I am not schooled in. I learn best that way.

      RE: ‘correlation’ implies ‘causation

      I hypothesize that each of us has cultivated mundane and unique ways of interpreting what’s going on around us. I have learned to accept that others don’t think as I; and I don’t necessarily want to think like others, as some possibilities of expression. Sometimes I will be wrong. And actually I’m relieved when someone points out when I am.

  13. Not only is their statement wrong, it is dangerous. If a drug cures an infectious disease, the correlation between prevalence infection and death will be insignificant where the drug is used.

    That does not mean that the infection does not cause death.

    Any case where there are multiple factors effecting an outcome and they are correlated within the sample can result in observing no correlation or correlations with the wrong sign.

    • So as I mentioned above, in many research textbooks it is stated that to make a statement about causation three things must be true: the temporal ordering must be clear; there must be an “empirical correlation” (e.g. from an experiment or the slope in regression); that is not spurious. So if you want to know if a disease death and the drug reduces death you need to have variation in the use of the drug or your non-association of the disease with death is spurious. I think this is where they are coming from, and it really does not reflect more complex ways people think about this.

      • I don’t want to be mean about it, but what I’m trying to say is that this sounds like something that someone would take away incorrectly from a lecture on this in an introductory methods class. But also, and again this reflects the poor writing, because of course the idea is not about “r” nor is it saying that in every set of data no matter what the design there will be r meaningful different from 0.

        Here’s an example of this https://conjointly.com/kb/designing-research-designs/

      • You’ve gotta be careful about what you mean when you say “correlation” and “not spurious.”

        As pointed out elsewhere in this thread, correlation in the statistical sense presumes a linear model. If X causes Y to be exactly Sin(X), and X is on a bounded interval, the true Corr(X, Y) can be zero depending on the interval boundaries (too lazy to calculate them), even though X completely determines Y with probability 1. So, using the literal statistical definition of correlation, your criteria eliminates true causal relationships.

        An empirical relationship can be spurious in that

        1. It’s a finite-sample artifact
        2. It’s the result of pre-treatment differences, confounders or selection bias or the sort

        There are empirical relationships that are non-spurious in the sense of (1) which, but are spurious in the sense of (2) and therefore do not reflect a causal relationship.

  14. I’ve been on a camping trip and just got to this, but two things:

    1. Selection bias is *huge* in many realms. I’ve worked for decades, off and on, on occupational safety and health, and the “healthy worker effect” is a big deal. You can’t just look at cross-sectional data to see how a job affects someone’s health because there’s the initial selection of healthy people for physically demanding jobs. This is also big in child labor research: if you have four kids and need one of them to work, which one will you pick? If you don’t get this you really can’t do serious work in this field (although a lot of work is still oblivious, oh well……).

    2. I agree strongly with AG’s point about nudgers although I would broaden it beyond this specific policy frame. The entire field of rationality vs heuristics that Sunstein operates within is epistemically arrogant. Yes, people are fallible, but we are all people. The standard of rationality that folks like Sunstein assume as the benchmark is universally unattainable in the real world, and all of us rely on heuristics to steer our way through. Maybe the difference is that some people are more open to learning from experience than others or pay more attention or fine-tune their heuristics. I much prefer Gigerenzer’s approach.

    • Hi Peter,

      Re: The entire field of rationality vs heuristics that Sunstein operates within is epistemically arrogant.
      —–
      Selection bias is huge. As is Base Rate Neglect.

      Actually I think Cass Sunstein, in particular, has done more to diversify the foreign policy establishment than people may appreciate by virtue of his recommendation that less senior employees and independent thinkers be afforded a say in decision making. Cass Sunstein’s explorations, in part, somespring from Irving Janis scholarship as well, namely Groupthink. The sociology of expertise shapes each and every discipline and endeavor. So I welcome continuous examination of group dynamics b/c the best thinking is not necessarily a function of credentialing and a string of degrees.

      Someone with an unpopular analysis is shunted around and marginalized very quickly. So in this sense Cass Sunstein’s attention to this dynamic has paid off in some quarters. If an unpopular analysis is on target, wouldn’t you think it would be appreciated? Nahhhh! It is a peculiar feature of some decision venues. On the other hand the extent of commodification of every aspect of our lives distorts our priorities and values. Artificial intelligence can hack every one of us through algorithms. Thus shape our preferences. That spooks me out quite a bit.

      In any case, once I finish the book, I will have more to say. BTW, I don’t always agree with Cass Sunstein and Danny Kahneman.

      I too respect Gigerenzer, although I am more at home with Kahnemann and Tversky’s work.

      As a venue, Andrew’s blog allows folks like myself to learn from such excellent thinkers.

      Sorry if I am not as structured and logical in my writing today.

      • True, Sunstein has written quite a bit about group decision processes, although this is mostly tangential to the OP. My take on the availability cascade and related stuff is that it is mostly if not entirely a recapitulation of what social psychologists, org behavior and management thinkers had been saying for decades. I’ll be happy to be corrected if I’m wrong. Meanwhile, it’s interesting Sunstein deliberately chooses examples he sees as reflecting irrational fears of environmental harms; he’s sort of a warrior against militant environmentalism. This has lead, IMO, to a tendency to repeat industry claims that health and safety regulation has gone too far without entertaining the legitimacy of other perspectives. See for instance his invocation of the “alar scare”, where the jury is still out on how serious a threat this chemical is. (Wikipedia has a relatively up-to-date summary.) In general, he seems to regard much if not most of the modern environmental movement as the product of availability heuristics augmented by cascade effects. There are issues around who has funded much of this work, but we should just stick with the work itself…..

        • Peter:

          I don’t know, but my guess is that Sunstein is a liberal (in the U.S. political sense) and that when pushing against environmentalism he sees himself as doing it from a position of moral or political solidity. Hmmm, I don’t think I’m expressing this clearly . . . what I’m trying to say is that if he believes that deep down he’s an environmentalist, he can feel that whatever anti-environmental positions he takes are just minor course corrections. I had the same impression when he was supporting the death penalty a few years ago: he was a liberal, and so he could feel that taking a conservative position on this particular issue was evidence of his non-doctrinaire nature. I don’t really know what more to say about all this, but maybe it helps to have some impression of where he’s coming from.

        • Sunstein certainly has a public reputation as a liberal, and perhaps he takes that stance on some issues, but where I’ve crossed paths with him he’s rather conservative (in a decidedly pre-Gingrich, pre-Trump sense). Much of his critique of environmental policy was financed by the Olin Foundation, and as I said, wherever there is a well developed industry stand that differs from the green one, he just assumes the industry is correct. (Can anyone provide a counterexample?) You may remember the gnashing of teeth on the part of the environmental community when Obama picked him to head up OIRA. As for figuring out the guy, I think he thinks he stands for reasonableness against all the crazy currents swirling in our society. I guess that could be OK, except that he seems to associate activism with unreasonableness, and in a society with a lot of background corporate power, that’s a severe bias.

          One thing I wonder about is how much influence Sunstein had over Obama, not just 09-17 but in the years before when BO was picking up courses at the U of Chicago. And what sort of influence. Having spent some time with him, I recognize Sunstein’s intense intelligence, and I can imagine how someone who seeks this could be swept along.

        • I have taken to reread several of Cass Sunstein’s books. It’s a habit that has helped me keep track influences on me. It was chance that I picked up Kahneman & Tversky’s Judgment Under Uncertainty: Heuristic and Biases back when it came out. But I had begun to note that academics did engage in selection biases. I had no label to give the biases until I read their work. It was a bit disconcerting b/c many of them were not aware that they indulged in them. Sunstein’s and Kahneman/Tversky scholarship helped me tackle some foreign policy questions.

          Since I was just a tag a long with my Dad to various conferences and random lunches, I just observed and listened quietly; occasionally voiced my viewpoint.

          Had I gone through the traditional path of academics, I may have been ensconced in common cognitive errors. Because i developed a habit of examining the categorization engaged by academics, I was content to leave it to that without voicing my thought processes and attendant questions. So I never had to prove much. LOL

          My disagreement with Cass Sunstein and his like-minded colleagues has been over the GMO foods question. I lean to the proponents of non-GMO foods.

          However several of his books are extremely useful for wider audiences. Wiser and Constitution of Many Minds.

          I tend to write off the cuff and without correction. Apologies if there are mistakes. A bit lazy sometimes.

    • I agree with your second point and I think that some of that “epistemic arrogance” you mention was already clear in the 1996 exchange in Psychological Review between Gigerenzer on one side and Kahneman and Tversky on the other. I once discussed this exchange with Kahneman and he seemed to suggest that people who took the “humans are irrational” message as the central idea in their work had misunderstood their message, that their position was ultimately not so different from the views of Herb Simon and Gigerenzer. Personally, I think this sounds a little like revisionist history, but maybe there are some early papers that support that account.

  15. “What it looks like to me is that they’re trying to explain something to the reader that they don’t understand themselves.”

    I’m not a statistician, just an interest layman. So I am dependent on what I read, and dependent that the experts I read are as expert as they say they are. But that line of Andrew’s jumped out at me, because of an example I read from the Physics world.

    In “The Herbert Dingle Affair”, Luca Signorelli tells the story of an accomplished scientist who wrote lauded, best-selling popularizations of Einstein’s Relativity theory, while containing mistakes that showed he had completely misunderstood it! Signorelli shows that Dingle was even making basic mathematical errors. But Dingle persisted in his views, and when finally convinced they contradicted the real theory, therefore spent a lot of time and words arguing *against* Relativity.

    “A famous scientist going nuts is nothing new. It has happened to a lot of Nobel Prize winners…. Here, though, we face a different matter, and an even more chilling one: someone who is a supposed expert in a sector in which ‘peer reviews’ exist with all the accolades and the respectability that entails, who shows that he hasn’t understood a word of things that he’s been left to discuss for years.”

    https://planetofstorms.wordpress.com/2021/02/13/the-herbert-dingle-affair/

    • John:

      At first from your comment I was thinking that “The Herbert Dingle Affair” was a short story and that Luca Signorelli was an author in the spirit of Italo Calvino or Umberto Eco or Vladimir Nabokov. But then I followed the link and it seems that, even though “Herbert Dingle” sounds like a made-up name, he really was a real person!

  16. Read E T Jaynes paper titled “The Gibbs Paradox”, especially sections 5 and 6….. https://www.damtp.cam.ac.uk/user/tong/statphys/jaynes.pdf

    He addresses what he calls 2nd-Law trickery related to the choice of variables.

    Jaynes was quite critical of “nitpickers” and set theoretical foundations for probability. He is the author of “Probability Theory: the Logic of Science”…. http://www.med.mcgill.ca/epidemiology/hanley/bios601/GaussianModel/JaynesProbabilityTheory.pdf

    • If you read a microbiology text e.g Microbiology of the Cell (a star) you will not find an explanation why Gibbs Free Energy and the Hamiltonian equations deffer. Why is that? If you (as a student) don’t know this maybe you don’t belong here or that we do not want to discuss this?
      Does Sean Carrol’s bizarre cream in coffee entropy explanation (in a popular phys. book) bear on this?

      • It reminds me of a philosophical question in chess (which I think is a joke, although I’m not sure): as many people know, if the same position occurs three times in a game of chess, either player may declare the game to be a draw. In this case, the “same position” includes not just the positions of the pieces, but a few pieces of additional information: is it possible to make an en passant capture, and, for each player, have they yet castled? For example, suppose the pieces are all in specific locations on the board, with black’s king and rook unmoved on their initial squares; and then later black castles and ends up moving her king back to its original square and her rook back it its original square. For purposes of the “three repetitions” rule, these are different positions: in the first one, black had not castled, and in the second one she has.

        Anyway, the joke/question is: suppose a position occurs that has a white rook on a1 and a white rook on h1. Over some sequence of moves, white moves the h1 rook, then moves the a1 rook over to h1, and then a few moves later they move the rook that had originally been on h1 over to a1. All other pieces on the board are in the same places they were at the start of this question, and all of the other conditions for a repetition are met. Is this a repetition? That is, are the rooks to be treated as distinguishable or indistinguishable?

  17. Andrew

    I get the point about causation and correlation, but I don’t think Rachael Meager’s example is correct.
    If the speed of the car is constant, then all the forces acting on the car cancel out and the net (causal) force is 0.
    And moreover as I learned from reading Sean Carroll, physicists don’t like to talk about cause and effect wrt physical phenomena.

    PK

    • Suppose a tiny alien comes to earth for a visit and slips into a car. Just as it is about to head up a hill. The alien sees that the car has a pedal and that the human’s foot is on the pedal. “What is the purpose of the pedal?” the alien wonders. “Perhaps the car goes faster if one pushes the pedal farther towards the floor.” Just then the human pushes the pedal farther towards the floor, but the car does not go faster. A minute later, with the car at the top of the hill, the human lets the pedal come back up off the floor, but again the speed of the car does not change. The alien, having read Noise, concludes that there is no causal relationship between the position of the pedal and the speed of the car, because causation implies correlation and there is no correlation between the speed of the car and the position of the pedal.

      • If by “causal relationship” you mean an algebraic state function. then that’s actually TRUE right? The causal connection is between the pedal position and the force applied at the wheel-ground interface.

        • The alien, if he were an experimentalist, might defer judgement until either [a] he discovered that fluctuations at some characteristic frequency in throttle gave rise to fluctuations in velocity, around some operating point; or [b] that they appear never to do so, under any circumstance at all.

    • In the Humean->Rubin sense, causation is about counterfactuals. What would the car do if you stopped pressing on the pedal? It would slow down. So not depressing the gas causes a decrease in speed, and the inverse. The trick is the counterfactual is never observed, so no correlation is observed. Hence, the search strategy proposed by Kahneman of “no correction => no causation” fails.

      Even the modern Pearl formalism of causality in structural causal models is about a hypothetical “ideal treatment.” If you don’t define a hypothetical, causality isn’t well defined. This tends to be tough for physical scientists because in that realm, the world under study is supposed to be, at least theoretically, reducible to physical laws that are context free and just are, but the human world requires a richer concept of causality or you end up in endless circular arguments about “who really killed Princess Di”

  18. As with many discussions of causality, the answer you get can depend critically on exactly what question you are answering.

    There is a causal relationship between the position of the gas pedal and the speed of the car, in this sense: with the pedal at position A, you get a different speed than you would if the pedal were at position B. And yet this causal relationship does not necessarily lead to a correlation between the position of the car and the speed of the gas pedal. All I have done is restate Meager’s example in a bit more detail.

    • In something like a Pearl causal network or whatever, there’s a notion of time. Something happens first and then a second thing happens second. In Newtonian mechanics this is only valid instantaneously. If instantaneously you adjust the position of the pedal, then in the next instant you will see a different increment of speed than you would have if you had left the pedal alone. In fact the whole process unrolls indexed in time, and we can’t just talk about “pedal position” and “speed” but rather “pedal position at time t” and “speed at time t+dt”.

      We’re assuming a constant speed, so there is a specific pedal position *curve* that must occur in order for speed to be constant. If you observe several pedal position *curves* and their associated *speed curves* on the same track… you **will** find a correlation.

    • If there is a “causal” relationship, then fluctuations in A will be associated with fluctuations in B (possibly at a time lag). The devil is in the details: the time-scale of the fluctuations for which this association is observable depends upon the many factors; not all of which may be controllable by the experimenter, nor even necessarily within the scope of the experimenter’s imagination. So a causal relationship may go unnoticed. But surely, when and if it is finally *noticed* someone must observe an association: between changes in two things. If it is never noticed, then it remains a hypothesis. Perhaps a strong one, inasmuch as it may be a consequence of some other theory, the predictions of which are beyond reproach.

    • Anon:

      I think of variation as being the more general term, and noise referring to variation that can’t be explained or understood. With the judges example, my point was that different judges will have different attitudes; it’s variation. Similarly, pro athletes vary in ability. I wouldn’t call that noise. On the other hand, a batter might go 0-for-4 one day and 2-for-4 the next. Most of that sort of variation can’t be usefully modeled, so it would be fair to label that as noise. At least, that’s my quick thought for now. Probably others have thought about these labels more carefully.

      • Noise when measuring quantum mechanical processes is perfectly explained and also a platonic example of noise. Spin measurements from a stern gerlach apparatus is an understandable consequence of the ambiguous mixture of spin eigenstates and is the natural binomial(n, 1/2) distribution.

        Personally, I consider noise to be variation that is irreducibly unpredictable, making it subjective. Dice rolls are noisy from my point of view, but from the point of view of some advanced machinery with a full dynamical simulation, they might be a deterministic source of variation. A pseudo RNG produces noise to me, but to someone with the seed it produces a varying but predictable sequence of numbers. All contingent on one’s state of knowledge, except for quantum collapses.

        • Somebody:

          Yes, this relates to discussions we’ve had on the blog regarding conditional probability. Forgetting about quantum physics for a moment, we can say that a random variable y is pure noise if p(y|x) = p(y) for any variable x that is measured earlier in time than y. So the coin flip is pure noise if we are not allowed to measure variables such as the initial state of the coin and how high and fast it is flipped. The judge’s decision is pure noise if you are not allowed to gather any information on the judge, but if you can look at the judge’s past record, you can learn about his or her attitudes and values, and then you can partially predict the outcome, hence not pure noise.

  19. Andrew: help me out here with your example. If, under the very conditions you describe, I jitter the accelerator, there will be a variation in speed about the equilibrium point. The variation in speed will be slower than the variation in the throttle; but if I plot the time-series of throttle depth versus time and the time-series of speed versus time; and I am sure that the two time-series will resemble the input and output of some (probably nonlinear) filter; and that be any technical definition of “correlation” you wish (first-order, higher-order, Kullback mutual information; etc) the output of that process *will* be correlated with the input.

    That is a mouthful of words, but the point is this: generally speaking, when we say X is correlated with Y we are thinking of a particular time-scale; we are thinking of the wiggles being synchronized and we are certainly indifferent to features longer than the scale of the experiment itself.

    • To be quite precise here, in the case of the jittering throttle and the fluctuation in speed, there will be a correlation at some time-lag. But the same characterization holds good, “when we say X is correlated with Y — possibly at lag tau — we are thinking of the wiggles being synchronized — possibly at lag tau — and we are certainly indifferent to features longer than the time-scale of the experiment itself” (how can an experiment that lasts a time T tell us anything about the relation between wiggles which are on the order of T/2 ? It cannot)

    • Rm:

      First off, I don’t think that particular example is so important. It’s just one of many many examples where there is causation but no correlation, or more directly where the sign of the correlation can go in any direction so that it provides no clue about the existence or sign of the causal effect. As noted in the above post, Rachael’s example is kinda complicated, which leads us into these fun side directions. Second, regarding your particular question, I’d say that when there’s a causal effect there will have to be statistical dependence at some level, perhaps with latent variables. Depending on the measurement structure, this can manifest itself at different ways with observational data, as you illustrate with your example.

      • My point is this: when we say we “see” correlations between X and Y we are implicitly or explicitly stipulating a feature scale (e.g. time scale). What we “see” is that *changes* in X and *changes* in Y are associated; and we are, of course, blind to changes of long characteristic scale (they cannot be seen in our experiment). When we say literally that “X causes Y” this is a short-hand for saying that “it is an iron law of nature that experimental manipulations of such-and-such a kind will show that *changes* in X invariably are followed by *changes* in Y”.

        The equilibrium state between the throttle and the constant velocity up a hill is not a good experimental setup for learning of the causal relation between throttle and velocity. But the fact that *this* setup is a poor experiment does not mean that there are no *good* experiments to seek it out. And in fact, the “good” experiment will elicit changes in Y which follow from changes in X; and on that basis will conclude: [1] Such changes are correlated (possibly at time lag) and [2] there is a causal connexion between the two.

  20. I am not sure if this is actually important for your argument or just a detail, but you are clearly wrong on the way you resolve the causation -> correlation argument by reverting to statistics. The pedal example is irrelevant.
    If you are the driver who presses the pedal, you don’t need statistics, the problem is not statistical any more, it is analytical. You have a clear cause and a clear effect! Effect (speed, constant or not) is the result of the accelerator, full stop.
    Give me me the mass of the car, the details of the engine etc and I can calculate EXACTLY the speed, uphill, downhill, whateverhill. Also, the reverse holds: give me the desired speed and I can calculate the exact amount of pressure on the pedal.
    Causation of this kind is exact and absolute correlation, mathematically this is a bijection, a tautology (the one IS the other, so infinite correlation).
    So, this is not correlation in the statistical sense: Statistics is needed when you do not know the exact causes of phenomena, so all you can have is statistical correlation and models.
    But When you know a priori the exact causes of phenomena you do not need statistics, you just calculate and correlate a posteriori and exactly.

    (Of course, this is only valid in principle. In true life, good luck trying to calculate without statistics…)

    • I think you’re not taking the gas pedal example in the spirit in which it’s intended. Of course the driver of the car understands how the gas pedal works. But if you’re looking from the ‘outside’, you see the gas pedal moving forward and backwards and the speed of the car not changing, in spite of the fact that there is a causal relationship between the position of the pedal and the speed of the car. This is merely a thought experiment that illustrates that just because there is a causal relationship between A and B (the position of the pedal and the speed of the car), there is not necessarily a correlation between A and B. It is indeed difficult to imagine a situation in which someone who is operating the car would fail to understand the causal relationship, but that’s not the point.

      • From the outside, confronted with a state of affairs entirely unknown to us, we never draw conclusions about causation, unless either [a] we can manipulate the entities experimentally, attempting to “hold other factors fixed”; or [b] we can observe the relationship between the entities, attempting to restrict our attention to conditions under which “other factors are fixed”. And then: we only say we think we observe a causal relation if *changes* in one factor (e.g. the throttle) are followed (in time) by *changes* in the other factor (e.g. the velocity). And whether or not any such relation is observed depends upon the equilibrium point of the system and the intensity and rapidity of the changes injected into the system around that equilibrium point.

        • rm, you say “From the outside, confronted with a state of affairs entirely unknown to us, we never draw conclusions about causation” but this is not true, at least for some definitions of “we.” Indeed, this is what this post is about! People evidently do draw the conclusion that if two variables are uncorrelated then there must be no causal relationship between them. Or at least, that is what is suggested by the quote Andrew pulled out of the book, and readers of the book might be encouraged to draw such conclusions themselves.

        • replace “we never draw conclusions…” with “we cannot draw conclusions”.
          That does not mean that we do not draw them, in spite of this.
          Of course we must!

          But if we are confronted with new phenomena, X and Y, about which we know nothing at all, and we do *not* observe synchrony between them (in time or in space), then we will remain ignorant of any causal connection between them lurking beneath. Until we are able to intervene experimentally and discover that changes in one are followed by changes in the other (on some time scale and given some chosen set of fixed background variables).

          We may *suspect* some connection between the two but we cannot *see* it until it is manifested in “correlation”; in the general sense of ‘wiggles in one following wiggles in the other’ — but someone’s got to make them wiggle!

  21. Context matters, and a sentence (which I wrote) caused much agitation when taken out of context. The admittedly vague but generally correct claim that “wherever there is causality there is correlation” is not equivalent to the precise and obviously wrong claim that a causal relationship between two variables is always and unconditionally manifest in the correlation between them. In its context (p.152-3 in “Noise”), the sentence referred to a sociological study of a large sample of fragile families, in which each family was described by numerous variables collected over many years. The aim of the study was to assess the predictability of a few specific outcomes — for example, whether a family had been evicted in the preceding year. Over 100 teams of statisticians and other experts competed to construct a predictive model for each of six outcomes. The competing models were tested in a holdout sample, and the highest correlation was taken as a measure of the predictability of the outcome, given the information. The winning model accounted for less than 5% of the variance of the eviction outcome. What can be inferred from this finding? The authors of the study concluded: “Researchers must reconcile the idea that they understand life trajectories with the fact that none of the predictions were very accurate.” We recast their concept of “understanding” in explicitly causal terms, and concluded that the large set of predictive variables did not contain the main causes of the outcome variable. We stand by this conclusion and believe it has some generality when applied with appropriate caution. It is often (not always) possible to test causal hypotheses by observing correlations, and a researcher can often (not always) infer that a causal hypothesis misses the mark if the relevant correlation is puny.

    • > The admittedly vague but generally correct claim that “wherever there is causality there is correlation”

      Consider an infinite ensemble of identical boxes each fixed to a bounded track with positions [-1, 1]. We try to discover the causal impact of sliding a box to a particular x-position along the track on its height y. We do this by sliding a sample of boxes to a x-position randomized uniform(-1, 1). If all the tracks are a parabola of the form y = x^2, the empirical correlation converges closer to 0 the more samples we take, because the true correlation is zero. This is the *ideal* case for causal inference–truly identical units, no pre-treatment or selection biases, true treatment randomization.

      The trouble with the statement isn’t that it’s too vague but generally correct, but that it’s not vague enough. When the discussion is already in the domain of statistics, as it is here, something less overloaded like “relationship” or “dependence” is a lot less confusing. I do think it’s true that causality will generally leave some kind of observational signature unless it’s being deliberately hidden, and that perfect catastrophic cancellations into noise don’t happen in the real world–there’s just no reason to believe that signature will be a linear model, especially not without coordinate transforms to a particular basis.

      > The winning model accounted for less than 5% of the variance of the eviction outcome. What can be inferred from this finding? The authors of the study concluded: “Researchers must reconcile the idea that they understand life trajectories with the fact that none of the predictions were very accurate.” We recast their concept of “understanding” in explicitly causal terms, and concluded that the large set of predictive variables did not contain the main causes of the outcome variable.

      Gelman posted a great paper on this blog that gives a similar framing to the question of “understanding why”, in case you haven’t read it

      http://www.stat.columbia.edu/~gelman/research/unpublished/reversecausal_13oct05.pdf

      • > unless it’s being deliberately hidden

        Wanted to give a brief addendum to this; this isn’t as uncommon or sinister as my poor choice of words makes it sound. If, pre-treatment, one group of people, who know who they are, are headed for a predictable negative outcome, they will often opt in to treatments to avoid it, naturally creating a pre-treatment bias that hides correlations by design. Example from someone above: costly interventions for preventable deaths.

    • Daniel:

      Thanks for the reply. I agree that context is important, and your comment is very helpful.

      Let me unpack your statement that it is a “generally correct claim that wherever there is causality there is correlation.” I agree with you on this one—as long as we’re general enough in our interpretations of the words “wherever” . But we have to be careful, as I discuss in this comment.

      Here’s a precise statement:

      When there is causation there will be correlation somewhere; you might not observe it in data but there will be correlation in some latent variables. This loops back to the definition of causal effect as a difference between outcomes under treatment and control conditions, so if you consider y^T and y^C as latent variables and if you imagine applying treatment or control to multiple identical people, you’ll get a correlation.

      The difficulty with this statement is that (a) we don’t in general have data on multiple identical people and (b) we don’t in general observe both of the latent outcomes on each person.

      Prediction is another story. Your example demonstrates that it’s hard to predict the eviction outcome. The researchers’ statement, “Researchers must reconcile the idea that they understand life trajectories with the fact that none of the predictions were very accurate,” seems reasonable to me. I don’t see how you can go beyond this to a statement, “the large set of predictive variables did not contain the main causes of the outcome variable.” I say this because I don’t think a phrase such as “the main causes of the outcome variable” has any meaning in general. In specific cases such as a mechanistic model in pharmacology, sure, we can say that the drug’s concentration is mostly determined by a certain function of physiological parameters and input conditions. But in an example such as your study of families, I don’t see that it’s so helpful to frame this in terms of the main causes of eviction.

      To put it another way, I don’t really see what is gained by taking a well-defined statement about prediction error and transforming it into a vague statement about main causes. I get that there’s interest in main causes and I fully support efforts to construct models of such causes and then fit such models from data.

      Perhaps it is most helpful to say where we agree! I agree that if, after all researchers’ best efforts, the best they can do is an R-squared of 5%, that this is relevant information, that they can’t predict an outcome well. One challenge in working with R-squared is that it depends on the set of cases you are considering. If you look at all American families, maybe you can get a pretty good prediction model. For example, I’ve never been evicted, and that would not be hard to predict, given that I have a steady job that pays a lot more than my rent. By restricting to fragile families, you’re made it into a harder problem.

      Does poor prediction represent a lack of causal understanding? It depends. Rachael gave an example of a car; we also see such cancellation of effects in competitive settings. I doubt we could predict the success of pro sports teams based on how much practice time the players are getting, but we shouldn’t interpret this to imply that practice doesn’t help. This can come up a lot in economics. Lower the price of a product and you can typically expect to sell more of it. This will not necessarily show up as a correlation between prices and sales across products.

      In summary: I agree that this is an important topic and you’re getting at something real. I also think that Rachael has a good point. Lots of economic phenomena have some approximate equilibrium states, which in turn is related to statistical selection. Again, consider your families example. By restricting to fragile families, you’re making the problem more difficult. This has some similarity to famous selection examples such as the negative correlation of GPA and SAT among students who are attending a moderately selective college, or control/feedback mechanisms such as Rachel’s car.

      • Sociology is hard. There are thousands of unmeasured variables, plus, yes, randomness. The effect sizes aren’t big, but small net effects can have meaningful consequences. And yes, just looking within the population of fragile families obviously ignores the variable of being a fragile family (whatever that means). The mean of fragile-ness is already explaining a lot, possibly. If I’m betting a million dollars a 5% improvement in prediction over the sample proportion especially if the mean is not close to .5 may make me take the bet. There are a million k-12 students in NYC. If I can identify a potential variable that explains 5% of the variation in grade retention that’s not nothing. If anything I would be suspicious of big effect sizes.

      • There is a better way of saying what I was trying to say. Simply substitute the word “prediction” for “correlation”. You then get two uncontroversial statements: “prediction does not imply causation” and “causation implies prediction.” Unlike correlation, prediction is not bivariate, so this wording is immune to the climbing-car objection. Another way to state essentially the same two ideas is “you can predict an outcome without being able to explain it” and “a useful explanation of an outcome implies that it could have been predicted.” Not earth-shaking, but also not wrong — this is in a popular book.

        I wish I had thought of the prediction formulation when I wrote. I also wonder about the investment in this thread: how much mental energy went into trashing the original formulation, how much into improving it. Is this ratio optimal?

        • Daniel:

          1. I disagree that causation implies prediction. Or, to be precise, I agree that causation implies prediction of latent variables (the y^T and y^C of potential outcome notation), but I disagree that causation will necessarily imply prediction of observed data. Consider Rachael’s car example or various standard examples of selection bias. Similarly, I see the appeal of your formulation, “a useful explanation of an outcome implies that it could have been predicted”—you just have to be careful with it, again keeping in mind examples such as feedback and selection bias. Returning again to Rachael’s car, you can have a good explanation of how the accelerator works but still not be able to use it to make a good observational predictions of speed and acceleration. A good explanation (a good causal model) will allow you to make good experimental predictions, but that’s a different thing than talking about patterns in observed data as in your fragile families example.

          And, again, don’t forget the selection problem, as the fragile families are a group where eviction prediction is particularly difficult. For another example, someone told me once that it’s really hard to quit smoking: smoking cessation programs are full of people who’ve tried to quit a million times. And, sure, it can be really hard to quit! But there’s also some selection bias: people enter smoking cessation programs because simpler methods of quitting didn’t work. Lots of other people can quit with less effort but they’re not in that particular sample.

          2. Regarding the trashing/improvement formulation: You make a good point. I will say that trashing takes less energy than improving! It’s complicated, though, because when trashing we have to be very careful. There’s an asymmetry by which a criticism is often held to a higher standard than the original writing. Some of the reason for my passion on this particular item is explained in the first paragraph of the above P.S., also I assume you can understand that I’m not the only person who was annoyed by your collaborator’s “discovered a new continent” framing or his statement that he’d never thought about some of these very important issues before writing the book.

          Speaking at a more abstract level, I think there’s room for all sorts of positive and negative feedback and criticism. Your book has received strongly positive reviews in several major newspapers, and these reviews point out lots of reasons why your book is an interesting and relevant take on an important topic. It’s also received some pushback on social media (including this blog). I think the balance works out in a fair way. My point here is not that the blog was written in response to the positive reviews—it wasn’t, it was written in response to Rachael’s example and other sources cited in the post. I think negative commentary can be useful and constructive commentary can be useful too. I don’t believe in empty “trashing”; in this case I think the trashing had substance, getting into questions of causation and prediction which were followed up in the comments, as well as questions of what does it mean for a book’s coauthor to declare that he’s discovered a new continent, when he’s writing about a topic that’s been considered in many different ways and for a long time by many statisticians and economists. I wouldn’t want the negative comments on the blog to drown out the positive reactions from the New York Times, Washington Post, etc., or the constructive comments have have arisen in this thread.

          Also see the P.P.P.S. just added. Thanks again for your feedback.

        • There is also the issue of not being able to give an explanation that gives rise to misunderstanding in many/some readers/listeners.

          When you stick closer to just the math, those with adequate math skills can self verify if what they are taking away is “correct”.

          But when you stray from the math to allow for a grasp by a wider audience, more (most?) will think they got it but many incorrectly.

          Perhaps one way out of this is to but equip potential readers/listeners with simulation skills so that they can self verify more easily.

          Like providing a guide dog to a blind person trying to verify if they should follow the directions of someone with possibly only one eye.

        • RE:

          1. But when you stray from the math to allow for a grasp by a wider audience, more (most?) will think they got it but many incorrectly.
          ———
          The problem is that there is such considerable contestation among experts. Who can be held to be correct, without following along assiduously? And it is also the case that some non-experts [those outside the field under discussion] can be correct. Philip Tetlock’s tournaments have demonstrated this hypothesis. Without reading Expert Political Judgment, as one of several indispensable books, we may not reckon with the catalogue of errors made by experts in particular. Here the distinction between science and scientism is relevant and amply addressed by Susan Haack.

          I believe that Noise and other books by Cass Sunstein and Daniel Kahneman are for wider audiences. I agree Keith that wider audiences can be incorrect. But they are in good company with subsets of prominent experts as well.

          ———–
          2. Perhaps one way out of this is to but equip potential readers/listeners with simulation skills so that they can self verify more easily. Like providing a guide dog to a blind person trying to verify if they should follow the directions of someone with possibly only one eye.
          —-
          This is an interesting suggestion. But hasn’t this been happening in the beginner’s courses. A consistent theme by Sander Greenland and Andrew Gelman is that statistics is being taught incorrectly.

          Critical thinking should be encouraged. Critical thinking courses seem mediocre. I haven’t surveyed but a handful.

          I’m a proponent of John Dewey’s theories more broadly. That is my bias. And leads me to believe that radical change is needed in teaching starting in grade school.

        • > But they are in good company with subsets of prominent experts as well.
          Yes, but if it’s math and if they have the math skills it there fault for not working that through. The same as if you gave someone a list of numbers and they miscalculated a quantity from them. Simulation is a style of math that converts the probability equations into numbers (draws of pseudo-random variables) so that calculations can be done on them to discern what would repeatedly happen. This reduces the math skills required but may actually increase the critical thinking skills required.

          More and more simulation is being taught in introductory Stats courses along with computer programming. But there is this generational drag of I learned statistics analytically and I don’t trust anyone can do otherwise. Also, as one undergraduate director once said to me about 5 years ago “I hate all this programming and simulation stuff”.

        • I get Andrews point, but I think the prediction formulation is useful for your audience.
          In particular it’s important that the public understand that prediction and causation are not the same thing. Too often we get policy proposals — let’s encourage people to get married since the people who select into marriage without encouragement have better outcomes — are based on prediction that has a causal gloss.

    • Perhaps a middle point between “causation could imply correlation” and Andrew’s philosophy is that, causal inference amounts to an individual-level prediction problem, and can generally be judged by the individual-level fit.

    • “We stand by this conclusion and believe it has some generality when applied with appropriate caution. It is often (not always) possible to test causal hypotheses by observing correlations, and a researcher can often (not always) infer that a causal hypothesis misses the mark if the relevant correlation is puny.”

      The claim of “some generality” is misleading for economics and, I think, other social sciences such as sociology. The textbook example is that if you found no correlation between prices of some good and quantities sold, you could conclude that raising the price would not result in an appreciable decline in the quantity sold. This is obviously wrong, and the reason is that in most economic and social systems many observed variables, in this case, prices and quantities, are (approximate) equilibrium outcomes of selection processes.

    • hey, how’s priming coming along? remember how sure you were of it being a real effect? how dismissive you were of criticisms initially?

      peperridge farm remembers.

  22. I don’t think Meager’s example is nitpicking or unnatural at all. It’s just an example of a dynamic equilibrium

    https://en.wikipedia.org/wiki/Dynamic_equilibrium

    and this correlation argument can be used to falsely falsify causality in ANY situation with a dynamic equilibrium. In a social science context, this is dangerous for lots of real-world problems because people often DELIBERATELY seek equilibrium. This approach will fail on ANY outcome variable where people target some constant. Is humans targeting a constant metric really that unusual of an example?

  23. “By restricting to fragile families, you’re making the problem more difficult.”

    The pertinent data for the eviction problem would look like a bunch of random walks [family wealth], each with a unique absorbing boundary [that family’s rent payment].
    By only looking at fragile families, the dataset is biased towards those families that start near and stay near the boundary. If the problem is understood as attempting to predict how many of the random walks hit the absorbing boundary within an arbitrarily selected date range (“in the last year”), it is not hard to see why the scores were so low.

    It is hard to imagine a less desirable set-up for making generalizations about the relationship between causation and correlation.

    • And if what you want to do is reduce eviction causation may not be the standard you need. Prediction may better serve you. That’s why organizations look at credit scores for all kinds of things unrelated to giving someone credit.

  24. While I am casting stones at examples, I don’t think Rachael’s accelerator/velocity formulation works well in this context either. Velocity is just a component variable of “work,” which is defined as what engines produce. With fuel and oxygen quality fixed, accelerator pedal position does indeed both correlate with work performed, while pressing on the pedal causes more work to be performed.

    If you flip the situation logically, Rachael’s example would mean that you can disprove that there is a correlation between two complex variables by showing that there is no correlation between one of the complex variables and a component variable of the other one. I don’t think that is true.

    I guess I am coming around to Kahneman’s general point about correlation and causation, even if I don’t like any of the examples.

    • Rachel’s example doesn’t claim that the accelerator doesn’t cause an increase in velocity; the point is that it does cause an increase in velocity but doesn’t correlate with velocity upon observation. By Kanheman’s search strategy:

      “If you find no correlation between age and shoe size among adults, then you can safely conclude that after the end of adolescence, age does not make feet grow larger and that you have to look elsewhere for the causes of differences in shoe size”

      Kanheman would conclude that the accelerator doesn’t cause velocity, and he has to look elsewhere to explain velocity, which is false.

      • “Kahneman would conclude that the accelerator doesn’t cause velocity, and he has to look elsewhere to explain velocity, which is false.”

        Except that it is not false. The accelerator does not cause velocity, all it does is open the throttle. If it caused velocity – which it would in the special case of being on flat ground – then it would correlate to velocity, which it does on flat ground!

        He most certainly would have to dig deeper to explain velocity. And when he rolled up all the complexity of the engine output into a work function – the right thing to do – he would have a nice clean correlation back to throttle setting.

        • > If it caused velocity – which it would in the special case of being on flat ground – then it would correlate to velocity, which it does on flat ground!

          Incorrect. The definition of causality is a counterfactual dependency. What would happen if the accelerator was depressed more or less, hill or not? Velocity would change. It’s correlated in that hypothetical, you just never observe the hypothetical, so you can’t find that out from just observational data—you’d need to do an experiment.

        • If one digs a little deeper into what we really *know* (or think we know) when we say X causes Y we see it means this: “changes in X are followed by changes in Y when we think we have accounted for all other relevant factors and, ideally, held them fixed (in an experimental setting); or when we are able to partition our observations so that the phenomenon can be observed with them fixed (in an observational setting)”. The throttle/velocity example is misleading, like a magician’s trick. The evidence for throttle to velocity causation is only available in circumstances when we may observe *changes* in one invariably (*) followed by *changes* in the other.

          (*) “invariably” may mean: over the course of the experimental session; or over the course of many experimental sessions; or over the course of a lifetime; or many lifetimes; or the whole of human experience.

        • “changes in X are followed by changes in Y when we think we have accounted for all other relevant factors and, ideally, held them fixed (in an experimental setting); or when we are able to partition our observations so that the phenomenon can be observed with them fixed (in an observational setting)”. The throttle/velocity example is misleading, like a magician’s trick.”

          Exactly and well put. The fact that you can name a single aspect of the output, the velocity of the vehicle, and show that it does not correlate with a specific aspect of the input proves nothing at all. I can come up with a whole bunch of other component variables that don’t correlate throughout the operating envelope either.

          Rachael was basically arguing that:

          1. Input variable x does not always correlate with output variable y.

          2. An increase/decrease in input variable x causes output variable y to increase/decrease.

          3. Therefore, changes in input variable x cause but do not correlate with changes in output variable y.

          Now fix No. 2 so it is correct, substituting variable w (work) for variable y. Doesn’t add up to anything.

        • > Now fix No. 2 so it is correct

          It doesn’t need to be fixed, it’s correct. If you were to increase or decrease the depression of the accelerator, speed would increase or decrease compared to if you did something else. Therefore, it does cause increases or decreases in speed. You just can’t observe both simultaneously. “The throttle/velocity example is misleading, like a magician’s trick.”” Exactly, an observational setting can be misleading about causality, that’s the whole point. It is a response to an eliminative search strategy proposed by Kanheman which is completely wrong and easily fooled

          “Where there is a causal link, we should find a correlation. If you find no correlation between age and shoe size among adults, then you can safely conclude that after the end of adolescence, age does not make feet grow larger and that you have to look elsewhere for the causes of differences in shoe size.”

          This falsification strategy is articulated specifically and is completely wrong

        • The problem with your argument is that correlation between “step on the accelerator” and “speed increases” has nothing to do with work. It’s just a correlation between two behaviors. IN effect, you’re cheating by including your knowledge of other parameters. These parameters may not be known or measurable in an analogous situation in the real world.

    • At this juncture, I am coming around to Andrew’s AND Danny’s general point. But I do think that Rachel’s example given requires a technical background that most of us don’t have.

      I would have liked to read many more examples across different endeavors.

  25. Susstein basically wrote an op-ed about how the new novel coronavirus was nothing to worry about like 3 days prior to the WHO declaring a pandemic and almost every world government simultaneously deciding it was dangerous enough to give their citizenry a stay at home order (https://www.bloomberg.com/opinion/articles/2020-02-28/coronavirus-panic-caused-by-probability-neglect).

    It’s not the fact he’s a lawyer and doesn’t have a stats background that’s the problem – it’s that he’s a total moron.

    • It’s that he is a first-class pontificator and and panderer to those would-be immortals: college boards; middleweight panjandrums; non-combatant investors in grape-seed oil; scribblers to the moralists, busy with their plans to live to great years as pickled mummies, holding court from a cabinet, face buffed up nice and shiny, like Bentham’s skeleton. They cannot believe that *they* are flesh-and-blood same as the servile classes.

  26. Several commenters have mentioned the situation of a car being driven at constant speed, but I don’t think that all of them have fully grasped the problem that all control systems pose for causal analysis.

    The claim (passim) that if you press harder on the accelerator, the car speeds up, is as untrue as the original claim that Gelman criticised, that causation implies correlation. The car will not speed up if you do this just as the car begins to go uphill, which is exactly what you will do if you are trying to maintain a constant speed. It is what the cruise control will do if it is engaged.

    The reason that it seems so reasonable, even obvious, to say “if you press harder on the accelerator, the car speeds up” is that you already know how the mechanism works, and you are imagining a scenario in which the car is being driven on a level road by a driver who is not trying to control its speed. These are circumstances almost never to be found in real driving situations. They are found in the imagination, when one contemplates the physical mechanism of the pedal changing the flow of fuel, which changes the power generated by the engine, which changes the speed, all other things imagined away. That is a model of the system that is already known to be correct. But if you have that, then statistics and causal analysis are irrelevant. You already know everything that it is the purpose of those techniques to discover.

    If you do not know the mechanism, and see only a time series of the pedal height and the speed, and have no data on the road gradient, then they will seem to have nothing to do with each other. Worse still, if you do also have a record of the gradient, then the gradient will seem to cause the pedal setting: they will be closely correlated, while neither will correlate with the speed. Other forces, such as the wind, will just make the data fuzzier without giving any better information about the real causal links. These links are from pedal to speed, gradient to speed, and speed to pedal (through the driver or the cruise control). All of the causal links involve the speed, but the speed as practically zero correlation with pedal and gradient. Pedal and gradient are tightly correlated, but there is no direct causal link between them. The better the cruise control works, the more hopeless the task of discerning its presence from external observations.

    Such is the nature of control systems. Their entire point is to remove the relationship between the disturbing forces on a variable and the variable itself. It is only when it works badly — that is, when it is not a control system — that correlations corresponding to the physical causal links can show up, and even then, the cyclic causation among the variables greatly increases the difficulty of the problem.

    Of course I am not the first to observe this, and some of it has already been mentioned by other commenters here. But the only paper I know of setting out the general problem, applicable to all situations where control systems may be present, and exhibiting the failure of all (then) existing methods of causal analysis to deal with it, is my own, which Alexander Kruel already linked to: https://arxiv.org/abs/1505.03118.

    MM linked a blog post of Cosma Shalizi to the effect that all this is well-known to everyone in the field. Everyone in the field of econometrics, perhaps. Everyone there mentions “Friedman’s thermostat”, although Friedman’s original article in the WSJ was not actually making this point. It seems to have been a later blog post by Nick Rowe that brought out and disseminated this interpretation, and its antecedents go back a long way, but apparently only in econometrics. (I don’t know what econometricists actually do about the problem beyond mentioning Friedman’s thermostat from time to time.) From my reading there is little trace of the concept outside of that field, despite it being ubiquitously applicable. Almost the entire literature on causal analysis is based on DAGs, which cannot describe feedback systems, and the few attempts to extend it to cyclic graphs are based on hypotheses that still exclude control systems.

    At least, that was true when I wrote that paper. I would be interested to know if there has been any advance since then.

    • Disclaimer: Have not read your paper (yet)

      In the theoretical case, yes an ideal control system does seem to mask the causal relationships and does seem impenetrable to observational methods in causal inference. But as an applied problem, can’t this be got around by having measurements of a high enough precision and temporal resolution? Real control systems have latency, overshoot, and settling time, no?

      • The short answer is, for practical purposes, no. Yes, real control systems have these imperfections, but they don’t help unless you hit the thing with a hammer, or in more technical language, apply an impulse: a delta function or a step function. The impulse response of a linear system can tell you a great deal about it. But the impulse, which in the mathematical ideal takes no time at all, has to be much faster than the control system’s settling time — which means, fast enough for the control system to not be a control system on that timescale. Practical control systems, when functioning as such, are just as impenetrable to the causal analysis techniques I know of as ideal ones.

        • In fact, my paper includes some examples of the result of injecting noise into all the variables. All it does is to make some of the multivariate correlations that in the noise-free case were undefined (because of some linear identities among the variables) well-defined but unhelpful.

        • But the impulse, which in the mathematical ideal takes no time at all, has to be much faster than the control system’s settling time — which means, fast enough for the control system to not be a control system on that timescale.

          What’s confusing me here is that my understanding of control systems (limited to a week on PIDs during instrumentation laboratory as an undergrad) is that they function by measuring changes in the outcome variable of interest as it deviates from a setpoint. So, if you measure at a fine enough temporal resolution, doesn’t every control system fail, then course correct, or as you put it become locally “not a control system” at a very zoomed in timescale? I’d therefore expect correlations to converge to sensibly in the limit of infinite temporal resolution, at least for digital systems with a fixed finite polling rate for measurements and responses.

    • Isn’t part of the issue that you are excluding a relevant variable (incline)? Of course if the outcome variable never varies you can’t model it with a regression, but you could write an equation that has a non zero slope. Which is what I think they mean by correlation, not a bivariate correlation.

      • As I said (and others have mentioned), observing the incline as well doesn’t help: when the speed is under control you’ll just see a correlation between pedal and incline, and none between either of those and speed. Of course, with a car you can look inside and see the mechanisms, take them apart, and vary all the variables in any way you like, and so discover how it works. But in real applications of statistics and causal analysis you generally can’t do that. (If you could, you wouldn’t need them.) The interventions you can do are limited, and sometimes you’re stuck with observational data only.

        One intervention you can do that is diagnostic of the presence of a control system, called the Test for the Controlled Variable, is to push on a variable that you suspect may be under control and see if something pushes back, leaving the variable unchanged.

        • If one has never observed fluctuations in Y associated with fluctuations in X and one is in no position to perform a controlled experiment to see if there are any; and furthermore if one hasn’t any “theoretical” reason to even suspect that X is a cause of Y, well then, it would take a lot of special pleading to still say one thinks that X is a cause of Y. Absent observed association (repeated, experimental or historical record of fluctuations in Y following fluctuations in X) how can there be any rational ground to suspect a causal link? No record of “correlation” (in a generalized sense) => no evidence for causal link.

        • And on the subject of the TCV sans le nom, I just found an excellent paragraph by Nick Rowe (https://worthwhile.typepad.com/worthwhile_canadian_initi/2011/06/no-you-cant-test-whether-core-is-useful-that-way.html):

          “You can’t test whether core inflation is a useful indicator for a central bank to look at just by seeing whether core inflation forecasts future total inflation (or whatever the bank is targeting). You can’t test whether anything is a useful indicator for central banks to look at that way. Everything ought to look useless by that test, if the bank is doing it right. What you are testing is whether the bank is doing it right.”

    • “DAGs…cannot describe feedback systems” – I see that all the time and regard it as incorrect in the general sense of “feedback”, as evidenced by the considerable literature on dynamic treatment regimes by Robins and colleagues (which deal with feedback systems in medical treatments). But to use DAGs in such schemes does require construction with resolution down to the shock-intervention scale so that response lags are clearly visible and variables are resolved down to that time scale (in fact Robins got his inspiration for his original g-methods to handle these cases from shock models in engineering). This means that DAGs can rapidly get unwieldy for more than illustrating basic points.

      What I could agree with is that ordinary coarse DAGs don’t capture crucial features of feedback-control systems because such systems are by definition trying to counterbalance (cancel out) effects and thus produce unfaithful causal systems, making effects invisible in the DAG (and in any coarse analysis, as this thread notes). A parallel limitation of DAGs in failing to capture study-selection balancing was described in
      Greenland S, Mansournia MA (2015). Limitations of individual causal models, causal graphs, and ignorability assumptions, as illustrated by random confounding and design unfaithfulness. European Journal of Epidemiology, 30, 1101-1110
      with related problems discussed in
      Mansournia MA, Greenland S (2015). The relation of collapsibility and confounding to faithfulness and stability. Epidemiology, 26(4), 466-472.

      • Pardon my basic question, but Is there a formal definition somewhere of what densities can and cannot be represented by DAGs? I followed the paper link above but it just assumes the reader already knows this. I never know how to think of multivariate nodes in systems like BUGS and JAGS. In general, any old Bayesian model p(theta, y | x) = p(theta) * p(y | theta, x) corresponds to the simple three node, two-edge DAG ({x, y, theta}, { y <- theta, y <- x }).

        • I don’t think DAGs represent densities, DAGs represent the flow of time. in a DAG

          A -> B
          C -> B

          It means that manipulations of A or C will after some sufficient amount of time result in a change in the observed values of B.

          The typical DAG stuff seems to represent an “asymptotic for sufficiently large time” process… So for example if you turn on a switch there will be a flow of electrons into a capacitor, and at sufficiently large time then the voltage across the capacitor is equal to the voltage across the power supply. (in a transient intermediate time period, the voltage changes dynamically, and the DAG doesn’t represent that typically).

          My biggest problem with DAGs is that so many processes are actually dynamic and do not reach an equilibrium on the time scale of interest. I much prefer explicit dynamic models.

        • “DAGs represent the flow of time” … hmmm I’d say um not exactly:

          DAGs are just directed topological network structures which have been around since long before we were born (some say since Euler or even before). They are just math objects comprising nodes and directed edges that define d-connectedness among nodes. What those abstract objects represent is up to the user. In graph probability they represent qualitative distributional constraints found by mapping d-separation to independence. I’m pretty sure there are theorems characterizing distributions for which all their independencies (unconditional and conditional) can be read off a given graph using d-separation, which is to say are faithful to the graph, and also giving criteria for determining whether a given distribution has a DAG it is faithful to. I’ve e-mailed some specialists for citations.

          Causal DAGs have no unique formalism; the latter range from the weak one in SGS (where a cDAG is little more than a time-oriented DAG) to the strong SEM one in Pearl (where each parent set determines its node via a functional relation), with others between. Richardson and Robins review the varieties in ‘Alternative Graphical Causal Models and the Identification of Direct Effects’ (search the title online and you’ll find it free). And then there are many DAG extensions like signed DAGs and SWIGS.

        • Sander:

          Yes, one example I like to give is the simple hierarchical 8-schools model:
          y_j ~ normal(theta_j, sigma_j) for j = 1,…,j with y_j’s observed and sigma_j’s known
          theta_j ~ normal(mu, tau)
          prior p(mu, tau).
          This is a joint distribution expressed as a DAG: p(mu, tau) * p(theta | mu, tau) * p(y | theta, mu, tau). Or, as the DAG: (mu, tau) -> theta -> y.
          Is this DAG “causal”? Not in the Rubin sense. It’s just a mathematical model, with no formulation of potential outcomes. Does it make sense to think of mu and tau as “causing” theta? Maybe, I’m not sure.
          The term we’ll use here is that it’s a “generative” model. The variables (mu, tau) generate theta, and theta generates y. So it’s a generative causal graph.
          Mathematically, we could write the joint distribution in reverse order: p(y) * p(theta | y) * p(mu, tau | theta, y). That corresponds to a generative DAG model in the other direction. But it’s a weird sort of generative DAG, in that it doesn’t seem to make sense to generate data and parameters in this way.
          If we want to throw in more mystery, why do I say in the above model that y is “observed” and sigma is “known”? Somehow sigma feels like an assumption and y feels like data. Mathematically, the difference is that I have no generative model for sigma.

          In all these questions, there’s an uncomfortable back-and-forth between statistical model and the underlying science or engineering model. Also, causal inference is confusing. But I think that some of the confusion associated with causal DAGs corresponds to some unsettled issues in generative modeling, even setting aside causality.

        • Well put Andrew!
          Yes a probability DAG can be viewed as a flow chart for generating data from any distribution compatible with the DAG in the sense that every d-separation corresponds to an independency in the distribution (the converse of compatibility is faithfulness). This relation does not care about time order, and several different DAGs can lead to the same independencies. There is a remarkable theorem by Verma and Pearl (“Equivalence and synthesis of causal models, 1990) showing how to identify and create probability-equivalent DAGs with differing edge orientations.

          The 2010 Robins-Richardson paper gives a lot of interesting results on the topic of what makes a DAG “causal”. Obviously any definition will allow only one orientation of edges: time ordered. If the only other demand is a particular identification of responsiveness to well-defined interventions then the model is “agnostic”. They then introduce a “minimal counterfactual model” (MCM) that is stronger than the “agnostic model” in being strong enough to support using potential-outcome models to estimate effects, but weaker than Pearl’s NPSEM causal model. They also give data-generating algorithms based on the models.
          In a subsequent paper (“A Unification of the Counterfactual and Graphical
          Approaches to Causality”) they went on to expand graphs to include potential outcomes explicitly, as single-world intervention graphs (SWIGS).

          As academic as it all sounds R&R show it actually has some arguably important implications for those trying to define and measure direct and indirect effects using potential outcomes.

        • Sander, thanks for engaging that comment. Obviously a DAG by itself is a tool, it can represent lots of stuff. For example it can represent the dependencies between different code files in a software project… or whatever.

          However if it comes to modeling causality, the purpose of the arrow in teh DAG is to determine which is the cause, and which is the effect. the arrow can be read as “causes” (or “causes in part”), so we have the simplest DAG A -> B

          suppose A is the position of a dial and B is a voltage on the output of some circuit. If you adjust the dial, then some short time later the voltage at B will be… such and such. Special relativity says that if B is a distance D from the dial then the voltage can’t change before D/c time units later.

          the reason we have DAGs is because causality isn’t a two way thing. We can’t “magically change the voltage at B” and then a short time later the dial moves.

          The role of time is essential in causality. To the extent that we model causality at all, it’s always a process in which first some stuff happens and then some other stuff happens “because” of it.

        • If the processes are “high-pass” then an experiment that wiggles the input rapidly will elicit wiggles in the output.
          If the processes are “low-pass” then an experiment that wiggles the input slowly will elicit wiggles in the output.
          If the processes are “band-pass” then an experiment that wiggles the input at a rate in a specific band (but not elsewhere) will elicit wiggles in the output.
          And so on.

          If I already *know* the dynamics — which is to say, I have a model — that means I have already *observed* these input-output relations in some experimental (or observational) context; and therefore I am able to come up with a model which fits well to my experience. But if I do *not* already know the dynamics (and do not yet have a model) I may not have the slightest inkling (yet) that there *is* a relationship between the two quantities. That relationship may exist but may never have crossed the threshold of my attention; it may be exceedingly difficult to observe, and so no experimenter has yet proposed to go looking for it.

          Now some clever experimenter may look at some clever theoretical result and say: Aha! There are correlations to be found between this and if they are found we confirm that piece of theoretical work; and if they are not found, we cast doubt upon the whole of that theory; but in any event it’s an exceedingly difficult experiment to do; etc. etc.

          That is in essence the story of the Aspect Experiment and it’s lineage; and the “spooky” violations of the Bell Inequality.

      • “But to use DAGs in such schemes does require construction with resolution down to the shock-intervention scale so that response lags are clearly visible”

        I agree. Or as I put it, to avoid the feedback that invalidates a DAG description, hit it with a hammer and watch what happens before the feedback arrives.

    • I should add: The faithfulness assumption has been a point of contention among causal-DAG advocates for some 30 years. It is not a necessary part of cDAG theory but it often gets portrayed as if it is. I never used or endorsed it as it always seemed far too strong given the kind of data I see in practice, and have given it cold treatment in my expositions of cDAGs. It is however an essential assumption of the causal discovery algorithms for cDAGs developed by Spirtes Glymour Scheines (SGS), which has led to the mistake of assuming all cDAG uses need it.

      Without faithfulness, however, DAGs become strictly limited in direction of inference, allowing data to refute certain simplifications of causal structures but not (as SGS wanted) to “discover” or confirm them (which is fine with me, given my refutationist leanings), at least not without considerably more constraints (e.g., directional monotonicities as in signed DAGs – see VanderWeele & Robins, “Signed directed acyclic graphs for causal inference”, JRSS B 2010).

Leave a Reply to AH Cancel reply

Your email address will not be published. Required fields are marked *