Bill Harris writes:

You’ve written about causality somewhat often, and you, along with perhaps everyone who has done anything with statistics, have written that “correlation is not causation.”

When you say that correlation is not causation, you seem to be pointing out cases where correlation exists but causality does not. While that’s important, there’s another case where causality exists but correlation does not, and it, too, comes up all the time in real life.

Think of pouring water into a cup (or make it an urn, since you have a statistics blog). A graph of the inflow of water might look like a long pulse or perhaps a boxcar: it’s zero until a magic point, whereupon it goes to x for a certain time, and then it goes back to 0. That inflow is the cause of the cup or urn filling up.

A graph of the volume (the stock) of water in the cup (urn) doesn’t look at all like that. It’s the integral of that inflow of water. Except in the restrictive case where the inflow is an exponential, the integral of the inflow has a different shape.

Giving students a graph of a net inflow and asking them to sketch the stock (or giving them a graph of a stock and asking them to sketch the inflow) is a standard system dynamics exercise. It comes up in the real world, too. You put money in a bank account and withdraw some, and your bank balance is the integral of the net deposits. You buy inventory for a manufacturing company, and the level of inventory is the integral of the net of inventory purchases less inventory use. To the degree that political opinions are influenced by advertising and advertising effectiveness is proportional to advertising spending, I even imagine that a policitian’s share of the vote is proportional to some function of the integral of advertising spending.

I don’t often see people talking about or recognizing that. Have I just missed where you’ve discussed this? If people really do miss that relationship, do you see places where it causes people to mis-assess causal effects?

My response: I agree that the distinction between stock and flow is important and not well understood. Perhaps we should put together a list of mathematical concepts that are relevant in statistics but are not actually part of statistics. Another such concept is the log transformation. It’s important but typically is never taught, and this is a gap for many students—and for many professionals. Many times I’ve seen papers by economists or political scientists where they run a regression on the untransformed scale even thought the log scale is the obvious choice. They just don’t know any better. Same with stock and flow: an important concept that statisticians have to learn “on the street.”

Harris adds:

I’d put “units” on the list. Early in my career, I was confused because revenue and assets were both denominated (incorrectly) in dollars.

At least when I do system dynamics models, I (almost) always check units. By the Laplace rule, I figure half the time I get an error (my usual simulator tracks units), it’s a simple typo, but the other half of the time I’ve made a modeling error. I think I do it with statistical models, but I’m sure I fail to do it as much as I should.

Harris continues:

This may be harder than I thought. I like the log transformation idea; I see opportunities for that a lot, but I rarely see it used. The next obvious one is a lack of curiosity: why don’t people plot the data (multiple ways)? Why don’t people plot the residuals? Why don’t people remain curious about what the data could be saying and skeptical about their models longer? I’m not sure those are in the same category, though.

Okay, here’s another potential one: why is it rare to see people ignore feedback effects? Those are often easy to represent with sets of ODEs, as you can do in MCSim and now Stan. I admit that I was a practicing engineer for years, designing and analyzing feedback circuits, before it really hit me how much the feedback interconnection can swamp the effect of the components. I knew and used the formulas, but, when I was troubleshooting a failing feedback system, all the components in the feedback loop seemed to meet their specs, and still it failed. I was stumped. Then I pulled out a then state-of-the-art analyzer that could show me transfer functions in an operating (or failing) feedback loop as it was running, and the answer was clear.

I’m not sure I get Harris’ point on units. When I think about the difference between revenue and assets, my first thought is to stock and flow (assets are stock, revenue is flow). Perhaps he is just meaning that revenue is dollars per unit time.

Here’s an egregious example of statistical stock-flow confusion that got published:

http://blog.metasd.com/2011/10/linear-regression-bathtub-fail/

Tom:

I followed your link. At the end of your post from 2011 you write:

But I followed your link to Pielke’s page, and in the comments of his own blog he gives this quote from the paper in question:

In a sense this is irrelevant if, as you argue, that paper is seriously flawed. Still, it does not seem that Pielke misread it.

Pielke’s headline is “Are US Floods Increasing? The Answer is Still No” and his first paragraph is “A new paper out today in the Hydrological Sciences Journal shows that flooding has not increased in the United States over records of 85 to 127 years.” Most people would conclude that his unqualified “increasing” implies “increasing over time.” The paper doesn’t address change over time at all, except coincidentally through the correlation with CO2. In any case, uncritically citing a fatally flawed paper as evidence for anything is sloppy.

I think you could take the quote as evidence that Pielke read correctly, in which case he was misled by the language, which seems to discuss absolute increases or decreases rather than correlations with respect to CO2, when it actually concerns the latter.

In any case, the quote is at the end of the paper, and for anyone vaguely aware of atmospheric stocks and flows (i.e. that temperature and precipitation depend on the integral of the radiative effects of CO2), as Pielke surely is, alarm bells should have been going off long before.

John: that was my point. :-)

I’d say the economists I know are all very conscious of the value of log transformations. Political scientists often seem to follow economics any way.

But people in many branches of statistical science could certainly learn from the habits of natural scientists of keeping track of dimensions and units of measurement as utterly routine and essential. I’ve seen researchers not otherwise naive looking at regression coefficients and failing to realise that most of what they were seeing was a side-effect of differing units. Citing the variance as a univariate descriptive statistic is usually pointless, as its units of measurement often make little sense to researchers, even when they do know that it has different units. Naturally, that’s the reason for standard deviations.

The question of why people don’t plot much more is interesting, important, and intricate. The answers are usually tacit, but here are some I’ve encountered:

1. I’m fitting a model with several variables and/or a large dataset. No graphs could represent that adequately; therefore I show no graphs.

2. Graph interpretation is subjective and lacking in rigour.

3. People in my field expect tables, not graphs. If I submit tables as expected and instructed, graphs will be rejected as at best repeating the information given precisely in the tables.

4. I work with categorical data and categorical data can’t be graphed expect trivially. You expect me to show pie charts?

5. Graphs are for showing the obvious to the ignorant (nod to Edward Tufte’s wording of a view he does not endorse), so would just raise suspicions that I am not doing cutting-edge research.

6. The effects I am detecting are too subtle to be shown graphically. The result is highly significant, but not important.

7. No graph I could devise supports the argument I am trying to convey. (#6 made more general.)

I should perhaps emphasise that I do know lots of really good reasons for graphing data. I am just trying to diagnose thinking on the other side.

Re. #6 : If the result isn’t important why advertise it?

Rahul: Quite so. You got the point. Some of the rules are ironically stated.

Wow. I read it again. That was a sweet list. Stupid me.

I see a typo in the last paragraph. When I wrote to Andrew, “…, why is it rare to see people ignore feedback effects?”, I meant “… why is it rare to see people address feedback effects?”

That’s good feedback, Bill.

More seriously, if how to deal with feedback effects is not typically taught, where is a good place to learn about it?

nah, you might check out either (feedback) control theory or system dynamics. Control theory is usually linear and often focused on physical systems; system dynamics often deals with nonlinear systems and with systems involving people.

John Sterman’s /Business Dynamics/ is a good if hefty and somewhat encyclopedic text on system dynamics. Click on Tom Fiddaman’s name to see some examples, or click on my name, too (he’s got a higher density of system dynamics per post than I do).

It seems like there’s value in the awareness of stocks and flow in as much as it influences our choices of how to represent data. For example, if I represent flow as the instantaneous flow over time (sorry, that’s probably not the right technical description), then Harris is correct that flow at time T won’t look like the volume of liquid in the urn at time T. That’s because flow is being represented as a moment in time, but the volume in the urn as the accumulation. But, if I either measured the change in volume of the urn from time T to time T+1 OR measured the cumulative flow into the cup, wouldn’t the correlation be restored?

I must be missing something important about his point, yes?

This works in simple cases, but not in general in the presence of noise. Then you need something like Kalman filtering.

The stock-flow distinction is essentially (mathematically) the same as the CDF-PDF distinction. We probably don’t talk about CDF’s enough. When teaching calculus, I did give exercises of the sort “here’s the graph of the derivative; what does that say about the graph of the function?” and vice-versa. So perhaps I have been more inclined to emphasize the CDF/PDF connection in teaching statistics than many people.

Yes, logs are often not understood. One thing I do in teaching is to emphasize that although we tend to count additively (on our fingers; 1, 2, 3, ..), Nature often counts multiplicatively (1, 2, 4, 8, …). So taking logs is just translating Nature’s way to our more usual way (and using logs base 2 is often easier to interpret than natural logs or logs base ten!)).

Hmm — maybe I’d better post some of my exercises from the prob/stat course I taught for teachers — both these points are addressed in them.

OK, I’ve added some stuff that might be of interest. Click on my name above to go to the page they’re linked from.

Thanks Martha

Adam, MIT’s John Sterman has done work in this area. http://jsterman.scripts.mit.edu/~jsterman/docs/Sterman-2009-DoesFormalSystemDynamics.pdf is one article that gives examples where people struggle with what seems to be a simple task. Linda Booth Sweeney and Sterman have published http://web.mit.edu/jsterman/www/StermanSweeney.pdf, which shows a practical issue: climate change. A large number of people (in their samples) seem to struggle with understanding the effect of changes in CO2 emissions on atmospheric CO2. http://scripts.mit.edu/~jsterman/docs/Sterman-2007-UnderstandingPublicComplacency.pdf is one article from their research. I think I have more concise versions of both somewhere.

Thanks! I appreciate the links to additional examples! :)

Avoid parenthesis. See Pinker:

http://www.amazon.com/Sense-Style-Thinking-Person%C2%92s-Writing/dp/0670025852/ref=asap_bc?ie=UTF8

Andrew:

Some ideas for your list:

It’s a good habit to drill undergrads to never calculating anything without explicitly writing all units & ensuring units cancel out. As an aside, it helps you handle messy mixed-units situations.

Another sanity check is to make sure nothing in a formula that’s inside an exp() or ln() etc. ever has a leftover unit.

Also, to remember to check extreme conditions in any formula to see if it yields expected results e.g. t=0, x=infinity, y=x etc. Has helped me catch bugs / typos / model errors so often.

Yet another, is to be always wary & careful with any naked numerical constants in formulae. e.g. the 180 in this one http://en.wikipedia.org/wiki/Kozeny%E2%80%93Carman_equation

PS. Maybe this is too basic & obvious for the readers here. But it might help beginners / students.

Checking whether a time series exhibits a unit root (stock) or is stationary (flow) is one of the first thing when doing econometrics. I don’t know how insular those concepts and terminology are, they don’t really seem to come up in general statistical texts.

Let me just say that even in applied Microeconomics we don’t use those terms (we don’t really do time series in general). So I am guessing that since they haven’t moved from Macro to Micro, they are pretty likely not to be common outside of economics.

That said – only an economist could name a “flow” as “stationary”.

They are not explicitely called stock and flow, but since you go from one to the other by diferentiating, they might as well be.

Regarding units, I wish this were a more common concept in basics statistics. In high school, I took my first statistics class at the same time as my first chemistry class. The beginning of chemistry focused a lot on unit conversions. I had a big “a-ha” moment in statistics when I conceptualized z-scores—standardizing variables—as converting data to the units of “standard deviations”. This is covered nicely in ARM Chapter 4, but it’s a concept worth emphasizing again and again to new students.

As an engineering undergraduate I took a (mandatory) “dimensional analysis” course, which was all about making sure that the units were sensible — and deriving things like Reynolds Number just by thinking about what variables are likely to be important, and then finding a combination of them that was dimensionless. Once it clicks, as others have stated, it’s a useful way of thinking.

But it’s not an obvious concept — when I tried to show that the difference between a probability and a probability density can be explained by thinking about the units, I got a lot of blank stares.

Oh yes. “The density estimation code must be wrong because the graphed results go above 1, which is impossible.”

Yes, these points are sadly often lost. (The book “Street-Fighting Mathematics” has a good introduction to this method). Personally, I find it helpful to write probability densities as differential forms rather than functions (e.g., the density form of the Exponential distribution is exp(-x)dx for x >= 0). This notation emphasizes that densities transform differently than cumulative distribution functions under changes of coordinates.

I take the general points, but do we really have to use the terms ‘stock’, ‘flow’ and ‘systems dynamics’? This is not the 60s/70s etc. The correct term is ‘dynamical systems’.