Freddy Garcia writes:

I read your post Vine regression?, and your phrase “I love descriptive data analysis!” make me wonder: How to do a descriptive analysis using regression models? Maybe my question could be misleading to an statistician, but I am a economics student. So we are accustomed to think in causal terms when we set up a regression model.

My reply: This is a funny question because I think of regression as a predictive tool that will only give causal inferences under strong assumptions. From a descriptive standpoint, regression is an estimate of the conditional distribution of the outcome, y, given the input variables, x. This can be seen most clearly, perhaps, with nonparametric methods such as Bart which operate as a black box: Give Bart the data and some assumptions, run it, and it produces a fitted model, where if you give it new values of x, it will give you probabilistic predictions of y. You can use this for individual data, or you can characterize your entire population in terms of x’s and then do Mister P. (That is, you can poststratify; here Bart is playing the role of the multilevel model.) It’s all descriptive.

I train my students to summarize regression fits using descriptive terminology. So, don’t say “if you increase x_1 by 1 with all the other x’s held constant, then E(y) will change by 0.3.” Instead say, “Comparing two people that differ by 1 in x_1 and who are identical in all the other x’s, you’d predict y to differ by 0.3, on average.” It’s a mouthful but it’s good practice, cos that’s what the regression actually says. I’ve pretty much trained myself to talk that way.

Of course, as Jennifer says, there’s a reason why people use causal language to describe regressions: causal inference is typically what people want. And that’s fine, but then you have to be more careful, you gotta worry about identification, you should control for as many pre-treatment variables as possible but not any post-treatment variables, you should consider treatment interactions, etc.

Ha. I just fit a (trivial) regression of Dissolved Oxygen to water salinity.

So, it it wrong to say something like “if salinity increases by x % the DO will go down by y %”?

Or is it because we have extraneous strong reasons to know the causal direction in the hard sciences so we are allowed to say such things? i.e. Typically salinity influences Dissolved Oxygen levels and not the other way around.

If you run the experiment where you have some water with a given salinity and a given DO and then you add salt and the DO goes down… then you’ve got a causal model and your only purpose for the regression is to determine the shape of the relationship. More or less “causal” means “the response changes when I force a change in the controlled variable”

On the other hand, you could imagine someone brings you 13 test tubes of water each of which has a given salinity and you measure it, as well as the DO… can you determine that the relationship is purely causal from the effect of salinity from this? No. For example, the lab tech could be screwing around with you, bubbling pure O2 through some of them, and submitting others to vacuum degassing, etc.

Lots of social science is more like the latter situation, where you don’t know what else happened. Still, Economists have causal models for people’s behavior. If you raise the price of fuel oil for 15 years, you expect more people who have to install new heaters to use electric heat pumps, or whatever. Tying those causal models to the observed data is often what Economists are spending their time doing.

I always feel like regression is descriptive. I like the way your are stressing the comparison of two observations rather than “change in” one observation. If we really knew the impact of changing a variable within one unit, that would start to seem causal to me, though then we are down the rabbit hole of predicting that change (and uncontrolled variables related to that) unless there is a RCT which is in many cases not possible (i.e. you can’t change someone’s parents’ level of education or their long term party identification).

This is good;I think I’m going to rephrase a homework problem to be more wordy like that. It’s also a good distinction between the way students have learned to talk about lines in algebra and the way they need to do it in statistics.

What is your view on explanations such as: “an increase x_1 by 1, with all the other x’s held constant, is on average associated with an increase in Y of 0.3″? I sometimes use this to avoid explicitly causal language – but your preferred formulation is quite different. Do you think this is ok?

To be clear I do like your formulation – but was wondering what you think about the above, which I think tries to explain the partial correlations in words.

I don’t think your formulation is ok because we don’t observe increases within individual units.

@JB:

Another problem with your proposed explanation is that in most real-world problems, the “predictors” are not independent, so “an increase in x_1 by 1, with all the other x’s held constant” is off in fairy-tale land.

In contrast, Andrew’s “Comparing two people that differ by 1 in x_1 and who are identical in all the other x’s” avoids any suggestion of a mechanism (e.g. “increase” as opposed to “differ by”) that suggests thinking causally.

+1

“causal inference is typically what people want”

I always told my students to consider the science of the situation for causality. For example, in the famous Motor Trend cars dataset, it is fair to consider the miles per gallon (MPG) of an automobile as being directly affected by the weight of a vehicle. On the other hand, I don’t know enough about automobiles to know if there is a physical connection between the number of cylinders and the MPG.

…but surely there’s no causal mechanism for MPG to affect the cylinder number of a car?

Put a big dumb blocky high drag SUV body on the car and everyone complains it doesn’t accelerate well enough, so you put a bigger engine in ;-)

With few exceptions, increasing or decreasing the number of cylinders in a car’s engine will bring its MPG to zero.

I wonder if MPG could “cause” the number of cylinders in a car, given the presence of certain regulations on the set of vehicles a manufacturer produces. If there is a regulation is that 90% of each manufacturers cars must get better than 50 MPH (say), the other 10% are going to skew to being high margin luxury cars, meaning more cylinders.

Doesn’t it take more fuel to run an 8-cyl engine than a 4-cyl? It would seem so, unless the marginal power output increases with each cylinder, which I’m sure it doesn’t.. Plus, an 8-cyl engine is heavier than a 4-cyl engine, so it takes more gas to move a car with an 8-cyl than the same car with a 4-cyl.

It’s complicated. Engine displacement, compression ratio, supercharging, and maximum RPM all probably influence fuel consumption more than does number of cylinders. The old Offy four cylinder engines used for racing could generate more than 1000 HP from four cylinders. They probably got very low milage. See https://en.wikipedia.org/wiki/Offenhauser

Bob

Driving habits also influence fuel consumption.:~)

And all kinds of things influence choice of car.

“Predictive,” to me, implies causality to the casual reader. (For example, Wikipedia on “predictive modeling” starts with “Predictive modeling uses statistics to predict outcomes.” “Outcome” is an “effect” suggesting that the other thing is a “cause” because it happened earlier in time. To its credit, Wikipedia corrects itself in the next sentence, but that suggest the first is confusing.

How about calling it “correlation modeling?” Or what language has anyone else tried?

To me “predicted outcomes” doesn’t mean causal effect. It’s just some variable. The idea that I could predict one variable using a variable doesn’t mean I can actually understand what’s happening. I can predict someone will click on an ad based on its color, but can I explain why that is? I can predict that religious affiliation in the US will decline next year just as it has for many years, but that doesn’t explain why there is a decline.

@Dzhaughn: You are correct that “predictive” does imply “causality” to some (Many?) people. So we need to be careful to specify just what we mean by the term.

Another nice way to highlight the puzzle of suppression effects: “Comparing two people that differ by 1 in x_1 and who are identical in all the other x’s, you’d predict y to differ by 0.3, on average………but if the two people differ by 1 in x_1 and are identical in all the other x’s AND identical in z, then you’d predict y to differ by 0.7, on average.”

The question here is when the results of an analysis can be interpreted as causal. The statistical methods report no more than associations, so the causality needs to be inferred from other things we know.

If we had a randomized control trial with perfectly balanced covariates, and complete adherence to the protocol, and we used regression to estimate the treatment effect, we would not object to causal inference. A bit of overkill, since a difference of means test might be sufficient, but supposed we wanted to assess whether the effect was larger in some segment of the study population than others. So, an interaction is added of treatment with the covariate, or the data are stratified by the covariate. Risk of wandering down the garden of forking paths, but still causality is a reasonable interpretation.

It gets murkier when using observational data. There are statistical methods that are routinely used to control for selection and thus make inferences of causal relationship more likely — instrumental variables, selection models, propensity scoring — that have emerged in fields like economics, medicine and health care more generally, where there is substantial observational data available and limited opportunities for randomization. But these can be neither fully convincing nor fully effective.

Epidemiologists add some further tests – dose/response relationships, temporal relationship, etc. These are based on some basic models of the processes that are believed to exist in the world and are being tested with the analysis. The analysis is testing the consistency of the data not with a statistical model, but with a causal model, and A causes B shouldn’t be the case if B precedes A.

Andrew’s list “you gotta worry about identification, you should control for as many pre-treatment variables as possible but not any post-treatment variables, you should consider treatment interactions, etc.” feels right about extending an understanding of when and how to make causal inferences, but to me doesn’t get to the core of the issue, except indirectly in its comment that “you gotta about identification.” Causal inferences have to come first from consistency of observed associations with a separate causal/conceptual model of the processes that are believed to be producing the result. The models tested should be based upon the conceptual model. This is what economists for example claim to do when they provide a theoretical model of consumer or firm behavior and design their regressions to test the predictions of the models. Model specifications and analytic strategy should also allow ruling out other competing explanations/hypotheses. These are the purposes of worrying about identification, but the modeling should go beyond that. The RHS of a regression model to the extent it reflects the conceptual model can get busy, and modeling analytically interactions, or casual links through mediation analysis is reasonable and often necessary to make the regression reflect the conceptual model being explored.

It is easy to draw questionable causal inferences from a regression. In Andrew’s field of voting behavior, if I observed an association of higher income with voting Republican, I would hesitate to draw the inference that if I increased the income of an individual, I would increase their likelihood of voting Republican. If I had data on change of income and found the association, I might feel more comfortable drawing causal inferences; and if I had baseline voting behavior at the initial income level, and observed a shift to Republican’s when income increased, would be more confident still in drawing a causal link from the observed association. Behavior as a result of change in income is what I want to understand, and cross sectional differences in income without data on change make testing this model harder.

While it is easy to draw questionable causal inferences from regressions, and I’ve trained my students well to routinely write into their papers that the observed association is not necessarily causal, the underlying motivation of analysis is often a causal model and the desire to see if the results are consistent with the model. Without a model, without choosing the sample and selecting the RHS variables to reflect the model, without worrying about jointly modeling what would be expected in a competing causal model (or rules such as those of the epidemiologists about when relationships look causal), we observe nothing but associations, and causal inferences are not appropriate. But if there is an underlying causal model and the analysis is designed to test whether the links in the model are observed, I would argue that drawing causal inferences is legitimate and language implying causality is not inappropriate.