## How to think about correlation? It’s the slope of the regression when x and y have been standardized.

Dave Balan writes:

I am an economist at the Federal Trade Commission with a very basic statistics question, one that I have put to several fairly high-powered econometricians, and to which no one has had a satisfying answer.

The question is this. Why are correlations meaningful? We know that they are ubiquitous, they get reported all the time in work across many disciplines. But for the life of me I cannot understand what the question is to which a correlation is the answer. I get that it’s sometimes useful to know whether or not the correlation is close to 0; if it is close to 0 then you know that it’s not too far from the truth to say that no (linear) relationship exists, and that might be all you need to know. By the same token, a correlation of, say, 0.9 tells you that it’s nowhere close to being true that no linear relationship exists, so you need to go further and investigate what that relationship is. What I can’t understand is why people interpret that 0.9 as a meaningful standalone number in its own right. A correlation of 0.9 means that the data lines up pretty nicely along some line with a positive slope, but that slope can be anywhere from just above 0 to just below infinity. What good does it do to know that a strong linear relationship exists when you have no idea what that relationship is?

To take the example of your recent (very interesting) election work, a finding that the correlation in the polling errors between State A and State B is 0 would clearly be important and relevant. And so a finding that the correlation is far from 0 is clearly important insofar as it tells you that it’s definitely not OK to assume that it’s zero. But what is its importance beyond that? What good does it do to know that the polling errors between State A and State B are highly correlated if you don’t know whether a 1 percentage point error in state A is associated with an error of 1 percentage point, or 0.1 points, or 2 points in State B?

I know that correlations have the advantage of being unit-free. And that’s nice, but it doesn’t seem to solve the problem.

Am I missing something fundamental here? If so, I hope you will share what it is. If not, is it a serious problem? Is there some other unit-free number that could be used instead? Maybe something like the elasticities that economists use?

I replied that the way I think about the correlation is that it’s the slope of the regression of y on x if the two variables have been standardized to have the same sd. And I pointed him to section 12.3 of Regression and Other Stories, which discusses this point.

Balan followed up:

Below is my [Balan’s] attempt at some intuition:

A. Since the correlation is the common slope of the y-on-x regression line and the x-on-y regression line, the dots must be configured in such a way that they look pretty much the same if you flip the axes.

B. The only way that that can be true is if the dots lie around some line with a slope of 1.

C. Note that this does NOT mean that the regression line through those dots is 1, rather it has to be <= 1 (per your book). D. Since the dots line up along a line with a slope of 1, they will still line up along a line with a slope of 1 when you flip the axes. The intercept might change, but the slope won’t. E. And since the orientation of the dots does not change much (and in the limit doesn’t change at all), the regression line through them does not change either. The part that I had a hard time understanding was why it is impossible for the dots to line up perfectly along a line with a slope other than 1, or to line up imperfectly along a line with a slope equal to 1. I think this is where the assumption of equal sd matters. If two variables have the same sd, then having a correlation of 1 means that they are basically the same variable (possibly shifted due to different means), which means that the only line that they can line up perfectly along has a slope of 1. Similarly, if they do not have a correlation of 1, then the regression to the mean described in your book kicks in so that the regression line must be less than 1 and the randomness means that the dots will not line up perfectly along that line.

To which I responded: Yes, corr is like a rescaled regression coefficient. Sometimes this makes sense, other times it does not. For example if you are computing elasticity, which is roughly speaking the regression of log(output) on log(input), then standardization would make no sense at all. But if x and y are two different standardized tests, it could make sense to renorm each to have mean 0 and sd 1.

1. Mike says:

Here’s a nice blog post making the same observation in a slightly different way http://composition.al/blog/2018/08/31/understanding-the-regression-line-with-standard-units

2. jonathan says:

I think correlation is fascinating. You have at least 2 things which relate to each other in variation to other things. Which means the complexity associated with both is treated in some fashion as (I hate to say) unitary but it can be, so the variation in how the imaginary component as the complex larger field reduces to or affects the smaller field that is the correlation, which then feeds into something that is more real, meaning a pin or other attachment to an existence at the level which is your baseline. That’s why I think of correlations in a non-statistics sense as complex reals: they are like middlemen that exist in complex reality, where what actually happens derives through and into them. Like a family: it’s a construct that has many layers which we can examine without ever being able to render the actuality of family except through perspectives.

I find myself thinking about absolute value with these because a physical existence or measurement is true in an absolute sense like an empty set is attached to the set that then fills (as opposed to the null set that underlies all empty sets, when you pursue the induction of null far enough that it becomes uncountable). Sorry for the diversion, but I’ve been playing around with trying to represent uncountability. So to represent null to empty, which eventually leads to correlations, where I’m at is thinking of functions and mapping them to 0 to 1 segments. The two best examples I’ve run into are the ! factorial and the exponent (above 1, because below is another story and not meaning e or some other specific exponential expression). So borrowing JvN’s extensible universe idea, then I run the functions back and forth, even and odd, add a line each time expanding from a vertex. They eventually become so large that they become infinite in both directions. That describes the continuum because you can then just divide them in half and that’s the power set of infinity. The ! function then means the construction of inner identity, because rather bluntly it is the count of all the relationships within whatever n you want. The exponent, meaning something to n over 1, like the powers of 10, magnifies identity, literally and bluntly making more of the same, 10 then 100 then gazillion of little 1’s all ideally the same. So that field of identical external identity, which is not a phrase I’ve ever used before typing it, meets identical internal identity.

This generates correlation. It has to because it matches internal and external at the abstract level. So, since any box is uncountably infinite, any box contains these identity matches. I use this to make a grid of squares, so there is a gs for a grid square with mappings to GS, which relates to but is not the same as measures of infinity because it ‘boxes’ these inner and outer identity statements, along with other functions, and it is the expandable, contractable mappings of these boxes which become correlations and what I call complex reals. (It’s interesting but you can construct not only layers but entire universes within complexity, which basically describes existence as a projection of complexity as it attaches to points which become ‘tangible’ across that GS box. That is judgement free; this reality is a projection as much as any other reality.

So, I like to think of a gym. A correlation is like a barbell with weight plates on the ends: you put it on your shoulders or you pick it up, and the weight on either end could fall or your knee could crumble or your back give out or you could wobble like crazy but still hit it. But you can see a perfect lift and you can imagine that is your correlation, which it is, meaning that you can start at one end of the barbell and see the form remains consistent. Or go the other way. And you can go up and down looking at your form too. Plus other issues, like tilt, so the form looks perfect until you analyze the other angles and notice gaps from ideal. That’s complex but it’s real. Lots goes into it, and you can change your axis at will – and should – because the rotations through complexity reveal the differences, and those differences have slopes. And then when you identify an area – like for me, tilt on squats – then the area comes into view and you can then treat that as a segmented area, meaning as a gs:GS and GS:gs relationship. That means you can treat each end as 1 and 0 and that constructs the imaginary unit circle so you are doing complex rotations over the smaller, focused area – that smaller GS:gs box – and that is also a complex real because that translates into maybe one day I can lower the bar straighter. (Though I doubt it.)

It’s really cool to think about the barbell as apparent: it’s really dumbbells that you imagine are firmly connected with a big bar but which have all sorts of complexity within their movements in your hands. So you can be easily misled by correlations that are apparent if you think barbell but which go away when you realize ‘dumbbell’. (And yes, that was a joke.)

So when you can switch around the x and y, you’re complexly rotating over z, and that means you are counting internal identity and external identity across that rotation. Really shows the importance of how you normalize and standardize.

This has been a blast. I’ve mostly been working on the insides of grid squares, because those are what you’re manipulating when you do ! or ^n. So for null, if you segment that to empty, you generate the null identity over the same complex field, which generates the empty set over the same field, which attaches to kinds of emptiness because otherwise truly empty is null, which is the point of the segmentation. And then you generate sets and universes.

(I think, btw, that Cantor’s original conception of Aleph null is this: the null existence that is in binary relationship, so a power set with, existence. We spend our time examining one side of that. I’ve not found anything that says he could say that, but I believe he could see that concept. That is where I attach Cantor to his Jewish roots, because that idea articulates the opposite of the uncountable God of Judaism (which at least superficially resists incarnation in complex reals). It’s a guess but I mostly work with his interpolative method, just developed into a model so there is no person drawing the lines and inserting points.)

3. Carlos Ungil says:

> What good does it do to know that a strong linear relationship exists when you have no idea what that relationship is?

If you want to know what the relationship just calculate it. Or, if someone tells you that the correlation is high you may be inclined to ask for more details. Knowing that the correlation is large is exaclty as useful (or useless) as knowing that R-squared is high.

4. Zhou Fang says:

> What good does it do to know that a strong linear relationship exists when you have no idea what that relationship is?

Well, what use is *any* single statistical value without a context and assumptions surrounding it? If you know a correlation *and* a regression coefficient, then you know a little bit more. If you know a correlation and regression coefficient and an interval estimate on the coefficients then you know a bit more than that. All statistics serve to distill and compress information. High correlation tells you, for instance, that you should be cautious about modelling with both variables in the model due to issues of multi-collinearity, and could be useful as a first pass indicator to go look for causal connections or perform more sophisticated modelling.

5. Ben S. says:

I remember being similarly puzzled when I was learning about heritability in grad school. You sometimes hear people say heritability is the proportion of offspring phenotypic variance explained by parental phenotypic variance, which implies heritability is a correlation. However that is incorrect: heritability isn’t the correlation r^2, it is the estimate of the slope of the regression. So I remember thinking, why should anyone care about the correlation ever? Ultimately what I tried to file away in my memory bank is that correlation is “symmetric”; it doesn’t matter which variable is X and which is Y in Cov(X,Y) / sqrt(Var(X) Var(Y)). When you have no supposition about causality, it makes sense to use correlation. When you have some reason to suppose causality (e.g. offspring phenotype is caused to some degree by parental phenotype), then it makes sense to look at the slope Cov(X,Y) / Var(X) which is “asymmetric” in that it matters which variable is which.

• Ben S. says:

An addendum, just to muddy the waters, is that for all practical purposes in the case of heritability, r^2 and the regression slope are just about the same since the parental variance and the offspring variance should be just about the same.

• somebody says:

> When you have some reason to suppose causality (e.g. offspring phenotype is caused to some degree by parental phenotype), then it makes sense to look at the slope Cov(X,Y) / Var(X) which is “asymmetric” in that it matters which variable is which.

I’m pretty confused by this line of reasoning — the additional factor 1 / Var(Y) doesn’t contain any additional information about causality. It’s just rescaling, effectively a change of units such that both variables are pure numbers with SD=1?

• Ben S. says:

Totally possible that the way I am thinking about it is flawed, but I just mean that when you choose to regress Y on X, rather than X on Y, you are implicitly saying something about which way you think the causality is going.

• Actually that may be how you think about it, but I don’t think it’s correct.

when you regress Y = f(x) + error the implicit assumption is that x is measured with only ignorable levels of error and that y is measured or at least predicted with some nontrivial error, that’s the main thing.

It could very easily be that y causes x, but you can measure x very well and then want to infer what the y was that must have been occurring to cause the x.

One of the great things about Bayesian analysis is that it makes it very simple to work with measurement error models where such considerations become less important:

y = f(x+errx) + erry

for example, with errx having entirely different distributions of error than erry

• somebody says:

I think that Ben is just talking about the framing. Assuming you have everything that’s required for a regression to give good causal inference (which is quite a lot to assume), you want to be able to make statements about effect sizes, and it’s more straightforward to say “1 change in x causes beta change in y and comparatively circuitous to say that “1/beta change in y is caused by 1 change in x”.

• sure, but if y is measured poorly and x is a causal outcome that’s measured well, then if you do say least-squares you should still regress y vs x and then back out the effect of a change in y on x.

the assumptions of least squares is that the predictor is well measured and it’s the predicted value where you’re minimizing the squared error that is poorly measured.

If you have both poorly measured then you should either do principal components analysis or do a measurement error model.

• jim says:

I disagree with Andrew on this question. I’d say measure the temperature near the building instead of five miles away.

• Ben S. says:

Interesting. This seems related to the fact the residuals are the vertical distances to the regression line?

• Exactly. If instead the errors are the perpendicular distance to the regression line then you’re doing principal components analysis… If the errors have explicit distributions associated you’re doing measurement error models.

6. David J. Balan says:

Thank you Andrew for taking the time to help me (a total stranger) understand this. I’m a fan!

My key takeaways from Andrew’s response (and please correct me if I’ve misunderstood) are as follows:

1. When x and y have the same standard deviation (regardless of whether or not they have the same mean), the correlation DOES have a clear interpretation. It is BOTH the slope of the regression line (which is the same whether you regress y on x or vice-versa) AND a measure of how well the dots line up along that line. So: (i) a correlation of 1 means that the dots line up perfectly along a line with a slope of 1; (ii) a correlation of 0.5 means that the dots line up OK-but-not-great along a line with a slope of 0.5; and (iii) a correlation of 0 means that the dots line up terribly along a line with a slope of 0 (i.e., they look like a shotgun blast).

2. When x and y do not have the same standard deviation, my original concern remains and there is NOT a clear interpretation of a correlation of, say, 0.9. Such a correlation does mean that it’s NOT true that no linear relationship exists. And it also means that the dots line up pretty well along SOME line with a positive slope. But that’s as far as correlation can take us. Sometimes that’s sufficient, but other times it’s not, in which case we would have to do some other analysis.

3. So the correlation has a MUCH more meaningful interpretation when the standard deviations are the same than when they are not. (I don’t think this is a knife-edge result, so if the standard deviations are close but not identical it’s probably still more-or-less OK. The problem arises when the standard deviations are meaningfully different.)

4. Because the interpretation of the correlation is much more meaningful when the standard deviations are the same, there is a benefit to rescaling one or both variables so that they have the same standard deviation. But of course you can only do that if the rescaling makes substantive sense in the specific application.

Thanks again Andrew!

• Carlos Ungil says:

> there is NOT a clear interpretation of a correlation of, say, 0.9

> means that the dots line up pretty well along SOME line with a positive slope.

Why is that not a clear interpretation? Would you say that there is not a clear interpretation of the slope because it doesn’t tell you how well the dots line up along the line? Do you think that there is no clear interpretation of R-squared?

You’re right: correlation doesn’t tell you the slope and the interpretation of correlation doesn’t involve the slope. Not having anything to do with the slope is not necessarily a bad thing, though. Correlation doesn’t change if you change the units. The correlation between temperature and rainfall won’t be affected if you change Fahrenheit to Celsius and inches to mm.

• Michael Nelson says:

The correlation is also the square root of the proportion of variance explained in one variable by the other. So you can see that a correlation of .8 is pretty different from a correlation of .85, because they explain 64% vs. 72% of outcomes, respectively. But correlations of .3 and .35 explain 9% and 12%, respectively, so they’re pretty similar. The (ironically) non-linear scale of the correlation may be one reason why values of r not at the extremes lose meaning for you.

Also, thanks to the general linear model (GLM), any statistic comparing two variables can be transformed into a correlation. So if an author reports the t-value for a comparison of two group means, a fairly unintuitive number, you can compute t/sqrt(t^2 + n – 2) = r. So reading that a two-group study with n = 60 per group has t = 2.00 only tells you that the difference was significant at the .05 (2-tailed) level. Whereas a quick calculation gives r = .18, telling you the effect of the intervention was quite small even if significant, explaining only about 3% of the variation in outcomes. For some outcomes that’s actually huge, or may be worth it if the intervention’s really cheap, but otherwise…

• Lahvak says:

Imagine you are standing in front of one side of a square table and you have whole bunch of pennies. You align the pennies into a straight line that stretches from the left side of the table to the right side, and passes through the center. Then you shake the table in the “back-and-forth” direction, so each of the pennies moves by some random amount away from you or towards you, but not left and right. The more you shake it, the more noise you introduce. At some point of time some of the pennies start falling off the table. The steeper the original line was, the sooner this will happen: if the line has slope 1 or -1, the pennies near the corners will fall off right away. With smaller slope, you have more space for the noise.

If the standard deviations are not equal, your table is no longer square.

7. Jeff Gill says:

The thing that always amazes me is that people get away with publishing correlation coefficients without a standard error (SE(r) = \sqrt{\frac{1-r^2}{n-2}}), when it is functionally the same information a regression slope, which you could never publish without a standard error.

• Michael Nelson says:

As the formula indicates, publishing the n at least allows the reader to compute the SE themselves, so at least it’s not as bad as failing to publish the SE of a mean estimate.

Also, you sure about that formula? Isn’t the 95% CI(rho) the Fisher inverse of fisherz[r] +/- 1.96*(1/sqrt(n-3))]?

8. Dale Lehman says:

Why are we confining the discussion to measures of linear correlation? I think of correlation as a more general concept involving the relationship between/among two or more variables. The correlation coefficient from a linear regression is but one measure of this. If the relationship is nonlinear, then that particular measure may not be meaningful at all, but there may still be a meaningful relationship.

• jyd says:

I’m ok with this. Whenever I talk about nonlinear dependence… say in the context of copulas, I try to avoid using the word correlation (because people automatically think linear) and refer to measures of nonlinear dependence.

• Dan Bowman says:

FWIW, I’ve bowed to the prevailing winds and when I teach intro stats, I’ve begun using the word “association” for the general concept and “correlation” for the special case of “linear association.” The hope is that by separating the concepts, students will both be prepared to correctly interpret most instances of the word “correlation” in the wild (where it typically refers to a “correlation coefficient” and hence a linear association) and have a sense that linear associations aren’t all there is. Whether it works or not, I have no idea…. I’m just a physicist at a small college who often has to teach stats and is trying to do his best. Feedback appreciated.

9. Ian says:

The thing that always amazes me is the casual sloppiness of discussions of dependence, even among sophisticated statisticians. First of all, ‘correlation’ invariably refers to Pearson linear correlation. As such, this metric is describing the fit wrt a specific pattern in data. One has only to revisit Anscombe’s Quartet for a reminder of how easily this seemingly innocuous statistic can be falsified https://en.wikipedia.org/wiki/Anscombe%27s_quartet or https://blog.revolutionanalytics.com/2017/05/the-datasaurus-dozen.html

Next, lack of a ‘correlation’ does not imply lack of dependence. Today, there are a wide set of dependence metrics that go way beyond the Pearson. They include the Reshef’s MIC, Szekely’s distance correlation, information theoretic metrics such as Shannon’s entropy, mutual information, AIC, permutation entropy…and more.

These advances need wider recognition and use among the rank-and-file.

10. JS says:

Balan: “The part that I had a hard time understanding was why it is impossible for the dots to line up perfectly along a line with a slope other than 1, or to line up imperfectly along a line with a slope equal to 1. I think this is where the assumption of equal sd matters.”

I think this indicates a confusion of the different slopes and how they related to the data. The slope of the (nonstandardized data) regression line is arbitrary and can be anything “essentially independent” of the correlation: any correlation can occur with any regression slope (they just have the same sign). But when we standardized the datasets, then the correlation is literally the slope of the (standardized data) regression line. I think he is still mixing up the two cases. When not standardizing, the line can have any slope and the data can either fall on the line or off of it. We could rescale the data in arbitrary ways to give us different slopes as well, but there is only one way to rescale the data (of course with arbitrary shifts) so as to make the slope equal to covariance divided by product of standard deviations.

In the case that the correlation is 1, we could “jiggle” the pairing of the data so as to make it no longer have correlation 1 and thus the re-paired dataset wouldn’t fall perfectly on the line (and of course the standardized slope would no longer be 1). I think this might be an important thing to mention. It isn’t just the individual datasets that matter, but how they are paired up.

It is not quite the case that equal sd is what matters though: X and Y can have the same standard deviation but any correlation. It’s just that in this case, the slope of the regression line will be equal to the correlation and thus standardizing doesn’t change the slope.

11. Richard Juster says:

My, what conniptions people are going through! Why has no one mentioned that a simple bivariate correlation coefficient between x and y is at its most basic, simply a measure of how “tight” the data is around the linear projection of y on x (or x on y). It is also useful to have a look at John Tukey’s view of the correlation coefficient as reported by David Brillinger: https://www.stat.berkeley.edu/~brill/Papers/jwtint4.pdf

• Robby says:

…where tightness is measured relative to the scale of the Y variable. You need this qualification.

12. Greg Baker says:

I’ve recently started thinking about correlation as being a road-sign. If you have a good Pearson correlation, then it means that you can go ahead and build a linear model. If you have a good Kendall correlation but not a good Pearson correlation, maybe you should look at nearest neighbour methods. A good Spearman correlation means… umm… some kind of model will work well… maybe some kind of multi-break linear monotonic model.

13. Ben says:

Hello Andrew
“I replied that the way I think about the correlation is that it’s the slope of the regression of y on x if the two variables have been standardized to have the same sd.”
I disagree, since the only thing that correlation describes, is the certainty to which two variables are lineary dependent. The expected slope of a standardized pair of data-sets, x and y, will not change, it will always be 1. So for example, if you measure a hundred times a data-set with 20 measurement pairs of x and y, the average slope after standardizing this data will be around 1 (or -1) but depending on the tightness of relation of the data (high or low correlation), the 100 slopes might fluctuate widely or narrowly around 1.