A little correlation puzzle, leading to a discussion of the connections between intuition and brute force

Shane Frederick wrote:

One of my favorite questions to ask people is: r(a,b) = x; r(b,c) = x; r(a,c) = 0. How big could x be?

The answer feels unintuitive to me. And it is unintuitive to almost all, I’ll add.

I took this as a challenge. It should feel intuitive to me, dammit! I’ll let you think about it too before going on.

When given this problem, I pondered for a few minutes but got hung up with the correlation formula. Then I realized I could solve it using brute force by just enforcing the constraint that the correlation matrix be positive definite.

The first step is to figure out det((1, x, 0), (x, 1, x), (0, x, 1)). I remember this from high school, how to compute the determinant of a 3×3 matrix. All the zeroes make this particularly easy to do: the result is 1 – 2x^2. OK, that’s easy: the condition 1 – 2x^2 > 0 is equivalent to |x| < 1/sqrt(2). So that's the answer. Oh yeah, then I checked the computation in R just to make sure I hadn't made any stupid mistakes. I emailed back to Shane:

1/sqrt(2), i.e. R-squared = 50%? Or am I missing something?

He replied:

Nope that’s it. Statisticians can just see it. Feels SO high to me. I’m surprised when I see it with r = 0.3.

To get the intuition, think of it like this. Start with a and c independent. Let b = a + c. Each of these explains 50% of the variance of b, hence if you regress b on a or b on c you’ll get R-squared = 50%. So corr is sqrt(1/2).

What about r = 0.3? Then a or c explains 9% of the variance of b, so you could write this as, b = a + c + e, where the error term e explains the remaining 82% of variance. Looked at it this way, r = 0.3 doesn’t seem so exceptional at all.

So why was r = 0.3 so surprising to Shane? I’m guessing that the issue is a and c being uncorrelated. In observational data, variables don’t typically have correlations of 0, so when that does happen, you’re typically in the presence of enough noise that you won’t see correlations of 0.3 either. Or maybe not, I’m not so sure about this one.

But maybe the real lesson of this one is the connection between brute force and intuition. When Shane sent me this problem, I started with the intuition that I should have intuition for it. Unfortunately I had no intuition. But I did know some brute force tools, and once I figured out the answer the hard way, the R-squared of 50% jumped out at me, and from there the intuition was clear.

22 thoughts on “A little correlation puzzle, leading to a discussion of the connections between intuition and brute force

  1. Intuitively, I thought about it more in geometric terms a and c are perpendicular. The angles between a-b and b-c are the same so the angle between a and b is 45° or a r-squared of 0.5

  2. Another way to think about these problems is to use the idea behind Cauchy-Schwarz to connect them to angles between vectors.

    Suppose the question had been “You have unit vectors a, b, c in some Hilbert space. You know the angles between a, b and b, c are both equal to some angle theta, and you know that a and c are orthogonal. What’s the minimum value of theta possible?”

    In this case, it’s obvious the answer is theta = 45 degrees, and the corresponding correlation coefficient would then be cos(45 degrees) = 1/sqrt(2), same as the answer you get. Your proof with the determinant can then be recast as using the double angle identity 0 = cos(90 degrees) = 2 cos^2(45 degrees) – 1 to figure out what cos(45 degrees) is.

  3. The way the problem is posed, it looks like a transitive relation.

    > r(a,b) = x; r(b,c) = x; r(a,c) = 0.

    In layperson’s terms, “a is similar to b, b is similar to c, so a and c should be similar, too”. And if a and c are not similar at all, by that logic I’d expect the other similarities to be small if I’m not thinking too hard (or one-dimensionally).

    But your trick of seeing b as a composite of a and c works for the layperson, too:
    • a=red circle, b=blue circle, c=blue square
    • a=blue, b=green, c=yellow

    As soon as you get the idea of thinking multidimensionally, it’s obvious.

  4. It’s always careless to talk about ‘correlation’ when there are so many dependence metrics out there. Of course it can be assumed you’re applying the Pearson correlation. Does this rule also hold true for Spearman rank correlations? Distance correlations? and so on…

  5. I was curious to see how well this generalized to more variables, but can’t math as well semi-early in the morning, so wrote a short script to check it real quick — anybody want to explain the shape of this curve, as well as hazard what value the variable on the vertical axis approaches as that on the horizontal approaches +inf? https://i.imgur.com/Oz7YLQR.png (the jitters are just due to me using a grid of interval width 0.001 to approximate the maximum correlation of all a,b,c,d… except one).

    This also reminds me of a similar puzzle I’d worked on some years back for a grad school project involving met-hasting proposal distributions to correlation matrices where I made a similar graph: https://i.imgur.com/hudBi1i.png couldn’t math that one out either, though in that case not for lack of trying (though there I didn’t care to try too hard, since I was mostly trying to argue that making proposals to marginal correlation coefficients was a fool’s errand)

  6. I am wondering why you view the seemingly obvious math solving “not intuitive”—it appears more tractable than comping up with a regression example, and perhaps is the only scalable approach when there are more variables—Hmm, maybe a “general solution” is typically less intuitive and less cute?

      • Indeed there is an extra math puzzle for me:
        How to determined when matrix ((1, x, 0), (x, 1, x), (0, x, 1)) is P.S.D. in an intuitive way?

        The naively intuitive solution gives a loose bound that the diagonal element should be larger than the off-diagonal elements, such that 1>=2x, or x<=0.5, which is too loose. How is the regression example mapping into a liner algebra theory in this example?

        • I have found that in symmetric problems like this one often a useful trick is to use the definition of the characteristic polynomial of the matrix together with matrix determinant lemma and potentially the Sherman-Morrison formula, which lets you read off all the eigenvalues as functions of x.

          In this case, $latex p_{\mathbf{M}}\left(t\right) = \det\left(t{\mathbf{I}} – {\mathbf{M}}\right)=\det\left({\mathbf{A}} + {\mathbf{u}_{1}}{\mathbf{v}_{1}}^\intercal + {\mathbf{u}_{2}}{\mathbf{v}_{2}}^\intercal\right)$, where $latex \mathbf{A}= (t-1)\mathbf{I}$, $latex \mathbf{u}_{1} = \left[-1,0,-1\right]^\intercal$, $latex \mathbf{v}_{1} = \left[0,x,0\right]^\intercal$, $latex \mathbf{u}_{2} = \left[0,-1,0\right]^\intercal$ and $latex \mathbf{v}_{2} = \left[x,0,x\right]^\intercal$. Applying the matrix determinant lemma twice and the Sherman-Morrisson formula once, we get $latex p_{\mathbf{M}}\left(t\right) = (t-1)\left[ t-\left(1+\sqrt{2}x\right)\right]\left[ t-\left(1-\sqrt{2}x\right)\right]$. So the eigenvalues are $latex 1$, $1+\sqrt{2}x$ and $1-\sqrt{2}x$, and the matrix is PSD iff $|x| <= \frac{1}{\sqrt{2}}$.

          I doubt that most would consider this intuitive, but it often generalizes well to higher dimensional problems where the geometric intuitions that others have provided might struggle a bit (at least for my limited imagination).

        • Aftab: Thanks. I like how this more intrinsic math is derived from the stone soup, as Andrew would probably say.
          Further question—Because it seems that everything works in the squared scale in this example, maybe there can could be a sufficient bound “Id + noise is PSD if Frobenius-norm (noise) < ..."?

    • (I guess also the dimensionality of the correlation matrix would be better said to be choose(n,2), where n is the number of correlated variables considered)

  7. I will second other people: this feels very intuitive to me because I think of correlations as the angle between vectors. So imeddiately, I thought “What is the smallest angle, when composed twice, can produces a 90 degree angle.” And a 45 degree angle then presents itself as the obvious.

    Because this answer seemed so obvious to me, I was panicked for a second trying to figure out where the trick was. Could you somehow compose two 30 degree angles into orthogonality? Is there some trick in the limit of large n dimension?

    But no. Goes to show that there’s large interpersonal variation in what constitutes intution. I suspect that those who are more used to working with data in the wild would be more stumped by the problem while those who view correlations through a linear algebra perspective would find this trivial.

  8. Interesting problem. Thank you!

    For an intuitive example (not a proof).

    Example 1.

    Let Y, Z iidrv with same mean and variance

    Let:
    a = Y
    b = Z+Y
    c = Z
    (i.e. Y is shared between a and b, Z is shared between b and c)

    Then r(a,b) = r(b,c) = 1/sqrt(2) and r(a,c) = 0. The same holds when Y and Z are each sums of N indep. variables.

    Example 2.
    If now we have X,Y,Z,W also iidrv with same mean and variance then we let:

    a = X+Y
    b = Z+Y
    c = W+Z
    (i.e. Y is shared between a and b, Z is shared between b and c, but X and W are not shared)

    Then r(a,b) = r(b,c) = 1/2 < 1/sqrt(2) and r(a,c) = 0. If X,Y,Z,W are sums of irv then x is always < 1/sqrt(2) (except the cases covered in example 1)

Leave a Reply

Your email address will not be published. Required fields are marked *