“Trivia question for you. I kept temperature records for 100 days one year in Boston, starting August 15th (day “0”). What would you guess is the correlation between day# and temp? r=???”

Shane Frederick writes:

Trivia question for you. I kept temperature records for 100 days one year in Boston, starting August 15th (day “0”).

What would you guess is the correlation between day# and temp?

r=???

Shane sends me this kind of thing from time to time, for example:

Boris and Natasha in America: How often is the wife taller than the husband?

A little correlation puzzle, leading to a discussion of the connections between intuition and brute force

Also this from 2016 (nearly ten years ago!): When are people gonna realize their studies are dead on arrival?

When he sends me these little probability puzzles, I can figure them out—no surprise, as they’re devised to work for the general population and probability is my special area of expertise! The flip side is that, when a new one comes, I feel some pressure not to mess it up.

So here was my reply to Shane:

uh, that’s a tough one . . . I don’t have such a great intuition about correlation. ok, let me try to guess . . . I guess you’re talking about the average temperature on each day . . . On 15 Aug the avg temperature might be 80 degrees. 100 days later, that’s approx 25 Nov, the average temp might be 45 degrees (jeez, this is cringeworthy, I’m afraid that without looking it up I will get it embarrassingly wrong). A range of 35 degrees, so that would be an explained sd of 35/sqrt(12) (using the fact that the uniform distribution has a sd of 1/sqrt(12)), so approx 10. As for the unexplained sd . . . suppose there’s a day where the expected avg temp is 60. This time of year it could easily be 50 or 70, so maybe 10 for the unexplained sd? Then the explained and unexplained variance are equal so R-squared is 0.5, so r=0.7?
I kinda feel like I must have done something stupid here . . .

It turns out that I got it right (except for forgetting to specify that the correlation is negative, not positive, as temperature is steadily declining from August through November). That’s a relief.

According to Shane, his MIT colleagues guessed a correlation of between 0.15 and 0.2. The day-to-day variability of Boston weather was just too salient to them. He writes:

The narrative of the book Noise is that people under appreciate noise. I don’t disagree. But there are Some contexts where they clearly “over appreciate” it.

I’m guessing that people would give more accurate estimates if you asked them to specify R-squared rather than correlation. R-squared of 50% sounds like a safe guess, no? Conversely, a correlation of 0.15 or 0.2 corresponds to an R-squared in the 2%-4% range, which sounds kinda low.

Regarding Shane’s point about the narrative, this reminds me of the slow to update thing that Josh Miller and I have been thinking about. The usual lesson from those base-rate-fallacy problems is that people overweight local data and underweight base rates. But there are these examples (such as Leicester City) where people seem to be moving too slowly away from their baseline positions, not reacting fast enough to evidence.

As is often the case, there is no safe haven.

28 thoughts on ““Trivia question for you. I kept temperature records for 100 days one year in Boston, starting August 15th (day “0”). What would you guess is the correlation between day# and temp? r=???”

  1. I downloaded the data for 2015-2018 and found the correlation to be .06 (2015), .16 (2016), .19 (2017), and -.01 (2018). I used the average of the high and low temperatures each data. So,…?

    • I’ll correct a few mistakes shortly, but I think I know why the colleagues (and me) guessed wrong is that we interpreted the question differently. I was thinking of the day number of the month (1-31) and the question is about day numbers 0-99. The former I suspect has a low correlation, while the latter would be larger.

      • So, some mysteries remain. I calculated the correlation (not R2) two ways: one using a day counter from 0 (Aug 15) to 99 (Nov 22) for each year and the other using the actual day number of the month (1-31). The correlations are much higher (in absolute value, as they are negative) for the first method and much lower for the second than Andrew’s guess and the supposed correct answer. Here is what I get for the 4 years:

        year correlation using 0-99 correlation using 1-31
        2015 -0.84 0.02
        2015 -0.88 0.10
        2016 -0.80 0.13
        2017 -0.88 -0.06

        So, what now?

    • Yours is avg of min/max temperature in the shade at the airport (or similar), it suffers from all the well-known issues.

      His could be the temperature of anything really (his dog’s body temperature?), but the phrasing suggests most likely looking at a thermometer on a wall in the backyard or similar. That would be more affected by clouds, solar zenith angle, time of day, and probably other stuff that doesn’t immediately come to mind.

      Its important to realize that temperature varies up to 10 C over sub-second timescales due to stuff like clouds providing shade. Then a mercury thermometer shows the average over the last minute or so.

      • Could be. The example then underscores issues of measurement. The things you mention all point to the correlation being weaker than what I get from the “official” measurements, which is consistent with the larger correlations (absolute values) in the official data than what this person might have measured themselves. But I’m not sure about the magnitude of that difference (seems a bit large to me, but measurement is important and could easily be noisier than I think).

        Andrew’s reasoning is not that far off from the data I looked at (though a bit on the low side). I still think the primary reason for the difference between Shane’s colleagues and the “correct” answer is that the former were thinking about day numbers of the month and Shane was talking about days counting from 0-99. Speculation on my part. So, I think it is an interesting question for several reasons: measurement issues, probability intuition, and ambiguities of wording.

        • > Trivia question for you. I kept temperature records for 100 days one year in Boston, starting August 15th (day “0”).

          If people were looking at the day numbers of each month, why would day 0 start in the middle of August? I don’t think your initial interpretation holds up to a careful reading of the question.

        • Anon
          Yes, you are right. But this is like the invisible gorilla. I immediately skipped over that fact and started thinking about the first half and last half of the months during that time period and the general downward trend in temperature over the 100 days. When you (I) focus intently on something that requires effort, it is easy to miss what is right in front of you (me), like the gorilla.

      • Anon:

        Yeah, when Shane said, “I kept temperature records for 100 days one year in Boston,” I was assuming that (a) he was talking about average daily temperatures, and (b) his procedure to “keep temperature records” was to type in our download the official numbers for 100 straight days.

        • “I kept temperature records” implies a bit more agency in the generation of the data to me. Also, in some cultures (eg, British), it is quite common to even have your own weather station. So that isn’t necessarily an odd thing to do.

          Of course, you know him so are most likely guessing right.

  2. Looking at back at that Leicester post…

    “Leicester was the 14th-best team in the league last year in terms of points (and they were better than that in terms of goal differential, which is probably a better indicator of underlying quality)”

    They were 13th in GD. So… their table finish was an accurate representation of their quality.

    Also, 5000:1 did seem over done, but I think 100:1 at least would have been entirely reasonable if not optimistic. Leicester had the fewest point of any winner in the past 20 years and the lowest goal differential of any winner since the EPL went to 38 games. They were also only one of two teams to win the EPL having finished outside the top three the previous season. The other was Chelsea, who won in 2016/17 after finishing 10th in 2015/16 (and, perhaps ironically, whose comeback draw against Tottenham clinched the title for Leicester). But they had won the title in 2014/15 and were having an aberrantly bad season. It took an incredible confluence of circumstances for Leicester to win.

    Post-Bosman and in the era of the oligarchs and Gulf oil money, soccer doesn’t look anything like American professional sports. So Paul really shouldn’t have been taking his baseline from that.

  3. I studied climatology in grad school, so I’m used to looking at graphs of daily averages around the year. Below is the url for a graph of average daily min and max for Boston. In the fall, the curves are pretty straight. If you picked a different starting date, the answer would be different.

    https://www.google.com/imgres?q=time%20series%20average%20daily%20temperature%20boston&imgurl=https%3A%2F%2Fweather-and-climate.com%2Fuploads%2Faverage-temperature-united-states-of-america-boston.png&imgrefurl=https%3A%2F%2Fweather-and-climate.com%2Faverage-monthly-Rainfall-Temperature-Sunshine%2CBoston%2CUnited-States-of-America&docid=qDwVzz6R_0p9KM&tbnid=Rmy3dnUzwdZE_M&vet=12ahUKEwjm0-TtsruJAxUaHDQIHRqwKwcQM3oECBYQAA..i&w=702&h=232&hcb=2&ved=2ahUKEwjm0-TtsruJAxUaHDQIHRqwKwcQM3oECBYQAA

    • Of course the averages should show a strong negative correlation over the time period Aug 15 – Nov 22. For the 4 years I looked at, the negative correlation remains strong, but probably exhibits more variation than those averages. To me, the more interesting question was the one I originally was thinking of – and which I am speculating Shane’s colleagues were thinking of. Using the day of month numbers (1-31), a lot more difficult thinking (at least for me) is involved, and my intuition was that there would be very little correlation.

  4. This is why I prefer to think in terms of r_pi (standard deviation of the model over the summed standard deviations of model and error) as I referred to in this previous post about r-squared: https://statmodeling.stat.columbia.edu/2024/06/17/this-well-known-paradox-of-r-squared-is-still-buggin-me-can-you-help-me-out/

    Here, if there is an equal amount of variation in average temperature over the days as the variation in temperature that is unexplained, as in Andrew’s intuitive estimation, then r_pi = 0.5 (as for r-squared).

    If r_pi = 0.2 (rather than r = 0.2, as in the MIT colleagues guesses), then there would be four times as much variation in temperature unexplained as that which is explained, which sounds like too large a difference to me.

    R-squared = 0.04 of course also helps one come to that conclusion, although its low number seems “too extreme” to me, similar to when it was 0.01 in the previous post (r_pi = 0.17 when r = 0.2, so roughly the same interpretation as above).

    Conversely, if r-squared = 0.2 (if some MIT colleagues thought of it rather than r when answering), then r_pi = 1 / 3, so twice as much unexplained variation in temperature as what is explained. This still sounds like a large difference to me, although I am not sure how the temperature varies in Boston, and as a coastal city closer to the equator (than where I am located), I could believe it does not vary so much in temperature over the year that this would be a plausible number.

  5. I mentioned this last time, it shouldn’t take so much jargon to explain such a simple concept. That indicates a serious underlying problem.*

    People have been judging models via their predictive skill for millennia. Can these methods at least be backtested on past success stories (Halley’s comet, etc) so we can see what they add?

    *Like how students have trouble with (NHST) p-values because the correct understanding means they are utterly worthless. But then they can’t understand why are they being forced to learn something so worthless? Best way is to think of the “dark ages” where all the most highly educated members of society focused on (what you likely consider utterly pointless) theology for 1000 years.

  6. “it shouldn’t take so much jargon to explain such a simple concept. That indicates a serious underlying problem”

    This comment thread is a bit demoralizing in that regard. At this point I am really hoping that what Shane actually said to his colleagues was not “watch how easy it is to get statisticians bickering over the simplest of problems.”

  7. It seems to me that I have never asked the question “what’s the correlation?” in solving any modeling problem. I’ve asked things like what’s the slope of a best fit line and how much variation do the residuals have and is a nonlinear shape needed and do we need to add in other predictors and soforth, but I literally never want to know the correlation coefficient.

    Is that because I took a lot of applied math classes but have never taken an official statistics class? Maybe. I think it’s probably a good thing. What value would the correlation bring to solving problems about predicting temperature between Aug and Nov or similar? What would be the correlation over 200 days? 365? 3650? One day of hourly temps? One week?

    • Perhaps the correlation coefficient is of little use, but surely the concept of correlation you’ve seen as important. I’ve done a lot of simulation modeling and with multiple uncertainties, correlation structures are often key – as they are with the election forecasting models often discussed on this blog (though I am not a fan of these). The particular example you give surely suggests autocorrelation as a topic, and I don’t believe you are saying that is irrelevant to modeling problems. So, I have to believe you are highlighting a very particular concept – the correlation coefficient in a linear model. As with most summary measures, uses are limited and the dangers may well outweigh the uses. Or do you have a more fundamental objection to the practical consequences of correlation?

      • +1 to Daniel, for looking at variations of temperature over time somewhere on earth the required functional form is a sine curve that you’re only looking at in a certain time window. Calculating the correlation coefficient is useless.

        On the other hand, like Dale mentions, in physics and engineering, sometimes you need the concept of an autocorrelation function (for signal processing and other such problems). This is useful. Calculating the autocorrelation of two signals as a single number (average autocorrelation?) would definitely be useless though.

        Obligatory plug: we should all become members of the Society for the Suppression of the Correlation Coefficient.
        https://societytosupressthecorrelationcoefficient.wordpress.com/bibliography/

      • In time and space I prefer variogram to correlation for reasons discussed in these papers:

        [2006] Weighted classical variogram estimation for data with clustering. {\em Technometrics} {\bf 49}, 184–194. (Cavan Reilly and Andrew Gelman)

        [2010] Adaptively scaling the Metropolis algorithm using expected squared jumped distance. {\em Statistica Sinica} {\bf 20}, 343–364. (Cristian Pasarica and Andrew Gelman)

        [2021] Rank-normalization, folding, and localization: An improved R-hat for assessing convergence of MCMC. {\em Bayesian Analysis} {\bf 16}, 667–718. (Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner)

      • Dale, yes I mean the correlation coefficient. Time series autocorrelation, and spatial covariance functions and similar are all much more informative and complex than a single number correlation coefficient between bivariate data.

        I’m much more interested in the concept of functional forms for predictive purposes than “how much does this data spread out along a line”

        As Anonymous says you’re much more interested in the Fourier spectrum of temperature records than any kind of pearson correlation coefficient.

        It turns out that a Fourier coefficient and a covariance are both projection operators onto members of an infinite basis set. Fourier wise it’s a sin/cos(x), while Pearson wise it’s y=x. With the Fourier expansion you have a complete basis, with the Pearson you’re just projecting on a single element of an unspecified basis set. Lines are ok I guess but rarely the full story.

        • Daniel, Anon:

          But isn’t this more a criticism of modelling the temperature with a linear model, instead of a sin/cos-model, rather than a criticism of correlation (and the Pearson correlation coefficient) directly? (I think Dale’s comment also pertains to this question.)

          For the correlation coefficient (and related measures like r-squared) to reflect some comparison in variations of the model/error/outcome, it just requires that the model and error are orthogonal, and that the error is unbiased. That is: you could very well calculate a correlation coefficient (or r-squared etc.) between model and outcome after fitting the sin/cos model (preferably estimated in a new sample), to estimate how much variation there is in average daily temperature over the year, compared to the total variation in temperature over the year, no?

        • P.s.: Of course, here we have dependence between timepoints, which makes the usual estimate of the correlation coefficient problematic, although that seems like a separate problem from what has been forwarded.

        • Mathias, all I mean is that while lots of people seem to immediately reach for the correlation coefficient and often then stop there, I rarely if ever even consider it. When modeling a scenario involving say 2 variables my usual starting point is a graph, and then an internal discussion about how the relationship should look.

          Here are some examples:

          y is a function of x, x is contained in [0,inf) and the function should decline from a finite limit at x=0 like a power function to an asymptotic constant.

          y and x are not a function of each other, but the radius sqrt(x^2+y^2) is constrained to a narrow region around 1

          y is a function of x and the rate of change of y with x is an approximately known function of x (a differential equation)

          y is a function of x which is quasi-periodic on the range 0 to 365

          y is a complicated function of x which is not easily represented but the values of x are in [0,1] and by shifting and scaling them into [-1,1] we can represent them as a Chebyshev polynomial series with a small number of coefficients, perhaps 5. The function should have relatively small second derivative everywhere.

          y is a function of x and is adequately described by y = a*x + b

          After fitting, my interest in “explained variance” type calculations is almost nil, typically I think in terms of typical size of prediction error, and/or shape of the marginal distribution of that prediction error (perhaps skewed etc).

          The correlation coefficient just doesn’t enter into my modeling thought almost ever. I literally never calculate it. Off the top of my head I don’t even know the name of the function in Julia which calculates it. Turns out it’s “cor” which I found by guessing.

        • Daniel:

          I agree that people probably too often just consider the correlation coefficient, and too often believe it (and r-squared) say things they do not say. On my end, I just think of them as comparisons of variations – but I do think such comparisons can be useful.

          From your examples, it looks like you work with data where there is a clear pattern between variables. I understand that it is probably not so interesting to look at things beyond how to fit the model then, as the error will be small anyway.

          However, in those cases where you do look at the error, it seems to me that a correlation-like measure would be of interest to many (to you?). Say for example that someone fits a model and finds a “small” absolute error of 0.01 standard deviations (or mean absolute deviation or something similar) in the variation of whatever they measure (length in cm, income in $, etc.). If they then go and celebrate the usefulness of that model, but it turns out that the original variation (the variation that remains if you just predict with the mean) that they tried to capture is 0.010001 standard deviations (or mean absolute deviations etc.), then this seems to me to be useful information.

          That we can see the “non-impressiveness” of the model in that case is because we compare two sources of variation (error remaining / original in outcome), and such comparisons is exactly what correlation-like coefficients measure. I agree that the Pearson correlation coefficient is often not the most useful comparison between variations on its own (model / original in outcome), but it seems to me that measures of a similar type can contribute quite useful information, whether one reports them directly, or reports the absolute variations (e.g. of error and outcome), and then just compare them in one’s head.

Leave a Reply

Your email address will not be published. Required fields are marked *