Low correlation of predictions and outcomes is no evidence against hot hand

Posted on December 19, 2016 9:40 AM by Andrew

Josh Miller (of Miller & Sanjurjo) writes:

On correlations, you know, the original Gilovich, Vallone, and Tversky paper found that the Cornell players’ “predictions” of their teammates’ shots correlated 0.04, on average. No evidence they can see the hot hand, right?

Here is an easy correlation question: suppose Bob shoots with probability ph=.55 when he is hot and pn=.45 when he is not hot. Suppose Lisa can perfectly detect when he is hot, and when he is not. If Lisa predicts based on her perfect ability to detect when Bob is hot, what correlation would you expect?

With that setup, I could only assume the correlation would be low.

I did the simulation:

> n <- 10000
> bob_probability <- rep(c(.55,.45),c(.13,.87)*n)
> lisa_guess <- round(bob_probability)
> bob_outcome <- rbinom(n,1,bob_probability)
> cor(lisa_guess, bob_outcome)
[1] 0.06

Of course, in this case I didn’t even need to compute lisa_guess as it’s 100% correlated with bob_probability.

This is a great story, somewhat reminiscent of the famous R-squared = .01 example.

P.S. This happens to be closely related to the measurement error/attenuation bias issues that Miller told me about a couple years ago. And Jordan Ellenberg in comments points to a paper from Kevin Korb and Michael Stillwell, apparently from 2002, entitled “The Story of The Hot Hand: Powerful Myth or Powerless Critique,” that discusses related issues in more detail.

The point is counterintuitive (or, at least, counter to the intuitions of Gilovich, Vallone, Tversky, and a few zillion other people, including me before Josh Miller stepped into my office that day a couple years ago) and yet so simple to demonstrate. That’s cool.

Just to be clear, right here my point is not the small-sample bias of the lagged hot-hand estimate (the now-familar point that there can be a real hot hand but it could appear as zero using GIlovich et al.’s procedure) but rather the attenuation of the estimate: the less-familiar point that even a large hot hand effect will show up as something tiny when estimated using 0/1 data. As Korb and Stillwell put it, “binomial data are relatively impoverished.”

This finding (which is mathematically obvious, once you see it, and can demonstrated in 5 lines of code) is related to other obvious-but-not-so-well-known examples of discrete data being inherently noisy. One example is the R-squared=.01 problem linked to at the end of the above post, and yet another is the beauty-and-sex-ratio problem, where a researcher published paper after paper of what was essentially pure noise, in part because he did not seem to realize how little information was contained in binary data.

Again, none of this was a secret. The problem was sitting in open sight, and people have been writing about this statistical power issue forever. Here, for example, is a footnote from one of Miller and Sanjurjo’s papers:

Funny how it took this long for it to become common knowledge. Almost.

P.P.S. I just noticed another quote from Korb and Stillwell (2002):

Kahneman and Tversky themselves, the intellectual progenitors of the Hot Hand study, denounced the neglect of power in null hypothesis significance testing, as a manifestation of a superstitious belief in the “Law of Small Numbers”. Notwithstanding all of that, Gilovich et al. base their conclusion that the hot hand phenomenon is illusory squarely upon a battery of significance tests, having conducted no power analysis whatsoever! This is perhaps the ultimate illustration of the intellectual grip of the significance test over the practice of experimental psychology.

I agree with the general sense of this rant, but I’d add that, at least informally, I think Gilovich et al., and their followers, came to their conclusion not just based on non-rejection of significance tests but also based on the low value of their point estimates. Hence the relevance of the issue discussed in my post above, regarding attenuation of estimates. It’s not just that Gilovich et al. found no statistically significant differences, it’s also that their estimates were biased in a negative direction (that was the key point of Miller and Sanjurjo) and pulled toward zero (the point being made above). Put all that together and it looked to Gilovich et al. like strong evidence for a null, or essentially null, effect.

P.P.P.S. Miller and Sanjurjo update: A Visible Hand? Betting on the Hot Hand in Gilovich, Vallone, and Tversky (1985).

59 thoughts on “Low correlation of predictions and outcomes is no evidence against hot hand”

JSE on December 19, 2016 9:52 AM at 9:52 am said:

I did a simulation like this when I was writing about the hot hand, with similar results. Korb and Stillwell did a similar test here:

https://www.researchgate.net/profile/Kevin_Korb/publication/228609377_2003The_Story_of_The_Hot_Hand_Powerful_Myth_or_Powerless_Critique/links/0a85e538edbbca9489000000.pdf

Reply ↓
- Andrew on December 19, 2016 10:39 AM at 10:39 am said:
  
  Jordan:
  
  Thanks for the link which was indeed relevant. See P.S. added above.
  
  Reply ↓
  - Joshua B. Miller on December 19, 2016 3:35 PM at 3:35 pm said:
    
    Andrew, your P.S. is incorrect.
    
    Korb & Stillwell have some nice coverage of the statistical power issues, but they do not perform this calculation.
    
    The example you discuss in this post was one we made in reference to Gilovich, Vallone and Tverky’s betting task, p. 308-309 and Table 6 on p. 310, here.
    We created a version of this example, and re-analyzed of GVT’s betting data in p. 24-25 our paper (here, current version:November 15, 2016).
    
    Korb & Stillwell, do not correlate predictions of outcomes with outcomes, or running a hidden markov chain, with a predictor.
    
    Reply ↓
    - Joshua B. Miller on December 19, 2016 8:17 PM at 8:17 pm said:
      
      Andrew-
      
      thanks for the update on the P.S..
      
      last thing: the power issues which Korb & Stillwell cover, is different from our point here. Our calculation is about how to evaluate beliefs/prediction/betting data, i.e. it is about the hot hand fallacy, not the hot hand effect.
      
      Of course our little example is related to the measurement error story you discuss, and we actually yoked this example to one our simulation studies of measurement error (and power) from the appendix of our 2014 “Cold Shower” paper, where we used a hidden markov model data generating process. (The first discussion of measurement error and the hot hand, using a autocorrelation as the example, was in Dan Stone’s 2012 AmStat paper).
    - Joshua B. Miller on September 14, 2017 1:10 AM at 1:10 am said:
      
      Update…
      
      we have a new short paper describing this issue in more detail, with a more complete set of simulations to show how the correlation depends on how big is the hot hand, its frequency, and the predictor’s ability.
      
      In particular we write:
      
      The reason why this correlation measure is expected to lead to a surprisingly low underestimate of prediction ability is closely related to Stone (2012)‘s work on measurement error and the hot hand.
      
      And we have a footnote:
      
      In particular, Stone (2012) showed that the serial correlation in hits is expected to be far lower than the serial correlation in the player’s hit probability, as a hit is only a noisy measure of a player’s underlying probability of success. In this case we have shown that the correlation between “bet hit'” and “hit” is expected to be far lower than the correlation between “bet hit” and the player’s hit probability, for the same reason.
- Joshua B. Millerr on December 19, 2016 1:44 PM at 1:44 pm said:
  
  thanks Jordan. We were aware of Korb and Stillwell (and your book), but somehow missed this.
  
  Reply ↓
- Joshua B. Miller on December 19, 2016 3:18 PM at 3:18 pm said:
  
  Jordan
  
  Just looked at Korb & Stillwell quickly and it looks like they are making the power point (not to be confused with MS), and the additional point that binomial data is typically underpowered. I do not see them correlating predictions of outcomes with outcomes, or running a hidden markov chain, with a predictions.
  
  Reply ↓
Guy on December 19, 2016 10:17 AM at 10:17 am said:

This is an obviously true observation. But how valuable is to dissect 30-year-old statistical errors, when there is a mountain of more recent evidence suggesting the hot hand effect is too small to have any practical effect or usefully inform strategic decisions in sports? In the end, we’re still left with a vast gap between the popular perception of the hot hand (large) and its reality (small). Perhaps “fallacy” is too strong a word, but belief in the hot hand is certainly a significant cognitive error. Put it this way: a sports decision-maker would make far, far better decisions if they dismissed the hot hand as non-existent than if they accepted the popular belief in its size and importance. I fear that this suddenly fashionable meme — “there really is a hot hand after all!” — will have the net effect of reducing rather than enhancing public understanding.

Reply ↓
- Andrew on December 19, 2016 10:44 AM at 10:44 am said:
  
  Guy:
  
  You write, “here is a mountain of more recent evidence suggesting the hot hand effect is too small to have any practical effect or usefully inform strategic decisions in sports . . . a sports decision-maker would make far, far better decisions if they dismissed the hot hand as non-existent than if they accepted the popular belief in its size and importance.”
  
  I’m not quite sure it makes sense to speak of “the popular belief in its size and importance” given that these beliefs vary; indeed one popular belief is that there is no hot hand at all!
  
  I agree with your larger point that what is relevant is the magnitude of the phenomenon and how it works, rather than its mere existence or nonexistence. The point of the above post is that you’re in for all sorts of trouble if you try to estimate this magnitude using sequences of 1’s and 0’s.
  
  Reply ↓
  - Rahul on December 19, 2016 11:00 AM at 11:00 am said:
    
    Naive question: So I guess I don’t understand the “hot hands” question in the first place: Obviously, no one is arguing that a basketball player is a random number generator, right?
    
    So, obviously some days I play better, other days I don’t. If we declare the hot-hands theory disproven, does that mean we can never predict the outcome of a player’s next shot with anything more than strictly random accuracy?
    
    What’s a compact, canonical statement of the hot hands hypothesis?
    
    Reply ↓
    - Andrew on December 19, 2016 11:08 AM at 11:08 am said:
      
      Rahul:
      
      Korb and Stillwell write, “It is not entirely clear what the Hot Hand phenomenon is generally supposed to be, nor just what GVT intend us to understand by it.”
      
      Roughly, I think the claim made by Gilovich et al. and their followers was not that players are random number generators, but that they are statistically indistinguishable from random number generators, so that belief in any evidence for the hot hand was a fallacy.
    - Rahul on December 19, 2016 11:16 AM at 11:16 am said:
      
      So why do we spend so much time on analyzing a problem that’s not even well defined?
      
      My feeling is, we keep going in circles: The effect is tiny to start with, & every time one analyst makes some progress the other side shifts the goalposts ever so slightly. And we keep arguing endlessly.
      
      Isn’t this a fools errand?
    - Andrew on December 19, 2016 11:25 AM at 11:25 am said:
      
      Rahul:
      
      I agree that if the goal here were to determine a winner or loser of the hot hand debate, this would be largely a waste of time. But that’s not my goal.
      
      There’s no winner or loser here. The “game” is to better understand how to use statistical analysis to learn things about the world, and the study of these sorts of simple yet real examples can be, for me and others, a good way to gain intuition and develop new principles. God is in every leaf of every tree.
    - Rahul on December 19, 2016 12:22 PM at 12:22 pm said:
      
      If one must study the frivolous, at least study the *well-defined* frivolous. :)
    - Realist Writer on December 19, 2016 11:26 AM at 11:26 am said:
      
      It is. The same argument for it being’ a fools errand could be made for other (possible) effects such as ESP and Power Pose.
    - Jonathan (another one) on December 19, 2016 11:11 AM at 11:11 am said:
      
      There are several versions of the hot hand hypothesis. The strongest version is that you don’t actually have a different p from day to day. You have one underlying p and your varying performance from day to day is just the effects of the underlying p.
      You are correct, though — I don’t anybody who believes the strong version.
      The weaker version is simply that p(shot made|previous shot made)>p(shot made|previous shot missed). Fancier versions can adjust with lots of other independent variables p(shot made|previous shot made,X=X1)>p(shot made|previous shot made,X=X2) where X is some vector of independent variables at the time of the shot whose effects are previous-shot-independent.
    - Rahul on December 19, 2016 11:21 AM at 11:21 am said:
      
      Thanks. The strong version seems axiomatically stupid: Isn’t that like saying injuries etc. just don’t exist.
      
      About the weak version: The first formulation seems nice p(shot made|previous shot made)>p(shot made|previous shot missed). This ought to be eminently testable.
      
      The part that makes this “hot hands” debate farcical is the adjustment part. Obviously there cannot be universal objective agreement on a canonical X vector.
      
      When the effect is small to start with, isn’t any “conclusion” of the hot hands problem merely a specific statement of what you think subjectively is the “right” X vector?
    - Rahul on December 19, 2016 11:27 AM at 11:27 am said:
      
      I love your formalization:
      
      p(shot made|previous shot made)>p(shot made|previous shot missed)
      
      So what’s the empirical answer to this question, unadjusted for anything. Anyone know?
    - Jonathan (another one) on December 19, 2016 11:38 AM at 11:38 am said:
      
      I’m sure there’s a lot more data now, but the original Gilovich paper http://wexler.free.fr/library/files/gilovich%20(1985)%20the%20hot%20hand%20in%20basketball.%20on%20the%20misperception%20of%20random%20sequences.pdf has the original data in Table 1.
      p(shot made|previous shot miss)=.54 p(shot made|previous shot made)=.52 This data is for 9 players in 1980-81.
      
      A variant of the hot hand hypothesis is that p(shot made|n previous shots made)>p(shot made|previous shot missed) It is this variant that Gilovich tests that is shown to be calculated in a biased fashion in Miller et al.
    - Rahul on December 19, 2016 12:18 PM at 12:18 pm said:
      
      Thanks!
      
      The other big issue is going to be external validity. e.g. Does this finding generalize to 1990-91. Et cetra. Who knows!?
      
      Frankly, I think it’s a silly problem to study. When the system you are studying itself is so highly variable it is folly to try to answer questions depending on such small differences as 0.54 vs 0.52.
    - Andrew on December 19, 2016 12:36 PM at 12:36 pm said:
      
      Rahul:
      
      Grrrr…. No, the probabilities are not 0.54 vs. 0.52. They can vary by much more than that! The point is that conditioning on the previous shot is an extremely noisy way to measure anything. The result is to make the apparent differences look small and inconsequential, even if the underlying differences are huge.
    - Rahul on December 19, 2016 12:38 PM at 12:38 pm said:
      
      Andrew:
      
      OK. What’s your preferred statement of the problem?
    - Jonathan (another one) on December 19, 2016 1:49 PM at 1:49 pm said:
      
      Sure P(make|miss) and P(make|make) are really noisy. But the hot hand would have to include that at a minimum, wouldn’t it? P(make|miss),P(make|hit),P(make|2 hits),P(make|3 hits) would be an increasing series under the hot hand hypothesis, and you can’t even pass step one (which has more data than any other conditional series) what chance do you have later? If nothing else, wouldn’t the set of players with hot hands (assuming there were any such things) be MUCH more likely to have P(hit|make)>P(hit|miss)? Obviously figuring out whether or not there are any such players is the essence of the exercise, of course, but wouldn’t you be really suspicious of spurious fitting if you found that the hot hand only starts after you’d hit 4 in a row and that that fifth shot was really really likely, but that the first make was negatively correlated the the second?
    - Andrew on December 19, 2016 3:06 PM at 3:06 pm said:
      
      Jonathan:
      
      Regarding your statement, “you can’t even pass step one,” please look in the above post and in the link in the P.P.S. to get a sense of why it does not make sense to take this apparent failure as evidence that there’s no hot hand, or even as evidence that any hot hand phenomenon is small. What we’ve learned from looking at this problem is that even a large hot hand phenomenon can fail to show up when studied in this crude manner.
    - Jonathan (another one) on December 19, 2016 3:36 PM at 3:36 pm said:
      
      OK… I read the Korb and Stillwell paper and frankly, the result is completely inappropriate to the serial correlation analysis. What they test is a case where, in a sequence of 10 shots, the probability goes up only after the fifth shat and says there. OF COURSE, that will be an incredibly weak test of whether making the previous shot increases the chance of making the next one: 9/10ths of the time it doesn’t! The fact that the serial correlation test gets nearly twice the “significant” results as the null suggests that this test is really quite good! I’m going to make a simulation that I think will show this all much better….
    - Joshua B. Millerr on December 19, 2016 2:09 PM at 2:09 pm said:
      
      Rahul
      
      There are different levels of discussion. If you want to stick with the literature, the consensus view was the “cognitive illusion” view. The original paper concluded that shots were as-if randomly generated (as Andrew said), and this was the accepted consensus, e.g. here, and this is how papers cited this result, e.g. the first two paragraphs here. Researchers were committing a fallacy here.
      
      For your compact canonical statement, if we again want to tie our hands to this literature, well the original paper discussed hot hand (and streak shooting) in terms of patterns, rather than in terms of process. In that paper the hot hand is when the length and frequency of streaks are greater than one would expect by chance (see here, p. 296-297), or probability of success increases after recent success.
      
      If you look at how lay players and coaches discuss it, they use the words “zone”, “rhythm”, and “flow”, so clearly they are talking about the probability of success increasing, for whatever reason.
    - Rahul on December 19, 2016 2:59 PM at 2:59 pm said:
      
      But isn’t this vulnerable to the same sort of “p-hacking” that Andrew criticizes?
      
      i.e. First we look at the previous shot. Not much to see so we then switch to screening n-previous shots. Or tweak the streak definition.
      
      Having dont this, we can try to control for some vector X. Try other variations of X. Next comes the huge degree of freedom offered by which year & which players to analyze. Et cetra.
      
      Sure you’ll discover some zone or rhythm somewhere. Now is this a phenomenon or an artifact?
    - Joshua B. Miller on December 19, 2016 3:28 PM at 3:28 pm said:
      
      Rahul-
      
      on p-hacking, there are a few ways to know that it is not:
      
      1. The 3-back measure is the one all previous studies use, and what fans define as a “streak,” so it isn’t chosen out of thin air (its significant with 2-back and 4-back, but too sparse with 5+ back).
      
      2. The 3-back measure works out-of-sample to all the data we collected: a. 3 point shooting contest, b. our own controlled shooting study, and c. a little know study from the late 70s
      
      3. One can control for the multiple comparisons issues involved with using other measures (length of longest streak, frequency of streaks) by constructing composite measures.
  - Guy on December 19, 2016 5:20 PM at 5:20 pm said:
    
    Agreed that there are many different beliefs about the existence/size of the hot hand. My sense just from talking to sports fans, and listening to almost any game broadcast, is that many — perhaps even most — fans believe in a strong hot hand effect. To be more concrete, I think they believe that a 50% NBA shooter becomes something like a 60% or 70% shooter when “hot,” and that a .280 hitter becomes a .350 or better hitter when hot. And my sense (but I offer this tentatively, as I don’t know this literature well) is that academic research on *belief* in the hot hand are consistent with my intuition. But if the research shows a much more limited public belief in hot hand effects, I will happily revised my view.
    
    Reply ↓
    - Joshua B. Miller on December 19, 2016 6:02 PM at 6:02 pm said:
      
      Guy-
      
      We were thinking more about the beliefs of players and coaches, i.e. the decision makers. Right or wrong, we were less concerned about the “public understanding” as you mention. Fans say all sorts of things, its not even clear what they mean. For example, people don’t know the difference between a 60% chance of rain and an 80% chance of rain—in either case, if it doesn’t rain, the weather forecaster is wrong. For another example, look how people were interpreting betting odds in the previous election.
- Joshua B. Miller on December 19, 2016 2:50 PM at 2:50 pm said:
  
  Guy-
  
  If you define hot hand as the statistical estimate of the increase in field goal percentage after recent success for the *average* player in an unbalanced panel with partial controls, then, yes, if someone were to believe this effect to be large, they would be making a significant cognitive error. But in order to test if peoples beliefs are wrong, you have to first be clear on what their beliefs pertain to.
  
  Further, in all controlled (and semi-controlled) tests, the increase in field goal percentage after a recent streak of success is meaningfully large, 5-13 percentage points depending on the study (with the highest estimate in the original study).
  
  If one defines the hot hand and the change in a players probability of success when in the hot state, or more realistically, with continuous states, the range over which a player’s probability of success varies (controlling for difficulty), the “mountain of more recent evidence” can’t say much (I assume we are referring to the same mountain). We should have a little humility about what we can measure here. Where the data cannot speak, should we not defer to the practitioners?
  
  What you bring up in your final two sentences may be on point though. If players & coaches had to choose between two blanket dumb heuristics of (a) always believe on in the hot hand, or (b) never believe in the hot hand, then the latter could may be safer bet, depending on the strength of the typical reaction. This ignores, of course, the possibility that the coach player has more granular information than the zero and ones we are looking at, and can respond more judiciously. We agree, the “hot hand exists” vs. the “hot hand doesn’t exist” is a terrible way to think of it, the binary formulation is what leads to blanket heuristic thinking.
  
  Reply ↓
  - Guy on December 19, 2016 5:28 PM at 5:28 pm said:
    
    Josh:
    While I do believe that ignoring the hot hand would be more accurate than accepting the typical fan’s beliefs, those obviously aren’t the only choices. So the more interesting question (to me) is whether the hot hand is ever large enough to be of any practical significance in sports decision-making? For example, should a team pitch differently to a “hot” hitter (or use a hot hitter as a pinch hitter when he would not otherwise be used)? Or should a team give more shots to a “hot” NBA shooter, and should their opponent assign a superior defensive player to guard that “hot” shooter? To me, this is what it would mean for the hot hand to be “real” (while acknowledging that weaker hot hand effects might exist). I have yet to see any compelling evidence for a hot hand that meets this test.
    
    If an NBA player did in fact improve his shooting accuracy by as much as 13 percentage points for a period of time, that would easily meet my test of “sports significance.” Or it would be IF this hotness could be detected in real time. I find it hard to believe that such large short-term changes in true ability actually occur. And it’s even harder to imagine how such talent changes could be measured with sufficient confidence to act on the information in real time.
    
    Reply ↓
    - Joshua B. Miller on December 19, 2016 6:22 PM at 6:22 pm said:
      
      Guy:
      
      You say: “I have yet to see any compelling evidence for a hot hand that meets this test.”
      
      -For game data, those are tough questions. We know of no evidence for or against using game data. But then again players and coaches make all sorts of decisions in the course of the game that cannot be backed up with data-driven evidence. Shouldn’t we defer to them until we have the data to address it? (assuming they aren’t using blanket one-size-fits-all heuristics)
      
      Side note: we do have evidence that bettors in GVT’s study do better predicting than they would if they were to guess randomly (and a dumb hot hand heuristic would also do better).
      
      You say: “I find it hard to believe that such large short-term changes in true ability actually occur.”
      
      -Well, 13pp was an average FG% effect in GVT’s Cornell study, so I’d bet some players *sometimes* get much more than that for their probability of success, in controlled settings. Also, remember we are talking about probability here, so you aren’t going to pick this up with FG% based on recent shot outcomes, which is all anyone ever measures. This means you can’t use existing data to inform your beliefs on the magnitude. The thing exists. Is it modest in all players? Huge in some players, tiny in most players? Well it could be 40pp big in some players and you would never see it in the data given the way it is currently analyzed.
      
      You say: “And it’s even harder to imagine how such talent changes could be measured with sufficient confidence to act on the information in real time.”
      
      -proprioception?
Clyde Schechter on December 19, 2016 10:28 AM at 10:28 am said:

Where did the .13 and .87 come from?

The result depends on the probability that Bob has a hot hand. If he has a hot hand half the time, the correlation peaks at around 0.1, and it declines symmetrically around 0.5.

By the way, I found that 10,000 replicates gives very noisy estimates. You really want to run this with 1,000,000 reps or even more.

None of this, of course, disputes the qualitative point that even if Lisa always knows exactly when Bob has a hot hand, the correlation between her prediction and his outcome will be pretty low.

Reply ↓
- Andrew on December 19, 2016 10:40 AM at 10:40 am said:
  
  Clyde:
  
  It looks like I was assuming that the player was hot 13% of the time, but I don’t at all remember where that 13% came from!
  
  Reply ↓
  - Joshua B. Miller on December 19, 2016 4:17 PM at 4:17 pm said:
    
    we just picked 13% because that is the fraction of shots that would be preceded by a streak of 3 successes, for a Bernoulli p=.5 shooter. In the paper, we used 15% of the time, with the idea that it should be relatively rare.
    
    Reply ↓
    - Andrea on December 20, 2016 10:18 AM at 10:18 am said:
      
      Joshua, Clyde,
      
      in any case, 13%, 15% or even 50% doesn’t drastically affect the main point. Even with p_hot=0.5, a simulation still predicts extremely low correlation:
      
      set.seed(512)
      n <- 10000
      bob_probability <- rep(c(.55,.45),c(.5,.5)*n)
      lisa_guess <- round(bob_probability)
      bob_outcome <- rbinom(n,1,bob_probability)
      cor(lisa_guess, bob_outcome)
      [1] 0.09700044
    - Joshua B. Miller on December 20, 2016 2:52 PM at 2:52 pm said:
      
      nice observation Andrea.
      
      the reason for this is interesting, and it comes right out of what we do in our paper.
      
      Our example, which Andrew quoted (appears on p. 24-25 of present version:November 15, 2016), answers essentially the following question: What if there were zero measurement error? Lisa has no measurement error.
      
      Now an implication of what Andrew noted about correlation is that corr(Bob’s state, hit)=corr(Lisa’s prediction, hit).
      
      Note that if we set up a least square estimate of hit=a+b*[Bob’s state], then b=cov(bob’s state, hit)/sd(bob’s state)*sd(hit)
      
      If bob’s state is hot 50% of the time, sd(bob’s state) is expected to be close to sd(hit), b/c the variance for bernoulli R.V. is p(1-p), and therefore b is expected to be close to corr(Bob’s state, hit).
      
      Note that because these variables are binary b=proportion(hit| bob’s state= h)-proportion(hit| bob’s state= n), and E[b]=ph-pn=0.10, given the set up.
      
      So the closer sd(bob’s state) is to sd(hit), the closer corr(Bob’s state, hit) is expected to be to E[b]=ph-pn=0.10.
      
      So the issue is a correlation of .10 seems small, but really correlation is hard to interpret because it is a dimensionless constant. But in this context of binary R.V.s, correlation can be related to probabilities, i.e. a difference in probabilities. A difference in probability of 10 percentage points is a lot for a basketball shot.
    - Joshua B. Miller on December 20, 2016 3:11 PM at 3:11 pm said:
      
      too bad I can’t edit.
      
      Two clarifications/corrections:
      
      1. Let hot=1 if Bob’s state is h, and 0 otherwise. The equation should be hit=a+b*hot, instead of hit=a+b*[Bob’s state]. So sd(bob’s state) can be thought of as sd(hot).
      
      2. correlation is dimensionless, clearly it is not constant.
anon on December 19, 2016 10:49 AM at 10:49 am said:

Why would we want to use the correlation coefficient to assess how binary predictions compare to binary outcomes?

Reply ↓
- Andrew on December 19, 2016 10:56 AM at 10:56 am said:
  
  Anon:
  
  Information is information. In this case, it doesn’t really matter how you make the summary; the point is that there’s not much information there.
  
  Reply ↓
- Anoneuoid on December 19, 2016 12:35 PM at 12:35 pm said:
  
  I agree. When reading this I wondered how it was something I never noticed before, since I like knowing “gotchas” like this. Then I realized it is because I wouldn’t calculate a correlation in this situation (or if I ever did, it was quickly dismissed as unhelpful).
  
  Reply ↓
- Rahul on December 19, 2016 12:39 PM at 12:39 pm said:
  
  +1 It stumped me.
  
  Reply ↓
- Joshua B. Miller on December 19, 2016 2:18 PM at 2:18 pm said:
  
  Anon-
  
  for why we talked about these correlations, see see p. 310 here, the original paper.
  
  Another reason: if predictions of success are made almost as often as success outcomes (doesn’t have to be that close), the correlation coefficient is close to an estimate of Pr(success|predict success)-Pr(success|predict failure).
  
  Reply ↓
Chandrasekhar Ramakrishnan on December 19, 2016 10:52 AM at 10:52 am said:

Here is the simulation translated to python in case someone finds it useful:

“`
import numpy as np
import scipy
import scipy.stats
n = 10000
h_n_prob = np.array([.55, .45])
h_n_count = np.array([.13,.87]) * n
bob_probability = np.concatenate([np.full(int(h_n_count[i]), h_n_prob[i]) for i in [0, 1]])
lisa_guess = bob_probability.round()
bob_outcome = [np.random.binomial(1,p) for p in bob_probability]
scipy.stats.pearsonr(lisa_guess, bob_outcome)[0]
“`

Reply ↓
- Llewelyn on December 19, 2016 11:02 PM at 11:02 pm said:
  
  +1 Go fellow Pythonista!
  
  Reply ↓
Robert Grant on December 19, 2016 11:46 AM at 11:46 am said:

As a big simulation fan, I like the fact that your first reaction was to get R to run it 10,000 times rather than start messing about with algebra.

Reply ↓
Daniel Lakeland on December 19, 2016 12:52 PM at 12:52 pm said:

One of the things I find interesting here is how this plays out in other areas, where the difficulty of getting evidence against something causes that thing to persist forever. And in these other cases, it’s clearly a lot more consequential. For example:

1) Acupuncture making people feel better based on the accurate placement of needles
2) Tamiflu reducing duration and severity of flu
3) Statins protecting against heart disease
4) Mammograms reducing breast cancer mortality

We’ve made a bit of progress on some of these perhaps. But the point is, you have a noisy problem where even if you can predict an underlying issue perfectly, the observed outcomes provide not much information about that fact… and you combine this with a default assumption along the lines of “X works” or “X doesn’t work” and it leads us to decades of confusion, wasted resources, and in the medical case potentially unneeded suffering.

Reply ↓
Jonathan on December 19, 2016 4:27 PM at 4:27 pm said:

For amusement. I was watching Barcelona yesterday and Messi got the ball, put it through the 1st defender’s legs, then tapped it to the left as he hopped around the next defender, then tapped it to the right as he hopped to get around the next defender, then cut it back to the goal around a 4th defender, put a shot on the goalie that couldn’t be held and Suarez slammed the rebound home. My point: these little slices are clearly hot hand moments and Lionel gets into a zone of intelligent, instinctive athletic reaction which only a few players can reach and he does this relatively often, but it’s impossible to say when that will happen, which is one reason why we watch: when will Messi or Neymar or Suarez do something so powerfully beautiful? Or maybe they won’t at all today. Next point: we look at averages and totals and those include within them moments of good and bad play and maybe we go back and remember the high points and some low points – Ralph Branca – but those tend to be only visible in retrospect and our expectations matter a lot because we expect Alabama will crush a Div 2 opponent in football or that Peyton Manning would kill a team that blitzed all the time so in more complex situations though we can say completion/incomplete the context in which that occurs would, as noted, require a host of models matched to each situation and that has additional problems. As in, does Steph shoot better in the 4thQ when the game is within 5 points, but then does it matter if the other team has a guard who can defend the perimeter or whether the defenders played the night before and so on. Life is very complicated.

Vinnie Johnson was called the Microwave because he’d heat up fast but watching the games I saw a guy who was brought in as an offensive replacement with the team expectation he’d shoot and that many times he would do that and some of those times he was really good. But other times he’d come in not as an offensive replacement and he wouldn’t shoot or he’d shoot and miss and maybe not shoot three in a row. And it was obvious at times, as a Pistons fan, that the coaches would put Vinnie in to shoot sometimes and would run plays for him to shoot making it look like he was on fire if he hit the first 2 or 3 shots. Sensibly, they’d put him in because they thought those plays would work so maybe that would reflect the coach’s hot hand! The Celtics would intentionally run plays for Robert Parrish early because, as they’ve described in print, they felt he’d put more effort into the game if he shot early. That made it look to fans like Chief was more of an offensive player, but really it was that they ran those few plays for him early and, crucially, we’d remember that he hit those few early jump shots in some games. They’d then run plays for him later in the game when those plays were the best option, which given basketball scoring, might be who was defending, who wasn’t on the floor for the Celtics, etc. I don’t know how anyone could separate actual hot hands from this kind of complexity given simple it went in or it didn’t stats.

Transient effects are really interesting. Would Pete Reiser have been a great player if he hadn’t crashed into outfield walls? Would Lady Catherine de Burgh have been a great proficient if she had indeed learned to play? Is Messi better than Ronaldo? In most cases, you can’t even argue that totals matter: Babe Ruth played in a different era, was a pitcher for years, etc. Pelé never played in a European league and I don’t think anyone would argue that Josef Bican is actually the greatest soccer player though he tops the list of goals. Even if we model for competition, the arguments come down to “he hit the ball so incredibly far” or “he did the most spectacular things with a ball”.

Reply ↓
- Corey on December 19, 2016 5:20 PM at 5:20 pm said:
  
  Would Lady Catherine de Burgh have been a great proficient if she had indeed learned to play?
  
  +1… million
  
  Reply ↓
Llewelyn on December 19, 2016 10:54 PM at 10:54 pm said:

I find this truly confusing… but yet somehow it compels me to try. If I understand it all correctly, a simulation of successes following another success suggests that runs of actual baskets that match or exceed a player’s average are less likely to be observed unless something ‘special’, the hot hand, is happening.

The comments by Guy and Josh above were really useful, thank you. But, I struggle to understand how the same actor, Bob, in any real world sense can have two probabilities of shooting a basket. My guess is that these are the assumed hot hand and play as normal probabilities which as spectators we observe as a combined total probability. But then I wonder how two separate states exist in the same player…I get stuck on the practicality. So I assume all the cleverness around this is correct and so am left wondering how one identifies a hot hand state in an actual game? And if this can be done, how does one predict the likelihood of a hot-hand run in a game/season/career *retrospectively*? The example in my mind is the Australian cricketer, Donald Bradman, whose stats are just unusually high compared to anyone else. https://en.wikipedia.org/wiki/Don_Bradman

I also found this paper (from a course I did on Sport Analytics) on longest runs interesting (and baffling), https://www.maa.org/sites/default/files/pdf/upload_library/22/Polya/07468342.di020742.02p0021g.pdf. It seems to me that this is the same thing, but without a hot-hand factor operating. Or not?

Reply ↓
- Andrew on December 19, 2016 11:00 PM at 11:00 pm said:
  
  Llewelyn:
  
  The model we believe is that the probability of success varies continuously from shot to shot, and that there are times when this probability is higher (when a player is “in the zone”) and times when it is higher. The model with two discrete probabilities is a ridiculous oversimplification that we use just to make the point about the estimate being super-noisy. A simulation using a more realistic model would show the same thing.
  
  Reply ↓
  - Rahul on December 20, 2016 12:53 AM at 12:53 am said:
    
    >>>probability of success varies continuously from shot to shot<<<
    
    This sounds intuitively so obvious that I'd be amazed to see data to the contrary.
    
    What's the alternative hypothesis? That a human player might be an ultra-consistent random number generator with no patterns, and no ups and downs?
    
    Reply ↓
    - Andrew on December 20, 2016 1:04 AM at 1:04 am said:
      
      Rahul:
      
      There is no “alternative hypothesis.” I don’t think that way. The point is to estimate the level of variation and estimate what predicts it. The point of the above post is that such estimation is difficult with 0/1 data, and naive estimates such as employed by GIlovich, Vallone, and Tversky in their celebrated paper are so biased and so noisy as to not be interpretable in the usual ways.
    - Rahul on December 21, 2016 12:48 AM at 12:48 am said:
      
      In my opinion it’s be much better if didn’t frame this as “Evidence for (or against) the hot hand” but rather started thinking in terms of “How much does a player’s performance vary from day to day and how well can we predict performance?”
Matt on January 6, 2017 3:57 PM at 3:57 pm said:

With regards to the bias found in Miller’s paper – how does this extend to outcomes that are not binary? For example, if I want to know what the expected value of a die is following a roll of 6, will this be biased?

Reply ↓
- Andrew on January 6, 2017 4:00 PM at 4:00 pm said:
  
  Matt:
  
  The answer’s right in front of you, on your computer. Do a simulation in R or Python and find out!
  
  Reply ↓
  - Matt on January 6, 2017 4:54 PM at 4:54 pm said:
    
    Fair enough! So doing 100,000 reps of a sequence of N=100 of a random variable X that takes on 1,2,3,4,5,6 with prob=1/6, I get the following: E[X_t|X_t-1=1]=3.52 ; E[X_t|X_t-1=2]=3.515 ; E[X_t|X_t-1=3]=3.503 ; E[X_t|X_t-1=4]=3.492 ; E[X_t|X_t-1=5]=3.484 ; E[X_t|X_t-1=6]=3.473.
    
    Same mechanism obviously, but still it’s incredible that that is true. I’m going to this hot-hand stuff for golf data which is why the non-binary outcomes are relevant.
    
    Reply ↓
  - Matt on January 6, 2017 9:50 PM at 9:50 pm said:
    
    Yeah fair enough! With n=100 rep=100,000, X taking on 1,2,3,4,5,6 w prob 1/6 – you get E[X_t|X_t-1 = 6]=3.47, and then steadily increases up to E[X_t|X_t-1=1]=3.53. Cool!
    
    Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Low correlation of predictions and outcomes is no evidence against hot hand

59 thoughts on “Low correlation of predictions and outcomes is no evidence against hot hand”

Leave a Reply to anon Cancel reply