Update…

we have a new short paper describing this issue in more detail, with a more complete set of simulations to show how the correlation depends on how big is the hot hand, its frequency, and the predictor’s ability.

In particular we write:

The reason why this correlation measure is expected to lead to a surprisingly low underestimate of prediction ability is closely related to Stone (2012)‘s work on measurement error and the hot hand.

And we have a footnote:

In particular, Stone (2012) showed that the serial correlation in hits is expected to be far lower than the serial correlation in the player’s hit probability, as a hit is only a noisy measure of a player’s underlying probability of success. In this case we have shown that the correlation between “bet hit'” and “hit” is expected to be far lower than the correlation between “bet hit” and the player’s hit probability, for the same reason.

]]>Yeah fair enough! With n=100 rep=100,000, X taking on 1,2,3,4,5,6 w prob 1/6 – you get E[X_t|X_t-1 = 6]=3.47, and then steadily increases up to E[X_t|X_t-1=1]=3.53. Cool!

]]>Fair enough! So doing 100,000 reps of a sequence of N=100 of a random variable X that takes on 1,2,3,4,5,6 with prob=1/6, I get the following: E[X_t|X_t-1=1]=3.52 ; E[X_t|X_t-1=2]=3.515 ; E[X_t|X_t-1=3]=3.503 ; E[X_t|X_t-1=4]=3.492 ; E[X_t|X_t-1=5]=3.484 ; E[X_t|X_t-1=6]=3.473.

Same mechanism obviously, but still it’s incredible that that is true. I’m going to this hot-hand stuff for golf data which is why the non-binary outcomes are relevant.

]]>Matt:

The answer’s right in front of you, on your computer. Do a simulation in R or Python and find out!

]]>In my opinion it’s be much better if didn’t frame this as “Evidence for (or against) the hot hand” but rather started thinking in terms of “How much does a player’s performance vary from day to day and how well can we predict performance?”

]]>too bad I can’t edit.

Two clarifications/corrections:

1. Let hot=1 if Bob’s state is h, and 0 otherwise. The equation should be hit=a+b*hot, instead of hit=a+b*[Bob’s state]. So sd(bob’s state) can be thought of as sd(hot).

2. correlation is dimensionless, clearly it is not constant.

]]>nice observation Andrea.

the reason for this is interesting, and it comes right out of what we do in our paper.

Our example, which Andrew quoted (appears on p. 24-25 of present version:November 15, 2016), answers essentially the following question: What if there were zero measurement error? Lisa has no measurement error.

Now an implication of what Andrew noted about correlation is that corr(Bob’s state, hit)=corr(Lisa’s prediction, hit).

Note that if we set up a least square estimate of hit=a+b*[Bob’s state], then b=cov(bob’s state, hit)/sd(bob’s state)*sd(hit)

If bob’s state is hot 50% of the time, sd(bob’s state) is expected to be close to sd(hit), b/c the variance for bernoulli R.V. is p(1-p), and therefore b is expected to be close to corr(Bob’s state, hit).

Note that because these variables are binary b=proportion(hit| bob’s state= h)-proportion(hit| bob’s state= n), and E[b]=ph-pn=0.10, given the set up.

So the closer sd(bob’s state) is to sd(hit), the closer corr(Bob’s state, hit) is expected to be to E[b]=ph-pn=0.10.

So the issue is a correlation of .10 seems small, but really correlation is hard to interpret because it is a dimensionless constant. But in this context of binary R.V.s, correlation can be related to probabilities, i.e. a difference in probabilities. A difference in probability of 10 percentage points is a lot for a basketball shot.

]]>Joshua, Clyde,

in any case, 13%, 15% or even 50% doesn’t drastically affect the main point. Even with p_hot=0.5, a simulation still predicts extremely low correlation:

set.seed(512)

n <- 10000

bob_probability <- rep(c(.55,.45),c(.5,.5)*n)

lisa_guess <- round(bob_probability)

bob_outcome <- rbinom(n,1,bob_probability)

cor(lisa_guess, bob_outcome)

[1] 0.09700044

Rahul:

There is no “alternative hypothesis.” I don’t think that way. The point is to estimate the level of variation and estimate what predicts it. The point of the above post is that such estimation is difficult with 0/1 data, and naive estimates such as employed by GIlovich, Vallone, and Tversky in their celebrated paper are so biased and so noisy as to not be interpretable in the usual ways.

]]>>>>probability of success varies continuously from shot to shot<<<

This sounds intuitively so obvious that I'd be amazed to see data to the contrary.

What's the alternative hypothesis? That a human player might be an ultra-consistent random number generator with no patterns, and no ups and downs?

]]>+1 Go fellow Pythonista!

]]>Llewelyn:

The model we believe is that the probability of success varies continuously from shot to shot, and that there are times when this probability is higher (when a player is “in the zone”) and times when it is higher. The model with two discrete probabilities is a ridiculous oversimplification that we use just to make the point about the estimate being super-noisy. A simulation using a more realistic model would show the same thing.

]]>The comments by Guy and Josh above were really useful, thank you. But, I struggle to understand how the same actor, Bob, in any real world sense can have two probabilities of shooting a basket. My guess is that these are the assumed hot hand and play as normal probabilities which as spectators we observe as a combined total probability. But then I wonder how two separate states exist in the same player…I get stuck on the practicality. So I assume all the cleverness around this is correct and so am left wondering how one identifies a hot hand state in an actual game? And if this can be done, how does one predict the likelihood of a hot-hand run in a game/season/career *retrospectively*? The example in my mind is the Australian cricketer, Donald Bradman, whose stats are just unusually high compared to anyone else. https://en.wikipedia.org/wiki/Don_Bradman

I also found this paper (from a course I did on Sport Analytics) on longest runs interesting (and baffling), https://www.maa.org/sites/default/files/pdf/upload_library/22/Polya/07468342.di020742.02p0021g.pdf. It seems to me that this is the same thing, but without a hot-hand factor operating. Or not?

]]>Andrew-

thanks for the update on the P.S..

last thing: the power issues which Korb & Stillwell cover, is different from our point here. Our calculation is about how to evaluate beliefs/prediction/betting data, i.e. it is about the hot hand fallacy, not the hot hand effect.

Of course our little example is related to the measurement error story you discuss, and we actually yoked this example to one our simulation studies of measurement error (and power) from the appendix of our 2014 “Cold Shower” paper, where we used a hidden markov model data generating process. (The first discussion of measurement error and the hot hand, using a autocorrelation as the example, was in Dan Stone’s 2012 AmStat paper).

]]>Guy:

You say: “I have yet to see any compelling evidence for a hot hand that meets this test.”

-For game data, those are tough questions. We know of no evidence for or against using game data. But then again players and coaches make all sorts of decisions in the course of the game that cannot be backed up with data-driven evidence. Shouldn’t we defer to them until we have the data to address it? (assuming they aren’t using blanket one-size-fits-all heuristics)

Side note: we do have evidence that bettors in GVT’s study do better predicting than they would if they were to guess randomly (and a dumb hot hand heuristic would also do better).

You say: “I find it hard to believe that such large short-term changes in true ability actually occur.”

-Well, 13pp was an average FG% effect in GVT’s Cornell study, so I’d bet some players *sometimes* get much more than that for their probability of success, in controlled settings. Also, remember we are talking about probability here, so you aren’t going to pick this up with FG% based on recent shot outcomes, which is all anyone ever measures. This means you can’t use existing data to inform your beliefs on the magnitude. The thing exists. Is it modest in all players? Huge in some players, tiny in most players? Well it could be 40pp big in some players and you would never see it in the data given the way it is currently analyzed.

You say: “And it’s even harder to imagine how such talent changes could be measured with sufficient confidence to act on the information in real time.”

-proprioception?

]]>Guy-

We were thinking more about the beliefs of players and coaches, i.e. the decision makers. Right or wrong, we were less concerned about the “public understanding” as you mention. Fans say all sorts of things, its not even clear what they mean. For example, people don’t know the difference between a 60% chance of rain and an 80% chance of rain—in either case, if it doesn’t rain, the weather forecaster is wrong. For another example, look how people were interpreting betting odds in the previous election.

]]>Josh:

While I do believe that ignoring the hot hand would be more accurate than accepting the typical fan’s beliefs, those obviously aren’t the only choices. So the more interesting question (to me) is whether the hot hand is ever large enough to be of any practical significance in sports decision-making? For example, should a team pitch differently to a “hot” hitter (or use a hot hitter as a pinch hitter when he would not otherwise be used)? Or should a team give more shots to a “hot” NBA shooter, and should their opponent assign a superior defensive player to guard that “hot” shooter? To me, this is what it would mean for the hot hand to be “real” (while acknowledging that weaker hot hand effects might exist). I have yet to see any compelling evidence for a hot hand that meets this test.

If an NBA player did in fact improve his shooting accuracy by as much as 13 percentage points for a period of time, that would easily meet my test of “sports significance.” Or it would be IF this hotness could be detected in real time. I find it hard to believe that such large short-term changes in true ability actually occur. And it’s even harder to imagine how such talent changes could be measured with sufficient confidence to act on the information in real time.

]]>Would Lady Catherine de Burgh have been a great proficient if she had indeed learned to play?

+1… million

]]>Agreed that there are many different beliefs about the existence/size of the hot hand. My sense just from talking to sports fans, and listening to almost any game broadcast, is that many — perhaps even most — fans believe in a strong hot hand effect. To be more concrete, I think they believe that a 50% NBA shooter becomes something like a 60% or 70% shooter when “hot,” and that a .280 hitter becomes a .350 or better hitter when hot. And my sense (but I offer this tentatively, as I don’t know this literature well) is that academic research on *belief* in the hot hand are consistent with my intuition. But if the research shows a much more limited public belief in hot hand effects, I will happily revised my view.

]]>Vinnie Johnson was called the Microwave because he’d heat up fast but watching the games I saw a guy who was brought in as an offensive replacement with the team expectation he’d shoot and that many times he would do that and some of those times he was really good. But other times he’d come in not as an offensive replacement and he wouldn’t shoot or he’d shoot and miss and maybe not shoot three in a row. And it was obvious at times, as a Pistons fan, that the coaches would put Vinnie in to shoot sometimes and would run plays for him to shoot making it look like he was on fire if he hit the first 2 or 3 shots. Sensibly, they’d put him in because they thought those plays would work so maybe that would reflect the coach’s hot hand! The Celtics would intentionally run plays for Robert Parrish early because, as they’ve described in print, they felt he’d put more effort into the game if he shot early. That made it look to fans like Chief was more of an offensive player, but really it was that they ran those few plays for him early and, crucially, we’d remember that he hit those few early jump shots in some games. They’d then run plays for him later in the game when those plays were the best option, which given basketball scoring, might be who was defending, who wasn’t on the floor for the Celtics, etc. I don’t know how anyone could separate actual hot hands from this kind of complexity given simple it went in or it didn’t stats.

Transient effects are really interesting. Would Pete Reiser have been a great player if he hadn’t crashed into outfield walls? Would Lady Catherine de Burgh have been a great proficient if she had indeed learned to play? Is Messi better than Ronaldo? In most cases, you can’t even argue that totals matter: Babe Ruth played in a different era, was a pitcher for years, etc. Pelé never played in a European league and I don’t think anyone would argue that Josef Bican is actually the greatest soccer player though he tops the list of goals. Even if we model for competition, the arguments come down to “he hit the ball so incredibly far” or “he did the most spectacular things with a ball”.

]]>we just picked 13% because that is the fraction of shots that would be preceded by a streak of 3 successes, for a Bernoulli p=.5 shooter. In the paper, we used 15% of the time, with the idea that it should be relatively rare.

]]>OK… I read the Korb and Stillwell paper and frankly, the result is completely inappropriate to the serial correlation analysis. What they test is a case where, in a sequence of 10 shots, the probability goes up only after the fifth shat and says there. OF COURSE, that will be an incredibly weak test of whether making the previous shot increases the chance of making the next one: 9/10ths of the time it doesn’t! The fact that the serial correlation test gets nearly twice the “significant” results as the null suggests that this test is really quite good! I’m going to make a simulation that I think will show this all much better….

]]>Andrew, your P.S. is incorrect.

Korb & Stillwell have some nice coverage of the statistical power issues, but they do not perform this calculation.

The example you discuss in this post was one we made in reference to Gilovich, Vallone and Tverky’s betting task, p. 308-309 and Table 6 on p. 310, here.

We created a version of this example, and re-analyzed of GVT’s betting data in p. 24-25 our paper (here, current version:November 15, 2016).

Korb & Stillwell, do not correlate predictions of outcomes with outcomes, or running a hidden markov chain, with a predictor.

]]>Rahul-

on p-hacking, there are a few ways to know that it is not:

1. The 3-back measure is the one all previous studies use, and what fans define as a “streak,” so it isn’t chosen out of thin air (its significant with 2-back and 4-back, but too sparse with 5+ back).

2. The 3-back measure works out-of-sample to all the data we collected: a. 3 point shooting contest, b. our own controlled shooting study, and c. a little know study from the late 70s

3. One can control for the multiple comparisons issues involved with using other measures (length of longest streak, frequency of streaks) by constructing composite measures.

]]>Jordan

Just looked at Korb & Stillwell quickly and it looks like they are making the power point (not to be confused with MS), and the additional point that binomial data is typically underpowered. I do not see them correlating predictions of outcomes with outcomes, or running a hidden markov chain, with a predictions.

]]>Jonathan:

Regarding your statement, “you can’t even pass step one,” please look in the above post and in the link in the P.P.S. to get a sense of why it does not make sense to take this apparent failure as evidence that there’s no hot hand, or even as evidence that any hot hand phenomenon is small. What we’ve learned from looking at this problem is that even a large hot hand phenomenon can fail to show up when studied in this crude manner.

]]>But isn’t this vulnerable to the same sort of “p-hacking” that Andrew criticizes?

i.e. First we look at the previous shot. Not much to see so we then switch to screening n-previous shots. Or tweak the streak definition.

Having dont this, we can try to control for some vector X. Try other variations of X. Next comes the huge degree of freedom offered by which year & which players to analyze. Et cetra.

Sure you’ll discover some zone or rhythm somewhere. Now is this a phenomenon or an artifact?

]]>Guy-

If you define hot hand as the statistical estimate of the increase in field goal percentage after recent success for the *average* player in an unbalanced panel with partial controls, then, yes, if someone were to believe this effect to be large, they would be making a significant cognitive error. But in order to test if peoples beliefs are wrong, you have to first be clear on what their beliefs pertain to.

Further, in all controlled (and semi-controlled) tests, the increase in field goal percentage after a recent streak of success is meaningfully large, 5-13 percentage points depending on the study (with the highest estimate in the original study).

If one defines the hot hand and the change in a players probability of success when in the hot state, or more realistically, with continuous states, the range over which a player’s probability of success varies (controlling for difficulty), the “mountain of more recent evidence” can’t say much (I assume we are referring to the same mountain). We should have a little humility about what we can measure here. Where the data cannot speak, should we not defer to the practitioners?

What you bring up in your final two sentences may be on point though. If players & coaches had to choose between two blanket dumb heuristics of (a) always believe on in the hot hand, or (b) never believe in the hot hand, then the latter could may be safer bet, depending on the strength of the typical reaction. This ignores, of course, the possibility that the coach player has more granular information than the zero and ones we are looking at, and can respond more judiciously. We agree, the “hot hand exists” vs. the “hot hand doesn’t exist” is a terrible way to think of it, the binary formulation is what leads to blanket heuristic thinking.

]]>Anon-

for why we talked about these correlations, see see p. 310 here, the original paper.

Another reason: if predictions of success are made almost as often as success outcomes (doesn’t have to be that close), the correlation coefficient is close to an estimate of Pr(success|predict success)-Pr(success|predict failure).

]]>Rahul

There are different levels of discussion. If you want to stick with the literature, the consensus view was the “cognitive illusion” view. The original paper concluded that shots were as-if randomly generated (as Andrew said), and this was the accepted consensus, e.g. here, and this is how papers cited this result, e.g. the first two paragraphs here. Researchers were committing a fallacy here.

For your compact canonical statement, if we again want to tie our hands to this literature, well the original paper discussed hot hand (and streak shooting) in terms of patterns, rather than in terms of process. In that paper the hot hand is when the length and frequency of streaks are greater than one would expect by chance (see here, p. 296-297), or probability of success increases after recent success.

If you look at how lay players and coaches discuss it, they use the words “zone”, “rhythm”, and “flow”, so clearly they are talking about the probability of success increasing, for whatever reason.

]]>Sure P(make|miss) and P(make|make) are really noisy. But the hot hand would have to include that at a minimum, wouldn’t it? P(make|miss),P(make|hit),P(make|2 hits),P(make|3 hits) would be an increasing series under the hot hand hypothesis, and you can’t even pass step one (which has more data than any other conditional series) what chance do you have later? If nothing else, wouldn’t the set of players with hot hands (assuming there were any such things) be MUCH more likely to have P(hit|make)>P(hit|miss)? Obviously figuring out whether or not there are any such players is the essence of the exercise, of course, but wouldn’t you be really suspicious of spurious fitting if you found that the hot hand only starts after you’d hit 4 in a row and that that fifth shot was really really likely, but that the first make was negatively correlated the the second?

]]>thanks Jordan. We were aware of Korb and Stillwell (and your book), but somehow missed this.

]]>1) Acupuncture making people feel better based on the accurate placement of needles

2) Tamiflu reducing duration and severity of flu

3) Statins protecting against heart disease

4) Mammograms reducing breast cancer mortality

We’ve made a bit of progress on some of these perhaps. But the point is, you have a noisy problem where even if you can predict an underlying issue perfectly, the observed outcomes provide not much information about that fact… and you combine this with a default assumption along the lines of “X works” or “X doesn’t work” and it leads us to decades of confusion, wasted resources, and in the medical case potentially unneeded suffering.

]]>+1 It stumped me.

]]>Andrew:

OK. What’s your preferred statement of the problem?

]]>Rahul:

Grrrr…. No, the probabilities are *not* 0.54 vs. 0.52. They can vary by much more than that! The point is that conditioning on the previous shot is an extremely noisy way to measure anything. The result is to make the apparent differences look small and inconsequential, even if the underlying differences are huge.

I agree. When reading this I wondered how it was something I never noticed before, since I like knowing “gotchas” like this. Then I realized it is because I wouldn’t calculate a correlation in this situation (or if I ever did, it was quickly dismissed as unhelpful).

]]>If one must study the frivolous, at least study the *well-defined* frivolous. :)

]]>Thanks!

The other big issue is going to be external validity. e.g. Does this finding generalize to 1990-91. Et cetra. Who knows!?

Frankly, I think it’s a silly problem to study. When the system you are studying itself is so highly variable it is folly to try to answer questions depending on such small differences as 0.54 vs 0.52.

]]>I’m sure there’s a lot more data now, but the original Gilovich paper http://wexler.free.fr/library/files/gilovich%20(1985)%20the%20hot%20hand%20in%20basketball.%20on%20the%20misperception%20of%20random%20sequences.pdf has the original data in Table 1.

p(shot made|previous shot miss)=.54 p(shot made|previous shot made)=.52 This data is for 9 players in 1980-81.

A variant of the hot hand hypothesis is that p(shot made|n previous shots made)>p(shot made|previous shot missed) It is this variant that Gilovich tests that is shown to be calculated in a biased fashion in Miller et al.

]]>I love your formalization:

p(shot made|previous shot made)>p(shot made|previous shot missed)

So what’s the empirical answer to this question, unadjusted for anything. Anyone know?

]]>It is. The same argument for it being’ a fools errand could be made for other (possible) effects such as ESP and Power Pose.

]]>Rahul:

I agree that if the goal here were to determine a winner or loser of the hot hand debate, this would be largely a waste of time. But that’s not my goal.

There’s no winner or loser here. The “game” is to better understand how to use statistical analysis to learn things about the world, and the study of these sorts of simple yet real examples can be, for me and others, a good way to gain intuition and develop new principles. God is in every leaf of every tree.

]]>Thanks. The strong version seems axiomatically stupid: Isn’t that like saying injuries etc. just don’t exist.

About the weak version: The first formulation seems nice p(shot made|previous shot made)>p(shot made|previous shot missed). This ought to be eminently testable.

The part that makes this “hot hands” debate farcical is the adjustment part. Obviously there cannot be universal objective agreement on a canonical X vector.

When the effect is small to start with, isn’t any “conclusion” of the hot hands problem merely a specific statement of what you think subjectively is the “right” X vector?

]]>So why do we spend so much time on analyzing a problem that’s not even well defined?

My feeling is, we keep going in circles: The effect is tiny to start with, & every time one analyst makes some progress the other side shifts the goalposts ever so slightly. And we keep arguing endlessly.

Isn’t this a fools errand?

]]>There are several versions of the hot hand hypothesis. The strongest version is that you don’t actually have a different p from day to day. You have one underlying p and your varying performance from day to day is just the effects of the underlying p.

You are correct, though — I don’t anybody who believes the strong version.

The weaker version is simply that p(shot made|previous shot made)>p(shot made|previous shot missed). Fancier versions can adjust with lots of other independent variables p(shot made|previous shot made,X=X1)>p(shot made|previous shot made,X=X2) where X is some vector of independent variables at the time of the shot whose effects are previous-shot-independent.

Rahul:

Korb and Stillwell write, “It is not entirely clear what the Hot Hand phenomenon is generally supposed to be, nor just what GVT intend us to understand by it.”

Roughly, I think the claim made by Gilovich et al. and their followers was not that players are random number generators, but that they are statistically indistinguishable from random number generators, so that belief in any *evidence* for the hot hand was a fallacy.

Naive question: So I guess I don’t understand the “hot hands” question in the first place: Obviously, no one is arguing that a basketball player is a random number generator, right?

So, obviously some days I play better, other days I don’t. If we declare the hot-hands theory disproven, does that mean we can never predict the outcome of a player’s next shot with anything more than strictly random accuracy?

What’s a compact, canonical statement of the hot hands hypothesis?

]]>