n <- 5

rep <- 1e6

mydata <- array(0,c(rep,n))

score <- 0

total <- 0

for (k in 1 : rep){

last <- 0

for (i in 1:n){

now <- sample(c(0,1),1)

if((now==1) & (last == 1)&(i!=1))

{

score = score +1

}

if ((now == 1) & (i !=n))

{ total = total + 1

last = 1

}

if(now == 0) {

last=0

}

mydata[k,i] = now

}

}

ratio <- score/total

mydata

ratio

————————————————

and the result is 0.5.

]]>“Any relation to Clark?” Daughter

]]>Thank you. My simulations (using Excel and Stata) got the same results as you did, which is also in line with statistical theory. If the trials are independent, the probability is 0.5 and the occurrence is 50%. The “hot hands” could still occur only if dependencies arise due to any of various temporary physical or emotional maladies. Thank you again.

]]>Michael:

1. GVT is not just about shooting in a game environment; they have lots of data from non-game environments.

2. GVT supply an estimate of the hot hand. Their estimate is near zero so they conclude there’s no hot hand. But, actually, if there is no hot hand, you’d expect they’d get negative estimates for the statistical reason explained by Miller and Sanjurjo.

]]>Maybe I am looking at this the wrong way, but how does it prove hot hands exist by creating a negative bias in the analysis? They mention for example a three point contest, but GVT was not about sitting in a spot, and hitting the same shot over and over. It was about the random nature of shooting in a game environment where you have an high number of variables than change on a millisecond timeframe.

I don’t see anything that disagrees with GVT, but instead a transformation of the data through underweighting the successes. If this underweight isn’t done, everything aligns with GVT. Unless I am missing something.

]]>I find this explanation very helpful, thanks! Any relation to Clark?

]]>Yea that’s the whole conclusion right? When we group things and underweight the groups that have massive success our average success rate of the groups goes down. I don’t see what’s important about this.

]]>Some quick observations, and I admit a lot of the discussions have addressed some of these issues. I also haven’t read GVT in it’s entirety so forgive me for some naive observations.

As many people have pointed out, the R code is summing up the probabilities for each row, and then taking the average of those probabilities, this inherently has a negative bias which discounts the actual counts. Here, take for example this set of numbers. . .

0 0 1 0

1 0 1 1

In the algorithm, it would show in heads1 and heads2 the truth tables below.

heads1 <- false, false, true

heads2 <- false, true, false

For the first row a prob of 0.0, which means after a flip of 1, there were 0 times another head appeared.

In the second row, he would show a probability of 0.5 from the truth tables below.

heads1<- true, false, true

heads2<- false, true, true

This shows there was a single time after a head flip, there was another flip heads . . . so there is a 0.5 probability.

Now. . . when the code takes the mean(prob) there is a total prob of 25% for flipping a head

mean(prob, na.rm=true) = 0.25 or the prob of 0 and .5 is .25)

That isn't correct though, because there were three times a head came up in the first three flips, and only once did a head occur on the next flip. This means the probability for a head flipped after a head flip is actually 33% not 25% in this data set. The code is normalizing each row, when it should be counting individually each occurrence. Because of this, there is a negative bias in the own algorithm, and the error is not being accounted for. That is the reason for the approx 40% coming up, not because of hot hands.

If instead the code's program changed to actually count the occurrences instead of creating a normalizing function, he would find that indeed it does happen 50% of the time, and thus no hot hands.

]]>The probability of having a boy after a girl is 1/4. The probability of having a mixed pair is 1/2 (assuming the true prob of boy vs girl is .5 and not .48) The probability of having two boys, out of 3 children, in a row is 3/8. Note that the grouping changes the outcome. That seems to me what is happening here. I think a probability tree greatly helps in this case.

]]>Alex:

No, it’s different.

]]>Dear Jeff

thanks for the comment, I didn’t realize there wasn’t clarity on this point.

We do not assert that: “a way to determine the probability of a heads following a heads in a fixed sequence, you may calculate the proportion of times a head is followed by a head for each possible sequence and then compute the average proportion, giving each sequence an equal weighting on the grounds that each possible sequence is equally likely to occur.”

In fact we say in the July 6th paper that it is a mistaken intuition to treat this computation as an unbiased estimator of the true probability. It is certainly consistent, but biased, as we demonstrate. This mistake is *the* problem in the original hot hand study, and several of the subsequent studies.

In the introduction to the July 6th paper we discuss the weighting issue that you here describe. Weighting flips equally does eliminate the bias, but can only be used if you *know* that all coins are the same. This may be reasonable for coins, but is not reasonable for basketball players. If you weight all shots equally you will have another form of “Simpson’s Paradox”, which is a bias now towards finding the hot hand (selecting a shot that immediately follows 3 hits in a row creates a bias towards selecting a better player, and a better player is more likely to hit the next shot). If you put this in a regression context and add fixed player effects (to control for better players), you end up back where you started, with the finite sample bias.

]]>Let us assume, as Miller and Sanjurjo do, that we are considering the 14 possible sequences of four flips containing at least one head in the first three flips. A head is followed by another head in only one of the six sequences (see below) that contain only one head that could be followed by another, making the probability of a head being followed by another 1/6 for this set of six sequences.

TTHT Heads follows heads 0 times.

THTT Heads follows heads 0 times.

HTTT Heads follows heads 0 times.

TTHH Heads follows heads 1 time.

THTH Heads follows heads 0 times.

HTTH Heads follows heads 0 times.

A head is followed by another head six times in the six sequences (see below) that contain two heads that could be followed by another head, making the probability of a head being followed by another 6/12 = ½ for this set of six sequences.

THHT Heads follows heads 1 time.

HTHT Heads follows heads 0 times.

HHTT Heads follows heads 1 time.

THHH Heads follows heads 2 times.

HTHH Heads follows heads 1 time.

HHTH Heads follows heads 1 time.

A head is followed by another head five times in the six sequences (see below) that contain three heads that could be followed by another head, making the probability of a head being followed by another 5/6 this set of two sequences.

HHHT Heads follows heads 2 times.

HHHH Heads follows heads 3 times.

An unweighted average of the 14 sequences = [(6 X 1/6) + (6 X ½) + (2 X 5/6)]/14 = [17/3]/14 = .405, which is what Miller and Sanjurjo report.

A weighted average of the 14 sequences = [(1)(6X1/6) + (2)(6 X ½) + (3)(2 X 5/6)]/[(1 X 6) + (2 X 6) + (3 X 2)] =

[1 + 6 +5]/[6 + 12 + 6] = 12/24 = .50.

Using an unweighted average instead of a weighted average is the pattern of reasoning underlying the statistical artifact known as Simpson’s paradox. And as is the case with Simpson’s paradox, it leads to faulty conclusions about how the world works.

Oops. I just copied and pasted the html generated at https://www.codecogs.com/latex/eqneditor.php. That didn’t work.

Perhaps the url will be better:

https://latex.codecogs.com/gif.latex?\frac{\sum_{i=1}^m&space;\sum_{j=1}^{n_j}&space;X_{ij}}{\sum_{i=1}^m&space;\sum_{j=1}^{n_j}&space;Y_{ij}}&space;>&space;\frac{1}{m}\sum_{i=1}^m&space;\frac{\sum_{j=1}^{n_j}&space;X_{ij}}{\sum_{j=1}^{n_j}&space;Y_{ij}}

]]>Hi Sid,

I think the point is that what you’ve coded is not the estimator that has been used in some previous studies of the hot hand.

Instead, an estimator that calculates that estimator for each replication (player) and forms an average of those has been used.

And if I may try to express it in my own LaTeX, I think that the present authors point out that:

$\frac{\sum_{i=1}^m \sum_{j=1}^{n_j} X_{ij}}{\sum_{i=1}^m \sum_{j=1}^{n_j} Y_{ij}} > \frac{1}{m}\sum_{i=1}^m \frac{\sum_{j=1}^{n_j} X_{ij}}{\sum_{j=1}^{n_j} Y_{ij}}$

JD

]]>my $rep = 1e6;

my $n = 4;

my @result;

my $summary_result = 0;

my $denom = 0;

for (my $j=0; $j<$rep; $j++) {

for (my $i=0; $i= 0.5) {

$result[$i] = 1;

}

else {

$result[$i] = 0;

}

}

for (my $i=0; $i<$n; $i++) {

if ($i != 3) {

if($result[$i] == 1) {

if ($result[$i+1] == 1) {

$summary_result++;

}

$denom++;

}

}

}

}

print "Total count: ".$summary_result/($denom)."\n";

]]>Here is a suggested heuristic to address the issue. Looking at equation (3), and approximating all the exponentials as 0, the measured frequency will be n/(n-1) * (p-1/n). So, rule of thumb: If you measure q, adjust to (1-1/n)*q + 1/n.

]]>Dear Guy

Thank you for the comments on our exposition and your challenges on how to interpret the evidence with respect to hot hand, they are helpful.

Dear all: The feedback in general in the comments section here has been great and very helpful, thank you for taking your time to look at our work.

I will address to everyone the three important points you bring up regarding: (1) the effect size, (2) the relevance of the bias as a function of the sample size, (3) the method of bias correction.

I. Effect Size

———————-

Should Bocskocsky, Ezekowitz & Stein (2014) be the study which informs our priors on how large the hot hand is likely to be? There are a few critical issues here, but before discussing these issues; I’d like to mention one thing. We are now have different goal posts than that of the original studies. The original GVT study and the early challenges were about the question of whether not sometimes some players get the hot hand, they were not about whether the average player is a streaky shooter or tends to get hot. The consensus came to be, after these few challenges, that perception that some players get hot is a cognitive is illusion. Much later Koehler and Conley (2003) looked at data from individual shooters in the NBA’s Three Point Contest and replicated GVTs conclusions; this paper has been often cited as the study which should reinforce our belief that the hot hand is a fallacy. What can now be said, with the cleanest data set from the original study, is a reversal of the original conclusion, that this perception is not a fallacy. Further, across all controlled shooting studies, which all differ, and also in the NBA’s Three-Point contest, we get the same reversal. The average effect size is substantial, and there is a great degree of heterogeneity in effect size, with some players having large effects. If we wanted to isolate the basketball shooting, and study it in scientific way, this is the type of data we would want to look at. GVT understood that, that is why they included a controlled study. I think everyone would agree that we cannot conclude anything about the existence of the hot hand or its effect size from the game data in the original study (Study 2, Dixit and Nalebuff’s point stands: http://bit.ly/1eXxdI3).

Now, while these goal posts are different, they are valid, and we are in no obligation to be tied to the line of argument in the academic literature. We all want to know, how big is this hot hand effect in games? If we look at Bocskocsky at al. (2014) we see what appears to be a small effect size, but in looking to this study for information on effect size, we are asking too much from them. Their study, which analyzes the richest in game data set ever, is not a study of effect size, rather it is a test of whether sometimes some players get the hot hand. If we are willing to accept some of the natural limitations that come with studying game data, we have to conclude that they find at least some evidence of players being streaky.

The reason why Bocskocsky et al. is not the study to look to if we want to get information on effect size is because their empirical strategy will vastly understate the hot hand effect for the following reasons:

1.Measurement error: We want to know a player’s shooting percentage in the “hot” state, but what is actually measured is a player’s shooting percentage in a *proxy* for the hot state. This is a problem with the approach our papers as well; not every instance of a streak of three or more made shots is a “hot” streak, so when we measure a player’s field goal percentage after a streak of three or more made shots, we are pooling shots attempted in the hot state, with shots attempted in a non-hot state. The degree of measurement error is substantial, in fact, under many plausible alternative models of the hot hand, more substantial than the bias we discus (if you want a particularly clean example of this please look at the work of Dan Stone: http://bit.ly/1TJ0Qfo). Now there is every reason to believe that measurement error is far worse when you study game data. Take Bocskocsky et al.’s proxy for hot hand, their complex heat index, they are looking at a player’s shooting percentage in his previous 4 shots relative to what is expected, and those previous 4 shots can be separated by tens of minutes. This is a very weak signal of a hot state; plenty of these shots will be from a normal state. In more controlled settings the shots are happening in shorter time intervals, so while hitting 3 or more shots in a row is not a perfect signal of being hot, it is a far stronger signal in these contexts.. If you want to better proxy for the hot state, you will have to measure it in some other way. Perhaps teammates and coaches can pick-up on the subtle cues in body language, facial expression or shooting mechanics that signal a hot hand? No one has done this research, and why would they if they didn’t believe the hot hand exists? (note: if you want to get more of a sense of measurement error issue, Jeremy Arkes also has done some interesting simulations. He found that if the hot hand is infrequent, your estimates (and power) will be diluted. I can only find a gated copy of the paper here: http://bit.ly/1TJ1KZn but if you email him, I am sure he would share it gratis)

2.Omitted variable bias: while Bocskocsky et al. control for a lot, they do not control for many important features of the defense that make shots more difficult, features that you would expect to be more present when a player is in the hot state (due to strategic adjustment), thus reducing a player’s probability of making a shot relative to baseline. Just a few examples: (1) the quality and identity of the defender, (2) whether the defender placed a well-timed visual occlusion (Joan Vicker’s has illustrated the importance of the “quiet eye” in far aiming tasks: http://bit.ly/1gGEZXM ; also see the work Oudejans, Oliveira and colleagues: http://bit.ly/1fdR99G), (3) whether the defender forced an off balance shot.

3. Pooling heterogeneous responses: Not all players are great shooters. Not all players are streaky. Some players may even be anti-streaky, e.g. go off the rails after making a few in a row. Now mix all their shots together as done in Bocskocsky et al. Do you expect to see a big average effect? This is an important point. It is the fact that there is heterogeneity in response that gives players a reason to want discriminate between the streaky and non-streaky guys, and to adjust play appropriately. We see that heterogeneity when we isolate the shooting from the strategic confounds present in games.

I hope we agree that if you want information on how big the hot hand effect size can be in individual players, game data is not where you want to look. If you want to make the best scientific inference you can make, based on the data you have available where would you look? All controlled shooting designs and the NBA’s Three Point Contest involve the same physical act that is carried out in games. These designs all differ in how players take their shots. The story is pretty much the same across studies, substantial, but heterogeneous effect sizes. Now scientific inference here is not restricted to isolating a behavior. We have to remember we are studying human beings, and we have additional scientific knowledge about human beings. Human performance in other domains has been found to largely effected by variation in confidence (self-efficacy, Albert Bandura), attention and concentration (e.g. Daniel Kahneman), the flow state (Mihaly Csikszentmihalyi, think “the zone”), and even physiological state (e.g. Churchland http://bit.ly/1HZemTO ). There is no reason to believe these factors do not operate to the same degree when shooting a basketball.

Now, I have a question: in 1984 would anyone’s prior have been that either the hot hand doesn’t exist, or if it does, it is weak? There was no reason to have this prior. T Why would it be anyone’s prior now?

I get that everyone *wants* to believe in the hot hand, and we should be suspicious of this kind of motivated reasoning because people are likely to only confirm their priors. On the other hand, there is a dual motivation, held among researchers, to be able to say that these experts don’t know what they are talking about, that with high-powered statistics and without any knowledge of basketball we can know more. This is sometimes true. But we should have some humility. We are looking at 0s and 1s, the player and the coach have a far richer information set (more than what SportVU picks up). We can’t pretend we are measuring everything. Jeremy Arkes made us aware of a beautiful quote from Bill Russell on this very issue, see it here: http://bit.ly/1IamgOH

II. Bias as a function of sample size.

—————–

You are correct that the bias is small when players take a lot of shots, which is true in our controlled shooting study of the Spanish semi-pros and true for *some* of the NBA Three Point contestants. But the bias matters crucially in the original study. When the data from the original study is analyzed correctly the original conclusions are not just invalidated, they are reversed! (here: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2450479). This is important point. There was never foundation to conclude that the belief in the hot hand was a fallacy, the original data set actually had statistically significant evidence of the hot hand. The mathematical fact of the bias, the fact that it cannot be argued with, and that it matters in the original study, well that opens the door to all these other (important!) subtleties that people have mentioning for years. The issue of power, measurement error, and omitted variable bias all likely understate the hot hand effect to a much greater degree (especially in game data), but because issues have required a model of the world to even discuss them, they have been thrown into the category of “debatable” and the fallacy view has lived on. But you can’t debate the math, and the math matters in the original study. Now if you want to have a justifiable opinion on the hot hand, you have to think about these other issues.

III. Bias correction

—————–

The bias correction illustrates the difference relative to the bias in the null Bernoulli model. This is a *conservative* correction, the bias is actually worse if a player actually has the hot hand, please see the comments of Jonathan Weinstein above, and the responses.

Thanks again

]]>At the same time, I think M&S could do a better job of communicating what they found and – especially – what they did to measure the bias. Let me try to provide a brief explanation for the bias that will hopefully prove intuitive for some, and explain what I think the authors actually did to measure the bias (which is not exactly what the paper says they did).

There are two different biases at work that need to be corrected. First is the bias created by the fact that the conditional flips cannot also appear as the next flip. Let’s call this the “conditional bias.” This can be corrected using sampling without replacement (SWOR): for example, if n=100 and frequency of heads is .50, our expectation after 3 consecutive heads would be 47/97 or .485 once we account for conditional bias.

However, what M&S have discovered is that the average expectancy after a streak is even less than the estimate provided by SWOR. For example, sequences with exactly 50 heads and 50 tails will have a conditional frequency that is less than .485 (approximately .46). This additional “M&S bias” of .025 results from the fact that the number of HHH streaks in a sequence is not independent of the success rate after HHH: more success after a streak also means more streaks. Thus, if every sequence is weighted equally, we will underweight the streaks followed by a hit and the observed mean conditional frequency will be less than estimated by SWOR. This extra bias is relatively small for larger sequences (for example, in their NBA three-point study, it amounts to only about 1% for an average player), but does have a larger impact when sequences are shorter and especially when computing the difference between “hot” and “cold” frequencies (as in Gilovich).

One confusing aspect of the paper is the presentation of their measurement of this bias. They say that they “derive an explicit formula for the expected conditional relative frequency of successes for any probability of success p, any streak length k, and any sample size n.” The presentation and discussion of table 1 (coin flip sequences with P=.5) also seem consistent with a model that estimates bias based on a particular probability of success. However, such a model would *not* correctly measure the bias in hot hand studies. As I observed above, these samples are not random trials from a player with a known true FG%, but are fixed frequencies from players of unknown true FG%: for every sequence in these studies the observed frequency of heads is exactly equal to the inferred “P” (by definition). And the bias in sequences with actual frequency X is *not* the same as the mean bias in all sequences generated when P=X. For example, Andrew notes above that sequences of 4 coin tosses (P=.50) will have a mean .41 success rate after a head, but if we look only at those cases where exactly 2 heads come up — and thus we would infer P=.50 — the success rate is actually just .33, not .41. The difference is much smaller for longer sequences, but does not disappear.

So, what M-S call “P” is really the observed frequency (“F”) of hits in a finite sequence. The paper would be clearer if M&S did not imply they were estimating bias for sequences from a known probability, which does not seem relevant to the hot hand studies. Indeed, I’m not sure the word “probability” belongs anywhere in the paper. But fortunately, M&S did not base their bias estimate on all outcomes for a given probability (it took me some time to figure this out), but rather correctly estimate the bias for fixed distributions. Their method thus provides reasonable estimates of the combined effect of the two types of bias.

(I suspect the paper would also generate less confusion if the authors removed statements — mainly in the first six pages — that seem to imply a discovery about the actual probability of alternating outcomes, such as “the result has implications for evaluation and compensation systems, and suggests successful gambling systems.” The suggestion that a financial firm might reward an analyst based on typically guessing right about the market more than 15 days out of 30 – while ignoring the fact that her losses in the months she fails this test are larger than her profits in the other months – is particularly implausible.)

By the way, it does seem to me that a simpler solution for coping with this bias is available than that proposed by S&M. Rather than using the biased average of sequences and then correcting for it, why not just remove the bias by taking an average for the full sample (weighting all streaks equally)? Then generate a null by calculating the SWOR estimate for each player, and weighting these by number of streaks. While high-FG% players will be overrepresented in the full sample mean, they are similarly overweighted in the null estimate. So I believe this will still provide a valid comparison.

Does the discovery of this bias mean vindication for a strong “hot hand” effect? I’m much less sure of that than Andrew appears to be. That depends in part on the quality of the earlier studies which M&S now claim to have ‘reversed,’ and I’m not familiar enough with those studies to offer an opinion. In terms of an actual hot hand effect under game conditions, this recent study found that four consecutive baskets elevate the expected FG% on the next shot by only 1%, a very weak effect: http://www.sloansportsconference.com/wp-content/uploads/2014/02/2014_SSAC_The-Hot-Hand-A-New-Approach.pdf. So I feel that a “prior” of a small (c. 2%) hot hand effect is still quite reasonable, and we will need a lot more evidence before revising that upward.

]]>Dear Jonathan

we do have a way around this problem, re-sampling (via permutation), which matches exactly the core assumptions behind the null in the original study. It also allows us to pool data from all players. The method is described here: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2450479

thanks!

]]>Kind of like how aging Barry Bonds took up his game to another level from 1998 to 2001.

]]>Dear Jonathan:

I agree “trick” is the wrong word. Re-weighting illustrates the clustering issue that is going on. Re-weighting, or treating the trial as the unit of observation, makes sense when you have a strong theoretical reason that every coin is coming from the same process with the same parameters. Coins satisfy this criteria, basketball players do not. If you treat the trial as the unit with multiple basketball players, then you will have selection bias in favor of finding the hot hand. If you add a fixed effect, then it is equivalent the problem we point out with the linear probability model at the end of the intro.

Study 2 of the original paper has a severe endogeneity problem, which was pointed at quite early, e.g. on the first page of Avinash Dixit and Barry Nalebuff’s Thinking Strategically book they explain clearly the problem of strategic adjustment (see it here: http://bit.ly/1eXxdI3 ). Scientifically speaking, this is why Study 3 is so important, because it does not suffer from these issues. If you can show that there is no evidence of hot hand shooting in Study 3, it is reasonable to infer it doesn’t exist. This is also why great hay has been made about the no-effect result in the 3 point study of Koehler & Conley (2003).

ciao!

josh

Just for the record: I now get that there is no easy way around this problem when sequences are short (or even not-that-short, as in the GVT experiments) and probabilities vary.

]]>Hi Jonathan

I wish there was a way to get a notification when a reply comes in.

For your first paragraph:

Please forgive me if I miss-read what you mean. In your model there was a hot hand P(H|H)-P(H|T)=0.2>0 and for the estimate, lets use rf to avoid confusion, the estimate is expected to be E[rf(H|H)-rf(H|T)]=-0.2125, so it is also biased in finite samples, and far worse. This means that if you were to assume there is some unknown effect size, and you were to estimate it, and you saw rf(H|H)-rf(H|T)=0, you would have an underestimate, and falsely conclude that there is no hot hand. The correct test is not to compare these two groups of flips as if they are independent groups, the correct test is to re-sample as we do in our other 2 papers (permutation test, details in this paper which corrects for the bias: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2450479).

Now our goal was not to say one model of the hot hand is right and another model is wrong. There are many ways to operationalize the hot hand using statistical measures and we don’t want to take a stand on what it is, so we have 3 measures that are based on hit streaks which we justify based on identification and power grounds in the other papers, and we reject the null for patterns in these measures, and importantly, we don’t reject the null for patterns in miss streak measures. This is exactly what you would predict if the hot hand was there.

For your second paragraph:

Looking at each player individually *is* the problem. With 100 shots and a 50 percent shooters you are expected to find a difference of -8 percentage points when you condition on 3+ hits. Also, while we report mean adjusted differences, out tests are all based on the exact distribution under the null. For example, in the paper we perform binomial tests based on the true median and the true .05 percentile (using resampling). In case you are curious the median is -6 percentage points.

hope that helps clear things up. Its tough to strike the balance between readable exposition and full detail. The full detail is in the first two papers, but I see we need to add a little more detail to this paper.

thanks for kicking the tires!

ciao!

josh

Thanks for this explanation.

It’s not really a reweighting trick, in my opinion. I would say that the R script reweights the “trials” (flips that follow a head) by averaging first over “coins”. If we stuck with trials as a unit, we would not need to reweight at all.

Yes, I did think that I could solve the problem of differing probabilities with regression, and now I see why it might not work (although I still kind of want to try it). I still think there is no problem with GVT’s “Study 2” — since a player’s whole season should have a lot of trials, even for the longer streaks — but maybe you’re not saying that there was a bias problem with Study 2.

]]>Thanks. I completely missed that part of GVT. I’ve looked at GVT several times, and thought I knew what it said. But I apparently never got beyond “Study 2” — perhaps because I am a Sixers’ fan.

Both my comments were focused on GVT’s in-game field goal study, which I still think is not much affected by the M+S bias. As M+S point out, though, in-game data may be too complex to analyze convincingly for the hot hand effect.

]]>PS I was a little too quick to claim knowledge of the behavior of the median — it looks complicated based on a few examples. But more important points are (1) D being higher than its mean, for example D=0, is not necessarily evidence for the hot hand (2) nothing in GVT’s methods look to me like it’s invalidated by biased D.

]]>Joshua,

Thanks for the reply. I don’t question the bias you identified in the mean conditional frequency, and I think it’s a nice point showing a subtle and original example of selection bias. But I question the relevance of the bias for inference about whether the null model (iid) or a Markov/Hidden Markov model is correct. My example shows that even though under the null the expected conditional frequency is below p=.5, a string with conditional frequency .5 can still be evidence in favor of the null and against the hot hand.

I looked at Gilovich-Vallone-Tversky. If they relied on averaging conditional vs. unconditional frequency across players for their methods, your bias would certainly continue into play. But they don’t. They look at each player individually, and typically summarize a data set by saying how many players had positive vs. negative serial correlation. In fact in footnote 3, p.304, they say it would be wrong to average across players (for reasons different from yours).

Now I realize that even looking at one player, the biased mean is there. But when GVT discuss how many players have positive vs. negative hot-hand effect, the question is not whether the *mean* is biased but whether the *median* is biased, an entirely different question. In fact the median is not systematically biased, so GVT are justified in thinking that in i.i.d. data, half the players will show hot-hand and half anti-hot-hand, though anti-hot-hand will be by bigger percentage margins, the cause of your bias. That is: Let D=(sampled conditional frequency)-(sampled unconditional frequency). You have shown that the mean of D is negative. But the median of D is not necessarily negative. (Because of integer constraints, the median can be slightly positive or slightly negative, somewhat arbitrarily.)

Best,

Jonathan

I think the broad based intuition to communicate these findings may lie in the (2) thread. By increasing (n) from 2 to 20, the probabilities take on significantly more possible outcomes. For instance:

n=2 P(H|H) {0,1}

n=3 P(H|H) {0,.05,1}

n=4 P(H|H) {0,0.5,0.66,1}

n=20 P(H|H) {> sort(unique(prob))

[1] 0.0000000 0.1000000 0.1111111 0.1250000 0.1428571 0.1666667 0.1818182

[8] 0.2000000 0.2222222 0.2500000 0.2727273 0.2857143 0.3000000 0.3333333

[15] 0.3636364 0.3750000 0.4000000 0.4166667 0.4285714 0.4444444 0.4545455

[22] 0.4615385 0.5000000 0.5384615 0.5454545 0.5555556 0.5714286 0.5833333

[29] 0.6000000 0.6153846 0.6250000 0.6363636 0.6428571 0.6666667 0.6923077

[36] 0.7000000 0.7142857 0.7272727 0.7333333 0.7500000 0.7692308 0.7777778

[43] 0.7857143 0.8000000 0.8125000 0.8181818 0.8235294 0.8333333 0.8461538

[50] 0.8571429 0.8666667 0.8750000 0.8823529 0.8888889 0.9000000 0.9090909

[57] 0.9166667 0.9230769 0.9285714 0.9333333 0.9375000 0.9411765 0.9444444

[64] 0.9473684 1.0000000 }

The small (n) is analogous to limiting the number of rectangles in integration when estimating f(x). More rectangles = better estimate. The frequency of these discrete probability bins as an artifact of (n) for P(H|H) also demonstrates a convergence towards 0.5 for larger (n), as well as filling in these increased surrounding bins while diminishing the 0’s and 1’s.

]]>Hi Zachary

The goal isn’t to find any pattern that rejects the null of a player with a fixed probability of success, the goal is to find specific patterns that are consistent with hot hand shooting. This means that if the hot hand exists, the patterns associated with hot hand shooting (certain types of hit streaks) will lead you to reject the null, but other patterns, for example those that have to do with miss streaks, may not lead you to reject the null. This is what we find across all studies we have looked at. If you peek in the appendix you can see the connection with runs. For a complete discussion on these issues, please see this earlier paper of ours: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2450479

]]>Hi Sam

It depends on how the predictions are evaluated, if you are evaluated based on the *percent* of predictions that are correct, you can game this, as outlined in the paper. There is a better way to game this evaluation system, unrelated to alternation in finite sequences. It is fun to think about.

]]>it does appear that players try to “hit the iron while its hot” so to speak. That’s what is said about Steph Curry. There might be a way we can look into this, we recorded that data too. thanks Z!

]]>Dear Jonathan-

the re-analysis of the data from the original Gilovich, Vallone and Tversky study (among others) is here: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2450479

Also the re-weighting trick only works for coins, because we can assume all coins have the same probability of success. This doesn’t work for basketball shots. You might think that you can place this in a regression context and add a fixed effect, but that would return us back to where we started.

]]>Dear Jonathan

Thanks, this is a nice question to think about.

We didn’t get into this in the intro because over-precision can make the text unreadable. Yes, the bias is calculated relative to the null that the player is a consistent shooter (shoots with a fixed probability of success), which is the reference distribution of the previous studies.

Now for the model you propose the bias is substantially *worse*. Consider p(H|H)-p(H|T) in column 3 as an estimate of the effect size in your model of the data generating process (which you assume to be 0.2). The estimate of the effect size is expected to be E[p(H|H)-p(H|T)]=-.2125, so the bias is -.4125, which is less than the bias of -.33 for this estimate when the data is generated by the consistent shooter model presented in the bottom row of column 3. To see how E[p(H|H)-p(H|T)] calculated, note that in your model each sequence has probability C*0.5*(0.4)^a(0.6)^(3-a), in which a is equal to the number of alternations in the sequence, and C is equal to the normalizing constant that accounts for sequences in which p(H|H)-p(H|T) evaluates to missing.

In our paper from last year (“A Cold Shower for the Hot Hand Fallacy”) we explore the power of our statistical tests by considering plausible alternative models of hot hand shooting including a regime switching model (Hidden Markov), which is in spirit with what you propose, put perhaps more plausible. What you see in this case is that the underestimate of the hot hand is even more severe, but for a different reason than your example. When looking at the relative frequency of H after HHH in the regime switching model, you are using HHH as a proxy for the probability of success in the hot state. This will necessarily be an underestimate of the hot hand because HHH will also occur when a player is in a non-hot state (i.e. you are pooling hot and not-hot shots together).

]]>Jonathan:

See the last 2 columns of Table 3 of this paper (linked in the above post). They give p.hat(hit|3 hits) – p.hat(hit|3 misses):

GVT estimate .03

bias-adjusted estimate .13

So, yeah, it does seem to make a difference in that classic analysis!

I think M+S have an interesting example of a bias that _can_ arise with certain kinds of averaging, but I don’t think they’ve made even a prima facie case that it _does_ arise in the GVT paper (and I don’t think that it does).

]]>Have you ever read Fish, S. Dennis Martinez and the Uses of Theory, Yale Law Journal 96, 1773-1800 (1987)?

I’m pretty sure that asking Larry Byrd about hot hands would be a lot like this… (Kareem Abdul-Jabar might have something edifying to say, but he is an outlier.)

]]>What I always wonder about in these papers, especially the original GVT one, is that they seem to assume that people’s shooting ability (or whatever) is unchanged throughout their career. Look at Serena Williams, is it just chance that she happens to have a Serena Slam over the last 12 months? Or is it that she really has taken her game up a level from where it was a couple of years ago?

]]>Andrew:

AHA! That makes all the difference!

]]>Bill:

Yes, you’re right. I should’ve said “proportion,” not “probability.” I’ve fixed.

]]>So (with my correction below) I calculated what you say, and get (2 figure accuracy) 0.405, which agrees with what Andrew’s program produces and confirms what you say is being done.

It’s not true that the conditional probability is ~0.4. That is a bogus number calculated in a bogus way.

I am sure Andrew knows this. The question is, if this is what “hot hand” researchers are doing, do they know what they are doing?

]]>Your table has the line “TTHH 2 1”

This isn’t right, because the fourth H should not be counted (it isn’t followed by anything, a rule that you followed on every other line in your table).

Should be “TTHH 1 1”

which makes the total number of heads 24, not 25, and the conditional probability exactly 0.5.

Andrew was not correct to describe the number he calculated as “the conditional probability that he gets heads, given that he got heads on the previous flip”. Your table (except for that error) is identical to what I produced independently, and that number is the actual the conditional probability that he gets heads, given that he got heads on the previous flip. Not what Andrew calculated.

That is what was getting me. Andrew made that assertion, which is clearly wrong, and I couldn’t understand what his program was trying to calculate. Some of the other comments (and Zachary’s) help to explain this. But Andrew’s assertion mystified me.

]]>I think you can show this with a pretty simple directed acyclic graph: for any given sequence of kids (say, consider only families who have 3 kids), you want to know whether probability (boy|prior kid was a boy)=pr(boy), or approximately 0.5.

You obviously can only evaluate families for whom either kid 1 or kid 2 was a boy. So, the DAG includes 3 variables: sex of kid 1, sex of kid 2, both of which influence whether the family is included in your analysis. The DAG looks like:

kid1_boy–>include family in analysis <–kid2 boy

Among included families, the two causes of inclusion are associated (in this case, because it's an either/or inclusion rule, they are inversely associated – if kid2 is a girl, then you know kid1 must have been a boy or else the family wouldn't be included in your analysis).

Something that confused me when I first read this is how any association between the sex of kid3 is induced – but upon consideration (by which I mean an embarrassing amount of time obsessively thinking about the problem) actually I don't think there is any association with the last element of the sequence, because that element has no influence on whether the family (or cluster) is included in the analysis.

In short, seems like a nice illustration of collider bias.

Maria

I think you (Zachary) might be misunderstanding the significance of the 4 flips for the authors’ argument. Certainly, the authors recognize the point you & BD McCullough are making. You are actually treating Tbl 1 as if it were a randomly generated 64 sequence flip; it’s not, and the authors don’t analyze it as such.

the paper’s key point is that “in a finite sequence generated by repeated trials of a Bernoulli random variable the expected conditional relative frequency of successes, on those realizations that immediately follow a streak of successes, is strictly less than the fixed probability of success….”

If you want to disagree with them, then I think you need to show either (a) the authors are wrong about that; or (b) the authors are wrong to understand the analytic strategy of the classic “hot hand” studies as assuming, when they analyzed player performance over a particular interval, that if one examines a finite sequence of outcomes generated by a binary random process, the probability of the recurrence of a particular outcome following a specified string of such outcomes *is* in fact the *same* as the unconditional probability of that outcome within that set.

No one’s done either of those things so far in this discussion, at least as far as I can tell. Even Guy seems to have agreed there was exactly the defect M&S have identified in the original studies; Guy quarrels with how to specify the expected P(succcess|following specified string of successes) to determine whether the observed “streaks” do in fact differ from what one would expect to see by chance.

]]>To clarify: have you ever played basketball? Who calls a hot hand after 4 shots? A little domain expertise would be prudent here :p

]]>