You start with (say) 512 names of people you know are interested in football results. You send each of them a prediction for a particular match — half predicting a win for side A, and half for the other side.

After the game has been played, you throw away the addresses of the people you sent the wrong prediction to, and send another prediction to the 256 survivors, again split 50/50 between the teams.

After doing it again you have 64 people who have received three straight correct predictions, and so will wish to subscribe to your newsletter.

Interestingly this would probably be difficult to pull off today — people would be very likely to compare notes on some Internet forum.

]]>> How do we set up to determine 1.65 expected significant in 33 comparisons? Is this a binomial distribution calculation (success / failure) ? Is there an implicit 0.05 significance level here?

33 x 0.05 = 1.65

]]>Brad:

Sure, with 33 independent comparisons and an expected 1.65 significant, you could use dbinom() in R to compute the probability of 0, 1, 2, 3, etc. significant results.

But I’d *not* recommend this as a data analysis. I’d do partial pooling on the effect sizes. My point was just that, when there are many comparisons, it’s no surprise to see some low p-values, just by chance.

Conversely, if you have good theoretical reasons to believe these effects, or if the effects are consequential, it can make sense to act on them right away, without using statistical significance as a threshold.

From the above post, my correspondent had written: “This has led to a new business of trying large numbers of combinations of otherwise-unused drugs against a large number of untreatable illnesses, with a large number of success criteria.” And this suggested to me that there was no good theoretical reason to expect these effects, in which case from a Bayesian point of view we’d want to do a lot of partial pooling toward 0, which would give us little confidence in any of these claimed effects, even if the separate p-values happened to be less than 0.05 or whatever. The large number of potential comparisons is helpful in understanding how those low p-values came up in the first place.

Regarding quantitative understanding, there’s this article which treats the problem from a non-Bayesian perspective. Oddly enough, I haven’t recently written any articles giving the simple Bayesian solution to such problems; I should do that.

]]>Thank you, Andrew, for noting the difference between unpooled drug efficacy comparisons and partially-pooled fund performance comparisons.

Would a kind reader mind please providing calculation details, to help readers like me get a better quantitative understanding of Andrew’s points?

— How do we set up to determine 1.65 expected significant in 33 comparisons? Is this a binomial distribution calculation (success / failure) ? Is there an implicit 0.05 significance level here?

— How do we set up and calculate the analogous significance for say 33 partially-pooled mutual funds? Are more details required, e.g. size of pools and/or population of available stocks to choose from?

Thanks very much in advance for any assistance.

]]>Grumbler:

Partial pooling, baby, partial pooling.

]]>As for the incubation of the funds, is that wrong? Some funds will do better by chance, but it is also possible the best performing funds are doing better because they are more skillfully managed. It just helps to keep in mind the multiple “chances” to do well when calculating the odds that a fund’s performance is due to chance alone. If a coach “incubates” multiple players for a team and then cuts the worst ones, does anybody doubt the better-performing players are actually better? Or should I go back to my little league coach and tell him kicking me off for batting .100 when there were .300 players was a misuse of statistics?

]]>