A colleague writes:

When I was in NYC I went to this party by group of Japanese bio-scientists. There, one guy told me about how the biggest pharmaceutical company in Japan did their statistics. They ran 100 different tests and reported the most significant one. (This was in 2006 and he said they stopped doing this few years back so they were doing this until pretty recently…) I’m not sure if this was 100 multiple comparison or 100 different kinds of test but I’m sure they wouldn’t want to disclose their data…

Ouch!

Too BAD this is how statistics gets done when billion dollars of investment are at stake.

At a smaller scale, in MANY bio-science (+?) research labs around the world [including here in the States] where million dollars of research funding are at stake, 100 experiments were run but the ONE single experiment that got you a breakthrough [might very likely actually be due to mistake or accident or whatever] that fit with the publishable hypothesis got reported in prestigious scientific journals, and years later it might get discovered or NOT AT ALL…

At least, they are a little better than those 100% data fabricators=)

http://xkcd.com/882/

Yes, that is why “selection effects” must be taken account of, as they are very naturally in error statistics. Now I’m not sure what people think the consequences of the likelihood principle are for this kind of “hunting with a shotgun”.

Prof Mayo, could you please explain how one takes this into account in practice using a non-Bayesian approach?

If I recall my Jaynes correctly (no guarantee there), one should perform a Bayesian calculation by including all relevant information, which presumably includes the information that one performed a whole bunch of statistical tests. My understanding is that using the Bayesian paradigm, a hierarchical modelling approach can be used to incorporate such information. I am not aware of the best way to do this with a non-Bayesian approach. The post-hoc corrections do seem appealing, and lead to larger uncertainty intervals.

And I doubt any statistical paradigm will correct for the sort of data dredging that occurs when one tests a large number of hypothesis and chooses to report the most favourable results make use of calculations which do not include the information about the other statistical tests that were performed.

Rob: I don’t regularly return to the site of my blog commentaries or other crimes (I’m sure you can subscribe—I thought I had–, but my e-mail is a mess).

Now I can’t tell if this is the same hypothesis tested 100 times and reporting those that are statistically significant, or 100 related hypotheses (e.g., benefits associated with a drug). It is typical in non-Bayesian inference to adjust, but differently in the different cases. That’s because the probability of finding k statistically significant test results out of 100 tests exceeds the “nominal” significance level for an individual test. The issue is related to the one of optional stopping, with tests or confidence intervals—with the proper stopping rule, guaranteed to end, even if the true value is excluded from the interval.

Given your policy of blog comments, I appreciate that you replied to my question. Thank you.