In one of life’s horrible ironies, I wrote a paper “Why we (usually) don’t have to worry about multiple comparisons” but now I spend lots of time worrying about multiple comparisons

Exhibit A: [2012] Why we (usually) don’t have to worry about multiple comparisons. Journal of Research on Educational Effectiveness 5, 189-211. (Andrew Gelman, Jennifer Hill, and Masanao Yajima)

Exhibit B: The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time, in press. (Andrew Gelman and Eric Loken)

24 thoughts on “In one of life’s horrible ironies, I wrote a paper “Why we (usually) don’t have to worry about multiple comparisons” but now I spend lots of time worrying about multiple comparisons

  1. I guess it might have been less troublesome, if you had mentioned the condition “if we use (Bayesian) hierarchical modeling” in the title of your 2012 paper. ;-)

    On a more serious note, I really liked your 2012er paper but I was always afraid people might ignore the part about careful Bayesian modeling … Would you want to correct anything in your 2012 paper nowadays?

  2. Maybe my grandmother has an answer to this astonishing outcome. She always tells me that God punishes the small sins in one’s lifetime. According to this framing, God somehow sorts out a category of sins that just isn’t worth it to wait until Last Judgment.

    So, perhaps you did something bad, but not really Bad. And now that’s the punishment: writing about mutliple comparisons. On the upside, I think in my grandmother’s frame Satoshi Kanazawa will have to wait for Last Judgment for his beauty nonsense. That is, one can see this is a sort of redemption: He let’s you repair the slip of being grumpy at breakfast or something by making you expose multiple comparison problems.

    You are welcome.

  3. Pingback: Gelman recognizes his error-statistical (Bayesian) foundations | Error Statistics Philosophy

  4. From Andrew’s 2014 paper:

    “…. it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough…"

    …..if you look hard enough & do not adjust for all that hard looking? Looking hard, means looking at many things, in this context, right? So long as you disclosed how many things you looked at things don’t seem that bad?

    About implicit multiple comparisons: If someone did “perform test after test in a search for statistical significance” but didn’t report it & adjust for that, yes, the p-value is meaningless.

    But I’m not sure I entirely buy this “potential comparisons” argument. Until a comparison was actually made, I don’t think it confounds the results.

    The crux of this issue lies in this specific statement from the paper: ” given a particular data set, it is not so difficult to look at the data and construct completely reasonable rules for data exclusion, coding, and data analysis that can lead to statistical significance”

    I’m not convinced that data-sets are so obviously transparent that it’d let the researcher construct such rules on sight. (unless he did indeed explicitly try stuff, but then that’s conscious p-hacking which we are not considering here) And if such an obviously transparent exclusion rule was indeed applied that would be equally obvious to the referees & readers too.

    Of course fishing umm “multiple comparisons” happen, but I don’t buy the sub-conscious variety.

    • But the p value doesn’t mean what we think it means even if you didn’t go through a process where you calculate p_1, p_2, p_3,…. and finally on p_12 you find p_12 < 0.05. All it takes is that there are a variety of possible p values that are reasonable to consider, and you look at your data and think about which one you want to calculate before you calculate it.

      You could get a p < 0.05 on the first try and still have it be invalid provided that your choice of p value was dependent on the data.

    • There’s an intermediate ground between “performing test after test” and “sub-conscious multiple comparisons.” Here’s what I’ve been telling my students for years when I discuss “data snooping”:

      One way in which researchers unintentionally obtain misleading results by data snooping is in failing to account for all of the data snooping they engage in. In particular, in accounting for Type I error when data snooping, you need to count not just the actual hypothesis tests performed, but also all comparisons looked at when deciding which post hoc (i.e., not pre-planned) hypothesis tests to try.

      Hypothetical example:
      A research group plans to compare three dosages of a drug in a clinical trial. There’s no pre-planned intent to compare effects broken down by sex, but the sex of the subjects is recorded.
      The pre-planned comparisons show no statistically significant difference between the three dosages when the data are not broken down by sex. However, since the sex of the patients is known, the researchers decide to look at the outcomes broken down by combination of sex and dosage. They notice that the results for women in the high-dosage group look much better than the results for the men in the low dosage group, and perform a hypothesis test to check that out. The number of comparisons is not just one plus the number m of pre-planned hypothesis test, but m + 15, since in addition to the m pre-planned hypothesis tests, they’ve looked at fifteen comparisons: there are 3×2 = 6 dosage×sex combinations, and hence (6×5)/2 = 15 pairs of dosage×sex combinations.
      Thus if they were using a simple Bonferroni correction, to get overall significance rate .05, they would need to use significance level .05/(m + 15) for the post-hoc test. (Of course, as pointed out in the Gelam et al 2012 paper, if the new hypothesis test were statistically significant at this level, there would be a good chance that the effect was overestimated if there were a Type I error.)

  5. When I saw this title in the sneak preview I’d hoped that this post would have some content talking about your evolution from that paper to this paper. But on reflection it’s maybe a wee bit selfish for me to expect that without having actually read the new paper. I have now; thanks for writing it. When I first read your old paper my gut reaction was, “That’s great for his research context, but offers me nothing I can use in my own” (which at the time was epidemiology). This new one is much closer, though I’ve since switched fields and have entirely new multiple comparisons problems ;) but hey, no learning is really for nought.

    A question about pre-publication replication. One idea that a colleague thought of, for addressing the multiple comparisons issue, was something that seems similar to me — a split analysis where we divided the data into exploratory and confirmatory sets. All the farting around and learning-from-data would be conducted in the exploratory set, but only findings that were replicated in the confirmatory set would be taken seriously. (We had the N to support this, for some of our research questions at least.) Unfortunately for us, the PI hated this split-dataset idea, and so we never got to try it out and see how it worked in practice. Explaining these issues to researchers is one of the worst parts of being a statistician, especially in a field where it’s customary not to take multiple comparisons issues seriously — basically all of epidemiology seems to view itself as exploratory (but they still use p-values).

    • Isn’t your idea similar to validation datasets in machine learning etc. used to guard against overfitting? Even the Netflix prize, I recall, had separate datasets for training vs validation.

      • Yep, same basic idea. I hadn’t (and still haven’t) seen it used much in scientific literature other than machine learning, though I’m not sure why. In psychology I have seen it used in the context of validating factor analyses, e.g.

        • Rahul:

          There can be a lot of choice about which data to include and exclude in your analysis, and which line to fit. I agree that sometimes an analysis is close to predetermined, but in lots and lots of examples that I see, the research team has many forking paths in data processing and analysis.

    • Hi Erin,

      The split halves analysis is certainly better than presenting exploratory analyses as confirmatory analyses, but it seems that — for analyses of samples — replication requires the analysis of new data.

      Let’s imagine that, due to random bias in the sample selection process, there is a spurious correlation in a dataset. We randomly split the dataset into two halves, and then we detect the spurious correlation in the “exploratory” half of the dataset. We develop a theory for the presence of that correlation, and then we detect that correlation in the “confirmatory” half of the dataset.

      I don’t think that the detection of the correlation in the “confirmatory” half of the dataset has confirmed anything regarding the theory or the hypothesis, because the development of the theory and hypothesis was based on data that were not independent of the data that suggested the theory and hypothesis.

      • Correction to the last line: I don’t think that the detection of the correlation in the “confirmatory” half of the dataset has confirmed anything regarding the theory or the hypothesis, because [testing] the theory and hypothesis was conducted with data that were not independent of the data that suggested the theory and hypothesis.

      • Hm! I see your point, but I think that you could take that even farther, to say that confirmation actually requires replication in a different lab. (If I’m doing the sampling for both studies, any biases introduced by my methods, community connections, location, ability to pay, etc in the first study are likely to apply to the second as well.) And it’s clear that that’s not really going to work for prepublication Type I error control!

        Now I’d be especially interested to try this on some real data. Or maybe some simulated data… it would be pretty easy to simulate a big dataset, identify the spurious correlations within it, and then see how many of those persist on subsets of various sizes.

      • Thanks for that link, wei!
        It’s so interesting for me to read genetic work — geneticists really seem to take the Type I error issue seriously in a way that others I’ve worked with typically don’t. I guess maybe it’s because they don’t want to waste a lot of money following genes that aren’t really doing anything.

        • Honestly, the abuse of statistics in genetics is pretty bad. Almost every practitioner I’ve met follows multiple comparison corrections as a blind ritual. Most have no clue about shrinkage estimation as an alternative strategy.

          They also get really excited about larger and larger sample sizes for detecting smaller effects without considering that confounded correlations that are ‘real’ in a statistical sense are likely increasing. Also many tend to believe that subtracting a few PCA dimensions is sufficient to subtract out environmental confounding.

        • I agree with you on the point that multiple correction does not correct effect estimation (and we need shrinked estimate); and that eigenstrat has its limitation.

          Can you expand on “confounded correlations … are likely increasing”? any reference on the subject?
          thanks
          wei

      • The paper cited by Wei does not appear to be what Rahul (and I would guess also Erin) were talking about. From the paper’s abstract:

        “Because of the high cost of genotyping hundreds of thousands of markers on thousand of subjects, genome-wide associations studies often follow a staged design in which a proportion … of the available sample are genotyped on a large number of marker in stage 1, and a proportion … of those markers are later followed up by genotyping them on he remaining samples in stage 2. The standard strategy for analyzing such two-stage data is to view stage 2 as a replications study and focus on findings that reach statistical significance when stage 2 data are considered alone.”

        What I believe Rahul is talking about is this: Collect all the data on all the cases. (Here “cases” is what is referred to as “samples” in the quote above.) Randomly split the data, with some proportion as the “leave out” set and the rest as the “training set. (This needs to be done carefully if there is a complex structure to the data – e.g., nesting). Use the training set to develop a model; then use the leave-out set to see if the model works well there also.

        Additional comments:

        1. I believe that what the article describes may fall under what Efron calls “filtering”, which messes up the types of techniques he recommends, since it gives a misleading “background” distribution.

        2. David Draper proposes using what he calls “calibrated cross-validation” (CCV). This means partitioning the data into three sets, M for modeling, V for validation, and C for calibration. Use M to explore plausible models and V to test them, iterating the explore/test process as needed. Then fit the best model (or use Bayesian model averaging) using the data set combining M and C, reporting both inferences from this fit and quality of predictive calibration of this model in C. See http://www.ams.ucsc.edu/~draper/draper-austin-2014-final-version.pdf for more detail.

  6. what does “Multiplicity would not be a huge problem in a setting of large real differences, large samples, small measurement errors,
    and low variation.” mean?

    does it mean ‘raw p-values from testing procedures 1-4 are equivalent in interpretation when power is high’?

    Or ‘you are mostly chasing true signals by just focusing on the best (say top 10) raw p-values out of the 1000 tests when power is high’

  7. “To put it another way, we view these papers – despite their statistically significant p-values – as exploratory, and when we look at exploratory results we must be aware of their uncertainty and fragility. It does not seem to us to be good scientific practice to make strong general claims based on noisy data, and one problem with much current scientific practice is the inadvertent multiplicity of analysis that allows statistical significance to be so easy to come by, with researcher degrees of freedom hidden because researcher only sees one data set at a time.”

    Very well said Andrew. A large part of this problem is precisely the issue of marketing an exploratory analysis as a definitive finding. The concept of exploratory analyses and definitive analyses should be brought in to statistical training received by researchers, to help curb this current problem.

    Running a web search for “Nature Journal exploratory analysis” yields only one paper in the top few hundred hits:

    “Comprehensive analysis of DNA methylation data with RnBeads”
    (Assenov et al., Nature Methods (2014) doi:10.1038/nmeth.3115 ) which discusses a software tool for analysis of DNA methylation data, appropriately describing features in the software for conducting exploratory analyses.

    However, I could not find any other Nature papers discussing results from a study labeled as exploratory. Set me straight if you find any, but I can’t imagine Nature editors accepting a paper describing an exploratory finding. This is a shame – if more initial “discoveries” were honestly labeled as the exploratory findings that they are (witness the recent STAP fiasco), and reproduction of findings was encouraged and also published, we would have a far more honest assessment of what effects are real (stand the repeated tests of time) and which are merely interesting initial findings needing further vetting. In truth, since journals such as Nature insist on “cutting edge” findings, many of their published results are exploratory findings, whether or not they will honestly describe them thusly.

    I did see a slew of papers in psychology-oriented publications describing exploratory analysis findings, as well as this honest effort published in an opthalmology journal:

    “Exploratory Analysis of Diabetic Retinopathy Progression Through 3 Years in a Randomized Clinical Trial That Compares Intravitreal Triamcinolone Acetonide With Focal/Grid Photocoagulation”

    Bressler et al. , Arch Ophthalmol. 2009;127(12):1566-1571

    wherein the authors honestly review their study:

    “This study has a number of potential weaknesses. Its protocol was not designed primarily to determine the
    effect of intravitreal corticosteroids on prevention of the progression of retinopathy, and the analyses presented were not planned secondary outcomes before the onset of the study, although the concept was considered because the analysis plan at the onset of the study included comparison among the change in retinopathy levels on fundus photographs.”

    and draw appropriate conclusions:

    “Conclusions: Intravitreal triamcinolone acetonide (4 mg) appeared to reduce the risk of progression of diabetic retinopathy. Given the exploratory nature of this analysis and because intravitreal triamcinolone adverse effects include cataract formation and glaucoma, use of this treatment merely to reduce the rates of progression of proliferative diabetic retinopathy or worsening of the level of diabetic retinopathy does not seem warranted at this time.”

    Kudos to Andrew for aptly describing a big part of this current irreproducible findings crisis, to Bressler and co-authors of the diabetic retinopathy paper for providing a great example of how data driven initial findings should honestly be written up for submission, and to the editor(s) who accepted the Bressler et al. paper for publication. We need a lot more of this.

Leave a Reply to Gelman recognizes his error-statistical (Bayesian) foundations | Error Statistics Philosophy Cancel reply

Your email address will not be published. Required fields are marked *