Kleber Neves writes:

I’ve been a long-time reader of your blog, eventually becoming more involved with the “replication crisis” and such (currently, I work with the Brazilian Reproducibility Initiative).

Anyway, as I’m now going deeper into statistics, I feel like I still lack some foundational intuitions (I was trained as a half computer scientist/half experimental neuroscientist). I write to ask a question about something that I think is simple, but I have a hard time wrapping my mind around.

Should we ever correct for multiple comparisons (within a false positive/false negative framework)?

Say, if I collect data Y and make two unplanned estimates of A and B to be significantly different from 0, some people would recommend performing a correction for multiple comparisons (not very worrying in this case, since it’s just two comparisons, but bear with me). But what if this happens in a different time frame? What if I publish A first and only later get to the point in the analysis where I estimate B and then publish it as a separate paper? Should this be corrected? What if I release my dataset and several independent analysts use it to estimate A, B (and C, D, E …)? Wouldn’t that warrant a correction for multiple comparisons?

These scenarios made me think initially that, at the very least, these corrections are inconsistent. But I also find it weird that the number of comparisons performed should change my belief on an assertion – well, it’s not that I find it weird, I think I get the idea of p-hacking, forking paths and researcher degrees of freedom. What I don’t quite get is how this is formalized in terms of probability. I think a reach a similar problem when thinking about preregistration: if I get data Y and estimate a value A from data, does the uncertainty on my estimate of A decrease if it was preregistered?

Now, after thinking about this, I believe the answer is that I “contaminate” the prior. Whatever expectation I have before, it is assumed to be independent of the data. The data is another source of information. If I use the data to inform my choice during analysis (i.e. the prior, in a broad sense), the prior is no longer independent of the data, I’m using the same information twice, it’s “double dipping”, whereas if I do only the planned analysis, I have two independent sources of information.

If the above is correct, then I still don’t see much sense in correcting for multiple comparisons – the problem is actually the lack of a prior. I believe that’s the message of your “Why you should not correct for multiple comparisons” paper with Hill and Yajima. This would be the case for multiplicity that arises from exploratory analysis: I want the data to show me new possibilities of analysis. I can do that, aware of the “double dipping”, aware to be careful with inferences. This would also be the case for multiplicity that arises simply from having many different possible measures for the same thing. Either all measures are valid and then they should agree somewhat (and checking all of them would be what you call a “multiverse analysis”?) or the choice of measure should be justifiable by theory and previous knowledge. Again, the problem would not be using many measures, per se (until you find one with p less than 0.05), but the lack of an a priori justification for the choice of measure.

My reply:

That’s right. From the Bayesian point of view, if estimate a lot of parameters using flat priors, with the goal of comparing those parameters to zero, you’ll get problems.

Why? Consider the frequency properties of this procedure of classical estimates with uncorrected multiple comparisons.

You can do a little simulation, for example you’re estimating theta_j, for j=1,…,100, and for each theta_j, you have an unbiased estimate y_j with standard error of 1 (for simplicity). Now suppose the true theta_j’s come from a normal distribution with mean 0 and standard deviation 1. Then your point estimates will have a distribution with mean 0 and standard deviation sqrt(2). So on average you’re overestimating the magnitude of your parameters. But that’s really the least of your problems. If you restrict your attention to the estimates that are statistically significantly different from zero, you’ll be way overestimating these theta_j’s. That’s the type M error problem, and no multiple comparisons correction of statistical significance will fix that.

But wait!, you might say: Flip it around. The above simulation gave these results because we assumed a certain distribution for the true theta_j’s that was concentrated near zero; no surprise that the point estimates end up likely to be much larger in absolute value. But what if that wasn’t the case? What if the underlying theta_j’s were much more spread, for example coming from a distribution with mean 0 and standard deviation 100? OK, fine—but then it doesn’t seem so plausible that you’re doing a study to compare all these theta_j’s to zero.

More generally, if we’re estimating a lot of parameters (100, in this example), we can *estimate* the population distribution of the theta_j’s from the data. That’s what Francis Tuerlinckx and I discussed in our original paper on Type M and Type S errors, from 2000.

If the number of parameters is small, your inference will depend more strongly on your prior distribution for the hyperparameters that govern the population distribution of theta. That’s just the way it is: if you have less data, you need more prior information, or else you’ll have weaker inferences.

Regarding your question about whether “the number of comparisons performed should change my belief on an assertion”: The key here, I think, is that increasing the number of parameters also supplies new data, and that should be changing your inferences. You write, “What if I publish A first and only later get to the point in the analysis where I estimate B and then publish it as a separate paper? Should this be corrected? What if I release my dataset and several independent analysts use it to estimate A, B (and C, D, E …)? Wouldn’t that warrant a correction for multiple comparisons?” My answer is that new data from others shouldn’t change the *data* that you reported from experiment A, but it should change the *inferences* that we draw. No need to correct your original paper: it is what it is.