Fabio Rojas asks:

Should I do Bonferroni adjustments? Pros? Cons? Do you have a blog post on this? Most social scientists don’t seem to be aware of this issue.

My short answer is that if you’re fitting mutlilevel models, I don’t think you need multiple comparisons adjustments; see here.

With respect to genetics or gene expression applications, even the 2nd approach (FDR) does not make a lot of sense. In these cases one can use prior knowledge of co-regulation or biophysical interactions to build a realistic multi-level model. The latter in turn could be used to answer biologically and clinically relevant questions about co-expression, which in turn could suggest patterns of co-regulation.

Or if one is totally lazy, one could just put a Dirichlet Process Prior (or a DPM) on the regression coefficients and have most of them shrink down to zero.

My 2c

Here is a question I’ve been posing (mostly to myself) in this context: If this is data from a larger experiment, or if you are writing multiple papers from one experiment and/or if other groups are using the same data to investigate other outcomes, what is the right number of comparisons?

For instance: what should we multiply our p-value by when using, say, the 2009 March CPS – 200,000?

We recently had a referee ask us to do this on a BioMed paper, and I was like “should we pretend that this $5M project was only to investigate the outcomes in this particular paper, or should we just give up?”

One option I’m considering for the future is using False Discovery Rate corrections within families of outcomes, but this requires pre-specification and reporting of all outcomes and families of outcomes. Ted Miguel does something like this in his Sierra Leone paper:

http://eml.berkeley.edu//~emiguel/pdfs/miguel_gbf.pdf

But all that said, I think my current feeling is that this is just another reason we never believe something just because one person found it in one context using one dataset. Conditional on that understanding, my first instinct is that we could just stick with regular p-values (or better, confidence intervals/standard errors).

Jrc:

You’re not listening! The correct answer to “what should we multiply our p-value by,” is: Don’t summarize your inferences with p-values! Fit a multilevel model and these problems go away. No pre-specifying and reporting required.

So if you had an experiment with M>20 outcome variables, all on fundamentally different aspects of the participants, you would estimate those all simultaneously? I can see how a multilevel model makes sense when you have one outcome and a bunch of different covariates you are interested in, but not really when you are trying to measure the effects of something (some “treatment”) across a wide variety of outcomes.

I think we generally agree on the use and misuse of p-values, but I’m not convinced that multilevel modeling across various outcomes really solves the multiple-hypothesis testing problem when you have outcomes measuring different aspects of some intervention.

Jrc:

Yup, I’d model all 20 outcomes. If you want to model them separately, that’s fine, you can model them all separately in a single model, with an independent factor for each outcome. That might be a bit silly but it could be an OK start. The key is to have informative priors for the effects of interest. Hierarchical is one way to have an informative prior but you could just assign the prior directly.

The general point is that it makes sense to study all the phenomena of interest, not to select a single thing. You can then display everything you’re studying; such a display serves as a substitute for selecting one or two p-values to look at.

Cynical mode on

And risk destroying the cottage industry of secondary publications from cutting and repackaging the same data?

Cynical mode off

In the social sciences data like GSS, NELS88, Add Health are collected precisely so they are of maximum use to the broadest possible group of researchers. Not to mention the major federal data sets including the census that are used by researchers.

Aside from the issue of how to model your data, I think Andrew’s previous response would have best if just cut off at the second exclamation: “You’re not listening! The correct answer to “what should we multiply our p-value by,” is: Don’t summarize your inferences with p-values!” If you use effect sizes (or MLM), have replications, and talk in terms of real-world significance instead of statistical significance, corrections strike me as unnecessary.

How does talking in terms of real world significance make correction unnecessary? Doesn’t the problem still persist?

> The key is to have informative priors for the effects of interest.

That _should_ be obvious but isn’t and discussions about multiplicity (which often reflects trying to learn about the world blindly) are always a confusing mess.

I think it has to do with false sense of objectivity that what can be learned from a study (aka “evidence”) must not depend on the individual, their background knowledge or intentions/purposes. As David Cox once put to me “you want inferences to stand as much as possible just on the study in hand”. With all due respect to David “I want a perpetual motion machine”.

If you don’t know much and don’t have a good sense of purpose – your prior is vague (no matter what you say) or you need to make adjustments to p_values _and_ confidence intervals or you need to do shrinkage with a vague prior or equivalent penalization.

I have read the linked paper. Some issues with it:

1) On p. 191, it claims about Bonferroni: “Implicitly, it assumes that these test statistics are independent.” I have seen this in a number of discussions of Bonferroni, but it’s not true, is it? If k tests have a type I error probability of 0.05/k, the probability that at least one type I error occurs is bounded from above by 0.05 (which is the sum of the individual error probabilities) regardless of how dependent the individual tests are, i.e. what the joint error probabilities for more than one test are.

2) The kind of criticism “we don’t believe that the H0 is truly precisely zero” in my opinion isn’t an issue with p-values if they are interpreted correctly (i.e., a large p-value is not interpreted as *confirmation* of the H0); regardless of whether we believe the H0 to be potentially true or not, if the data cannot distinguish what we observe from what is expected under H0, surely the data can’t be evidence for any specific deviation from it. That’s the logic, and it doesn’t rely at all on believing H0 to be potentially true.

3) It seems to me that you made a case for multilevel models being beneficial in situations in which a good case can be made that the different tests are somehow connected (like “the same treatment in different groups” etc.), so that it makes sense to implicitly control the variation between parameters, which is what you’re effectively doing. But I think that there are a number of problems in which people do multiple testing in which such a case cannot convincingly be made. I’m not an expert in genetics but I’d guess that there can be arbitrary variation between what different genes do regarding certain outcomes of interest, just as a crude example. In any case, the question that gave rise to this posting didn’t mention anything like this.

Regarding the multiple testing issue in general, I tend to interpret such tests as “exploratory devices” of which whatever the outcome is needs to be subject to more focused investigation with new data before it can be treated as “discovery”. I don’t oppose running many tests in an exploratory manner, to find “things worth having a closer look at” in the future. I often tell people that good graphs can often deliver this kind of thing more convincingly than hypothesis tests, but even then a non-significant test may reduce the number of “potentially worthwhile things” to look at later and help the researcher to train their eyes a bit, assessing how (little) extreme some of their findings actually were.

Christian:

To respond to each point:

1. I don’t really care about this since I don’t think type 1 error rates etc are interesting.

2. I find statements such as false positives and negatives to be unhelpful, at least in the problems I work on, because I don’t think effects are zero. If you are interested in studying small effects, that’s fine, but then I’d rather model that directly rather than playing a game with “near-zero” etc.

3. Multilevel models work just fine when there are large differences between groups. Then the group-level variance will be large and there will not be much partial pooling, which is what we want. We discuss this in our paper.

In some worlds, though, we do think it very possible that effect sizes are zero. Really. Especially in evaluation research. It’s true that underpowered studies lead to “nothing works” conclusions and thats bad, but there are lots of interventions people try that make no difference.

Regarding genetics, hierarchical modelling is a well-established approach in the analysis of microarray data, precisely because Bonferroni is way too conservative.

I used to explain Bonferroni to students like this.

Someone says they are going to drop three equal size sheets of paper on the floor and they are willing to bet that the area on the floor that will be covered by them will be equal to or less than three times the area of a single sheet. Not a terribly risky bet.

(OK the extension to a non-finite collect of sheets of paper of arbitrary shape requires a theorem.)

One situation in which I think Bonferroni adjustments can be useful is as part of a quick-and-dirty method of applying skepticism in reading a paper that’s got lots of hypothesis tests and p-values. In reading such a paper, I routinely count the number n of p-values, then look only at the tests with p-value less than .05/n to see if any of them have an effect size that seems of practical importance. Often that leaves me skeptical that the authors have provided any “evidence” for any of their “conclusions”.

Martha:

The trouble with this approach is that a study can have big multiple comparisons problems even if only one test was done, as Loken and I discuss in our Garden of Forking Paths paper.

Lack of necessity need not be a criticism of sufficiency.

Andrew,

Sounds like you skipped over the first sentence in my comment.

Martha:

Yes, but even if a paper has lots of hypothesis tests and p-values, there could be lots of others that could’ve been done, had the data come out differently.

Pre-registration!

Not sure if old replies are answered, but here I go:

I’ve been wondering how to deal with multiple comparisons in the context of a simple (i.e., no groups) factorial experiment where each individual is simultaneously exposed to n treatments (say, biographical traits). In that case, the outcome of interest is regressed on n treatments, thus analyzing n hypotheses. I don’t see a way how to fit a hierarchical model here (there’s no hierarchy in the data) and it wouldn’t really make sense to pull estimates towards the mean–but perhaps I’m ill-informed. Assuming no “primary hypotheses” were pre-registered, am I left with Bonferroni then? I noted that a couple of poli sci papers that do such conjoint analyses do not even bother to do correct standard errors for multiple comparisons.