# Analyze all your comparisons. That’s better than looking at the max difference and trying to do a multiple comparisons correction.

The following email came in:

I’m in a PhD program (poli sci) with a heavy emphasis on methods. One thing that my statistics courses emphasize, but that doesn’t get much attention in my poli sci courses, is the problem of simultaneous inferences. This strikes me as a problem.

I am a bit unclear on exactly how this works, and it’s something that my stats professors have been sort of vague about. But I gather from your blog that this is a subject near and dear to your heart.

For purposes of clarification, I’ll work under the frequentist framework, since for better or for worse, that’s what almost all poli sci literature operates under.

But am I right that any time you want to claim that two things are significant *at the same time* you need to halve your alpha? Or use Scheffe or whatever multiplier you think is appropriate if you think Bonfronni is too conservative?

I’m thinking in particular of this paper [“When Does Negativity Demobilize? Tracing the Conditional Effect of Negative Campaigning on Voter Turnout,” by Yanna Krupnikov].

In particular the findings on page 803.

Setting aside the 25+ predictors, which smacks of p-hacking to me, to support her conclusions she needs it to simultaneously be true that (1) negative ads themselves don’t affect turnout, (2) negative ads for a disliked candidate don’t affect turnout; (3) negative ads against a preferred candidate don’t affect turnout; (4) late ads for a disliked candidate don’t affect turnout AND (5) negative ads for a liked candidate DO affect turnout. In other words, her conclusion is valid iff she finds a significant effect at #5.

This is what she finds, but it looks like it just *barely* crosses the .05 threshold (again, p-hacking concerns). But am I right that since she needs to make inferences about five tests here, her alpha should be .01 (or whatever if you use a different multiplier)? Also, that we don’t care about the number of predictors she uses (outside of p-hacking concerns) since we’re not really making inferences about them?

First, just speaking generally: it’s fine to work in the frequentist framework, which to me implies that you’re trying to understand the properties of your statistical methods in the settings where they will be applied. I work in the frequentist framework too! The framework where I don’t want you working is the null hypothesis significance testing framework, in which you try to prove your point by rejecting straw-man nulls.

In particular, I have no use for statistical significance, or alpha-levels, or familywise error rates, or the .05 threshold, or anything like that. To me, these are all silly games, and we should just cut to the chase and estimate the descriptive and casual population quantities of interest. Again, I am interested in the frequentist properties of my estimates—I’d like to understand their bias and variance—but I don’t want to do it conditional on null hypotheses of zero effect, which are hypotheses of zero interest to me. That’s a game you just don’t need to play anymore.

When you do have multiple comparisons, I think the right way to go is to analyze all of them using a hierarchical model—not to pick one or two or three out of context and then try to adjust the p-values using a multiple comparisons correction. Jennifer Hill, Masanao Yajima, and I discuss this in our 2011 paper, Why we (usually) don’t have to worry about multiple comparisons.

To put it another way, the original sin is selection. The problem with p-hacked work is not that p-values are uncorrected for multiple comparison, it’s that some subset of comparisons is selected for further analysis, which is wasteful of information. It’s better to analyze all the comparisons of interest at once. This paper with Steegen et al. demonstrates how many different potential analyses can be present, even in a simple study.

OK, so that’s my general advice: look at all the data and fit a multilevel model allowing for varying baselines and varying effects.

I took a look at the linked paper. I like the title. “When Does Negativity Demobilize?” is much better than “Does Negatively Demobilize.” The title recognizes that (a) effects are never zero, and (b) effects vary. I can’t quite buy this last sentence of the abstract, though: “negativity can only demobilize when two conditions are met: (1) a person is exposed to negativity after selecting a preferred candidate and (2) the negativity is about this selected candidate.” No way! There must be other cases when negativity can demobilize. That said, at this point the paper could still be fine: even if a paper is working within a flawed inferential framework, it could still be solid empirical work. After all, it’s completely standard to estimate constant treatment effects—we did this in our first paper on incumbency advantage and I still think most of our reported findings were basically correct.

Reading on . . . Krupnikov writes, “The first section explores the psychological determinants that underlie the power of negativity leading to the focal hypothesis of this research. The second section offers empirical tests of this hypothesis.” For the psychological model, she writes that first a person decides which candidate to support, then he or she decides whether to vote. That seems a bit of a simplification, as sometimes I know I’ll vote even before I decide whom to vote for. Haven’t you ever heard of people making their decision inside the voting booth? I’ve done that! Even beyond that, it doesn’t seem quite right to identify the choice as being made at a single precise time. Again, though, that’s ok: Krupnikov is presenting a model, and models are inherently simplifications. Models can still help us learn from the data.

OK, now on to the empirical part of the paper. I see what you mean: there are a lot of potential explanatory variables running around: overall negativity, late negativity, state competitiveness, etc etc. Anything could be interacted with anything. This is a common concern in social science, as there is an essentially unlimited number of factors that could influence the outcome of interest (turnout, in this case). On one hand, it’s a poopstorm when you throw all these variables into your model at once; on the other hand, if you exclude anything that might be important, it can be hard to interpret any comparisons in observational data. So this is something we’ll have to deal with: it won’t be enough to just say there are too many variables and then give up. And it certainly won’t be a good idea to trawl through hundreds of comparisons, looking for something that’s significant at the .001 level or whatever. That would make no sense at all. Think of what happens: you grab the comparison with a z-score of 4, setting aside all those silly comparisons with z-scores of 3, or 2, or 1, but this doesn’t make much sense, given that these z-scores are so bouncy: differences of less than 3 in z-scores are not themselves statistically significant.

To put it another way, “multiple comparisons” can be a valuable criticism, but multiple comparisons corrections are not so useful as a method of data analysis.

Getting back to the empirics . . . here I agree that there are problems. I don’t like this:

Estimating Model 1 shows that overall negativity has a null effect on turnout in the 2004 presidential election (Table 2, Model 1). While the coefficient on the overall negativity variable is negative, it does not reach conven- tional levels of statistical significance. These results are in line with Finkel and Geer (1998), as well as Lau and Pomper (2004), and show that increases in the negativity in a respondent’s media market over the entire duration of the campaign did not have any effect on his likelihood of turning out to vote in 2004.

Not statistically significant != zero.

Here’s more:

Going back to the conclusion from the abstract, “negativity can only demobilize when two conditions are met: (1) a person is exposed to negativity after selecting a preferred candidate and (2) the negativity is about this selected candidate,” I think Krupnikov is just wrong here in her application of her empirical results. She’s taking non-statistically-significant comparisons as zero, and she’s taking the difference between significant and non-significant as being significant. Don’t do that.

Given that the goal here is causal inference, I think it would’ve been better off setting this up more formally as an observational study comparing treatment and control groups.

I did not read the rest of the paper, nor am I attempting to offer any evaluation of the work. I was just focusing on the part addressed by your question. The bigger picture, I think, is that it can be valuable for a researcher to (a) summarize the patterns she sees in data, and (b) consider the implications of these patterns for understanding recent and future campaigns, while (c) recognizing residual uncertainty.

Remember Tukey’s quote: “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”

The attitude I’m offering is not nihilistic: even if we have not reached anything close to certainty, we can still learn from data and have a clearer sense of the world after our analysis than before.

## 15 thoughts on “Analyze all your comparisons. That’s better than looking at the max difference and trying to do a multiple comparisons correction.”

1. To your correspondent’s specific question–I think the classical hypothesis testing framework would instruct you to *increase* alpha for each of the 5 tests. The reason is that the composite hypothesis stated is a conjunction, not a disjunction. In the typical Bonferroni example, we have one scientific hypothesis, and M statistical hypotheses corresponding to it. If *any* were established, we would take this to establish our scientific hypothesis. To control the type I error rate with respect to our scientific hypothesis, then, we need to be more exacting towards each individual statistical hypothesis–hence, the test-level alpha is decreased. In your example, we would only consider the scientific hypothesis established if *every* statistical hypothesis is established. To control the type I error rate with respect to the scientific hypothesis, then, we need to be more forgiving for each individual statistical hypothesis–hence, the test-level alpha is *increased*.

Note that I’m not commenting here on whether this overall approach is appropriate or not. But the textbook answer to your correspondent’s question is to use a higher alpha, not a lower alpha, for each test, in order to control the family wide type I error rate.

• I see what you are saying – you want to reject that coefficients 1-4 are greater than 0 and coefficient 5 is less than 0… that isn’t a classical multiple hypothesis testing problem, it is actually just a simultaneous test of multiple coefficients, driven by clear theoretical predictions. I’ve often thought this could be a powerful statistical tool – if your model makes predictions that clear over a number of outcomes, enumerate them all and test all the relevant parameter estimates at once.

But let’s be clear about what probably happens in most of these circumstances, admitting openly I have no idea about this particular paper: First, you run all your regressions and get the sign of the coefficient estimates. Then you take your generic, over-arching, under-specified theoretical model and start making restrictions on relationships and parameters until you get a set of predictions consistent with all of the coefficient estimates you’ve already done. Then you claim that was your theory all along, and then it looks like you made a series of improbable predictions. #Science

Either way though, I agree with you that there is no real reason to inflate p-values here. But that is easier to say when you don’t feel any need to believe that low p-values imply scientific discovery or theoretical verisimilitude.

• Thank you for the explanation. I agree with your answer “I think the classical hypothesis testing framework would instruct you to *increase* alpha for each of the 5 tests”. Could you point to papers where authors have done that?

2. …I don’t want to do it conditional on null hypotheses of zero effect, which are hypotheses of zero interest to me.

Are they of exactly zero interest to you? Or do you maybe have small-magnitude, context-dependent interests in null hypotheses?

3. “Krupnikov writes, “The first section explores the psychological determinants that underlie the power of negativity…”

I can’t help but wonder if social scientists won’t really understand what statistics is all about as long as they think in terms of “determinants”. My first instinct is to say, “No, not determinants — influences,” but on second thought I realize that many may see “influence” as saying the same as “determinant”, whereas the point is to get out of deterministic thinking (and into probabilistic thinking).

4. Hi all,

I just came across this post (and website (WOW!)) after reading the Steegen / Gelman “multiverse” paper which I found very insightful. So I’m new here.

I am also just getting into thinking about statistics in the manner that is being done here, my background is Epidemiology and Public Health where it seems to be all about hypothesis testing and confidence intervals, so I don’t quite understand a) why the following is being said and b) what the exact alternatives are:

[i]”The framework where I don’t want you working is the null hypothesis significance testing framework, in which you try to prove your point by rejecting straw-man nulls.

In particular, I have no use for statistical significance, or alpha-levels, or familywise error rates, or the .05 threshold, or anything like that. To me, these are all silly games, and we should just cut to the chase and estimate the descriptive and casual population quantities of interest. Again, I am interested in the frequentist properties of my estimates—I’d like to understand their bias and variance—but I don’t want to do it conditional on null hypotheses of zero effect, which are hypotheses of zero interest to me. That’s a game you just don’t need to play anymore.”[/i]

I would be very grateful if one of the readers could just point me to some papers so that I can understand why this is said and what could potentially be done instead? (Just as a background: What is been done usually in my field is survival analysis with prospective cohort data and sometimes TSCS / panel analysis).

Cheers,
Kai

• Kai:

5. “Estimating Model 1 shows that overall negativity has a null effect on turnout in the 2004 presidential election (Table 2, Model 1). While the coefficient on the overall negativity variable is negative, it does not reach conven- tional levels of statistical significance. These results are in line with Finkel and Geer (1998), as well as Lau and Pomper (2004), and show that increases in the negativity in a respondent’s media market over the entire duration of the campaign did not have any effect on his likelihood of turning out to vote in 2004.”

Translation: the author didn’t collect evidence, so the original statement must be true. My hypothesis is that the author is a fool, I didn’t spend much time looking otherwise, so yeah, the author is a fool.

6. “The attitude I’m offering is not nihilistic: even if we have not reached anything close to certainty, we can still learn from data and have a clearer sense of the world after our analysis than before.”

Andrew,
It sounds nice, but I still don’t know how to do it or have seen good examples. It looks like this should affect also the way you define your research questions… Can you point out to your best empirical paper where you don’t use hypothesis testing, embrace uncertainty, and provide substantive conclusions about the world and the problem being studied?

Thanks!

• Sdaza:

I’ve published dozens of applied papers where I don’t use hypothesis testing, embrace uncertainty, and provide substantive conclusions about the world and the problem being studied? Just look here.

And here are a few examples:
– Our book Red State Blue State
– Our paper on decision making for home radon
– Our paper on toxicology
– Our two papers (here and here) on incumbency advantage
– Our paper on voting power

And there are lots out there: the vanishing swing voter, arsenic in Bangladesh, the effects of survey incentives, death sentencing reversals, trends in public opinion, political polarization, police stops, beauty and sex ratio (that one was a negative finding, but that’s a substantive conclusion too), etc etc etc. Once you start thinking this way, it’s not hard at all.

• Thank you!

• You and Susan should write a paper together. A model that psych folks could more easily adapt. (high impact nearly guaranteed)

For those not in the know — Andrew Gelman : Bayesian Stats :: Susan Gelman : Cognitive Development

• Andrew2:

Susan and I are working on a research project! Ball is now in my court. We gathered a bunch of pilot data and now I’m supposed to analyze it so we can figure out where to go next.