The following email came in:
I’m in a PhD program (poli sci) with a heavy emphasis on methods. One thing that my statistics courses emphasize, but that doesn’t get much attention in my poli sci courses, is the problem of simultaneous inferences. This strikes me as a problem.
I am a bit unclear on exactly how this works, and it’s something that my stats professors have been sort of vague about. But I gather from your blog that this is a subject near and dear to your heart.
For purposes of clarification, I’ll work under the frequentist framework, since for better or for worse, that’s what almost all poli sci literature operates under.
But am I right that any time you want to claim that two things are significant *at the same time* you need to halve your alpha? Or use Scheffe or whatever multiplier you think is appropriate if you think Bonfronni is too conservative?
I’m thinking in particular of this paper [“When Does Negativity Demobilize? Tracing the Conditional Effect of Negative Campaigning on Voter Turnout,” by Yanna Krupnikov].
In particular the findings on page 803.
Setting aside the 25+ predictors, which smacks of p-hacking to me, to support her conclusions she needs it to simultaneously be true that (1) negative ads themselves don’t affect turnout, (2) negative ads for a disliked candidate don’t affect turnout; (3) negative ads against a preferred candidate don’t affect turnout; (4) late ads for a disliked candidate don’t affect turnout AND (5) negative ads for a liked candidate DO affect turnout. In other words, her conclusion is valid iff she finds a significant effect at #5.
This is what she finds, but it looks like it just *barely* crosses the .05 threshold (again, p-hacking concerns). But am I right that since she needs to make inferences about five tests here, her alpha should be .01 (or whatever if you use a different multiplier)? Also, that we don’t care about the number of predictors she uses (outside of p-hacking concerns) since we’re not really making inferences about them?
First, just speaking generally: it’s fine to work in the frequentist framework, which to me implies that you’re trying to understand the properties of your statistical methods in the settings where they will be applied. I work in the frequentist framework too! The framework where I don’t want you working is the null hypothesis significance testing framework, in which you try to prove your point by rejecting straw-man nulls.
In particular, I have no use for statistical significance, or alpha-levels, or familywise error rates, or the .05 threshold, or anything like that. To me, these are all silly games, and we should just cut to the chase and estimate the descriptive and casual population quantities of interest. Again, I am interested in the frequentist properties of my estimates—I’d like to understand their bias and variance—but I don’t want to do it conditional on null hypotheses of zero effect, which are hypotheses of zero interest to me. That’s a game you just don’t need to play anymore.
When you do have multiple comparisons, I think the right way to go is to analyze all of them using a hierarchical model—not to pick one or two or three out of context and then try to adjust the p-values using a multiple comparisons correction. Jennifer Hill, Masanao Yajima, and I discuss this in our 2011 paper, Why we (usually) don’t have to worry about multiple comparisons.
To put it another way, the original sin is selection. The problem with p-hacked work is not that p-values are uncorrected for multiple comparison, it’s that some subset of comparisons is selected for further analysis, which is wasteful of information. It’s better to analyze all the comparisons of interest at once. This paper with Steegen et al. demonstrates how many different potential analyses can be present, even in a simple study.
OK, so that’s my general advice: look at all the data and fit a multilevel model allowing for varying baselines and varying effects.
What about the specifics?
I took a look at the linked paper. I like the title. “When Does Negativity Demobilize?” is much better than “Does Negatively Demobilize.” The title recognizes that (a) effects are never zero, and (b) effects vary. I can’t quite buy this last sentence of the abstract, though: “negativity can only demobilize when two conditions are met: (1) a person is exposed to negativity after selecting a preferred candidate and (2) the negativity is about this selected candidate.” No way! There must be other cases when negativity can demobilize. That said, at this point the paper could still be fine: even if a paper is working within a flawed inferential framework, it could still be solid empirical work. After all, it’s completely standard to estimate constant treatment effects—we did this in our first paper on incumbency advantage and I still think most of our reported findings were basically correct.
Reading on . . . Krupnikov writes, “The first section explores the psychological determinants that underlie the power of negativity leading to the focal hypothesis of this research. The second section offers empirical tests of this hypothesis.” For the psychological model, she writes that first a person decides which candidate to support, then he or she decides whether to vote. That seems a bit of a simplification, as sometimes I know I’ll vote even before I decide whom to vote for. Haven’t you ever heard of people making their decision inside the voting booth? I’ve done that! Even beyond that, it doesn’t seem quite right to identify the choice as being made at a single precise time. Again, though, that’s ok: Krupnikov is presenting a model, and models are inherently simplifications. Models can still help us learn from the data.
OK, now on to the empirical part of the paper. I see what you mean: there are a lot of potential explanatory variables running around: overall negativity, late negativity, state competitiveness, etc etc. Anything could be interacted with anything. This is a common concern in social science, as there is an essentially unlimited number of factors that could influence the outcome of interest (turnout, in this case). On one hand, it’s a poopstorm when you throw all these variables into your model at once; on the other hand, if you exclude anything that might be important, it can be hard to interpret any comparisons in observational data. So this is something we’ll have to deal with: it won’t be enough to just say there are too many variables and then give up. And it certainly won’t be a good idea to trawl through hundreds of comparisons, looking for something that’s significant at the .001 level or whatever. That would make no sense at all. Think of what happens: you grab the comparison with a z-score of 4, setting aside all those silly comparisons with z-scores of 3, or 2, or 1, but this doesn’t make much sense, given that these z-scores are so bouncy: differences of less than 3 in z-scores are not themselves statistically significant.
To put it another way, “multiple comparisons” can be a valuable criticism, but multiple comparisons corrections are not so useful as a method of data analysis.
Getting back to the empirics . . . here I agree that there are problems. I don’t like this:
Estimating Model 1 shows that overall negativity has a null effect on turnout in the 2004 presidential election (Table 2, Model 1). While the coefficient on the overall negativity variable is negative, it does not reach conven- tional levels of statistical significance. These results are in line with Finkel and Geer (1998), as well as Lau and Pomper (2004), and show that increases in the negativity in a respondent’s media market over the entire duration of the campaign did not have any effect on his likelihood of turning out to vote in 2004.
Not statistically significant != zero.
Going back to the conclusion from the abstract, “negativity can only demobilize when two conditions are met: (1) a person is exposed to negativity after selecting a preferred candidate and (2) the negativity is about this selected candidate,” I think Krupnikov is just wrong here in her application of her empirical results. She’s taking non-statistically-significant comparisons as zero, and she’s taking the difference between significant and non-significant as being significant. Don’t do that.
Given that the goal here is causal inference, I think it would’ve been better off setting this up more formally as an observational study comparing treatment and control groups.
I did not read the rest of the paper, nor am I attempting to offer any evaluation of the work. I was just focusing on the part addressed by your question. The bigger picture, I think, is that it can be valuable for a researcher to (a) summarize the patterns she sees in data, and (b) consider the implications of these patterns for understanding recent and future campaigns, while (c) recognizing residual uncertainty.
Remember Tukey’s quote: “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”
The attitude I’m offering is not nihilistic: even if we have not reached anything close to certainty, we can still learn from data and have a clearer sense of the world after our analysis than before.