Skip to content

Against Screening

Matthew Simonson writes:

I have a question that may be of interest to your readers (and even if not, I’d love to hear your response). I’ve been analyzing a dataset of over 100 Middle Eastern political groups (MAROB) to see how these groups react to government repression. Observations are at the group-year level and include human-coded variables based on news reports and archival research such as “Did the group attack civilians in year X” or “was the group outlawed in year Y.” Though I started from a theory, there was no pre-registration, and I’m almost certainly guilty of unwittingly chasing significance down forking paths, but I now have the rare opportunity to make amends. I have come across another dataset of political groups coded at the group-year level (NAVCO 2.0) where I could potentially run the same analyses, thus rendering my first dataset “exploratory,” and then use this new dataset to confirm or refute what I’ve previously found. NAVCO 2.0 has different coding rules (e.g. it tends to lump all the opposition groups in a given campaign into a single unit), wider geographic and temporal scope, and thus relatively little precise overlap with MAROB. Some of the variables it coded are similar to MAROB but don’t match up one-to-one and have different coding rules. Nevertheless, there are a number of aspects of my theory which I think are testable under both datasets. This seems like a golden opportunity to avoid p-hacking, and it may be the closest one ever really gets in international relations to running an experiment over again to confirm one’s earlier findings. How do I go about this in a scientifically-honest manner? How do I avoid screwing this up?

My reply: I do think it makes sense to preregister some analysis on your new dataset, but I don’t think you should pull out just one particular hypothesis to study.

Instead I recommend a multilevel model. Or, if you don’t feel like doing that, you could try an intermediate approach and perform a secret-weapon-type analysis where you look at all of the comparisons of interest, and display all your estimates and uncertainties in a single plot. In any case, you want to get away from the trap of just picking out estimates that exceed some “statistical significance” threshold.

This “get away from the trap” thing applies to your initial, exploratory analysis as well. For example, here’s a tempting strategy: Do the exploratory analysis, pick out a statistically significant finding, then run the confirmatory analysis just to be sure. That would be a mistake. Why? Because that statistical significance from the original study—that low p-value or that posterior interval that happens to exclude zero—is itself noisy. Better to push through all your ideas than to do some sort of noisy screening process which is only a step above making your research decisions based on consulting random numbers.


  1. Noisyness is thy name. Yes Andrew that noiseyness has dogged anything to do with ME issue

    I simply do not envision how such an effort can do justice to a very complex topic. For starters, I see one reference listed in the link provided. Moreover one would have a command of not only the political history but ironically enough a grounding in the economics of wars and conflicts which I would speculate ia beyond the reach of most researchers. Even classified. More importantly there have been such limited pools of assumptions and ‘theories’ that impede prospects for reducing or ending violence in the ME region.

    • RE: ‘Better to push through all your ideas than to do some sort of noisy screening process which is only a step above making your research decisions based on consulting random numbers.’
      This is where basic logic and grounding in cognitive biases could be helpful, with the caveat some of these cognitive biases need way more refinement. I’ve been exploring these for over 16 years. Maybe longer. Just a hobby.

  2. Bill Harris says:

    Andrew, I wonder if it’s worth explaining your penultimate paragraph to distinguish it from the “green jellybean” approach. You had a nice post on that sometime recently, I think. This post is more cryptic, and the difference between a secret weapon and a green jellybean seems less blatant.

    My assumption: in the green jellybean case (see xkcd, if that reference is too cryptic), one is using p-values to indicate which “tests” are “significant,” given that you’re doing sequential tests /and/ that you’ve configured the process to allow 5% of the tests to fail erroneously. Out of 20 jellybean colors, one shouldn’t be surprised that some color jellybean shows up as “significant,” even if (when) none of them cause acne.

    In the “secret-weapon-type analysis,” you’re doing one analysis, and you are /estimating/ the true probabilities of each of the multiple results being in a certain range. There’s no multiple “testing” (I admit I’m not 100% confident I can argue that one through to the end: what’s the logical difference between 20 analyses done simultaneously and 20 tests done sequentially, especially if I shield my eyes and refuse to look at the results until the computer has printed them all? I can explain the difference between “test” and “analysis” but perhaps not as clearly between “sequential”–especially “blindfolded sequential”–and parallel). More importantly, assuming the model was set up appropriately, the analysis returns the probability of the result, not the probability of having data this extreme if the assumed hypothesis were false.

    Is that close? How would you clarify that?

Leave a Reply