Skip to content

The scope for snooping

Macartan Humphreys sent the following question to David Madigan and me:

I am working on a piece on the registration of research designs (to prevent snooping). As part of it we want to give some estimates for the “scope for snooping” and how this can be affected by different registration requirements.

So we want to answer questions of the form:
“Say in truth there is no relation between x and y, you were willing to mess about with models until you found a significant relation between them, what are the chances that you would succeed if:
1. You were free to choose the indicators for x and y
2. You were free to choose h control variable from some group of k possible
3. You were free to divide up the sample in k ways to examine heterogeneous treatment effects
4. You were free to select from some set of k reasonable models”

People have thought a lot about the first problem of choosing your indicators; we have done a set of simulations to answer the other questions, and find for example that freedom to add control variables gives you a lot of latitude for small datasets but this decreases quickly as datasets become larger; freedom to focus on subpopulations gives huge latitude; freedom to select models (eg linear, logit, probit) don’t do much.

The question is: are there analytic results on these things already? Or is there already a literature assessing these different approaches to snooping?

David wrote:

I’ve been involved in a large-scale drug safety signal detection project for the last two or three years ( We have shown empirically that for any given safety issue, by judicious choice of observational database (we looked at 10 big ones), method (we looked at about a dozen), and method setup, you can get *any* answer you want – big positive and highly significant RR or big negative and highly significant RR and everything in between. Generally I don’t think there is any way to say definitively that any one of these
analysis is a priori obviously stupid (although “experts” will happily concoct an attack on any approach that does not produce the result they like!). The medical journals are full of conflicting analyses and I’ve come to the belief that, at least in the medical arena, the idea human experts *know* the *right* analysis for a particular estimand is false.

I’m all for registration of observational studies with pre-specified protocols. The bit I’m not so sure about is whether such a process will necessarily produce better answers…

To which Macartan replied:

That’s very interesting, if a bit depressing; the following result is easy: for any variables x,y and coefficient b there is a z such that a
regression of y on x yields b, with as low a p as you like; simply define z = y – bx. Of course whether you can find that z is another question…

I saw this today on registration: Mathieu et al: “Comparison of Registered and Published Primary Outcomes in Randomized Controlled Trials” which suggests that only 31% of articles that registered actually registered properly and stuck to their plans; among those that changed their plans there were a lot of positive results… However at least with registration you can go back and see what effects are due to the changing plans.

Meanwhile, I wrote the following reply to the original question:

The short answer is that I think a determined researcher can find all sorts of things. My solution to this snooping problem is not to forbid analyses but rather the opposite, to set up the data so people can do all possible analyses. If everything is noise that should show up in the distribution of findings.

To which Macartan replied:

Sounds very democratic but it requires that different people are happy to find all sorts of things; which if they have correlated stakes in the answers they might not;

But there is something else here which I find harder to put my finger on: say I propose model A ex ante and find wonderful effects, and then you come along and run models B, C, D, E,…. and find no effects; should I infer that my result was spurious? No, not unless I thought that B, C, D… are just as plausible tests of whatever my claim is. But of course if I did find them just as plausible then I would have been happy to include them in my initial statement of the test to be run. In other words the extra analyses that you would admit would only matter to me if they are the ones that I wouldn;t have forbidden in the first place. What precommitting then does is just move forward the conversaton about what the family of plausible models is, to a point where it is not influenced by results.

Then I wrote:

It’s not a matter of the original finding being spurious, it’s about putting it in a larger context. Consider the 8-schools example in chapter 5 of Bayesian Data Analysis. Inference for each individual school is informed by the data from the others. See also this presentation (which includes that example):

Then Macartan:

I don’t see how that addresses the issue. It is still the case that for whatever model you settle on (including a multi level Bayesian model that uses data from all schools) someone can muck about with features of the model to get results they like.

And me:

Yes, that’s always true in any case. But a multilevel model will handle many of the issues of concern.

At this point it seemed worth posting the discussion for all of you.


  1. “Generally I don’t think there is any way to say definitively that any one of these
    analysis is a priori obviously stupid (although “experts” will happily concoct an attack on any approach that does not produce the result they like!). The medical journals are full of conflicting analyses and I’ve come to the belief that, at least in the medical arena, the idea human experts *know* the *right* analysis for a particular estimand is false.”

    This seems overly harsh to me. OMAP is working with fairly low quality data (prescriptions claims for Medicare/MedicAid, GPRD and other clinical databases, and these sorts of databases) due to the nature of the problem (one of the OMAP investigators is actually here at the University of Florida). The clearest example of high quality medical data is likely to be randomized, controlled, souble-blinded clinical trials. But there is a whole layer of data between these two extremes of data quaity(prospective cohort studies, for example).

    Sure, it is true that the prospective cohort studies tend to be underpowered to detect rare adverse drug side effects (for precisely the same reason that RCTs are). But there is a lot of research that does not generate conflicting results or where the experts really seem to have a good grasp on the problem. The links between serum cholesterol levels and cardiovascular events, for example, seems relatively solid and widely replicated.

    So I would be careful to generalize to all of medical research.

    That being said, I have a great deal of frustration with medical database research for a lot of the same reasons as David Madigan does.

  2. We report simulations and discussions relevant to the original question in a recent paper False-Positive Psychology.

    We also discuss possible solutions to the problem.

    We end up proposing disclosure rather than pre-registration for pragmatic reasons, but those may be specific to the domain of study.

  3. dan says:

    In some sense, I would say it even worse than is being suggested above: even in a completely randomized experiment it is still possible to find anything you want given the right control variables. Not “on average” of course, but in any given experiment.

    Rubin discusses this issue in “Estimating causal effects of treatments in randomized and nonrandomized studies” (1974). So Andrew’s remark that “that’s always true in any case” is exactly the point; the apparently shocking points above actually reduce to a much more general discussion about whether we can find out anything at all.

  4. freddy says:

    There are some ways of permitting limited data-snooping, and still having well-calibrated inference in the end. For example, see Zhang Tsiatis and Davidian;

    … the method lets you pick adjustment variables post hoc, in a randomized trial setting, and thus buy yourself some efficiency over using unadjusted analyses (the default) or using a pre-specified set of adjustments. Typically it won’t be a huge efficiency gain, but it’s better than nothing.