A potential big problem with placebo tests in econometrics: they’re subject to the “difference between significant and non-significant is not itself statistically significant” issue

In econometrics, or applied economics, a “placebo test” is not a comparison of a drug to a sugar pill. Rather, it’s a sort of conceptual placebo, in which you repeat your analysis using a different dataset, or a different part of your dataset, where no intervention occurred.

For example, if you’re performing some analysis studying the effect of some intervention on outcomes during the following year, you run a placebo test by redoing the analysis but choosing a year in which no intervention was done. Or you redo the analysis but choosing an outcome that should be unrelated to the intervention being studied.

Ummm, what’s a precise definition? I can’t find anything on wikipedia, and I can’t find “placebo” in Mostly Harmless Econometrics . . . hmmm, let me google a bit more . . . OK, here’s something:

A placebo test involves demonstrating that your effect does not exist when it “should not” exist.

That’s about right.

The problem comes in how the idea is applied. What I often see is the following: In the mean analysis the key finding is statistically significant, hence publishable and treated as real. Then the placebo controls show nothing statistically significant, they just look like noise, and the researcher concludes that the main model is fine. But if the researcher is not careful, this process runs into the “difference between significant and non-significant is not itself statistically significant” issue.

I don’t have any easy answers here, in part because placebo tests are often imprecise, in that there are various ways in which correlations between the “placebo treatment” and the outcome, or between the treatment and the “placebo outcome” can leak into the data. Hence a positive “placebo effect” does not necessarily invalidate a causal finding, and one would typically not expect the true “placebo effect” to be zero anyway—that is, with a large enough sample size, one would expect to reject the null hypothesis on the placebo. But in any case I think there are serious problems with the standard practice in which the researcher hopes to not reject the placebo and then take this as evidence supporting the favored theory.

P.S. The above cat doesn’t care at all about the placebo effect. Not one bit.

13 thoughts on “A potential big problem with placebo tests in econometrics: they’re subject to the “difference between significant and non-significant is not itself statistically significant” issue

  1. Interesting point, but I’m not sure I agree. In the usual the ‘difference between significant and not is not significant’ case I really want to get at a hypothesis like, “the effect differs across men and women,” and I incorrectly instead generate two test statistics, one against the null that effect is zero for men, and one for women.

    But is the placebo test case analytically identical? In this case, the substantive null we wish to test isn’t that the effect is the same under the treatment and the placebo. Rather, somewhat loosely, we wish to know if we would come to qualitatively different conclusions if we studied the placebo rather than the actual treatment. Generating the two stats and separately assessing the conclusions we would draw then seems reasonable.

    Suppose for example that I find an effect with the real treatment of 2.5 with a standard error of 1.0. I then estimate the model under the placebo treatment and find an effect of 0.5 with a standard error of 1.0. If I were to test the null that the two effects are equal, I get a test stat of (2.5-0.5)/(\sqrt(2)) ~= 1.4 (if the stats are uncorrelated), and I cannot reject the null that the two effects are equal.

    But if I saw these results it would I would find the estimated treatment effect more credible. It’s not clear to me why the null that the two effects are actually the same is the one of interest. I don’t think it is.

    • Chris:

      Fair enough. I guess the real issue is not statistical but “sociological,” in that if a placebo test is set up with the goal of succeeding (finding that the claimed results are healthy and robust), then researchers will be able to find robustness; but if the test is set up with the goal of finding the claims are not robust, then people will be able to find that non-robustness instead. My impression from lots of papers I’ve seen is that placebo tests, or robustness checks more generally, are set up as a sort of rhetorical tool to shoot down potential objections from reviewers, and they’re not typically open-ended explorations.

  2. Andrew,

    Can you elaborate on why “with a large enough sample size, one would expect to reject the null hypothesis on the placebo”?

    Suppose I do a randomized “experiment” with two groups, A and B, where there is no difference between the treatments received by A and B. In other words, B receives a placebo treatment of nothing relative to A. Then with a large enough sample size I should NOT reject the null hypothesis that A and B experience the same treatment effect, i.e. the null that the placebo treatment has zero effect.

    Admittedly, in econometrics we are not typically thinking of a controlled experiment. But I think the same principle applies. If the placebo “treatment” is nonexistent, and it is essentially arbitrary which observations are deemed “treatment” and “control”, then why would we expect to see statistically significant differences in a large sample?

    • The point of this article is that the proper null hypothesis for the placebo is that it has the same effect as the main treatment; it is a mistake to take the null for the placebo as having no effect. Seems reasonable. That seems to address most concerns here; does it miss any?

  3. Are placebo tests ever used in medicine? I know that sounds silly, because they use “real” placebos, but I mean “placebo” in the sense we in economics use it. For example, suppose I am trying to test whether coffee causes liver cancer. A placebo test would be to take the same sample and see if my methodology makes it look like coffee causes mononucleosis, when we have good theory saying coffee should have zero effect. That would be important because in medicine a big problem is that some people are generally unhealthy, prone to all kinds of illness, and that can be correlated with something like coffee-drinking (or, better, whisky, or cigarettes).

    • It’s extremely uncommon. In the few instances that come to mind, the studies used the case-control design, which is notorious for a high risk of confounding by unmeasured attributes of the participants because cases and controls are sampled separately. I’ve never seen this type of additional analysis in reports of a randomized controlled trial. (Obviously I’m reporting just on my own experience here, which may be biased by the topics I follow and the journals I routinely read.)

      The principle is accorded some recognition. When reviewing a body of literature on the health effect of an exposure, one of the general principles is to be skeptical of causality if the effect is not very specific. So, to use your example, if coffee were associated with mononucleosis, that might indeed cast doubt on its causal relationship to liver cancer. But the approach to this is quite informal. And it is, in some respects, a questionable perspective to take. There are exposures that have multiple adverse consequences: smoking tobacco causes or contributes to the causation of a mind-boggling plethora of diseases. In fact, it may be that recognition of the causal role of smoking in lung cancer was slightly delayed because of this.

  4. Sounds like looking for positive and negative control situations.

    Positive control: testing your experiment (or design, or hypothesis etc) against something where you know what the effects will be
    Negative control: testing your experiment with something you know should have no effect

  5. The worst is when the placebo estimate is actually bigger than the real point estimate, but is less precisely estimated and then doesn’t cross some p-value threshold, and so the point estimate becomes BetaHat and the placebo estimate becomes 0 (#PassedPlaceboTest).

    It is also true that, depending on the type of placebo test, the combined results of many placebo tests sometimes has the flavor of a sampling distribution under the null in the way that a permutation/randomization test might generate the true sampling distribution of BetaHat under the null. But a) I rarely see anyone treat placebo results in this way or do them systematically in the manner this thinking would necessitate; and b) the usefulness of such sampling distributions (whether “distribution under the null” is useful at all, and then considering the violations of the null inherent in the process that Andrew identifies, specifically not expecting the placebo results to be actually 0) is not totally clear, since historically such distributions have been used only to generate p-values, not to be re-centered and re-formatted/sized to be used as uncertainty intervals around the point-estimate. So even if the theory could be worked out to treat various placebo-tests as generating a more formal estimate of the sampling distribution of BetaHat, we’d still be stuck with something we don’t believe (here’s what we would get if there was no effect, but we know there is an effect, so…. how do we bound our uncertainty about that effect size estimate?).

    But i think this argument is interesting both in terms of Andrew’s framing here (diff b/w sig not sig is not sig) and his general arguments against permutation/randomization tests (they produce un-interesting objects of analysis). Placebo tests fail in both dimensions: not sufficiently rigorous to produce an estimate of a sampling distribution they are hinting at; and if they were done rigorously enough, the sampling distribution that resulted from them might not be that useful.

  6. In epidemiology, these “placebo tests” are usually called negative controls. Negative controls are often used for informal sensitivity analyses or sanity checks. The OMOP group proposed using the distribution of estimated effects in a range of negative controls as a sort of empirical null distribution when doing hypothesis testing (https://onlinelibrary.wiley.com/doi/full/10.1002/sim.5925). Eric Tchetgen Tchetgen has an interesting recent article (https://arxiv.org/pdf/1808.04945.pdf) on how (and under what assumptions) negative controls can be used to identify causal effects in the presence of unobserved confounders. This is sort of an extension of standard difference-in-differences methods, which also make use of negative controls.

  7. Thanks for the answers, Clyde and Z. Something like that ought to be applied to the effect of smoking on heart disease, which seems much less well-established than its effect on lung cancer, where the mechanism is also clearer. The hard part would be getting data on some malady which is clearly not caused by smoking, e.g. car accidents, yellow fever. Maybe some genetic disorder that only shows up at age 50?

Leave a Reply to Andy Whitten Cancel reply

Your email address will not be published. Required fields are marked *