The principle is accorded some recognition. When reviewing a body of literature on the health effect of an exposure, one of the general principles is to be skeptical of causality if the effect is not very specific. So, to use your example, if coffee were associated with mononucleosis, that might indeed cast doubt on its causal relationship to liver cancer. But the approach to this is quite informal. And it is, in some respects, a questionable perspective to take. There are exposures that have multiple adverse consequences: smoking tobacco causes or contributes to the causation of a mind-boggling plethora of diseases. In fact, it may be that recognition of the causal role of smoking in lung cancer was slightly delayed because of this.

]]>It is also true that, depending on the type of placebo test, the combined results of many placebo tests sometimes has the flavor of a sampling distribution under the null in the way that a permutation/randomization test might generate the true sampling distribution of BetaHat under the null. But a) I rarely see anyone treat placebo results in this way or do them systematically in the manner this thinking would necessitate; and b) the usefulness of such sampling distributions (whether “distribution under the null” is useful at all, and then considering the violations of the null inherent in the process that Andrew identifies, specifically not expecting the placebo results to be actually 0) is not totally clear, since historically such distributions have been used only to generate p-values, not to be re-centered and re-formatted/sized to be used as uncertainty intervals around the point-estimate. So even if the theory could be worked out to treat various placebo-tests as generating a more formal estimate of the sampling distribution of BetaHat, we’d still be stuck with something we don’t believe (here’s what we would get if there was no effect, but we know there is an effect, so…. how do we bound our uncertainty about that effect size estimate?).

But i think this argument is interesting both in terms of Andrew’s framing here (diff b/w sig not sig is not sig) and his general arguments against permutation/randomization tests (they produce un-interesting objects of analysis). Placebo tests fail in both dimensions: not sufficiently rigorous to produce an estimate of a sampling distribution they are hinting at; and if they were done rigorously enough, the sampling distribution that resulted from them might not be that useful.

]]>Positive control: testing your experiment (or design, or hypothesis etc) against something where you know what the effects will be

Negative control: testing your experiment with something you know should have no effect

Here’s a typical example. A change happened at time T, and the main analysis compares trends before and after time T. The placebo test uses the same analysis but doing the comparisons at other time points. All the data are observational, and things are happening at all times. So true effects won’t be exactly zero, even for the placebo comparisons.

]]>Can you elaborate on why “with a large enough sample size, one would expect to reject the null hypothesis on the placebo”?

Suppose I do a randomized “experiment” with two groups, A and B, where there is no difference between the treatments received by A and B. In other words, B receives a placebo treatment of nothing relative to A. Then with a large enough sample size I should NOT reject the null hypothesis that A and B experience the same treatment effect, i.e. the null that the placebo treatment has zero effect.

Admittedly, in econometrics we are not typically thinking of a controlled experiment. But I think the same principle applies. If the placebo “treatment” is nonexistent, and it is essentially arbitrary which observations are deemed “treatment” and “control”, then why would we expect to see statistically significant differences in a large sample?

]]>Fair enough. I guess the real issue is not statistical but “sociological,” in that if a placebo test is set up with the goal of succeeding (finding that the claimed results are healthy and robust), then researchers will be able to find robustness; but if the test is set up with the goal of finding the claims are *not* robust, then people will be able to find that non-robustness instead. My impression from lots of papers I’ve seen is that placebo tests, or robustness checks more generally, are set up as a sort of rhetorical tool to shoot down potential objections from reviewers, and they’re not typically open-ended explorations.

But is the placebo test case analytically identical? In this case, the substantive null we wish to test isn’t that the effect is the same under the treatment and the placebo. Rather, somewhat loosely, we wish to know if we would come to qualitatively different conclusions if we studied the placebo rather than the actual treatment. Generating the two stats and separately assessing the conclusions we would draw then seems reasonable.

Suppose for example that I find an effect with the real treatment of 2.5 with a standard error of 1.0. I then estimate the model under the placebo treatment and find an effect of 0.5 with a standard error of 1.0. If I were to test the null that the two effects are equal, I get a test stat of (2.5-0.5)/(\sqrt(2)) ~= 1.4 (if the stats are uncorrelated), and I cannot reject the null that the two effects are equal.

But if I saw these results it would I would find the estimated treatment effect more credible. It’s not clear to me why the null that the two effects are actually the same is the one of interest. I don’t think it is.

]]>