Comments on: A potential big problem with placebo tests in econometrics: they’re subject to the “difference between significant and non-significant is not itself statistically significant” issue

By: Eric Rasmusen

Eric Rasmusen — Wed, 03 Oct 2018 14:56:08 +0000

Thanks for the answers, Clyde and Z. Something like that ought to be applied to the effect of smoking on heart disease, which seems much less well-established than its effect on lung cancer, where the mechanism is also clearer. The hard part would be getting data on some malady which is clearly not caused by smoking, e.g. car accidents, yellow fever. Maybe some genetic disorder that only shows up at age 50?

By: Dzhaughn

Dzhaughn — Thu, 27 Sep 2018 17:45:29 +0000

In reply to Devin Caughey. The point of this article is that the proper null hypothesis for the placebo is that it has the same effect as the main treatment; it is a mistake to take the null for the placebo as having no effect. Seems reasonable. That seems to address most concerns here; does it miss any?

By: Clyde Schechter

Clyde Schechter — Thu, 27 Sep 2018 14:35:40 +0000

In reply to Eric B Rasmusen. It's extremely uncommon. In the few instances that come to mind, the studies used the case-control design, which is notorious for a high risk of confounding by unmeasured attributes of the participants because cases and controls are sampled separately. I've never seen this type of additional analysis in reports of a randomized controlled trial. (Obviously I'm reporting just on my own experience here, which may be biased by the topics I follow and the journals I routinely read.) The principle is accorded some recognition. When reviewing a body of literature on the health effect of an exposure, one of the general principles is to be skeptical of causality if the effect is not very specific. So, to use your example, if coffee were associated with mononucleosis, that might indeed cast doubt on its causal relationship to liver cancer. But the approach to this is quite informal. And it is, in some respects, a questionable perspective to take. There are exposures that have multiple adverse consequences: smoking tobacco causes or contributes to the causation of a mind-boggling plethora of diseases. In fact, it may be that recognition of the causal role of smoking in lung cancer was slightly delayed because of this.

By: Z

Thu, 27 Sep 2018 14:35:06 +0000

In epidemiology, these “placebo tests” are usually called negative controls. Negative controls are often used for informal sensitivity analyses or sanity checks. The OMOP group proposed using the distribution of estimated effects in a range of negative controls as a sort of empirical null distribution when doing hypothesis testing (https://onlinelibrary.wiley.com/doi/full/10.1002/sim.5925). Eric Tchetgen Tchetgen has an interesting recent article (https://arxiv.org/pdf/1808.04945.pdf) on how (and under what assumptions) negative controls can be used to identify causal effects in the presence of unobserved confounders. This is sort of an extension of standard difference-in-differences methods, which also make use of negative controls.

By: jrc

jrc — Wed, 26 Sep 2018 23:12:22 +0000

The worst is when the placebo estimate is actually bigger than the real point estimate, but is less precisely estimated and then doesn’t cross some p-value threshold, and so the point estimate becomes BetaHat and the placebo estimate becomes 0 (#PassedPlaceboTest).

It is also true that, depending on the type of placebo test, the combined results of many placebo tests sometimes has the flavor of a sampling distribution under the null in the way that a permutation/randomization test might generate the true sampling distribution of BetaHat under the null. But a) I rarely see anyone treat placebo results in this way or do them systematically in the manner this thinking would necessitate; and b) the usefulness of such sampling distributions (whether “distribution under the null” is useful at all, and then considering the violations of the null inherent in the process that Andrew identifies, specifically not expecting the placebo results to be actually 0) is not totally clear, since historically such distributions have been used only to generate p-values, not to be re-centered and re-formatted/sized to be used as uncertainty intervals around the point-estimate. So even if the theory could be worked out to treat various placebo-tests as generating a more formal estimate of the sampling distribution of BetaHat, we’d still be stuck with something we don’t believe (here’s what we would get if there was no effect, but we know there is an effect, so…. how do we bound our uncertainty about that effect size estimate?).

But i think this argument is interesting both in terms of Andrew’s framing here (diff b/w sig not sig is not sig) and his general arguments against permutation/randomization tests (they produce un-interesting objects of analysis). Placebo tests fail in both dimensions: not sufficiently rigorous to produce an estimate of a sampling distribution they are hinting at; and if they were done rigorously enough, the sampling distribution that resulted from them might not be that useful.

By: george

george — Wed, 26 Sep 2018 21:34:24 +0000

Sounds like looking for positive and negative control situations.

Positive control: testing your experiment (or design, or hypothesis etc) against something where you know what the effects will be
Negative control: testing your experiment with something you know should have no effect

By: Eric B Rasmusen

Eric B Rasmusen — Wed, 26 Sep 2018 19:45:25 +0000

Are placebo tests ever used in medicine? I know that sounds silly, because they use “real” placebos, but I mean “placebo” in the sense we in economics use it. For example, suppose I am trying to test whether coffee causes liver cancer. A placebo test would be to take the same sample and see if my methodology makes it look like coffee causes mononucleosis, when we have good theory saying coffee should have zero effect. That would be important because in medicine a big problem is that some people are generally unhealthy, prone to all kinds of illness, and that can be correlated with something like coffee-drinking (or, better, whisky, or cigarettes).

By: Devin Caughey

Devin Caughey — Wed, 26 Sep 2018 19:34:13 +0000

Erin Hartman and Danny Hidalgo have a forthcoming article on this issue: https://www.erinhartman.com/equivalence/

By: Andy Whitten

Andy Whitten — Wed, 26 Sep 2018 18:31:47 +0000

In reply to Andrew. That makes sense. Thanks.

By: Andrew

Andrew — Wed, 26 Sep 2018 18:05:33 +0000

In reply to Andy Whitten. Andy W: Here's a typical example. A change happened at time T, and the main analysis compares trends before and after time T. The placebo test uses the same analysis but doing the comparisons at other time points. All the data are observational, and things are happening at all times. So true effects won't be exactly zero, even for the placebo comparisons.

By: Andy Whitten

Andy Whitten — Wed, 26 Sep 2018 17:54:35 +0000

Andrew,

Can you elaborate on why “with a large enough sample size, one would expect to reject the null hypothesis on the placebo”?

Suppose I do a randomized “experiment” with two groups, A and B, where there is no difference between the treatments received by A and B. In other words, B receives a placebo treatment of nothing relative to A. Then with a large enough sample size I should NOT reject the null hypothesis that A and B experience the same treatment effect, i.e. the null that the placebo treatment has zero effect.

Admittedly, in econometrics we are not typically thinking of a controlled experiment. But I think the same principle applies. If the placebo “treatment” is nonexistent, and it is essentially arbitrary which observations are deemed “treatment” and “control”, then why would we expect to see statistically significant differences in a large sample?

By: Andrew

Andrew — Wed, 26 Sep 2018 16:41:49 +0000

In reply to Chris Auld. Chris: Fair enough. I guess the real issue is not statistical but "sociological," in that if a placebo test is set up with the goal of succeeding (finding that the claimed results are healthy and robust), then researchers will be able to find robustness; but if the test is set up with the goal of finding the claims are not robust, then people will be able to find that non-robustness instead. My impression from lots of papers I've seen is that placebo tests, or robustness checks more generally, are set up as a sort of rhetorical tool to shoot down potential objections from reviewers, and they're not typically open-ended explorations.

By: Chris Auld

Chris Auld — Wed, 26 Sep 2018 16:36:43 +0000

Interesting point, but I’m not sure I agree. In the usual the ‘difference between significant and not is not significant’ case I really want to get at a hypothesis like, “the effect differs across men and women,” and I incorrectly instead generate two test statistics, one against the null that effect is zero for men, and one for women.

But is the placebo test case analytically identical? In this case, the substantive null we wish to test isn’t that the effect is the same under the treatment and the placebo. Rather, somewhat loosely, we wish to know if we would come to qualitatively different conclusions if we studied the placebo rather than the actual treatment. Generating the two stats and separately assessing the conclusions we would draw then seems reasonable.

Suppose for example that I find an effect with the real treatment of 2.5 with a standard error of 1.0. I then estimate the model under the placebo treatment and find an effect of 0.5 with a standard error of 1.0. If I were to test the null that the two effects are equal, I get a test stat of (2.5-0.5)/(\sqrt(2)) ~= 1.4 (if the stats are uncorrelated), and I cannot reject the null that the two effects are equal.

But if I saw these results it would I would find the estimated treatment effect more credible. It’s not clear to me why the null that the two effects are actually the same is the one of interest. I don’t think it is.