Why it doesn’t make sense in general to form confidence intervals by inverting hypothesis tests

Peter Bergman points me to this discussion from Cyrus of a presentation by Guido Imbens on design of randomized experiments.

Cyrus writes:

The standard analysis that Imbens proposes includes (1) a Fisher-type permutation test of the sharp null hypothesis—what Imbens referred to as “testing”—along with a (2) Neyman-type point estimate of the sample average treatment effect and confidence interval—what Imbens referred to as “estimation.” . . .

Imbens claimed that testing and estimation are separate enterprises with separate goals and that the two should not be confused. I [Cyrus] took it as a warning against proposals that use “inverted” tests in order to produce point estimates and confidence intervals. There is no reason that such confidence intervals will have accurate coverage except under rather dire assumptions, meaning that they are not “confidence intervals” in the way that we usually think of them.

I agree completely. This is something I’ve been saying for a long time—I actually became aware of the problem when working on my Ph.D. thesis, where I tried to fit a model that was proposed in the literature but it did not fit the data. Thus, the confidence interval that you would get by inverting the hypothesis test was empty. You might say that’s fine—the model didn’t fit, so the conf interval was empty. But what would happen if the model just barely fit? Then you’d get a really tiny confidence interval. That can’t be right.

Here’s what was happening:

Sometimes you can get a reasonable confidence interval by inverting a hypothesis test. For example, the z or t test or, more generally, inference for a location parameter. But if your hypothesis test can ever reject the model entirely, then you’re in the situation shown above. Once you hit rejection, you suddenly go from a very tiny precise confidence interval to no interval at all. To put it another way, as your fit gets gradually worse, the inference from your confidence interval becomes more and more precise and then suddenly, discontinuously has no precision at all. (With an empty interval, you’d say that the model rejects and thus you can say nothing based on the model. You wouldn’t just say your interval is, say, [3.184, 3.184] so that your parameter is known exactly.)

The only thing I didn’t like about the above discussion–it’s not Cyrus’s fault, I think I have to blame it on Guido–is the emphasis on the Fisher-style permutation test. As I’ve written before (for example, see section 3.3 of this article from 2003), I like model checking but I think the so-called Fisher exact test almost never makes sense, as it’s a test of an uninteresting hypothesis of exactly zero effects (or, worse, effects that are nonzero but are identical across all units) under a replication that typically doesn’t correspond to the design of data collection. I’d rather just skip that Fisher and Neyman stuff and go straight to the modeling.

OK, I understand that Guido has to communicate with (methodologically) ultraconservative economists. Still, I’d prefer to see the modeling approach placed in the center, and then he can mention Fisher, Neyman, etc., for the old-school types who feel the need for those connections. I doubt I would disagree with anything Guido would do in a data analysis; it’s perhaps just a question of emphasis.

P.S. I realize from the comments that my above example isn’t clear enough. So here is some more detail:

The idea is that you’re fitting a family of distributions indexed by some parameter theta, and your test is a function T(theta,y) of parameter theta and data y such that, if the model is true, Pr(T(theta,y)=reject|theta) = 0.05 for all theta. The probability here comes from the distribution p(y|theta) in the model.

In addition, the test can be used to reject the entire family of distributions, given data y: if T(theta,y)=reject for all theta, then we can say that the test rejects the model.

This is all classical frequentist statistics.

Now, to get back to the graph above, the confidence interval given data y is defined as the set of values theta for which T(y,theta)!=reject. As noted above, when you can reject the model, the confidence interval is empty. That’s ok since the model doesn’t fit the data anyway. The bad news is that when you’re close to being able to reject the model, the confidence interval is very small, hence implying precise inferences in the very situation where you’d really rather have less confidence!

This awkward story doesn’t always happen in classical confidence intervals, but it can happen. That’s why I say that inverting hypothesis tests is not a good general principle for obtaining interval estimates. You’re mixing up two ideas: inference within a model and checking the fit of a model.

35 thoughts on “Why it doesn’t make sense in general to form confidence intervals by inverting hypothesis tests

  1. Can you define exactly what you mean by a confidence interval? Cox (2006) defines the CI as the range of values for the parameter for which the test would not reject at null at that value. Additionally, I am unclear what the “test” is in your diagram above. Is the test not what we are using to create the confidence interval? Or are you referring to a particular null hypothesis (e.g. mu = 0 for a difference of means test) that is or is not in the confidence band?

    I would be indebted to anyone who could show this property numerically or with a simulation example.

    On permutation tests, there is more opportunity for model building and testing than you admit. The procedure is simple. I’ll use the language of potential outcomes, where every unit has a Yc and Yt, but we only observe one or the other based on Z, the treatment indicator. First, posit a model which relates Yc and Yt for each unit. The simplest is Yt = tau + Yc, a constant effects model, but other models are possible. For example, with a network S (adjacency matrix), we could model spillover between units as Yt = tau + sigma * z^t * S, where tau is the direct effect and sigma is a parameter that reflects the amount of spillover between units. For interesting values of the parameter(s), adjust the data, removing the hypothesized effect. Then run the classic sharp null to generate a p-value for these particular parameters. As you can quickly see, this method fully embraces inverting tests as the confidence interval or region is simply the hypotheses that we did not reject.

    Jake Bowers and I are working papers and software demonstrating the power of this approach, especially with respect to spillover effects in networks, but there are many possible applications for researchers doing creative modeling.

  2. It seems your general point would be that the confidence interval should not be conditional on the correctness of the model. I would say that, when the model is close to being rejected, the very precise confidence intervals are perfectly reasonable if you interpret them as conditional: it’s just that the data are also telling you that the condition is unlikely to hold. But is it possible to have confidence intervals that aren’t conditional on the correctness of the model? Does it even make sense to talk about such confidence intervals?

  3. Just to be clear, what kind of test are you talking about? What does is mean for the model to be rejected entirely? Something like p-value(parameter != x) < alpha for all x? Isn't this a case of your test being overloaded (it is testing something besides the parameter, something about the overall model)? If your test is inherently comparative (like a LRT or BF) is that possible?

  4. I will amplify my earlier comment in the light of this from the PS:

    The bad news is that when you’re close to being able to reject the model, the confidence interval is very small, hence implying precise inferences in the very situation where you’d really rather have less confidence!

    I’m not so sure one would always rather have less confidence in that situation. It depends what question you want the confidence interval to answer. The precise confidence interval is telling you that, if the true distribution is from the assumed family, then the parameter value must be very close to the point estimate. This information can be useful in two cases: (1) if you have great prior confidence that the distribution is indeed from the assumed family, then you can conclude that true parameter value is very close to the point estimate, and (2) if the point estimate itself seems implausible, then you can conclude that the actual distribution is not from the assumed family. It’s easy to think of situations where the latter case might apply, where you’re trying to ascertain the implications of a model to decide whether it’s a reasonable model, and the more precise you can be about those implications, the better equipped you are to make that decision.

    • Andy H.:

      Confidence intervals are commonly used to indicate uncertainty in an estimate. For this purpose, I think the pattern in the above graph (a very narrow interval, implying a very high precision, when the model does not fit well) is undesirable.

  5. I guess I’m with Andy H. I never take confidence intervals to be anything but conditional on the model used to specify them. In that case, if there’s only a tiny parameter space consistent with the model, then the answer has to be in there somewhere (again, conditional on the model).

    Suppose you run a regression and state that a 95 percent CI is from 3.5 to 6.9. Surely you wouldn’t be surprised if some other model gives you a confidence interval of 2.1 to 3.6. If now you state, conditional on both models being true, my confidence interval is 3.5 to 3.6, well, what’s wrong with that? It’s your insistence that both models are correct that has shrunk the size of your confidence interval.

    • Jonathan:

      The trouble in your example can be seen by considering three analysts with the same model and slightly different datasets. Analyst 1 has the interval [3.5, 3.6] as in your example above. He proudly publishes his super-precise result, secure in the knowledge that he has a classically valid confidence interval. Analyst 2, with nearly the same data but a slightly better fit to the model, gets the interval [3.0, 4.1]. That’s ok but not so precise. Analyst 1 is getting better results because his model fits worse. Next there’s Analyst 3, whose model fits slightly worse than that of Analyst 2. His interval is empty. So, instead of being able to make a very strong claim, he can say nothing at all about the parameter.

      The problem is that the actual information from the data about the parameter is approximately the same in the 3 settings but the resulting inferences are much different. This problem won’t occur in Bayesian inference, or likelihood inference, or inference based on an estimate and a bootstrap s.e. It only occurs when you try to create a conf interval by inverting a hyp test.

      • I need to think about this a bit more, but I’m not sure how this can’t occur in Bayesian analysis as well. Suppose you have priors which peak around beta=-100 and beta=1 and you have ten observations of normally distributed with standard error 10 around 100. Won’t you get a very narrow posterior around 0? Much narrower than the standard error of the data…. The fact that the model doesn’t fit well is causing this result… well, that and a mistaken, misbegotten prior.

        • Jonathan:

          Bayes will never give you an empty interval, so you’ll never have that discontinuity where, as the model fit gets worse, your inference gets more and more precise until it suddenly becomes completely imprecise.

          To put it another way, in the theory of hypothesis testing, the empty interval and the interval (-infinity, +infinity) are completely different. But with interval estimation, they are the same: the empty interval and the whole-real-line interval both convey that you can say nothing about the parameter in question.

          • Never say never, even Bayesianly.

            Think of a prior and data model that puts all the probability on positive outcomes.

            And your outcome turns out to be negative.

            So the model must be wrong but the only reason not to get an empty interval would be computational.

  6. …but the intervals you do get from Bayes, or likelihood, or anything parametric are still liable to be terrible summaries of uncertainty if the model is massively wrong, which is what you suggest the underlying problem is.

    More broadly, I don’t think it’s a fair criticism (in general) to say that inverting hypothesis tests means one is mixing up “inference within a model and checking the fit of a model”. That’s because one can invert hypothesis tests without needing any (parametric) model – and when doing so, checking the fit of the model is not a particular concern.

    • Freddy:

      1. If you don’t like parametric models, just replace “parameter” in the above discussion with “scalar function of the infinite-dimensional model.”

      2. Checking the fit can be important for nonparametric models too!

  7. I’m no statistician but am I right in thinking that the main (or only) requirement for a valid CI is that it is generated in such a way that it has the required coverage? (A 95% CI covers the true value for 95% of random datasets, assuming H_0).

    The interval produced by inverting the hypothesis test could also be a CI, right? So long as it has the right coverage. If you’re testing using alpha=0.05 you’ll get no interval 5% of the time (i.e. rejection) and a finite interval 95% of the time. When your model fits really well (e.g. p > 0.5 in GoF test) you’ll get a really large interval since a large volume of parameter space satisfies p > 0.05. The (average) coverage of all these very different sized intervals (including the zero-width ones) may not be different from the expected 95% coverage. (A few years back I exchanged a few emails with Fred James a CERN about this and he thought the test-based intervals had about the right coverage.)

    Does anyone know any more rigorous statements of this idea? Is it really true that inverted-test-based intervals generally have the wrong coverage? If not then maybe we shouldn’t discount them. The question is then whether they are useful intervals. An interval selected to be (-infty,+infty) on 95% of occasions and zero-width on 5% of occasions covers the true value the probability 95%. Right coverage, just not very useful.

    • Simon:

      There are many ways of getting uncertainty intervals. My point is not that inversion of hyp tests is always bad but that I don’t see it as a general underlying principle. In the above example, the question is what the user does with an empty interval. When the interval is nonempty the usual practice is to take it at face value and go with the model. When the interval is empty then the user rejects the model. This melange of interval estimation and testing is not what is described by Neyman-Pearson theory, and the difficulty is that a single procedure is being used to do both. Underlying this is the practical impossibility of using an empty interval estimate to do anything except reject a model.

  8. This is somewhat confusing to me. But i’d really like to hear your views on the reverse problem of interpreting a bayesian confidence interval as a p-value.

    A simple example:
    * my observations is a set of 1 gazillion coin flips
    * my model assumes an iid process and has a single free parameter m which holds p(heads)
    * I have some prior for m. (e.g uniform in the interval [0;1])
    * I use Bayes to estimate posterior(m)

    I observe that the bayesian 2.5-97.5% confidence interval (from the posterior) does not span 0.5. And so I declare that i can reject the null-hypothesis that the coin is fair (p<0.05). What do you think about this type of reasoning (in general)?

    • 1. A uniform [0,1] distribution does not make sense for the probability of heads on a coin flip (see here). So you’re already in the realm of convenience models that have serious problems when it comes to interpreting inferences.

      2. I don’t think there’s any need to go around rejecting the null hypothesis that the probability of the coin is 0.5. In your model, you’ve already assigned zero probability to that particular event. It’s find just to report your posterior distribution for the parameter.

      • Surely one can bias a coin by bending it suitably. For example, if you bend it almost in half with, say, tails inside, it will essentially have to land on its two edges to “count” as a tail; it would be much more likely to land on a portion of the outer (“heads”) surface.

      • 1. Thanks for the pdf. I am always up for sharpening my skills. [However, there is no reason to argue with the prior which was not important for the question. Also as you state a coin can be biased if the “toss”-rules are changed. You might not know these rules, and it might be that the tosser is deliberately trying to bias the result. A uniform prior probably still wouldn’t be a sensible choice, but there are other binary outcome processes where it would. ]

        2. I agree that the posterior distribution gives more rich information and i always report that. However, often you want to convince an audience who have never heard of a posterior. In those cases it is helpful to frame the posterior as a confidence interval (ofcourse conditional on your assumptions). It also helps to be able to state that observed outcome is unlikely to have come from a process with p_heads=0.5. Ofcourse I would prefer to use this kind of framing in a way that is not wrong and does not grate in Bayesian ears. (Probably the most Bayesian way to do this would be to compare to a p_heads=0.5 model using bayes factors.)

        3. Point taken regarding “…. you’ve already assigned zero probability to that particular event.” – You are probably right that it is a poor example, but please ignore with that. I want to find a nice way to frame the posterior. For example i might want to hihglight that parameter #i is most likely greater than zero according to the posterior. I think my audience would be most comfortable to see that as a traditional p-value for a null-hypothesis (H0: model with parameter #i<=0).

      • I think Oobla Bling-blong was asking an interesting question. Sorry but I think it is muddying the waters to start talking about whether a real coin can be biased. As someone else mentioned you could certainly imagine a skilled flipper who could get it to go heads most of the time for example. I don’t understand your point 2. why do you say “you’ve already assigned zero probability to that particular event”? Please could you say how you would test if a coin has 0.5 probability of landing heads?

      • Re: 1, surely with a `gazillion’ coin flips using convenience models should not cause “serious problems”? At least for large values of a gazillion…

        Re: 2, what if we were searching through a bunch of coins for the badly biased ones? (Or some analogous search through scientific ‘coins’). Then a testing statement is needed; the `reject’ statement gets the interpretation that ‘here is a coin that merits further investigation’; non-rejections are interpreted as ‘no suggestion of anything interesting’. Neither statement requires You to believe (with non-zero probability) that the coin is exactly unbiased.

        • Worzel:

          1. If n=gazillion and y is not close to 0 or n then the posterior interval will be so narrow that I don’t even know what the question is. Any statistical method will work fine, and I’d go for the simple and effective p +/- sqrt(p*(1-p)/n).

          2. You can do inferences on the probabilities directly. In this case you’d want a hierarchical model with a distribution for these varying probabilities. The distribution would have hyperparameters that you’d estimate from the data. You can do Bayesian inference which gives, among other things, inferences about each of the individual probabilities. You can use those inferences to make decisions however you’d like. No real need to talk about rejection; you’re just estimating a bunch of probabilities. You can also feed your inferences through a decision analysis in which costs and benefits of “further investigation” are more explicitly specified.

  9. Pingback: Should we invert hypothesis tests to create confidence intervals | Jarad Niemi

    • Jared, the crux of the problem is (I think) that the test statistic T is not ancillary for other parameters in the correct expanded model which fits the data. The behavior of T can be quite far from the “null” behavior when projected down onto the used model for theta which are plausible under the correct model. It’s well known that likelihood-based methods (including Bayes) aren’t guaranteed to be robust to misspecification. That said, the credible / LRT interval should be a little better behaved than a generic statistic because it is normalized.

  10. I’ve been playing around with a simple model which does behave like this in the hypothesis test to confidence interval case, but the Bayesian posterior always stays narrow even when the model is a bad fit. Its a simple model
    when the data is generated from
    with a normally distributed error. I find that for non-zero beta, the hypothesis test to confidence interval case gives no confidence interval for alpha when beta is large enough while the bayesian posterior for alpha still stays very narrow. So I think the hypothesis test to confidence interval case is actually behaving better here as it is indicating that the model is a bad fit. Where if someone blindly used the bayesian approach they would continue to think they have a tight constraint on alpha.

  11. Andrew, A belated response to your post – I was travelling. Always interesting to read your take on conceptual issues. Had I known you would be commenting on these presentations in Guadalupe, I would have been more careful! Let me offer some comments on your interpretation of my position (my position itself was fairly summarized in the post – somewhere on the iiie website the slides were posted, but my internet connection here in Oslo is a little shaky so I cant find it). The context for these presentations was the analyses of randomized experiments in development economics. My reading of that literature is that most of these analyses involve using regression methods: there is some outcome Y, a treatment indicator W, and some pretreatment variables X, and then researchers run a regression of Y on W and X and construct a confidence interval for the coefficient on W based on a normal approximation to the sampling distribution.

    Whether economists are particularly conservative in their methodology I dont know, but they certainly do like their established methods! Fisher randomization tests are not among those, they are very rarely done in analyses by economists. Personally I think
    they are a useful tool. This view is parly (heavily?!) influenced by conversations with Paul Rosenbaum. On the topic of estimation I am more in agreement with your (Gelman’s) view of using modelling, and preferably fully principled Bayesian modelling, although I do that all too rarely in my own empirical work (most of my work is theoretical, so I am worried if you say you would agree with most of my data analyses – I dont do that much real data analysis in the end). Back to the randomization tests. Why do I like them? I think they are a good place to start an analysis. If you have a randomized experiment, and you find that using a randomization test based on ranks that there is little evidence of any effect of the treatment, I would be unlikely to be impressed by any model-based analysis that claimed to find precise non-zero effects of the treatment. It is possible, and the treatment could affect the dispersion and not the location, but in most cases if you dont find any evidence of any effects tbased on that single randomization based tests, I think you can stop right there. I see the test not so much as answering whether in the population the effects are all zero (not so interesting), rather as asnwering the question whether the data are rich enough to make precise inferences about the effects.

    More specifically, I was thinking of settings with randomization at a cluster level. Say we have individual level data, but the treatment was randomized at the community level. In many cases researchers may be interested in the average effect, averaged over the individuals, not over the clusters. If the clusters are very different size, it may be the case that it is impossible to precisely estimate the average effect over the entire population, while, at the same time, you may be able to estimate precisely the average effect over the clusters. In that case I thought it would be useful to complement an analysis of the average effect for the population with a test of the assumption of no effect. This is what my comments on separating the notion of testing for the presence of effects versus estimating the average effect of interest referred to.

    As an aside, more in the spirit of your comments, in the presentation I mentioned that simply testing for the presence of any effects is not very interesting. If you report to a policy maker that you are very confident that a particular program has some effect, but you cannot tell the policy maker whether the effects are typically positive or negative, it is unlikely to be of interest to the policy maker.

    • Guido:

      Thanks for the response. The bit about what can be estimated in a cluster study is interesting, and it does make sense to me that it could make sense to estimate the “wrong” weighted average treatment effect if such estimation is simpler to do.

      Regarding hypothesis testing, I have three points:

      1. On the general issue of hyp testing, I agree with you. A hyp test can be fine, and indeed it can be helpful to know that a certain pattern in data could easily have occurred purely by chance.

      2. My post above was not intended to criticize hyp testing, it was criticizing the idea of testing a family of hyps and using the set of non-rejections as a conf interval. I see the appeal of this approach but, as indicated in the graph above, it doesn’t always work. I am frustrated because in classical statistical theory, this inversion is often portrayed as the fundamental way to get interval estimation. Inverting a family of hyp tests is a way to get interval estimates, and sometimes it works just fine, but I don’t think it’s right for it to be viewed as the way. I don’t see this inversion principle as any more fundamental than other competing principles out there (for example, estimate +/- s.e., bootstrapping, or Bayes). When doing causal inference, the inversion procedure has another problem, which is that it is typically applied to a series of models that assume a constant treatment effect. A zero treatment effect I can imagine. But a treatment that has the effect +5 (for example) on every unit in the population? That I don’t buy.

      3. My specific problem with the so-called Fisher exact test is that it is based on a replication that is almost never relevant (a new table with fixed row and column margins). This is not a criticism of hyp tests in general, just of this particular hyp test. Again, I am particularly disturbed by the view that many people have that the Fisher test (or, even worse, the interval estimates obtained by inverting the Fisher test) are somehow fundamental. There are lots of methods for testing and estimating relations in a table of counts, and the Fisher test is one of them. I don’t think it deserves any privileged status. This is the sort of thing I was talking about when I referred to conservatism. The idea that there is some status-quo procedure that is considered to be the best or most pure, and that you have to jump through hoops to persuade people to do something different.

      • The thing I like about the inversion approach is that assuming the model is a good approximation, then one knows the coverage is as stated. I think for science having correct coverage of confidence intervals is very desirable. It seems much more concrete to me to be able to say that if my model is a good approximation then 95% of my confidence intervals will contain the true values of the parameter. Rather then saying these 95% confidence intervals are a representation of my uncertainty about the parameters of the model. That seems very nebulous to me and in my opinion is not appropriate for science. I like the Bayesian approach but I think one should always then check that the obtained credibility intervals have correct coverage. As far as I am aware, in the limit of large data sets where the likelihood tends to multi-variate Gaussian this always happens anyway except in a few pathological cases.

        • Chris:

          In the limit of large data sets etc., everything gives the same answer. When you’re not in the limit and different methods give different answers, then the graph above illustrates a problem with interval estimates obtained by inverting confidence intervals. Ultimately what is relevant is not your subjective feeling of what is “nebuolus” or “appropriate for science” but rather the information conveyed by these inferences and confidence statements.

          • Andrew,

            All I am saying is that in my opinion correct coverage is a valuable property for confidence intervals to have. I am saying I value that information highly as in my opinion it is a very concrete and unambiguous property for the intervals to have. Yes, as you have shown if one uses a bad model then the inversion based confidence intervals give bad results, so I agree maybe there is some robustness weakness with the inverting case. But I think if one can do model checks to make sure that the model is a good description of the underlying process generating the data (I am feeling a bit like brining coal to New Castle here), then the inverting case is good in my view as it is going to give the correct coverage.

  12. Lost track of this interesting post (nice paper Guido, I got caught in the Gail/Donner debate years ago and thought it really was a Fisher/Cochran _silent_ debate).

    Peto’s “O-E” method for meta-analysis would be a nice example for Guido – the _implicit_ idea is that if it is reasonable to assume the treatment effect is either strictly positive or negative (i.e. can vary in magnitude but not sign) then any weighted average is OK (has the right sign) for dectecting the effect and one should choose the weighted average to maximize the power. I don’t believe its relevant that such a weighted average does not represent any population.

    Unfortunately its probably never reasonable to assume the treatment effect is either strictly positive or negative.

    As for Fisher test based intervals – perhaps plot the coverage of confidence or credible intervals against the nuissance (control rate) parameter. For credible intervals use a different prior than assumed as Gustafson and Greenland suggest.


  13. Andrew’s point (above) that 5% of the time you can get an empty interval isn’t specific to this form of interval construction; there’s the well-known valid-but-useless interval where one reports an empty set 5% of the time, or the whole real line the other 95%. I think the deeper problem is that 95% confidence on its own isn’t worth anything inferentially, as the valid-but-useless example attests.

    To make inference work, using 95% confidence as a measure of validity, one needs to do something that distinguishes between all the valid 95% intervals out there. The test-inverting method comes from Neyman’s idea of optimizing the probability that the interval does not cover values that are not the truth – Teddy Seidenfeld’s book on the philosophical problems of inference discusses this. But Neyman’s approach is not the only one; some form of minimum length criterion could be used, or optimizing posterior belief, if you like, there are lots of options. If, as seems to be the case here, one wants sane inference when the model doesn’t fit well, then an option that reflects this would be appropriate.

    • I agree of course that coverage isn’t the only desirable property a confidence interval should have. But I also think people who use Bayesian methods should ideally check that their credibility intervals have correct coverage. Although I appreciate in practice this can be very time consuming.

Comments are closed.