Why it doesn’t make sense in general to form confidence intervals by inverting hypothesis tests

I’m reposing this classic from 2011 . . . Peter Bergman pointed me to this discussion from Cyrus of a presentation by Guido Imbens on design of randomized experiments.

Cyrus writes:

The standard analysis that Imbens proposes includes (1) a Fisher-type permutation test of the sharp null hypothesis–what Imbens referred to as “testing”–along with a (2) Neyman-type point estimate of the sample average treatment effect and confidence interval–what Imbens referred to as “estimation.” . . .

Imbens claimed that testing and estimation are separate enterprises with separate goals and that the two should not be confused. I [Cyrus] took it as a warning against proposals that use “inverted” tests in order to produce point estimates and confidence intervals. There is no reason that such confidence intervals will have accurate coverage except under rather dire assumptions, meaning that they are not “confidence intervals” in the way that we usually think of them.

I agree completely. This is something I’ve been saying for a long time–I actually became aware of the problem when working on my Ph.D. thesis, where I tried to fit a model that was proposed in the literature but it did not fit the data. Thus, the confidence interval that you would get by inverting the hypothesis test was empty. You might say that’s fine–the model didn’t fit, so the conf interval was empty. But what would happen if the model just barely fit? Then you’d get a really tiny confidence interval. That can’t be right.

The (stylized) graph above shows what was happening.

Sometimes you can get a reasonable confidence interval by inverting a hypothesis test. For example, the z or t test or, more generally, inference for a location parameter. But if your hypothesis test can ever reject the model entirely, then you’re in the situation shown above. Once you hit rejection, you suddenly go from a very tiny precise confidence interval to no interval at all. To put it another way, as your fit gets gradually worse, the inference from your confidence interval becomes more and more precise and then suddenly, discontinuously has no precision at all. (With an empty interval, you’d say that the model rejects and thus you can say nothing based on the model. You wouldn’t just say your interval is, say, [3.184, 3.184] so that your parameter is known exactly.)

The only thing I didn’t like about the above discussion—it’s not Cyrus’s fault, I think I have to blame it on Guido—is the emphasis on the Fisher-style permutation test. As I’ve written before (for example, see section 3.3 of this article from 2003), I like model checking but I think the so-called Fisher exact test almost never makes sense, as it’s a test of an uninteresting hypothesis of exactly zero effects (or, worse, effects that are nonzero but are identical across all units) under a replication that typically doesn’t correspond to the design of data collection. I’d rather just skip that Fisher and Neyman stuff and go straight to the modeling.

OK, I understand that Guido has to communicate with (methodologically) ultraconservative economists. Still, I’d prefer to see the modeling approach placed in the center, and then he can mention Fisher, Neyman, etc., for the old-school types who feel the need for those connections. I doubt I would disagree with anything Guido would do in a data analysis; it’s perhaps just a question of emphasis.

Here is some more detail:

The idea is that you’re fitting a family of distributions indexed by some parameter theta, and your test is a function T(theta,y) of parameter theta and data y such that, if the model is true, Pr(T(theta,y)=reject|theta) = 0.05 for all theta. The probability here comes from the distribution p(y|theta) in the model.

In addition, the test can be used to reject the entire family of distributions, given data y: if T(theta,y)=reject for all theta, then we can say that the test rejects the model.

This is all classical frequentist statistics.

Now, to get back to the graph above, the confidence interval given data y is defined as the set of values theta for which T(y,theta)!=reject. As noted above, when you can reject the model, the confidence interval is empty. That’s ok since the model doesn’t fit the data anyway. The bad news is that when you’re close to being able to reject the model, the confidence interval is very small, hence implying precise inferences in the very situation where you’d really rather have less confidence!

This awkward story doesn’t always happen in classical confidence intervals, but it can happen. That’s why I say that inverting hypothesis tests is not a good general principle for obtaining interval estimates. You’re mixing up two ideas: inference within a model and checking the fit of a model.

P.S. To clarify (see Larry’s comment), perhaps I should replace “confidence interval” in the above title by the more generic phrase, “interval estimate.”

43 thoughts on “Why it doesn’t make sense in general to form confidence intervals by inverting hypothesis tests

  1. I think the presentation link should be http://cyrussamii.com/wp-content/uploads/2011/06/Imbens_June_8_paper.pdf (it was updated on Cyrus’s original post).

    I must be especially dense today, so I’m having a little trouble seeing exactly when this is and is not a concern. It seems like this is an issue only when θ is a parameter specific to the family of distributions it indexes (or some larger encomassing family), not when θ is something like the population median or IQR; or am I mistaken?

    • Gray:

      It depends on the model. The trouble is that, in the inversion-of-hypothesis-testing framework, the hyp test is doing double duty, being used to create a conf interval if the model is accepted and being used to reject the model if the interval is empty. It is only under some special conditions that these two jobs won’t interfere with each other.

      • I might be tripping over the word “model,” which I usually think of (as an econometrician) as being distinct from the true data generating process. In the example that the linked post discusses, it seems that the issue is that the null hypothesis for Fisher’s test maintains an additional assumption about constant treatment effects across individuals that we don’t want to impose when constructing a CI. Is that the problem—namely that some test statistics impose additional constraints that don’t make sense when estimating the parameter value—or am I hung up on this particular example and this is a special case of something more general?

        I’m convinced this scenario is bad… now I’m trying to figure out if it’s something I’m likely to screw up :)

        • Gray:

          My problem with Fisher’s exact test is that is reference distribution (that is, where the p-values come from) is conditional on fixed margins in both dimensions of the table, which very rarely corresponds to any actual design.

        • True, but lots of problems have one margin fixed and the argument for “approxiate ancillarity” of the second margin has always made good sense to me. And just to give the example I use in which the underlying hypergeometric is in fact correct: you have w women and m men and you fired f employees. The first two parameters are clearly fixed and the third is sometimes absolutely fixed as well, and even if it isn’t, it’s usually almost fixed. The Fisher’s exact test is, IMO, exactly the right test to use in this circumstance, possibly stratified for important covariates. Agreed? (That said, I have no patience at all for a 5% reference p-value in this situation, particularly given the discreteness of the test statistic disctribution.)

        • Jonathan,

          I think the so-called Fisher exact test can be a convenient approximation in some settings, but in most examples I would not consider it “exact.” Thus, if it’s easy to do and gives reasonable answers I have no problem with it, but I don’t think it makes sense to stick with it if it’s really hard to compute.

          Regarding your example, yes I agree that that is a rare case where both margins are fixed, if some fixed number of people is picked for firing or promotion.

  2. This again raises for me the issue of why do hypothesis tests at all?

    It seems to me that whatever the outcome of your analysis, there is an assumed set of actions that are implied. What do you DO if the hypothesis is rejected? If it is not rejected?

    If there are ACTIONS, then there should be a loss function. That is, If I reject and the hypothesis is false, what is the loss when I take the indicated action? If I reject and the hypothesis is true, what is the loss? Etc. for accepting.

    So all of this just leads me back to thinking that this whole issue should be framed in terms of decision theory and not in terms of hypothesis testing. (This even goes to confidence/credible intervals, since there’s always a decision to be made, even if it is just a simple as “should I publish this result?”)

  3. > As noted above, when you can reject the model, the confidence interval is empty. That’s ok since the model doesn’t fit the data anyway. The bad news is that when you’re close to being able to reject the model, the confidence interval is very small, hence implying precise inferences in the very situation where you’d really rather have less confidence!

    The link to Imbens’ presentation is out of date and I haven’t read your paper yet so I may well be missing something but, with that as a qualifier, why doesn’t this approach yield CIs which aren’t strongly-dependent upon goodness-of-fit?

    1. For the sake of simplicity, let’s say I’m comparing two linear models: y = A1 # x1 and y = A2 # x2 where y is a p-element data vector, the Ai matrices are sensitivies and the xi vectors contain the model parameter values; x2 is x1 augmented with one or more parameters. The ML values of xi minimize the sum-squared deviation between y and Ai#xi. (More generally, you could calculate ML model parameter values as those which minimize a cost function which depends on the deviation between y and Ai#xi and the values of xi, e.g., you could implement regularized regression or LASSO.)

    2. Use an F-test (or something analogous) to decide which model better explains the data.

    3. Calculate a goodness-of-fit metric for the best fit model. Decide from first principles considerations or based on the observed distribution of the metric for training data whether you believe the best-fit model accounts for the data.

    4. If the fit residuals are uncorrelated and normally-distributed then cov(xi) = sigma^2 * inverse{ cov(Ai) }. (If the residuals are correlated then there’s an error covariance term to include but conceptually nothing is changed.)

    5. If the residuals aren’t MVN then you’d need to construct a more sophisticated cost function but you can follow the same process for computing cov(xi) as you to in the MVN case and obtain reasonable CIs, no? At a minimum you can calculate the Cramer-Rao lower bound on parameter uncertainties associated with each model.

    6. Compute the values of the ‘extra’ model parameters even if you rejected H1. Examine the distribution of values. The mean should be about and the covariance should behave as per 4) so long as the fit is plausible.

    • Chris:

      Procedures exist which will fix just about any particular statistical problem. My point was not that inverting hypothesis tests will necessarily give bad interval estimates but rather that I don’t think it’s a good general principle.

  4. Isn’t this a problem in Bayesian models as well? If you have a model that is soundly contradicted by the data, you can still get a very narrow posterior distribution.

    • Jsalvatier:

      Yes, some Bayesian models can give narrow posterior distributions when they are contradicted by data. The resulting interval estimates won’t be empty, but they can be implausibly narrow.

  5. I would title this post “Why confidence intervals can sometimes be misleading”. We use confidence intervals to divide the parameter space into two groups: (i) the set of parameter values that are inconsistent with our data or model, and (ii) everything that’s not in group (i). We can be misled if we interpret group (i) as the set of rejected parameters and forget the possibility that we may reject because we have a bad model.

    I think the same issue arises in testing when we interpret rejection of a null hypothesis as evidence in favor of a particular alternative hypothesis even though the test has power against multiple alternatives.

    In some cases, this problem can be mitigated by plotting the criterion function rather than only reporting the interval (e.g., plotting T(theta,y) against theta). For example, if the criterion function is very flat, then the data provide little information about theta. Stock and Wright’s 2003 Econometrica paper shows an example of this for GMM estimation with weak identification.

  6. I am not sure I understand your point here.
    Every test defines a confidence interval and
    every confidence interval defines a test.
    Every confidence interval can be viewed as inverting a test.
    Perhaps what you mean is that you don’t like confidence intervals
    that correspond to a particular type of test?

    Larry

    • Larry:

      The statement “every confidence interval can be viewed as inverting a test” is correct in a Neyman-Pearson framework, but there are other sorts of confidence intervals out there. In cases such as the example above, I prefer the confidence intervals that would be obtained from a Bayesian posterior distribution, or by bootstrapping a point estimate. For another example, consider the Agresti-Coull confidence interval for a proportion, which has better properties (from the perspective of Agresti, Coull, myself, and many others) than any confidence intervals obtained by inverting hypothesis tests.

      We know that there are many forms of confidence intervals. My point is that many statisticians are trained to believe that inversion of hypothesis tests is the most fundamental definition of a confidence interval. But I do not think that is so, for reasons given above.

      • I still don’t understand.
        I think I am just missing something.
        Say you get confidence interval from a Bayesian posterior.
        Assuming it has frequentist coverage then it still
        can be written as test being inverted.

        • I’m sorry to be slow about this but I am still confused.
          Take the Agresti interval.
          Reject the null if it is not in the interval.
          This defines a level alpha test.
          Let’s call it phi.
          Now invert phi and you get back
          the Agresti interval.
          What am I missing?
          Larry

        • Perhaps he means that he doesn’t like taking a test based on a frequentist notion of the sampling distribution of some parameter, and inverting it to get a confidence interval. Of course you can *define* a test in terms of “calculate a confidence interval by procedure foo and reject the point-hypothesis if the observed value is outside the interval” but this is more like “creating a test from a confidence interval procedure” than “creating a confidence interval procedure from a test”. Perhaps the point is, that if you design a confidence interval procedure it tends to be *designed* to give meaningful confidence intervals, whereas if you invert a test it might be more likely to give a confidence interval that’s a bit wacky as mentioned here.

  7. I’m not sure I see what the problem is either. Isn’t your example – what do you conclude about a really narrow confidence interval that arises because the model just barely fits the data – the result of fixing the level of the test at alpha=0.05 or whatever and not looking at other alphas?

    If instead you looked at a plot of the test rejection frequency (y-axis) vs. the hypothesized value of the parameter (x-axis), I think your dilemma could disappear. A horizontal line at 0.95, say, gives you the 95% confidence interval where it intersects the plot of the rejection frequency. In your example, the confidence interval is very narrow. But you just have to look at the plot to see if something is amiss. If the rejection frequency barely drops below 0.95 and never goes as far down as, say, 0.90 so that a 90% confidence interval is empty, you should be worried. But if the rejection frequency drops all the way down to, say, 0.11, so that you have to go all the way down to a 10% confidence interval to get an empty one, you probably won’t be so worried.

    Of course, you could have an intermediate case and still have a dilemma. But at least you’ll know what is going on.

    I’m also not sure how common are empty confidence intervals in practice. If you construct your confidence interval from a point estimate and an SE – the most common scenario, I’d guess – it will never be empty because it will always have the point estimate in it.

    So the simple answer to your problem is, if an empty confidence interval is a potential problem, just look at a plot of the rejection frequency. Am I missing something here?

    –Mark

    • Mark:

      Sure, but that’s not generally what is done. The procedure is to construct the interval for a particular data set. In which case what can happen is that the model is almost rejected and the confidence interval is really small. The discontinuity is unpleasant: as the model fit becomes gradually worse, the interval becomes increasingly precise, then suddenly the interval is empty and you can’t say anything at all about the parameter.

      • Andrew,

        I must still be missing something – I’m suggesting you estimate your model for a particular dataset, and plot the rejection frequency vs. the range of hypothesized values for the parameter of interest. Here’s an example: the paper by Chernozhukov and Hansen (“The Reduced Form”, 2005 working paper version) available at papers.ssrn.com/sol3/papers.cfm?abstract_id=937943, p. 36 (last page), left-hand panel. Their example doesn’t have the problem of empty confidence intervals (but it’s discussed elsewhere in the paper, as I recall). Wish I could somehow post a graph illustrating the point in the comments here.

        And is it the discontinuity w.r.t. model fit per se that’s unpleasant? What seems unpleasant to me is that for a given dataset and model, you can choose a conventional test level of, say, 10%, and your 90% confidence interval is empty. This is unpleasant whether your 95% confidence interval was narrow or wide. But if to get an empty confidence interval you have to go all way down to a test level of, say, 90%, so that your 10% [sic] confidence interval is empty … well, this doesn’t seem too unpleasant to me. But maybe I’m missing something here too.

        –Mark

        • Mark:

          The problem is that when people see a very narrow interval, they interpret this as there being very precise knowledge about the parameter. But when we see an empty interval, it is natural to interpret this as a sign that the data supply no information about the parameter.

        • Andrew,

          But an empty interval should be interpreted as a sign that the data provide conflicting information, rather than no information, about the parameter, right? It’s when we see a huge interval that the natural interpretation is that the data supply no information about the parameter, i.e., the parameter is unidentified.

          In any case, I think you’re saying that the problem is with common practice and interpretation rather than an inherent problem with the tools at hand, which is fair enough (and I agree!). But I had read your blog post title to imply the problem was with the tools rather than how they are used.

          –Mark

        • Andrew,

          And I guess my point is that you need empty intervals at high significance levels before you need to worry about the very-narrow-interval problem. If the confidence interval is never empty at any significance level (e.g., when you’re doing basic OLS) or becomes empty only at low significance levels (so the rejection frequency never quite touches zero but comes close), then … no problem. In that case the usual interpretation of a narrow confidence interval – the parameter is estimated precisely – should be OK.

          I think I’m with Aaron on this one – I like his suggested replacement title, “Why confidence intervals can sometimes be misleading”.

          –Mark

          NB: I really like the blog, btw – I’m a regular reader and recommend it to colleagues and students. Thanks!

  8. What’s wrong with narrow or empty confidence intervals if they are interpreted in the way “sets of parameters consistent (in the sense defined by confidence intervals) with the data”?
    Davies (Davies, P. L. (1995) Data features. Statistica Neerlandica 49,185-245) defines “adequacy regions” by basically combining and computing several confidence intervals including some that measure the appropriateness of the model rather than measuring parameters against each other given the model. The whole approach is based on not *assuming* any model but rather figuring out which models could have generated the data, which may differ strongly from each other (and may include a rather narrow range of parameters in a certain, not exclusive, parametric model). I think that all this is fine, one just shouldn’t believe that a narrow confidence interval means that “we know very precisely what is going on”.

    • Christian,

      That sounds interesting. In common practice, we take the width of the confidence interval as a sign of how much we know about the parameters, but there is a problem that the confidence interval width is itself a random variable (as well as being model-dependent), and people typically don’t think about that.

  9. Andrew,

    The paper “A classical measure of evidence for general null hypotheses” (http://www.sciencedirect.com/science/article/pii/S0165011413001255) provides a measure of evidence based on likelihood-ratio confidence regions, it also has a close relationship to p-values under specific hypotheses.

    If the confidence region is empty, the proposed evidence will be zero. This provides strong evidence against all theta \in Theta. Therefore, the conclusion should be that the model is not appropriated.

    Best,
    Alexandre.

    • Patriota:

      Exactly, that’s the issue. If you look at the graph above, there’s a point where the model just barely fits, and then the conf interval is very narrow, implying that a lot is known about theta; but actually what’s happening is that the model is very close to being rejected entirely.

      • Andrew,

        I’m trying to figure out a situation where this issue happens.

        For instance, if the model is misspecified then the ML estimator will tend to be on the border of the parameter space (for a large n), and as n gets larger the confidence region will be approaching the empty set.

        But I do not now if this happens in other situations.

        What do you think?

        • Patriota:

          It can happen even if the model is correctly specified, if the parameter space is restricted so that the maximum likelihood estimate has high probability of being on the boundary. I first encountered the issue many years ago with models for image reconstruction; there, the intensities are constrained to be positive in each pixel.

        • Andrew,

          Can you provide a concrete example? If the parameter space is restricted, then you are specifying incorrectly the family of probability measures. This is the very same example I said in an earlier post.

          The parameter space is part of the model, actually it plays one of the most important roles in the classical statistics, since it basically defines which measures should be considered. If you provide an restricted parameter space, you are evidentially giving a restricted model which may not be appropriated to fit the data.

        • Patriota:

          The example is image reconstruction from emission tomography. We were working with data from a real experiment. It’s in my Ph.D. thesis, and it’s also described briefly in my 1996 article with Meng and Stern.

        • I don’t know what you mean by “the parameter space is restricted”, even if the model is correctly specified.

          Suppose that Y ~ N(m, s²), then the parameter space is theta = (m,s²) and lies in RxR+.

          If you restrict the parameter space to be [0,\infty]xR+ and the mean of Y is negative, then you are not specifying correctly the model. The family of distribution is not correctly specified.

          The statistical model is the triplet (Omega, F, M), where M is the family of probability measure. If M is ill-defined, we will have problems and, under parametric models, the parameter space is essential to correctly specify the family M.

  10. Andrew:

    You have re-posted this several times. One suggestion, if I may. Next time use a simulated example with fake data, and perform analysis side by side. Hopefully it will make things clearer.

    I still am not sure what your point is. When inverting the test we are making inference for the study population only. A narrow interval means only a small subset of parameter space is consistent with rejection of the null and the data.

    This can be interpreted as high precision – or high fragility. In the latter sense you want wide confidence intervals, meaning more of the parameter space is consistent with rejecting the null and the data. The confidence interval is what it is, the matter is the interpretation. Many of these issues will come to light if checking model robustness.

    If you are willing to add a lot more assumptions and make inferences about a wider population, heterogeneous effects, etc. – despite a small sample say – then the fragility aspect washes out – but only by assumption. Many readers are not aware of these assumptions.

    I think this discussion is comparing apples and oranges, in two different languages. As such I don’t find it helpful.

    • Ps I’m not disagreeing with you, or saying you are wrong. I take your point. My point is more about the ambiguity in this discussion.

      It strikes me there is a lot being left unsaid, including changes in study goals and objectives, target population, model choice, etc…

      I start from different defaults, and then make my way to yours. Others prefer to come out with statistical guns blazing. For the kind of problems you deal with that may make the most sense.

    • Fernando:

      I am not making an applied point here, it is more of a theoretical point. Students are often taught that the inversion of hypothesis tests is the most fundamental, exact form of interval estimation and that this should be used as a standard of comparison for any other approaches to interval estimation.

      This example demonstrates a difficulty with the approach of inverting hypothesis tests, even in a situation in which the model is true and the level of the test is exact.

      As discussed in response to an earlier commenter, this example came from my experience analyzing data for medical imaging. It would not be difficult to construct a fake-data example following these principles.

      • Agreed. But then you propose to do something else, which changes the budget of assumptions, including continuity of pararameters. In which case why test the null of exactly zero say?

        As mentioned by one of the comenters things make more sense in the context of a decision.

        Personally I like the hypothesis test and estimation proposed by Imbens and others. Why? Becasue I often deal with small samples so assumptions have more weight relative to the data. (I understand that this may in fact argue for adding more prior information, not less, but the context of a decision often comes from the desire to let the data speak).

        In large samples I would go Bayesian all the way, specially where some hierarchies are sparse. Do partial pooling. But in my typical setting everything is sparse.

        Perhaps a way to proceed is to do both types of analyses, see if they yield same decision. In small data settings, pre-specifying a prior might enhance the credibility of the study.

        • Here is the discussion as I see it between fictitious A & C:

          A: You should not do procedure A, it has problem X. You should adopt procedure B.

          C: Doing B involves assuming Y. The reason I am doing A is precisely to avoid assuming Y.

          A: Ok but then you can face problem X.

          C: Ok but I’d rather not assume Y.

          A: But Y makes a lot of sense, and opens a world of possibilities.

          C: Yes, but in the decision context I am in, with sparse data, I’d rather do A first. Then test aspects of Y. And then see if I can do B.

          A: But just do B and do model checking.

          C: But isn’t model checking conditional on Y. In my context I’d rather not assume Y.

          A: You should change your context.

          C: I’m working to get more data.

          A: Ok do that and then do B.

          C: It may take several years.

        • Fernando:

          I’m not saying that inversion of hypothesis tests should never be done; I’m saying that I don’t buy it as a fundamental procedure for the construction of uncertainty intervals. There are various different principles by which one can create uncertainty intervals, and I don’t believe any of these principles is uniquely fundamental.

          When a family of hypothesis tests can be easily inverted and it gives reasonable results, great. In other settings, inverting tests can be computationally expensive and can give results that don’t make sense. In those settings, I’d recommend doing something else and considering that something-else on its own terms rather than considering it as an approximation to an imagined ideal.

          Whether to use Bayesian inference is another story. In some small-sample settings it can be helpful to use classical inference and let others do the data combination later. In other small-sample settings there is a lot of prior information and it is foolish to ignore it. In addition, there are non-Bayesian ways of using prior information. Such approaches can be awkward but they can be done (see, for example, my paper with Weakliem, where we use non-Bayesian calculations to demonstrate the hopelessness of that beauty-and-sex-ratio example).

Comments are closed.