Confidence intervals, compatability intervals, uncertainty intervals

“Communicating uncertainty is not just about recognizing its existence; it is also about placing that uncertainty within a larger web of conditional probability statements. . . . No model can include all such factors, thus all forecasts are conditional.”us (2020).

A couple years ago Sander Greenland and I published a discussion about renaming confidence intervals.

Confidence intervals

Neither of us likes the classical term, “confidence intervals,” for two reasons. First, the classical definition (a procedure that produces an interval which, under the stated assumptions, includes the true value at least 95% of the time in the long run) is not typically what is of interest when performing statistical inference. Second, the assumptions are wrong: as I put it, “Confidence intervals excluding the true value can result from failures in model assumptions (as we’ve found when assessing U.S. election polls) or from analysts seeking out statistically significant comparisons to report, thus inducing selection bias.”

Uncertainty intervals

I recommended the term “uncertainty intervals,” on the grounds that the way confidence intervals are used in practice is to express uncertainty about an inference. The wider the interval, the more uncertainty.

But Sander doesn’t like the label “uncertainty interval”; as he puts it, “the word ‘uncertainty’ gives the illusion that the interval properly accounts for all important uncertainties . . . misrepresenting uncertainty as if it were a known quantity.”

Compatability intervals

Sander instead recommends the term “compatibility interval,” following the reasoning that the points outside the interval are outside because they are incompatible with the data and model (in a stochastic sense) and the points inside are compatible our data and assumptions. What Sander says makes sense.

The missing point in both my article and Sander’s is how the different concepts fit together. As with many areas in mathematics, I think what’s going on is that a single object serves multiple functions, and it can be helpful to disentangle these different roles. Regarding interval estimation, this is something that I’ve been mulling over for many years, but it did not become clear to me until I started thinking hard about my discussion with Sander.

Purposes of interval estimation

Here’s the key point. Statistical intervals (whether they be confidence intervals or posterior intervals or bootstrap intervals or whatever) serve multiple purposes. One purpose they serve is to express uncertainty in a point estimate; another purpose they serve is to (probabilistically) rule out values outside the interval; yet another purpose is to tell us that values inside the interval are compatible with the data. These first of these goals corresponds to the uncertainty interval; the second and third correspond to the compatibility interval.

In a simple case such as linear regression or a well-behaved asymptotic estimate, all three goals are served by the same interval. In more complicated cases, no interval will serve all these purposes.

I’ll illustrate with a scenario that arose in a problem I worked on a bit over 30 years ago, and discussed here:

Sometimes you can get a reasonable confidence interval by inverting a hypothesis test. For example, the z or t test or, more generally, inference for a location parameter. But if your hypothesis test can ever reject the model entirely, then you’re in the situation shown above. Once you hit rejection, you suddenly go from a very tiny precise confidence interval to no interval at all. To put it another way, as your fit gets gradually worse, the inference from your confidence interval becomes more and more precise and then suddenly, discontinuously has no precision at all. (With an empty interval, you’d say that the model rejects and thus you can say nothing based on the model. You wouldn’t just say your interval is, say, [3.184, 3.184] so that your parameter is known exactly.)

For our discussion here, the relevant point is that, if you believe your error model, this is a fine procedure for creating a compatability interval—as your data becomes harder and harder to explain from the model, the compatibility interval becomes smaller and smaller, until it eventually becomes empty. That’s just fine; it makes sense; it’s how compatability intervals should be.

But as an uncertainty interval, it’s terrible. Your model fits worse and worse, your uncertainty gets smaller and smaller, and then suddenly the interval becomes empty and you have no uncertainty statement at all—you just reject the model.

At this point Sander might stand up and say, Hey! That’s the point! You can’t get an uncertainty interval here so you should just be happy with the compatibility interval. To which I’d reply: Sure, but often the uncertainty interval isn’t what people want. To which Sander might reply: Yeah, but as statisticians we shouldn’t be giving people what we want, we should be giving people what we can legitimately give them. To which I’d reply: in decision problems, I want uncertainty. I know my uncertainty statements aren’t perfect, I know they’re based on assumptions, but that just pushes me to check my assumptions, etc. Ok, this argument could go on forever, so let me just return to my point that uncertainty and compatibility are two different (although connected) issues.

All intervals are conditional on assumptions

There’s one thing I disagree with in Sander’s article, though, and that’s his statement that “compatibility” is a more modest term than “confidence” or “uncertainty.” My take on this is that all these terms are mathematically valid within their assumptions, and none are in general valid when the assumptions are false. When the assumptions of model and sampling and reporting are false, there’s no reason to expect 95% intervals to contain the true value 95% of the time (hence, no confidence property), there’s no reason to think they will fully capture our uncertainty (hence, “uncertainty interval” is not correct), and no reason to think that the points inside the interval are compatible with the data and that the points outside are not compatible (hence, “compatibility interval” is also wrong).

All of these intervals represent mathematical statements and are conditional on assumptions, no matter how you translate them into words.

And that brings us to the quote from Jessica, Chris, Elliott, and me at the top of this post, from a paper on information, incentives, and goals in election forecasts, an example in which the most important uncertainties arise from nonsampling error.

All intervals are conditional on assumptions (which are sometimes called “guarantees“). Calling your interval an uncertainty interval or a compatability interval doesn’t make that go away, any more than calling your probabilities “subjective” or “objective” absolves you from concerns about calibration.

50 thoughts on “Confidence intervals, compatability intervals, uncertainty intervals

  1. There is something confusing me in your description. It seems like dichotomous thinking pervades your discussion- whether the interval is compatible with the model is always a probabilistic assessment – it never should result in a statement that the data is or is not compatible with the model. Regardless of which of the words “compatibility,” “uncertainty,” or “confidence,” is used, the interval should never be used to conclude that the data is or is not consistent (there’s a fourth word that could be used) with the model. I’m sure the particular word used will convey different things to different people, and I have no particular insight regarding what term is best to use.

    It is statements such as “yet another purpose is to tell us that values inside the interval are compatible with the data” that are confusing me. Wouldn’t it simplify things if we never used an interval to tell us such things?

    • In my experience, “The data is not consistent with the model” is a loose way of saying that there do not exist parameters such that the relevant test statistic passes the relevant threshold.

      You can think about the 95% confidence interval for a sample mean as the set of null hypotheses (in the class associated with your model) which cannot be rejected (p>0.05), and are therefore, in a probabilistic sense, compatible with your data.

      Since (for a two-sided test) p=1 when the null equals the sample mean this set will never be empty – there is always some population mean compatible with your sample mean. In more complicated models it can be empty (for a given confidence/compatibility level.)

    • Yes, I suspect that Sander would not argue “that the points inside the interval are compatible with the data and that the points outside are not compatible.” I suspect that because we wrote in [https://www.nature.com/articles/d41586-019-00857-9] that “just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible.” We thus usually say that “values inside the interval are MOST compatible with the data”, given the model, because Confidence Intervals Exclude Nothing [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1646956/pdf/amjph00255-0090.pdf].

    • Dale, Valentin:

      I agree that we shouldn’t think deterministically or dichotomously. On the other hand, that’s how these intervals are often used! So I’m not quite sure what to think about all this.

      • I don’t think any particular language will work for dichotomous thinking. If an interval is used that way, any or no language will prevent misinterpretation, although the more obscure the term, the less likely it is to be easily misused. I agree that intervals are often used dichotomously but I think devoting energy to finding the term leas subject to misuse is wasted energy. Better to attach the dichotomous thinking directly. Perhaps showing the entire distribution rather than an X% interval is a step in the right direction.

        As in the question posed below about cancer diagnosis/treatment. What I want is something that allows me to have a feeling for how precise the estimates are based on the evidence. Telling me that there is a 30% chance that I have cancer is better than telling me I “might” have cancer, but it is not nearly as good as telling me that my chance is between 20% and 40%, and that is not nearly as good as showing me a distribution for my probability, centered on 30% and showing the shape and various quantiles. No doubt this is not what patients are able to digest, nor what clinicians are able or desire to present, but we have to start somewhere. Catering to the lowest common denominator will never get us where we need to go.

        • Dale:

          Language does not solve technical problems. But I think language can still matter.

          Again, consider the distinction between “compatability” and “uncertainty” in the context of the diagram in the above post:

          – It can make sense for a compatability interval to get smaller and smaller and then suddenly disappear: this corresponds to the zone of live possibilities getting smaller and smaller until it finally disappears and we have to conclude that there are no parameter values that are compatible with the data and model.

          – In contrast, I don’t think it makes sense for an uncertainty interval to get smaller and smaller and then suddenly disappear. An empty uncertainty interval doesn’t have any meaning!

  2. Thanks. What I draw from this is that no single term will work by itself to convey the potential meaning of an analysis. Which is probably a good lesson.

    But since we’re on c_____ interval terms, what do you think of conformal intervals as in Lei et al. 2018?

    Lei, J., GSell, M., Rinaldo, A., Tibshirani, R. J., and Wasser- man, L. Distribution-free predictive inference for regres- sion. Journal of the American Statistical Association, 113 (523):1094–1111, 2018.

  3. 1). I think you have a typo here: “Sure, but often the uncertainty interval isn’t what people want.”

    2). I feel like the most important feature of the name is that it makes clear that the interval is conditional on the model (compatibility comes closest here). Although, I guess you could also argue we need better stats education rather than placing the burden on the name of the interval.

    3). Something I’ve been wondering: how important does everyone here feel intervals are when it comes to probability statements? For example, there’s a 35% chance your tumor is cancerous vs. the tumor has between a 20%–50% chance of being cancerous?

    • 3) is an interesting question. First, there are a bunch of different ways it might be meant which have different interpretations. It might be the standard error of prediction from a logit model, in which case it is a statement of our uncertainty over the probability given the data and model. Alternatively, we might have a model with some dependent variable which is unknown. Depending on the value of the unknown variable, the point prediction might be anywhere between 20%-50%. Or we could be stating something both about unmeasured variables and about modeling uncertainty. Or, we might be stating the results of entirely different models of the tumor carcinogenicity… All of these statements are consistent with some range of probabilities.

      I can think of a few other interpretations as well. These interpretations have different meanings depending on what you’re intending to do with the information. So this is another example, like waht is being discussed above, that the interval you choose to use depends on what you want to really know and how you intend to use it.

      • My perspective was as a consumer of the estimate. If someone says they expect the policy to change profits by +5 million vs. by between (-10M, 25M), the uncertainty could play a large role in my decision-making. But if someone says the chance of success is 35% vs. between 20%–50%, I’m not sure if that factors meaningfully into my decision making.

        However, I guess your point does answer that. When it’s provided, I can dig into the sources of uncertainty, and perhaps get a better estimate for my situation.

        • I’ve actually come across the probability interval question in my own research, which is largely focused on developing predictive models for a binary outcome. My gut impression is that something just feels a bit wrong about presenting a statement like “we believe Pr(Y=1) is between 0.4 and 0.7”. I’m not sure how one could use that to offer betting odds or even assess calibration of the interval statement (how does a “true value” of probability fall within it or not?). I think it gets further complicated because the same logic doesn’t necessarily apply to proportions or rates — “we believe that the proportion of votes for X candidate is between 0.4 and 0.7” seems perfectly fine. Also, it seems like the uncertainty interval can get applied in different places depends on how the problem is formulated. For example, predicting the probability of a soccer team winning a match. We could directly fit a logistic regression and calculate confidence/credible intervals for the predicted probability or we could predict point spread and its uncertainty which would give us a single probability of winning.

  4. “All of these intervals represent mathematical statements and are conditional on assumptions, no matter how you translate them into words.”

    Yes. It seems like a bit of personal preference in terms, and, as AG puts in this post, what the focus is on. But in the end, everything is conditional on assumptions and conditional on the data (which I guess is another type of assumption). So no matter what term one wants to use, when writing the results in a manuscript, perhaps it is always best to also append “conditional on our data and model” to the writeup. Maybe some standard disclaimer language should always be used? (and maybe this would be ignored). It’s sort of like how no statistical method is going to fix science – I don’t see any particular term able to fix misunderstanding in this area.

  5. Some of the ideas in here may be helpful, though I understand Andrew’s not a big fan of decision theory used this way.

    If one wants intervals that get wider when the fit is terrible, one could use an estimation loss where all estimates are more equally-bad when the fit is terrible – and when the fit’s okay, reverts to something like a default estimation loss.

    Another approach would be to have a joint interval (region, formally) for the parameter of interest and the quality of the fit, and invite readers to think about how the parameter interval varies importantly (or doesn’t) depending on the plausible values of the fit.

    The tricky part in both approaches is specifying quality of fit, in an appropriate way. But with that in hand the rest is largely automatic, I believe.

  6. Why not take the bull by the horns and call it a resampling interval? The only uncertainty you’re representing or bounding (if you want to be dichotomous) is the part that comes from taking just one sample from a population. You aren’t considering all the other sources. The one complication I can see is that the interval is also conditional on the assumption that the sample accurately reflects population dispersion, but that’s inside the logic of resampling, so to speak.

  7. Seems to me like this problem is about people following scientific procedures rather than what things are named. There’s nothing wrong with the term “confidence interval”.
    It means whatever you want it to mean. There’s a lot wrong with people making claims based on models with invalid assumptions.

    It keeps coming back to that time and time again, but no one wants to put the hammer down on bad science. It’s really pretty simple. But imagine thousands of research projects being cancelled and hundreds or thousands of researchers with no results to publish. Oh no, we’d have to admit that they last 50+ years of social science was entirely bogus.

  8. >But if your hypothesis test can ever reject the model entirely, then you’re in the situation shown above. Once you hit rejection, you suddenly go from a very tiny precise confidence interval to no interval at all.

    What does this mean? Does “reject the model” have a different meaning than “reject H0”? As in, the model is degenerate because of (say) perfect collinearity, so your estimate is undefined?

    • I thinks it’s more like “there is no good estimate”. For example, if you try to estimate a strictly positive thing (according to your model) but your measurement is negative. If it’s not too negative – compared with the precision of the measurement – some (small) values will still be “compatible” in the sense that a test won’t reject them. If it’s even more negative all the (positive) values will be rejected. There is no confidence interval – we “reject the model entirely”.

  9. > Neither of us likes the classical term, “confidence intervals,” for two reasons. First, the classical definition (a procedure that produces an interval which, under the stated assumptions, includes the true value at least 95% of the time in the long run) is not typically what is of interest when performing statistical inference.

    It’s not completely clear if what (both of) you don’t like is the name of the thing or the thing itself. If we change the term and call that thing a “compatibility interval” it still won’t be what is of interest when performing statistical inference.

    The interpretation of the classical (?) definition has issues but at least it’s well defined. Maybe we shouldn’t believe it but there is a theoretical meaning in, say, a 50% CI and 95% CI computed in some situation.

    When we relabel them as “compatibility intervals” we may say that they contain “highly compatible values” . What does that mean precisely for each of those intervals?

    I’m not sure if the proposal is to keep the “classical” interpretation – stressing that it depends on a number of assumptions – or to keep the meaning of “compatibility” as vague as possible.

    (By the way, I don’t if you’re aware that you write randomly compatibility and compatability.)

  10. “When the assumptions of model and sampling and reporting are false, there’s no reason to expect 95% intervals to contain the true value 95% of the time” – in fact when computing a confidence or compatibility interval, the parameter is defined within the assumed model, and if the model doesn’t hold, it isn’t clear what is even meant by “the true value”.

    • As we have discussed extensively, it can also be quite clear what is meant there by “the true value” if the interval is for a parameter in the model that corresponds to a quantity of interest in the real world.

      • @Carlos: In the framework of a confidence/compatibility/whatever named interval framework, the parameter is defined within the model. Of course you may have a clear idea how you want to interpret it regarding the real situation, but that doesn’t give you a formal definition within any other model.

    • Agree but one could say (as above), if a model is wrong but the data in hand is compatible with that wrong model, the statement that the data is compatible with the wrong model is correct. It may not be compatible with the true model, but that is not the compatibility claim being made.

  11. The way you have chosen to define a confidence interval is focused too heavily on its performance rather than on the observed interval. A better definition of a 90% CI would be, “Those hypotheses for which the observed result is within a 90% margin of error.” This makes it clear the observed interval is a set in the parameter space determined by the observed data that is based on performance.

    A Bayesian credible interval would be analogously defined as, “Those hypotheses for which the experimenter has 90% degrees of belief,” or “Those hypotheses for which the experimenter has assigned an unfalsifiable 90% credibility value.” It is not a verifiable statement about the actual parameter, the hypothesis, nor an experimental result.

    Here is a great cartoon that makes this point, https://lnkd.in/dsKHKTva

    • Geoff:

      Bayesian inference, like statistical inference in general, relies on models. There is no need to label such models as “beliefs,” any more than we need to refer to a logistic regression model, for example, as a “belief.”

  12. If the validity of the model is in question, sensitivity analyses can be performed while highlighting worst-case and best-case scenarios. Lumping frequentist and Bayesian intervals together as “uncertainty intervals” gives the Bayesian interpretation of probability the appearance of falsifiability.

      • I have seen you make this claim in the past, but I do not think we are using the word falsifiable in the same manner, much the same way Bayesians and frequentists do not use the word probability in the same manner. When a Bayesian assigns a “truthiness” credibility value to a hypothesis before and after an experiment, there is no real-life mechanism to empirically validate that value as having been the “right” number assigned to the hypothesis. It is not a promise of performance in repeated experiments. It certainly isn’t a factual statement about the parameter. It is just a number defined by the experimenter. No matter what number he produces he is always “right.” Only if there exists a real-life mechanism by which we can sample and observe population parameters can a probability distribution of parameters be empirically verified.

        In contrast, given enough time and money we could estimate a population parameter within an arbitrary margin of error and then investigate the coverage rate of a 95% confidence interval using a smaller sample size in repeated experiments. We can then determine if the promised performance is in line with the empirical result. We have the opportunity to falsify the claimed confidence level. This may not be practically possible, but it is possible in principle. This promised long-run performance is something everyone can universally agree upon. Using a Bayesian interpretation of probability, a credibility value is not a promise of anything and not necessarily universally agreed upon. If the prior distribution is chosen in such a way that the posterior is dominated by the likelihood or is proportional to the likelihood, Bayesian belief is more objectively viewed as confidence based on frequency probability of the experiment. The prior does not contain legitimate probability statements, it is simply a user-defined weight function for smoothing the likelihood, and the posterior is a crude approximate p-value function. Only then can the credibility values in the posterior, now viewed as frequency statements concerning the experiment, be empirically verified. This is what I mean by falsifiable.

        • Geoff:

          I recommend you read my paper with Shalizi and also my paper with Hennig on going beyond the terms “subjective” and “objective” in statistics.

          You write, “When a Bayesian assigns a ‘truthiness’ credibility value to a hypothesis before and after an experiment . . .” First, the term “truthiness” is insulting; second, you’re putting it in quotes even though you’re not quoting anyone; third there’s no reason not to use the existing non-insulting word “probability.” And, fourth, in the Bayesian inference I do, I don’t assign probabilities (or credibility values, or “truthiness,” whatever that means) to hypotheses; indeed, that’s a big point of my paper with Shalizi. In short, you’re arguing against a Bayesian who isn’t me.

        • I did not mean to be insulting with the word truthiness. I put it in quotes because I didn’t think it was a real word. I’ve read some of your papers before. I’ll have a look at this one. If the prior and posterior probability is not an assignment, and it does not represent the experimenter’s perception of whether the hypothesis is correct based on what he has experienced, and it is not a long-run sampling proportion, then I’m not sure what else it could be.

        • Geoff:

          You write: “If the prior and posterior probability is not an assignment, and it does not represent the experimenter’s perception of whether the hypothesis is correct based on what he has experienced, and it is not a long-run sampling proportion, then I’m not sure what else it could be.”

          Answer: It’s a mathematical model which can often be useful, especially if we’re willing to check it and improve it as necessary.

        • “The prior does not contain legitimate probability statements,”

          Maybe you meant “legitimate frequency statements”. Or maybe you don’t even see any difference between probability and frequency. That would be a very limiting view of probability. Broadening the perspective is possible and statements like “the probability that this tumor is cancer is X%” or “the probability that Biden completes his term is Y%” are no longer “illegitimate”.

        • > Answer: It’s a mathematical model which can often be useful, especially if we’re willing to check it and improve it as necessary.
          Yes, as is almost of of scientific reasoning. Perhaps reword mathematical model with abstract representation to make that clearer.

          Statistics predominately uses probability models as the abstract representations of choice that imply various worlds. But all of reasoning is about the abstract representation and it’s apparent compatibility with the world we are in.

          The judgement that the implications apply to our world, given the apparent compatibilities, is a fallible but sometimes useful conjecture.

      • I do not agree with the claim in the paper you linked to that the prior is falsifiable via the posterior predictive distribution. Under the Bayesian paradigm, posterior probability is an unfalsifiable number assigned by the experimenter to hypotheses. This means posterior predictive probability is not a factual statement about a future experiment, it is an unfalsifiable prediction credibility value assigned by the experimenter. He is free to assign any number he wishes. If the prior distribution is chosen in such a way that the credible level of a posterior predictive interval matches its long-run performance to cover a future experimental result, Bayesian belief is more objectively viewed as confidence based on frequency probability of the experiment. The prior is a user-defined weight function used to smooth the likelihood. It is not a legitimate probability distribution for the parameter because the parameter was never sampled from the prior. The posterior depicts approximate p-values for hypotheses concerning the parameter, and the posterior predictive distribution depicts predictive p-values for hypothesis concerning future experimental results.

        • I’m not meaning to come across as aggressive, I just wanted to share my thoughts. It’s tough to express sentiment in short text blurbs. I won’t say any more. Thank you for reading.

        • OK, last try:

          You write, “Under the Bayesian paradigm, posterior probability is an unfalsifiable number assigned by the experimenter to hypotheses.”

          That is not the Bayesian paradigm that I follow. Again I point you to my paper with Shalizi where we explicitly say that we do not do this, indeed we discuss in that paper why we don’t like that approach.

          Again, you are arguing with Bayesians other than me.

          Also, you say that the prior “is not a legitimate probability distribution for the parameter because the parameter was never sampled from the prior.” That’s your definition of “legitimate,” and you can use the language however you want. Unfortunately, limiting “legitimate” probability to random sampling will exclude as “illegitimate” almost every application I’ve ever worked on in areas including pharmacology, environmental science, public opinion, etc etc etc.—and that has nothing to do with Bayesian inference: in none of these do we have probability sampling! It’s turtles all the way down.

          The confusions you have have been expressed by many others, and dispelling such confusions is a reason for a lot of the things I’ve written over the years, including my above-linked papers, my 1995 paper with Rubin, my 1996 paper with Meng and Stern, and a few zillion blog posts. Beyond that, I recommend you take a look at my books and applied articles for lots of examples of Bayesian model checking in practice.

          I appreciate you expressing your views here; you are not unique in your perspective, and I blame decades of writing by statisticians for propagating these confusions. One of the useful things about these comment threads is that it can reveal that these misunderstandings continue to exist, and give us the opportunity to clarify things.

        • @Geoff: In my view Bayesian probability calculations and interpretations of probability are to be distinguished. Bayes himself was rather ambiguous on his interpretation of probability. Bayesian statistics is often used together with an epistemic probability interpretation, but this doesn’t have to be the case. Bayesian calculus applies also if probabilities refer to the real underlying data generating process and are as such falsifiable by data.

        • Geoff,

          I actually agree with you. you can’t falsify prior distributions, they’re an assertion of an assumption. Essentially a Bayesian statement is something like:

          “If you assume that values with high p(parameters) are more likely than those with low values of p(parameters), and the world can be predicted by the function F(inputs), and you have the data given in the dataset D, then in the future you should assume that the values with high p(parameters | D) are the ones that will reliably predict outcomes of experiments”

          This isn’t a falsifiable statement, it’s a logically consistent tautology.

          A falsifiable statement would be something like:

          “the value of parameters = (x,y,z) together with the predictive function F, will accurately predict the outcome of an experiment typically to within error epsilon” which by collecting data you can falsify if the posterior probability density in the vicinity of (x,y,z) goes to zero.

          or

          “there exists a value (a,b,c) of the parameters such that F(inputs) will accurately predict the outcome of an experiment typically to within error epsilon” which you can falsify by showing that the error even when using the maximum probability density parameters after collecting data will typically exceed epsilon

          You can’t falsify a prior distribution because it’s not a statement about the world, it’s a statement about your state of knowledge about the world.

      • Andrew and Geoff,

        It is impossible to prove anything with science due to affirming the consequent. It is also impossible to disprove anything because you can only disprove a conjunction of (Theory + Auxiliary Assumptions).

        Let’s rewrite that as (T and A). The negation is !(T and A) = [!T or !A or (!T and !A)]. So all you know is one of your assumptions are wrong and it is impossible to ensure they are all correct.

        That is the Duhem-Quine thesis. Imre Lakatos also wrote a lot about this. If anyone wants links just ask.

        So the idea falsifiability is something to strive for is a strawman. Instead what you do is compare the relative compatibility of all known explanations using Bayes’ rule. If someone comes up with a clever new explanation it reduces the posterior probability of all the others.

        The idea of testing a single theory/hypothesis/explanation at a time is fundamentally flawed. You should always be comparing at least the top few candidates.

        • That is the Duhem-Quine thesis. Imre Lakatos also wrote a lot about this. If anyone wants links just ask.

          Also Paul Meehl, of course.

        • If you have a theory and some data, the fact that the data look very different from how the theory says they should look like should tell us something important, shouldn’t it?

          Of course you’re right and the problem may come from auxiliary assumptions, however then it at least means that something of my (maybe not even explicit) assumptions is wrong and I then need to go on figuring out which one it may be (or whether in fact the theory is wrong).

          I don’t object against your philosophical remarks, however they don’t mean at all that we shouldn’t look whether the data we observe are in line with a theory that we may believe or suspect to be true, be it plus some assumptions that we’re explicitly and implicitly making. If there is disagreement/incompatibility, that’s a major source of information for us and an opportunity to learn in any case. I’d be very surprised if Duhem/Quine/Lakatos disagreed with that, when it comes to doing science in practice. The philosophy is about whether what we can learn may be somewhat different from what one naively might think, and this is fair enough, but for sure they can’t say we should not even look!

        • Anoneuoid: Nicely put.

          Perhaps add on economy of research to decide whether to pause with the best seeming currently most compatible explanation or further investigate A or revise T of the currently lower compatible explanation or identify a new explanation and assess it’s compatibility.

          I think the biggest mistake is trying to get closure rather than appropriate pauses in science.

  13. Nice post! I think you clarified a question I’ve always had regarding your idea that this shrinking CI behaviour is bad. I’ve always thought that a CI that compresses to only the parameter values that even *could* have produced the data is exactly what I want a CI to do, but that’s because I’ve always viewed the CI as showing the parameters that are (stochastically?) compatible with the data *assuming some data-generating model*, and so this behaviour is exactly what I would expect; you just need to add a model-checking step as well as the inference step.

    I did have a couple questions though:

    First, given this example, wouldn’t a credible interval *also* show this same shrinkage as the range of plausible parameters shrinks (although it might not shrink all the way to zero, depending on the prior).

    Second: do you see a role here for replacing CIs with p-value functions (i.e. the p-value associated with each parameter, assuming it as the null)? I think they are probably better summaries of compatibility, and in your example, I think they would also reveal the problem, since the maximum p-value would also shrink (i.e. there would be no point in the parameter space that wouldn’t be rejected at least some level of significance).

    • Eric:

      1. No, the 95% posterior interval would never be empty. As the data become more discrepant from the model, the 95% posterior interval might get wider or it might get narrower or it might stay the same—it depends on specific aspects of the tails of the likelihood and prior distribution.

      2. I’m not really into tail-area probabilities anymore.

Leave a Reply

Your email address will not be published. Required fields are marked *