Stupid-ass statisticians don’t know what a goddam confidence interval is

From page 20 in a well-known applied statistics textbook:

The hypothesis of whether a parameter is positive is directly assessed via its confidence interval. If both ends of the 95% confidence interval exceed zero, then we are at least 95% sure (under the assumptions of the model) that the parameter is positive.

Huh? Who says this sort of thing? Only a complete fool. Or, to be charitable, maybe someone who didn’t carefully think through everything he was writing and let some sloppy thinking slip in.

Just to explain in detail, the above quotation has two errors. First, under the usual assumptions of the classical model, you can’t make any probability statement about the parameter value; all you can do is make an average statement about the coverage of the interval. To say anything about the probability the parameter is positive, you’d need also to assume a prior distribution. And this brings us to the second error: if you assume a uniform prior distribution on the parameter, and the 95% confidence interval excludes zero, then under the normal approximation the posterior probability is at least 97.5%, not 95%, that the parameter is positive. So the above statement is wrong conceptually and also wrong on the technical level. Idiots.

P.S. Elina Numminen sent along the above picture of Maija, who is wistfully imaging the day when textbook writers will get their collective act together and stop spreading misinformation about confidence intervals and p-values.

203 Comments

1. George says:

I am confused that the book you linked to, with the offending quote, is written by Andrew Gelman…but I believe this post also to be written by Andrew Gelman. Is the link incorrect? Or is this a joke that I am missing?

• Corey says:

You did indeed miss the joke. Professor Gelman is issuing an erratum for his own textbook.

• someone says:

George, did you think Andrew is so mean-spirited that he’d be so severe on anyone other than himself?

2. Andrew do you think it’s “accurate enough” to say that if a 0.95 confidence interval is [a,b] that with 0.95 “confidence” (without exactly defining “confidence”) the data are consistent with a true effect (in the data generating mechanism) of between a and b? I’m looking for the shortest almost honest statement of a frequentist result using CIs. I realize that even with this there is a slight of hand: frequentists deal with an infinite set of repetitions of data, and don’t fully condition on ‘our data’; the CI is defined by long-term coverage and not by a probability interpretation for any one interval, consistent with what you wrote above. I’m harking to this definition of a 1-alpha CI: the set of all values that if hypothesized would not be rejected at the alpha level.

• Anoneuoid says:

a 0.95 confidence interval is [a,b] that with 0.95 “confidence” (without exactly defining “confidence”) the data are consistent with a true effect

The greater the “confidence”, the wider the interval (ie, 99% CI is wider than 95%). This means “confidence” is inversely related to the support for whatever research hypothesis people are actually trying to evaluate (ie, a wider interval means the results are consistent with more possible explanations).

It sounds to me like “confidence” is being used in precisely the opposite way that people would expect. Thus this practice will cause much confusion and be used to mislead more than anything else.

• Andrew says:

Frank:

Defining a confidence interval as an inversion of a family of hypothesis tests creates its own problems; see here. The key point is that confidence intervals are used to express uncertainty (indeed, I prefer the term “uncertainty interval”), and this test-inversion procedure doesn’t do that except in some special cases. Yes, these are important special cases, but the point is that inversion of hypothesis tests doesn’t work as a general principle for producing uncertainty intervals.

• Alex says:

> The key point is that confidence intervals are used to express uncertainty

Isn’t that a mixing of two interpretations of probability, though? Or are you defining “confidence interval” outside of the technical definition in the frequentist interpretation of probability?

I would say that a confidence interval only expresses uncertainty inasmuch as it agrees with a Bayesian credible interval, and then you have to say which Bayesian credible interval you mean.

• Donny Williams says:

Frank:
My current view is that there is no perfect definition of confidence intervals, and think the definition is best considered as a trade-off between strict correctness and conveying the relevant information. In this light, like you, I am currently defining a CI as those values that would not be rejected at the 1 – alpha level. The reasons are many: 1) it is clear that, when zero is not rejected, other values are also not rejected. This allows for avoiding the no effect fallacy; 2) it begins to loosen pseudo-skepticism of only testing a null nil hypotheses; 3) even when zero is rejected, it also shows that values very close to zero are not rejected when the interval is wide. This allows for conveying uncertainty in inference, which I do not view as the same as uncertainty of the interval.

As Andrew says, there are some problems with this definition. However, I would counter that there are problems with all definitions of CIs and it is not clear that using uncertainty conveys the bulleted points I listed above. I think these bullet points are not perfect, but are the most useful of interpreting CIs.

Furthermore, for those that do not just use frequentist methods but actually adhere to frequentism, I have not often heard the word uncertainty used. This is because, in the strict sense, the CI is providing pre-data information about average (long-run) expectations of a sampling procedure. This can be just fine, for some, but I personally do not think it has much of a place in science. Of course, that is my personal belief and in the strict sense I think all statistics are speculative. However, when considering the trade-off between pros and cons, I prefer (at least currently) your definition: “the set of all values that if hypothesized would not be rejected at the alpha level.”

• Blissex says:

«in the strict sense I think all statistics are speculative»

From my distant education in subjectivist/information theory based interpretations that is utterly uncontroversial, why is it said here as if it were heresy? :-)
More precisely the impression that I got long ago is that while “statistics” (in the sense of summaries) of a population are not at all speculative, they are arithmetic numbers, all “statistics” that are sample-derived (as estimators for the statistics of the population) are speculative, subject to leaps of faith on ergodicity and flat or non-flat priors.

• Martha (Smith) says:

Franks:

Saying “with 0.95 “confidence” (without exactly defining “confidence”)” sounds like a really bad idea to me.

3. It is barely comprehensible on its own, let alone deciphering which errors entail it. I wonder if we need far better foundation in mmeasurement theory because it is always like we all have gaps in our knowledge of measurement.

4. Dale Lehman says:

The second error should be corrected. Concerning the first (the misinterpretation of the confidence interval), I remain convinced this is an unproductive complaint. The difference between the confidence interval being a property of repeated random sampling and a statement about the one particular (likely non)random sample you happen to have just confuses people unnecessarily. Now you can argue that the difference is, in fact, a fundamental concept involving the very essence of statistical thinking – and there is some merit in that argument. However, the risk is that many people just get confused – and rather than avoiding statistics, just give up and establish very bad habits. In the quite common case where you have a single sample, and the 95% confidence interval ranges from a to b, I fail to see how saying “I’m 95% confident that the true parameter is between a and b” is so damaging. Yes, it is wrong. But exactly what is the damage being done? How would the correct interpretation be used any better? If you want to argue that the uncertainty is better appreciated by the correct interpretation, then I’m all for underlining the importance of uncertainty, but I don’t agree that the correct interpretation is the best way to develop that understanding. And, I hope we can leave Bayesian vs. frequentist debates out of this particular issue – while important, I don’t think they are the key to addressing the interpretation of a confidence interval.

One further elaboration. The word “confidence” is unfortunate and the source of much of the problem. But it also presents the opportunity to address the importance of uncertainty, the scientific method, measurement, sampling, and evaluation of evidence. I prefer to use “confidence” and then explore what that might mean rather than focusing on correctly interpreting the interval. I look forward to being more educated about this, as surely many readers will not agree with me.

• Anoneuoid says:

The difference between the confidence interval being a property of repeated random sampling and a statement about the one particular (likely non)random sample you happen to have just confuses people unnecessarily.

This is only the case because, for many common uses, the confidence interval approximates the credible interval of the same “level” calculated using a uniform prior. If not for this key property, confidence intervals would be rejected as unusable by most people.

There was a thread awhile back where Daniel Lakeland offered an explanation for this phenomenon but I remember not finding it very satisfying. I can’t find the thread at the moment, but still think there is something very important here that remains to be elucidated…

• We’ve been through this so many times I don’t know which version you’re referring to either ;-)

One thing I will say is that the “confidence” is in the *procedure*, any given realized interval we could have absolutely no confidence in… For example there are confidence procedures which give correct coverage, and yet for parameters that MUST logically be positive, they can yield for example a given interval which lies entirely to the left of zero…. No one would have any confidence in that interval, but there can be a convincing mathematical proof that the procedure only fails to contain the correct value 5% of the time (when the assumptions are met).

• Martha (Smith) says:

Possibly a decent “short” explanation of the rough idea of “confidence interval” is that the percent confidence is in the long-term (over many applications) performance of the procedure.

• Elin says:

The idea of the procedure containing the “true” value a certain percentage of times is one that I can get across to students using statkey or some other visualization including trying it one their own many times and graphing the results. So that tends to be what I focus on.

• Steve says:

It is damaging to say that “I’m 95% confident that the true parameter is between a and b” because such a statement is false. Ordinary people, including medical doctors, regulators, patients, etc. — that is people who have to make real world decisions can only possibly interpret such a statement to mean there is a 95% chance that we know the true parameter, when in fact we don’t know anything from such an isolated result. Rather than undertaking the difficult task of amassing loads of evidence for a well formed theory, consumers of science are given isolated studies and statistics that make it sound like the theory has been confirmed so easily. It also undermines the public’s confidence in science when various observational studies fail to replicate or are contradicted by other observational studies. If the public keeps getting told that a “we are 95% percent confident that hormone replacement therapy improved health,” and the next year medical science discovers that HRT increases mortality, the public is rightfully going to be confused. Scientists need to convey the substantial amount of uncertainty that exists in their work if they are ever going to have credibility with the public when they need, in the rare but important areas, to tell the public that they have a substantial amount of certainty about a claim.

• Jonathan (another one) says:

I think the problem in the first statement does come from the use of the term “confidence” outside of its ordinary English language meaning. I know a lot of people don’t like translation into wagering space, but 95% confidence to me is the same as the willingness to accept either side of a 20-1 bet. (Put risk aversion issues aside.) Put that way, no one who understands the repeated sample perspective is likely to express 95% confidence in every experiment whose outcome has a confidence interval which just touches 0.

• Corey Yanofsky says:

I think you mean 19-1 ;-). Also, confidence procedures can be consistent with the betting operationalization you’re discussing (https://arxiv.org/abs/0907.0139). You need a diachronic Dutch book to force full Bayes.

• Jonathan (another one) says:

Depends on whether “-” means “for” or “to.” Thanks for the reference.

• Yah! the term ‘confidence’ is open to many interpretations too.

• Ben Goodrich says:

Dale,

My current favorite example of the potential damage of confidence intervals is from the Summary for Policymakers of the Intergovernmental Panel on Climate Change.

http://www.ipcc.ch/pdf/assessment-report/ar5/syr/AR5_SYR_FINAL_SPM.pdf

To take one example, the first real paragraph says:

“The period from 1983 to 2012 was _likely_ the warmest 30-year period of the last 1400 years in the Northern Hemisphere, where such assessment is possible (_medium confidence_). The globally averaged combined land and ocean surface temperature data as calculated by a linear trend show a warming of 0.85 [0.65 to 1.06] °C {^2} over the period 1880 to 2012, when multiple independently produced datasets exist.”

The second footnote actually gets the definition of a confidence interval correct, albeit in a way that only a well-trained statistician would understand:

{^2}: Ranges in square brackets or following ‘±’ are expected to have a 90% likelihood of including the value that is being estimated

So, basically they are correctly saying to statisticians “The pre-data expectation of the indicator function as to whether the estimated confidence interval includes the true average temperature change is 0.9” and incorrectly saying to everyone else “there is a 0.9 probability that the average temperature rose between 0.65 and 1.06 degrees Celsius between 1880 and 2012”. I am hesitant to say it is okay for policymakers adopt the latter misinterpretation because they would misinterpret the former interpretation.

I think your main point comes down to what the alternative is. If confidence intervals are inevitable, then I guess it would be less damaging for people to interpret them incorrectly than correctly. But if confidence intervals can be replaced by Bayesian intervals and interpreted correctly, I think that would be preferable.

The report gets more convoluted in its attempt to simultaneously be useful to policymakers and not wrong statistically. The first footnote says

{^1}: Each finding is grounded in an evaluation of underlying evidence and agreement. In many cases, a synthesis of evidence and agreement supports an assignment of confidence. The summary terms for evidence are: limited, medium or robust. For agreement, they are low, medium or high. A level of confidence is expressed using five qualifiers: very low, low, medium, high and very high, and typeset in italics, e.g.,
_medium confidence_. The following terms have been used to indicate the assessed likelihood of an outcome or a result: virtually certain 99-100% probability, very likely 90-100%, likely 66-100%, about as likely as not 33-66%, unlikely 0-33%, very unlikely 0-10%, exceptionally unlikely 0-1%. Additional terms (extremely likely 95-100%, more likely than not >50-100%, more unlikely than likely 0-<50%, extremely unlikely 0-5%) may also be used when appropriate. Assessed likelihood is typeset in italics, e.g., _very likely_.

So, they are using the word "confidence" not in the technical sense of a "confidence interval" but to describe the degree of agreement among the scientists who wrote the report on the basis of the (presumably frequentist) studies they reviewed. And then they have a seemingly Bayesian interpretation of the numerical probability of events being true but without actually using Bayesian machinery to produce posterior expectations. If they had asked me (which they didn't), I would have said to just do Bayesian calculations and justify the priors and whatnot in the footnotes instead of writing this mess.

• Dale Lehman says:

Thanks for these insights. The more I read peoople’s comments, the more convinced I become that the wrong interpretation is not so bad. What I see is (i) some who see this as the ultimate reason to say we must all be Bayesian and finally reject frequentist statistics, (ii) those that are satisfied with showing how much smarter they are than everyone else, (iii) those that are true scientists (which I respect) but would have everyone withhold drawing any tentative conclusions until what I can’t tell, and (iv) those who are legitimately concerned with how dangerous the misinterpretations of the evidence can be. To the latter group, I would respond that I’d rather see people understand that the evidence is indeterminate and misuse the confidence interval as an indication of this uncertainty rather than believe we can educate people to use the correct interpretation and either adopt Bayesian approaches or wait until more research is done before drawing any conclusions. Yes, I think the real battle is to teach people to accept conclusions as tentative and that re-evaluation with all new evidence is required. But I don’t think we are getting closer by insisting that the natural – and wrong – interpretation of the confidence interval is bad and stupid. I don’t find the betting metaphor intuitive. I’ve been teaching statistics to non-statisticians long enough to have a feel for what the general (but educated) public might find useful and I don’t find this confidence interval bashing to be productive. I realize this is not a popular view and that it is not “correct,” but reading through these comments (every time the issue comes up) do not convince me to change my mind.

• The advantage of the betting approach (for me anyway) is that it works for realized CIs (in standard but not all settings), i.e., after observing the data and calculating the CI. We had a long-ish and interesting-ish discussion about “bet-proof” CIs on this blog back in March-ish.

• Allan C says:

Dale: “I would respond that I’d rather see people understand that the evidence is indeterminate….Yes, I think the real battle is to teach people to accept conclusions as tentative and that re-evaluation with all new evidence is required.”

But even tentative conclusions based on less than perfect evidence sometimes need a probabilistic assignment in terms of how much we believe them. Balance of probabilities, beyond a reasonable doubt, and all that.

Since at least some very important statements (common to most) must be given an interpretation of belief, I would think it hard to divorce other common statistical concepts from being interpreted like that! So in that sense I agree with you; people are probably going to continue to apply the degree of belief interpretation to some avail (even when it’s not warranted) and I am not so sure that’s a bad thing…I think of larger issue is that some people (most?) have a propensity to take 95% to be approximately close enough to 100% to treat it as such.

BTW: that bet-proof conversation was very peculiar. I know Kadane framed the concepts in Principles of Uncertainty in that way…read both the thread on this blog and his book and I did not find either especially helpful.

• Anoneuoid says:

the wrong interpretation is not so bad

It isn’t bad at all as long as you know what you are doing. Using a confidence interval with the (technically wrong) credible interval interpretation is no different than using newtonian mechanics instead of general relativity for many common problems. For these cases it makes no practical difference which calculations you use (other than efficiency*), you will get the same answer (w/ negligible differences).

The main difference between the two scenarios is whether the user realizes they are doing this or not, which is totally an educational issue. Why isn’t this explained to the end-user?

They should know they are using an approximation to what they want that may break down under certain circumstances (eg when you need to use a more informative prior to avoid nonsense results). Instead, when looking it up they find pages and pages about the meaning of probability and philosophy and principles and nitpicking jargon, etc.

*At its heart, a CI is just the mean +/- some multiple of standard errors. This is really easy to calculate.

• Elin says:

So I teach intro stats to undergraduates in a discipline. I think that in terms of what people who won’t learn much more statistics than that course need to know five years after taking my course a lot of this discussion is kind of misguided.

I think most people who use statistics in their daily lives (say people who run a small business or people wanting to decide how many sandwiches to order for a meeting) should learn that interval estimation is preferable to point estimation and that rather than basing their plans on a single number, they should think about a range of numbers. And, then, how wide that range should be is a discussion based on how bad things are if you over or under estimate the number and that “bad” includes both having to dispose of uneaten sandwiches and not having enough food so that people feel satisfied. I don’t think they are ever doing anything close to formal hypothesis testing in the rest of their lives except the tiny percent who might become researcher. Lots of them will be reading statistics though, whether they are classroom teachers or office managers, and I want them to be able to think about them competently and critically.

• Dale Lehman says:

Very well put (this is what I was trying to express, but you did it nicely). I would go one step further – I don’t think this logic is restricted to those that only will take one statistics course. I think it is the right pedagogical approach for the first course people take. Those that are serious about research or who take further courses, can and should be exposed to the limits and caveats that apply to such intervals. Indeed, even in the first course, I would discuss what the true definition of a confidence interval is – but I think it is counterproductive to try to convey that the interval is meaningless because it really doesn’t mean what people think it does. If you have one sample and want to say “something” then I think the wrong interpretation is just fine – and I would caution people about how limited it is. But trying to keep them from that interpretation makes statistical thinking appear worthless and unattainable for the masses (and even large subsets of the masses).

• Martha (Smith) says:

Dale said: ” Indeed, even in the first course, I would discuss what the true definition of a confidence interval is – but I think it is counterproductive to try to convey that the interval is meaningless because it really doesn’t mean what people think it does. “

I agree. And would like to add that often people who might understand the definition need to use confidence intervals in explaining results of a study to more lay people –e.g., someone trying to explain results of an analysis to a school board. I am not sure what the best way to do this is (but I think “bet-proof”! is *not* a good way), but one possibility I keep coming back to is something like,

“This gives a range of plausible values that tries to take into account that we have used a sample rather than complete data”

• Also agree with Elin here, esp. “most people who use statistics in their daily lives … should learn that interval estimation is preferable to point estimation”.

This is kinda-sorta why I was interested in the “bet-proof” interpretation. Instead of saying to students “you can’t interpret realized CIs the way you’d like, mistake, don’t do that” – which is a difficult message to get across when they’re also being taught how to calculate them – we can say “there’s an interpretation for realized CIs, it’s called bet-proofness, here’s how it works”.

• This bet-proof interpretation was not good, after trying hard to figure it out, and revising several times, here’s the summary I came up with:

http://models.street-artists.org/2017/03/08/bet-proofness-as-a-property-of-confidence-interval-construction-procedures-not-realized-intervals/

Note several things:

1) Not all confidence interval construction procedures are bet-proof, so you can’t just teach the interpretation, you have to also teach how to recognize a bet-proof procedure.

2) Those CI procedures that are bet proof must consistently produce supersets of some bayesian credible interval.

3) The “bet proofness” isn’t really about real bets surrounding real problems, it’s about being able to find an artificial problem which a given Bayesian procedure would have lost the bet over. So in my example: If I constructed a real prior about heights of people in my neighborhood, and then we went out and sampled and it turned out several people in my neighborhood were a million feet tall, then based on my prior putting almost all the probability mass between 3 and 7 ft, I would have lost the bet… We could adjust my Bayesian procedure into a bet proof confidence interval procedure so that just in case there are some million feet tall people in my neighborhood…. well you get the idea it’s not an improvement

With a Bet Proof confidence interval you can’t find a Bayesian procedure that wins on average in repeated applications no matter what the parameter being estimated is…. Yet this is irrelevant, in any given situation there is a particular parameter, and if the bayesian has a pretty reasonable idea what it is (like even just all people ever have been less than 10 feet tall and most people are less than 7), the Bayesian will eat the CI procedure’s lunch. So “bet proofness” sounds like “no bayesian can make money off me” and really means “there isn’t a fixed Bayesian prior that will still make money off the CI even when it really could be true that the parameter is far outside the Bayesian’s prior”

• (Commenting down here because I think the blog software gets grumpy if the nesting is too deep.)

Daniel:

#1 is easy. It works for “standard problems”, which is almost all of what we teach, at that level anyway. Point estimates with SEs are “standard” in this sense.

#2 is fine. There is indeed a close relationship between “bet-proof” frequentist CIs and Bayesian intervals. (I had a brief exchange with the authors of the paper after the blog discussion, and that’s how they think about it too.)

As for #3 and more generally … I can’t make up my mind. My motivation here came out of teaching, and based in part on what you and others think, I’m not as optimistic as I was. Maybe instead of “here’s what the interpretation of a realized CI is”, we can say to students “there’s an interpretation, it’s closely related to Bayesian approaches, but let’s not go there, too advanced and tricky, you don’t need it”. Not ideal. But still better than just being negative and saying “mistake, don’t try to interpret realized CIs”, imho anyway. (And definitely better than saying “there is no legitimate interpretation”, which is wrong.)

• I agree it’s a hard thing to do to teach about CIs since they’re pretty odd. the bet-proof stuff, because it’s always a superset of a bayesian interval, basically works when you do bayes with a flat prior.

I kind of like an algorithmic interpretation: a confidence procedure is a computational procedure that takes a random number generator, draws a certain number of draws from that generator, and constructs a range of values. If the generator is the appropriate kind that the CI expects, 95% of the time that it draws from the generator and constructs a range, a special value associated with the generator will be in the range.

The interpretation of a realized CI is just “this is the output of a CI procedure as described above” and then “a reader” below shows us that conditional on *only the range* (and not any info about the individual data points or the real science behind it all) the Bayesian probability associated with the special value being in the range is 95%

Note that we’re not conditioning on what actually happened (the individual data points) only what machinery was used to construct the interval. As far as a CI is concerned, it could be like a computer procedure that calls the RNG and gets several samples, and never prints them out, just uses them internally to produce the interval.

• I should say conditional on *only the range and the fact that it’s a CI procedure applied to an RNG meeting the assumptions of the CI procedure*

• Blissex says:

The way I remember being taught stats by amazingly good (I still think) statisticians of a non-american school (a deeply subjectivist school) is similar:

* Stochastic “numbers” have a different algebra from “arithmetic” numbers.
* Stochastic “numbers” arise from a sampling process.
* The sampling process must have an ergodic source; it can be biased, but the source must be ergodic.
* The “distribution” of the source is arithmetic, that of the samples is “stochastic”. Because of ergodicity they are related.
* Because they are related we can with great caution, trepidation and skepticism infer from properties of the stream of samples to properties of the population which is the source. Never confuse the properties of the sample with the properties of the population though.
* If the gods of statistics love us they give us a good signal to noise ratio. The signal to noise ratio is what we worry about.
* It is legitimate to use subjective priors as to supposed-known biases in the sampling process or supposed-known properties of the source. This adds signal to the signal to noise ratio, if the priors are right. The priors are a bet, just as the assumption of ergodicity, the samples are a fact.

For so-called “confidence intervals” it follows that as our blogger says it is very important to point out that they are “stochastic” and the 95% is an estimated property of the population of intervals, not of an interval.

The aspect of the textbook formulation that I find most unhappy is that “we are at least 95% sure”, because “95% sure” is a not a good way to say “we hopefully expect that 95% of the intervals …”.

Does this make sense to anyone? Focusing on the eternal distinction between stream of samples and population, and what we can say about inferences from one to another?

5. Alasdair says:

Thanks for sharing. A nice example of how easy it is to mess up this stuff, even when you (think you) have a pretty good grasp of it. I know I’ve made some big clunking mistakes in my teaching (and no doubt in my research) when I really really shouldn’t. But, it’s easily done :)

• Martha (Smith) says:

+1

Not to mention those really zonked out days when I could easily find myself saying, “Man bites dog” when I mean “dog bites man”.

6. Bob says:

Andrew wrote:
the posterior probability is at least 97.5%, not 95%

I’m confused. I always thought that if a ladder were 9.75 feet long, then it would satisfy the criterion of being “at least 9.5 feet long.” It seems that the identification of the second error is sort of an error itself.

Bob

7. Jonathan says:

The cat analogy really fits. Billy (a cat) demanding attention in the bathroom, reached up and pulled at what he sensed or perhaps could even see slightly was just over the lip of the counter. He spilled coffee on himself. Now there was indeed something there so the confidence interval was correctly drawn in that sense but it didn’t exactly say what he needed. Uninformative prior in one sense, informative in another. Sort of like ‘I see a light in the distance’ and it turns out to be the train coming at you: you can be right and be wrong at the same time. I see lots of that. A main reason is in your self-corrected example: it’s hard to parse what the bleep it says because you’re talking about ‘positive’ in the numerical, graphical sense that positive is to the right and up, while speaking about intervals mapped to that space, but the approach is backwards – as is so much of statistics (thanks so much, Fisher) – so you start by talking about validity of a parameter instead of starting with the space. When you start with the parameter, you postulate causal links that minimize whatever else could be happening in that space. By you, I don’t mean ‘you’ but the general you. That does mirror the way people think – the Butler did it cliché or suspect the husband when the wife goes missing – but no one can convince me that people think ‘well’ and that thinking methods can’t be improved. Your examples consistently show how confused people are in their thinking; they regularly, inappropriately and willfully draw causal links out of ‘spaces’ which map the results of complex operations. We all do it, but some people seem to do nothing but that, and that group includes a ton of researchers, academics and scientists. Sort of like cats that spill coffee on themselves. He was right there was something there to pull at. So the parameter was valid. But also unfortunate.

8. Carlos Ungil says:

You’re too tough on the authors. They didn’t make any probability statements, at least in the fragment you cite. They talk about being at least 95% sure about something, but there is no explanation of what “being sure” means… it’s now that you are jumping onto a probabilistic intepretation (which of course is wrong).

• Blissex says:

«at least 95% sure about something, but there is no explanation of what “being sure” means»

But for me that’s exactly the problem: “being sure” is handwaving. Our blogger later writes appositely:

The above passage was from a textbook; in a textbook I want to get it right. Being sloppy in conversation is one thing; being sloppy in a textbook with tens of thousands of readers can create misunderstanding.

The meta-difficulty with that is textbooks are not read with great attention to details, that is they are not, at least initially, used as references. But then this reinforces the idea that things need to be pedantically made explicit.

9. Jeremy Fox says:

I’m not thrilled with calling yourself “stupid ass”, “idiot”, or “fool” over one imprecise phrase, even as a joke. It reads as an endorsement of name-calling as part of post-publication review. Which I know you wouldn’t endorse–so why write as if you do?

10. Paul Alper says:

One of the hurdles of dealing with confidence intervals is possibly linguistic. A confidence interval is a surrogate for what is desired and sounds reasonable. Now consider the medical world where we have such misleading terms as (relative) risk reduction, five-year survival and progression-free survival. Each surrogate purports to say something about mortality and seems outwardly sensible. Unfortunately, not only the lay public but medical experts also draw the wrong conclusions from those terms.

• Martha (Smith) says:

+1

Since frequentist statistical concepts are indeed usually “a surrogate for what is desired and sounds reasonable”, I often teach them from the point of view of “what we want” and “what we get”, to try to emphasize that the two are not the same.

11. a reader says:

I’ve been musing over whether it’s actually improper to call a confidence interval a credible interval; after some thought, it is just a credible interval that does not condition on a prior, right?

A twenty sided die rolls a “1” 5% of the time, and as Bayesians, we have no problem saying that if we rolled the die and covered it up, there’s a 5% chance we rolled a 1. So why do we have problems saying that *conditional only on the 95% confidence interval procedure*, there’s a 5% chance the true value is not inside the confidence interval?

Of course, we could condition on prior information to update that probability (or the interval itself), but it seems to me that a confidence interval is not just an approximation of a certain credible interval, it *is* a certain type of credible interval.

• Corey Yanofsky says:

By definition a credible interval is a summary of a posterior distribution; no posterior distribution, no credible interval.

Really, though, what you call the thing is less important that the properties it has in the face of actual data. Even from a frequentist perspective there are confidence procedures that no one would choose to use for statistical inference — the confidence coverage property is only necessary, not sufficient, for a given procedure to be satisfactory from that perspective. See here for example: http://learnbayes.org/papers/confidenceIntervalsFallacy/lostsub.html . (Likewise, there are priors that no one would choose to use for statistical inference.)

• a reader says:

Corey:

I think most people interpret a 95% credible interval as “an interval that has a 95% chance of containing the RV theta of interest”. If you want to take that away, then we lose the ability to say “well, credible intervals are so much easier to interpret!”.

With that said, I don’t want to get off-course of the question: from a Bayesian perspective, isn’t there nothing wrong with saying “Conditional *only* on the 95% confidence interval procedure, there is a 95% chance that the RV of interest is inside the confidence interval”…or even “conditional *only* on the data seen and the assumptions of the model…”.

Yes, of course this could be updated by use of prior information to provide an interval with better characteristics…but isn’t that statement alone valid?

• Corey Yanofsky says:

There’s nothing wrong with saying, “Conditional *only* on the 95% confidence interval procedure, there is a 95% chance that the RV of interest is inside the confidence interval” but that’s because it’s a pre-data probability statement. Bayesian probability calculations are a superset of frequentist probability calculations. This is quite different from saying “conditional *only* on the data seen and the assumptions of the model…”, and I would deny that the latter is valid. The paper behind the link I provided does a really good job of illuminating these issues.

(The difference between a confidence interval and a Bayesian interval is in some sense analogous to the difference between worst-case analysis and average-case analysis of algorithm space and time complexity. A confidence procedure requires a guarantee that holds in all cases, and in particular, in the worst case; a Bayesian procedure gives weights to possibilities using a prior in the same way that average-case complexity analysis gives weights to possible inputs.)

• Interesting that you put it this way. In some sense, there’s merit in the reverse as well: The confidence procedure only gives an average performance amortized over repeated application of the algorithm. The Bayesian result gives a result that applies to the outcome of each and every individual application of the algorithm.

• a reader says:

Corey:

Hmm, oh that’s right, you’re allowed to do foolish things under the rules of confidence intervals. For example, a function that returns the real line 95% of the time and a single point the remaining 5% creates a 95% confidence interval for any parameter…but the resulting interval is not a 95% credible interval conditional on the data + model (although it is a credible interval conditional *only* on the confidence procedure and not the data).

Mentally meandering here, but is there additional clauses required that one could apply to make a confidence interval a credible interval conditional on only the model assumptions and the data?

• Corey says:

When you say “only the model assumptions and the data” I assume you mean “no prior”. What do you mean when you say “credible interval”? It can’t be the usual definition because that assumes the existence of a posterior distribution and hence a prior distribution too.

• a reader says:

Corey (I assume a different one?):

By credible interval, I simply mean the “simple” interpretation of a credible interval: a 95% credible interval has a 95% chance of containing the RV of interest.

What’s very tricky about this is what one is conditioning on for this probability statement. And by saying “only model assumptions and data”, I’m saying that we are *not* conditioning on expert information as you noted.

• Corey Yanofsky says:

(Same Corey, different device.)

There is no mathematical beast that satisfies your “simple” interpretation as far as I can tell. I’m not even sure what you mean, exactly, by the phrase “a 95% chance”. What does that look like in practice for you?

The discussion of relevant subsets in the paper illustrates the difficulty with trying to make the Bayesian omelet without cracking the Bayesian eggs. (I also forgot to say earlier that your example of a foolish confidence interval — the version in the paper is labelled “trivial” — and similar stupid things that no one would ever do aren’t really worth thinking about for more than five seconds; on the other hand, the paper’s discussion of the way the uniformly most powerful confidence interval fails to give us what we actually want is worthy of some contemplation.)

• a reader says:

Corey:

I’m somewhat confused by your comment that nothing satisfies what I’m calling a credible interval. To be clear, I’m referring to taking a Bayesian interpretation of a confidence interval. And I’m pointing out that it is not required that we condition on expert information; as you pointed out above, we can interpret a 95% confidence interval as having a 95% chance of containing the RV of interest, if we *only* conditioning on the CI procedure.

• Corey Yanofsky says:

It’s not required that one conditions on expert information but you do need to have some prior. Perhaps what you’re after are probability matching priors. (I personally do not consider data-dependent matching priors to be Bayesian.)

• a reader says:

Corey:

Can you articulate what is wrong with saying “conditional the 95% confidence interval procedure alone, there is a 95% probability that RV of interest is within the confidence interval”?

Again, you could condition on more information and make that statement false. But I’m merely applying the standard Bayesian interpretation of probability (“a six sided die that has been rolled and covered up has a 1/6 chance of being 1”) to a confidence interval.

• a reader: you’d better also condition on all the assumptions of the CI procedure being correctly met. in particular you’d better be sampling from a very large but finite population using a high quality random number generator.

I think if you condition on those things, it’s fair to say that you do have 95% probability that the true value of the parameter is in the interval in the same way that if you pull a ball out of a hat, and all you know is it has 100 balls and 3 of them are black, you have 3/100 chance of getting a black one.

• Which is another way of saying, that often the biggest problem with CI procedures is that they simply ARE NOT APPROPRIATE as we’re very often not repeated-sampling from a well defined population of fixed objects.

• a reader says:

Daniel:

I’m going to disregard your usual “but frequentist methods only work if the assumptions are met” comment as that discussion has already been played out enough.

But I will take the bait here: “Which is another way of saying, that often the biggest problem with CI procedures is that they simply ARE NOT APPROPRIATE as we’re very often not repeated-sampling from a well defined population of fixed objects.”

I’m taking a Bayesian definition of probability to a frequentist framework. If you know that in the long run, this procedure captures the RV of interest 95% of the time, then *conditional only on the the results from this procedure and not on prior knowledge etc.*, there’s a 95% chance that the RV will be in that interval.

You just need to be careful about what you are conditioning on.

• Well, I wasn’t trying to bait you, and I was in fact agreeing with you that conditional on certain information, the Bayesian probability is the same as the frequency, just trying to point out that the thing you’re conditioning on isn’t just the use of a given CI procedure, but rather the use of the given CI procedure on a problem for which all the assumptions needed are met.

there’s a tendency to talk about “frequency guarantees” and yet the guarantees regularly fail to hold in practice.

But yes, “conditional on the use of a 95% CI procedure to estimate a parameter in the case where the CI procedures assumptions are met” and no further information, then the bayesian probability you should assign to the parameter being in the interval would seem to be 95%

• Corey says:

I can’t articulate what is wrong with saying, “conditional the 95% confidence interval procedure alone, there is a 95% probability that RV of interest is within the confidence interval” because as I already said above, that one is a pre-data probability. But once you know the data and use it to compute the realized confidence interval, the question arises what relevance that pre-data probability has for statistical inference. A Bayesian would say: none — you have the data, so condition on it. A frequentist has to resort either to Neyman’s “no learning, just inductive behavior” stance or to Mayo’s more sophisticated stance founded on her declaration by fiat that a claim is warranted just to the extent that it is the result of a procedure that is rarely in error. (Not that I have a problem with declarations by fiat — you’ve got to start somewhere.)

• Note though we aren’t conditioning on the data only the output of the CI procedure and our background about the sampling and the CI methodology.

• Corey: suppose you have a rover on mars. Its bandwidth isn’t high enough to actually send you all the data it’s collecting. But the data it’s collecting is repeated measurements of say the iron content of a particular large rock… We tell it: collect 100 samples from the rock, and construct a confidence interval for the mean iron content using standard CI procedure C, and transmit the interval to us.

I think after seeing the interval (which is now our data) and having no other information to condition on (eh…) we should assign a probability distribution for the mean iron content that has 95% probability mass over this interval.

Similarly, if you read a paper and they provide a detailed description of the data collection and analysis process, and you agree that it is appropriately RNG like, and they publish *only* a 95% confidence interval, not the data set… again, you should assign 95% probability to the parameter being in the interval unless you have a way to condition on further information.

the data collection and CI procedure is basically like a rolling-dice procedure.

I wouldn’t argue that this is a GOOD way to do the analysis, (in particular, I’d argue for doing the bayesian analysis on the rover, and xmitting a summary of the posterior) but I do think it’s got a certain logic.

• Corey says:

Daniel, let’s consider an example from the confidence interval vs credible interval paper I linked earlier. A realized interval for the uniformly most powerful confidence interval procedure is given: it is [1.0, 1.5]. The paper notes that this is consistent with two possible data sets: (y_1, y_2) = (1.0, 1.5) or (y_1, y_2) = (-3.5, 6.0). If I didn’t know which of those two possibilities gave rise to the realized CI then I would have to treat the y values as latent unknown variables — nuisance parameters, in effect — and marginalize them out. Likewise in your example: the contents of the memory registers of the rover are our latent unknown variables and our knowledge the standard CI procedure enables us to compute the level set of latent unknowns that gave rise to the observed transmission. The latent unknowns are nuisance parameters and we marginalize them out to get the posterior distribution over the interest parameters. In both the UMP CI case and your Mars rover scenario, the fact that the function mapping latent unknowns to observed data was a confidence procedure at some specified confidence level is irrelevant to the Bayesian math.

• In order to marginalize them out you’d need to put priors over them, which would come from additional information you’re conditioning on right? So you’d also basically recommend conditioning on further information that you have, which I agree with… But it is different conditioning I think.

or am I missing the point?

• Corey says:

Daniel, what we’ve got structurally is just a bog-standard multi-level model with the interest parameter at the top and latent unknowns in the second layer. The (conditional) “prior” on the latent variables is just their sampling distribution given the interest parameter (and top level nuisance parameters if any). The weirdness is that the observed data is a constant random variable conditional on the latent unknowns.

This should ring a bell for you — it’s the same model structure that arises when dealing with rounded data in which the unobserved high-precision data enter the model as latent unknowns, where again the observed data are a deterministic function of the latent data.

• Corey, sure I see where you are coming from. But when you say that the prior is “just” the sampling distribution of iron component of a Mars rock… What is the shape of that sampling distro? Certainly that’s additional knowledge you’d have to condition on… Compared say to “the distro has a mean and standard deviation and therefore the assumption of the CI procedure are met.”

I agree with you that there are other ways to analyze things. But using notation

P(theta|interval, ciproc, sat)

Where sat means ciproc assumptions are satisfied.

Might well be different from

P(theta|interval, ciproc, sampdistx)

Where sampdistx means you know the sampling distribution family of the individual data points. There are CI procedures that don’t rely on knowing the sampling distro exactly and so sampdistx isn’t a logical consequence of sat

• Corey Yanofsky says:

Daniel, I agree with this as far as it goes. We still need to address the relevance, if any, of the fact that the observed data are a 95% confidence interval (for a scalar parameter) calculated from latent data. If *all* you tell me that I’m going to be handed such a thing — maybe you don’t even tell me which confidence procedure generated the observed interval — and I’m to do Bayes upon it then my mind turns to stochastic process priors for all the function spaces you just involved me in and various limits I might try to apply to allow me to make a no-information-added claim about my posterior distribution. The confidence level enters into the math as a constraint on one of the function spaces but it’s not obvious to me that the amount of posterior mass enclosed in the realized confidence interval limits will necessarily be equal to the confidence level. Hell, it’s far from obvious to me that there will be a reasonable-looking no-information-added limit that will even yield a proper posterior at all.

• Corey,

It’s an interesting and non-obvious issue. But I think it has merit to consider just the “rolling dice” procedure. The procedure is “take one blob of 10^24 atoms arranged into approximately a cube and inject a large amount of linear and angular momentum, watch it bounce off a craps table with lumpy padded sides, and then read off the marks indicated on the surface that lies upwards”

Now, you could marginalize over the initial state of all of those atoms, or you could rely on the approximate symmetry properties of the dice and the sufficient quantity of initial energy to allow all possible faces to wind up facing upwards… and assign probability 1/6 to each face.

At some point I think I’d agree with you that the symmetry approach is a short-cut computational approximation, but it’s a very good one for well constructed dice. In fact, what it means to be a “well constructed die” is that the approximate probabilities assigned due to symmetry are very close to the actual observed frequencies in many many repetitions of a “good” die roll.

similarly, perhaps relying on the CI to give you 95% probability mass could be considered a computational short cut vs the ideal situation, but it might well be a very good one given a “well constructed and applied CI procedure”.

Also, let’s say that given some knowledge about how rocks form and prior information about the particular rock being measured, you might well find a very different posterior P(theta | science, instrumentation, all data points x) than you would from P(theta | Interval, CI proc, sat)

The relevance though is that it’s common enough to find a CI in a paper, with no knowledge of what data was collected, what CI procedure was used, but maybe a well described data collection procedure that does look like random sampling of some population etc… what should you do with such “transmitted” CI information?

• Corey, haven’t had time to consider the answer to this question but it might well be solvable in closed form:

what is the maximum entropy distribution on an unbounded univariate space given an interval in which 95% of the probability must lie? How about if there’s also a known point at which the pdf must take on a peak value?

hmmm…

• Corey Yanofsky says:

“…you could marginalize over the initial state of all of those atoms, or you could rely on the approximate symmetry properties of the dice and…”

I do both, actually.

“…the sufficient quantity of initial energy…”

So the dice tossing thing is just Jaynes and the physics of coin tosses again. It’s not the *quantity* of energy that matters — it’s two things: first, that the symmetries of the die and the symmetries of the physics (important in the coin toss case for demonstrating that you can’t bias a coin by weighting it) make the level sets of the toss outcomes both very finely grained and with nearly equal volume inside the state space of a tossed die. The second thing is — and I want to be clear that to say “sufficient quantity of initial energy” is to miss the point — the second thing is that I need to know that the amount of imparted energy and momentum is under sufficiently poor *control* that my prior for it must cover a region of state space that is large compared to the graininess of the level sets. That’s what lets me do the marginalization without breaking out my abacus.

“similarly, perhaps relying on the CI to give you 95% probability mass could be considered a computational short cut vs the ideal situation, but it might well be a very good one given a “well constructed and applied CI procedure”… what should you do with such “transmitted” CI information?”

I don’t think this sort of “transmission” is like the rolling die at all. The way I usually deal with such things is to decide whether I want to treat what I’m looking at as an estimator of a mean with, e.g., two-sigma error bars. If I feel confident that that’s the kind of thing I’m looking at then I assume that the estimator is approximately normal and fall back on the symmetry of normal confidence intervals and normal credible intervals under a flat prior. If I think the test statistics is a likelihood-related thing — LR or Wald or score — then I’ll do the same thing. If it’s a bootstrap CI then I just treat it as a credible interval directly — I’m pretty sure I’ve read papers justifying that sort of treatment. If the situation is so weird that I can’t be reasonable sure that something more-or-less sensible has been done then I throw up my hands and balk. And if some kind of shrinkage or simple multi-level modelling is needed and hasn’t been done then I’ll take the given CIs as input and do the shrinkage in rough fashion in my head.

• Carlos Ungil says:

> what is the maximum entropy distribution on an unbounded univariate space given an interval in which 95% of the probability must lie?

It is not well defined when passing to the limit. For given bounds (outside the interval) the maximum entropy distribution is flat inside the interval (with 95% mass) and flat outside the interval.

• Suppose we have a computer program, and a CPU with a hw RNG and a function handle with no knowledge of the implemented function f except it is a proper function of just it’s one argument. We write our own function which calls the hardware random number generator multiple times collects the data and runs it through a confidence interval procedure which we have been given without any details of its implementation. We then output just the interval.

Whether we have a complex electronic circuit or a die being cast on a craps table out of control or a random number generator being used to select from a large table of phone numbers or the like all of them are arranged for the purpose of setting the frequency and mimicking an idealized random sequence.

I agree that a confidence interval procedure is just a constant function of the data. But basically a random sequence with a constant function applied to it is itself a random variable. If as a Bayesian I’m told that I’m going to get the output of a random variable then I will assign the probability of getting something particular to be the frequency with which the RNG has been designed to give that particular result.

When I have more information about the details of the random number generator such as I seed or a mechanism By Which it operates or the like then I may assign different probabilities.

This post transcribed by my phone please excuse any odd transcription.

• Carlos: right that makes good sense. So perhaps theres a sense in which it’s possible to assign a number to integrate(p(x), x in CI) while still not being possible to assign a full posterior distribution to x without additional information. That wouldn’t surprise me.

• Daniel et al.,

Did you reach a conclusion on the “Mars rover” example? I really like the setup. Seems very teachable. Would be v nice if the conclusion is equally clear, but I wasn’t sure it was.

–Mark

• Mark Shaffer:

Unfortunately I think the only thing we came to is that p(Theta in CI | CI Transmitted from Mars Rover, CI procedure assumptions are met) = 0.95

this isn’t enough information to give a posterior distribution over the parameter Theta it just constrains a particular integral of that posterior. We can say that provided the CI procedure’s assumptions are met, we should assign 95% mass to the interval, but we should assign 5% mass to “outside the interval” and we don’t have a general way to make a useful probability distribution from those two pieces of info when the parameter space is unbounded.

Any information we add which would allow us to make a proper probability distribution would be added information, and the combination of this added information, and the CI procedure/interval would potentially alter the probability being assigned to the interval.

• Daniel,

Thanks! Very clear. And maybe the “unfortunately” isn’t warranted, at least for me. I like the Mars rover example and the conclusion that “p(Theta in CI | CI Transmitted from Mars Rover, CI procedure assumptions are met) = 0.95” because it looks like it could be useful in a teaching context.

My problem all along is that teaching students how to calculate CIs and at the same time telling them “don’t try to interpret realized CIs, wrong, can’t do that” doesn’t work too well.

The Mars rover example – the frequentist robot hands its result to a Bayesian human, who interprets it (is that fair?) – looks like something that (a) students will understand and remember, and (b) is actually correct.

Maybe it needs a footnote so that “CI procedure assumptions are met” includes some extra assumptions (possibly the same ones that the “bet-proof” interpretation needs, i.e., it’s a “standard problem”)? Otherwise you could have a CI procedure with 95% coverage that sometimes generates intervals that are empty or the entire real line. But that’s OK. “Standard problem” includes almost everything that we teach at this level.

• Mark i think the biggest issue is that the information the Bayesian conditions in is the absolute minimum information for the random ci generation procedure to have its frequency properties, and no more. In particular any knowledge of the underlying science or measurement tools or some logic such as you can’t have negative iron or the like changes the conclusion, and changes the modeling method. For example Corey suggests to make the data unknown parameters subject to your science knowledge, and then put distribs on them subject to them resulting in the given ci… It could Radically alter the resulting conclusion.

If all you know is an rng gave you a random output with certain frequency behavior it’s justified to assign the probability to the event whose frequency is known… Like rolling a well made well rolled die, or calling a well tested rng function.

• Thanks Daniel. I think I get your point about prior science knowledge. But if you had in mind my passing remark about “standard problems” etc., I think that’s different. If the realized CI that the Mars rover is empty – possible for “nonstandard problems” – you know the conditional probability claim about this realized CI has to be wrong (and not because of prior science knowledge etc.). But maybe that’s not what you meant, in which case apologies (and thanks for continuing to engage … much appreciated).

• Oops! “If the realized CI that the Mars rover SENDS is empty”.

• Mark: in reality, yes, in the formalism I’m less sure, the information you’re using to infer that the CI can’t be empty is what? It seems it should be something else included on the right hand side of the vertical bar. But I think the point is well made, we ALWAYS have useful information about real problems. The biggest problem with interpreting a CI as a credible interval is that it treats the problem as if you are algorithmically testing the quality of a pseudo-random number generator. That’s never what you’re doing.

• Carlos Ungil says:

> If all you know is an rng gave you a random output with certain frequency behavior it’s justified to assign the probability to the event whose frequency is known…

Let’s say a rng gives you a number which is even with 50% probability. If that’s *all* you know (in particular, you do not know the number) it may be justified to assign 50% probability to the event that the number is even. On the other hand, if you know that the number generated is 42 it’s not justified to say that the probability that it is even is 50%.

If you get a confidence interval with 95% frequentist coverage it may be justified to say that the probability of covering the true value is 95% but only as long as you don’t know what is the interval. If you do, you should condition on the data and the frequentist coverage guarantee is no longer valid.

• Of course, it’s not that it can’t be empty, it’s that if it’s empty you know the conditional probability claim is wrong. (Which is what I think you meant.) And even if you only know that it’s possible that it’s empty, then you also know that if it’s not empty the conditional probability claim is also wrong. (If sometimes it’s going to be empty, the rest of the time it’ll be too wide.)

The way around it, I think, is to say “for standard problems only” (like with the simple bet-proof case) which means this can’t happen and the conditional probability claim will be ok. I guess you can say this is included on the right hand side of the vertical bar. But it’s different in that it’s something you know about the method rather than something you know about the parameter you’re estimating.

• a reader says:

Maybe I just see Bayesian statistics in a different light…but I think this all follows from my earlier point.

For example, Carlos’s example with even/odd numbers. Conditional on only knowing that you have function that returns an even number 50% of the time, if you *just* condition on this fact, then any number you get, you can say “conditional only on the procedure, there is a 50% chance this number is even”. So if the function returns a 42, if you only condition on the procedure and not your expert information about even odd numbers, you say “conditional only on what I know about this function and nothing else, there is a 50% probability that 42, the number return by this function, is even”. Of course, you can also say “conditional on what I know about this function and what I learned in kindergarten, there’s 100% probability that 42 is even”.

To demonstrate further, suppose I use a discrete uniform rng and get output 8912437587987614581234095. If I don’t use a computer nor care to waste any time to doing long division, I’m perfectly happy with saying “Given what I know about the rng + my mathematics background, the probability that the number above is divisible by 7 is 1/7”. After I check on my computer, I’m happy to update my posterior to 0 or 1, but I recognize that this is conditional on me having checked.

• Carlos: suppose you have brain damage, and you don’t know what it means to be even…(ie. “I know what evenness is” isn’t on the right side of your conditioning bar) then if someone tells you here’s the number 42 it came out of a random number generator that gives even numbers 50% of the time, what is the probability to you that the number is even?

Or alternative suppose someone gives you a number, they say it’s from an RNG that gives numbers that are FLORG 50% of the time. You have no idea what FLORG means, but it’s a well defined thing. You have the number. Conditional on your information, you can only say it has 50% probability of being FLORG

Your point is essentially amplifying what I already said, which is that conditioning ONLY on the knowledge that an interval came from a particular RNG / CI procedure is usually the wrong thing to do. But it’s a good amplification because it shows how background information is important at even the most basic level. We have LOTS of background information on every real world problem.

• Carlos:

“If you get a confidence interval with 95% frequentist coverage it may be justified to say that the probability of covering the true value is 95% but only as long as you don’t know what is the interval. If you do, you should condition on the data and the frequentist coverage guarantee is no longer valid.”

In Daniel’s “Mars rover” example, you can’t condition on the data because you don’t have the data – all the rover sent was the 95% interval it calculated. (I really do like this example!) And if you have the interval and nothing more, and the CI procedure assumptions are met and it’s a “standard problem” (no empty CIs possible etc.), then the claim is that you can assign a probability of 95% that the parameter is in the interval. Or am I misunderstanding your point here?

• a reader says:

Part of what I like about this viewing of a confidence interval in this way is that it very thoroughly points out that if there are values in the confidence interval that seems very unlikely to you, you shouldn’t just accept them as now being (relatively) likely values!

On the other hand, if there are values in the credible interval that seemed very unlikely to you a priori, you might need to reconsider if these values are really so unlikely.

Realistically, you should first reconsider if you had a reasonable prior/likelihood function.

• Corey Yanofsky says:

reader, Bayesian foundations (of the Cox-Jaynes variety) postulate that if B => A then Pr(A | B) = 1; that is, Bayesian probability models a “logically omniscient” reasoner who has an oracle for the logical implications of any set of assumptions. (That’s also helpful for postulating that we never condition on a contradiction.) For logical uncertainty we need something else; what, exactly, is not yet known but progress is being made: https://intelligence.org/2016/09/12/new-paper-logical-induction/

• a reader says:

Corey:

Either I’m not following your point, or you’re not following mine. Perhaps this will help illustrate: are you comfortable saying that since you don’t want to sit down and do the math/coding required, you’re fine with saying your personal probability that 12909809723450982345 is divisible by 7 is 1/7?

I’m fine with saying that. Now, even after I type in “mod(12909809723450982345, 7)”, I’m still fine with saying “Conditioning on what I just saw spit out by my computer, my personal probability that 129…45 is divisible by 7 is 0 (or 1). But before I saw that, my personal probability was 1/7”.

I don’t think this is just being annoying. I think its crucial to the interpretation of a Bayesian posterior is an update of *a* prior, and there’s lots of different priors, some better than others.

• Corey Yanofsky says:

No, I’m not fine with that, in the sense that the doctrinaire Bayesian in me refuses to assign probabilities that could result in conditioning on a contradiction. It is a constant irritant niggling at me that using probability for logical uncertainty in that fashion (and especially as in the case of Bayesian numerical integration) works as well as it does since I know of no foundations to justify that use case.

• Carlos Ungil says:

Mark, in this case the data is what the rover provides: the pair ( lower-bound , upper-bound ) that defines the interval.

Daniel, the example of the rover was not about getting a confidence interval for an undefined parameter but for a well defined quantity (iron content) and using a well defined procedure (collect 100 samples from the rock, and construct a confidence interval for the mean iron content using standard CI procedure C, and transmit the interval to us).

> I think after seeing the interval (which is now our data) and having no other information to condition on (eh…) we should assign a probability distribution for the mean iron content that has 95% probability mass over this interval.

I can agree with that, but if you had received the complete set of data and you didn’t have any other information to condition on you would also assign a probability distribution for the mean iron content that has 95% probability mass over that interval. If you have no other information to condition on, you use a flat prior and the confidence interval is a credible interval.

Depending on the details (and it seems to be the case for this location parameter example if the likelihood is just dependent on mu and sigma), the bounds of the confidence interval can be a sufficient statistic. If you want to do a Bayesian analysis, the information sent by the rover is enough in that case.

In general, if you have a prior for the parameter you have a prior for the probability of the confidence interval returned by the rover containing the true value. Your posterior probability for the interval containing the the true vale does not have to be 95%. If your prior probability was 100%, the posterior probability will be 100%. If it was 0% it will be 0%. It can be 95%, but I guess in most cases it will be somewhere between your prior probability and 95%. Even if it cannot be calculated explicitely if you don’t have a model, the 95% CI can be interpreted as evidence supporting an increase (or maybe decrease, if it was higher) of your prior probability for the interval containing the true value.

• Cory suppose that blarg is a well-defined true or false property of numbers but requires exponential computing power in the size of the number. Blarg(100) takes one year of computing, further suppose that a mathematical proof exists that a random number generator produces blarg numbers with probability 1/2. The RNG outputs x=8912437587987614581234095 what is your Bayesian probability that x is blarg.

Suppose that it is possible with a machine to deconstruct a rock atom by atom and count the iron atoms but the machine takes about 1 second per atom. Logically it is true or false that the number is or is not blarg, and it is true or false that the rock has more or less than say 1/2 iron atoms. Both require “only” the pure computation of a result by a computing machine. How do they differ?

• Corey Yanofsky says:

what is your Bayesian probability that x is blarg

Not sure yet, get back to me after the Big Freeze.

I’ll use probability for logical uncertainty provisionally because to seems to work and because research into a foundation of logical uncertainty seems to show that something like probability theory works there too. Cox’s theorem doesn’t cover it though.

• How is computing blargness different from counting the percentage of iron atoms though?

I’m fine with saying this all seems a little scary and proceed with caution, but I’m honestly not clear on how computing blargness with a machine and computing percentage of iron atoms with a machine would be different, the second one seems to be pretty clearly the kind of thing we do with approximate data collection. The blargness thing isn’t obviously different though. Particularly for example if you can output a sequence of intermediate results from the blargness computation that you could update your probability on the basis of.

not trying to “gotcha” here or anything, honestly think this is an interesting bit of philosophy of science and / or logic. And I suspect ojm would chime in here on something related to constructivist logic and the blargness hypothesis (is that like the best band name ever?).

• Anonymous says:

Corey,

You don’t need to give up Bayes, merely recognize some hidden assumptions. Consider Laplace’s definition of a probability, namely, it’s the ratio of the favorable cases to the total cases. Write this as p=F/T. This is not a frequency of occurrence, but merely counting of possibilities.

This definition is great and can serve as the basis for a Jaynes style foundation for statistics as an extension of propositional logic. But there is a hidden assumption that F and T are actually known. If you relax that assumption to consider cases when we only have partial information about F and T, and use the sum/product rules to manipulate that added uncertainty, you get what looks like a “probability of a probability”.

This actually works though. See Jaynes’s Chapter 18 on the “Ap distribution” For example.

In the case where were our information in principle fixes F/T but we can’t “effectively” compute F/T, then we can still assign probabilities to various F and T based on what we can effectively compute. As long as you assign some probability to every value which could be true, you won’t run into any logical difficulties. For example, if our information implies a “contradiction” so that F=0, then you’ll be Ok as long as you assigned some probability to that possibility and didn’t set Pr(F=0) =0.

I suppose you could call this an extension of a very strict interpretation of Jaynes, but since you’re still using the same equations to manipulate uncertainty (just at a deeper level), it makes more sense to me to consider it still “Bayes”. I don’t think Jaynes would have been bothered by that since he wrote that Chapter 18 on the Ap distribution after all.

• Corey Yanofsky says:

The difference between the blargness of the number of iron atoms and the question of whether more than 50% of the atoms are iron is that the blargness of any particular number is a logical consequence of my prior information (which I’m assuming here has imported enough axiom schemata and whatnot from first-order logic that we can actually reason about numbers and blargness) and the proportion of iron is not. As part and parcel of the fact that we aim to extend propositional logic, Cox’s theorem (and Van Horn’s uniqueness theorem even more explicitly) takes as a premise that all logical implications of the prior information get the same plausibility value as a tautology. If your system of probability doesn’t do that, it’s not an extension of propositional logic — which is fine, because we need to go beyond propositional logic to account for bounded computational resources. I’ll be satisfied when a system of probability exists that *actually describes how* to go about updating logical probabilities on the basis of a sequence of intermediate results from some ongoing computation. The closest thing I’ve seen to such a system is behind the link I gave “a reader”.

• Corey Yanofsky says:

Hey Big J, like naïve set theory, that approach will appear to work in limited domains but will run into problems with the undefinability of truth in fairly short order. http://intelligence.org/files/DefinabilityTruthDraft.pdf

• Corey, my naive understanding of intuitionist/constructivist logic is that the answer to “is x blarg” simply doesn’t exist until you’ve computed it. Or maybe until you’ve exhibited a computer program that would compute it… there’s obviously a difference.

I’ll concede that this area seems suspect, and probably not well resolved. In many practical cases we probably do well taking Joseph’s approach from a utilitarian perspective.

If J will send me his current email to my well known one I would appreciate it ;-)

• Anonymous says:

Corey,

Truth is definable enough in propositional logic, which Jayne’s probability theory generalizes.

In propositional logic given a set of atomic propositions a1, a2,…, then to determine the truth of a compound proposition Q(a1,a2,…) dependent on them, you merely construct the truth table for Q and cycle through every possible “valuation” or true/false combination for each atomic proposition. Given n atomic propositions there are 2^n possible valuations. Given enough time you can check all 2^n and determine if Q evaluates to True for all of them.

If this computation can’t be done or wont be done, then it represents a source of uncertainty. One on a deeper level than we usually deal with in statistics, but uncertainty nevertheless. No matter what you think of it, or how you think about it, the bottom line is you can consistently handle this with the same equations (sum/product rules) as any other uncertainty.

By “consistent”, I mean something like, if further computations verifying Q are made later, thereby reducing this kind of uncertainty, you can “update” in a Bayesian style to get better results which don’t inherently contradict what was claimed before.

Nor is the situation fundamentally different if we switch to Predicate logic (first order logic) since it too is semantically complete, according to theorem by Godel, which is all that’s really needed for this.

• Anonymous says:

I should add, this viewpoint has a lot more applications than it might look on the surface, so does Jaynes Chapter 18 on the Ap distribution for that matter.

In particular, suppose probability is the ratio of favorable to total cases, p=F/T as before, and there’s uncertainty as the correct value of F and T. Then you wind up “estimating” F/T, or taking expectation values over the Jayne’s Ap distribution. It’s like “estimating a probability”.

Many Bayesians who only partially got what Jaynes was saying, claim it doesn’t make sense to estimate a (non-frequency) probability. But if you read Jaynes carefully, it sometimes does make sense.

• Corey Yanofsky says:

Okay, you’ve convinced me.

• Anon / J,

There are of course some difficulties in using propositional calculus with real-world applications. For example suppose we wish to figure out the truth table for

Q(A,B)

and A is a proposition like “Blarg(X) is a computation that halts and returns 0”

Nevertheless, I do think there’s plenty of opportunity for using Bayes in these scenarios, I just don’t know how far any kind of logical consistency guarantees really extends. The Godel completeness theorem applies to base propositional calculus, where you assume the truth or falsity of “atomic” propositions is well decidable, if you start making propositions about Blarg(X) you’re only complete in the sense that *if you tell us whether A is true or false* then Q is definitely decidable.

• Anonymous says:

Daniel,

Sometimes I feel like I’m having the following conversation:

Me: “abstractly, the probability calculus is THE tool for handling situations where everything isn’t known”

Someone else: “yeah, but it doesn’t always apply because in situation xyz, we don’t know enough to carry out the computations”

Me: “uh….I think I see a way out of this…”

• Anonymous says:

Daniel,

As far as the logical consistency thing, I don’t think you need any more than to avoid assigning zero probability to things which could be true. The result is going to be as “consistent” as the underlying structure. So if you’re extending Propositional logic, the result would be as consistent as Propositional logic is.

• Anonymous says:

Jaynes’s Ap chapter seems perfectly designed for being ignored by applied statisticians. But it’s applicable and once you see the point, it’s very natural and convenient approach to many problems. It would be worth writing up a bunch of examples, but I have a feeling it would over the heads of the denizens of the stat community.

One thing though, it often doesn’t make sense to give probabilities to a high number of significant digits. When Andrew mentions this there’s sometimes bush-back from some more thoughtful Bayesians.

However, from the view of Jaynes’s Ap chapter, the width of the Ap distribution places a bound on how many significant figures it makes sense to quote probabilities to (this is for logical probabilities, not just frequencies masquerading as “probabilities”. Obviously estimating real frequencies is error prone for separate reasons)

• In some sense my example shows how badly we need something like Bayes for logical uncertainty. The halting problem is in general undecidable, so no amount of work will in general help us eliminate the uncertainty of certain logical statements. Nevertheless, I agree with you that it seems useful to have “meta Bayes” and it will probably work right much of the time. I’d need to see some kind of more formal proof of something to really understand what the limits are. I really do need to reread that Ap distribution stuff. I’m pretty sure I didn’t get it the first time I tried a few years back.

• Blissex says:

«A confidence procedure requires a guarantee that holds in all cases, and in particular, in the worst case; a Bayesian procedure gives weights to possibilities using a prior in the same way that average-case complexity analysis gives weights to possible inputs.)»

That’s an interesting point of view on how priors “help”.

• a reader says:

I also think that interpretation (“conditional *only* on the data and model assumptions…”) makes it much more clear what are the advantages and disadvantages of confidence vs credible intervals.

12. Bill_R says:

If you flip it around and focus on the typical values of the estimate (Hartigan, 1969), then the C.I. Is an approximation that covers about 95% of the typical values of the estimate. This depends, of course, on how well the math assumptions match whatever generated the observables.

13. Clyde Schechter says:

Ironically, were the term not already taken by Bayesian statistics, I think a better name for confidence interval would be credible interval (or, I would probably call it credibility interval). Here’s why.

A confidence interval is like a source of information that you know, in the absence of data, gets things right a certain percentage of the time. In fact, this is how I teach the concept of confidence interval. I tell my students to think of it like a friend whose movie recommendations usually turn out to match your taste. The CI calculation algorithm is a friend whose interval predictions about parameters are usually correct.

So I would prefer to call it a credibility interval; it’s an interval that derives from a highly credible process. But the Bayesian’s have appropriated (nearly) that term for something else, which, to my mind, would be better called a confidence interval. Confidence in the result (Bayesian credible interval) vs credibility of the process (Frequentist confidence interval)! But I suppose both terms are too entrenched for this semantic switch to happen.

14. Mayo says:

There are some people (Fraser, Fisher at times) who define the probability of the inference as the probability that the method used to obtain the inference would be correct frequentistly. This is a methodological use of probability: probability attaches to the method. (They try hard to identify when this kind of rubbing off can work.) A severe tester might say a one-sided lower Normal .975 interval for mean mu: mu > a is indicated severely because if mu ≤ a, then with high probability a larger observed mean would have been observed. (The analogous construal would hold for the one-sided upper interval.) Under this view, each point in the CI receives its own severity evaluation, and it’s the bounds (at each level) that matter. I take this to be akin to confidence distributions–on which there’s a big literature–but different interpretations are given by different CD people.

Now if only authors–not including Gelman (who doesn’t get this wrong)– were prepared to remove assertions from their texts blithely claiming that it is a theorem, proved by Birnbaum(1962), that the (strong) likelihood principle follows from weak conditionality.
https://errorstatistics.com/2017/12/27/60-yrs-of-coxs-1958-weighing-machine-links-to-bing-read-the-likelihood-principle/

15. I appreciate everyone’s thoughts about language for communicating confidence intervals. Until credible intervals dominate, we need to have language for CIs that is fairly honest and fairly accurate. Just as I often talk about margins of error and SEs, I think there is some value in CIs. The main value is that they are better than p-values (just as bar charts at the least are better than pie charts). I’m not completely convinced that saying that values between a and b are consistent with the data is too misleading to use, but the only consensus I can derive from the excellent discussion is that a 0.95 CI of [a,b] needs to be labeled exactly as “a 0.95 CI is [a,b]” just as we quote p-values without comment (except for non-helpful phrases such as “statistically significant”) instead of saying that the p-value is exactly the chance upon repeated identical experiments of getting a result more extreme than mine if these experiments were run with H0 in effect. It is too wordy to say in a medical paper that “in infinite repetitions of the experiment the line of confidence intervals would touch the true value 0.95 of the time”. So it would be nice to converge to some suggested short language to recommend to journals.

• Bill Harris says:

I once heard a presenter at a CS conference say that physicists have it good: they tend to give Greek- or Latin-derived names to concepts, and thus people have less incentive to conflate the conventional meaning with the technical meeting. In the presenter’s mind, CS folk, who tend to appropriate English or whatever natural language you’re using words are stuck with that conflation. If I only knew Greek or Latin, I could make up a word that derived from one of those languages and that referenced confidence intervals. In theory, someone unfamiliar with the term would look up the wordy and technical definition, and those who were familiar with it wouldn’t be tempted to conflate it with their reasoned personal confidence in a result.

I suppose it’s too late to start over. :-)

• Carlos Ungil says:

I wonder if the Latin-derived name might be one of the reasons why Fisher’s fiducial approach was not widely adopted.

• Martha (Smith) says:

@ Bill: Good point.

• Thanatos Savehn says:

αβεβαιότητα

See? Even the ancients struggled with the concept.

• Martha (Smith) says:

Wiktionary gives me “uncertainty, doubtfulness, incertitude, precariousness”.

• fred says:

I appreciate everyone’s thoughts about language for communicating confidence intervals.

For 0.95 CI [a,b], how about saying that the data wouldn’t be surprising if the true parameter value were anywhere between a and b?

• Chris Wilson says:

Yep! Regardless of what some have suggested here and there, most researchers discuss and interpret their CIs (from least squares, max likelihood, whatever) like credible intervals, i.e. automatically and unconsciously apply a uniform prior over parameter space. I have never seen a research presentation where it was made clear that frequentist CIs are basically *procedural*, or an error statistical perspective was used.

16. Allan C says:

They also seem to mislead in other ways: http://statmodeling.stat.columbia.edu/2017/04/20/teaching-statistics-bag-tricks-second-edition/#comment-468972

More on the actual post,

I concur with Jeremy Fox above that being hyperbolic about your own mistakes is probably not the best use of the opportunity (BTW: regular readers of this blog probably spotted the mistake was of your own doing given the provocative title. As it would only be something you bestow for your own mistakes). I think it might be more instructive to highlight the mistake as something that happens (even to legends in the field), spend an increased amount of time discussing why you think you wrote what you did and how it escaped revisions (etc.), why you were wrong, and how your thinking is now corrected (and how we know it’s better); even if you’re left speculating what your state of mind was back in the day, I think that’s a more productive enterprise then dismissing your past written word as stupid with only a small technical explanation as to why you were wrong at the time.

The above is written with the boundary constraint of time in mind. You write a lot for the benefit of many and the above is not to suggest you spend more time on these posts (though, that would be cool too); just that given whatever time you want to alot to the post, have more of it go towards discussing your state of mind at the time then the technical details now (with hyperbole toned down).

• Andrew says:

Allan:

In this case there’s not much to the story. I don’t remember when I wrote that passage (or maybe Jennifer wrote it, but I expect I was the author, because she’s typically been more successful at avoiding this sort of sloppy writing), but I expect I was just trying to explain a concept without fully thinking it through. It would be much better to just say that, if the assumptions are satisfied, that in repeated applications the 95% confidence intervals will include the true value 95% of the time.

17. Pancake, a bloody one says:

Ah, I was taking a bath in the warm waters of nostalgia by revisiting a statistics textbook that introduced me to the basics of basic frequentist concepts. How horrible that turned out to be! Oh my. There was a whole section on the interpretation of p-values, and the main point he was making was – p-value is the probability that the null hypothesis is true. Explicitly! On numerous occasions! There were other parts that were, well if not factually wrong in the strict sense, at least pedagogically questionable.

I should go to the library and see if there’s been a moment of epiphany for this writer and if he’s gotten his definitions/explanations right.

• Dale Lehman says:

It’s probably in the 11th edition, and still has data sets printed in the back of the book, and is still being used by the same instructors that have used it for the past 20 years, giving the same tests, and teaching the same way.

18. Tom Passin says:

The confidence interval, as I see it, is best thought of as a gauge or indication of the statistical qualities of the data. Originally at least, it is the width of a distribution – which you hope matches that of the underlying population – that includes x % of the data, where x is generally 95%.

To make sense of it, you have to assume things like –

1. The population distribution is the same as the sample distribution;
2. You can calculate the end points from the distribution. Usually one hopes that the normal distribution is close enough that you can use it.
3. Your sample estimate of the population distribution’s parameters (e.g., mu, sigma) are close enough to the real population’s parameters that it’s OK to use them without too much error.
4. You are willing to use imaginary replications to build up a picture of the statistics you would get if you could no enough repetitions of the experiment.

I see these other approaches to CIs as attempts to avoid some of the above limitations. But those points really dominate the situation no matter how much you wave your hands. Want to use a prior? Add one more assumption into the mix. Want to use some non-standard definition of CIs? Then they become even more obscure and hard to interpret. Want to use likelihoods instead? OK, but if your experiment doesn’t fit the assumptions that let you derive the likelihood rigorously, then you have added yet another set of assumptions that may or may not be so.

OTOH, if you use the basic “CI” as a gauge of how noisy a data set is, it’s very helpful. Let’s jsut not get too excited about exact numerical details.

19. Huw Llewelyn says:

Maybe Andrew is being a bit hard on himself. When estimating a single population parameter using random sampling, the prior probabilities of all its specified possible values are uniform by definition. My reasoning is described in the pre-print: https://arxiv.org/ftp/arxiv/papers/1710/1710.07284.pdf) arxiv.org/ftp/arxiv/pape…). I would be grateful for comments.

• Andrew says:

Huw:

I have a problem with uniform priors (and the resulting Bayesian interpretation of classical confidence intervals) for reasons discussed here.

20. Huw Llewelyn says:

Andrew:
I had read very carefully your 2013 ‘commentary’ to which you refer and cited it in my paper. I think that I have addressed the issues with which you have a problem. For example, I argue that the P value generally only provides an approximation to an ‘idealistic’, frequentist probability that the true parameter will be more extreme that the null hypothesis. The two are the same only under special circumstances. I use the example of the binomial distribution in the paper to make this point. In order to arrive at a ‘realistic’ probability, one has to make judgements about the study methods, etc. that can be represented by an ‘idealistic to realistic’ (I/R) index, which is based on a probability. (The idealistic probability can also be used as a bound for the realistic probability as suggested by Greenland and Poole). This index performs a similar role to the ‘Bayesian prior’ but is based on a ‘probability syllogism’ (there is an example of such a ‘probability syllogism’ in Bayes’ paper). I argue that when using random sampling to estimate a single parameter value the ‘base rate priors’ have to be uniform so that a non-uniform Bayesian prior can only be ‘non-base rate’ and has to be ‘conditional’ on past real or pseudo data based on identical observational methods. However the use of a ‘probability syllogism’ to calculate an ‘I/R’ index allows us to take these methodolgical issues (e.g. possible P hacking, possible differences in study subjects in different centers,etc.) into account.

• Anoneuoid says:

If we are presented with one of the sub-populations from 0% to 100% but not told which it is, and are asked to estimate its parameter value ???????????????? by taking random samples from that unknown sub-population, the result would depend only on the proportion of bilingual people in the sample. The relative sizes of the sub-populations would not affect the result. As the sample size increases, the sample proportion converges on the ‘true’ value ???????????????? equally rapidly whether it turned out to be a large more common sub-population (e.g. the ‘48%’ sub-population) or a smaller rarer sub-population (e.g. ‘the 32%’). Therefore the ‘prior probability’ of the sub-population in the source population is not relevant to the result of a sampling process and to include it inappropriately would bias the random sampling process.

A very large sample would converge on the ‘true’ value of ???????????????? in the sub-population. We can therefore consider a series of hypothetical results based on such a very large sample. These can be regarded as possible ‘model’ subsets, each one containing the same number of elements (i.e. the same large number of random selection results). This equal number of elements in each subset means that the prior probability of each possible ‘true’ result subset would be the same (i.e. the priors are uniform). We can choose any number ‘m’ of ‘possible outcome subsets’ for our model depending on the precision required when estimating the value ????????????????. If we then make ‘n’ selections from each of these ‘possible outcome-subsets’ then the result of these n selections can be represented by X1, X2, … Xn, any one of these being Xj.

So, you are saying that “taking a random sample” means “assume any sample from the population is as likely as any other”? In one way or another this (putting *a uniform “prior” on which sample is observed*) works out to often provide near equivalent results to putting *a uniform prior on the parameter value being estimated*?

• Huw Llewelyn says:

Thank you for your comment. There are many ways of thinking about the concept. For example, a scientist ideally would like to make a very large number (M) of observations but is restricted in practice to a smaller number of ‘n’ observations. This process can be modelled mathematically by regarding each very large number of M observations as a random sample of size M that converges on the true value of the parameter. We can build our model by specifying a range of possible values V1 to Vz. The possible ‘convergent’ results of M sampling observations in our model will therefore range from V1 to Vz. This result will not depend on the prior probability of any Vi but only the value of each Vi. Each possible ideal experiment will contain M observations so each set will contain an equal number of elements.

If we conduct a study based on a limited number of ‘n’ observations, then we will be selecting these elements at random from the ‘z’ model sets, each containing M elements, so that the prior probabilities of each of these model sets will be 1/z. The conditional probability of any Vi will be small or may be a probability density. In order to get a meaningful probability, we will have to specify a range (e.g. Vi to Vi+k). The accuracy of this range will depend on the number ‘z’ of the possible values of the parameter V that we have specified.

• Anoneuoid says:

Thanks Huw,

However, I don’t think you actually gave me a yes or no answer as to whether I had successfully put what you were saying into my own words… After a quick search I think these refs (second comes from the first) describe what I was getting at:

the “intimate similarities” between a subjective exchangeable prior distribution and an objective distribution introduced by the design using simple random sampling.

• Huw Llewelyn says:

I clearly did not understand your question properly. I do not think of these concepts in terms of exchangeability. Does the response below to Carlos Ungil answer your question?

21. Roger says:

Your complaint seems to be that a confidence interval might not really be a confidence interval, because it might be calculated using some other assumption. But if it is a confidence interval, and both ends of the 95% confidence interval exceed zero, then the conclusion is correct.

• Jeff Walker says:

This raises a question that I would like clarification. The original conclusion is “then we are at least 95% sure (under the assumptions of the model) that the parameter is positive”. If we apply frequentist modeling strategy (not really an assumption), then the parameter is fixed so the probability that the interval includes the parameter is either 100% or 0% and the conclusion is wrong (giving rise to this post). So, to say that we are 95% sure that the parameter is positive requires something more than the assumptions, such as specifying a flat prior. But I think Huw Llewelyn is arguing that the flat prior is part of frequentist assumptions, “by definition” of the explicitly stated assumptions. So then, under Huw’s argument, the statement is correct. So, are flat priors something that needs to be added in addition to the usual assumptions, or are they an automatic part of the assumptions (that has been entirely or largely unrecognized)? (Andrew’s comment that he linked in response to Huw is not arguing against Huw’s argument but simply that flat priors are often misleading and never better than informative priors).

• Andrew says:

Roger:

No. Careful definitions of confidence intervals always emphasize that the 95% coverage is an average and is not the property of any particular interval. The above passage was from a textbook; in a textbook I want to get it right. Being sloppy in conversation is one thing; being sloppy in a textbook with tens of thousands of readers can create misunderstanding.

• Allan C says:

I just picked up a couple books off my shelf and here’s what they have to say (All approximately state the definition of a CI correctly or incorrectly in the same way, technically, but their use of English is highly variable)

An Introduction to Mathematical Statistics and its Applications (5th edition) by Larsen and Marx.

Page 298: “The usual way to quantify uncertainty in an estimator is to construct a confidence interval. In principle, confidence intervals are ranges of numbers that have a high probability of ‘containing’ the unknown parameter as an interior point. By looking at the width of a confidence interval, we can get a good sense of the estimator’s precision.”

Beginning Statistics with Data Analysis by Mosteller, Fienberg and Rourke

Page 235: “The confidence statement says that if we make such calculations over and over in different samples, then 95% of the time the interval between the lower limit and upper limit will contain the true mean value of the statistic. Different samples usually produce different intervals.”

Statistical Models in Engineering by Hahn and Shapiro

Page 75: “A precise definition of a (1-alpha)100 percent confidence interval for u would involve the following statement: ‘If in a series of very many repeated experiments and interval such as the once calculated were obtained, we would in the long run be correct (1-alpha)100 percent of the time in claiming that u is located in the interval…’ Unfortunately, such a statement is difficult to comprehend for the nonstatistician and awkward to interpret operationally. Therefore, we shall consider a (1-alpha)100 percent confidence interval as a range within which we are (1-alpha)100 percent sure that the true parameter is contained, recognizing that this interpretation takes liberties with the strict classical definition (because the true parameter is not normally regarded as a random variable)”

Statistics for Experimenters by Box, Hunter and Hunter (second edition)

Page 96: “Then a 1-alpha confidence interval for delta would be such that, using a two-sided significance test, all values of delta within the confidence interval did no produce a significant discrepancy with the data at a chosen value of probability alpha but all the values of delta outside the interval did show a significant discrepancy.”

Page 103: “The lower and upper 1- alpha confidence limits for sigma squared are of the form (n-1)s^2 / B and (n-1)s^2 / A where, as before, the confidence limits are values of the variance sigma squared that will just make the sample value significant at the stated level of probability.”

Statistical Decision Theory and Bayesian Analysis by Berger (2nd Edition)

Some lengthy discussion starting in Chapter 1. On Page 23 there is: “These are: (i) the motivation [of confidence intervals] is based on repeated use of delta for different problems; and {ii} a bound R HAT on performance must be found which applies to any sequence of parameters from these different problems.”

Page 140: “Since the posterior distribution is an actual probability distribution on theta, one can speak meaningfully(though usually subjectively) of the probabilities that theta is in C. This is in contrast to classical confidence procedures which can only be interpreted in terms of coverage probability.”

Principles of Applied Statistics by Cox

Page 141: “The empirical interpretation of, for example, a 97.5% upper limit is that it is calculated by a procedure which in a long run of repeated applications would give too small a value in only about 2.5% of cases……Consideration, at least in principle, of a confidence distribution shows that in most cases the values near the center of a confidence interval are more likely than those at the extremes and that if the true value is actually above the reported upper limit it is not likely to be far above it. Conventional accounts of confidence intervals emphasize that they are not statements of probability. However, the evidential impact off, say a 97.5% upper confidence limit is much the same as that of an analogous upper probability or credible, limit. The crucial distinction is that different confidence-limit statements, even if statistically independent, may not be combined by the laws of probability.”

Survey Sampling by Kish

Page 15: “There is considerable disagreement about the meaning of those [confidence] intervals. I avoid any adjective for the intervals t_p se(y bar), and the reader will utilize them according to his knowledge.”

There are quite a few more descriptions I could add but these pretty much resemble the mix that I have come across (and I have to get going for New Years festivities).

I think, based on what we have seen, Kish had the most appropriate approach!

• Martha (Smith) says:

Allan:

For my approach to teaching confidence intervals, see pp. 28 – 43 of the Day 2 Slides here:
http://www.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html

(This is from a continuing education course for people who have already had an introductory statistics course; but I have used a similar approach in teaching introductory courses, and for beginning-of-semester reviews in master’s level courses in Analysis of Variance and Regression.)

• Allan C says:

Thanks Martha.

I especially enjoyed your bullet point format and cautionary notes about sampling distributions. Your notes, in general, seem like a good step towards better comprehension.

I’ve often wondered if an introduction to philosophy of science (with a focus on epistemology) would aid students in qualitatively evaluating quantitative methods, which I think should be the end goal for most introductory and mid-level stats courses. Perhaps even paired with a mathematical modeling course that will help students realize the false edifice of certainty that higher level mathematics, coupled with puzzling sounding language can create. There has been so much energy spent on trying to find better ways to convey statistics that I think we may be missing some of the underlying issues (such as no way to evaluate methods without an epistemology, etc.).

I am wholly unsure, however.

• Very nice notes Martha – thanks for sharing

• Roger says:

You say: “95% coverage is an average and is not the property of any particular interval.” Okay fine, everything is some sort of estimate and the true values might differ. But you also say “under the assumptions of the model”. The assumptions of the model are what allow you to calculate that confidence interval, and to be 95% sure that the parameter is within the interval ends. And if all that is true, then you can also be at least 95% sure that the parameter is positive, if the ends are also positive. What am I missing here?

• Corey Yanofsky says:

You seem like the kind of person the Fallacy of Placing Confidence in Confidence Intervals paper was written for.

• Andrew says:

Roger:

The assumptions of the model imply that, in repeated operation, 95% of the intervals will contain their true values in the long run. It does not imply anything about a particular 95% interval. You need more assumptions for that—in particular, you need the assumption of a uniform prior distribution on the underlying parameter, or something equivalent to that assumption. And for reasons we’ve discussed many times on this blog, that uniform prior distribution often does not make sense.

• Huw Llewelyn says:

Andrew and Roger:
But it is my contention that uniform prior distributions are an essential feature of random sampling used to estimate the value of a single parameter! My reasoning is described in the pre-print: https://arxiv.org/ftp/arxiv/papers/1710/1 and the above comments.

• Carlos Ungil says:

What would be the problem in combining random sampling and non-uniform priors? You assume the prior is uniform but it’s not clear why.

• Huw Llewelyn says:

There are two kinds of prior probability of course: ‘base-rate’ priors and non-base rate priors. A base-rate prior is conditional on an universal set so that the sets of those with the predicted outcome H and the items of evidence E1 and E2 are all subsets of U so that p(U|H) = p(U|E1) = p(U|E2) = 1. Thus Bayes rule in terms of a universal set is:
p(H|E1) = p(H|U) x p(E1|H) / p(E1|U) = 1/{1+p(Ḧ|U)/p(H|U) x p(E1| Ḧ)/p(E1|H)}

The posterior probability p(H|E1) can become the ‘non-base rate’ prior probability with respect to the subsequent evidence E2. If there statistical independence between evidence E1 and E2 with respect to the hypothesis H (such that p(E1˄E2|H) = p(E1|H) x p(E2|H)) and p(H|E1) is the conditional prior probability of H (and Ḧ=’notH’), then the posterior probability p(H|E1˄E2) is:
1/{1 + p(Ḧ|E1)/p(H|E1) x p(E2| Ḧ)/p(E2│H) }

However, if we regard U as a universal set so that p(H/U) and p(Ḧ|U) are equiprobable, then p(H|E1˄E2) can also be calculated as:
1/{1 + p(Ḧ|U)/p(H|U) x p(E1| Ḧ)/p(E1│H) x p(E2| Ḧ)/p(E2│H)}

Because the base-rate priors are equal, the likelihood ratios and probability ratios are also equal (e.g. p(Ḧ|E1)/p(H|E1) = p(E1| Ḧ)/p(E1|H)) so that:
p(H│E1˄E2)=1/{1 + p(Ḧ|U)/p(H|U) x p(Ḧ|E1)/p(H|E1) x p( Ḧ|E2)/p(H|E2)}

The question is, are Bayesian priors of the base-rate or non base-rate variety? In order for Bayesian priors to be base-rate, then the study evidence must be a subset of the situations exhibiting the prior evidence. However, this cannot the case as the prior evidence (e.g. E1 above) on which a prior probability is based is independent of the subsequent evidence (e.g. E2 above), which means that a Bayesian prior has to be of a non base rate nature and that the base rate priors have to be uniform.

• Carlos Ungil says:

Reading this comment and looking at the paper again I guess you’re not really talking about priors in the usual sense. I don’t understand what you’re trying to show or if it is beyond the usual “if the prior is flat, the posterior is equal to the likelihood”, but if it doesn’t prevent Bayesian analysis using informative priors it’s ok for me.

• Carlos Ungil says:

If the uniform prior is an essential feature of random sampling, don’t you find troubling that it is not invariant under reparametrizations?

Taking your example of the proportion of bilingual people pz, you say that the prior is uniform in [0 1]. But for the same problem and the same random sampling you could analyze the data in terms of the ratio of bilingual to non-bilingual people x=pz/(1-pz). The range for that parameter is [0 infinity]. If we set the upper limit to 1000, to avoid issues with infinity, a uniform prior on [0 1000] for x would lead to different conclusions than a uniform prior on [0 1] for pz.

• Blissex says:

Having read a bit the comments I think that most of this discussion can be summarized as:

#1 “we are at least 95% sure” is sloppy wording that should not be in a textbook, regardless.
#2 The obvious interpretation of “we are at least 95% sure” involves assuming flat priors.
#3 Assuming flat priors are either inevitable or dangerous.

I think that everybody in this discussion agrees on #1 and #2, the dissent is indeed on whether assuming flat priors or good or bad.

My understanding is that for most people flat priors are indeed “good enough”, even if they are pretty bad in several situations, and that the argument is whether the latter point needs to be explicit in a textbook passage about confidence intervals. But perhaps I am oversimplifying.

• Huw Llewelyn says:

The choice of possible values of the parameter depends on the accuracy required when using our mathematical model to calculate the probability that a parameter value will fall within a specified range e.g. between 0.5 and 0.6. If we choose 100 data points for the range 0 to 1, then the probability of the result falling between 0.5 and 0.6 inclusive will be obtained by adding 1l individual probabilities. If we choose 1000 data points between 0 and 1 then we will sum 101 probabilities to calculate the probability of the result falling between 0.5 and 0.6. However, in the latter case of using 1000 data points, the probabilities of each of the data points will be 1/10th of those when basing the calculation on 100 data points,so that the sum for the range 0.5 to 0.6 will be approximately the same but more accurate when using 1000 data points.

• Carlos Ungil says:

What I meant is that after the experiment and the random sampling and everything is done you could carry the analysis using your original variable pz or a transformed variable like x=pz/(1-pz).

In the first case, you say the prior for pz is uniform on the range [0 1]. So, for example, the prior probability for pz being over 0.9 is 0.1.

In the second case, applying your reasoning the prior for x is uniform on the range [0 infinity] and that’s not even a proper prior. There is a non linear correspondence between the transformed variable x and the original variable pz. pz=0 maps to x=0, pz=0.5 maps to x=1 and pz=1 maps to x=infinity. pz>0.9 maps to x>9. The prior probability for pz being over 0.9 is 1 (or as close to 1 as we want, if we approach the limit using a sequence of intervals [0 b]).

I hope we agree that applying a uniform prior in the first case and in the second case will lead to different conclusions about pz and that seems problematic to me. If your argument for the prior being uniform “by definition” applies in one case but not in the other, could you explain why?

• Huw Llewelyn says:

If you transform the parameter values, you may simply change the shape of the distribution eg by making it highly skewed. However, the probability of each transformed data point will be the same as the probability of each corresponding untransformed value.

The general point is that Bayesian priors are conditional on unspecified evidence that does not define a universal set. The prior probability in Bayes rule is by contrast conditional on a universal set.

If the prior evidence and subsequent study evidence are statistically independent then the prior probabilities of the possible parameter values can be regarded as uniform. The latter prior can be combined with the likelihood of the prior evidence to form a posterior probability that becomes a Bayesian prior. The latter can then be combined with study evidence to form a second posterior probability.

• Carlos Ungil says:

In a proper Bayesian analysis, the results should not depend on the parametrization used. If you use always a uniform prior your results will not be invariant under reparametrizations.

Looking at pz or x=pz/(1-pz) or y=(1-pz)/pz or any other tranformation of pz you will arrive at different conclusions if you apply a flat prior “by principle”. That’s unacceptable to a Bayesian, but it’s a “feature” of some frequentist methods.

If I understand your point, you are not trying to be Bayesian. Your goal seems to be to “explain” or “justify” the frequentist methods so not being invariant under reparametrizations is not at all an issue.

• “The assumptions of the model [do] not imply anything about a particular 95% interval. You need more assumptions for that—in particular, you need the assumption of a uniform prior distribution on the underlying parameter, or something equivalent to that assumption.”

Err… umm… the “bet-proof” concept allows you to say something about a particular interval. It needs some extra assumptions, and there is a Bayesian connection, but if I recall correctly the extra assumptions don’t have anything to do with a uniform prior (and again iirc the extra assumptions you do need are pretty innocuous in most cases anyway).

Nobody on this blog has been wildly enthusiastic about the concept, to put it mildly (me included – I am less optimistic than I was when I first pointed it out), but it’s there.

22. Blissex says:

OK, OK, so for a slightly tangential point but quite related: in reading some of the comments here and other discussions on this and other blogs I often think “what are these guys wittering about” and since I am not a professional statisticians (even if I was tempted that way…) I generally assume that I don’t get it. But then I went back and read in another recent post by our blogger an example of worthy but sloppy wording:

«stay focused on the three key leaps of statistics:
– Extrapolating from sample to population
– Extrapolating from control to treatment conditions
– Extrapolating from observed data to underlying constructs of interest.
Whatever methods you use, consider directly how they address these issues.
»

At least the first relates to confidence interval. But I would object to the use of the word “extrapolating”, because I would use “inferring”, because we are assuming ergodicity.
And for my subjectivist/information theory based understanding of “statistics” that is a key.

And then I realized that a lot of my incomprehension is based on this: for me statistics is really a branch of engineering (one using a lot of maths), specifically signal processing, for a lot of statisticians is a branch of “mathematics”, a derivative of what they call “probability theory”.
Then I read approvingly of the guys who says “procedure” (which is similar to my thinking about “process”).
Then someone in the comments mentions Neyman’s 1937 paper, and that is entirely based on pre-information-theory thinking.

And then I realize that there is a further dimensions to the engineering/”maths” divide: that experiment design and interpretation is really engineering, driven by science and maths, but really engineering like statistics, and civilization is doomed: we are asking “mere scientists” to do engineering, and no wonder that irreproducibility of results is common.

Perhaps a suggestion: for the statisticians that agree with the above to escape to departments of engineering, and change the name of what they do to something like “urnal process engineering” :-).

• Martha (Smith) says:

Blissez said ” I realized that a lot of my incomprehension is based on this: for me statistics is really a branch of engineering (one using a lot of maths), specifically signal processing, for a lot of statisticians is a branch of “mathematics”, a derivative of what they call “probability theory”. … And then I realize that there is a further dimensions to the engineering/”maths” divide: that experiment design and interpretation is really engineering, driven by science and maths, but really engineering like statistics”

I disagree with a lot of this. I don’t see statistics as a branch of engineering; I also do not see statistics as a branch of mathematics. I also don’t see experiment design and interpretation as engineering per se.

I do see statistics as in some ways parallel to engineering, in that it uses a lot of mathematics, but is not mathematics per se. Statistics is often useful in engineering, but is also useful in other fields, such as biology, medicine, economics, demography.

Bottom line: I think you are trying to put things too neatly into boxes and “divides”.

• Martha (Smith) says:

Oops — Blissex, not Blissez.

23. Nick Adams says:

Having read all the above comments, it’s time to wrap it all up and so I append an executive summary:

1. The formal definition of a 95% confidence interval will be unintelligible to 95% of the people who are foolish enough to take a etas course.
2. 95% of the time the 95% confidence interval will provide a reasonable estimate of the range in which the true value lies. The other 5% it’s wrong and possibly ridiculous.
3. The label “95% confidence interval” is unfortunate because of the common language connotation of the word confidence and the fact that the coverage is unlikely to be precisely 95%.

So let’s stop worrying about the formal definition, stop calling it a 95% confidence interval, and instead go and read some Fisher…(he hated them after all).

• Martha (Smith) says:

I agree that “The label “95% confidence interval” is unfortunate”

24. The discussion has been great. I’m still left with wanting a definition of CI that is honest and teachable without resorting to the (preferred) Bayesian approach, and a better label than ‘confidence interval’. The fact that these issues are so dicey is a fundamental problem to frequentists. I recall hearing Don Berry say that his inability to teach confidence intervals (which he ultimately viewed as a problem with the paradigm, not with his teaching) led him to convert to Bayes.

“Confidence” as a term was somewhat of a con, just as “exact” was in Fisher’s “exact” test.

• Huw Llewelyn says:

I have come to the conclusion that P values and CIs have such tortuous definitions as to make them impossible to interpret in a clear and logical way by scientists or doctors like me (or to teach sensibly to our students) and that they should be abandoned. Their definitions are mathematically clear of course and the temptation to cling to them after all these years is understandable. I agree with Bayesians that what we need to know is the probability of a study result falling into a specified range of study outcomes if the study were continued with an infinite number of observations. However, I disagree that this can be based on some prior unspecified ‘evidence’ by only applying Bayes rule. I think that there is much more to it than that.

• Allan C says:

I am slightly confused when you say that you “disagree that this can be based on some prior unspecified ‘evidence’ by only applying Bayes rule.” Are you saying some have argued that the only way to incorporate prior information into an approximation of the long run frequency is via Bayes?

• Huw Llewelyn says:

Not exactly. It is that many Bayesians use Bayes rule in a very special way, in that a Bayesian prior is not solely conditional on an explicit universal set but conditional on an unspecified universal set and some unspecified evidence. This posterior probability then becomes a ‘Bayesian prior’ based on the unspecified evidence. The reasoning processes used in science and medicine tend to be based on specified evidence and do not only use Bayes rule but also probabilistic versions of the syllogism and reasoning by elimination. These do seem to be used in statistical analyses to my knowledge. I think that if this were done, statistical reasoning may align better with medical and scientific reasoning and there may be fewer misunderstandings.

• Carlos Ungil says:

I would say that it is you who is using Bayes rule in a very special way. The “general” way of performing a Bayesian analysis combining a prior and a likelihood to get a posterior is commonly used in science, by the way. Maybe your dislike for what you call “Bayesian priors” is common in medicine, I don’t know. But as far as I know the FDA is happy with standard Bayesian analysis (in clinical trials for medical devices).

• Huw Llewelyn says:

I did not wish to imply that the Bayesian prior was ‘peculiar’. I said it was ‘special’: a Bayesian prior is valid but different to the prior probability used in Bayes rule. I am aware that the Bayesian approach is used by some in medicine to combine a ‘prior probability’ based on a general impression (without specifying the symptoms, signs and test results used to arrive at that impression) with a test result which is linked to likelihood ratio based on real data. Another approach that is based on specified evidence and the principles of a ‘probabilistic elimination theorem’ can be used also in medicine, scientific reasoning and perhaps some aspects of statistical hypothesis testing. It is outlined by me in an Oxford University Press blog: https://blog.oup.com/2013/09/medical-diagnosis-reasoning-probable-elimination/

• Carlos Ungil says:

What does “Bayes rule” mean for you? How would you say that a “Bayesian prior” is combined with a likelihood function to obtain a posterior distribution if not through use of Bayes’ rule?

• Huw Llewelyn says:

I explained this in some detail in my response to you ar 12.53pm on 1 January. In essence Bayes rule calculates a posterior probability of an outcome H conditional on a single item of evidence E1 by using the base rate prior probabilities of H and E1 and the likelihood of E1 conditional on H. A Bayesian posterior probability

• Huw Llewelyn says:

Sorry the following was omitted from the above response:

By contrast a Bayesian prior probability is also based on other prior evidence that does not form a universal set. The calculation makes an assumption of statistical independence between the prior unspecified evidence and the observed evidence.

• Carlos Ungil says:

Thanks for your reply. It’s difficult for me to follow your explanations because I think you have a problem with the terminology. A Bayesian calculation combines a prior distribution with a likelihood (which is not exactly a probability distribution) to obtain a posterior distribution using Bayes’ rule. The mechanics of the calculation are not controversial at all. A different question is what makes a “valid” prior (for example, a common theme in this blog is if you can get away with data-dependent priors and still call yourself Bayesian).

Saying that the Bayesian approach combines “two likelihood distributions” or “prior unspecified evidence and the observed evidence” is a very confusing way to refer to what is usually known as combining the “prior distribution” and the (data-derived) “likelihood” (using Bayes’ rule!).

It’s true that the prior might be the posterior from a previous analysis (and the posterior might become the prior for the next analysis). But this doesnt change the fact that right now you start with a prior probability distribution and update it using the likelihood to end with a posterior probability distribution.

• Chris Wilson says:

Huw, what do you mean by saying “The calculation makes an assumption of statistical independence between the prior unspecified evidence and the observed evidence.” ? At its essence, the Bayesian mechanics involve decomposing the joint distribution p(y,par) as p(y|par)*p(par) (ignoring the normalizing constant for now). What this assumes is the product rule of probability. I also agree with Carlos that your terminology is confusing and difficult to follow, and in places feels unnecessarily so. Just my \$0.02, feel free to ignore :)

• Huw Llewelyn says:

The terminology of statistics is already a nightmare and I apologise because this has been compounded by some typos of mine (e.g. omitting the term ‘distribution’). One source of confusion is a failure to distinguish clearly between ‘base-rate’ and ‘non-base rate’ prior probabilities (I dislike these terms!). A Bayesian prior probability distribution for an unknown parameter is of the ‘non base-rate’ variety. It is combined with a likelihood probability distribution based on data to create an updated posterior probability distribution. (I contend that when random sampling is used to estimate the value of some single parameter, although the non-base rate prior may not be uniform, the base-rate prior is uniform.) The following extract from the next edition of the Oxford Handbook of Clinical Diagnosis tries to explain the difference between a ‘base-rate’ and ‘non base-rate’ prior. (I would also like to point out that I refer to the ‘product rule’ as an assumption of ‘statistical independence’).

“Bayesian statisticians emphasise the importance of specifying an ‘informal prior probability’ based on informal evidence that is then combined with substantiated probabilities (i.e. based on observations that others can share). For example, a Bayesian might suggest on the basis of such ‘informal evidence’ that the ‘prior probability’ of finding someone with appendicitis in a study is 0.6. When this ‘informal evidence’ is combined with another finding (e.g. LRLQ pain) a new posterior probability is created. This posterior probability then becomes the new prior probability if the evidence so far is combined with yet another finding (e.g. guarding).

It should be emphasised at this stage that there are two types of prior probability (1) the ‘base rate prior’ based on the universal set and (2) the non-base rate prior based on the universal set and one or more of its subset(s). The base rate prior proportion and probability for appendicitis is 100/400, if the universal set is a group of 400 patients studied to which patients with all the other findings belong (i.e. those with appendicitis, no appendicitis, LRLQ pain, no LRLQ pain, guarding, the ‘informal evidence’, etc.). The patients showing the ‘informal evidence’ used for the Bayesian prior cannot be assumed to be a ‘universal set’ of which those patients with LRLQ pain, guarding, appendicitis and NSAP were subsets. We have to assume therefore that those with the ‘informal evidence’ could be a subset of the 400 studied, giving rise to a non-base rate prior of 0.6.

The ‘non-base rate’ prior probability of 0.6 can be used to calculate a ‘posterior probability’ of appendicitis (Appx) by combining the ‘informal evidence’ (IE) and LRLQ pain:
1/{1+[Pr(No Appx|IE)/(Pr⁡(Appx|IE))] [ (pr(LRLQ pain|No Appx))/(pr(LRLQ pain|Appx)) ] } = 1/{1+[((1-0.6))/0.6] [ (125/300)/(75/100) ] } = 0.73

The above calculation implies that there is statistical independence between the frequency of occurrence of the ‘informal evidence’ (IE) and LRLQ pain in those with appendicitis, and in those without appendicitis. For example, if the proportion of patients with the ‘informal evidence’ in those with appendicitis was 9/100 and its frequency in those without appendicitis had been 6/300, then the assumption of statistical independence means that the proportion with the informal evidence and LRLQ pain in those with appendicitis would be assumed to be 9/100 × 75/100 = 6.75/100. Similarly the proportion with the informal evidence and LRLQ pain (i.e. ‘IE & LRLQ pain’) in those without appendicitis would be assumed to be 6/300 × 125/300 = 2.5/300. We can now calculate the estimated proportion with appendicitis by using the base-rate prior proportion of 100/400 for appendicitis in the group studied (SG). Again, it is 0.73:
1/{1+[Pr(No Appx|GS)/(Pr⁡(Appx|GS))] [ (pr(SE & LRLQ pain|No Appx))/(pr(SE & LRLQ pain|Appx)) ] } = 1/{1+[(300/400)/(100/400)] [ (2.5/300)/(6.75/100) ] } =0.73

• Huw, I would say we’ve had many philosophical discussions on Bayesianism on this blog and your take on this seems to be in a different direction than the way that is commonly discussed here. So much so that I don’t recognize what you’re really talking about by the names you use etc.

The usual characterization of Bayes is that it calculates a degree of plausibility of an assertion about a true or false claim. The next more subtle thing is something I’m working on where it’s more of a degree of accordance with both theory and data. This enables you to have a meaningful discussion about Bayesian models for things where the model isn’t “perfect” and so “truth” is not well defined. I’ve got a half written paper on that.

But in all of these philosophical discussions one thing has been true, there is never anything called the “base rate” which meaningfully enters into the philosophy. A “Base rate” might be one piece of information you would use to assign a degree of plausibility or accordance or whatever, but it doesn’t hold any fundamental position in the philosophy. Given this, your description reads to me like someone coming from some existing well developed background very different from “ours” here at the blog, and having a lot of specific ideas couched in that framework, but we don’t recognize that framework and so we’re all talking past each other.

Perhaps it’s just a terminology issue, but as Corey says, it mostly seems very idiosyncratic.

• Curious says:

Daniel,

How does a base rate differ philosophically from a point estimate from previous research used to specify the mean of a prior?

• Curious, well first off i hear base rate and I think frequency. So if we are trying to estimate a frequency then fine but if we are demanding that a probability associated with a parameter be a frequency, then this is not what a Bayesian probability is… So that seems confusing.

Next, philosophically any quantity can have a Bayesian probability distribution assigned to it. And in particular any dimensions can be associated, so for example something like length^3/time/temperature so if you have a historical record of that and use it to assign a location parameter to the distribution over that parameter, in what sense is that a “base rate”? I’m truly completely lacking an answer.

It seems Huw has some ideas in mind that don’t align.

• Curious says:

Daniel,

I know you seriously answer and genuinely attempt to understand other’s perspective in your responses on this blog, which is why I am confused by your response here in that it seems you are determined to misunderstand Huw’s labels which seem obvious. Huw is using an epidemiological example of the incidence of an event in a population and using the term base rate to refer to that incidence with the notion that a screening tool is only useful if it can provide information that improves the identification of someone with the condition above random selection for which the base rate would be the estimated population probability.

I don’t understand what is confusing about that. If you are saying that this term is specific to diagnosis and selection problems and not to problems of physical distance, then sure I suppose I understand your point, but it is simply another term for the probability of incidence in a population which can be used to inform a model.

Let’s say we created a binary logistic model using a beta prior and estimated theta for the LRLQ. How would you assess whether this model combined with this screening tool provides utility to the diagnostician?

• Huw Llewelyn says:

You may know the ‘base rate’ prior probability (a term used in the phrase ‘the base-rate fallacy’) as the ‘unconditional prior probability’ [e.g. written as the p(A) or p(B)] of Bayes rule. Thus Bayes rule is p(A|B) = p(A) x p(B|A) / p(B). The term ‘unconditional’ is another confusing statistical misnomer of course because by p(A) and p(B) we really mean p(A|U) and p(B|U), U being some universal set such that A⊆U and A⊆U so that p(U|A) = 1 and p(U|B) = 1.

I explain this in the first few pages of Chapter 13 of the 3rd edition of the Oxford Handbook of Clinical Diagnosis (see http://oxfordmedicine.com/view/10.1093/med/9780199679867.001.0001/med-9780199679867-chapter-13 ). I am aware that Bayesians claim that probabilities are degrees of belief and not observed frequencies but the point I make is that probabilities obey the same rules as proportions even if they are imaginary proportions. I am in the process of rewriting this chapter for the 4th edition and sent an extract in my entry to this blog at 3.08pm on 3 January.

I am not disputing the way Bayesians use ‘Bayesian priors’ but am simply putting these Bayesian priors in the wider context of probability theory. The way that I explain the basics of probability theory makes it consistent with the way that my medical colleagues and I use the concepts verbally during discussions with each other, patients, etc.

In my recent paper preprint (https://arxiv.org/ftp/arxiv/papers/1710/1710.07284.pdf), I show that during random sampling to estimate the value of a fixed parameter, the underlying ‘unconditional prior’ (AKA ‘base rate prior’ AKA ‘prior probability conditional on a universal set’) is uniform even though the Bayesian prior probability conditional on other unspecified evidence is not uniform. This also allows us to calculate a frequentist posterior probability distribution based on data alone that can be combined with a Bayesian probability distribution. An advantages of ‘looking behind’ Bayesian probability distributions at the underlying uniform ‘Base rate’ priors, is that it allows frequentist and Bayesian concepts to be combined.

I hope that this clarifies my reasoning.

• Curious, thanks for the kind words. I truly am confused though. My impression was that Huw was talking about something more universal than diagnosis. Perhaps that is what confused me.

The mathematics of probability is the same whether they are thought of as proportions or degrees of plausibility. This is true. However using this fact to invent some sort of proportion story around a Bayesian analysis has been the single biggest source of confusion around interpretation of Bayesian analysis, so I’m generally not in favor of that. It becomes even more confusing when we think about Bayesian analysis of proportions.

Suppose we want to estimate the Bayesian probability under some model that the proportion of coffee drinkers who add milk is less than 30%…

You could imagine the set of coffee drinkers. Then you could imagine sampling from them uniformly using an rng. We now have a frequency probability that the sample will contain less than 30% milk takers. And this is conceptually totally different from the Bayesian probability that the full set of coffee drinkers has less than 30% of its population milk takers. There is not a need for any confusing “two kinds” of priors here.

• Huw Llewelyn says:

Daniel. I am talking about something more universal than diagnosis – and also more universal than the Bayesian approach to diagnosis and statistical inference for that matter. It is the use of probability theory to explain human verbal reasoning in all walks of life but which is done intensely in medical settings. You must not forget that that a working diagnosis is simply an example of a hypothesis and that a final diagnosis is an example of a theory. The same probability theory applies to both.

You end by talking about coffee drinkers in a population who drink milk. Instead of coffee drinkers, I use the example of people who are bilingual in my pre-print, and discuss the issues of uniform base-rate priors, non-uniform and non-base rate priors and posterior probabilities carefully and in some detail.

• Huw Llewelyn says:

Thank you everyone for your comments, which have been very helpful.

• Allan C says:

If I understand your views correctly your supposition is that there are two ways in which specification of a prior probability distribution occurs in practice: the first, where the prior is elicited from codified evidence which is there for all to see (hence a true reference set) and the second, where the prior is not elicited from a verifiable reference set but is rather based on some non-codified background information (or not).

To me the first is just empirical Bayes. And the second doesn’t have to be special in any sense. There are many times where the prior is not about specifying a distribution you actually believe the data generating mechanism to be consistent with but rather is constructed to help regularize or bound inferences.

How you know to regularize is a good question, and is domain specific; it can be based on past experience with related phenomenon or could be a logical/economical constraint or could be based on something else peculiar to the domain. The reasons for the regularization can and should be stated alongside the actual prior. I do not see what is so inherently special about this.

• Huw Llewelyn says:

The Bayesian approach is essentially to combine two or more statistically independent likelihood distributions arising from a single parameter. These may be based on real data, pseudo data or some other method of arriving at the distribution.

Bayesian calculations assume implicitly that the underlying base rate prior probabilities are uniform. This allows the joint or individual distributions to be normalised to become posterior probability distributions. I exploit the same principle to create ‘frequentist’ posterior probability distributions in the above paper (https://arxiv.org/ftp/arxiv/papers/1710/1710.07284.pdf)

• Corey says:

Which Bayesians have you been reading? This is a view I would characterize as idiosyncratic as I haven’t seen expressed elsewhere.

• Martha (Smith) says:

For a few years, I taught a prob/stat course in a summer master’s program for secondary math teachers. They all had a (minimalist) into stat course first. So one thing I did early was to try to get them to understand the correct definition of confidence interval. They, understandably, weren’t too happy. But then after some preliminary stuff on medical testing, I segued easily into some basic Bayesian statistics. They loved it.

25. Thomas says:

I teach estimation to medical students in this way:
– first establish the differnece between population and sample, parameter and estimator
– demonstrate the concept sampling variation when estimating a mean mu by x-bar. We use a simulation web applet where you can change the shape of the distribution of x, and sample size N. This conveys the essence of the central limit theorem, including the normality of x-bar regardless of the distribution of x, and the role of N for variability.
– pretend to be omniscient god, and define a prediction interval for x-bar (dependent on mu, sigma, N)
– become human again, and define the confidence interval for mu given a unique result x-bar, as the set of values of mu whose prediction interval would include that specific x-bar (pretty much like your explanation using testing, but we have not covered tests at that point). We signal but do not insist on the substitution of sigma by s.
– finally some examaples from recent articles, to insist that there is no uncertainty about the observed estimate, but there is uncertainty about the unknown parameter.
– they seem ok with that…

To more advanced students I would say that “95% probability” is a statement about the random vector (L, U), but once data have been observed, there is nothing random about (l, u). (Sometimes this leads to hairy arguments: a fetus has a 50% chance of being a girl, but once the baby is born, it’s either a girl or a boy; a CI is like a swaddled baby where you can’t tell the gender, etc. At that point I break down and agree that the CI gives you a pretty good idea about what the parameter might be given the data…)

My pick for a label would be “best guess interval”

• Martha (Smith) says:

1. A quibble with “a fetus has a 50% chance of being a girl, but once the baby is born, it’s either a girl or a boy”: Usually, the sex is set at conception; but we may not know the sex until later. And nowadays, sex is often identified in utero by ultrasound.

2. I agree with “best guess interval.”

• Thomas says:

re 1: sure, you are right. It would be more accurate to say that pre-conception a future baby has a 0.5 probability of being female, and that a CI is like an ongoing pregnancy before the ultrasound test. But one is no better off in terms of predicting gender before and after conception, whereas a CI is based on data. That’s where it doesn’t really help with understanding CIs. A better analogy might be mendelian transmission rules, using the parents’ phenotypes or genotypes as observed data? If you came up with a good analogy I’d be happy to use it.

re 2. or plausible interval? I think that I have used the phrase “set of plausible values for the parameter” without much thought

26. Daniel et al.,

Alberto Abadie (MIT) has just come out with a working paper that is (i think) similar to your Mars rover setup except for frequentist significance tests rather than frequentist CIs:

“[W]e formally adopt a limited information Bayes perspective. In this setting, agents representing journal readership or the scientific community have priors, P, over some parameters of interests, θ ∈ Θ. That is, a member p of P is a probability density function (with respect to some appropriate measure) on Θ. While agents are Bayesian, we will consider a setting where journals report frequentist results, in particular, statistical significance. Agents construct limited information Bayes posteriors based on the reported results of significance tests. We will deem a statistical result informative when it has the potential to substantially change the prior of the agents over a large range of values for θ.”

And from the conclusion:

“In this article, we have shown that rejection of a point null often carries very little information, while failure to reject is highly informative.”

Alberto Abadie, “Statistical Non-Significance in Empirical Economics”
Working Paper, March 2018
https://economics.mit.edu/files/14851

• Keith O'Rourke says:

> limited information Bayes posteriors based on the reported results of significance tests
That would involve discerning the (marginal) likelihood of just what was observed – the reported results of significance tests.

So the resulting re-weighting of the prior distribution is more so when it’s failure to reject than reject – interesting.

• Dale Lehman says:

I just moments ago came across the same paper (on Marginal Revolution – and left a comment there). What is unbelievable is that there is no reference to Andrew’s work. This relates to an earlier discussion on this blog. I think it is an example of how economists get things published – don’t read too much – then pretend you discovered something new.

• Andrew says:

Dale:

I do suspect that if they’d carefully digest the work of my collaborators and myself, that this could improve their understanding. But It’s more important to me that they get things right—or, close to right—than that they cite me. I’ve been frustrated for a long time with many economists’ naive views regarding identification, rigor, hypothesis testing, unbiasedness, etc., so if they come to discover type M and type S errors through the back door, that’s great. And if what’s necessary for them to believe it is that it be written in economists’ language, so be it.

• Dale,

Is this a case of just not citing related literature, or are any of the technical results in the paper actually not novel? I suspect you mean the former but it would be good to know if it’s the latter.

FWIW, I’m also an economist, and I also get annoyed when I see members of the tribe failing to cite the statisticians who first came up with the ideas. But in the examples that I can think of, the econometricians come off well and it’s the economists who are lazy at citation. The classic example is what economists would call “White standard errors” … even though IIRC White cited Huber in his 1980 paper.

• Dale Lehman says:

It is the former case (I haven’t read it carefully enough to say whether the technical results are novel or not). I also anticipated Andrew’s response. In terms of making progress, I also would not care whether people discover what others have discovered, as long as the movement is in the right direction. However, given the recent discussion about how to get published, I think this is an example of what bothers me about the advice to “not read too much” as a good formula for increasing publications.

• Ah, then that’s not so bad. It’s a very short working paper and this is presumably an early version, the focus is on the technical results, there are some papers by statisticians that show up in the references but not (yet?) in the main text … maybe the lit review will get updated before it gets published. :)

(Apologies for the double posting. Andrew, if it’s not too much trouble you can delete the other copy of this comment – I had intended to reply to Dale but messed it up.)

27. Ah, then that’s not so bad. It’s a very short working paper and this is presumably an early version, the focus is on the technical results, there are some papers by statisticians that show up in the references but not (yet?) in the main text … maybe the lit review will get updated before it gets published. :)