You then proceed to write your article, book, mission statement, treatise, what have you, as if you were correct this particular time that the parameter lies in the interval. You only get a chance to be correct 95% of the time if you make the claim that it’s in the interval every time.

I’m amazed that the same people who balk at that advice have no problem saying, “there was an effect with a mean of 2.0, p < .05," and then discussing it as if there really is an effect without any qualification, statement of confidence, etc. Even if you remind them they're probably only correct about that statement about 50% of the time at best they still run with it like it's truth granted from God and is irreversible. Any contrary evidence only shows you moderators.

]]>OK, fair enough. I still think that one could define “we’re 95% confident” in this way, but then this means that this way of speaking doesn’t rule out that we get a different confidence value from a different procedure for the same data, so it’s also fine to argue that this definition, although not “wrong”, is somewhat misleading.

My main point was not about the correctness of the statement but rather about the fact that the statement, if used along these lines, is not an interpretation because it doesn’t interpret or explain anything in terms that would be comprehensible outside the CI context. So I’m with you in objecting against its use, if for different reasons.

]]>Richard:

I would be interested to know what you think of the Mueller-Norets (2016) paper and the “bet-proofness” concept for interpreting realized CIs that I’ve been rabbiting on about (sorry) in the comments here and in the other blog post. The previous literature M-N cite overlaps with what you cite in some of your papers (Buehler 1959, Robinson 1979, maybe some others?). “Bet-proofness” seems like a legitimate interpretation for realized CIs that is not unintuitive, and is applicable to a lot of everyday cases (what M-N call “standard problems”). Their paper is about extending it to some non-standard problems.

–Mark

]]>Christian, as we argue here, any assignment of specific “label” – probability, confidence, call it X-ness – will fall victim to the reference class problem, because there are any number of procedures that could have generated the specific interval that have difference X-ness. The obvious examples are mixture procedures (flip a coin, produce one or another interval) but also other functions of the test statistics might produce the interval in question in the specific case but have different confidence coefficients. It never makes sense to take the confidence coefficient and apply it to the interval, regardless of what you call it.

]]>Simon,

The “plausibility” of values within a CI is in fact associated with a probability distribution. That is, not all values within a CI are equally “plausible”.

For instance, if your data follows a normal distribution, then under repeated sampling, if you divide the 95%CI into 4 equal-sized parts, the true mean lies within the center two parts ~68% of the time and within the outer two parts ~27% of the time. Here, values within the CI are more “plausible” the closer they are to the point estimate (i.e., center) of the CI.

]]>I haven’t read all comments and perhaps somebody else has already notes this, but…

“we can be x% confident that…” has no meaning whatsoever outside the context of confidence intervals.

I’m actually with Russ Lyons: “Russ Lyons, however, felt the statement “We can be 95% confident that the true mean lies between 0.1 and 0.4,” was just fine. In his view, “this is the very meaning of “confidence.’””

But if this is so, saying that “We can be 95% confident that the true mean lies between 0.1 and 0.4.” doesn’t *add* anything to the calculation of the CI. It is certainly not an interpretation, it’s just the very same thing.

]]>When I was teaching I had a freshman/sophomore honors college course, taught as a seminar, and on the first or second day I did the following experiment with the class:

I drew a coin from my pocket and asked, “what’s the probability that it will come up heads when I flip it?” The class discusses this and everyone says 50%.

I flip the coin so that it falls on the floor and before seeing what came up I put my foot on the coin. So none of us knows how the coin came up. I ask again, “what’s the probability that the coin is showing heads?” More discussion. Most of the class will say 50%, but a few (usually students who have taken AP stats in high school) will say it’s either 0 or 1 but I don’t know which. This shows a divergence between a Bayesian and a frequentist interpretation of probability, which I exploit briefly by asking about betting on whether it’s heads or tails…even the ones who said that the probability is 0 or 1 but I don’t know which are still willing to take an even-money bet on its being heads.

I then (out of sight of the students, the coin is on the floor and they can’t see it) peek at the coin and determine how it came up. I then ask again, what’s the probability that it’s heads? Again a dichotomy, although not everyone who said 50% after the second round is sure that that’s the right answer now. But again, everyone is willing to take the even-money bet. (You can always allow them to bet on tails at even money as well, of course).

I then tell them (truthfully) what I saw, and I ask again, what’s the probability that it’s heads. This causes something of a conundrum since the class now has to guess whether I’m telling the truth or not. Regardless of whether I’ve seen heads or tails, few are confident enough to say 0% or 100% probability (whichever is appropriate), but no one is willing to say 50% either! Nor is anyone willing to offer or take an even-money bet.

I then invite one student to look at the coin and say what that student saw. With only one exception, the student told the truth and agreed with me, and then most of the class (not all!) were willing to go with 0% or 100%, though some were still cautious and only move their estimate in the appropriate direction.

I do this at the end of the class, and the students are invited to look at the coin themselves as they leave.

As I said, there was one exception, a student who said that the coin was showing the opposite of what I had said and what was in fact showing. This happened the very first time I gave this class, and I was in fact very pleased that this student did this because the discussion that ensued was very interesting. That student went on to turn down a Rhodes fellowship, to study stats at Cambridge under a Marshall scholarship, earned a PhD in stats here in the states, and was awarded tenure a few years ago. I am enormously proud of him.

In any case, this experiment gives students a sense of the distinction between Bayesian and frequentist interpretations of probability, and allows me to start out this course on Bayesian decision theory with a concrete example that helps them sort it all out. Discussing probability in terms of bets on this real-life example also introduces that approach to defining probability. [Note that this was a non-calculus course that has been taken by students with all sorts of majors…most of them are not mathematicians, there are usually a few pre-med students and since some of the examples I give later in the course are from medicine they find useful things there…also pre-law, but many other majors as well. I even had one dance major early on, and she did just fine.]

]]>“Realized CI” is terminology that is sometimes used to distinguish CIs calculated using sample data from unrealized CIs, i.e. the CI procedure. I think it’s pretty clear.

And (for “standard cases” at least), realized CIs do have a formal interpretation – see the paper by Mueller & Norets and the Casella paper on conditional inference, both cited below. It’s rather different from the interpretation of an unrealized CI (coverage, repeated samples, etc.) but it’s still legitimate.

In other words, the mistake isn’t giving realized CIs an interpretation when none is possible. It’s using the wrong interpretation when a legitimate one is possible.

What I would really like to have is something I can point to that gets across the intuition of how to interpret a realized CI using this formal literature but in a simple and accessible way.

]]>“Confidence” seems like a really poor choice of word to use to label the concept. But I don’t know if I can think of a better one. Even though I generally don’t like using someone’s name to label something, that would be better than “confidence” or some other ordinary word that is likely to promote misinterpretations.

Maybe “reliability interval” would be better? “95% reliability” meaning it does what is intended 95% of possible times it could be used?(Still requires explanation — but not as bad as “confidence interval”.)

]]>“I am 95% confident that the true mean lies in the interval” … “practical use.”

I can’t practically use a confidence interval without some sort of semantics, even if informal. There’s a number (.95) for _something_. Frequentists would insist it’s not a probability (“that the true mean lies in the interval”) since that’s meaningless. Bayesians might say that this has meaning, but the confidence interval approach is in no way trying to estimate probabilities of truth (and the “particular cases” you malign just show that confidence intervals aren’t able to do this sensibly, which they is fair since that’s not what they are even trying to do).

So I’m left with “confidence”. It’s not probability, and it’s not fair to impute some technical definition to the normal English word “confidence” just because that’s what some early statistician pulled up when naming the concept. No one thinks it makes sense as a probability. So how should I interpret it? I mean, practically, and granting maximum generosity to practicality over formality -? Something that is useful … maybe how it influences a decision I make or a belief I form?

]]>This problem of the intro course is one I always have in mind, and I think that in a way i wish I could just not cover some things, but if students only ever take one class in statistics, I think they need to be able to walk away with the idea of sampling variation.

]]>My experience is that many people do interpret frequentist confidence intervals in a Bayesian way. However, I wouldn’t go so far as to say that this is because it is what they want; I’d say that it is a more because the correct interpretation of the frequentist confidence interval is complicated, whereas the Bayesian interpretation is easier.

By the way, I’ve found that one way to help people understand what frequentist confidence intervals are (and are not) is to teach them enough Bayesian analysis that they can see the difference between Bayesian intervals and frequentist intervals.

Unfortunately, there is not time in a typical intro stats class to do this. However, I was fortunate to be able to teach (for four summers) a prob/stat course for a master’s program for math teachers, who had already had a frequentist introductory statistics course. They really liked the Bayesian approach, and it helped them understand the frequentist approach better. But they had a better math background than many people taking an intro stats course, which made the course I taught feasible for them to grasp.

]]>My experience is that people (meaning mainly health professionals and researchers) interpret frequentist results in a Bayesian way, because those are the results that they really want. But I’m not sure if there is any real evidence that bayesian results lead to better understanding and better decisions than frequentist results. I’m not even sure how you would approach that question, but I’d really like to know if anyone has tried.

]]>The trouble with this is that the people who are facing this problem are often constrained by protocols that prescribe frequentist methods.

Also, I’m not convinced that “they” can all understand Bayesian results; many laypeople (e.g., many physicians. I have read) believe that “uncertainty between two possible outcomes” must mean that each possibility has probability 1/2.

]]>Obvious suggestion – give them bayesian results, which they can understand.

(someone had to say it)

]]>(B) is related to the following question: Even if a researcher has a good understanding of what “confidence interval” means, they often need to discuss results of a study with someone (e.g, a boss, a physician, a member of the school board) who doesn’t understand and isn’t willing or able to go to all the work involved in understanding. So there is a practical question of what is a *good enough* explanation of confidence interval for such a “layperson”.

I don’t have a good answer to this, but have come up with a possible candidate — something like, “Drawing inferences from data based on a sample always involves some uncertainty. The confidence interval helps describe some of this uncertainty — namely, the uncertainty (“sampling uncertainty” ) arising from the fact that we do not have complete data, but have to estimate from a sample. There are also other types of uncertainty involved — for example, measurement uncertainty, which we might or might not be able to estimate, depending on the particular circumstances. Unfortunately, the confidence interval itself involves some uncertainty: If all conditions are in place, the confidence interval only does what we intend for most samples we might encounter, but will always miss the mark for some small percentage of them.”

This is probably too unwieldy for the intended purpose. Does anyone have any better suggestions?

]]>There is a lot packed into (A) but the answer at the end is that the expected value of the number of “true value containing” CIs is 95 but there is no way to know for certain what it actually is. Just like if you roll a die 60 times under the hypothesis that the die is fair the expected number of 3s is 10 but you don’t actually know that it will be 10 and it’s actually not much more likely to be 10 than to be 9 or 11. If the die is fair it is unlikely that you would get 0 3s, but not impossible. But if you get 0 3s, you might want to reconsider the hypothesis that the die is fair.

(B) I’m not sure what you mean by the first sentence, but yes part of the discussion is that assuming simple random sampling and perfectly accurate measurement is assuming a lot. CIs are based on a theoretical model about sampling error not about measurement error.

]]>A) If confidence intervals are always constructed under the assumption that sampling was random (i.e. not systematically biased), do we really need to stress the repeatability of the experiment itself… Or could we say CIs are really about the amount of error that I’m willing to accept as a researcher? I mean, suppose a researcher conducts 100 unrelated experiments, could (s)he look back and say (assuming random sampling) that 95 of his/her CIs contain the ‘true value’ of the parameter?

B) A phrase like ‘true value’ doesn’t communicate that if the measurement procedure is biased (e.g. systematically underestimates) less than 95 of 100 CIs will contain the true value, no? It seems to me CIs are about precision, not accuracy then.

]]>But “Random sampling” is used as a default model for LOTS Of stuff in science EVERY DAY, for example the effectiveness of drugs, and yet the only situation where it’s a good model is basically surveys of finite populations using random number generators.

So, if you want to find out the average age of drivers in California, if you randomly sample from the DMV registration it would be a good model to say that your data comes from a random number generator process. But if you want to find out how well crops grow given a certain fertilizer it’s a TERRIBLE model.

]]>Man invented the pie, but pi is the work of God.

(actually, more likely woman)

]]>Not always easy to notice a model is poor for the underlying science http://statmodeling.stat.columbia.edu/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/

There a poor but fully Bayesian model produces the usual confidence intervals with those taken to be good frequency confidence coverage properties.

Makes one wonder – good, good for what?

]]>I think less an error than bringing in an additional uncertainty that was not necessary in the conversation.

The conversation is about what would happen in an infinite number of replicate samples and you have raised the issue of what will happen in a given finite sample (which is subject to sampling variability which decreases with increasing n).

]]>It was Kronecker who said “The integers alone are created by God; all else is the work of Man,”

]]>Thank you, Martha and George. It looks like my error here was in the use of the word “tends.” I was assuming that as the number of samplings increases, the percentage whose confidence interval contains the true value will come closer to 95%. That is, with 100 samplings you might be way off of 95%, but with 1000 you’d expect to be closer, and with 10,000 closer still. Or so I thought.

But I see now that this isn’t about “tending” but rather about the correctness of the algorithm over infinite samples. You cannot necessarily expect to get asymptotically closer. As Tom Dieterrich says, there is no guarantee for any particular execution (or, as I take it, number of executions) of the algorithm.

It is good to be able to sort this out; thank you!

]]>I think most people would agree that inference based on a simplistic model that is a poor model of the underlying science isn’t going to work well.

]]>Agreed, the confusion occured despite concepts being well defined.

In my opinion much of this has to do with the way statistics was taught, due to practical constraints, statistics is assessed in an exam environment where there is only one correct answer (or a limited number of talking points each worth partial credit). The best way to learn statistics is to defend statistical analyses and realise on your own that there is a weakness to every method. All this talk of programming and pretty visualisations are just learning distractions (but important when you actually have to do statistics!).

]]>+1

]]>Hi Diana

*As I understand it, with the repeated samplings, the percentage of the times that the true value lies within the confidence interval *tends* toward 95%. So you would expect the true value to lie outside the confidence interval for only 5 out of every hundred samplings–but this estimate becomes more accurate as the number of samplings approaches infinity.*

Not quite. The definition is with regard to an infinite number of replicate samples. If we replicated the study an unlimited number of times, calculating a confidence interval in the same way as we did for the actual data, 95% of the confidence intervals would cover the true value of whatever it is we’re estimating. It’s in this sense that we are “95% confident” that the intervals produced cover the truth.

The terminology is very confusing, possibly hopelessly so, but that’s what it means.

]]>See also Tom Dietterich’s comment below.

]]>Diana,

Bear in mind that a big part of the problem is, as pointed out by Peter Duong above, that the phrase “95% confident” is an abuse of terminology — or at least, a poor choice of terminology. The phrase “95% confident” is technically (in the context of confidence intervals) *defined* to mean, “we have used a procedure with the property that 95% of all possible samples will result in an interval that contains the true value of the parameter”

Note that I haven’t said “tends to” — that’s deliberate, but it begs the question of what “percentage” means in this context. I am OK talking about a percentage for an infinite set — e.g., “50% of all real numbers are <0" (and also "50% of all real numbers are < 1"). What I mean by such a sentence is "the probability that a randomly chosen real number will be less than 0 is 1/2". (Still some technicalities left out there, but it gets closer to being a precise definition.)

]]>+1

]]>I don’t use a scary voice, but put things in a “good news, bad news” framework: The good news: we can do this procedure which (assuming model assumptions fit) does what we’d like for 95% of suitable (as specified by model assumptions) samples.

The bad news: We don’t know whether or not the sample we have is one of the 95% for which the procedure works (i.e., gives an interval containing the true parameter), or one of the 5% for which it doesn’t.

Somebody, I can’t remember who, once said something to the effect of “God created the natural numbers; man invented the rest.”

I think it is fair to describe the history of mathematics as the incremental creation of abstract nonsense that solves problems that are otherwise unsolvable. Negative numbers and fractions are “nonsense,” but they were constructed to extend the basic properties of natural numbers and permit the solution of equations that have none within the natural numbers themselves. And on and on it goes.

]]>Yes, exactly.

]]>I sure hope the aliens won’t be gene-centrists. One species of those is enough.

More seriously, what about the fact (?) that frequentist CIs correspond to Bayesian credible intervals assuming flat priors? Wouldn’t that provide a practical interpretive benefit to users despite mixing up philosophies?

]]>+1 to Jim — a crisp way to clarify a common confusion.

]]>I am jumping in here without being sure that (a) I’m anywhere near correct or (b) I’m commenting in the right place. I’m neither a Bayesian nor a frequentist; I’m at best an infrequentist.

That said, I think the whole business of “repeated sampling” needs clarification (maybe only for me).

“We can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.” What does “95% confident” mean, though, in relation to the repeated samplings?

As I understand it, with the repeated samplings, the percentage of the times that the true value lies within the confidence interval *tends* toward 95%. So you would expect the true value to lie outside the confidence interval for only 5 out of every hundred samplings–but this estimate becomes more accurate as the number of samplings approaches infinity.

So in a sense you’re never 95% confident of anything specific. You’re only confident that as your number of samplings increases, your percentage of outliers will get closer to 5%.

]]>And you are saying in part that most people interpret them as point + error rather than as an interval estimate, right? I think that’s probably right especially since they are often graphically displayed that way. Even the way students learn to calculate them by hand probably adds to that perception.

]]>+1

]]>Peter, I agree. It is unfair on non-expert users of statistics who are expert users of English to regularly deal with the subtle distinction between ‘confidence’ followed by ‘interval’ and ‘confidence’ followed by any other word. However, I’m not sure that simply calling Neyman’s confidence intervals “random intervals” is a good enough fix. What is the meaning of “random” in that context?

Rather than assuming we can fix the confusion by changing names of statistical objects, we need to find where the underlying difficulties lie and deal with them.

]]>To say what are the most likely values of a pparameter. given the data.

]]>I think the issue is more abuse of terminology. The confidence interval properly defined is a Radom interval. The one calculated from the sample is more of a simulation of the confidence interval but for some reason we called them CIs anyway.

]]>What do you think most people want to use them for?

]]>Yes, exactly, which is why I explicitly put it in terms of “functions that compute…” in my comment above:

But, the usual thing is to get one data set, create a simplistic model of repeated sampling that is a poor model of the science, produce one confidence interval, and then immediately make the following logically fallacious inferences:

A = (my algorithm corresponds to the way the world works), however this is KNOWN to be FALSE very often

B = (95% of intervals generated under repeated sampling contain the true value of the parameter)

C = (My particular interval contains the true parameter with 95% probability)

if A then B (a true statement about an algorithm.)

A is true (a most often FALLACIOUS statement about a scientific process)

therefore B is true (B actually has totally unknown truth value due to above)

if B then C (This is a FALSE statement regardless of whether B is true because frequency and probability don’t mean the same thing to most people, but it appears true as soon as you confuse frequency and probability together formally)

B is true therefore C is true (FALLACY: remember, B has no known truth value because A is FALSE. Even if B were true the statement “if B then C” is basically false when looked at carefully due to “probability” not meaning long run frequency to real world people who haven’t been brainwashed by stats classes)

So, after a fallacious assumption about a non-existent sampling process, and a pun between “probability as frequency” and “probability as degree of credibility” which makes “if B then C” seem like a true statement… people fallaciously deduce “my particular interval contains the true parameter with 95% probability” sometimes even when they can know with 100% credibility before looking at the data that the parameter can NOT possibly be in the interval (for example, when the parameter logically has to be greater than zero but the confidence procedure includes only negative values… the kind of thing that *can* happen with confidence procedures).

]]>OK but the other implicit assumption here is no systematic error.

Good thing there is seldom a lack of awareness and or agreement on the implicit assumptions ;-)

]]>I found this one very helpful CONDITIONAL INFERENCE FROM CONFIDENCE SETS George Casella, Cornell University https://projecteuclid.org/download/pdf_1/euclid.lnms/1215458835

“Although it might be argued that searching for relevant sets is an occupation only for the theoretical statistician, we must remember that practitioners are going to make conditional (post-data) inferences. Thus, we must be able to assure the user that any inference made, either pre-data or post-data, possesses some definite measure of validity.”

]]>