Hans van Maanen writes:

Mag ik je weer een statistische vraag voorleggen?

If I ask my frequentist statistician for a 95%-confidence interval, I can be 95% sure that the true value will be in the interval she just gave me. My visualisation is that she filled a bowl with 100 intervals, 95 of which do contain the true value and 5 do not, and she picked one at random.

Now, if she gives me two independent 95%-CI’s (e.g., two primary endpoints in a clinical trial), I can only be 90% sure (0.95^2 = 0,9025) that they both contain the true value. If I have a table with four measurements and 95%-CI’s, there’s only a 81% chance they all contain the true value.Also, if we have two results and we want to be 95% sure both intervals contain the true values, we should construct two 97.5%-CI’s (0.95^(1/2) = 0.9747), and if we want to have 95% confidence in four results, we need 0,99%-CI’s.

I’ve read quite a few texts trying to get my head around confidence intervals, but I don’t remember seeing this discussed anywhere. So am I completely off, is this a well-known issue, or have I just invented the Van Maanen Correction for Multiple Confidence Intervals? ;-))

Ik hoop dat je tijd hebt voor een antwoord. It puzzles me!

My reply:

Ja hoor kan ik je hulpen, maar en engels:

1. “If I ask my frequentist statistician for a 95%-confidence interval, I can be 95% sure that the true value will be in the interval she just gave me.” Not quite true. Yes, true on average, but not necessarily true in any individual case. Some intervals are clearly wrong. Here’s the point: even if you picked an interval at random from the bowl, once you see the interval you have additional information. Sometimes the entire interval is implausible, suggesting that it’s likely that you happened to have picked one of the bad intervals in the bowl. Other times, the interval contains the entire range of plausible values, suggesting that you’re almost completely sure that you have picked one of the good intervals in the bowl. This can especially happen if your study is noisy and the sample size is small. For example, suppose you’re trying to estimate the difference in proportion of girl births, comparing two different groups of parents (for example, beautiful parents and ugly parents). You decide to conduct a study of N=400 births, with 200 in each group. Your estimate will be p2 – p1, with standard error sqrt(0.5^2/200 + 0.5^2/200) = 0.05, so your 95% conf interval will be p2 – p1 +/- 0.10. We happen to be pretty sure that any true population difference will be less than 0.01 (see here), hence if p2 – p1 is between -0.09 and +0.09, we can be pretty sure that our 95% interval *does* contain the true value. Conversely, if p2 – p1 is less than -0.11 or more than +0.11, then we can be pretty sure that our interval *does not* contain the true value. Thus, once we *see the interval*, it’s no longer generally a correct statement to say that you can be 95% sure the interval contains the true value.

2. Regarding your question: I don’t really think it makes sense to want 95% confidence in four results. It makes more sense to accept that our inferences are uncertain, we should not demand or act as if that they all be correct.

This reminds me of the optional stopping thing discussed a bit in the Mayo thread the other day. Once the data is collected you will always have additional information like the sample size, which should perhaps be included in the likelihood.

This reminds me of the optional stopping thing discussed a bit in the Mayo thread the other day. Once the data is collected you will always have additional information like the sample size, which should perhaps be included in the likelihood.

That doesn’t make the stopping rule relevant in the classic case I consider. It’s a point well clarified by J. Berger, but I’m sorry I don’t have time to look it up now.

We happen to be pretty sure that any true population difference will be less than 0.01 .

The use of 95%CI is based on the scenario specified in the question. We have no clue what is reasonable and draw CI’s at random from a bowl just like lottery tickets. In this scenario, seeing the drawn interval doesn’t provide any useful information about the probability that the interval contains the true value or not.

You introduce a different scenario, where we have useful prior information that can be applied to the drawn CI.

Apples and Oranges (Frequentist vs. Subjective Bayesian)

I’ve never liked the “we are 95% sure that…” or “we are 95% confident that…” interpretations of 95% CIs. They feel question begging, in that they depend upon the reader understanding that “95% confident” is a statement about the long run coverage rate of the method. And this is the most common point of confusion!

If you show “we are 95% sure that…” to someone who doesn’t already understand this, they will naturally take it to mean “there is a 95% chance that the value of interest is in the interval”.

Good points Ben,

I agree. I hope this thread will bring us back to several alternate terminological proposals available on TAS19 Special Edition, say as proposed by Valentin et all, Sander Greenland, Raymond Hubbard, Steven Goodman, and Andrew Gelman et al and debated by John Ioannidis. Evaluate the intersectionalities and non-intersectionalities.

I agree with Ben. However, I don’t think that this addresses the practical question of how you explain a 95% confidence interval to a lay person.

For example, I once had a student who seemed to understand the concept and its problems well herself, but asked the following practical question:

She worked for a school district, and sometimes was tasked with discussing statistical results with parents or teachers of students in that school district. She realized that they would not understand a technically correct answer, and asked for an explanation that the laypersons could understand. I did not have a good answer to give her. Later, after I had thought about it quite a while, the best I could come up with was,

“It is an interval that estimates part of the uncertainty in our estimate — namely, the uncertainty that comes from the fact that we have estimated from one particular sample. Had we used a different sample, we might have obtained a different estimate. However, there are other sources of uncertainty in our estimate, but they are not taken into account in the calculation of the confidence interval.”

Can anyone suggest a better explanation for a “lay audience”?

I find it useful to think about prevalence rates. Say that the point prevalence for breast cancer is 5%. Does this mean that every individual person has a 5% chance of currently having breast cancer? Well, of course not! Prevalence is an average value, derived from determining how many people across the world either have or do not have breast cancer at any given time. You give me one person, and they either have it or do not. I would need to account for all the information specific to that individual person to determine how likely THEY are to have breast cancer (i.e. I would need to include prior information).

Confidence intervals, like prevalence rates, are interpreted in the same way—they are statements about long-run frequency. Just like I cannot say that an individual person has a 5% chance of having breast cancer because the point prevalence is 5%, I cannot say that a given CI has a 95% chance of containing the true value because 95% of CIs in the long-term will contain it.

Here is what I offer in a quantitative reasoning text designed for students who will take no college mathematics. I’d be delighted to have critiques from this knowledgeable crew.

Sorry for the long quote.

The poll reached 1,015 adults and has a margin of sampling error of plus or minus 3.6 percentage points.] The first paragraph quoted above reports the results of a survey. The second tells you something about how reliable those results are. It’s clear that the smaller the margin of error the more you can trust the results. Understanding the margin of error quantitatively — seeing what the number actually means — is much more complicated. A statistics course would cover that carefully; we can’t here. Since the term occurs so frequently, it’s worth learning the beginning of the story. Even that is a little hard to understand, so pay close attention.

The survey was conducted in order to discover the number of people who thought the [Obama] tax increase would benefit the economy. If everyone in the country offered their opinion then we would know that number exactly. If we gave the survey to just three or four people we could hardly conclude anything. The people at the Pew Research Center decided to survey the opinions of a sample of the population — 1,015 people chosen at random. Of the particular people surveyed, 0.44 × 1, 015 = 447 people thought the tax increase would benefit the economy. If they’d surveyed a different group of 1,015 people, they would probably see a different number, so a different percent. The 3.6 percentage point margin of error says that if they carried out the survey many times with different samples of 1,015 people, 95% of those surveys would report an answer that was within 3.6 percentage points of the true value. There’s no way to know whether this particular sample is one of the 95%, or one of the others. About five of every 100 surveys you see in the news are likely to be bad ones where the margin of error surrounding the reported answer doesn’t include the true value. Survey designers can reduce the margin of error by asking more people (increasing the sample size). But there will always be five surveys in every hundred where the true value differs from the survey value by more than the margin of error.

The report doesn’t explicitly mention 95%. That’s just built into the mathematical formula that computes the margin of error from the sample size. Even that conclusion may be too optimistic. The margin of error computation only works if the sample is chosen in a fair way, so that everyone is equally likely to be included. If they asked 1,015 people at random from an area where most people were Democrats (or Republicans) or rich (or poor) the result would be even less reliable. The report describes the efforts taken to get a representative sample.

Ethan,

Just a quick thought from an ex philosophy professor who taught a few classes of the sort you mention, I’m wondering if some students might be confused by the fact that you write “About five of every 100 surveys you see in the news are likely to be bad ones” and then a couple sentences later write “But there will *always* be five surveys in every hundred …” (emphasis added).

Cheers,

Jeff

Jeff

You’re right, of course. I will delete that (since the idea appears correctly a few sentences earlier) or reword it (if I want to keep it for emphasis).

Thank you.

Let’s try and get into the mind of someone who will take no college mathematics. The word “sample” may conjure up visions of blood flooding into a syringe in their doctor’s surgery. And “error” might be equated with “mistake”. So, there is a good chance that the phrase “margin of sampling error of plus or minus 3.6 percentage points” is completely lost on them

Peter

Indeed that sentence would be difficult (impossible?) for my students. It’s not mine, it’s quoted by the newspaper in a story about a survey conducted by the Pew Research Center. The quotation marks and attribution didn’t show up when I cut and pasted here.

What I am trying to do is parse it so that they can understand it (more or less) when they read it. (They do know about percentage points.)

I think the idea of giving a quote from a newspaper or other source and guiding the students in understanding it (to some reasonable extent) is definitely a good idea.

Martha

Thanks for the affirmation. That is in fact the strategy for the whole text.

In completely lay persons terms, I like to talk about trying to hit a target in the dark. We have a method that, in ideal circumstances, is guaranteed to hit the target 95% of the time. This interval was obtained using this method. We are pretty confident that it hits, but we cannot he completely sure.

This does seem to get the point across (as well as possible for a lay audience) amazingly succinctly. Thanks.

What I like about this is that one can then ask questions like What do “ideal circumstances” mean? What about if there is a turbulent wind? Or the “shooter” is sitting in a poach of a kangaroo? Also, what is the cheapest, easiest way to improve the hit rate? And “do we really have no information about the location of the target at all?” What I do not like about it is that it disconnects the whole idea from the data, and makes it seem more “blackboxish”.

Jan said,

“What I do not like about it is that it disconnects the whole idea from the data, and makes it seem more “blackboxish”.”

Yes, it does make the process seem “blackbockish” — but, to be honest, the process of statistical inference does have an element of “blackbockishness” to it — so your objection is in some sense an objection to being open about that “uncomfortable truth”.

< being open about that “uncomfortable truth”.

Agree and for instance there needs to be a good separation from the representation (model) and what the model tries to represent well. That 5% is only directly about an abstract model not the actual (finite) set of surveys. You can never to better than as, Oliver Wendell Holmes put it, have good reasons to bet it would be 5% in any survey.

Not the room or time to properly explain it here, but I have found using a shadow metaphor helps get this across.

Statistics – seeing only the shadows and discerned what may have cast them.

Probability – mathematical models where what will cast the shows is set to a known parameter value and the ensemble of resulting possible shadows is derived.

Simulation – being a means to “run” and learn about any probability model, the ultimate? Shadow casting machine.

So we have reality beyond direct access we want to know about (what casts the shadow), an abstract but fully specified/known representation (model) of reality (probability shadow generating model) and an accurate as time allows means to learn about the representation (simulation).

Keith:

Nice metaphor!

There’s a Goethe quote that goes, “Where there is much light, the shadow is deep.” This fits in with your metaphor: The better the data and the analysis, the better the “shadow” (that results from the analysis) fits the reality.

I like the layperson definition of a confidence interval given by Martha because it highlights that confidence interval is only one part of the uncertainty. It also leads to the obvious question of what are the “other sources of uncertainty” and are they large or small compared to the uncertainty quantified by the confidence interval.

Why would you choose to inflict these subtleties on a lay person? As soon as you say “however, there are other sources of uncertainty in our estimate, but they are not taken into account in the calculation of the confidence interval” you are inviting the lay person to ignore you. Why haven’t you taken these other sources of uncertainty into account? If you are explaining something to a lay person you need to focus on the needs of that person, not your need to get lost in the subtleties of your subject.

Peter, I don’t think of this as subtleties at all. For example, if there is model misspecification and I have a large sample then estimate may be very precise (i.e., small confidence interval), but estimates may not be very accurate. I assume the person needs to know the uncertainty in order to make decisions and it is important that we don’t just ignore other, potentially much larger, sources of uncertainty just because it is not as easy to estimate or takes longer to explain or makes decisions more difficult.

Nat said,

“I assume the person needs to know the uncertainty in order to make decisions and it is important that we don’t just ignore other, potentially much larger, sources of uncertainty just because it is not as easy to estimate or takes longer to explain or makes decisions more difficult.”

I agree wholeheartedly. Many people tend to think in dichotomous terms, but Mother Nature does not. We need to help more people accept the ubiquitousness of uncertainty — from quantum phenomena, to which genes get selected at conception, to other “random” events that affect people’s lives and the state of the world. Sweeping uncertainty under the rug is just a form of unscientific denial.

I wouldn’t attempt to explain a confidence interval to a lay audience. I would simply say that “we are pretty sure that the value lies between (lower limit) and (upper limit) and leave it at that. Pretty sure is not the same as 100% sure and the recipient of the information should be able to understand that.

This may work well enough for some people, but there is always the possibility that someone will ask for more detail, so we need to be prepared to go a little further if asked.

For the layperson, or just for teaching in general, I think the ring toss metaphor works well. Throw the ring and 95% of the time it lands around the stake.

I’m not the only one who uses this metaphor. EpiEllie has a nice tweetstorm on it:

https://mobile.twitter.com/EpiEllie/status/1073385394580979712

My own little tweak is that the game is actually “Mystery Ring Toss”. You don’t actually know where the stake is so you never get to find out if your ring found it or not.

Nice tweak. It points out that the “game” is at least as much a game of chance as of skill.

I like this ring metaphor, but isn’t there also a more literal use of confidence intervals as a range that contains 95% of all values? For example, is it incorrect if we specify the 95% confidence interval for income such that 95% of people have an income in that range? For any randomly selected person there is a 95% probability that there income will fall in the specified range. However, there is no “true” income value (i.e., no single stake).

Confidence intervals are not intervals that contain a specified percentage of all values.

First, confidence intervals relate to parameters calculated from multiple observations (such as a mean or proportion), not to the value of an individual observation. For individual observations, an interval in which future observations will fall with a specified probability, given previous observations (your sample), is a prediction interval (https://en.wikipedia.org/wiki/Prediction_interval).

Second, in your post, there is a confusion between an interval calculated from the actual distribution (i.e. we know the exact distribution in the population) vs. an interval estimated from a sample taken in the population (i.e. we can only estimate the exact distribution in the population). In English, there is no clear distinction, as prediction intervals are used in both cases : an interval in which the future value of an individual observation or of a parameter measured on a sample should fall with a given propability given the actual distribution in the population. French names this a “fluctuation interval” and keeps “prediction interval” for the case described in the previous paragraph (predicting future observations based on previous ones).

Thanks for pointing out the distinction that is addressed in French but not in English terminology.

> My own little tweak

I think that is a very big tweak (for the same reasons as the shadow metaphor above) – we don’t have direct access to the stake’s position (or what casts the shadow).

True, it’s an important point, but the heavy lifting is being done by the ringtoss metaphor. You can explain lots of things with just that.

So a realised CI is a specific toss of the ring, whereas a CI procedure is your ringtoss technique in general. On average you’ll get the stake 95% of the time, but for any single toss, either the ring has landed around the stake, or it hasn’t.

Or say there’s a defined playing field (parameter space) where the stake has to be placed, and your technique is such that every now and then you make a wild throw and miss the playing field completely. Ex post you know that after a wild throw like that, the ring can’t be around the stake. Yet your technique still means that 95% of the time you get the stake.

Mark, you might be interested in pg 116 of this very elementary book. It lays out a version of the ring toss where the stake (target) is not known to the person doing inference.

This looks great. Thanks for the tip!

I ordered the book, and it’s great, but it turns out the book uses archery, not ring toss, to explain CIs.

I think they missed a trick here….

I find it useful to say that 95% of the time doing this calculation on many random samples (assuming the population satisfies other assumptions) will yield an interval that contains the population mean , but we have no way of knowing if this particular sample is one of the 95% or one of the 5%.

If I may offer a tweak to avoid some possible confusion: I’d modify what you wrote to say,

“If we do the same calculation on many, many random samples of the same size from the same population (assuming the population satisfies certain assumptions), we will get an interval that contains the population mean for 95% of those samples, but we have no way of knowing of our particular sample is one of the 95% or one of the 5%.”

+1 thanks!

“If you show “we are 95% sure that…” to someone who doesn’t already understand this, they will naturally take it to mean “there is a 95% chance that the value of interest is in the interval”.”

True, but most people can follow:

1) we have data from one experiment

2) we can repeat the experiment to get more data, afterall we already did it one time

so they understand quite naturally the fact that the intervals are expected to change.

Justin

For point 1, a widely-used example to make people think about this issue is Fieller confidence intervals, for the ratio of two Normal means. (See e.g. Section 5 of this article or all of this one.) These intervals can, with reasonable data, end up covering the entire real line. In those cases you would be 100 percent sure that the truth is in the interval you’ve been presented with, even without considering what values are plausible, as in Andrew’s reply. It’s also possible to contrive intervals that shrink to zero width for some data, making you 100 percent sure that the truth isn’t in there.

For point 2, the math you suggest looks a lot like Sidak correction.

It’s been known for 60-70 years now that it’s possible to get confidence intervals in real problems garunteed to not contain the true parameter, and moreover, this guarantee is provable from the same assumptions used to create the CI.

There’s nothing in frequentist statistics inforcing even an elementary consistency with deductive logic. It’s an interesting exercise to see what happens when you start imposing such consistency requirements on statistical methods. Perhaps Mayo can work that math out for us and enlighten us with the results.

There’s no inconsistency, please see pages 198-9 of SIST.

I did. Wow. Just wow.

You really believe constructing a 95% CI from assumptions that also directly imply the parameter can’t be in that interval is a howler? And you say this because “confidence level refers to the probability the method outputs true intervals”?

So conditional on some made up assumptions about what will be seen in infinite repetitions that will never exist, the method works 95% of the time, but conditional on the one set of data the scientist actually has before them, the method is guaranteed to fail. And you think this criticism is a howler?

No it’s not a howler. It implies frequentist methods aren’t even consistent with deductive logic in any practical sense and quite possibly horrific out in the wild. What about that Cox and Hinkley interpretation of CI’s you endorse “as the set of parameter values consistent [with the data] at the confidence level”? In point of fact, all parameter values might be inconsistent!

Anonymous, I’m interested in what conditions lead to this. Can you point me to a good source on the topic?

https://bayes.wustl.edu/etj/articles/confidence.pdf

Let’s look at the truncated failure times example.

1) it is a small sample setting, I don’t believe any approach is good

2) how does the Bayesian interval vary with different Bayesian priors?

3) frequentism is not tied to always using one rule like Bayesian is tied to Bayes rule, so why is the frequentist method only allowed one method? I believe order statistics (the minimum) and/or bootstrapping could give a confidence interval that makes more sense.

4) you have to realize that allowing ‘the data to talk’ sometimes but rarely yields incongruent silly answers in frequentism, much like ‘allowing subjective beliefs to enter’ gives silly answers dictating parameter values (proof of god existing, search for MH370 being so far off track) and Drake-like equation-ness in posterior distributions

If frequentism is flawed, why do Bayesians (and everyone) use histograms? Why does the Strong Law of Large Numbers simply work? Why do likelihoods tend to swamp priors? Why does MCMC rely on frequentist notions of sampling and convergence? Why does the CLT exist? Why is there success of survey sampling, experimental design, and quality control? Just a few questions for now,

Justin

Justin,

You’re a frequentist fanatic, so this will likely make no impact, but here goes:

1) it is a small sample setting, I don’t believe any approach is goodThe Bayesian approach as given in detail in the paper works perfect.

2) how does the Bayesian interval vary with different Bayesian priors?That’s easy enough for you to calculate. Are you insinuating different answers for different assumptions is a fatal flaw of Bayes?

3) frequentism is not tied to always using one rule like Bayesian is tied to Bayes rule, so why is the frequentist method only allowed one method? I believe order statistics (the minimum) and/or bootstrapping could give a confidence interval that makes more sense.Bayes isn’t tied to one rule. Bayes theorem is one theorem derivable from the basic sum/product rules of probability when applied to any statement (not just repeatable “frequentist” ones). There are infinity many other theorems implies by the sum/product rules.

In Frequentism, however, the following scenario gets repeated ad-nausem: a flaw is found (every time without exception) and the frequentist uses their intuition to just barely ad-hoc change things to avoid the flaw. Each such change brings their methods closer to Bayesian ones. They stoutly refuse to recognize this and or do a complete analysis showing when all flaws are removed you get Bayes. It’s a cheap way of insulating Frequentism from every having to admit they were wrong.

4) you have to realize that allowing ‘the data to talk’ sometimes but rarely yields incongruent silly answers in frequentism, much like ‘allowing subjective beliefs to enter’ gives silly answers dictating parameter values (proof of god existing, search for MH370 being so far off track) and Drake-like equation-ness in posterior distributionsFrequentist methods don’t “rarely” lead to silly answer in practice. They usually do. I never said anything about subjective beliefs. Laplace’s definition of probability was “cases favorable divided by all cases”. Note, this is not the “frequency of occurrence” but rather a simple counting of possibilities. This definition is not a frequentist definition, but actually far more general and makes sense in singular case where no repetition is possible. It has nothing to do with “subjective beliefs”.

If frequentism is flawed, why do Bayesians (and everyone) use histograms?Because frequencies are just functions of data. Bayesians are allowed to use frequencies or any other function of data they want. The difference between Bayes and Frequentist isn’t that one uses frequencies and the other doesn’t, rather, the different is one claims probabilities are fundamentally frequencies while the other does not.

For Bayesians, Frequencies are physical facts, like “temperature”, which are measured or estimated. Probabilities are used to describe our uncertainty about physical facts. One practical difference is that frequenices don’t change when our state of knowledge changes, but probabilities do.

Why does the Strong Law of Large Numbers simply work?It doesn’t far more than people think. But when it does work there’s a simple Bayesian explanation (it was after all original proved by one of the Bernoulli’s thinking along Bayesian lines) that can be most succinctly state as: “whenever almost every possibility leads to A, then that’s grounds for thinking A will be seen in practice and it happens to be true a fair amount”.

Why do likelihoods tend to swamp priors?Because if A and B are inputs to a problem but A carries more weight for the question being answer (for whatever reason) it tends to “swamp” B.

Why does MCMC rely on frequentist notions of sampling and convergence?It doesn’t although loose language often suggest that.

Why does the CLT exist?It’s one example of a vastly general phenomenon usually associated with Bayesians. Any time a “process” moves a distribution to a higher entropy distribution while maintaining a given set of constraints, then in the limit you reach the maximum entropy distribution subject to those constraints. “Distribution” here could legitimately refer to probability or frequency distributions even thought they’re two very different kinds of things.

Why is there success of survey sampling, experimental design, and quality control?Part of the answer is that for simple case (the ones Frequentists first tested their methods on) bayes and frequentists largely agree. But there’s also a deeper issue going on.

This is hard one to convey in a short space. But suppose you assume all outcomes of 100 coin flips are equally likely, then you’ll predict roughly 50 heads and 50 tails in 100 flips. From this Frequentist proclaim this “proves’ or “justifies” the equally likely assumption. NOT TRUE!!! Far different assumptions, many of which are violently different, also imply you’ll get roughly 50 heads and 50 tails.

Why? because almost any possible outcome no matter what the physical cause or propensity will lead to a roughly 50/50 split.

So there’s two ways to interpret this. The Frequentist one is that is there’s something in the universe called “randomness” and coin flips have it. The Bayesian one is the conclusion is incredibly insensitive to the details of what’s actually happening physically and that’s why you tend to see 50/50 splits in practice.

In other words, the Frequentist thinks they’re assuming frequencies, and when they make a good prediction they think this frequentist view is bolstered, but what they’re actually doing is showing the vast majority of possibilities lead to the same outcome, and a good prediction merely proves the observed outcome was one of those “vast majority” of cases.

In other words, the great mistake of Frequenitists is they think they making *necessary* assumptions, but they’re actually making *sufficient* assumptions that are incredibly far from being necessary.

“You’re a frequentist fanatic,”

No, I am not, but thanks for the name-calling and poisoning of the well. I will point out when people say that frequentism supposedly doesn’t work, however.

“The Bayesian approach as given in detail in the paper works perfect.”

The math works, sure, but for n = 3, I wouldn’t personally trust any interval from any method because it just isn’t enough data.

“That’s easy enough for you to calculate. Are you insinuating different answers for different assumptions is a fatal flaw of Bayes?”

The Drake-equation-ness depends on how wacky the priors are.

“In Frequentism, however, the following scenario gets repeated ad-nausem: a flaw is found (every time without exception) and the frequentist uses their intuition to just barely ad-hoc change things to avoid the flaw.”

What you say as ad hoc and breaking the likelihood principle, others can say as flexibility, being practical, and solving problems.

“Frequentist methods don’t “rarely” lead to silly answer in practice. They usually do.”

Not in my experiences. Your experiences may vary, of course.

“Because frequencies are just functions of data. Bayesians are allowed to use frequencies or any other function of data they want. The difference between Bayes and Frequentist isn’t that one uses frequencies and the other doesn’t, rather, the different is one claims probabilities are fundamentally frequencies while the other does not.”

It seems frequencies are pretty fundamental to any type of probability then, since Bayesians rely on histograms and probability distributions.

“But suppose you assume all outcomes of 100 coin flips are equally likely, then you’ll predict roughly 50 heads and 50 tails in 100 flips. From this Frequentist proclaim this “proves’ or “justifies” the equally likely assumption. NOT TRUE!!! “

And not at all an accurate depiction. The first sentence, OK. The sentence doesn’t follow. If there was:

a) good experiment giving about 50 heads on average

b) a few to many repetitions of a)

c) meta analysis analyzing the results

a-c still wouldn’t “prove” equally likely. It would, however, be evidence for equally likely.

Justin

Justin:

I think your mistake is buried within the following sentence of yours:

I agree with what you wrote there. I think your mistake is to think that statistics is about producing intervals that you can trust. My take on the replication crisis is that it’s all about scientists and publicists trying to use p-values and other statistical summaries to intimidate people into trusting—accepting as scientifically-proven truth—various claims for which there is no good evidence.

Pizzagate, himmicanes, ages ending in 9, beauty and sex ratios, power pose, ovulation and voting, ESP, etc etc etc . . . just a long stream of claims which the scientific and journalistic establishment are pushing at us without good evidence. I have no problem with some of these as conjectures, but they should be presented as such.

So, sure, I agree with you about not trusting anything based on pure statistical analysis with n=3. But I’d extend that distrust more generally, and I think a big big problem is that statistical methods have been sold as automatically generating trustworthy results.

>1) it is a small sample setting, I don’t believe any approach is good

It is absolutely 100% true that if you see the numbers given you can conclude “with 100% certainty, the truncation point is below min(data) = 12”

So any result that gives you an interval where the truncation point lies that includes points above 12 is insane. It’s like arguing that 1+1 is somewhere between 3 and 6. We have logical certainty that it isn’t.

Doesn’t matter how small your sample is, could be just 1 point, you can always conclude with certainty at least that the truncation point is below that one data point.

> 2) how does the Bayesian interval vary with different Bayesian priors?

It will weight different parts of the interval [0,12] differently, it doesn’t matter how wacky your prior it can NEVER give you an interval that includes any points above 12… whereas the example Frequentist method is *entirely* above 12.

> 3) frequentism is not tied to always using one rule like Bayesian is tied to Bayes rule, so why is the frequentist method only allowed one method?

it isn’t, the point is that whatever the method, the thing that makes it good isn’t because it has guaranteed confidence coverage. Basically in this case the thing that makes a frequentist interval estimation method good is that it had better produce a subset of [0,12] since logic tells us that the truncation point *has to be there*, which the Bayes method does automatically. There’s also a proof (Cox’s theorem) which tells you that if the method agrees with binary logic in the limit where binary logic gives certainty, then it pretty much has to be Bayes… So there’s that.

> a-c still wouldn’t “prove” equally likely. It would, however, be evidence for equally likely.

It’d be just as much evidence that all coin flips were equally likely except one of them, we don’t know which, which *has* to be equal to the observed value… It’d also be just as much evidence that only sequences where somewhere between 45% and 55% of any sequence can be heads and if you get more than 13 heads in a row the genie will come out of the lamp and smite you and you’ll die before completing the experiment…. etc.

There are 2^100 = 1.27e30 possible sequences of 100 coinflips, you could invent any number of stories about magic fairies that intentionally exclude all but 1% of the possibilities and you’d still probably get close to a 50/50 split. The fairies would still leave you with 1.27e28 possible sequences, almost all of which necessarily have near 50/50 split. Is the fact that you got 50/50 evidence of magic fairies?

“It is absolutely 100% true that if you see the numbers given you can conclude “with 100% certainty, the truncation point is below min(data) = 12”

Well you can conclude, I wouldn’t conclude because I am not convinced by a maybe wrong model and an analysis based on n=3 and nonreplicated non-experiment. It is just mathematically true sure. I’d tackle this problem with bootstrapping.

“It’d also be just as much evidence that only sequences where somewhere between 45% and 55% of any sequence can be heads and if you get more than 13 heads in a row the genie will come out of the lamp and smite you and you’ll die before completing the experiment…. etc.”

We’ll just have to agree to disagree I guess. Success from experimental design, quality control, survey sampling, and yes even coin flipping, hypothesis testing used in sciences and other areas all over the world kind of disproves the genie/fairy mocking I’d opine..

Justin

A big problem is how you measure success. There is a currently a problem that statistical significance (which is orthogonal to the frequentist vs bayesian issue) -> publication -> “success”.

So the very measure people have been using for “success” is flawed. Using other metrics like percent of papers that replicate tells a very different story. It seems to be that only ~10-50% replications get significance in the same direction, when it should be 50% for sufficiently powered studies. And then there is the issue of properly interpreting these results, which I suspect is even bigger.

typo:

My point is that calling something successful because your model predicts it and its indeed seen is exactly what even Mayo for example is arguing against, you need to also see if other alternative models would also predict the same thing or alternative things… the genie-fairy model is just a stand in for whatever else might be going on. Declaring success because your frequency model predicts whatever is seen is a non-sequitur, given the mathematics, it has to be extreme variations from sequences with near 50/50 are the only things that are possible in order to get anything other than about 50/50 since almost all of the possible sequences have near 50/50…

declaring success of the frequency model is like declaring success of a medical model because your model of toenail fungus predicts that almost all patients coming to a doctor with toenail fungus will live more than 5 years after their first visit… it’s almost gotta be the case given what we know about mechanisms by which toenail fungus could cause mortality, frequency properties of RCTs etc have nothing to do with it. What would be amazing is if you showed that you had a mechanistic model that no-one had ever thought of by which toenail fungus causes extreme increases in heart attack risk and you did in fact see such extreme increases…. This would be the equivalent of finding that in all sequences of 100 flips with a given coin, all of the last 35 of them had to be heads due to mystery mechanism you’ve discovered…

everything else is just mis-attributing what is essentially a combinatorial counting argument as if it had deep meaning for the physics of flat metal discs.

Also it’s fine to say that in reality maybe our models are wrong, and we should check them… like maybe the truncation point changes in time in an oscillating way or whatever…

but in this case the frequency based analysis predicts the wrong thing *even if all its assumptions are exactly mathematically true* doesn’t that bother you?

Like what if you do a RCT of a medicine and there’s some kind of censoring of toxicity results because toxicity below some threshold doesn’t get detected in the 6 months of the study but over the long term you get liver cancer from the drug… and you detect one person in your RCT with liver injury, so your frequency based analysis says that you can be “95% confident that less than 1 in 1000 people will have liver injury” but a Bayesian analysis using real information that we already know about the mechanism of the drug shows that it has to be at least 35 out of 1000

is that so different a situation? no. In fact I’d be shocked if we couldn’t go and find some such similar type of result with a dramatically incorrect estimate of adverse event risk due to frequentist confidence intervals in the last 10 years of drug trials which would have resulted in a very different estimate with the use of a Bayesian model and info from say a pharmacokinetic model and some data collected in that experiment that would have ruled out low liver risk or similar.

I even remember this example from a over decade ago on this blog: https://statmodeling.stat.columbia.edu/2007/08/20/jeremy_miles_wr/

“It seems to be that only ~10-50% replications get significance in the same direction, when it should be 50% for sufficiently powered studies. And then there is the issue of properly interpreting these results, which I suspect is even bigger.”

Are Bayes factors remedying this though? That is, are Bayesian approaches a) solving replication issues, and b) solving interpretation issues? Are the same people that are botching understanding p-values somehow understanding the intracacies of priors and MCMC?

Justin

“I think your mistake is to think that statistics is about producing intervals that you can trust.”

I mentioned intervals because the original poster mentioned intervals (from the truncated exponential example). I think intervals are just one aspect of stats, not the whole thing.

With that interval example, I do wonder why a) frequentists are denied from knowing/using the scientific knowledge about the minimum of the process (background knowledge can enter without using Bayesian priors) and b) frequentists are apparently banned from doing confidence intervals any other way (such as bootstrap).

So again I agree with the Bayesian math here, but I don’t find it a convincing (to me) example.

“Pizzagate, himmicanes, ages ending in 9, beauty and sex ratios, power pose, ovulation and voting, ESP, etc etc etc . . . just a long stream of claims which the scientific and journalistic establishment are pushing at us without good evidence.”

Speaking of ESP, I read a paper that said “Bayesian results range from confirmation of the classical analysis to complete refutation, depending on the choice of prior.”, so it doesn’t seem like Bayesian is the answer either, or for preventing false positives. Frequentism or Bayesian, I’d be more concerned about experimental design than anything else.

Justin

No, doing the same thing with Bayes factors makes no difference. That is why I said “the issue is orthogonal to the Bayesian vs Frequentist issue”.

The point was to get you to say what your measure of “success” was.

This complaint simply isn’t true. In my original description of the problem I thought I was pretty explicit that frequentists are not being banned from doing doing anything — if they were, it would hardly make sense to ask “What went wrong, and how can it be fixed — how can a frequentist improve upon this confidence procedure?”. I suppose I have to infer that you’ve not yet noticed my reply to your post below (but earlier!) in which you raise this point (sample quote: “in no way am I claiming this is the only acceptable frequentist approach to the problem”).

I looked at Mayo’s book again to see if I misread this. Nope. She really does argue it’s OK if CI’s give results contradicting deductive logic because CI’s are consistent with their own definition.

The key thing about Mayo is that, for her, Frequentism is unfalsifiable. No amount of theoretical or practical failure could ever shake her belief the foundations of statistics are Frequencies.

This belief isn’t born out of vast experience with statistical inference or deep mathematical investigation – she’s done neither even if one drops the ‘vast’ and ‘deep’ – rather it’s born out of nothing more than a failure to image how Bayesians could be right. She doesn’t see philosophically how Bayesiasn could be right therefore frequentism has to be right, and no amount of theoretical or practical failure can ever shake that belief.

That this sort of crud is a major influence on statistics in 2019 shows just how intellectually bankrupt the statistics field really is.

Anon:

I would not say that Mayo’s statistical advice is a major influence on statistics. The value that I see in Mayo’s work is not in her writings about statistical methods but rather in her efforts to integrate statistical methods with the philosophy of statistics. Although I disagree with her on many of her statistical attitudes, I do think there are Cantor-sized holes in all the statistical philosophies out there, and I appreciate that she’s looking hard at that.

To put it another way, what I look for in a philosopher is different than what I look for in a statistician. Take my hero Lakatos. He doesn’t offer any specific prescriptions at all, but I feel that I’ve learned a lot from reading his work, that it gives me a better perspective on what I and others are doing.

“It’s been known for 60-70 years now that it’s possible to get confidence intervals in real problems garunteed to not contain the true parameter, and moreover, this guarantee is provable from the same assumptions used to create the CI. “

CIs can go crazy with Bayesian approaches too, just create a prior that is funky (technical term).

Justin

Going further back to basics, if the true parameter was one CL then there is a likelihood of 0.025 of seeing the observed result or something more extreme. The same applies if the other CL is the true parameter. Provided that the conditional evidence is the observation alone (with no other prior or other additional evidence and assuming a symmetrical likelihood distribution) I argue in https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0212302 that the posterior probability of the true parameter being within the two CL is 0.95. The critical point is the conditional evidence in the probability statement. You Andrew bring further prior evidence into the statement to make it Bayesian, which is fine but different.

I’m just a lowly computer scientist so correct me if I’m wrong, but confidence intervals, like p-values, are about samples not populations. Given an observed effect size d, the 95% CI is the central portion of the sampling distribution containing 95% of the probability mass, assuming d is the true population effect size. This means that if d is correct, 95% of observed effect sizes will fall in the CI. To make predictions about the population effect size, you need additional assumptions, eg the a priori probability distribution of true effect sizes in your field.

Nathan:

No, your understanding is not correct. There are some important special cases where your definition works, but in general you what you’re saying doesn’t work. The sampling distribution is a distribution of the data, not of the parameters. If you want to consider the sampling distribution as a function of parameters (that is, as a likelihood, then in general you can’t assume it has a finite integral, hence you can’t speak of its probability mass). Finally, you can in some settings make predictions (“confidence statements”) without using a prior distribution: those predictions might have 95% coverage (assuming the model is true) on average, but can’t be said to represent 95% probability in any particular case, as in the example described in the post above.

P.S. There’s nothing “lowly” about being a computer scientist! We have expertise in different areas, that’s all.

I must be very confused. I thought I was saying the same thing you just said. My understanding is that the sampling distribution is the probability distribution of a test statistic for a given data generation process, like computing the mean of random samples of fixed size from a normally distrbuted population. This is, I think, how Wikipedia defines the term. This is a proper probability distribution (integral=1), so mass is a sensible term.

My definion of CI comes from code examples all over the place, eg, from Uri Simonsohn http://urisohn.com/sohn_files/BlogAppendix/Colada20.ConfidenceIntervalsForD.R, function ci_r in the CRAN predictionInterval package, this R tutorial https://www.cyclismo.org/tutorial/R/confidence.html, and many others. I haven’t seen an equivalent definition in stats texts (as opposed to code), so maybe all the coders are wrong…. On the other hand, my code matches R’s t.test on many, many test cases.

I describe myself as a “lowly computer scientist” to emphasize that all my stats knowledge has come on-the-job, often from inspecting code attached to interesting papers or blog posts, and sometimes from online conversations like this one. I appreciate your time in explaining these matters. I’m sorry for wasting your time if I’m completely off in the weeds.

With apologies, I see that I messed up my definition of CI. I double checked the source material I cited and my own code, and see that the definition I gave above is just plain wrong. I will follow up if I have anything useful to add.

I still can’t see where the two of you disagree. Not all CIs are constructed from sampling distributions, sure. But for those that are, the interval is equivalent to an inverted significance test with the tail probabilities calculated from a distribution with location parameter set to the observed estimate (d). For these intervals, since the underlying distributions are *sampling* distributions, do they not provide the same conditional prediction that you first described? (i.e. if delta = d, Pr(d_rep in bounds) = confidence level)

and I realize this has nothing to do with the coverage statement that confidence procedures are intended to make, and which yields their strict frequentist interpretation

Patrick- This sounds like an interesting point, albeit one that may not appeal to everyone reading this thread. I’d be happy to follow up by email if that suits you: natg@shore.net.

Andrew said,

“The sampling distribution is a distribution of the data, not of the parameters.”

I don’t understand this; I’m used to using the phrase “sampling distribution” to refer to the distribution of a statistic — e.g., “the sampling distribution of the mean” refers to the distribution of sample means of random samples (of fixed size n) taken from a given distribution.

I think Andrew was taking a shortcut here. a statistic of the data is itself data, in the sense of being fully observable. also it’s possible for the function to be “take the nth value” in which case the sampling distribution is identical to the data distribution…. in either interpretation, unobserved parameters do not have distributions, as they never do in Frequentist analysis

The sample mean is a function of the data, so “sampling distribution of the (sample) mean” is a distribution of data.

“the 95% CI is the central portion of the sampling distribution containing 95% of the probability mass”

This is incorrect because a frequentist interval is not a probability mass or distribution. It’s simply a range. You could produce a *function* of frequentist intervals to produce a consonance function/curve, but even that is not a probability distribution, though it will resemble one.

For more basic understanding of confidence/consonance intervals, I’d recommend reading Cox, D.R. (2006). Principles of statistical inference.

And to understand consonance curves/functions, you can check out this R package to learn more about them.

https://cran.r-project.org/web/packages/concurve/index.html

If we relied on the technical definition of confidence intervals in order to justify their use, then it is hard to believe that anyone would use them.

However, they are indeed very heavily used. Therefore, assuming that not everyone is completely insane (a big assumption?), this suggests that they are interpreted in some other way, e.g. via a Bayesian, fiducial or P value route.

Yes, people use them because confidence intervals approximate the credible interval you would get using uniform priors. This is true for many simple applications but breaks down in others.

When the prior is uniform that means it is the same for every parameter value and thus the priors drop out of Bayes’ rule.

I suspect that, in general, “frequentist” methods are just computationally efficient approximations of the corresponding bayesian solution when the priors drop out. They have been developed for each of a few special cases, and misinterpreting them as their Bayesian equivalent is fine (in those cases).

Anoneuoid:

You have given us a standard Bayesian viewpoint. While many may agree with you, it is perhaps too much of a narrow viewpoint for many others.

Also Bayesian credible intervals based on uniform priors may be very bad approximations to confidence intervals, especially with small samples (and of course with discrete data). Moreover, how do you justify using a uniform prior? What is special about it?

Well, my post contains some opinion/speculation but also a fact you seem to disagree with. The fact is that in many popular usecases you will not be lead astray by interpreting a confidence interval as a type of credible interval.

Here is some discusiion I found:

https://stats.stackexchange.com/questions/355109/if-a-credible-interval-has-a-flat-prior-is-a-95-confidence-interval-equal-to-a

And I mentioned what is special about (roughly) uniform priors. If p(H0) ~p(H1) ~ … ~ p(Hn), then the priors all drop out of Bayes rule. It allows for a simpler calculation at the expense of a (hopefully negligable) loss of accuracy. Think of it like dropping small terms from a denominator.

Sorry about the typos, Im on mobile…

But another thing is that terms in the denominator where p(H_i) ~ 0 (it is very small relative to the others) can be dropped. So using a uniform distribution that covers the a priori plausible range is enough, ie it need not extend infinately in both directions like some people approximate with a normal(mu,1e4) or whatever.

I took a look at one of your papers and found this:

https://arxiv.org/abs/1809.02089

I think you will have difficulty understanding Bayes’ rule if you treat the denominator* as a “normalizing constant”. Is this a common practice among statisticians?

* p(data) = sum( p(data|H[0:n])*p(H[0:n]) )

It is common practice — the prior predictive density evaluated at the observed data is indeed the normalizing constant of the posterior distribution for the parameter. As long as we’re not considering uncertainty about the model it’s fine to just write it the way naked does (by me at least).

The problem I have with that is it hides away how scientists intuitively follow Bayes’ rule and why it works.

Eg, my post here:

https://statmodeling.stat.columbia.edu/2019/04/22/wanted-statistical-success-stories/#comment-1025707

Even in that case, you will miss that a uniform prior is special because it means the priors all cancel out. It just seems like mathematicians sweeping something pesky under the rug so they can focus on details that are largely irrelevant to everyone else.

Can’t say as I agree, but I also can’t say as I expect to convince you otherwise or be convinced otherwise by you.

¯\_(ツ)_/¯

https://stats.stackexchange.com/questions/275641/what-does-it-mean-intuitively-to-know-a-pdf-up-to-a-constant?noredirect=1&lq=1

I wonder how often/when this is true in practice. I would tend to assume the terms follow a power law and you only need to do a few of them and can ignore the rest.

> I wonder how often/when this is true in practice.

Essentially ALWAYS which is why we do MCMC. Imagine trying to integrate a 50 dimensional model, say one parameter for each state in the US.

Suppose you do it numerically by evaluating just *two* density evaluations. Then you need 2^50 = 1.13e15 function evaluations… yes that’s 1.13 quadrillion

Sorry, I must be missing your point.

1) I think going down the rabbit hole of figuring out computationally efficient ways to approximate evaluation of Bayes’ rule for high-dimensional models is exactly the type of thing leading statisticians to miss major aspects of what it tells us at a higher level. It is like trying to understand Newton’s force law by looking at how ephemerides are calculated: https://en.wikipedia.org/wiki/Orbit_modeling#Orbit_simulation_methods

Like this is the dismissive attitude I am seeing when searching the term “normalizing constant” (emphasis mine):

https://stats.stackexchange.com/questions/129666/why-normalizing-factor-is-required-in-bayes-theorem

People are being taught: “Don’t look inside p(data), there is nothing of interest for you there.” Instead, I think statisticians should take a closer look at p(data) and the role it plays.

2) For eg metropolis-hastings, you never calculate p(data) since it cancels out when you take the ratio of two “posteriors”. Each accepted step is giving us another term in p(data), hopefully skipping all the negligible ones.

3) The reason MCMC works is precisely because we don’t need to calculate every possible prior*likelihood term (is there a better name for this?) to get a useful approximation of the posterior. The vast majority can be ignored since the prior is very small, the likelihood is very small, or it is approximately the same (equivalent for practical purposes) as a model that was already evaluated.

Anon:

You write, “statisticians should take a closer look at p(data) and the role it plays.”

I think statisticians should definitely think hard about the data model, p(data|parameters). I don’t think the marginal distribution, p(data) = int p(data|parameters) d(parameters), is always so relevant, as it can involves integrating over all sorts of things that we don’t care much about.

To put it another way: from my perspective, p(data|parameters) is fundamental, it’s a key part of statistical modeling, whereas p(data) is less clearly defined.

I think one of the problems here is the notation, in which the expression p(data|parameters)

looksmore complicated than the expression p(data). Maybe it would be clearer if p(data|parameters) were given a simpler notation such as M (for “model”), and then p(data) would be written as int M d(parameters). Then it would be clear that I’m saying that I’m interested in M, but I’m not so interested in a somewhat arbitrary integration of M over a space that’s largely populated by things I don’t care about.Anoneuoid: the asymptotics arguments you’ve used for model comparison and for why we don’t need to evaluate the integral over all 2^50 points are all valid… but from a computing perspective they have been considered and have led to dramatic developments in MCMC methods. these methods spend their time evaluating the density precisely where the density needs evaluating and not where the contribution is negligible. in some sense these things were realized back in 1940s and certainly in the 1980s and 1990s for example in Radford Neal’s seminal paper on hamiltonian Monte Carlo. the goal is explicitly to sample only in the high probability region

The likelihoods p(data|params) are used to calculate p(data) along with the priors p(params), so of course it is more fundamental. But MCMC procedures can be understood to work by approximating p(data). Each new step gives us another “unnormalized posterior”[1] term that collectively sum/integrate to approximately p(data), then if it is accepted we store the associated parameter values. By using a markvov chain hopefully we are not wasting effort calculating many negligible terms of p(data).

If you only care about estimating parameters of the model you won’t care directly about p(data) and you will probably throw away all the unnormalized posteriors used to get you there. But if you care about the probability a models (+ associated parameter set)is correct, then you do want to normalize all these unnormalized posteriors to their sum to get the actual posterior probabilities.[2]

[1] I guess people refer to this term as the “unnormalized posterior”: p(data|params)*p(params)

[2] I think of “real” distributions as always discrete due to limits on measurement precision, so if you are sampling from a continuous approximation then I guess similar models (+ parameters) should be aggregated for this. “Similar” means there is no practical difference between the parameter values, etc. Actually, when p(data) is estimated this way it could be a good measure of convergence to check that its rate of growth approaches zero. Is that a thing?

Anoneuoid:

Yes, if you are doing a mixture model something like [p(Data|Params1,model1) p(Params1|model1) p(model1) + p(Data|Params2,model2) p(Params2|model2) p(model2) ]/Z

Where Z is the normalization constant, then typically you define p(model1) and p(model2) as parameters say p[1] and p[2], and you put a prior on them, and you enforce that they add to 1 (say using a simplex in Stan and providing say a dirichlet prior on the simplex)

In this situation you absolutely need that each of the sub-models uses a *normalized* representation of the density. This is a subtle point that has bitten me a few times, in Stan it means you can’t use ~ statements, and need to use things like normal_pdf to compute the normalized version of the conditional pdf.

I don’t think this is entirely forgotten, but it’s definitely a kind of advanced area of application and can be overlooked rather easily even by experts, it’s a good area to look for bugs.

I think calling concern about p(data) “advanced” is somewhat misleading. I wouldn’t consider myself to have an “advanced” understanding of MCMC (although I have written my own gibbs samplers from scratch, etc).

I’d say it is “atypical”, because the main way to apply Bayes’ rule has been parameter estimation via MCMC. In that case p(data) usually is not a primary concern (if at all).

RE: advanced applications.

I meant building finite mixture models in which there are multiple models competing for posterior mass through a finite simplex of mixing parameters. I think there are many more people doing single model estimation than there are doing model comparison across multiple models.

True. I wonder if that will change as cpu cycles get cheaper.

The current situation is widespread discontent from outsiders about statistics and seeming disagreement about nearly everything from inside the field… So perhaps it would be a good idea to reassess how statisticians have been thinking about things.

Many of the terminological problems result stem from the resort to fallacies of composition and division.

Andrew: you seem to be giving the typical frequentist interpretation of CIs, I thought a key advantage of being Bayesian is that you get to assign a probability to the particular interval.–some kind of epistemic statement of degree of belief or credibility. I may have read too quickly. I came here just to look something up, but was drawn to some interesting remarks.

Psssst! There’s a compromise to be had here, or I’m a monkey’s uncle. Go with it.

> There’s a compromise to be had here

Whether it will be had, keeps being called into question by a few (many?) who at least verbalize a position “those who do not agree that X is the only way forward are either stupid or evil or both”.

Right now, the X is Bayes versus Frequentist, but if one side succeeded in completely annihilating the other, the X would just change to something else.

I think the bottom line is that academia is not (now) a community trying to be scientifically profitable (bending over backwards to help each other become less wrong) but rather a debating community where the winners hope to take all (but of course no prisoners).

Much more like adversarial as in civil dispute processing but with rules, judges or methods of enforcement…

Arghh but with NO rules, judges or methods of enforcement…

From Andrew’s Apr 25 comment “I think a big big problem is that statistical methods have been sold as automatically generating trustworthy results”

I think a big big problem is that the *correct* choice between Bayes versus Freqentist is trying to be sold by some as the only route for generating trustworthy results (and even usually just automatically).

The argumentative and ideological nature of current debates Reid’s the very statistics related cautions that thought leaders say to guard against. The JAMA Current has allowed for comments at the end of John Ioannidis’ own response to ‘Retire Stat Sig. But there are some gaps in reasoning to be filled.

Sorry I meant ‘convey’ on Metro which is shaky

My apologies. I did. not mean ‘convey’ either. I meant ‘reify’. I’m on a bumpy bus.

Agree – as in “I assure you that I am most certainly correct and those who disagree with me are most certainly wrong or at least offensive to me”

Which is ridiculous since for what most people are doing they give the approximately the same numerical answer. So the end user is (usually) free to interpret the result of their frequentist calculation in a bayesian way. Vice versa is fine too (although I have never heard of anyone wanting to do that…).

So there is this entire heated debate about something of no consequence to 99.9% of people who will use stats. Meanwhile the real problem of testing your hypothesis vs a default strawman hypothesis continues to go largely ignored and the BS conclusions are accumulating at an ever increasing pace.

I think the debate got off too cranky for some reason which, while entertaining, can set the stage for theatrical extremism.

to me the Bayes vs Frequentist debate is really about modeling processes vs replacing reality with random number generator. do you focus on the unknowns and trying to discover how they work, or only on data and how functions of data behave mathematically. the Frequentist approach actively discourages mechanistic thinking and I think this is why it is so offensive to me

I almost put something in that post about “if people start testing their own hypotheses they will tend to become bayesian anyway, maybe that is the real reason there is such resistance to dropping the default strawman null”, but decided it would distract from the point.

And we do see people wanting to test strawman null models with Bayes factors and the like too. That could be just because they were mistrained to do that already though.

>That could be just because they were mistrained to do that already though.

Yes I think so. There’s nothing particularly Bayesian about Bayes Factors. They are maybe a useful tool to figure out which models you can drop from consideration as a computational simplification, or if you have to decide between a small discrete set of models, like whether you’ve detected a whale or a submarine via sonar or whatever.

One way that I like to think about this debate is how the central limit theorem functions. It’s a mathematical fact that if you add up almost any of the possible subsets of a bunch of numbers the sum will be close to some value so long as the population of numbers isn’t too weird (ie. has some extreme outliers). This mathematical fact *does not rely on any fact about physics, biology, chemistry, psychology, social interaction, ecology, law, available medical treatments, etc* it is entirely derivable from a counting argument about how many subsets its even possible to form that have averages far from the overall average, and so focusing on the reliability of this fact and attributing the fact incorrectly to some fundamental physical property of the world is wrongheaded.

At best, it is a way to improve measurements by cancelling out measurement error, at worst it’s like asking a ouija board which scientific laws you should believe in… The complaints you often give about testing null hypotheses are a kind of subset of the problem. A big problem is that in many fields we are replacing *theory* with instead *measuring things and then post-hoc theorizing about why we got those measurements as if the measurements are inevitable facts about the world* This is sure to fool you almost every time, as we aren’t even asking what quantities are the important determinants of the measurement. We can get RCT results and it means that we can be reasonably sure that the thing we did caused the change in measurement… but generalization error can easily be huge when you move from the RCT to actual usage, because we have ignored all the important determinants of the outcome.

In medicine for example, you recruit some people for a blood pressure trial, you randomize and you find the drug reduces blood pressure and has few side effects. Fine, but you did the trial in southern Germany, where genetics, diet, exercise level, climate, jobs, social activities, and soforth all vary *considerably* from say southeast asia. Now if you give the blood pressure drug to people in Thailand or Cambodia or something, what will be the outcome *and why* which of the many variable factors are critical to the outcome of the drug treatment? For example activity of certain liver enzymes, or fish vs sausage in diets, or wealthy access to mercedes benz automobiles for travel vs biking everywhere with a heavy bike trailer and hence having different exercise patterns? What?

This failure to even try to investigate a model of what happens is sometimes even taken as an *advantage*. A “model free inference”. I say bullshit.

Trying to extract useful information from the medical literature can be infuriating. For example about a week ago I got bit by some ants (I assume fireants), the pustules are still somewhat itchy a week later. I wanted to know how unusual this is, so I look for a timecourse that shows the percent of people whose symptoms resolved after x number of days. This data doesn’t seem to exist. All I found is “folk” knowledge that differs across various sources. Some saying it should resolve in a few days, others say several weeks.

Eventually, I found some useful info from a study where volunteers were bitten by the ants and observed. This was from all the way back in 1957:

https://jamanetwork.com/journals/jamadermatology/article-abstract/524964

Unfortunately they only describe typical results and a few case studies, so I didn’t get the timecourse I wanted. But it did answer my question.

Anon,

Here’s the method that works better than anything else I’ve tried for dealing with ant bites:

As soon as possible, use a cotton swab to put a *very small* dab of hydrocortisone cream on the pustule. Then cover it with a bandaid (whatever size and shape works best — I sometimes get multiple bites near each other so need to choose and position the bandaids to cover all, but not have adhesive directly on any pustule). Repeat after you bathe or sweat excessively. The bandaid serves three purposes: It keeps you from automatically scratching the bite, it prevents clothing etc. that brushes against the bite from initiating itching, and it keeps the hydrocortisone in place.

PS My guess is that the severity of symptoms depends on many factors, e.g., strain of ant, weather, sensitivity of individual, number of bites, previous exposure to the ant “venom”.

Thanks for the advice. I think I must be near the end for this instance though.

Yes, I think I also lucked out pain wise because it was early in the year. Supposedly the bites get worse in mid-summer because the concentrations of various venom components are seasonal:

https://www.sciencedirect.com/science/article/pii/S0091674986800859

Basically, I was surprised at how long the itchiness is lasting given how little pain I experienced compared to what others reported. I’d even describe my initial sensation as more a slight pinch followed by tingling than pain. Perhaps it was even a different type of ant though.

Yes, the actual biting is not a big deal; it’s the itching that drives you crazy — although the sting of the bite is usually worse the hotter the weather.

“The Fundamental Confidence Fallacy”.

I have found it useful to contemplate an example given by E.T. Jaynes (example #5 in this paper). The problem is to infer the support of a truncated exponential distribution. (The density is p(x) = exp(θ – x) for x > θ and zero otherwise.) There is an unbiased estimator and its distribution gives a pivotal quantity — it’s all very nice mathematically.

There’s one little problem though: the procedure can induce confidence intervals (let’s say, shortest two-sided 90% confidence intervals) that we can be certain do not contain the true value of θ. For example, if the observed data are, say, {x1, x2, x3} = {14, 12, 16}, then 90% shortest confidence interval is [12.15, 13.83]. Since it lies entirely above the minimum in the sample, 12, we know that it cannot possibly contain the true value of θ.

The questions to ask oneself are:

What went wrong, and how can it be fixed — how can a frequentist improve upon this confidence procedure?

What is the Bayesian analysis of the problem?

What has to be changed in the problem statement to make the Bayesian analysis yield exactly the same credible intervals as the above flawed confidence interval procedure?

Can’t you always choose a confidence/credible level where this is the case? Why 90% here?

Basically I don’t see what significance you could possibly attach to this since the width of the interval is arbitrary.

the point is that every single element of the interval is logically impossible, the exponential is truncated on the left, so if you see the data point 12 you know it’s truncated somewhat below 12… the confidence interval for the truncation point lies entirely above 12

The point is that confidence coverage alone can’t guarantee that any particular interval isn’t garbage in the sense of including regions of parameter sample that the observed data already tell us cannot contain the true value. If you widen it you might get some values below the sample minimum, but you’re not going to move the entire confidence interval out of the garbage zone.

The challenge is to understand why a confidence interval procedure can fail in this way and how it can be fixed. In this particular case there is a right way to do it.

“parameter sample” should be “parameter space” of course

This makes more sense.

Well, for one thing, in this example the Bayesian math allows one to use the information that theta cannot be larger than the minimum observed sample. That unbiased estimator of population mean does not use that information. The confidence procedure works (I think), on average, given hypothetical replications of whole shebang, but is not optimal for any one, realized sample (the usual Bayes versus frequentist situation). In this example, the sample contains at least 1 value (16) that is pretty unlikely to occur in a sample of size 3, and it is jerking around the estimator based on population mean.

You’re exactly right (although I’d dispute that 16 is so large that its unlikeliness makes it or something even larger unworthy of concern). The important bit is that the sample minimum is also the sufficient statistic; using it restores the usual numerical equivalence of confidence intervals and credible intervals. Conversely, if a Bayesian never sees the full sample but only the sample mean then the relevant likelihood comes from the distribution for the pivot and against numerical equivalence of confidence intervals and credible intervals is restored.

> confidence coverage alone can’t guarantee that any particular interval isn’t garbage

Yes, confidence coverage a very weak property that should never be thought of as other than a first step.

In a way, this reflects a rather sorry state of graduate education in mathematical statistics that this does not seem obvious to many statisticians.

To get better intervals further considerations need to be brought in such as basing them on likelihood, pivots, the least wrong reference set in bootstrapping, etc.

The truncated exponential is a good example. But:

First, it is a small sample size situation. I do not believe any interval from any school will be great.

Second, the example disallows the frequentist to use ANY other frequentist method, such as bootstrapping.

Third, it disallows the frequentist to use scientific knowledge that theta < min(X_i)

Fourth, it doesn't say what the interval will be like when any other Bayesian prior is used.

So sure, anything can be viewed as a serious problem when you do 1-4,

Justin

You have to keep in mind what I’m attempting to illustrate here. Let me quote from the OP:

I believe that when AG wrote that he was thinking about prior knowledge being the source the information that lets us say that a realized interval was clearly wrong. I wanted to go further and show that the quoted text is true even without bringing prior information into the picture.

Let’s take your points in turn. To your first point: usually when people talk about intervals not being great in small sample size situations they’re talking about being unhappy with how wide the intervals are; this is a very different concern from the one I’m presenting.

To your second and third points: in no way am I claiming this is the only acceptable frequentist approach to the problem. I’m simply saying that *confidence coverage* alone isn’t enough, and there must be something more that makes our usual confidence procedures work. (And it turns out that in this case we

candraw on this “something more” to improve on our inference — and that’s all to the good.) If you don’t believe me, ask Mayo; she’ll tell you that in problems of this sort, good long-run properties of a method of inference are merely necessary and not sufficient for severe testing in the case at hand. So don’t mistake this for an argument for Bayes — it’s about reference sets and recognizable subsets, issues that were of deep concern to Fisher, as described in this paper on the topic of reconciling pre-data and post-data properties of interval procedures.To your fourth point: there

isone thing we can say about any interval resulting from any other Bayesian prior — it won’t include values above the sample minimum, since the likelihood is zero there.This year I feel like Sander Greenland’s public relations spokesperson. LOL Sander get your attention over here, if you have the time & inclination.

https://discourse.datamethods.org/t/language-for-communicating-frequentist-results-about-treatment-effects/934/39

A very interesting discussion which I reviewed a couple of days ago.

The stakes are high, IMO.

I’m afraid that the problem is ill-stated.

The “95%” qualifies the interval building *procedure*. More precisely, it states that, when repeated on a large number N of independent samples from the same population, the proportion of resulting intervals containing the true value of the parameter will fluctuate around 0.95 (according to a binomial etc, etc…).

The only thing that can *safely* be said about ONE interval is that it either does contains the true value of the parameter or does not. The statement “this interval contains the true value of the parameter” is either true or false, but has no bloody probability : once the sampling is done, the interval you compute is a fixed one, and the fact that it contains or not the true value is a fixed fact (albeit unknown to us…), not a bloody random variable. What is unknown is whether the procedure you used worked or not.

In other words, when you state a 95% confidence interval, you are in fact stating : “Using a procedure that works 95% of the time, I say that the true value of the parameter belongs to this interval”. Your statement is either true or false ; the “95%” probability is only the frequency with which you’ll be right when *repeatedly* using such procedures.

[ BTW : since this (very frequentist) use of probabilities can help you only to qualify the confidence you have in your statements, doesn’t this make such “frequentist” probabilities “subjective” object ? ;-)) ]

I do not think this problem is necessarily ill-stated. I think it points out somewhat problematic use of language in the usual statements about confidence intervals.

When we talk about a specific confidence interval (and we always do, the whole point of interval estimates is to have a specific one), we as frequentists cannot talk about probability, because the probability that the interval contains the parameter is either 0 or 1. The problem is we are uncertain which one, and we wish to somewhat quantify the uncertainty. So we appear to introduce a new measure of this uncertainty, that seems to be called “confidence”, that is calculated as a probability that a randomly selected element of the sample space of all simple random samples of size n from our population will produce an interval that contains the population parameter.

I believe the OP is asking, can we do the same thing for a pair of intervals? And I don’t see why not. Simply take a Cartesian square of the sample space, and calculate the proportion of pairs of samples that produce a pair of intervals that both contain the population parameter. The problem is, if you do that, you will end up saying things like “I am 90.25% confident that both of these obviously disjoint intervals contain the true value of the population parameter.” True, you will only say it less than 9.75% of the time, but I would rather not say it at all.

This issue keeps recurring and I keep dissenting. No, it is incorrect to state that the 95 percent confidence interval means there is a 95% chance that the true population parameter lies in that interval. But I still think that this incorrect restatement is not that bad.

Let’s be concrete: I just looked up the American Community Survey data for my zip code. The ACS is sample data, and median household income for my zip code is listed as $43,404 +/- $1,664 (the latter number is the margin of error). The margin of error is defined as based on a 90 percent confidence level, and the documentation correctly defines what a confidence interval means:

“For example, if all possible samples that could result under the ACS sample design were independently selected and surveyed under the same conditions, and if the estimate and its estimated standard error were calculated for each of these samples, then:

1. Approximately 68 percent of the intervals from one estimated standard error below the estimate to one estimated standard error above the estimate would contain the average result from all possible samples.

2. Approximately 90 percent of the intervals from 1.645 times the estimated standard error below the estimate to 1.645 times the estimated standard error above the estimate would contain the average result from all possible samples.

3. Approximately 95 percent of the intervals from two estimated standard errors below the estimate to two estimated standard errors above the estimate would contain the average result from all possible samples.”

My question is what advice would you give to someone that wants to use the data from my zip code to say something about household income in my zip code? I’d like to say that I’m 95% confident (I’d prefer a different word, but “confident” seems a bit better than “sure”) that the median household income is with 1.645*$1664 of $47,404. But that’s wrong. If I provide the lengthy correct description, what is someone to do with that? Should they ignore the numbers completely? Is the advice that if no additional samples are taken we can say nothing about median household income? What exactly would you advice be about how to use this estimate?

I think this is a really important question. At least as a teacher I don’t want to leave my students thinking that we might as well not use well collected data if there is uncertainty. In this case I think you have to ask “what are you using it for?’ and “what are the consequences for making a wrong decision based on this?” And they I’d also possibly that suggest that someone wanting to use the data consider working through their decision process with the mean and the high and low boundaries of the interval and see how different the results are. Then consider the impacts of those differences. They might not be very big for some purposes but be huge in others. If I’m deciding to locate a business in your neighborhood that will only appeal to people with HHI below 30k a year or above 60k, I think that interval is adequate to tell me this might not be a good spot. If I’m buying advertisements targeted at people with HHI between 45k and 50k and I can stop paying for them quickly if they don’t work out, then I’d say it’s probably fine too.

Elin said,

1. “At least as a teacher I don’t want to leave my students thinking that we might as well not use well collected data if there is uncertainty.”

and

2. “In this case I think you have to ask “what are you using it for?’ and “what are the consequences for making a wrong decision based on this?” And then I’d also possibly suggest that someone wanting to use the data consider working through their decision process with the mean and the high and low boundaries of the interval and see how different the results are. Then consider the impacts of those differences. They might not be very big for some purposes but be huge in others. “

Responding to (1): I think it is important that students learn that, even with the best statistical analysis and very well-collected data, there will always be some uncertainty. This is just part of the fact of life that we can never eliminate uncertainty completely.

Responding to (2): This is good advice in many cases — so it is important that they learn to ask themselves these and other questions that allow them to use what we can glean from (uncertain) information to make decisions that are more likely to be sound than if they ignored the data entirely.

I agree with both points 1 and 2. What I find distracting is the emphasis on the correct interpretation of the confidence interval. The fact that it is a property of repeated sampling rather than a property of the single sample you have seems less important to me than the myriad reasons to be skeptical, inquisitive, and probing, as well as the importance of recognizing and embracing uncertainty. Frankly I don’t find the repeated sampling point useful towards that end. Perhaps others disagree.

The fact that the one particular interval we have we cannot say is one of the 95% that cover the true value or the 5% that do not I don’t find at all helpful. What is my best estimate that this is one of the 95%? 95%! What I think is far more important is that 95% is not really 95% for a host of reasons from measurement issues, to modeling issues, to interpretation of probability issues. Again, others may disagree.

I both agree and disagree that the emphasis on the correct interpretation of the confidence interval is misplaced. I disagree with you because it is

nottrue in general that we cannot be sure that a realized confidence interval has failed to cover the true value, and this is sure to be baffling unless the correct interpretation is borne in mind. But on the other hand, I agree with you because one of the things my linked example shows is that confidence coverage alone cannot justify an inference; the reason why the confidence interval procedures in common use seem to work well must lie elsewhere.I have two problems with your disagreement. First, it will take me a while to decipher what “cannot be sure that…has failed” means. It is a double negative with an absolute positive in the middle. Can’t you say that more directly? More substantively, I’m not convinced that special cases prove a general point. If you point is that my incorrect description of a confidence interval is not always even close to correct, I’ll agree. But how common are these exceptional cases such as the one you provide? Further, my incorrect description of a confidence interval is not always close to correct for more important reasons: e.g., forking paths, measurement errors, model errors, etc.

My point is that any interpretation using the words “sure” “confident” “certain” is misleading. “Compatible” is better, but still does not portray the number of ways an inference can be unwarranted. In the midst of all these serious issues, I am not convinced that some extreme examples where the confidence interval is certain not to cover the true value is really that important.

“Can’t you say that more directly?”

I wanted to quote you directly. I’m saying confidence coverage alone doesn’t guarantee that a realized interval makes sense in light of

allthe information in the data set.“I’m not convinced that special cases prove a general point.”

Depends on the nature of the general point being made. The claim is that confidence coverage alone cannot justify an inference (even when everything else has gone right); by giving an example of a realized confidence interval yielding a demonstrably false inference I have provided a constructive proof of the claim. It’s actually the confidence interval procedures in common use are the special cases, and it’s useful to know what makes them special.

“Further, my incorrect description of a confidence interval is not always close to correct for more important reasons [and so on]…”

No disagreement on those larger concerns from me, but I do disagree with two things you say. First, my example is not extreme; this isn’t like the goofy “independent of the data, 95% chance of returning the entire parameter space and 5% chance of returning the null set” procedure. The Jaynes paper motivates the model with an actual industrial application, and the subsequent treatment does appear at first look to be reasonable by frequentist lights, what with the unbiased estimator and pivot and so forth. Second, we raise those serious issues so that we can address them; and once we

haveaddressed them as best as we are able, wearegoing to do inference! So itisimportant to know exactly when and why our confidence procedures are trustworthy.Dale:

In your example I would give people the Bayesian interpretation, i.e., given various assumptions, I’m 95% sure the interval contains the true value. The difficulty comes from the contortions needed to interpret the frequency statement. In your case I don’t see any problem with just going Bayesian, or interpreting the classical estimate Bayesianly.

A lot depends on what questions are asking. If you really want to know about your zip code, then I think the Bayesian interpretation is the only relevant answer. If you care about a large number of randomly selected zip codes, then you can go with the ACS’s story. As you say, though, the ACS’s story is not so useful if you want to make a statement about a single zip code.

Also good to ask why didn’t ACS use a subjective, or any, Bayesian interval here.

Could it be the likelihood swamped the prior?

Why not allow the person in the zip code to use their own prior? Does it matter what a person believes the average household income should be?

Justin

I was having a conversation with someone about this on Reddit the other day, and when I gave an example of prior information that very clearly reduced the probability that this interval was one of the right ones, they agreed. But then they made a more interesting statement – that the probability would depend on what information you’re including in the model, so there could be a bunch of different probabilities that the parameter is there, all of them valid in a certain way.

This is an interesting thought, but is that right? To some extent this feels like a “it’s your opinion” view of Bayesian statistics, but a similar problem would also apply to frequentism (when you say that in situations like this the coin has a 50% probability of landing heads up, what factors are you using to define “situations like this?”). One possible answer that occurred to me is that your probability is incorrect unless you utilize every bit and every kind of information that you have, but then I don’t think you can realistically do that a lot of the time. My best guess is that you do have to utilize all information and that you’ll probably fail – and this is a contributor to that “(all/most) models are wrong” thing. What do you think?

I agree that the (calculated) probability would depend on what information you’re including in the model., but I disagree with “so there could be a bunch of different probabilities that the parameter is there, all of them valid in a certain way”.

The way I see it is that the validity of the calculated probability (i.e., the degree to which it fits reality) depends on the quality of the model. Thus, statistics is not just a matter of deciding on a model and using it to calculate probabilities; the art and science of choosing a good model for a given problem is also (very) important (and too often ignored). Choosing a good model requires both a good understanding of the various possible models, and a good understanding of the context/problem being studied. (And sometimes it even involves coming up with a new type of model.)

Hans van Maanen writes:

Primary endpoints in a clinical trial are not independent. On the other hand, fill the bowl with an infinite number of 95% CIs on the same endpoint, and draw two of them at random. These

areindependent. More realistically, repeat the clinical trial with an independent group of subjects from the same hypothetical population, and calculate the 95% CI from each trial. Say these two CIs are non-overlapping. As Hans writes, if each realized 95%CI has a .95 probability of containing the true parameter value, then two independent 95% CIs must have a .95*.95 probability of both containing the true parameter value. But the probability that these two realized 95% CIs contain the true parameter value is 0. Therefore, it is not the case that a realized 95% CI has a .95 probability of containing the true value of the parameter.Andrew

Re; ‘I agree with what you wrote there. I think your mistake is to think that statistics is about producing intervals that you can trust. My take on the replication crisis is that it’s all about scientists and publicists trying to use p-values and other statistical summaries to intimidate people into trusting—accepting as scientifically-proven truth—various claims for which there is no good evidence.’

Very well said. This was my guess when it came around to asking them for theories for their ‘science’ and ‘statistics’ claims. Just could not explain it all that well. Compelling evidence that something is amiss in medical and statistics education and, more fundamentally, conceptually and methodologically. Science itself is presumed to be self-correcting, but it is not a timely process by any means. I think Raymond Hubbard’s frame he refers to as ‘hypothetico-deductivism’, accepted as a model for much science and justified by the use of NHST, can be expanded. Would make for an intriguing query. I haven’t read Corrupt Research yet. But am fascinated by its reviews.

> If I ask my frequentist statistician for a 95%-confidence interval, I can be 95% sure that the true value will be in the interval she just gave me.

If I throw a die I can be 50% sure that I will get an even number. This doesn’t mean that I can be 50% sure that the number that I just got is an even number. If I’ve seen the number and I know what does it mean for a number to be even then I can be 100% sure that the number is (or is not) even.