As Sander put it – only the “compatibility” criterion differs.

Frequentist compatibility is conditional on the specific tested parameter – how often would possible data be this or more discrepant than the observed data, with the specific tested parameter (if the specific tested parameter was true).

Bayesian compatibility is conditional on the observed data – what’s the distribution of parameters that each would generate the exact same possible data as the observed data at least this often (or plausibly) or more.

Actually allows one to mitigate the degree of uncertainty laundering with Frequentist methods while introducing Bayesian inference in a way that is inoculated against uncertainty laundering using a workflow introduced first with Frequentist methods.

Or so I hope.

]]>I share concerns about confusing types of intervals, but there is a sense in which they are more alike than different whenever the model is only hypothetical. In that case there is no real coverage validity (calibration), and both types of intervals are only showing compatibility of the data with their assumed model; only the “compatibility” criterion differs. This raises the possibility of other criteria, but then the resulting interval functions have usually turned out to be numerically the same as particular coverage or credibility functions (as with pure likelihood).

HDPI may be OK insofar as it sounds hard to misinterpret, but researchers are creative so may prove me wrong if they adopt it (which does not seem likely any time soon in my field).

]]>From a getting paid to do work perspective of course you are absolutely right. I’d just say that being open and up-front about those particular issues, and telling people what *should* be done even if it can’t be done is a task we should bend over backwards to do. of course, it doesn’t make getting contracts any easier… let me agree entirely on that. Its a pleasure when you find someone who will buy the real deal.

My wife was discussing the budget for a grant with one of her colleagues, she proposed putting something in the budget explicitly for data analysis. Her colleague just said they should find some collaborator whose lab would do it free, after all it’s only a few hours of a grad student’s time or something to press the buttons on the bioinformatics software and write up the results right?

:-|

]]>I do like that McElreath calls Bayesian intervals “compatibility intervals”, but I think it might open the doors for possible confusion with frequentist intervals. What do you think about 95% highest density posterior intervals (HDPI) for Bayesian intervals, it’s one that John Kruschke uses in his Bayes book

]]>Daniel: I agree to all that in principle. Unfortunately as with so much that is good “in principle”, it’s simply not practical (apart from infrequent exceptions), at least in my main application field (medical drug, device, and practice surveillance). There, few researchers can correctly interpret a P-value or CI let alone comprehend in detail the ordinary unrealistic model generating those; some high-prestige journals like JAMA even force authors to misinterpret P-values and CIs!

No surprise then that the labor involved in modeling out uncertainty sources in detail is well beyond that budgeted for analyses, and far beyond the training or competence of most teams. Worst of all, the incentives are all stacked to do no such thing, because it will inevitably lead to weaker conclusions not even worth a press release let alone acceptance in a high-status journal.

I strongly doubt the situation is any better in other health sciences or social sciences or psychology. In the face of such harsh reality, I see no alternative than to try and force honest description of conventional outputs. At least get away from terms promoting overconfidence, like “significance”, “confidence”, “coverage”, “credibility”, etc. in favor of less sensational, more modest ordinary-language descriptions, as illustrated in Chow & Greenland, http://arxiv.org/abs/1909.08579

]]>Thanks Andrew! I’d like to think we are getting closer…

For this iteration, in response:

First: “CI” is just an abbreviation for “confidence”, “coverage”, “credible”, “compatibility” etc. (e.g., “crap”) interval. It solves only a speed-typing problem. What they share is that none of them capture uncertainty outside of stylized (and in my work, unrealistic) examples. Otherwise we should face the fact that the interval estimates in research articles and textbooks do not deserve labels as strong as “confidence”, “coverage”, or “credible”. The key question is: Why should we care about uncertainty (or coverage, confidence, or credibility) given unrealistic models? At best we are only getting compatibility with those models (distinguished from the other Cs only in that it is not a hypothetical conditional; see Greenland & Chow, http://arxiv.org/abs/1909.08583).

Second: Fully agree that”confidence intervals” rarely have their claimed coverage properties and so are not coverage intervals; (thus their name is a confidence trick, as Bowley said upon seeing them in 1934). That’s why I call them “compatibility intervals” in my work. And fully agree that “credible intervals” rarely warrant credibility near what is stated (e.g., 95%) and often contain incredible values, so that at least one modern Bayesian text (McElreath) also calls them “compatibility intervals” (albeit here the compatible models include an explicit prior).

Third: If you agree that all these CIs are model-based and thus do not capture total uncertainty, then you’ve made my point: “Uncertainty interval” (UI) is a very bad term for them because (apart from very special cases) CIs do not capture total uncertainty. Worse, CIs often capture only a minority of uncertainty, for the reasons I stated.

Adding those up: You have been in a leader in condemning uncertainty laundering, hence I’m baffled as to why you’d continue to promote labeling CIs as UIs. It seems obvious (to me anyway) from past researcher performance that they already take CIs as representing total uncertainty; thus relabeling CIs as “uncertainty intervals” will only dig in this misinterpretation even deeper. At best, they could be labeled as “MINIMAL-uncertainty intervals” with a massive emphasis on “minimal”, but then we should caution that they may be WAY too narrow, and may be biased WAY off to an unknown side.

]]>Sander, shouldn’t we be advocating that people actually model those often unmodeled uncertainties. I mean, for example unless your measurement apparatus is quite good, you should probably have a measurement error in your model, and unless you’re doing an extraordinary job of recruiting a wide variety of patients to match the demographics of your country, you should be including some kind of sample bias or something in your model, and when there are generating process issues, you should add reasonable “width” to your likelihoods, which can be accomplished through informative priors that bias the error scales away from zero intentionally…

having done all that, we won’t be perfect, but we won’t be fooling ourselves either, and now, with those components in our models, we can discuss them explicitly and argue over what a good model for them is…

Anything else is I agree fooling ourselves, and like Feynman said in his cargo cult lecture, the first thing we need to do is not fool *ourselves*.

]]>Sander:

Statistical interval estimates are used in different ways, including to express confidence in a conclusion, to express a range of credible values, and to express uncertainty about an inference. In that sense, all three terms, “confidence interval,” “credibility interval,” and “uncertainty interval,” are reasonable, as they represent three different goals that are served by interval estimation. Separating these concepts can help, as there are examples of confidence intervals that do not include credible values and do not summarize uncertainty, there are examples of credible intervals that do not convey confidence and do not capture uncertainty, and there are examples of uncertainty intervals that are not interpretable as confidence or credibility statements.

Regarding your point: all three of these concepts—“confidence interval,” “credibility interval,” and “uncertainty interval”—are model-based, and all of our models are wrong. So, sure, I agree, except in some rare cases, uncertainty intervals do not capture total uncertainty. But the same is the case for confidence and credibility intervals. Except in some rare cases, confidence intervals do not have the claimed confidence properties, and, except in some rare cases, credible intervals can exclude credible values and include incredible values.

If you want to call the term “uncertainty interval” a “sales gimmick,” fine. I’d prefer to say it’s a mathematical statement conditional on a model, which is what I’d also say of “confidence interval” or “credibility interval.” I don’t see how calling it a “CI” solves this problem.

]]>Great!

Now, if you’d only stop claiming that “confidence” and “credibility” intervals (CI) are “uncertainty intervals” we might approach stat nirvana. Until then, that “uncertainty” label is conning the reader and ourselves. Why? Because CI do NOT capture total uncertainty (outside of the highly idealized examples that characterize the toy universe of math stat). That means calling either kind of CI an “uncertainty interval” is part of the usual stat sales gimmick of empty quality assurance (AKA “error control”).

Look, we always compute CI from a data model. In 100% of my work (and I bet about the same % of yours) there’s serious uncertainty about the underlying physical data-generating process. That process has important features not captured by our model, like measurement errors and selection biases. In that case the CI flopping out of our software (whether SAS, Stan or Stata) are OVERCONFIDENCE intervals, and should not be assigned anything near either the numeric confidence or credibility shown alongside them.

Unless you carry out the arduous task of including all important uncertainty sources in the model, CI do NOT account for our actual uncertainties about the mechanisms producing the data. And that uncertainty can far exceed any uncertainty from the “random variation” allowed by the assumed model; see for example Greenland, S. (2005). Multiple-bias modeling for analysis of observational data (with discussion). J Royal Statist Soc A, 168, 267-308.

Note well: Model averaging and so-called “robust” (another con word) methods don’t address this uncertainty problem. Those methods only address uncertainty about the “best” mathematical form for combining the observations, not problems with the observations like measurement error, selection bias, and (in allegedly causal analyses) uncontrolled confounding.

At best then, we can only say that CI show us a range of good-fitting models (models “highly compatible with the data”) within the very restricted model family used to combine the observations.

]]>Sander:

It’s not a fallacy, it’s an assumption! But I agree that assumptions should be clear. So I’ve rewritten that entry; it now says, “16: You need this much more of a sample size to estimate an interaction that is half the size of a main effect.”

I agree that the entry as written was potentially misleading, so thanks for giving me the push to fix it.

]]>As an antidote to this fallacy go to this exchange:

https://statmodeling.stat.columbia.edu/2020/02/10/evidence-based-medicine-eats-itself/#comment-1242382 ]]>

It would be great if you got John Ioannidis here to debate the p-value debate. What is its disposition? Everyone goes off on leaving just shy of making an impact debate wise. Is one to conclude that this debate on backburner?

]]>I have identified some individuals who I think can make superb contributions. This forum too can be helpful.

]]>On the one hand, I love learning about ALL of this stuff, especially the more subtle fallacies.

But on the other hand, my list of things to read just exploded exponentially.

So, thank you. Jerk.

]]>So just skip the earlier parts?

]]>