“Language for communicating frequentist results about treatment effects”

https://discourse.datamethods.org/t/language-for-communicating-frequentist-results-about-treatment-effects/934

Among many other topics, Sander explains why he prefers “compatibility interval” to “confidence interval”:

]]>Just to clarify, the IPCC reference wasn’t intended as an appeal to authority. Rather, I was merely pointing out that others have come up with similar descriptors of probabilities to try to help make sense of the data/ findings. Thanks for the notes and citations/ links to posts on objective Bayes and reference priors – I’ll check those out. Agree with your last point about the big picture.

Thanks for your reply. ]]>

> one should be wary of appeals to ‘authority’

Agree, so we can disregard the reference to IPCC

> not attempting to assist with interpretation of findings

Also agree – for many of us that should be a main part of our job

> as a plausible range of effect sizes compatible with the data and model

That’s doable without a prior but with a prior the compatibility needs to include between the prior and background knowledge and I don’t believe the ‘objective’ or ‘reference’ priors are. Rather they simply restore the compatibility assessment to being in frequentest properties. For instance, in OBayes publications a big deal is made in obtaining frequency properties that sometimes even better than frequentest derived confidence intervals.

Some of this is discussed in linked paper I gave above, in the simulation in Andrew’s post here http://statmodeling.stat.columbia.edu/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/ and in Dan Simpson’s post on simulating fake data from default priors and how impossible such data is in reality http://statmodeling.stat.columbia.edu/2018/09/12/against-arianism-2-arianism-grande/

The big picture is that attempting to assist with interpretation of findings is an open question without any group being able to convince others widely as to how to go about it.

]]>In any event, the thresholds underlying the MBI descriptors of likely, very likely etc. were never intended to be prescriptive or to encourage/ force dichotomous thinking. Rather, in extensive consultation with clinicians and practitioners over the years, they represent a first stab at such a scale. The overriding intention is indeed, to fully and properly embrace uncertainty of estimation. Certainly no ‘sleight of hand’ or lowering of the standards of evidence was ever intended. There is no doubt that – as with all methods – there are examples of misuse of MBI, so the way that MBI is sometimes used in practice is something of a cause for concern. I agree with Andrew, of course, that the main focus should be on better designs, more precise measurement, improved reporting, and raw data sharing in this open science era, rather than sterile debates about methods or ‘rules’ of inference. I also agree that a focus on error rates is a distraction, especially as MBI does not involve hypothesis testing. Nevertheless, we took up this challenge as at some stage someone has to make a decision based on data – though preferably not from a single study – and decisions come with attendant errors. Essentially, our feeling is that not attempting to assist with interpretation of findings to aid decisions affecting policy and practice is ‘kicking the can down the road’, as it were. I am, however, grateful for David Spiegelhalter’s wise counsel, that these are deeply contested issues, no opinions should be considered definitive, and one should be wary of appeals to ‘authority’. Thank you all once again for your interesting feedback – I am genuinely very grateful for the discussion.

1. Mastrandrea MD, Field CB, Stocker TF, Edenhofer O, Ebi KL, Frame DJ, Held H, Kriegler E, Mach KJ, Matschoss PR, Plattner G-K, Yohe GW, Zwiers FW (2010). Guidance Note for Lead Authors of the IPCC Fifth Assessment Report on Consistent Treatment of Uncertainties. Intergovernmental Panel on Climate Change (IPCC): https://www.ipcc.ch/pdf/supporting-material/uncertainty-guidance-note.pdf

]]>There were some papers advocating this sort of approach in medical research about 20 years ago – these were two (if I remember correctly):

https://www.ncbi.nlm.nih.gov/pubmed/11343760

https://jech.bmj.com/content/jech/52/5/318.full.pdf

It didn’t catch on, but maybe the inspiration for MBI came from there?

MBI’s appeal comes from this underlying sleight of hand. People are born Bayesians and find it almost impossible to think in frequentist terms. Outside of trained statisticians, I have found that nobody (literally) thinks in frequentist terms, and it is almost impossible to explain the frequentist interpretation to people who have not taken a number of statistical courses. (One statistical course is usually not enough.) Even well-trained professionals often backslide into Bayesian interpretations of frequentist results.

Such is life.

]]>That is a question I was going to ask myself but thought coming from me it would be viewed as naive. But from you well, it will be taken seriously.

]]>Confidence intervals start out as compatibility intervals, a set of parameter values not too incompatible with data and data generating process model and only make it to Confidence intervals when the data and assumptions seem beyond reasonable doubt. Credible intervals start out as compatibility intervals, a set of parameter values not too incompatible with data and data generating process model with respect to their initial distribution (prior) and only make it to Credible intervals when the data, assumptions and prior seem beyond reasonable doubt.

Drawing from arguments here – Inferential Statistics as Descriptive Statistics: Amrhein1, Trafimow and Greenland https://peerj.com/preprints/26857.pdf

]]>The fact is you can’t *make decisions mechanically* because decisions ultimately need to be about usefulness/utility, and no robot can decide for you or any group of people, what the utility should be.

However, once some kind of utility has been decided on, you *can* calculate mechanically, and *this* is what Bayes does. The frequentist probability that 95% of constructed intervals contain the true value just doesn’t let you calculate in the way you need to, relative probabilities *within* the interval can’t be calculated in Frequentism. In particular you’ll get different results with different tests of the same set of hypotheses, etc. To get consistency *within* the interval you’ll need to adopt the likelihood, and then if you insist on flat priors you’ll be doing something that is provably dominated in frequentist terms compared to a real-world prior in all but the most awkwardly constructed fake problems (ie. problems where parameters of interest have numerical values that really do exceed the limits of an IEEE floating point number for example).

So I think the answer is clear: teach decision making, teach the importance of utility, and teach Bayes. Drop confidence and NHST because just like you say ” Perhaps that is much of the problem with NHST – we teach people that their answer is reject/do not reject rather than the reasoning behind the decision they reach”. Instead *teach reasoning about decision making* NHST is a broken paradigm whose goal was to mechanize discovery in an age when it seemed like mechanizing everything was a good idea (1940’s Dieselpunk ethos).

]]>I guess I am thinking that there is no mechanistic answer that is satisfying. Perhaps that is much of the problem with NHST – we teach people that their answer is reject/do not reject rather than the reasoning behind the decision they reach – which is not mechanistic (although it may involve a number of calculable probabilities). If there was a mechanistic decision procedure, then we hardly need humans to be involved. If humans are to be involved, then it means the essential parts of the decision involve reasoning – reasoning about the uncertainties, the kinds of mistakes we might make, the probabilities that we can determine, and the potential/need for further studies.

I am interested in any reactions to this, but I do want to return to MBI for a moment. In one sense, it is a step in the direction I am stating – use domain knowledge to establish limits from which to take actions, and then establish probabilities in relation to these limits. On the other hand, it appears to hide the uncertainty in the estimates and tries to establish a mechanistic decision making procedure to resurrect most of the NHST apparatus. Am I understanding that correctly?

]]>In repeated batch lot testing everyone seems fine with the confidence interpretation. On the other hand, in routine diagnostic testing in a population with a well know prevalence of disease, everyone seems fine with taking posterior probabilities as relevant and even literally.

But when there is the perception of a single case (say the case of the last batch sample done before the production line is destroyed), the first is seen as nonsensical. While when the distribution of parameter values in the assumed prior has no relevance at all to the unknown parameter values one is trying to pin down, some Bayesians (Rubin coined these sage Bayesians) don’t take the resulting posterior literally or even as relevant.

In your example, presumably its a flat prior that gives that posterior probability and the idea that almost always, the unknown difference one will be trying to learn about will be of magnitude greater than 10^99999 does not seem relevant.

Now the point of the baiting is that it seems any discussion of something in between seems taboo and though it is occasionally blogged about (e.g. in this blog) and discussed in talks and a few papers – most seem to overlook or neglect these discussions. So that’s primarily why I asked for clarification.

]]>Instead of “confidence interval,” I’d rather say “uncertainty interval.”

]]>What exactly will you do with this number? I mean this absolutely seriously. Can you list some productive calculations or computations that you can use this number for?

]]>I know its wrong, but I think the nonsense like this MBI is a reaction to the fact that we want to say something based on our one sample. Unfortunately, if I am understanding the MBI correctly, it is being used to say things based on the confidence interval that should not be said – like a treatment is very likely beneficial if the lower end of the 95% interval lies in an ambiguous range while the upper end lies in the beneficial range. ]]>

> I think statisticians that insist on pointing out that confidence intervals are only valid for interpretations of a repeated procedure are being counterproductive.

OK, what would you suggest are other valid interpretations of a repeated procedure?

Some have been previously mentioned/discussed on this blog, but do you have preferences on those or alternatives?

]]>1. Yes, there are problems with NHST and confidence intervals.

2. So, let’s introduce a new concept: MBI. We will bring in domain knowledge about meaningful sizes of effects (potentially both beneficial and/or harmful). Notwithstanding all the issues with confidence intervals, we will now unabashedly use those intervals, in conjunction with the domain knowledge, to derive decision rules based on those intervals. Except that we now ignore everything that is incorrect about the interpretation of the confidence intervals.

3. We now use MBI to derive decision rules based on confidence intervals in relation to meaningful beneficial or harmful effects. And, since numbers are hard for people, we’ll couch our decisions in terms of nice terms like “very likely,” “likely,” “possibly,” etc.

I actually appreciate their attempt to resurrect confidence intervals – despite the technically correct critique of confidence intervals, I still support the incorrect interpretation of the interval as the probability that the true parameter being within the X% confidence interval as being X%. Yes, it refers to a repeated procedure – but we often have only one sample and we need to use that to make decisions. I’ve said before – I think statisticians that insist on pointing out that confidence intervals are only valid for interpretations of a repeated procedure are being counterproductive – insisting on technical correctness at the risk of making an irrelevant point. I think this MBI idea is evidence that I am right about this. As a reaction to the critiques of NHST and confidence intervals we get nonsense like the MBI.

For example, one of things I find most useful about confidence intervals is their width. I take that as meaningful evidence about the degree of uncertainty in my evidence. The MBI application appears to completely ignore the width of the interval, choosing to focus on the endpoints of the interval and where they lie in the prespecified regions of harmful and beneficial effects.

As I see it, the truly damaging uses of inference are the attempts to reach dichotomous decisions using a mechanical procedure based on sample data. MBI relaxes this, reaching a slightly larger set of possible decisions – but at the risk of pretending the uncertainty is not really there. Hardly an improvement.

I have similar problems with Mayo’s severity testing – I hesitate to bring it up in the context of MBI as Mayo’s analysis seems to be far sounder and well reasoned than MBI. But I am left with the same feeling – that severity testing is an attempt to save the confidence interval as a useful concept. I believe it is useful. I think the harm is in attempting/insisting on making deterministic decisions on the basis of p-values, confidence intervals, MBI, severity tests, or anything else (to be fair to Mayo, I don’t think she is recommending reducing uncertainty to deterministic decisions).

Why is it so hard to use a confidence interval as meaningful evidence? Yes, surround it with all the caveats you need to (the repeated procedure, in my view, is one of the less important caveats as compared with nonrandom sampling, measurement issues, selection biases, etc.).

Am I missing the point? Is there more to MBI than I am seeing?

]]>