Loving, hating, and sometimes misinterpreting conformal prediction for medical decisions

This is Jessica. Conformal prediction, referring to a class of distribution-free approaches to quantifying predictive uncertainty, has attracted interest for medical AI applications. Reasons include because prediction sets seem to align with the kinds of differential diagnoses doctors already use, and they can support common triage decisions like ruling in and ruling out critical conditions. 

However, like any uncertainty quantification technique, the nuance needed to describe what conformal approaches provide can get lost in translation. We have catalogs of common misinterpretations of p-values, confidence intervals, Bayes factors, AUC, etc., to which we might now add misinterpretations of conformal prediction. The below set is based on what I’m seeing as I read papers about applying conformal prediction for medical decision-making. If you’ve encountered others that I’ve missed (even if not in a health setting), please share them.

Misconception 1: Conformal prediction provides individualized uncertainty

It would be great if we could get prediction sets with true conditional coverage without having to make distributional assumptions, i.e., if we could guarantee that the probability that a prediction set at any fixed test point X_n+1 contains the true label is at least 1 – alpha. Unfortunately, assumption-free conditional coverage is not possible. But some enthusiastic takes on conformal prediction describe what it provides as if it is achieved. 

For example, Dawei Xie pointed me to this Nature Medicine commentary that calls for clinical uses of AI to include predictive uncertainty. The authors start with what appears to be a common motivation for conformal prediction in health: standard AI pipelines optimize population-level accuracy, failing to capture “the vital clinical fact that each patient is a unique person,” motivating methods that can “provide reliable advice for all individual patients.” The goal is to use uncertainty associated with the prediction to decide whether to abstain and bring in a human expert, who might gather more information or consider how the model was developed. 

This is all fine. The problem is that they propose to solve this challenge with conformal prediction, which they describe as a new tool “that can produce personalized measures of uncertainty.” You can get “relaxed” versions of conditional coverage, but no truly personalized quantification of uncertainty.

Misconception 2: The non-conformity score makes conformal prediction robust to distribution shift

Another potential source of misinterpretation is the non-conformity score. In split conformal prediction, this is the score that is calculated for (x,y) pairs in a held-out calibration set in order to find the threshold expected to achieve at least 1-alpha coverage on test instances. Then given a new instance, its non-conformity score is compared to the threshold to determine which labels go in the prediction set. The non-conformity score can be any negatively-oriented score function derived from the trained model’s predictions, though the closer it approximates a residual the more useful the sets are likely to be. A simple example would be 1 – f_hat(xi)_y where f_hat(xi)_y is the softmax value for label y produced by the last layer of a neural net, and the threshold is based on the distribution of 1 – f_hat(xi)_yi in the calibration set, where yi is the true label. 

One could say that non-conformity scores capture how dissimilar an (x,y) pair under consideration is from what the model has learned about label posterior distributions from the training data. But some of the application papers I’m seeing make more generic statements, describing the score as measuring how strange the new instance, as if in an absolute sense, or how unusual the new instance is relative to the training data, as if it is used to detect distribution shift.  

Misconception 3: You can get knowledge-free robustness to distribution shift 

Some papers acknowledge that standard split conformal coverage is not robust to violations of exchangeability, and cite work that relaxes this assumption to get coverage under certain types of distribution shifts. The risk here is describing these approaches as if one can get valid coverage under shifts without having to introduce any additional assumptions. Even in the work of Gibbs et al., which makes the least assumptions as far as I can tell, you still have to select a function class that covers the shifts you want coverage to be robust to. There is no “knowledge-free” way around violations of the typical assumptions.

Misconception 4: Conformal prediction can only provide marginal coverage over the randomness in calibration set and test points

In contrast to the above, I’ve also seen a few more skeptical takes on conformal prediction for medical decision making, arguing that conformal prediction sets are unreliable under shifts in input and label distributions and for subsets of the data. Papers that make these arguments can also mislead, by implying that any use of conformal prediction equates to simple split conformal prediction where coverage is marginal over the randomness in the calibration and test set points. This neglects to acknowledge the development of approaches that provide class-conditional or group-conditional coverage or the previously mentioned attempts at coverage under classes of shifts. Beware blanket statements that write off entire classes of approaches based on what the simplest variations achieve. 

Progress in AI may be exploding, but achieving nuance in discussions of uncertainty quantification is still hard.

8 thoughts on “Loving, hating, and sometimes misinterpreting conformal prediction for medical decisions

  1. I’m confused by the (excessive, in my view) use of jargon in these examples. Let me state my intuitive understanding and see if I am also misunderstanding what conformal prediction can and can’t tell you about uncertainty. I don’t see it as quantifying uncertainty about individual cases, but I do see it as expressing uncertainty about individual cases. When a conformal prediction fails to classify an individual as belonging to group X or Group Y (to use a binary example, so Group Y is simply not group X) subject to probability 10%, I interpret that to mean that if I want my model to provide 90% confidence in its predictions, this particular observation cannot be classified. I am applying the result to an individual observation, but I am not quantifying my uncertainty about that observation. I am simply saying that this observation cannot clearly be classified at that level of desired accuracy.

    Is this a misconception?

    • Not trying to be jargon-y! But in a post about confusion arising from sloppy language I want to be a little precise.

      >I don’t see it as quantifying uncertainty about individual cases, but I do see it as expressing uncertainty about individual cases.
      This is where things get tricky to talk about. Part of what I’m saying in the post is that people seem to confuse the use of instance-specific info to generate the set with getting true conditional coverage where Pr(Y in C(X)|X=x) for all x.

      >I interpret that to mean that if I want my model to provide 90% confidence in its predictions, this particular observation cannot be classified
      I think that’s fine to say, as long as it doesn’t get misread as implying that we know true probability that true label is Y for this specific observation (which wouldn’t make sense, because the true label is either Group X or Group Y in your example).

  2. As an anecdote, I work with health insurance claims data and have had very good coverage in practice over extended periods of time. Or more specifically I expected Covid to mess all sorts of stuff up but it did not. Moreso continuous predictions of claim $$$, e.g. you have X comorbidities given your historical claims in 2023, and I predict you will generate $k in claims in 2024. I have done some examples though recently of classifications and sets and they look fine and dandy as well.

    (Ironically when I have tried to do metrics to measure drift, they are very noisy and prone to “everything drifts all the time”, maybe I should use conformal coverage degrading itself as a drift metric.)

    Now, I do different batch type processes where I think conformal inference and setting different overall error rates makes sense. In terms of individual doctors + patient decisions, I can understand how that gets confusing (even ignoring individualized uncertainty). The population coverage over different sets is sort of “why would I as an individual care”. That is lovely you know to have high recall over your patient population for disease Y you set the threshold to order a biopsy at 1% predicted probability — that doesn’t mean for me personally it makes sense to bother with the invasive test though.

    • Two questions.

      1) It is common actuarial practice in health insurance claims analysis to use cohorts of individuals. In other words, people are bundled into groups based on demographics, etc..

      Would you confirm that your claims model is at the individual patient level?

      2) PDFs of health care claims are very fat-tailed, invalidating Gaussian assumptions.

      Would you clarify how your modeling accounts for that fact?

      Ty!

      • This quote was in reference to individual person predictions. I model the log of the claims value, which for the claims data I look at to me is the obvious choice. (I don’t work for an insurance agency, I could see actuaries more worried than me about very high claims though to be fair.)

        Some idiosyncrasies of lumps of claim values at the low end I find more annoying than the high outliers (the conformal stuff is non-parametric, but of course you want the residual distribution as small as possible). So you have a mixture of people who get 1/2 particular services per year, and they are specific dollar amounts, and they cause spikes in the marginal distribution.

        • Thank you for those clarifications. One additional comment. Natural log transformations are standard modeling practice but still fail to adequately fit the extremes. A substantially better fit can be obtained using quantile regression and, instead of fitting the median, set the target at a much higher bound, e.g., 0.7 or more. FWIW.

Leave a Reply

Your email address will not be published. Required fields are marked *