When do we expect conformal prediction sets to be helpful? 

This is Jessica. Over on substack, Ben Recht has been posing some questions about the value of prediction bands with marginal guarantees, such as one gets from conformal prediction. It’s an interesting discussion that caught my attention since I have also been musing about where conformal prediction might be helpful. 

To briefly review, given a training data set (X1, Y1), … ,(Xn, Yn), and a test point (Xn+1, Yn+1) drawn from the same distribution, conformal prediction returns a subset of the label space for which we can make coverage guarantees about the probability of containing the test point’s true label Yn+1. A prediction set Cn achieves distribution-free marginal coverage at level 1 − alpha when P(Yn+1 ∈ Cn(Xn+1)) >= 1 − alpha for all joint distributions P on (X, Y). The commonly used split conformal prediction process attains this by adding a couple of steps to the typical modeling workflow: you first split the data into a training and calibration set, fitting the model on the training set. You choose a heuristic notion of uncertainty from the trained model, such as the softmax values–pseudo-probabilities from the last layer of a neural network–to create a score function s(x,y) that encodes disagreement between x and y (in a regression setting these are just the residuals). You compute q_hat, the ((n+1)(1-alpha))/n quantile of the scores on the calibration set. Then given a new instance x_n+1, you construct a prediction set for y_n+1 by including all y’s for which the score is less than or equal to q_hat. There are various ways to achieve slightly better performance, such as using cumulative summed scores and regularization instead.

Recht makes several good points about limitations of conformal prediction, including:

—The marginal coverage guarantees are often not very useful. Instead we want conditional coverage guarantees that hold conditional on the value of Xn+1 we observe. But you can’t get true conditional coverage guarantees (i.e., P(Yn+1 ∈ Cn(Xn+1)|Xn+1 = x) >= 1 − alpha for all P and almost all x) if you also want the approach to be distribution free (see e.g., here), and in general you need a very large calibration set to be able to say with high confidence that there is a high probability that your specific interval contains the true Yn+1.

—When we talk about needing prediction bands for decisions, we are often talking about scenarios where the decisions we want to make from the uncertainty quantification are going to change the distribution and violate the exchangeability criterion. 

—Additionally, in many of the settings where we might imagine using prediction sets there is potential for recourse. If the prediction is bad, resulting in a bad action being chosen, the action can be corrected, i.e., “If you have multiple stages of recourse, it almost doesn’t matter if your prediction bands were correct. What matters is whether you can do something when your predictions are wrong.”

Recht also criticizes research on conformal prediction as being fixated on the ability to make guarantees, irrespective of how useful the resulting intervals are. E.g., we can produce sets with 95% coverage with only two points, and the guarantees are always about coverage instead of the width of the interval.

These are valid points, worth discussing given how much attention conformal prediction has gotten lately. Some of the concerns remind me of the same complaints we often hear about traditional confidence intervals we put on parameter estimates, where the guarantees we get (about the method) are also generally not what we want (about the interval itself) and only actually summarize our uncertainty when the assumptions we made in inference are all good, which we usually can’t verify. A conformal prediction interval is about uncertainty in a model’s prediction on a specific instance, which perhaps makes it more misleading to some people given that it might not be conditional on anything specific to the instance. Still, even if the guarantees don’t stand as stated, I think it’s difficult to rule out an approach without evaluating how it gets used. Given that no method ever really quantifies all of our uncertainty, or even all of the important sources of uncertainty, the “meaning” of an uncertainty quantification really depends on its use, and what the alternatives considered in a given situation are. So I guess I disagree that one can answer the question “Can conformal prediction achieve the uncertainty quantification we need for decision-making?” without considering the specific decision at hand, how we are constructing the prediction set exactly (since there are ways to condition the guarantees on some instance-specific information), and how it would be made without a prediction set. 

There are various scenarios where prediction sets are used without a human in the loop, like to get better predictions or directly calibrate decisions, where it seems hard to argue that it’s not adding value over not incorporating any uncertainty quantification. Conformal prediction for alignment purposes (e.g., control the factuality or toxicity of LLM outputs) seems to be on the rise. However I want to focus here on a scenario where we are directly presenting a human with the sets. One type of setting where I’m curious whether conformal prediction sets could be useful are those where 1) models are trained offline and used to inform people’s decisions, and 2) it’s hard to rigorously quantify the uncertainty in the predictions using anything the model produces internally, like softmax values which can be overfit to the training sample.

For example, a doctor needs to diagnose a skin condition and has access to a deep neural net trained on images of skin conditions for which the illness has been confirmed. Even if this model appears to be more accurate than the doctor on evaluation data, the hospital may not be comfortable deploying the model in place of the doctor. Maybe the doctor has access to additional patient information that may in some cases allow them to make a better prediction, e.g., because they can decide when to seek more information through further interaction or monitoring of the patient. This means the distribution does change upon acting on the prediction, and I think Recht would say there is potential for recourse here, since the doctor can revise the treatment plan over time (he lists a similar example here). But still, at any given point in time, there’s a model and there’s a decision to be made by a human.    

Giving the doctor information about the model’s confidence in its prediction seems like it should be useful in helping them appraise the prediction in light of their own knowledge. Similarly, giving them a prediction set over a single top-1 prediction seems potentially preferable so they don’t anchor too heavily on a single prediction. Deep neural nets for medical diagnoses can do better than many humans in certain domains while still having relatively low top-1 accuracy (e.g., here). 

A naive thing to do would be to just choose some number k of predictions from the model we think a doctor can handle seeing at once, and show the top-k with softmax scores. But an adaptive conformal prediction set seems like an improvement in that at least you get some kind of guarantee, even if it’s not specific to your interval like you want. Set size conveys information about the level of uncertainty like the width of a traditional confidence interval does, which seems more likely to be helpful for conveying relative uncertainty than holding set size constant and letting the coverage guarantee change (I’ve heard from at least one colleague who works extensively with doctors that many are pretty comfortable with confidence intervals). We can also take steps toward the conditional coverage that we actually want by using an algorithm that calibrates the guarantees over different classes (labels), or that achieves a relaxed version of conditional coverage, possibilities that Recht seems to overlook. 

So while I agree with all the limitations, I don’t necessarily agree with Recht’s concluding sentence here:

“If you have multiple stages of recourse, it almost doesn’t matter if your prediction bands were correct. What matters is whether you can do something when your predictions are wrong. If you can, point predictions coupled with subsequent action are enough to achieve nearly optimal decisions.” 

It seems possible that seeing a prediction set (rather than just a single top prediction) will encourage a doctor to consider other diagnoses that they may not have thought of. Presenting uncertainty often has _some_ effect on a person’s reasoning process, even if they can revise their behavior later. The effect of seeing more alternatives could be bad in some cases (they get distracted by labels that don’t apply), or it could be good (a hurried doctor recognizes a potentially relevant condition they might have otherwise overlooked). If we allow for the possibility that seeing a set of alternatives helps, it makes sense to have a way to generate them that give us some kind of coverage guarantee we can make sense of, even if it gets violated sometimes. 

This doesn’t mean I’m not skeptical of how much prediction sets might change things over more naively constructed sets of possible labels. I’ve spent a bit of time thinking about how, from the human perspective, prediction sets could or could not add value, and I suspect its going to be nuanced, and the real value probably depends on how the coverage responds under realistic changes in distribution. There are lots of questions that seem worth trying to answer in particular domains where models are being deployed to assist decisions. Does it actually matter in practice, such as in a given medical decision setting, for the quality of decisions that are made if the decision-makers are given a set of predictions with coverage guarantees versus a top-k display without any guarantees? And, what happens when you give someone a prediction set with some guarantee but there are distribution shifts such that the guarantees you give are not quite right? Are they still better off with the prediction set or is this worse than just providing the model’s top prediction or top-k with no guarantees? Again, many of the questions could also be asked of other uncertainty quantification approaches; conformal prediction is just easier to implement in many cases. I have more to say on some of these questions based on a recent study we did on decisions from prediction sets, where we compared how accurately people labeled images using them versus other displays of model predictions, but I’ll save that for another post since this is already long. 

Of course, it’s possible that in many settings we would be better using some inherently interpretable model for which we no longer need a distribution-free approach. And ultimately we might be better off if we can better understand the decision problem the human decision-maker faces and apply decision theory to try to find better strategies  rather than leaving it up to the human how to combine their knowledge with what they get from a model prediction. I think we still barely understand how this occurs even in high stakes settings that people often talk about.

7 thoughts on “When do we expect conformal prediction sets to be helpful? 

  1. I think the issues about human vs algorithmic decision making are a red herring here. Conformal prediction is a property of the prediction model, not the decision process. And, as a property of the model, I think that the trimodal decision making to replace bimodal decision making is an improvement: when the decision is binary, X or Y (X or not X), then having the classes X (subject to a specified error rate), Y (subject to that specified error rate), and “too close to call” is an improvement. I see it as an invitation for a decision maker to think more carefully about these indeterminate cases (it still runs the risk of inviting binary thinking about the conforming prediction sets – they are still subject to errors, even if they are in the X or Y decision sets).

    I concur with your resistance to focusing on multiple stages of recourse. Many decisions, particularly medical ones, have little room for recourse. If the diagnosis is wrong, there will be subsequent decisions, but usually these will be inferior choices than were available initially. I think the ability to make corrections is important to decision making, but it is a poor excuse to avoid better portrayals of the uncertainty in the prediction model.

    I’ve been following conformal prediction for some time now and wondering why it hasn’t caught on more than it has. Rather than technical reasons, I have the impression that it is due to the (unnecessary in my mind) excessive mathematical detail in the literature. While many decisions have more than binary classifications, the binary case is much easier to describe yet most of the literature seems intent on describing the more general case. At its core, my understanding of conformal prediction is that it is based on ranks – deriving a probability measure from where a new observation falls compared with a ranking of the cases in the training data. The use of ranks means this is distribution free. What I don’t really understand is whether the probability measure that is derived is a valid probability – I get lost in the math when I try to understand that. It is called a “p value” just to confuse things further.

  2. Jessica:

    This is an interesting discussion, and it reminds me a bit of our increased awareness of the importance of time in statistical modeling and analysis.

    If you’re in a setting where you’re making a sequence of low-stakes decisions and can keep adapting, then I can see the argument that a statistical procedure could just spew out a stream of recommended decisions and adaptation rules. One thing about low-stakes decisions is that utilities tend to be approximately linear so you can go with expected value (which, by the way, is different than “go if p < 0.05"-style rules). If you're in a setting where you're making one big decision without much possibility of adapting, then uncertainty can matter a lot more, and I don't think that point predictions can yield nearly optimal decisions. I guess it also depends on what counts as a point prediction. For example, suppose you are predicting how people will vote, and your model predicts that there's an 60% probability you will vote for candidate X. What is the point prediction? Is it 0.6, or is it 1? If all probabilistic predictions are rounded to 0 or 1, then there's no way this enough information to make nearly optimal decisions. If you are allowed to return the probability as a prediction, this should help. Difficulties will still remain, though. For example, suppose you are predicting a response with possible outcomes 1, 2, 3, 4, 5. As your point prediction, are you allowed to return the five predicted probabilities that sum to 1? Or must you give E(y)? Further complexities arise when predicting joint outcomes. I'm not saying that a full joint predictive distribution is always needed, just that a statement that, "If you can do something when when your predictions are wrong, then point predictions coupled with subsequent action are enough to achieve nearly optimal decisions," can't be true in general; it must depend on some conditions on the problem being studied, as well as on how you are defining point predictions. Under some situations, point predictions are lossy; under others, not so much. This discussion reminds me of the point we make in Regression and Other Stories where we categorize the probability distribution of the error term as the least important of the assumptions of linear regression.

    We qualify this by saying that the distribution of the error term can be super-important when the goal is making predictions about individual outcomes, especially when they will be piped through a nonlinear transformation. We use the example of forecasting elections. If my point prediction is that candidate X will receive 49% of the two-party vote, with a predictive standard deviation of 3%, then it would be terrible to just summarize this by plugging in the point prediction and saying that the candidate will lose. Having a measure of predictive uncertainty is essential to this particular use of the model.

  3. These are all good points. On time, I can certainly imagine cases where the maximum utility of a choice of action drops off as more time goes by, and it is weird that this is one of those things we often ignore in modeling.

    I had a similar thought to yours in writing this — what exactly we consider a point estimate is itself somewhat arbitrary. So yes, makes sense to say that generalizations about when we don’t need uncertainty are fraught.

    [oops this was supposed to be a reply to Andrew]

  4. In deep learning, we can address these issues by reexpressing networks (or composition of networks). One of the lasting takeaways of deep learning is likely to be that we can now reliably calibrate over high-dimensional inputs. Similarity-Distance-Magnitude calibration is to distribution modeling, as neural networks are to hand-tuned feature engineering.

    Marginal CP for discrete classification should be viewed as an important theoretical result for pedagogical purposes, and as a substrate and contrast for building additional approaches. However, it typically should not be used directly as the final quantity to present end-users in these cases where you’re presenting a user a single document/image/point in isolation for downstream decision-making (at the unit of analysis of that instance). The quantity produced by taking a threshold on the CDF to form a set is rather counterintuitive, even setting aside distributional shift concerns, because the guarantee says nothing about the coverage of singleton sets (|C(x)|==1). This is very problematic from a statistical communication perspective, because there is a risk people will assume the documents assigned singleton sets have a high accuracy of at least 1-alpha, after having winnowed out the more non-conforming documents. Importantly, this assumption can be grossly wrong, since the singleton sets can have an accuracy (conditional on size) of less than 1-alpha. We might resign ourselves to these limitations (and try to develop user interfaces to encourage a long-term marginal coverage interpretation), but luckily there are more effective approaches (see first paragraph).

    Importantly, even a well-calibrated probability is not necessarily sufficiently informative. A decision maker needs a means of inspecting the reference class that determined the probability, including assessing the sample size of the data partition. It also needs to be possible to readily update that calibration process if an error is found in the labeled reference class, or if new data needs to be taken into consideration. The approach mentioned in the first paragraph addresses these issues.

    • Can you say more on “One of the lasting takeaways of deep learning is likely to be that we can now reliably calibrate over high-dimensional inputs.” Are you saying the models don’t require additional uncertainty quantification?

      “This is very problematic from a statistical communication perspective, because there is a risk people will assume the documents assigned singleton sets have a high accuracy of at least 1-alpha, after having winnowed out the more non-conforming documents.”

      I definitely get the risks you are concerned with. Singleton sets are a a good example if we’re thinking about risks of the sets being taken too seriously. This reminds me of a similar point Recht argues “The third reason the conformal guarantees are misleading is that they are conflating the probability the algorithm is correct with the probability the prediction is correct.”

      At the same time though, do we ever want someone to take an uncertainty interval as the truth about all the uncertainty? I would argue no (and I like points like Sander Greenland has made on how CIs are always conditional on assumptions we can’t evaluate, hence they are better thought of as values compatible with our modeling assumptions and available data).
      I don’t think any intervals are exactly the right thing straight out of the box. though some methods will certainly produce things closer to the interpretations we want to give. Anyway, this is why I can’t in good faith reject a general purpose method like conformal prediction for human decision scenarios without trying to understand how they impact the overall process.

      As a more minor point, I think its important not to assume that the strategy by which they are employed has to be all or nothing. We could potentially do things like withhold them based on some learned expectations about their effects (e.g., don’t show singletons or very small sets at all).

  5. Less a month before you posted this, we put out a very relevant paper on the ArXiv
    https://arxiv.org/abs/2401.13744v2
    Here’s the summary from the abstract:
    “In this work, we study the usefulness of conformal prediction sets as an aid for human decision making by conducting a pre-registered randomized controlled trial with conformal prediction sets provided to human subjects. With statistical significance, we find that when humans are given conformal prediction sets their accuracy on tasks improves compared to fixed-size prediction sets with the same coverage guarantee. The results show that quantifying model uncertainty with conformal prediction is helpful for human-in-the-loop decision making and human-AI teams.”

Leave a Reply

Your email address will not be published. Required fields are marked *