This is Jessica. Continuing on the theme of calibration for decision-making I’ve been writing about, recall that a calibrated prediction model or algorithm is one for which, for all predictions that are b, the event realizes at a rate of b, and this is true for all values of b. In practice we settle for approximate calibration and do some binning.
As I mentioned in my last post in calibration for decision-making, an outcome indistinguishability definition of perfect calibration says that the true distribution of individual-outcome pairs (x, y*) is indistinguishable from the distribution of pairs (x, ytilde) generated by simulating a y using the true (empirical) probability, ytilde ∼ Ber(ptilde(x)). This makes clear that calibration can be subject to multiplicity, i.e., there may be many models that are calibrated over X overall but which make different predictions on specific instances.
This is bad for decision-making in a couple ways. First, being able to show that predictions are calibrated in aggregate doesn’t prevent disparities in error rates across subpopulations defined on X. One perspective on why this is bad is socially motivated. A fairness critique of the standard ML paradigm where we minimize empirical risk is that we may be trading off errors in ways that disadvantage some groups more than others. Calibration and certain notions of fairness, like equalized odds, can’t be achieved simultaneously. Going a step further, ideally we would like calibrated model predictions across subgroups without having to rely on the standard, very coarse notions of groups defined by a few variables like sex and race. Intersectionality theory, for example, is concerned with the limitations of the usual coarse demographic buckets for capturing people’s identities. Even if we put social concerns aside, from a technical standpoint, it seems clear that the finer the covariate patterns on which we can guarantee calibration, the better our decisions will be.
The second problem is that a predictor being calibrated doesn’t tell us how informative it is for a particular decision task. Imagine predicting a two-dimensional state. We have two predictors that are equally well calibrated. The first predicts perfectly on the first dimension but is uninformed on the second. The second predicts perfectly on the second dimension but is uninformed on the first. Which predictor we prefer depends on the specifics of our decision problem, like what the loss function is. But once we commit to a loss function in training a model, it is hard to go back and get a different loss-minimizing predictor from the one we got. And so, a mismatch between the loss function we train the model with and the loss function that we care about when making decisions down the road can mean leaving money on the table in the sense of producing a predictor that is not as informative as it could have been for our downstream problem.
Theoretical computer science has produced a few intriguing concepts to address these problems (at least in theory). They turn out to be closely related to one another.
Predictors that are calibrated over many possibly intersecting groups
Multicalibration, introduced in 2018 in this paper, guarantees that for any possibly intersecting set of groups that are supported by your data and can be defined by some function class (e.g., decision trees up to some max depth), your predictions are calibrated. Specifically, if C is a collection of subsets of X and alpha takes a value in [0, 1], a predictor f is (C, α)- multicalibrated if for all S in C, f is alpha-calibrated with respect to S. Here alpha-calibrated means the expected difference between the “true” probability and the predicted probability is less than alpha.
This tackles the first problem above, that we may want calibrated predictions without sacrificing some subpopulations. It’s natural to think about calibration from a hierarchical perspective, and multicalibration is a way of interpolating between what is sometimes called “strong calibration”, i.e., calibration on every covariate pattern (which is often impossible) and calibration only in aggregate over a distribution.
Predictors that simultaneously minimize loss for many downstream decision tasks
Omniprediction, introduced here, starts with the slightly different goal of identifying predictors that can be used to optimize any loss in a family of loss functions relevant to downstream decision problems. This is useful because sometimes we often don’t know at the time when we are developing the predictive model what sort of utility functions the downstream decision-makers will be facing.
An omnipredictor is defined with respect to a class of loss functions and (again) a class of functions or hypotheses (e.g., decision trees, neural nets). But this time the hypothesis class fixes a space of possible functions we can learn that will benchmark the performance of the (omni)predictor we get. Specifically, for a family L of loss functions, and a family C of hypotheses c : X → R, an (L,C)-omnipredictor is a predictor f with the property that for every loss function l in L, there is a post-processing function k_l such that the expected loss of the composition of k_l with f measured using l is almost as small as that of the best hypothesis c in C.
Essentially an omnipredictor is extracting all the predictive power from C, so that we don’t need to worry about the specific loss function that characterizes some downstream decision problem. Once we have such a predictor, we post-process it by applying a simple transformation function to its predictions, chosen based on the loss function we care about, and we’re guaranteed to do well compared with any alternative predictor that can be defined within the class of functions we’ve specified.
Omnipredictors and multicalibration are related
Although the goal of omniprediction is slightly different than the fairness motivation for multicalibration, it turns out the two are closely related. Multicalibration can provide provable guarantees on the transferability of a model’s predictions to different loss functions. Multicalibrated predictors are convex omnipredictors.
One way to understand how they are related is through covariance. Say we partition X into different disjoint subsets whose union is X. Multicalibration implies that for an average state in the partition of X, for any hypothesis c in C, conditioning on the label does not change the expectation of c(x). No c has extra predictive power on y given once we’re in one of these groups. We can think of multicalibration as predicting a model of nature that fools correlation tests from C.
Finding such predictors (in theory)
Naturally there is interest in how hard it is to find predictors that satisfy these definitions. For omnipredictors, a main result is that for any hypothesis class C, a weak agnostic learner for C is sufficient to efficiently compute an (L, C)-omnipredictor where L consists of all decomposable convex loss functions obeying mild Lipshcitz conditions. Weak agnostic learning here means that we can expect to be able to efficiently find a hypothesis in C that is considerably better than random guessing. Both multicalibration and omniprediction are related to boosting – if our algorithm returns a predictor that is only slightly better than chance, we can train another model to predict the error, and repeat this until our “weak learners” together give us a strong learner, one that is well correlated with the true function.
Somewhat surprisingly, the complexity of finding the omnipredictor turns out to be comparable to that of finding the best predictor for some single loss function.
But can we achieve them in practice?
If it wasn’t already obvious, work on these topics is heavily theoretical. Compared to other research related to quantifying prediction uncertainty like conformal prediction, there are far fewer empirical results. Part of the reason I took an interest in the first place was because I was hearing multiple theorists respond to discussions on how to communicate prediction uncertainty by proposing that this was easy because you could just use a multicalibrated model. Hmmm.
It seems obvious that we need a lot of data to achieve multicalibration. How should researchers working on problems where the label spaces are very large (e.g., some medical diagnosis problems) or where the feature space is very high dimensional and hard to slice up think about these solutions?
Here it’s interesting to contrast how more practically-oriented papers talk about calibration. For example, I came across some papers related to calibration for risk prediction in medicine where the authors seem to be arguing that multicalibration is mostly hopeless to expect in practice. E.g., this paper implies that “moderate calibration” (the definition of calibration at the start of this post) is all that is needed, and “strong calibration,’’ which requires that the event rate equals the predicted risk for every covariate pattern, is utopic and not worth wasting time on. The authors argue that moderate calibration guarantees non-harmful decision-making, sounding kind of like the claims about calibration being sufficient for good decisions that I was questioning in my last post. Another implies that the data demands for approaches like multicalibration are too extreme, but that it’s still useful to test for strong calibration to see whether there are specific groups that are poorly calibrated.
I have more to say on the practical aspects, since as usual it’s the gap left between the theory and practice that intrigues me. But this post is already quite long so I’ll follow up on this next week.
Not sure I have fully grasped the topic here, but an interesting approach I’ve seen before is to repeat the model fitting process on bootstrapped training sets, with the requirement that each bootstrapped set meets a minimum sample size requirement for model stability. The spread of the predictions on a test set then roughly mirrors the prediction error you can expect when generalising, and you can visualise the miscalibration with LOWESS smoothers (one for every iteration). Smoothing has the advantage of pointing out miscalibration over the range of possible Y values, as opposed to bins, and can potentially identify subgroups by going back in after the fact and seeing who’s being lit up in the data.
A philosophical question that has bothered me for some time is the idea of calibration at the individual prediction level – the idea that a person’s risk of Y really was, say, 20%. In the case of clinical informatics, calibration would relate to the proportionality of the treatment to the patient’s observed risk. For example, if a prediction was relatively high for this individual due to chance or an individual quirk like asymptomatic high systolic pressure (rather than reflecting some underlying pathophysiology), a model could be calibrated more generally, but lead to a disproportionate and potentially harmful treatment response. This to me is why any clinical prediction modelling exercise should begin with comprehensively reviewing the circumstances the predictors will be used in, the decision-making following prediction, etc. and why the idea of any kind of model being thrown into clinical practice without objectively reviewing its potential for harm is terrifying.
To clarify: the spread of predictions for each individual in a test set*
Interesting. Sounds like what ML people might call data augmentation.
I agree on individual prediction… got into this a bit in my last post as a point where the calibration literature misses what matters for many decision in practice (e.g. being able to show lack of negligence on a specific case). Some of the multicalibration/omniprediction lit gets into the extent to which we can expect to distinguish between the model we’ve learned and the best (true) model, which I find useful for drawing attention to the limits of calibration. I like your philosophy for starting with a review of exactly how the predictors will be used in decision making. My sense is that this is rarely how it happens in many places where models are now being deployed. What a rigorous deployment and evaluation protocol for decision assistance looks like is something I’ve been thinking about lately.
This is a cool post! My background is in operations research, and the problem of calibrating a model was discussed in a good paper by Elmachtoub and Grigas (2022) (https://pubsonline.informs.org/doi/10.1287/mnsc.2020.3922). The show that more accurate predictive models can lead to worse decisions (using an LP as the example), and then propose training predictive models with decision loss instead of prediction loss. There’s a nice convex reformulation using duality too.
Second, the point on multicalibration seems to resemble a distributionally robust optimization problem. Since you will never know the true generating distribution of your data, the idea of DRO is to find the distribution that leads to the worst-case objective function value given constraints on the distance by which the empirical distribution can shift. There’s a paper by Daniel Kuhn and others somewhere that shows the DRO problem is the least conservative model that gives a guarantee on out-of-sample performance.
*the problem of calibrating a model for downstream decision tasks
Thanks. The INFORMs paper seems related, also makes me think of this recent work defining calibration decision loss (max improvement in payoff over all payoff-bounded decision tasks from calibrating the predictions): https://arxiv.org/abs/2404.13503
Maybe you mean this paper by Kuhn: https://proceedings.mlr.press/v139/li21t/li21t.pdf
I found the Kuhn paper- it’s this one: https://pubsonline.informs.org/doi/10.1287/mnsc.2020.3678. To me, it’s an incredible result, but I’m only familiar with the operations research literature so I have no clue how it relates to the CS or Stats literature on similar topics.
Thanks for sharing that arxiv paper! I’m reading through it now, lots of good stuff and I’m only a few pages in. As a young grad student, it’s a little overwhelming reaching out beyond my own field for research on similar topics but hopefully will be rewarding!