Skip to content

What property is important in a risk prediction model? Discrimination or calibration?

Sanjay Kaul writes:

I am sure you must be aware of the recent controversy ignited by the 2013 American College of Cardiology/American Heart Association Cholesterol Treatment Guidelines that were released last month. They have been the subject of several newspaper articles and blogs, most of them missing the thrust of the guidelines. There is much to admire about these guidelines as they are more faithfully aligned with high-quality ‘actionable’ evidence than the 3 previous iterations. However, the controversy is focused on the performance of the risk calculator introduced for initiating treatment in individuals without established atherosclerotic disease or diabetes (so-called primary prevention cohort). The guidelines recommend statins for primary prevention in individuals who have a 10-year risk estimated to be 7.5%. The risk calculator was derived from population cohorts studied in the 1990s. The discrimination for predicting the risk of atherosclerotic cardiovascular events defined as coronary heart disease deaths, myocardial infarctions and strokes is fair as measured by the area under the ROC curve (c index of 0.7 to 0.8 across the ethnic spectrum). [As an aside, very few risk models developed to predict risk of clinical events have c index >0.80]. However, when applied to recent population and randomized controlled trial cohorts, the risk calculator is limited by ‘miscalibration’ (overestimates risk by 75% to 150%). This is of course understandable as the prediction cohort is quite different from the development cohort in terms of baseline risk (many subjects in the former were on statin treatment that modifies risk). The single-event probability estimate (which is essentially what calibration characterizes) is referenced to the ‘state’ of the development cohort. If the state is different (as in the prediction cohort), should it surprise anyone that the risk model miscalibrates risk? So, the question is does calibration trump discrimination in risk prediction?

I guess, this harks back to the age-old conundrum about weather forecasting, what is important – that it will rain tomorrow (patient will develop an event as a measure of discrimination) or there is a 30% probability of rain tomorrow (there is a 30% probability of the individual patient developing an event in the future as a function of calibration)?

I have attached the Ridker and Cook paper published in Lancet and the response from the chairs of the guidelines (Stone and Lloyd-Jones) published in Lancet. The original risk tool publication is also attached (JACC paper).

Would appreciate posting this on your blog to get useful insights from you and your well informed readers.

This looks interesting but now I’m feeling too overwhelmed to look at it in detail. (It’s too bad that I read all sorts of crappy papers but then feel too busy to read the interesting stuff. . . .) Maybe some of you will have useful thoughts?


  1. jimmy says:

    i am confused by what these terms mean. what are the exact definitions of discrimination and calibration? and based on a quick reading of the above, are they roughly saying that in-sample prediction is different from out-of-sample prediction? wouldn’t one expect that? can someone lend some clarity?

    • anon says:

      If your probability estimates are well-calibrated then they will correspond to frequencies. Greater discrimination means greater AUC.

      That’s my understanding at least

    • dab says:

      FWIW, this paper by Nancy R Cook (which I found by following the link to Matt Bogard’s blog given in a comment below; she is also the author of one of the articles cited in the post above) seems to define the terms precisely in this context. I’m still not sure I understand them, but at least now I have definitions to chew on…. Hope that helps.

  2. Matt Bogard says:

    I think anon has a correct interpretation. I’ve been wrestling with these two concepts for a while, and still am not sure when each metric is most appropriate in a given context or application. About a year ago I tried to get a handle on this here: but would definitely like more insight.

  3. Sherman Dorn says:

    Separate issue to ponder, since I’m wide awake at 4 pm: I’m wondering if the cost of miscalibration can outweigh the technical issue of discrimination vs. calibration in a practical sense; that is, that’s not inherent in the tradeoff but rather a trait of the types of miscalibration we’re likely to notice or where it makes more sense to devote resources. Overdiagnosis/overtreatment of prostate cancer has a pretty serious cost, so miscalibration matters a great deal by itself.

  4. M says:

    Cardiovascular epidemiology has been very heavily influenced by the Cochrane ‘hierarchy of evidence’ mindset, where unconfounded (but otherwise) marginal effects are the estimands of primary interest, and evidence of heterogeneity is often dismissed as specious ‘subgroup analysis’. I think this culture, as well as an over-emphasis on “clinical” decision-making rather than thinking of system-level impacts, contributes to myopic discussions in the cardiovascular field about risk-benefit. This is a tangent to the ‘discrimination vs. calibration’ discussion but I think (no offense to Sanjay Kaul) the culture is very clearly reflected in the description of the motivating question, which mentions possible differences in baseline risk between the training and test samples, but not differences in the distributions of effect modifiers between samples, or the presence of a market externality (‘market’ decision made by doctors and patients, with implications for the ambient level of statins in water etc.) that would lead to a difference between the clinical optimum and the society optimum.

    Although there is a substantial literature on statin adverse events and pharmacological interactions with common substances including grapefruit juice, I have rarely heard cardiovascular epidemiologists touting the benefits of statins stopping to think about possible harms (in joint exposure to effect modifiers) – at least, until it was uncovered that diabetes risk might increase with use of statins. In one cardiovascular epidemiology course I took (before the diabetes evidence started coming in), the instructor actually suggested putting statins in the water supply because so many people have terrible ‘risk profiles.’ I think this could have major negative impacts on the frequency of adverse outcomes (not only among persons with concomitant grapefruit exposure etc., but also in utero exposures and other exposures to persons with no benefit from statins), as well as uncertain environmental impacts (i.e. on fish, amphibians, etc.).

    The ACC/AHA recommendation that statins be used in those with 10-year risk estimates > 7.5% makes me queasy for very similar reasons, if on a smaller scale than 100 % of the population exposed. The performance of the risk prediction models is not great for individual predictions, and there are possible harms from being prescribed these medicines, so the misclassification matters. But more importantly, what proportion of the population would actually fall into this ‘primary prevention’ risk set? While statins might not be going directly into the water supply intentionally, if the drinking water supplies (many already contaminated with statins from the wastewater stream) becomes even more saturated with statins, then maybe the dose would move to a range where we start seeing more negative impacts?

    I’d welcome any thoughts on whether these guidelines should have weigh-in from other stakeholders than cardiovascular epidemiologists and cardiologists if the implication is a change in population frequency of exposure (i.e. from drinking water); whether thinking about effect modification would lead to different / more nuanced recommendations; whether the risk prediction model makes sense to use now if there are temporal trends in effect modifiers between the 1990s and present; etc.

  5. Thomas says:

    Is it a choice between discrimination and calibration? Or is it a matter of updating the risk calculator’s parameters that were developed about 20 years ago for a different cohort population? If “updating” is what is meant by calibration (I’m with Jimmy, above, in admitting to some confusion as to what these two terms mean precisely), recalibration definitely seems to be the order of the day. Once that’s done a discussion about discrimination would make more sense.

  6. Erik says:

    I think it’s rather clear that discrimination is the first necessarity in a risk prediction model. Just blindly predicting the overall risk for the entire population would give you a calibrated model.

    However, calibration is still important to get right before you start using the model. Otherwise the doctor can not really interpret the results in a meaningful way to give good treatment recommendations. Also depending on you quantify discrimation (AUC?) you have the fact that calibration depends even more on the target population.

    A good example for this is to look at how logistic regression behaves – for discrimination purposes the intercept is not needed at all. If you want to be calibrated as well you need to get it right for the target populations – which corresponds to the baseline risk in the population of the other predictors are centered.

  7. Jon Williams says:

    In general, is there a difference in how statisticians and machine-learning folks view calibration?

    Cook’s above-linked paper (Cook, Nancy R. “Use and misuse of the receiver operating characteristic curve in risk prediction.” Circulation 115.7 (2007): 928-935) basically uses the word “calibration” as a noun. The extent to which a model’s probabilities agree with observed frequencies is a measure of its calibration. Cook emphasizes the importance of considering calibration when building a model (at least when the goal is “to categorize individuals into risk strata”), but doesn’t consider any post-processing to improve the calibration.

    In the machine learning literature, on the other hand, “calibration” is an action: you calibrate your model to obtain good probabilities (if you need good probabilities). See e.g. Niculescu-Mizil, Alexandru, and Rich Caruana. “Predicting good probabilities with supervised learning.” Proceedings of the 22nd international conference on Machine learning. ACM, 2005.

  8. This whole question just seems like it’s symptomatic of a severe lack of understanding of what these people are supposed to be doing.

    hint: no-one gives a rats ass how well calibrated your model is, or how accurately you predict risk, what society cares about are the relative costs and benefits of proposed uses of your model for all of society. To the extent that calibration or prediction help you improve benefits and reduce costs, that’s the only thing that matters.

    The fact that there is no discussion in this question about things like QUALYs or dollars or the effects on pollution (of the water supply as mentioned by “M” above) it shows a severe lack of understanding of what the point of guidelines and risk models is.

    • Just to be fair and not point the finger at one particular area of risk analysis, the same thing occurs in a field closer to home for me. For example civil engineering may adopt new models for building code analysis, and these may be more accurate for predicting certain types of events, and ultimately may require more money invested when new buildings are built. But this may actually slow the adoption of these new codes, reduce the rate at which new buildings are put up, and increase the risk to society overall. Something like this may very well have happened when new seismic guidelines were adopted for hospitals in CA. The result was a large number of hospitals closing, and so it’s conceivable that more people are dying of heart attacks or car accidents or whatever, much more than are likely to be saved by the more stringent seismic standards.

    • Teetee says:

      i disagree, No speak regarding QUALYs or dollars show severe lack of understanding of the overall implications and use for these models… i thought this was a discussion about what is needed to create a proper risk prediction tool… to that extent, technical definitions of calibration and discrimination, which implicate how they are used ( usually stratified by discipline) are needed…

  9. John Goodwin says:

    ‘hint: no-one gives a rats ass how well calibrated your model is, or how accurately you predict risk, what society cares about are the relative costs and benefits of proposed uses of your model for all of society. To the extent that calibration or prediction help you improve benefits and reduce costs, that’s the only thing that matters.’

    This isn’t nearly cynical enough. A Marxian might say that risk analysis is used by the elites as a convenient device to silence critiques of their personal preferences and tastes, using the socially constructed categories devised by them of ‘costs’ and ‘benefits’.

    The interesting technical point, at the level of system analysis, is why one would suppose the social solution space is path independent, and that static evaluation of costs and benefits, without smuggling in path dependence via a Hamiltonian formulation or shadow costs, is even reasonable.

    So I think your ‘rats ass’ test fails on two counts; insufficiently cynical about the role of risk analysis in politics and technically short of providing an objective function (the social welfare function behind the cost and benefits) in a path dependent world. Why on earth wouldn’t we want to evaluate the effect of using this policy for one decade and that one for another? Cost-benefit analysis would never yield that.

    Genuine question: is there anyone who tries to include treatment *path* in risk analysis like this?

    • I’m not a Marxist, and I do honestly believe that people want outcomes not accuracy. But I definitely agree with you that path dependence (which is another way of saying *individual utility* I think) is important. So I’m not sure if we’re on the same side of the fence or not. I think in something like medical care, path dependence is considered in policies by recommending the order in which people try treatments. Something like: “first try diet and exercise, then try statins, then heart bypass…” obviously if one of these works to slow the risk of serious disease, then we stop treatment there and don’t progress to more serious higher cost treatments.

      If you’re saying that defining “costs” and “benefits” on behalf of society is an obnoxious self-serving thing to do, then I agree with you! I frequently complain that doctors make recommendations like “don’t treat sinus infections with antibiotics” based on some tradeoffs that in all likelihood don’t mimic the values of their patients (they make these recommendations because on average the antibiotics “only” shorten the duration by a couple of days, and the supposed “costs” are less effectiveness in antibiotics for more serious illnesses and more billing and insurance overhead etc. But I don’t think there is actually much evidence that “overprescribing antibiotics” to humans is actually causing antibiotic resistance, and all the pharmacologists I’ve talked to about it believe that essentially ALL of the resistance problem is created by agricultural overuse).

      Anyway, I’m serious, a poorly calibrated, low accuracy model which nevertheless takes people from “dying by the thousands of an easily preventable disease” to “essentially eradicating this disease” would be hugely valuable, even though it’s performing poorly by statistical standards.

Leave a Reply