Skip to content

$3M health care prediction challenge

i received the following press release from the Heritage Provider Network, “the largest limited Knox-Keene licensed managed care organization in California.” I have no idea what this means, but I assume it’s some sort of HMO.

In any case, this looks like it could be interesting:

Participants in the Health Prize challenge will be given a data set comprised of the de-identified medical records of 100,000 individuals who are members of HPN. The teams will then need to predict the hospitalization of a set percentage of those members who went to the hospital during the year following the start date, and do so with a defined accuracy rate. The winners will receive the $3 million prize. . . . the contest is designed to spur involvement by others involved in analytics, such as those involved in data mining and predictive modeling who may not currently be working in health care. “We believe that doing so will bring innovative thinking to health analytics and may allow us to solve at least part of the health care cost conundrum . . .”

I don’t know enough about health policy to know if this makes sense. Ultimately, the goal is not to predict hospitalization, but to avoid it. But maybe if you can predict it well, it could be possible to design the system a bit better. The current system–in which the doctor’s office is open about 40 hours a week, and otherwise you have to go the emergency room–is a joke.


  1. K? O'Rourke says:

    "Sounds good" – passionate minds that don’t know what can’t be done.

    The real challenge in working with claims and other admin data is understanding how the data was collected and coded along with how vested interests may have impacted this.

    Even obvious things can be really really hard to get at. One example I once checked for an old colleague (whoes published table I noticed garbled in print and after addressing that we wanted to look further ) involved literally dozens of SAS recoding programs that were pages long …

    Anyways, the right group with appropriate backgrounds may well be able to work this through but this sort of appraoch seems better a problem that can be handed off as a stand alone project.

    A bit disappointed that hiring well trained individuals to collabourate repeatedly over time does not work or seems to be being given up on…


  2. anonymous says:

    "Ultimately, the goal is not to predict hospitalization, but to avoid it. But maybe if you can predict it well, it could be possible to design the system a bit better. The current system–in which the doctor's office is open about 40 hours a week, and otherwise you have to go the emergency room–is a joke."

    These are two different issues. One is about predicting hospitalization, presumably due to chronic conditions. The other is about the increasing difficulty in seeing a primary care physician for urgent but routine issues (e.g. strep throat) for which people are increasingly going to the ED) – partly because of a shortage of primary care physicians, partly because of "moral hazard", and partly because there is a perception that it will be quicker. Provider organizations are dealing with this through urgent care centers at clinics and hospitals. No, you won't see "your" doctor, but most issues will be something a resident or PA can handle.

  3. Numeric says:

    The current system–in which the doctor's office is open about 40 hours a week, and otherwise you have to go the emergency room–is a joke.

    Many states have "half-way houses"–urgent care centers, open 24 hours a day, which provide pretty much all of the services a doctor's office would (if not more). These centers are far more cost-efficient than trauma centers, as emergency rooms are supposed to be. There are times, of course, when you need a trauma center.

  4. Marmaduke says:

    The only reason I can think of to want this sort of prediction is to figure out what rates to charge so that you can turn a profit at the end of the year. I don't see how solving this problem will improve health care.

  5. DWG says:

    Knox-Keene is the license that all fully-insured managed care organizations (as well as Blue Cross and Blue Shield PPOs) need to operate in California.

    A "limited license" is, per the below link "a medical and hospital organization[sic] authorized to assume full-risk."

    So it might be a fair translation to say Heritage is a "the largest risk-bearing medical/hospital group in California", though this leaves out the many all the largest actual health plans.

    Looking quickly at the regulator's site (Dept of Managed Health Care), they had about 450,000 enrollees in 2009.

  6. Igor Carron says:

    I smell lawsuits. If the Netflix prize, a open contest for predicting people's taste in movies, ran into this sorts of judicial problems ( then I don't see how HPN could avoid it.


  7. Ed says:

    "The current system–in which the doctor's office is open about 40 hours a week, and otherwise you have to go the emergency room–is a joke."

    Good point. How did we get this?

  8. zbicyclist says:

    If you can predict claims, you can figure out (a) who you want drop, or (b) who you want to discourage in some way if you can't drop them, or (c) charge profitably if you can.

    Remember, their "health care cost conundrum" isn't necessarily our "health care cost conundrum". Every bureaucracy first cares about itself.

  9. Bob Carpenter says:

    There have been several such prediction contests already using the natural-language components of things like discharge summaries. This challenge looks like it adds other clinical predictors such as test results. That's great, because parsing those things out of the natural language is rather error prone.

    The first contest I know about was run by Cincinnatti Children's Hospital, and involved predicting ICD-9 diagnosis codes:

    Then there have been several run by the i2b2, which involved predicting conditions for patients based on text discharge records:

    In the U.S., there are standards for de-identification in medical records. Here's a good overview:

    De-identification messes up the language processing a bit, but in predictable ways.

  10. K? O'Rourke says:

    Bob – neat stuff but perhaps a bit different in that its individual predictions of ICD9 from text and intimate knowledge of how the "variables" were defined/coded in the text is not that important.

    If you are getting 100000 de-identified medical records (-i.e. past treatments and diagnoses ) then you have to really worry about re-identification through various combinations of rare events – the 50 year old male breast cancer patient who had a motor vehicle trauma – i.e. you need to make sure there are at least n-patients with every possible combination of variables.

    This group does worry a lot about that

    Note also if the 3 million seems like a lot – this groups spends 22 million a year – see financials here

    On the other hand they have a much larger non-selected sample of almost everyone in that province ~ 6 million


  11. K? O'Rourke says:

    This might also have some bearing on whether is better data management/coding versus statistical methods that will be critical.

    A comparison of statistical learning methods on the GUSTO database

    Ennis M, Hinton G, Naylor D, Revow M, Tibshirani R. A comparison of statistical learning methods on the GUSTO database. Stat Med. 1998; 17 (21): 2501-2508.

    A battery of modern, adaptive non-linear learning methods are applied to a large real database of cardiac patient data. Each method is used to predict 30-day mortality from a large number of potential risk factors, and their perfomances are compared. None of the methods could outperform a relatively simple logistic regression model previously developed for this problem.


  12. K in TX says:

    Although I don't know much about health care, I can tell you this – providing large incentives to the at-large public for solving large problems is far more efficient than hiring a small team of 'experts' and waiting for their results.

    I wish the government would do more of this.

  13. Igor Carron says:

    Of interest to this discussion:
    Nonadaptive Mastermind Algorithms for String and Vector Databases, with Case Studies by Arthur Asuncion, Michael Goodrich


  14. K? O'Rourke says:

    K in TX: acknowledge small team of 'experts' are all too often disappointing.

    But I do know a lot about trying to learn from administratively collected health care data and subject to constraints of non-disclore clauses I may have become encombered by working there.

    The critical tasks seems to be extracting and putting together real rather than "faked" history on patients based on almost insider knowledge of how various players report and game the reporting systems. And thats really hard work work that takes person years for each kind of outcome one tries to predict.

    If that already been done, then it makes a bit more sense – but as Igor's post suggests ensuring privacy seems almost impossible.


  15. Bob Carpenter says:

    I understand the risk of re-identification and even how likely it is to re-identify some people given lots of patients and lots of predictors. Given the continuous predictors, I doubt any two patients have the same combination of variables.

    My point was just that there were laws in the US about what counts as compliance with anonymization. Age is one of the things HIPAA requires to be anonymized (a real pain for pediatrics and geriatrics in particular).

    The i2b2 challenges weren't about ICD-9, but they were about extracting/predicting conditions from text than about other predictors.

  16. Chris says:

    @Bob Age is not protected under HIPAA, just date of birth. This can be an issue in pediatrics when age in months is needed because month&year makes re-identification easier.

    A big problem with working with data like this is that it completely censors people who do not visit the doctor at all or have no health care coverage. What non-medical claims data is there that helps tell who successfully avoids the system? Younger men are more likely to be completely absent from the data set. This may be OK for HPN because they are only interested in treatment-seeking individuals, but makes it hard to generalize to the broader population.

  17. Igor Carron says:


    Thanks for pointing out the past examples of contest. The element I wanted to highlight in this discussion is that identification or database cloning have really just recently surfaced thanks to new developments in compressed sensing /group testing. In light of these developments, the real question is really: should implementation of HIPAA be re-thought ?


  18. K? O'Rourke says:

    Bob, first seconding Igor's thanks; but common law and a reasonable man to worry about here as well

    i.e. that damage due to disclosure of an individual's private medical records that a reasonable man could/should have forseen happening …