Pro Publica’s new Surgeon Scorecards

Skyler Johnson writes:

You should definitely weigh in on this…

Pro Publica created “Surgeon Scorecards” based upon risk adjusted surgery compilation rates. They used hierarchical modeling via the lmer package in R.

For detailed methodology, click the methodology “how we calculated complications” link, then atop that next page click on the detailed methodology to download a publication quality pdf).

At least three doctors have raised objections:

The little mixed model that could, but shouldn’t be used to score surgical performance

An alternative presentation of the ProPublica Surgeon Scorecard

Why the Surgeon Scorecard is a journalistic low point for ProPublica

Curious as to your critique of Pro Publica’s methodology and results.

Next time this sort of thing is done, maybe they’ll use Stan. But that’s not really the point. The real point is that, yes, I probably should weigh in on this, but it would take a bit of work! This is not your run-of-the-mill “p less than .05” paper in PPNAS, it’s a serious project.

I quickly read through the online critiques and I saw some good points and some bad points. The bad points were some generic ranting against “shrinkage”; the blogger in question didn’t seem to realize that these issues arise in any prediction problem and represent inferential uncertainty that is the inevitable consequence of variation. Another blogger complained about wide uncertainty intervals but, again, that’s just life. The more important criticisms involved data quality, and that’s something I can’t really comment on, at least without reading the report in more detail.

It’s too bad. Something dumb like himmicanes and hurricanes is easy to criticize, easy for me to post on. But an important topic like rating doctors, that would require a lot more work for me to say anything definitive.

I will say, though, that I like what Pro Publica is doing. No model is perfect, but I think this is the way to start: You fit a model, do the best you can, be open about your methods, then invite criticism. You can then take account of the criticisms, include more information, and do better.

So go for it, Pro Publica. Don’t stop now! Consider your published estimates as a first step in a process of continual quality improvement.

15 thoughts on “Pro Publica’s new Surgeon Scorecards

  1. A little bit of my time is spent advising on projects like this in the UK (google NCAPOP if you want to know more). I basically contribute a statistical eye to the committee and look out for two things: (1) that a project is making some effort to innovate and improve in calculation and communication, and (2) that a project does not take its stats as gospel truth (these are performance indicators – all they can do is indicate). That second point is a really tough one to get across to an excitable media, a worried public and a flustered workforce, but it has to be done. As for the methods themselves, I think most projects of this nature could up the complexity a bit, but there’s a balance to be found between trying to account for every potential bias and uncertainty, and keeping it accessible with a quick turnaround.

  2. To me this is a very complicated issue. There already exists a rating system for surgeons, but it is hidden from the sight of the public. Nothing on earth is more driven by gossip than the typical healthcare community. As a result every hospital has people who are generally considered better than others and people considered to be clunkers. Often these ratings are based on the same kind of criteria as determining the cool kids in high school. Objective measures would require strict verification; did the data set really capture what it says it did. Hospitals’ reimbursement goes up when the patients’ diagnosis is more complex, and as a result there are software packages out there to maximize the diagnostic severity. This padding of the diagnosis is probably of variable accuracy since accuracy is not the goal, but reimbursement is. Further, some problem are technical and surgeon caused, but other problems are driven by the nature of the disease process and are stochastic in occurrence. Data without analysis is just numbers. Unfortunately, there is no process to generate fair and dispassionate referees.

  3. There was another piece about this that I will try to track down. There was a major issue in that some doctors will be routinely taking high risk surgeries or operating on classes of patients more likely to have increased complication or mortality rates for other reasons outside of the doctor. Definitely quite complicated. I will see about finding that piece.

  4. Kind of a side topic, but:

    If you go to the 2nd critique (http://www.datasurg.net/2015/07/24/an-alternative-presentation-of-the-propublica-surgeon-scorecard/), you can see how they present the point estimate and confidence interval for each doctor’s complication rate. You’ll notice they use a ‘faded’ confidence interval that becomes more transparent as you reach the edges. That’s good! However, they kind of cancel that out by laying the confidence interval over a 3-tiered spectrum with definitive cutoffs for ‘Low’, ‘Medium’, and ‘High’ complication rates. I can’t help but have my eye drawn to a comparison of the point estimate and the nearest definitive cutoff for complication rate.

  5. > No model is perfect, but I think this is the way to start: You fit a model, do the best you can, be open about your methods, then invite criticism. You can then take account of the criticisms, include more information, and do better.

    +1

    My former boss used to say (insist), “Write an equation. Make a prediction.” The objective was to test your understanding of the world, identify misunderstandings, and correct them. Similarly, my Ph.D. advisor used to say, “When you don’t know anything, do what you know.” That wasn’t an invitation to do nothing but, when you don’t how things actually work, take a model that you understand and apply it. Look for systematic deviations between your predictions (or the best fit) and the data available. Use those deviations to refine your model or to create a new one which more accurately captures how things work.

  6. I agree that this study is a step forward. I am surprised, however, that nobody is objecting to the fact that the dataset is not made available. You can search by state, hospital, surgeon, and type of surgery and see the results, but you cannot replicate their results or try to improve the methodology (except on a theoretical level). I am sure the response would be that they can’t release the data in order to protect patient privacy but I find the claim without substance. It is the same as stamping every government document with the label “protected for national security.” At least, I am troubled that nobody (thus far) has objected to the lack of transparency in the data. Transparency in methods is good, but without the data we are once again thrust into the position of having to trust them because they are smart.

    • I actually don’t think that highly of the study, but the claim that privacy arguments are “without substance” seems entirely inaccurate to me. Medicare claims are very detailed. Based on my skim, the data would (or should) include at least surgeon, date, age, and sex. It would be really easy to go for there to identify a person along with comorbid conditions – (heart failure, HIV, etc.) – from their claim. Data anonimization for surgeon-level data is not a trivial problem at all. Furthermore, ProPublica may not care at all about privacy, but their DUA with CMS almost certainly includes very restrictive language as to what they can release. Basically if they went ahead and deliberately released the data if they would face (a) never getting Medicare data again, (b) lots of fines, (c) many lawsuits, and (d) perhaps even some time in prison. CMS restricts usage of its data pretty severely. Perhaps they shouldn’t but that is a CMS issue and not a propublica issue.

  7. I’m trained to look at information like the surgeon scorecard as a diagnostic test. So when I see it, I ask questions like, “Selecting two surgeons at random, if surgeon A has a better score than surgeon B, what is the probability that surgeon A is actually better for patients?”

    I totally understand the concerns about data quality. However, if we give the analysis the benefit of the doubt and assume that the data are fine and the model is correct, then the question looks very answerable to me. One could A. select many random pairs of surgeons, B. simulate new scores for each surgeon using all the model coefficients, and C. calculate the proportion of pairs in which the better surgeon based on the simulated scores is the same as the better surgeon based on the scores that were published. This is like the accuracy of a diagnostic test.

    However, I thought I’d ask about this here given your (and your audience’s) better understanding of these statistical issues. Would the approach outlined above be OK? If it showed that the surgeon scorecard could provide helpful information to patients, it seems like it could go a long way to addressing other people’s statistical complaints (though not the concerns about the data). On the other hand, if it showed that the scorecard did not provide helpful information, then addressing the data quality concerns would not help any because the problems with the analysis would be more fundamental.

  8. I looked into getting the scorecard data, and went as far as to call their headquarters and talk to some very nice editorial assistant who told me they were working out the details of releasing data. I haven’t heard back from them yet. My guess is like pds00197, CMS is severely limiting their ability to do so.

  9. I think it would be great if more people like you got involved in the debates about physician and hospital outcome profiling, and particularly because of your comment saying, ” I probably should weigh in on this, but it would take a bit of work! ”

    That is the humility that seems lacking in the Pro Publica offering which is one of many quick and dirty looks at large datasets which are invariably followed by really pretty irresponsible claims about the inferences that one can make from the data.

    Medicare first released national mortality rate data on hospitals in the mid 1980s and there is, not surprisingly, a huge literature since then on the pros and cons of using claims data to try to look for what is often a needle in a haystack, namely the preventable deaths or complications within the very large amount of morbidity and mortality that goes with being really sick and needing healthcare.

    For any interested, a good accessible & quick overview of the long history of provider report cards and analytic pitfalls in this type of work is a piece by Sharon-Lise Normand here (http://arxiv.org/pdf/0710.4622.pdf)

    But the data quality is a big problem. Claims data are designed and collected exclusively to support billing, not outcome evaluation. Consider one point with regard to looking at complications. It seems straight-forward to profile complication rates. However, Medicare does not require you to list any diagnoses unless they are necessary to justify the level of payment that you are requesting. So basically the provider gets to decide what diagnoses to list, and once the payment is supported there isn’t much incentive to report a complication, and none at all in the cases where there is now no payment for a complication and certainly none if someone is going to plaster your name all over the internet for having lots of complications.

    Makes you think about using complication rates for anything. This is why there is exactly 0 correlation between Medicare rates of hospital acquired pressure ulcers from claims data and surveys of hospitalized patients by an external evaluation organizations who actually lift the sheets and look at every patient (to give just one example from a project that I was involved in [pmid:24126644]).

  10. Dear Dr Gelman,
    Thank you for mentioning my blog. The major justification for my substantiated (or at least so I think) rant against the use of shrinkage in this case is the elephant in the room that no-one wants to talk about: the substantially limited information (avg number of complications per individual < 50) for otherwise infrequent events (avg complication rate <5%). In such a situation, shrinkage will make everyone look like everyone else, limiting the ability to draw meaningful complications by looking at the values of the random effects. Hence even though mixed models are to be preferred as more data accumulate(a point I make clear in the 3 blog posts I wrote), no modeling could overcome this severe lack of events. In this particular case, a major reason that mixed models were used (rather than the more conventional way of treating each surgeon as its own fixed effect) is the potentially large number of surgeons with zero complications in their limited observation records. Curiously this information is not provided, although I'd have thought it would be important to report this number in the shake of transparency. The use of shrinkage models allow one to report performance for these people, and generate nice colorful graphs (with one and possibly two decimal points) about their performance instead of reporting a NaN. For these people, the actual information comes not from their events but from the number sof surgeries they performed, which raises the question of inclusion models (a point you convincingly address in your book) and the associated biases inherent in comparing surgeons who accept v.s. those who do not accept medicare and the associated socioeconomic determinants of health outcomes. There are other technical issues that one could "rant" about e.g. how were the hospital and surgeon effects were combined, but I will not digress any further.

    • +1

      Overall I think that the argument that a given attempt is the “first step in a process of continual quality improvement” is often used to justify crap, no matter how tiny or worse misleading that first step is.

      There really is such a thing as a tool that does more harm than good. If the scorecard amplifies noise & leads to a false impression that these artifacts are real surgeon-to-surgeon differences, it does lead the users astray.

    • Christos:

      Thanks for the comments. I hope that Pro Publica can respond to them. I think this sort of dialogue (for example, your comment that the information is not provided regarding the number of surgeons with zero complications) is the way to move forward. You rant, they reply, etc.!

Leave a Reply to Andrew Cancel reply

Your email address will not be published. Required fields are marked *