What does a “statistically significant difference in mortality rates” mean when you’re trying to decide where to send your kid for heart surgery?

Keith Turner writes:

I am not sure if you caught the big story in the New York Times last week about UNC’s pediatric heart surgery program, but part of the story made me interested to know if you had thoughts:

Doctors were told that the [mortality] rate had improved in recent years, but the program still had one star. The physicians were not given copies or summaries of the statistics, and were cautioned that the information was considered confidential by the Society of Thoracic Surgeons. In fact, surgeons at other hospitals often share such data with cardiologists from competing institutions.

While UNC said in a statement that it was “potentially reckless” to use the data to drive decision-making about where to refer patients, doctors across the country said it was simply one factor, among several, that should be considered.

In October 2017, three babies with complex conditions died after undergoing heart surgery at UNC. In a morbidity and mortality conference the next month, one cardiologist suggested that UNC temporarily stop handling some complex cases, according to a person who was in the room. Dr. Kibbe, the surgery department chairwoman, said in a recent interview that the hospital had never restricted surgeries.

In December, another child died after undergoing surgery a few months earlier for a complex condition.

The four deaths were confirmed by The Times, but are not among those disclosed by UNC. It has declined to publicly release mortality data from July 2017 through June 2018, saying that because the hospital had only one surgeon during most of that period, releasing the data would violate “peer review” protections.

Other information released by UNC shows that the hospital’s cardiac surgery mortality rate from July 2013 through June 2017 was 4.7 percent, higher than those of most of the 82 hospitals that publicly report similar information. UNC says that the difference between its rate and other hospitals’ is not statistically significant, but would not provide information supporting that claim. The hospital said the numbers of specific procedures are too low for the statistics to be a meaningful evaluation of a single institution.

Seems like a lot of these data for UNC are not going to be easy for one to get their hands on. But I wonder if there’s a story to be told with some of the publicly available data from peer institutions? And even in the absence of quantitative data from UNC’s program, I think there are a lot of interesting questions here (besides the ethical ones about hospitals at public institutions withholding mortality data): What does a “statistically significant difference in mortality rates” mean when you’re trying to decide where to send your kid for heart surgery?

My reply:

Good question. I think the answer has to be that there’s other information available. If the only data you have are the mortality rates, and you can choose any hospital, then you’d want to do an 8-schools-type analysis and then choose the hospital where surgery has the highest posterior probability of success. Statistical significance is irrelevant, as you have to decide anyway. But you can’t really choose any hospital, and other information must be available.

In this case, I think the important aspects of decision making are not coming from the parents; rather, where this information is particularly relevant is for the hospitals’ decisions of how to run their programs and allocate resources, and for funders to decide what sorts of operations to subsidize. I’d think that “quality control” is the appropriate conceptual framework here.

Tomorrow’s post: What happens when frauds are outed because of whistleblowing?

23 thoughts on “What does a “statistically significant difference in mortality rates” mean when you’re trying to decide where to send your kid for heart surgery?

  1. “If the only data you have are the mortality rates, and you can choose any hospital, then you’d want to do an 8-schools-type analysis and then choose the hospital where surgery has the highest posterior probability of success.”

    The decision is not so simple. This is the classic example usually given to explain Simpson’s “paradox”. Hospitals do not typically have identical case mixes, i.e. hospitals with better reputations or more sophisticated equipment tend to receive more difficult cases and can consequently have higher mortality rates. Even without access to information about reputation (which I guess you’re considering to be additional data here), I’d be suspicious of a hospital with very low mortality rates, thinking it was an indication that they do not get many complicated cases. If you do have information about reputation, it’s tricky to combine that information (which is certainly not reliable itself) with hard mortality rate data that is subject to confounding.

    Also, I thought the reporting in that Times article was awful. Saying that 3 babies with “complex heart conditions” died without giving any sense of the expected number of deaths in similar circumstances is worse than useless. 3 babies dying has a lot of emotional impact on the reader and completely obscures the relevant question about whether it was due to poor care.

    • +1

      “I thought the reporting in that Times article was awful.”

      Me too. The authors seem to know the problems they are facing in showing that this surgery center is somehow different. But just pointing out that raw mortality data does not address risk factors is not enough.

      It is absolutely true that some hospitals avoid risky surgeries precisely to avoid this type of criticism. My sister-in-law needed a liver transplant to save her life. The doctor at MediCal explained that her unwillingness to quit smoking and eating salt – when punched into their statistical mortality calculator – made her too high of a risk. He actually told her that they have their surgical success rate to think about. This with a 100% mortality guarantee without surgery. She considered her options, chose not to change her lifestyle, and died a few months later of liver failure. The statistical model did not kill her, that was her choice, but I guess I was a little surprised at how things played out.

      • Wow, I can’t believe the doctor admitted that surgical success rate for their facility was a factor. Sounds like that should definitely be illegal… When my wife’s grandmother had a stroke, we heard that the rehab facility we were aiming for tends to reject cases that are unlikely to recover because it makes them look bad (not due to limited capacity). We were only able to get her in because we had a connection, and indeed she didn’t recover and hurt their stats. Some type of poetic justice?

      • Well that doctor was an ass, but the legitimate concern isn’t the surgical success rate, it’s the person who doesn’t smoke and has normal blood pressure and a healthy diet who wouldn’t get the liver transplant because it was in a patient who smoked and had high BP or whatever.

        Still, your basic point is correct, and by publishing raw mortality rates we actually *harm* patients, because people game the system by letting people die rather than operate on them. What is needed is a ratio of estimated mortality rate at a given hospital to the average of estimated mortality rates across all hospitals *for the specific surgery, or at least all surgeries of similar type and complexity*.

        what happened in the last couple of cases is irrelevant except in so far as it has some small effect on our overall estimate of hospital success.

        • “the legitimate concern isn’t the surgical success rate, it’s the person who doesn’t smoke and has normal blood pressure and a healthy diet who wouldn’t get the liver transplant because it was in a patient who smoked and had high BP or whatever.”

          +1

          This is another case where “incentive” interfere with ethics.

  2. “Pediatric Heart Surgery” includes a lot of different procedures with different degrees of difficulty and risk –“a complex condition” means nothing. Without more information there is no way to compare these hospitals.

    The US economic system relies on competition to drive improvement in products and services. Hospitals compete on the number and types of services they offer — often using regulatory capture to prevent competitors from competing — but not on any objective measure of the quality of those services.

    • +1.

      I had to make literally this decision a couple of decades ago. There’s complex, and then there’s complex. My two-month-old daughter required two open-heart surgeries separated by a seven-day ECMO run to address a number of issues including transposition, a stenotic valve, and a festive assortment of garden-variety septal defects. I’m pretty sure this would qualify as “complex,” yet her team was always optimistic about her chances (and she’s in college now, so good on them). However, at the time we met a couple with an infant with hypoplastic left heart syndrome, and the prognosis for that child was less optimistic from the start.

      So, key takeaway: Data, schmata. At least at that level, it looks to me like accountability theatre.

  3. There are some medical procedures that are relatively routine, and others that only get done a couple times a year, and everything in between.

    What’s actually needed to determine whether there’s a problem is a probability of success conditional *on the specific type of operation*.

    If a local hospital gets all the people who fall off ladders or are in low-speed auto-accidents, and ships all the gunshot wounds and high speed auto accidents and industrial machinery injuries and things to a local trauma center, you can bet the trauma center has a worse probability of success than the local hospital. This isn’t evidence at all that the trauma center is doing a bad job… if you moved these complex cases to the local hospital you might expect an *even worse* outcome.

    It’s a complex issue because there’s patient confidentiality involved, but the ideal thing from an estimation perspective would be to release data on everything that happens at every hospital… only then could you do statistics that compared similar cases across the whole country and do hierarchical models that links outcomes through the department/people involved to come up with a meaningful estimate of expected surgical success rates.

    • You can’t just compare outcomes for similar patients in different hospitals. You need a mechanistic model of each patient’s heart and a model for how exact surgical practices differ across physicians in each hospital to simulate the individual surgeries.

      (Only regular comments readers will understand this joke)

      • I laughed…

        But you really do need to compare similar surgeries, you can’t just say “it was a heart surgery”, and no you don’t need to compare individual hearts and individual surgeons, but yes you do want to infer a “skill” for each surgical group at least.

        i’m working on some educational data, and the problems are very similar. Different batches of kids go to different schools… if you just compare scores across schools you’re finding out just that rich kids are on the east side of town and poor kids are on the west side…

  4. The difficulty is the issue of small numbers. If a given procedure is done once a month at two different institutions, then all kinds of variables and the effects of stochastic distribution will make it impossible to discern real differences. Statistics can’t give answers without data. I served as an interested neutral party on an American College of Surgeons committee in my community about 25 years ago looking at our outcomes for esophageal and pancreatic surgery. The best answer is to go with a team that does a minimum of 25 cases per year. We had no legal enforcement power to require this, but the word got out, and procedures got concentrated into local centers through informal but effective means. The generally doleful effects of consolidation by hospitals/insurance systems may have been a positive in this.

  5. A few thoughts on this: 1) was surgery the best option in all these cases? 2) in addition to surgery mortality, a mortality rate accounting for cases too complex for surgery (and the patients die without one) would be useful. 3) Good measurement of mortality risk of incoming patients would be very useful as covariate. 4) qualitative look at individual cases for such small samples.

    • yyw said, “1) was surgery the best option in all these cases?”

      My guess is that in all these cases, letting the child die was the only other option.

      This quote from the NYTimes article points to what seems to be a big part of the problem:

      “The best outcomes for patients with complex heart problems correlate with hospitals that perform a high volume of surgeries — several hundred a year — studies show. But a proliferation of the surgery programs has made it difficult for many institutions, including UNC, to reach those numbers: The North Carolina hospital does about 100 to 150 a year. Lower numbers can leave surgeons and staff at some hospitals with insufficient experience and resources to achieve better results, researchers have found.”

  6. If Thoracic Aortic Dissection Repair, Septal Myectomy, and Esophagectomy were sports, ESPN would spend hours reviewing the stats and drug companies would be tripping over each other to sponsor the shows. As it is, if a statistician gets close to the hospital records room alarms go off and security runs to the scene with shredders.

    • Since you brought up sports, I am reminded that Steve Pearce, a decent player with a lifetime .254 batting average, was the MVP of the 2018 World Series. He, Ted Kluszewski, and Babe Ruth are the only people with multiple World Series home runs past the age of 35. I point this out to show that application of statistics may not always comport with real life events. I didn’t predict an Illinois over Wisconsin victory, either.

    • “ESPN would spend hours reviewing the stats”

      Sabremetrics for surgery, that’s what we need! The surgeons could be graded on “Successes above Replacement Level (SAR).”

  7. Statistics are irrelevant when using this data to answer certain questions about the quality of care at UNC. If the question is, “Do outcomes tend to be unusually bad at UNC?” then yes, this is a question about a population of patients over time, of which the recent past is a sample and from which we want to make a statistical prediction about future samples. But if the question is, “Have outcomes been unusually bad at UNC in the recent past?” then this is a question about a population, not about a sample–nothing is being estimated, inferred or predicted. I will grant you that a potential patient is most interested in the answer to the first question. But if the sample size is just too small for reliable predictions, I think most patients would want to know the answer to the second question and then consider that as one piece of evidence for where they should get treatment. When the statistics can’t overcome the noise in the data that we have, there’s nothing wrong with handing a patient an objective fact and then letting her come to her own subjective conclusion about the evidentiary weight it deserves.

  8. “What does a ‘statistically significant difference in mortality rates’ mean when you’re trying to decide where to send your kid for heart surgery?”

    What does a ‘Bayes factor greater than 5’ mean when you’re trying to decide where to send your kid for heart surgery?

    I’d be much more focused on staff expertise, quality control, quantity and availability of data, and also on practical significance (say at least from equivalence testing). It is absurd to think people make major decisions *only* from a p-value (or Bayes factor, or anything else) and nothing else.

    Justin

    • “It is absurd to think people make major decisions *only* from a p-value (or Bayes factor, or anything else) and nothing else.”

      The military faces the problem of “multiple target recognition and tracking”, wherein there’s multiple objects (ships, missiles, aircraft) on the sea and air which are only imperfect detected and distinguished by radar. Even “tracking” individual targets is surprising difficult in that scenario. Additionally, decisions have to be made on how to employ counter measures. The decisions had to be made far faster than a human could do and so had to be automated.

      I worked at a company that got very wealthy by solving the multiple target recognition problem by doing Bayesian Model selection (basically Bayes Factors) based off Pr(object_j =known_craft_of_type_i|radar signature, physics, craft_characteristics,…) and then doing a Bayesian decision analysis on counter measures. Bayesian model selection works particularly well here because the universe of models (i.e. possible craft types) is well known. They were doing this starting back in the 80’s.

      I’ve never head of anyone making money from a project like this which used p-values though.

  9. Just to complicate things further – since hospitals follow the reputational effects of such mortality data, one plausible strategy is to go to the hospital with a bad record. They are probably going to great lengths to improve their outcomes. My wife used to tell me that eating out a restaurant that recently had an episode of infectious illness was probably the safest place to go to: the reaction to the bad publicity means they were working harder than others to change perceptions.

Leave a Reply to Speed Cancel reply

Your email address will not be published. Required fields are marked *