Don’t trust people selling their methods: The importance of external validation. (Sepsis edition)

This one’s not about Gladwell; it’s about sepsis.

John Williams points to this article, “The Epic Sepsis Model Falls Short—The Importance of External Validation,” by Anand Habib, Anthony Lin, and Richard Grant, who report that a proprietary model used to predict sepsis in hospital patients doesn’t work very well.

That’s to be expected, I guess. But it’s worth the reminder, given all the prediction tools out there that people are selling.

8 thoughts on “Don’t trust people selling their methods: The importance of external validation. (Sepsis edition)

  1. The article states
    “These more complex prediction tools may present insurmountable barriers to local external validation, which will place more responsibility on model developers either to publish model performance characteristics, the variables included, and the settings in which they were obtained or to
    provide code and data on request to enable others to verify the model transferability to the local setting.”

    So, it is a bit ironic that when I contacted the authors to see if I could get the data they used to test the Epic model for external validity, the response I received was

    “Here is our editorial.
    No data.”

    • Did you contact Wong et al? It seems like this editorial is about the results from that Wong et al study, so the authors of the editorial wouldn’t have any data, right?

      • You’re right – I hadn’t read it carefully enough to see that the editorial referred to the Wong attempt at validation. I’ll contact those authors and report back – until then, “never mind.”

        • Update on my data request to the authors: their reply ” ”
          Seriously, I don’t expect them to provide the data – in fact, I’m not sure they should, given that it may involve a large number of diagnostic variables from electronic health records for tens of thousands of people. I’d settle for a response to my email.

          I have a modest proposal, given the editorial suggesting the need for validation (including code and/or data) for “these more complex prediction tools.” I don’t think there is a workable distinction between the more complex and less complex tools. I suspect they mean the more complex tools are the machine learning methods and less complex would be classical techniques such as logistic regression. Perhaps they would lump them together. In any case, as data sets get large (particularly wider), all prediction models will become complex and difficult to interpret. Logistic regression models aren’t really easier to interpret than neural networks – the latter only appears to be more of a black box, but there are measures of variable importance that make these models more similar than different (except that the former provide standard errors and associated inferential tools that the latter generally lack). I think almost all prediction models will require validation and the editorial about the sepsis paper should not single out any particular techniques.

          My modest proposal is that journals begin rating papers (either in categories or with a numerical score) regarding reproducibility. This would include factors on documented work flow, data availability, preregistration, code availability, etc. Rather than a qualitative judgement like accepting or rejecting a manuscript, this rating should be mostly objective, documenting the degree to which papers are consistent with open data policies. I think this would be much better than the data provision policies at many journals, which advocate open data but then permit the data sharing statement to simply state that the data is proprietary. Accepted papers would count in people’s careers, but they would count more if the open data rating is better. More importantly, a low rating on open data characteristics would alert observers and the media to the fact that the paper cannot easily be validated. I think an organization like the ASA could help promote this by listing journals that agree to such a policy with a sort of seal of approval.

          What do you think?

  2. Epic does not seem like a nice company in general. If I recall correctly, this was the company which for decades deliberately preyed on CS undergrads, grossly underpaying them while trapping them in the middle of nowhere & forcing them to spend their early career learning the especially-unpleasant minutiae of a wildly obsolete obscure language until they realized how they were being scammed and how the wasted time was crippling the rest of their careers; then they lobbied for EHR systems like theirs to be made mandatory even though everyone knew it was a hot mess, and no one then (or now) seems to have a good thing to say about the experience of using their systems for such minor things as medicine; then they threatened to sue Tufte when he was going to write about their dangerously bad UX/UIs… (And this is just what I recall reading in passing, having no interest in them or EHRs.)

  3. The article and Epic are approaching this issue from completely the wrong angle. External validation is the wrong approach for these sorts of models. You should be training / fine-tuning sepsis models at each hospital.

    You will almost always get better performance by taking advantage of the data at your site (and much better statistical guarantees as well).

    The idea of having one universal model that is deployed everywhere is a wrong way to view this space

    • That isn’t so clear to me. I would want to know how stable the model is across sites – it it varies greatly, then you may be correct and each hospital has its own factors that affect sepsis. But to the extent that some factors have similar effects across hospitals, and others do not, there is much to be learned from both of those facts. I don’t know much at all about sepsis, but I would imagine that many factors concern the patient’s health and demographic characteristics, and I would expect a better model using multiple sites for those factors. Factors that are hospital specific are also important and may indicate things a hospital can, and should, attempt to change. But if models are developed for each site in isolation, don’t we lose the ability to identify these?

      • If only there were some way to share data across multiple contexts that can transition gracefully from the general to the specific as more information comes in. Like, the data gets only partially pooled together, maybe using bayes rule somehow? Some kind of a model that’s aware of hierarchical structures…

Leave a Reply

Your email address will not be published. Required fields are marked *