Statistical methods for healthcare regulation: rating, screening and surveillance

Here is my discussion of a recent article by David Spiegelhalter, Christopher
Sherlaw-Johnson, Martin Bardsley, Ian Blunt, Christopher Wood and Olivia Grigg, that is scheduled to appear in the Journal of the Royal Statistical Society:

I applaud the authors’ use of a mix of statistical methods to attack an important real-world problem. Policymakers need results right away, and I admire the authors’ ability and willingness to combine several different modeling and significance testing ideas for the purposes of rating and surveillance.

That said, I am uncomfortable with the statistical ideas here, for three reasons. First, I feel that the proposed methods, centered as they are around data manipulation and corrections for uncertainty, has serious defects compared to a more model-based approach. My problem with methods based on p-values and z-scores–however they happen to be adjusted–is that they draw discussion toward error rates, sequential analysis, and other technical statistical concepts. In contrast, a model-based approach draws discussion toward the model and, from there, the process being modeled. I understand the appeal of p-value adjustments–lots of quantitatively-trained people know about p-values–but I’d much rather draw the statistics toward the data rather than the other way around. Once you have to bring out the funnel plot, this is to me a sign of (partial) failure, that you’re talking about properties of a statistical summary rather than about the underlying process that generates the observed data.

My second difficulty is closely related: to me, the mapping seems tenuous from statistical significance to the ultimate healthcare and financial goals. I’d prefer a more direct decision-theoretic approach that focuses on practical significance.

That said, the authors of the article under discussion are doing the work and I’m not. I’m sure they have good reasons for using what I consider to be inferior methods, and I believe that one of the points of this discussion is to give them a chance to give this explanation.

Finally, I am glad that these methods result in ratings rather than rankings. As has been discussed by Louis (1984), Lockwood et al. (2002), and others, two huge problems arise when constructing ranks from noisy data. First, with unbalanced data (for example, different sample sizes in different hospitals) there is no way to simultaneously get reasonable point estimates of parameters and their rankings. Second, ranks are notoriously noisy. Even with moderately large samples, estimated ranks are unstable and can be misleading, violating well-known principles of quality control by encouraging decision makers to chase noise rather than understanding and reducing variation (Deming, 2000). Thus, although I am unhappy with the components of the methods being used here, I like some aspects of the output.

References

Deming, WE (2000). Out of the Crisis. Cambridge, Mass.: MIT Press.

Louis, TA (1984). Estimating a population of parameter values using Bayes and empirical Bayes methods. J. Am. Statist. Assoc., 78: 393-398.

Lockwood JR, Louis TA, McCaffrey D (2002). Uncertainty in rank estimation: Implications for Value Added Modeling Accountability Systems. J. Educational and Behavioral Statistics, 27: 255-270.

1 thought on “Statistical methods for healthcare regulation: rating, screening and surveillance

  1. The medical research community does have an unhealthy fixation on statistical significance and p-values.

    But let me play devil's advocate here. Maybe part of the reason for that is the need for algorithmic/policy decisions to come out of the research, which often have a very binary flavor to them. As much as I wish medical practitioners would be more open to the complexities of trends and effect sizes, the reality is that they don't have the time/resources to do so when they are expected to be familiar with thousands of diseases and conditions. The questions often must be reduced to things like – does this drug work? Are these disease conditions different? Is it beneficial to take supplement X?

    What statistical tools would you suggest using to reduce results into these binary statements without relying on statistical significance? S-errors? (there's the additional problem that if I refer to a type S-error in a medical paper, most journal editors would probably tell me I'm submitting to the wrong journal).

Comments are closed.