David Spiegelhalter wants a checklist for quality control of statistical models?

David Spiegelhalter writes in with a quick question:

Although I don’t do any technical stuff now, I find myself arguing for using quantified expert judgement in assessing a distribution for the size of systematic biases in estimates from lower-quality data-sources, particularly for official stats such as migration estimates, but also in other areas.

We have promoted using expert judgement in trying to ‘de-bias’ observational studies in meta-analysis, which has had a good lot of citations but has not really caught on. This is essentially the same as assessing proper priors, but since Bayes theorem is not used, we can avoid the B-word.

Sander Greenland recently did a review of the whole area of quantifying biases in epidemiology, and questioned why it had not become established.

My experience is that audiences express scepticism about quantified judgement, wondering about the quality control. When I was discussing this at StanCon, I wondered if checklists for quality-control of priors had been established, and whether there was any chance of these becoming standardised and more ‘official’ (like CONSORT, STROBE etc). Things like…

What is the prior?
Whose responsibility is it?
When was it assessed?
What sources were used?
What is a reasonable range for sensitivity analysis?
What is its impact on conclusions?
Does the prior-predictive distribution look reasonable?
etc

Maybe this has all been done, in which case I would like to promote such a checklist.

I don’t know of any such checklist. My only comment is that I would change “prior” to “model” because I’m concerned about model assumps in general. Indeed, the model for data and measurement is typically much more important than the prior distribution for model parameters.

14 thoughts on “David Spiegelhalter wants a checklist for quality control of statistical models?

  1. I don’t mean this to knock your suggestion, which I think is good. But what you show isn’t a checklist. They are things that should have been done. A checklist is a checklist affirms that the required steps have been done, or if they can be done on the spot, the steps are accomplished right then. Answers are Yes/No, or checked/unchecked, etc.

    For example:

    – Prior established
    – responsibility for prior assigned
    – Prior assessed

    If the checklist is too much of a burden, it won’t get done. For private pilots, a pre-landing checklist is often running through the acronym GUMPS. It’s used to kick off the actions on the checklist:

    – Gear down
    – Undercarriage down (i.e., landing gear) down
    – Manifold pressure adjusted
    – Propeller pitch flattened
    – Seatbelts secured (and sometimes also Switches set)

    So the checklist doesn’t want to know what the prior actually is, it just wants to know that it’s been handled.

  2. OK, I have been a big fan of these concerns for a long time (Two cheers for Bayes. 1996) but I’ll step into it and point out three things.

    1. Multiple Bias Analysis is a difficult time consuming work that requires extensive expertise and search for sources of information about the biases and their direction/size. However, it’s highly likely to be criticized and dismissed. Not a good way to seek (quick) academic rewards.

    2. Careful and critical reflection of prior assumptions detracts from being able to pull the Bayesian crank and claim one has answered questions and done something important.

    3. The Prior: Fully comprehended last, put first, checked the least? https://statmodeling.stat.columbia.edu/2017/01/11/the-prior-fully-comprehended-last-put-first-checked-the-least/

    • Maybe if the work can be translated to appeal to the public at large, we might make further gains. The abysmal writing can so confuse ANYONE. Sheesh. I recognize though that it is a great boon to have extensive expertise, in many areas. But I think it is also a matter of attitude and temperament since some experts are quite elitist about their fields. I think those are biases that off putting.

    • + 1. Especially to #2. In many areas, reviewers struggle with non-classical stats in general (anything outside whatever version of NHST they have picked up). Reviewers with the depth of knowledge required to evaluate a full Bayesian workflow, with elaboration of things like prior predictive checks, are going to be hard to find indeed! Even those who have used Bayesian methods may not have spent much time around this corner of the statistics world, and even if they have, the time commitment on their part would be enormous. So the incentive in applied research – much of academia – will be on *results*, not so much how you got there.
      I don’t see much hope for structural change outside of: 1) making science mostly amateur, or 2) somehow uncoupling “results” from career advancement in the professional sciences. The rethink on #2 is not impossible, but would be really really hard to thread the needle just right. There’s just no incentive for the big players in the system right now to undertake such a task. Maybe things will change post-Covid…

      • Chris said,
        “Reviewers with the depth of knowledge required to evaluate a full Bayesian workflow, with elaboration of things like prior predictive checks, are going to be hard to find indeed! Even those who have used Bayesian methods may not have spent much time around this corner of the statistics world, and even if they have, the time commitment on their part would be enormous. So the incentive in applied research – much of academia – will be on *results*, not so much how you got there.”

        So sad. Coming to this as a mathematician, it is especially sad — because reviewing (we call it refereeing) in math consists largely of scrutinizing “how they got there” rather than just that a “result” is claimed. In pure math, there is vocabulary that helps make the distinction between “results” and “How you get there”. For example, we call something a “conjecture” unless there is a solid proof, that has been gone over with a fine toothed comb by several people. And a common way to consider a conjecture is to go back and forth between trying to prove it, and trying to disprove it by finding a counterexample. I think even the terminology of “referee” rather than “review” helps — part of a referee’s job is to point out a mistake (or violation of the rules). In contrast, a “reviewer” just summarizes.

  3. I think that Michael Betancourt’s work on Bayesian workflow is very important in this respect.

    We tried to learn from him by writing a paper with him, applying his methods to a psycholinguistic application:

    @article{SchadEtAlWorkflow,
    Author = {Daniel J. Schad and Michael Betancourt and Shravan Vasishth},
    Title = {Towards a principled {B}ayesian workflow: {A} tutorial for cognitive science},
    Year = {2020},
    pdf = {https://arxiv.org/abs/1904.12765},
    code = {https://osf.io/b2vx9/},
    journal = {Psychological Methods},
    note = {In Press}
    }

    In my MSc thesis in statistics, I did bias modeling a la Spiegelhalter, but I went fully Bayesian, unlike the original Rebecca Turner, Spiegelhalter et al paper, which I thought was really amazing:

    @article{turner2009bias,
    title={Bias modelling in evidence synthesis},
    author={Turner, Rebecca M and Spiegelhalter, David J and Smith, Gordon CS and Thompson, Simon G},
    journal={Journal of the Royal Statistical Society: Series A (Statistics in Society)},
    volume={172},
    number={1},
    pages={21–47},
    year={2009},
    publisher={Wiley Online Library}
    }

    Bias modeling was really hard work, as Keith says. I was the only expert available, but obviously I am not the best person to be evaluating the biases in an experiment I had a stake in. I wanted to recruit psycholinguists who could serve as unbiased experts, so I could use their priors for bias adjustment, but I could convince nobody to put in the time. It’s a huge amount of effort, you have to think hard about bias, not something we are used to doing in psycholinguistics. My master’s thesis is here, in case someone wants to pick up on this important topic and has the Sitzfleisch to actually do it:

    https://www.ling.uni-potsdam.de/~vasishth/pdfs/MScDissertationVasishth.pdf

  4. I think that a checklist of necessary things addressed provided with any data analysis is a great idea. Then one can be sure that things like the distribution of the outcome variable was assessed**. Or, that at least someone was willing to say it was. This has nothing to do with Bayes, or bias, and everything to do with making sure people don’t skip important steps.

    ** Assessed could be that it was examined or that it was known based on properties of the outcome variable, e.g. binomial given n and p, or some combination, or really just making sure people look at the damn data instead of clicking on a few buttons in a regression dialog.

Leave a Reply to Shravan Vasishth Cancel reply

Your email address will not be published. Required fields are marked *