Skip to content

Challenges to the Reproducibility of Machine Learning Models in Health Care; also a brief discussion about not overrating randomized clinical trials

Mark Tuttle pointed me to this article by Andrew Beam, Arjun Manrai, and Marzyeh Ghassemi, Challenges to the Reproducibility of Machine Learning Models in Health Care, which appeared in the Journal of the American Medical Association. Beam et al. write:

Reproducibility has been an important and intensely debated topic in science and medicine for the past few decades. . . . Against this backdrop, high-capacity machine learning models are beginning to demonstrate early successes in clinical applications . . . This new class of clinical prediction tools presents unique challenges and obstacles to reproducibility, which must be carefully considered to ensure that these techniques are valid and deployed safely and effectively.

Reproducibility is a minimal prerequisite for the creation of new knowledge and scientific progress, but defining precisely what it means for a scientific study to be “reproducible” is complex and has been the subject of considerable effort by both individual researchers and organizations like the National Academies of Science, Engineering, and Medicine. . . .

Replication is especially important for studies that use observational data (which is almost always the case for machine learning studies) because these data are often biased, and models could operationalize this bias if not replicated. The challenges of reproducing a machine learning model trained by another research team can be difficult, perhaps even prohibitively so, even with unfettered access to raw data and code. . . .

Machine learning models have an enormous number of parameters that must be either learned using data or set manually by the analyst. In some instances, simple documentation of the exact configuration (which may involve millions of parameters) is difficult, as many decisions are made “silently” through default parameters that a given software library has preselected. These defaults may differ between libraries and may even differ from version to version of the same library. . . .

Even if these concerns are addressed, the cost to reproduce a state-of-the-art deep learning model from the beginning can be immense. For example, in natural language processing a deep learning model known as the “transformer” has led to a revolution in capabilities across a wide range of tasks, including automatic question answering, machine translation, and algorithms that can write complex and nuanced pieces of descriptive text. Perhaps unsurprisingly, transformers require a staggering amount of data and computational power and can have in excess of 1 billion trainable parameters. . . . A recent study estimated that the cost to reproduce 1 of these models ranged from approximately $1 million to $3.2 million using publicly available cloud computing resources. Thus, simply reproducing this model would require the equivalent of approximately 3 R01 grants from the National Institutes of Health and would rival the cost of some large randomized clinical trials. . . .

I sent this to Bob Carpenter, who’s been thinking a lot about the replication crisis in machine learning, and who knows all about natural language processing too.

Here was Bob’s reaction:

None of what they list is unique to ML: lots of algorithm parameters with default settings, randomness in the algorithms, differing results between library versions. They didn’t mention different results from floating point due to hardware or software settings or compilers. About random seeds, they say, “One study found that changing this single, apparently innocuous number could inflate the estimated model performance by as much as 2-fold relative to what a different set of random seeds would yield.” Variation from different seeds isn’t innocuous, it’s fundamental, and it should be required in reporting results. There’s nothing different about deep belief nets in this regard compared to, say, MCMC or even k-means clustering via EM (algorithms that havebeen around since the 1950s and 1970s). All too often, multiple runs are done, the best one is reported, and the variance is ignored.

The costs do seem larger in ML. One to three megabucks to reproduce the NLP transformers is indeed daunting. I also liked how they used (U.S. National Institutes of Health) R01 grants as the scale instead of megabucks.

Can’t wait to see the blog responses after six months of cave aging.

We’ll see if we get any comments at all. The post doesn’t involve racism, p-values, or regression discontinuity, so it might not interest our readers so much!

The Beam et al. article concludes:

Determining if machine learning improves patient outcomes remains the most important test, and currently there is scant evidence of downstream benefit. For this, there is likely no substitute for randomized clinical trials. In the meantime, as machine learning begins to influence more health care decisions, ensuring that the foundation on which these tools are built is sound becomes increasingly pressing. In a lesson that is continuously learned, machine learning does not absolve researchers from traditional statistical and reproducibility considerations but simply casts modern light on these historical challenges. At a minimum, a machine learning model should be reproduced, and ideally replicated, before it is deployed in a clinical setting.

I pretty much agree. And, as Bob notes, the above is not empty Mom-and-apple-pie advice, as in the real world we often do see people running iterative algorithms (whether machine learning or Bayesian or whatever) without checks and validation. So, yeah.

But one place I will push back on is their claim that, to get evidence of downstream benefit, “there is likely no substitute for randomized clinical trials.” Randomized clinical trials are great for what they are, but they have limitations in realism, time limitations, and sample size. There are other measures of downstream benefit that can be used. Other measures are imperfect—but, then again, inferences from randomized experiments are imperfect too.


  1. Kenneth Tay says:

    “None of what they list is unique to ML”

    Neither is it limited to algorithms right? I mean, in place of a prediction model put a human who is asked to predict (as doctors and nurses are asked to do in their work).

    So it seems like we are requiring algorithms/ML to pass a higher bar re reproducibility. Not that I disagree with it… It’s just that these extra costs associated with reproducibility in ML don’t seem to be paid for human predictors.

  2. jim says:

    “At a minimum, a machine learning model should be reproduced, and ideally replicated, before it is deployed in a clinical setting.”

    “ideally”?? :) This is one reason I fear the battle for science has already been lost. It is not “ideal” to replicate results. It’s fundamental. No one should be acting on any research results that haven’t been replicated several times, or – ideally – many times.

    I don’t know why machine learning should be any different than any other approach regarding data: garbage in, garbage out. If the data are crap then the results won’t reproduce or replicate – that’s before you start worrying about variation from programming and modelling parameters like random seeds.

    If people are expecting ML to eliminate the inconvenience of getting quality data and replicating, they’re going to be confused for a long time.

  3. Giles Hooker says:

    Minor quibble: are we reproducing the process of ML or the success of the particular ML model? Many of the large deep learning models are reported with structures and parameters — you can ask “how well do these perform”, regardless of how those parameters were arrived at. Treat the learned model as being like a treatment (or at least treatment protocol).

    That doesn’t help reproducibility for the purposes of ML research (given how much human intervention goes into training deep models, I’m not sure that goal isn’t impossible) but it might be OK for medical uses — and actually reproducing how well this particular ML model does provides an incentive to its trainers to not “take the best random seed we can find”.

  4. Paolo Inglese says:

    I agree that the reproducibility problem is general and doesn’t regard only ML.
    It’s just that in this historical period is characterised by a particular hype (maybe excessive) about ML (and AI).
    This makes ML attract more criticism, than other methods. Together with the fact that ML models tend to be black boxes with gazillions of parameters. Together with the fact that a lot of people want to surf the hype wave and apply the methods using the “default parameters”.
    One issue is that unfortunately scientists tend to lose a bit of their critical thinking in front on hyped techniques. Thus, we get several flawed papers that have not been reviewed properly. If you are critical towards a hyped technique you are accused to be against progress.
    Laziness is another problem. People tend to use what they think gives the highest reward with the smallest effort. Applying ML is a bit seen like something that can solve complex problems easily.
    You add a bit of political push from some research groups and companies and then you get the full picture.

    • What do you think should be done to rectify these problems. I discern some obvious solutions that have continued to be raised here on the blog. Consumers of expertise should also develop tools that can enable them to distinguish scientism from science.

    • jim says:

      “ML models tend to be black boxes with gazillions of parameters”

      Are there really ML models with “gazillions” of parameters? I don’t know a lot about what’s in use. But my understanding is that *successful* ML models are doing relatively modest things, like tracking a few physical parameters (age, temp, maybe strain, use cycles and a few other things) on an industrial part and predicting failure.

      From my brief experience with modelling, piling on the parameters doesn’t usually do that much to improve the model because the noise in those parameters is larger than the difference they would make on the model – like maybe a few tenths of a percent improvement.

      Am I wrong about that? I could be but IME there’s just not very many truly significant controls on any given physical or even social process.

      • Giles Hooker says:

        See many of the deep learning models used for image data. These end up with tens of thousands of parameters. Probably mostly unneeded if you knew which ones but that’s the trick.

        Yes. For “interpretable” (tabular data seemed to be the current nomenclature) smaller models seem to do just as well. But a lot of the success in ML, at least recently, is in images, natural language, audio where humans do a lot of subconscious processing that doesn’t seem to be “simple”.

    • bbis says:

      To borrow the Andrew hat for a moment – Criticism is hard. Much of the criticism within any field follows well established paths. The processes by which the work is done are well established and have been evaluated carefully in the past so the faults and flaws have also been identified so there are available the tools needed to critique work that is done within these boundaries. With new techniques and methods it takes time for people to work through the ideas, test what the methods can and can’t do well, where the most common issues arise. Then it takes time for these ideas of criticism to spread throughout the community and become an accepted standard. A question that remains is how to deal with the period while the techniques of criticism emerge.

      One issue with the way most people discuss criticism of research is the implication that somehow it is simple and straightforward. Just be skeptical and all will be clear and no errors will occur. As fields develop how work needs to be evaluated also develops (with more than a curt nod to Oscar Wilde) and the development of the process of critical evaluation can be just as difficult and indirect as the development of the research to which it is directed. It might make it easier for people to accept critical input if there was a better appreciation of the intellectual effort involved in the development of good critiques.

      I also think that skepticism without some critical foundation to evaluate new ideas is not very productive. One potential product of skepticism without a foundation is an excuse to not engage with new ideas. A balance that allows engagement without over commitment is the approach needed that walks a line between simple skepticism as generally practiced and simple buy-in.

      • When I say “be a skeptic”, I mean that scientists should always ask themselves: “What if?”. This goes together with the appreciation of the good aspects brought by a new technique or technology. I see that being a skeptic or constructively critical is even more respectful towards the topic, than just passively accepting it. Being intellectually engaged with something denotes respect and interest in the topic.

        At the same time, we have also to say that, unfortunately, in the last few years we have seen a hyped dogmatism towards AI/DL/ML.
        Several papers that lack scientific rigour, even on prestigious journals, just because of the hyped technique.

        It will take time for sure, but only the pressure of critical thinking can make the field more rigorous. This is what I was meaning in my comment.

  5. Mendel says:

    Am I understanding this right, that reproducing a ML model costs mroe than creating it in the first place?
    Or are they that expensive to create in the first place?

  6. Jonathan (another one) says:

    This is a great example of a Parkinson’s Law for computing. I’m old enough to remember when testing regressions on the Longley dataset (16 observations on 6 variables) was all you needed to benchmark the accuracy of the computing part of your estimation. As computing power expands, the complexity of the algorithms is allowed to grow, and the tunable parameters and random seeds (and quality of the associated RNGs) become issues that were never dreamed of.

    But the solution here is ridiculously simple: transparency. Provide the program used, including all user-set parameters and include a file of the default parameters used by the software. (That may actually require a little effort by software designers in some cases.) If the result is unstable with respect to the starting seed, or tuning parameters, or whatever, then let the world know about it.

    This is routine in expert witness work, in which experts are routinely required to turn over their data and the programs used to generate results. The remaining nonreplications come from an unwillingness of the replicator to use the exact software used by the originator — all manner of divergences can arise from the attempt to rewrite Stata code in SAS, for example, but transparency and openness solve these issues.

    • Although I totally agree about the transparency, I see a big problem especially with DL.
      DL sometimes requires a lot of computational power (that not everyone has). So, testing multiple parameters may be unfeasible for several. I would see more something like a public computational environment for research, where models can be uploaded, and benchmark tested. It will never happen, but I like to dream a lot.

      • Jonathan (another one) says:

        That’s a good point. If the resources required just to do the run are extensive enough, then it may not be enough just to provide the program. But as computing resources continue to get cheaper, benchmarking current models might be feasible, say, five years from now, for much more modest resource costs.

    • somebody says:

      > I’m old enough to remember when testing regressions on the Longley dataset (16 observations on 6 variables) was all you needed to benchmark the accuracy of the computing part of your estimation

      I don’t know if that was ever a good idea. Even today, a lot of hyped ML/deep learning tweaks that gain notoriety from high performance on standard large datasets like MNIST end up being just change overfittings on hyperparameters that perform about as well as standard methods on truly OOS validations.

      • Jonathan (another one) says:

        I agree completely. My reference to the Longley data was meant somewhat sarcastically, although it was taken very seriously at the time… so much so that, apocryphally, some statistical packages wrote code to detect the Longley data and spit out the correct answers. Even if that didn’t happen, hacks to game benchmarks still happen all the time.

  7. Ron Kenett says:

    First step – get your terminology right. The discussion is mixing terms and this almost always leads to cacophonic discussions. See

    Second step – lay out in detail the data analysis pipeline. Just mentioning algorithms is not enough. See for example

    Third step – properly describe research claims. Predictive analytics is difference from exploratory analysis. In both cases the emphasis should be on generalisability of findings. Usually, ML is poort at generalizability that leverages domain expertise….

  8. Stevec says:

    I don’t work in that field but have read a lot on ML since AlphaGo arrived. And Deep Mind, who created AlphaGo, fairly recently showed better results than traditional methods for protein folding.

    In general you use training data to essentially pattern match.

    Image recognition is the iconic example. Humans aren’t setting parameters. For example, 10M images in, and 10M human issued correct answers, one for each image. The algorithm finds the set of weights that minimise error (computer result v human correct result). That’s basically it. Usually (always?) you hold back some sample, and then validate the resulting neural net and see how well it does. Lots of computing power to determine the weights (parameters), but once you have the weights, to get “the next output from the next input” requires very little computing power.

    You can’t validate it the way you validate someone’s code. You validate it based on testing it against a non-training sample.

    No idea how it’s being used in medicine. I’m sure it’s being overhyped, that’s a given.

    If I was a patient and a medical person said, we’re giving you this cocktail of drugs because that’s the best shot at fixing this problem, AI recommends it, I’d be asking questions. However, the key point seems to be that right now medical practitioners have a similar approach to a ML neural network, i.e., their own brain, trained on a much smaller subset of data. Different practitioners make different recommendations, based on their small subset of papers read, patients seen, results observed.

    Seems like it has great potential for identifying combinations that humans aren’t seeing. Exploratory work. Like setting it to work on the the UK biobank which has over 500k standardised datasets.

    And definitely (I believe proven) potential on image related tasks – recognising cancers from xrays, recognising eye problems, recognising skin cancers – achieving the same as the top experts in these sub fields.

    • Stevec says:

      A note on the costs.. I was struck watching the AlphaGo documentary with how their hardware requirements dramatically reduced once they had initially “nailed the problem”. They beat Lee Sedol, “the Roger Federer of Go”, and they had a pretty big set of servers.. then a year later they played the current no 1 in the world, Ke Jie (can’t remember the right spelling) with the system running off something like 1/100 less computing power. And they demolished him, and didn’t use most of their time allocated.

      I don’t really know the computational details, I read David Silver’s paper, but it’s not my field. The key point that I took away – trying to solve a big problem (whatever that problem is) requires massive hardware, experts in the field, lots of trial and error with the layout of the neural net (and probably the algorithms), but then once you’re getting results you can validate (in this case winning or picking best next move each time) you see how to slim down the computational requirements.

      I might have misread how this plays out. But I’d guess the potential for cost reduction in this field is very high. RCT will stay expensive, computational costs will shrink, not just because hardware costs go down every year/decade, but possibly because of this other consideration.

      I would be very interested to get insights from ML experts if there are any here.

      • Ben says:

        > with the system running off something like 1/100 less

        Yes, treating the computers as the cost seems strange. I’d expect most of the cost to be labor in basically any algorithm development. Sure eventually you’ll scale it up to a big computer, but presumably a lot of your dev time will be in smaller stuff.

        And is the cloud cheap for stuff like this? It’s my impression it’s a pay for convenience thing.

  9. Ethan says:

    I think the fundamental problem is that there is a serious lack of open and public benchmarks for understanding how well various medical machine learning algorithms work. The primary reason why people have been so successful in computer vision and NLP is that they have public benchmarks that enable easy and clear evaluation of methods (ImageNet, SuperGLUE, etc). This makes research much more straightforward as it’s easier to understand how well your method actually does (and makes sure that your supposed “improvement” isn’t simply due to incorrect baselinies). Right now we have no clue how effective many medical machine learning methods are or even if they work at all. Every paper evaluates on a different dataset with a different set of baselines and setup which makes comparisons and replication impossible. You even have documented cases where authors are unable to reproduce their own work from a couple years prior!

    The solution is quite simple: We need an entity to commit to creating a public benchmark for medical machine learning algorithms (medical records, notes, imaging, etc) in cases where the data can’t be made public. That entity would then run methods and compute scoreboards for which perform best. Of course that would cost money and resources, but I do believe the investment will be well worth the price.

  10. Mat says:

    One thing that hasn’t been mentioned in this thread are approaches the software industry takes to achieve replicability (obvious disclaimer, I work on such approaches). The software industry fights with very similar problems than the machine learning/science community. It is, for example, non-trivial to build and run a simple program like emacs from 10 years ago and end up with the same version than before – in the worst case the computation just fails and in the best you just have a slightly different emacs caused by undeclared dependencies, non-determinism and so on.

    Besides reproducing things over an over again in different environments (automatic tests), they increasingly rely on computational approaches that can give you very high confidence in the replicability of your computations. If you had such confidence that a machine learning model will replicate if trained with the same inputs in the same computational context, you probably would just rely on it instead of relying on your own tests that are really costly and focus on testing the model in a different context. I am not sure where confidence in the replicability of published models comes from at the moment.

    One approach to get confidence in replicability of computations is the use of pure functions in a program – and this can be enforced. Pure functions prohibit any external side effects on your computation such as random number generation without fixed seed, internet access or others. If you need to use such effects you have to make them explicit and describe them, for example with seeds or content hashes – you have to pass through a controllable effect layer. Truly non-deterministic computations that can be caused by parallelism are of course more difficult to handle. Tools like Nix/Guix/Haskell/Elm and others use this approach to control a certain number of such side effects. Nix, for example, achieves bit-reproducibility for the build process of a subset of programs of the Linux open source ecosystem (see And in some sense, building a binary from source code is not much different from building a machine learning model from data (and code). More extensively effect-controlled pipeline tools that control other side effects would allow setting up machine learning models that are replicable with high confidence. Although these approaches are not be perfect, such strategies could at least push this number of 20% of successful replication attempts to retrain published machine learning models closer to 100%.

    I wonder what role such tools could play in health sciences and even statistics in the wider sense if they were more widely used …

  11. Chris says:

    When discussing ML in health care, a key aspect missing from the discussion above is the inability to share data due to privacy rules. Without good de-identified publicly available data, there is a limit to replication. Journals that have open data policies still waive those policies when researchers point to privacy concerns about the data sets. Those concerns are real, but at least a model matrix can be shared without risk of re-identification. On top of this the peer-review process does not pay enough attention to full disclosure of the workflow. When a paper does not fully disclose their workflow, and all the details therein, then that work should not get published.

Leave a Reply