David Bailey, a physicist at the University of Toronto, writes:
I thought you’d be pleased to hear that a student in our Advanced Physics Lab spontaneously used Stan to analyze data with significant uncertainties in both x and y. We’d normally expect students to use python and orthogonal distance regression, and STAN is never mentioned in our course, but evidently it is creeping into areas I wasn’t expecting.
The other reason for this email is that I’d be curious to (eventually) read your comments on this new article by Anne‐Laure Boulesteix, Sabine Hoffmann, Alethea Charlton, and Heidi Seibold in Significance magazine: “A replication crisis in methodological research?” I [Bailey] have done a bit of research on scientific reproducibility, and I am always curious to understand better how statistical methods are validated in the messy real world. I often say that physics is much easier to do than medical and social sciences because “electrons are all identical, cheap, and unchanging – people are not.”
I read the linked article by Boulesteix et al., and I agree with their general points, although they perhaps underestimate the difficulty of these evaluations. Here are three relevant articles from 2012-2014:
1. In a 2012 article, Eric Loken and I criticize statisticians for “not practicing what we preach,” regarding teaching rather than methodology, but a similar issue.
2. I ask how we choose our methods; see section 26.2 of this article from 2014.
3. Another take on what we consider as convincing evidence is this article from 2014 with Keith O’Rourke, which begins as follows:
The rules of evidence as presented in statistics textbooks are not the same as the informal criteria that statisticians and practitioners use in deciding what methods to use.
According to the official rules, statistical decisions should be based on careful design of data collection, reliable and valid measurement, and something approximating unbiased or calibrated estimation. The first allows both some choice of the assumptions and an opportunity to increase their credibility, the second tries to avoid avoidable noise and error and third tries to restrict to methods that are seemingly fair. This may be fine for evaluating psychological experiments, or medical treatments, or economic policies, but we as statisticians do not generally follow these rules when considering improvements in our teaching nor when deciding what statistical methods to use.
So, yes, it’s kind of embarrassing that statisticians are always getting on everybody else’s case for not using random sampling, controlled experimentation, and reliable and valid measurements—but then we don’t use these tools in our own decision making.
P.S. On Bailey’s webpage there’s a link to this page on “The Way of the Physicist”:
Our students should be able to
– construct mathematical models of physical systems
– solve the models analytically or computationally
– make physical measurements of the systems
– compare the measurements with the expectations
– communicate the results, both verbally and in writing
– improve and iterate
and to apply these skills to both fundamental and applied problems.
P.P.S. I sent the above comments to Boulesteix et al. (the authors of the above-linked article), who replied:
We agree with your discussion about the shortcomings of the sources of evidence on the effectiveness of statistical methods in Gelman (2013) and Gelman and O’Rourke (2014). Indeed, we generally expect methodological papers to provide evidence on the effectiveness of the proposed methods and this evidence is often given through mathematical theory or computer simulations “which are only as good as their as-sumptions” (Gelman, 2013). When evidence of improved performance is given in benchmarking studies, the number of benchmark data sets is often very limited and the process used to select datasets is unclear. In this situation, researchers are incentivized to use their researcher degrees of freedom to show their methods from the most appealing angle. We do not suggest any intentional cheating here, but in the fine-tuning of method settings, data sets and pre-processing steps, researchers (ourselves included) are masters of self-deception. If we do not at least encourage authors to minimize cherry-picking when providing this evidence, then it is just silly to ask for evidence.
We also agree with you that the difficulties of becoming more evidence-based should not be underestimated. At the same time, we are not arguing that it is an easy task, we say that it is a necessary one. As you say in Gelman and O’Rourke (2014), statistics is a young discipline and, in the current statistical crisis in science, we feel that we can learn from the progress that is being made in other disciplines (while keeping in mind the differences between statistics and other fields). Currently, “statisticians and statistical practitioners seem to rely on a sense of anecdotal evidence based on personal experience and on the attitudes of trusted colleagues” (Gelman, 2013). For much of its history, medicine was more of an art than a science and relied on this type of anecdotal evidence, which has only come to be considered insufficient in the last century. We might be able to learn important lessons from experiences in this field when it comes to evidence-based methodological research.
Starting from the list of evidence that you present in Gelman (2013) and Gelman and O’Rourke (2014), we might for instance establish a pyramid of evidence. The highest level of evidence in this pyramid could be systematic methodological reviews and pre-registered neutral comparison studies on a number of benchmark data sets determined through careful sample size calculation or based on neutral and carefully designed simulation studies. We have to admit that evidence on real data sets is easier to establish for methods where the principal aim is prediction rather than parameter estimation and we do not pretend to have the answer to all open questions. At the same time, it is encouraging to see that progress is being made on many fronts, ranging from the pre-registration of machine learning research (e.g. NeurIPS2020 pre-registration experiment, http://preregister.science/), to the replication of methodological studies (https://replisims.org/replications-in-progress/), and to the elaboration of standardized reporting guidelines for simulation studies (see, e.g., De Bin et al., Briefings in Bioinformatics 2020 for first thoughts on this issue).
We appreciate your argument in Gelman and Loken (2012) that we should be more research based in our teaching practice. Exactly as applied statisticians should, in an ideal world, select statistical methods based on evidence generated by methodological statisticians, teaching statisticians should, in an ideal world, select teaching methods based on evidence generated by statisticians committed to didactics research. In both fields – teaching and research, it seems deplorable that we do not practice what we preach for other disciplines. In the long run, we also hope that our teaching will benefit from more evidence-based methodological research. Our students are often curious to know which method they should apply on their own data and, in the absence of neutral, high-quality studies on the comparative effectiveness of statistical methods, it is difficult to give a satisfactory answer to this question. All in all, we hope that we can stop being “cheeseburger-snarfing diet gurus who are unethical in charging for advice we wouldn’t ourselves follow” (Gelman and Loken (2012)) regarding both teaching and research.
I responded that I think they are more optimistic than I am about evidence-based medicine and evidence-based education research. It’s not so easy to select teaching methods based on evidence-based research. And for some concerns about evidence based medicine, see here.