David Bailey, a physicist at the University of Toronto, writes:

I thought you’d be pleased to hear that a student in our Advanced Physics Lab spontaneously used Stan to analyze data with significant uncertainties in both x and y. We’d normally expect students to use python and orthogonal distance regression, and STAN is never mentioned in our course, but evidently it is creeping into areas I wasn’t expecting.

The other reason for this email is that I’d be curious to (eventually) read your comments on this new article by Anne‐Laure Boulesteix, Sabine Hoffmann, Alethea Charlton, and Heidi Seibold in Significance magazine: “A replication crisis in methodological research?” I [Bailey] have done a bit of research on scientific reproducibility, and I am always curious to understand better how statistical methods are validated in the messy real world. I often say that physics is much easier to do than medical and social sciences because “electrons are all identical, cheap, and unchanging – people are not.”

I read the linked article by Boulesteix et al., and I agree with their general points, although they perhaps underestimate the difficulty of these evaluations. Here are three relevant articles from 2012-2014:

1. In a 2012 article, Eric Loken and I criticize statisticians for “not practicing what we preach,” regarding teaching rather than methodology, but a similar issue.

2. I ask how we choose our methods; see section 26.2 of this article from 2014.

3. Another take on what we consider as convincing evidence is this article from 2014 with Keith O’Rourke, which begins as follows:

The rules of evidence as presented in statistics textbooks are not the same as the informal criteria that statisticians and practitioners use in deciding what methods to use.

According to the official rules, statistical decisions should be based on careful design of data collection, reliable and valid measurement, and something approximating unbiased or calibrated estimation. The first allows both some choice of the assumptions and an opportunity to increase their credibility, the second tries to avoid avoidable noise and error and third tries to restrict to methods that are seemingly fair. This may be fine for evaluating psychological experiments, or medical treatments, or economic policies, but we as statisticians do not generally follow these rules when considering improvements in our teaching nor when deciding what statistical methods to use.

So, yes, it’s kind of embarrassing that statisticians are always getting on everybody else’s case for not using random sampling, controlled experimentation, and reliable and valid measurements—but then we don’t use these tools in our own decision making.

**P.S.** On Bailey’s webpage there’s a link to this page on “The Way of the Physicist”:

Our students should be able to

– construct mathematical models of physical systems

– solve the models analytically or computationally

– make physical measurements of the systems

– compare the measurements with the expectations

– communicate the results, both verbally and in writing

– improve and iterate

and to apply these skills to both fundamental and applied problems.

I like that! It’s kind of like The Way of the Statistician. Bailey also wrote this article about how we learn from anomalies, in a spirit similar to our God is in Every Leaf of Every Tree principle.

**P.P.S.** I sent the above comments to Boulesteix et al. (the authors of the above-linked article), who replied:

We agree with your discussion about the shortcomings of the sources of evidence on the effectiveness of statistical methods in Gelman (2013) and Gelman and O’Rourke (2014). Indeed, we generally expect methodological papers to provide evidence on the effectiveness of the proposed methods and this evidence is often given through mathematical theory or computer simulations “which are only as good as their as-sumptions” (Gelman, 2013). When evidence of improved performance is given in benchmarking studies, the number of benchmark data sets is often very limited and the process used to select datasets is unclear. In this situation, researchers are incentivized to use their researcher degrees of freedom to show their methods from the most appealing angle. We do not suggest any intentional cheating here, but in the fine-tuning of method settings, data sets and pre-processing steps, researchers (ourselves included) are masters of self-deception. If we do not at least encourage authors to minimize cherry-picking when providing this evidence, then it is just silly to ask for evidence.

We also agree with you that the difficulties of becoming more evidence-based should not be underestimated. At the same time, we are not arguing that it is an easy task, we say that it is a necessary one. As you say in Gelman and O’Rourke (2014), statistics is a young discipline and, in the current statistical crisis in science, we feel that we can learn from the progress that is being made in other disciplines (while keeping in mind the differences between statistics and other fields). Currently, “statisticians and statistical practitioners seem to rely on a sense of anecdotal evidence based on personal experience and on the attitudes of trusted colleagues” (Gelman, 2013). For much of its history, medicine was more of an art than a science and relied on this type of anecdotal evidence, which has only come to be considered insufficient in the last century. We might be able to learn important lessons from experiences in this field when it comes to evidence-based methodological research.

Starting from the list of evidence that you present in Gelman (2013) and Gelman and O’Rourke (2014), we might for instance establish a pyramid of evidence. The highest level of evidence in this pyramid could be systematic methodological reviews and pre-registered neutral comparison studies on a number of benchmark data sets determined through careful sample size calculation or based on neutral and carefully designed simulation studies. We have to admit that evidence on real data sets is easier to establish for methods where the principal aim is prediction rather than parameter estimation and we do not pretend to have the answer to all open questions. At the same time, it is encouraging to see that progress is being made on many fronts, ranging from the pre-registration of machine learning research (e.g. NeurIPS2020 pre-registration experiment, http://preregister.science/), to the replication of methodological studies (https://replisims.org/replications-in-progress/), and to the elaboration of standardized reporting guidelines for simulation studies (see, e.g., De Bin et al., Briefings in Bioinformatics 2020 for first thoughts on this issue).

We appreciate your argument in Gelman and Loken (2012) that we should be more research based in our teaching practice. Exactly as applied statisticians should, in an ideal world, select statistical methods based on evidence generated by methodological statisticians, teaching statisticians should, in an ideal world, select teaching methods based on evidence generated by statisticians committed to didactics research. In both fields – teaching and research, it seems deplorable that we do not practice what we preach for other disciplines. In the long run, we also hope that our teaching will benefit from more evidence-based methodological research. Our students are often curious to know which method they should apply on their own data and, in the absence of neutral, high-quality studies on the comparative effectiveness of statistical methods, it is difficult to give a satisfactory answer to this question. All in all, we hope that we can stop being “cheeseburger-snarfing diet gurus who are unethical in charging for advice we wouldn’t ourselves follow” (Gelman and Loken (2012)) regarding both teaching and research.

I responded that I think they are more optimistic than I am about evidence-based medicine and evidence-based education research. It’s not so easy to select teaching methods based on evidence-based research. And for some concerns about evidence based medicine, see here.

Correction: the link to the paper by Boulesteix, Hoffmann, Charlton, and Seibold is available at https://rss.onlinelibrary.wiley.com/doi/10.1111/1740-9713.01444 (the original link had an parenthesis character).

Link fixed; thanks.

I remember as a second-ish year grad student I used Stan to fit some logistic growth model in some ‘intro to advanced artificially intelligent machine learning without any math’ course that turned out to be 90% exploratory data visualization with ggplot. I plotted credible & posterior predictive intervals etc. over the generic scatterplot we’d been meant to produce (a few other students might have fit the basic linear linear model w/ lm()). Nobody had any idea what to make of it lol.

What is the evidence this has been a benefit?

Apropos the wonderful physics study desiderata, I found the undergraduate study of quantum mechanics a profound experience, not least because the lab exercises designed to demonstrate the relevant effects were l-o-n-g (we’d miss meals), spooky, and fascinating. Measurements were critical, but math not so much.

As someone who chose to go into industry instead of academics, I enjoy this blog so much. Thanks for giving an insight into academic discussions that those of us not allowed in the ivory tower :)

I think it’s a mistake to think of this as “what really goes on in the ivory tower” My impression is there are plenty of academics here because it **isn’t** like what goes on in the ivory tower, and there are plenty of non-academics here as well. In many ways I think of this open discussion as “what **ought** to go on in academia, but mostly doesn’t anymore”

Here is perhaps another useful example from the causal inference literature.

I have been struck with how a prominent stream of method evaluation papers basically lack any statistical evidence. In one JASA paper, Shadish, Clark, and Steiner (2008) claims are made about the ability of different adjustment and matching methods to reduce or removing confounding bias. This is done using a “doubly randomized preference trial” (DRPT) that randomly assigns people to whether they will be randomly assigned to treatment or whether they self-select into treatment. This paper was published along with interesting comments by Don Rubin, Jennifer Hill, and others. However none discuss this issue:

“Papers reporting on DRPTs have argued they provide evidence for bias of observational estimators and about which types of covariates and analysis methods most reduce that bias (Shadish, Clark, and Steiner 2008; Pohl et al. 2009; Steiner et al. 2010). However, they employ relatively small samples (e.g., n = 445, n = 202), and these claims are apparently not based on formal statistical inference: for Shadish, Clark, and Steiner (2008) and Steiner et al. (2010) the experimental and unadjusted observational estimates are statistically indistinguishable, so their results are “basically descriptive” (Steiner et al. 2010, p. 256). … The same is true of the smaller replication by Pohl et al. (2009). In fact, this problem is not unique to DRPTs: constructed observational studies (e.g., Griffen and Todd 2017) are often similarly underpowered to detect any bias to remove.”

This is a quote from my paper https://deaneckles.com/misc/Bias%20and%20high-dimensional%20adjustment%20in%20observational%20studies%20of%20peer%20effects%20(with%20SI).pdf and this is elaborated on in our Appendix 5. It turns out that if there is any evidence in that data of bias reduction, it comes from comparisons of the entirely observational estimators! — not from comparison with the analysis of the randomized treatment arm:

“In the main text, we comment on prior papers that report on doubly-randomized preference trials (DRPTs). In particular, we note that the experimental comparison provides little-to-no formal statistical evidence; rather, any evidence about bias or bias reduction comes from comparisons among observational estimators, which are not actually reported in these papers, but can be partially inferred from the results reported.

First, the comparisons between observational and experimental estimators are not statistically significant. This can be determined by analysis of the reported point estimates and standard errors for the experimental and (unadjusted) observational data in Tables 2 and 3 of Steiner et al. (2010) and Table 3 of Pohl et al. (2009).

Second, one can compare the different observational estimators. If there is evidence that two observational estimators (e.g., one unadjusted and one adjusted) are converging to different estimands, then this might be interpreted as explained by the presence of confounding (though other explanations may be possible). In particular, using the reported point estimates and standard errors for various regression adjustment estimators (ANCOVA) in Tables 2 and 3 of Steiner et al. (2010), one can conduct Wu–Hausman specification tests of the null hypothesis that the different estimators estimate the same quantity. These tests are potentially anti-conservative because of unknown covariance between the estimators (i.e., seemingly unrelated estimator tests should be used). We find that some of these tests (such as between the unadjusted and fully adjusted estimators) reject, which may be interpreted as providing entirely observational evidence for confounding. Thus, ironically, any statistical evidence for confounding bias or bias reduction through adjustment for covariates in Shadish et al. (2008) and Pohl et al. (2009) derives solely from the nonrandomized arms, not from comparison with the randomized arms.”

Thanks for the thoughtful post and also for getting in touch with us in advance!

In case anyone here is interested in a video version of the paper, go to https://youtu.be/b6X_O5pD3lA