Skip to content

“Do you come from Liverpool?”

Paul Alper writes:

Because I used to live in Trondheim, I have a special interest in this NYT article about exercise results in Trondheim, Norway.

Obviously, even without reading the article in any detail, the headline claim that

The Secret to Longevity? 4-Minute Bursts of Intense Exercise May Help

can be misleading and is subject to many caveats.

The essential claims:

Such studies [of exercise and mortality], however, are dauntingly complicated and expensive, one reason they are rarely done. They may also be limited, since over the course of a typical experiment [of short duration], few adults may die. This is providential for those who enroll in the study but problematic for the scientists hoping to study mortality; with scant deaths, they cannot tell if exercise is having a meaningful impact on life spans.

However, exercise scientists at the Norwegian University of Science and Technology in Trondheim, Norway, almost 10 years ago, began planning the study that would be published in October in The BMJ.

More than 1,500 of the Norwegian men and women accepted. These volunteers were, in general, healthier than most 70-year-olds. Some had heart disease, cancer or other conditions, but most regularly walked or otherwise remained active. Few were obese. All agreed to start and continue to exercise more regularly during the upcoming five years.

Via random assignment, they were put into three groups: the control group which “agreed to follow standard activity guidelines and walk or otherwise remain in motion for half an hour most days,” the moderate group which exercises “moderately for longer sessions of 50 minutes twice a weekend” and the third group “which started a program of twice-weekly high-intensity interval training, or H.I.I.T., during which they cycled or jogged at a strenuous pace for four minutes, followed by four minutes of rest, with that sequence repeated four times.”
Note that those in the control group were allowed to indulge in interval training if they felt like it.

Almost everyone kept up their assigned exercise routines for five years [!!], an eternity in science, returning periodically to the lab for check-ins, tests and supervised group workouts.

The results:

The men and women in the high-intensity-intervals group were about 2 percent less likely to have died than those in the control group, and 3 percent less likely to die than anyone in the longer, moderate-exercise group. People in the moderate group were, in fact, more likely to have passed away than people in the control group [!!].

In essence, says Dorthe Stensvold, a researcher at the Norwegian University of Science and Technology who led the new study, intense training — which was part of the routines of both the interval and control groups — provided slightly better protection against premature death than moderate workouts alone.

Here can be found the BMJ article itself. A closer look at the BMJ article is puzzling because of the term non-significant which appears in the BMJ article itself and not in the NYT.


This study suggests that combined MICT and HIIT has no effect on all cause mortality compared with recommended physical activity levels. However, we observed a lower all cause mortality trend after HIIT compared with controls and MICT.


The Generation 100 study is a long and large randomised controlled trial of exercise in a general population of older adults (70-77 years). This study found no differences in all cause mortality between a combined exercise group (MICT and HIIT) and a group that followed Norwegian guidelines for physical activity (control group). We observed a non-significant 1.7% absolute risk reduction in all cause mortality in the HIIT group compared with control group, and a non-significant 2.9% absolute risk reduction in all cause mortality in the HIIT group compared with MICT group. Furthermore, physical activity levels in the control group were stable throughout the study, with control participants performing more activities as HIIT compared with MICT participants, suggesting a physical activity level in control participants between that of MICT and HIIT.

As it happens, I [Alper] lived in Trondheim back before North Sea oil transformed the country. The Norwegian University of Science and Technology in Trondheim, Norway did not exist but was called the NTH, incorrectly translated as the Norwegian Technical High School. Back then and as it is today, exercise was the nation’s religion and the motto of the country was

It doesn’t matter whether you win or lose. The important thing is to beat Sweden.

To give you a taste of what the country was like in the 1960s, while I was on a walk, a little kid stopped me and said, “Do you come from Liverpool?”

Dude should have his own blog.

Conference on digital twins

Ron Kenett writes:

This conference and the special issue that follows might be of interest to (some) of your blog readers.

Here’s what it says there:

The concept of digital twins is based on a combination of physical models that describe the machine’s behavior and its deterioration processes over time with analytics capabilities that enable lessons to be learned, decision making and model improvement. The physical models can include the control model, the load model, the erosion model, the crack development model and more while the analytics model which is based on experimental data and operational data from the field.

I don’t fully follow this, but it sounds related to all the engineering workflow things that we like to talk about.

Which sorts of posts get more blog comments?

Paul Alper writes:

Some of your blog postings elicit many responses and some, rather few. Have you ever thought of displaying some sort of statistical graph illustrating the years of data? For example, sports vs. politics, or responses for one year vs. another (time series), winter vs. summer, highly technical vs. breezy.

I’ve not done any graph or statistical analysis. Informally I’ve noticed a gradual increase in the rate of comments. It’s not always clear which posts will get lots of comments and which will get few, except that more technical material typically gets less reaction. Not because people don’t care, I think, but because it’s harder to say much in response to a technical post. I think we also get fewer comments for posts on offbeat topics such as art and literature. And of course we get fewer comments on posts that are simply announcing job opportunities, future talks, and future posts. And all posts get zillions of spam comments, but I’m not counting them.

As of this writing, we have published 10,157 posts and 146,661 comments during the 16 years since the birth of this blog. The rate of comments has definitely been increasing, as I remember not so long ago that the ratio was 10-to-1. Unfortunately, there aren’t so many blogs anymore, so I’m pretty sure that the total rate of blog commenting has been in steady decline for years.

Webinar: An introduction to Bayesian multilevel modeling with brms

This post is by Eric.

This Wednesday, at 12 pm ET, Paul Bürkner is stopping by to talk to us about brms. You can register here.


The talk will be about Bayesian multilevel models and their implementation in R using the package brms. We will start with a short introduction to multilevel modeling and to Bayesian statistics in general followed by an introduction to Stan, which is a flexible language for fitting open-ended Bayesian models. We will then explain how to access Stan using the standard R formula syntax via the brms package. The package supports a wide range of response distributions and modeling options such as splines, autocorrelation, and censoring all in a multilevel context. A lot of post-processing and plotting methods are implemented as well. Some examples from Psychology and Medicine will be discussed.

About the speaker

Paul Bürkner is a statistician currently working as a Junior Research Group Leader at the Cluster of Excellence SimTech at the University of Stuttgart (Germany). He is the author of the R package brms and a member of the Stan Development Team. Previously, he studied Psychology and Mathematics at the Universities of Münster and Hagen (Germany) and did his PhD in Münster on optimal design and Bayesian data analysis. He has also worked as a Postdoctoral researcher at the Department of Computer Science at Aalto University (Finland).

More on that credulity thing

I see five problems here that together form a feedback loop with bad consequences. Here are the problems:

1. Irrelevant or misunderstood statistical or econometric theory;
2. Poorly-executed research;
3. Other people in the field being loath to criticize, taking published or even preprinted claims as correct until proved otherwise;
4. Journalists taking published or even preprinted claims as correct until proved otherwise;
5. Journalists following the scientist-as-hero template.

There are also related issues such as fads in different academic fields, etc.

When I write about regression discontinuity, we often focus on item #2 above, because it helps to be specific. But the point of my post was that maybe we could work on #3 and #4. If researchers could withdraw from their defensive position by which something written by a credentialed person in their field is considered to be good work until proved otherwise, and if policy journalists could withdraw from their default deference to anything that has an identification strategy and statistical significance, then maybe we could break that feedback loop.

Economists: You know better! Think about your own applied work. You’ve done bad analyses yourself, even published some bad analyses, right? I know I have. Given that you’ve done it, why assume by default that other people haven’t made serious mistakes in understanding.

Policy journalists: You can know better! You already have a default skepticism. If someone presented you with a pure observational comparison, you’d know to be concerned about unmodeled differences between treatment and control groups. So don’t assume that an “identification strategy” removes these concerns. Don’t assume that you don’t have to adjust for differences between cities north of the river and cities south of the river. Don’t assume you don’t have to adjust for differences between schools in the hills and schools in the flatlands, or the different ages of people in different groups, or whatever.

I’m not saying this is all easy. Being skeptical without being nihilistic—that’s a challenge. But, given that you’re already tooled up for some skepticism when comparing groups, my recommendation is to not abandon that skepticism just because you see an identification strategy, some robustness checks, and an affiliation with a credible university.

Rob Tibshirani, Yuling Yao, and Aki Vehtari on cross validation

Rob Tibshirani writes:

About 9 years ago I emailed you about our new significance result for the lasso. You wrote about in your blog. For some reason I never saw that full blog until now. I do remember the Stanford-Berkeley Seminar in 1994 where I first presented the lasso and you asked that question. Anyway, thanks for
admitting that the lasso did turn out to be useful!

That paper “A significance test for the lasso” did not turn to be directly useful but did help spawn the post-selection inference area. Whether that area is useful, remains to be seen.

Yesterday we just released a paper with Stephen Bates and Trevor Hastie that I think will be important, because it concerns a tool data scientists use every day: cross validation:

“Cross-validation: what is it estimating how well does it do it?”

We do two things (a) we establish, for the first time, what exactly CV is estimating and (b) we show that the SEs from CV are often WAY too small, and show how to fix them. The fix is a nested CV, and there will be software for this procedure. We also show similar properties to those in (a) for bootstrap, Cp, AIC and data splitting. i am super excited!

I forwarded this to Yuling Yao and Aki Vehtari.

Yuling wrote:

The more cross validation papers I read, the more I want to write a paper with the title “Cross validation is fundamentally unsound,” analogous to O’Hagan one:

1. (sampling variation) Even with iid data and exact LOO [leave-one-out cross validation], we rely on pseudo Monte Carlo method, which itself ignores sampling variation.

2. (outcome variation) Even worse than the unsound Monte Carlo, now we not only have sampling variation on x, but also variation on the outcome y, especially if y is binary or discrete.

3. (pointwise model performance) Sometimes we use pointwise cv error to understand local model fit, no matter how large variance it could have. The solution to 1–3 is a better modeling of cv errors other than sample average. We have already applied hierarchical modeling to stacking, and we should apply to cv, too.

4. (experiment design) Cross validation is controlled experiment, which provides us the privilege to design which unit(s) to receive the treatment (left) or control (not left). Leave-one-cell-out is some type of block-design. But there is no general guidance to link the rick literature on experiment design into the context of cv.

Just to clarify here: Yuling is saying this from the perspective of someone who thinks about cross validation all the time. His paper, “Using stacking to average Bayesian predictive distributions” (with Aki, Dan Simpson, and me) is based on leave-one-out cross validation. So when he writes, “Cross validation is fundamentally unsound,” he’s not saying that cross validation is a bad idea; he’s saying that it has room for improvement.

With that in mind, Aki wrote:

Monte Carlo is still useful even if you can reduce the estimation error by adding prior information. Cross validation is still useful even if you can reduce the estimation error by adding prior information. I think better title would be something like “How to beat cross validation by including more information.”

Aki also responded to Rob’s original message, as follows:

Tuomas Sivula, and Måns Magnusson, and I [Aki] have recently examined the frequency properties of leave-one-out cross validation in the Bayesian context and specifically for model comparison as that brings an additional twist in the behavior of CV.

– Tuomas Sivula, Måns Magnusson, and Aki Vehtari (2020). Uncertainty in Bayesian leave-one-out cross-validation based model comparison.

– Tuomas Sivula, Måns Magnusson, and Aki Vehtari (2020). Unbiased estimator for the variance of the leave-one-out cross-validation estimator for a Bayesian normal model with fixed variance.

Although Bayesian LOO-CV is slightly different, there are certainly the same issues in folds not being independent and the naive SE estimator being biased. However it seems this is mostly problematic for small n, very similar models, or badly mispecified models (1st paper), and in addition there is a way to reduce the bias (2nd paper). I have shared your [Rob’s] paper with Tuomas and Måns and we will look carefully if there is something we should take into account, too.

Lots of this conversation is just whipping over my head but I thought I’d share it with you. I’ve learned so much about cross validation from working with Aki and Yuling, and of course all of statistics has benefited from Rob’s methods.

P.S. Zad sends in this picture demonstrating careful contemplation of the leave-one-out cross validation principle. I just hope nobody slams me in the Supplemental Materials section of a paper in the The European Journal of Clinical Investigation for using a cat picture on social media. I’ve heard that’s a sign of sloppy science!

Is explainability the new uncertainty?

This is Jessica. Last August, NIST published a draft document describing four principles of explainable AI. They asked for feedback from the public at large, to “stimulate a conversation about what we should expect of our decision-making devices‘’. 

I find it interesting because from a quick skim, it seems like NIST is stepping into some murkier territory than usual. 

The first principle suggests that all AI outputs resulting from querying a system should come with explanation: “AI systems should deliver accompanying evidence or reasons for all their outputs.” From the motivation they provide in the first few pages: 

“Based on these calls for explainable systems [40], it can be assumed that the failure to articulate the rationale for an answer can affect the level of trust users will grant that system. Suspicions that the system is biased or unfair can raise concerns about harm to oneself and to society [102]. This may slow societal acceptance and adoption of the technology, as members of the general public oftentimes place the burden of meeting societal goals on manufacturers and programmers themselves [27, 102]. Therefore, in terms of societal acceptance and trust, developers of AI systems may need to consider that multiple attributes of an AI system can influence public perception of the system. Explainable AI is one of several properties that characterize trust in AI systems [83, 92].”

NIST is an organization whose mission involves increasing trust in tech, so thinking about what society wants and needs is not crazy. The report summarizes a lot of recent work in explainable AI to back their principles, and acknowledges at points that techniques are still being developed. Still, I find the report to be kind of a bold statement. Explainable AI is pretty open to interpretation, as the research is still developing. The report describes ideals but on what seems like still underspecified ground.

The second principle, for instance, is:

Systems should provide explanations that are meaningful or understandable to individual users.

They expound on this: 

A system fulfills the Meaningful principle if the recipient understands the system’s explanations. Generally, this principle is fulfilled if a user can understand the explanation, and/or it is useful to complete a task. This principle does not imply that the explanation is one size fits all. Multiple groups of users for a system may require different explanations. The Meaningful principle allows for explanations which are tailored to each of the user groups. Groups may be defined broadly as the developers of a system vs. end-users of a system; lawyers/judges vs. juries; etc. The goals and desiderata for these groups may vary. For example, what is meaningful to a forensic practitioner may be different than what is meaningful to a juror [31]. 

Later they mention the difficulty of modeling the human model interaction:

As people gain experience with a task, what they consider a meaningful explanation will likely change [10, 35, 57, 72, 73]. Therefore, meaningfulness is influenced by a combination of the AI system’s explanation and a person’s prior knowledge, experiences, and mental processes. All of the factors that influence meaningfulness contribute to the difficulty in modeling the interface between AI and humans. Developing systems that produce meaningful explanations need to account for both computational and human factors [22, 58]

I would like to interpret this as saying we need to model the “system” comprised of the human and the model, which is a direction of some recent work in human AI complementarity, though it’s not clear how much the report intends formal modeling versus simply considering things like users’ prior knowledge. Both are hard, of course, and places where research is still very early along. As far as I know, most of the work in AI explainability and interpretability—with the former applying to any system for which reasoning can be generated and the latter applying to “self-explainable” models that people can understand more or less on their own—is still about developing different techniques. Less is on studying their effects, and even less about applying any formal framework to understand both the human and the model together. And relatively little has involved computer scientists teaming up with non-computer scientists to tackle more of the human side. 

The third principle requires “explanation accuracy”:

  • The explanation correctly reflects the system’s process for generating the output.

Seems reasonable enough but finding accurate ways to explain model predictions is not an easy problem. I’m reminded of some recent work in interpretability showing that some very “intuitive” approaches to deep neural net explanations like giving the salient features of an input given the output sometimes don’t exhibit basic traits we’d expect, like generating the same explanations when two models produce the same outputs for a set of inputs even if they have different architectures. 

Also, especially when the explanations must correctly reflect the system’s process, then they will often entail introducing the user to features or counterfactuals and may include probability to convey the model’s confidence in the prediction. This is all extra information for the end-user to process. It makes me think of the broader challenge of expressing uncertainty with model estimates. AI explanations, like expressions of uncertainty, become an extra thing that users have to make sense of in whatever decision context they’re in. “As-if optimization” where you assume the point estimate or prediction is correct and go ahead with your decision, becomes harder. 

One thing I find interesting though is how much more urgency there seems to be in examples like the NIST report or popular tech press around explainable AI relative to the idea that we need to make any outputs of statistical models more useful by expressing uncertainty. NIST has addressed the latter in multiple reports, though never with the implied urgency and human-centric focus here. It’s not like there’s a shortage of examples where misinterpreting uncertainty in model estimates led to bad choices on the parts of individuals, governments, etc. Suggesting AI predictions require explanations and measurements require uncertainty expressions come from a similar motivation that modelers owe their users provenance information so they can make more informed decisions. However, the NIST reports on uncertainty have not really discussed public needs or requirements implied by human cognition, focusing instead on defining different sources and error measurements. Maybe times have changed and if NIST did an uncertainty report now, it would be much less dry and technical, stressing the importance of understanding what people can tolerate or need. Or maybe it’s a branding thing. Explainable AI sounds like an answer to people’s deepest fears of being outsmarted by machine. Uncertainty communication just sounds like a problem.  

At any rate, something I’ve seen again and again in my research, and which is well known in JDM work on reasoning under uncertainty is the pervasiveness of mental approximations or heuristics people use to simplify how they use the extra information, even in the case of seemingly “optimal” representations. It didn’t really surprise me to learn that heuristics have come up in the explainable AI lit recently. For instance, some work argues people rarely engage analytically with each individual AI recommendation and explanation; instead they develop general heuristics about whether and when to follow the AI suggestions, accurate or not. 

In contrast to uncertainty, though, which is sometimes seen as risky since it might confuse end-users of some predictive system, explainability often gets seen as a positive thing. Yet it’s pretty well established at this point that people overrely on AI recommendations in many settings, and explainability does not necessarily help as we might hope. For instance, a recent paper finds they increase overreliance independent of correctness of the AI recommendation. So the relationship more explanation = more trust should not be assumed when trust is mentioned as in the NIST report, just like it shouldn’t be assumed that more expression of uncertainty = more trust. 

So, lots of similarities on the surface. Though not (yet) much overlap yet between uncertainty expression/comm research and explainable AI research. 

The final principle, since I’ve gone through the other three, is about constraints on use of a model: 

The system only operates under conditions for which it was designed or when the system reaches a sufficient confidence in its output. (The idea is that if a system has insufficient confidence in its decision, it should not supply a decision to the user.

Elaborated a bit:

The previous principles implicitly assume that a system is operating within its knowledge limits. This Knowledge Limits principle states that systems identify cases they were not designed or approved to operate, or their answers are not reliable. 

Another good idea, but hard. Understanding and expressing dataset limitations is also a growing area of research (I like the term Dataset Cartography, for instance). I can’t help but wonder, why does this property tend to only come up when we talk about AI/ML models rather than statistical models used to inform real world decisions more broadly? Is it because statistical modeling outside of ML is seen as being more about understanding parameter relationships than making decisions? While examples like Bayesian forecasting models are not black boxes the way deep neural nets are, there’s still lots of room for end-users to misinterpret how they work or how reliable their predictions are (election forecasting for example). Or maybe because the datasets tend to be larger and are often domain general in AI, there’s more worries about overlooking mismatches in development versus use population, and explanations are a way to guard against that. I kind of doubt there’s a single strong reason to worry so much more about AI/ML models than other predictive models.     

Relative vs. absolute risk reduction . . . 500 doctors want to know!

Some stranger writes:

What are your thoughts on this paper? Especially the paragraph on page 6 “Similar to the critical appraisal ….. respectively”.
There are many of us MD’s who are quite foxed.
If you blog about it, please don’t mention my name and just say a doctor on a 500-member listserv asked you about this. And send me the link to that blog article please. There are at least 500 of us doctors who would love to be enlightened.

The link is to an article called, “Outcome Reporting Bias in COVID-19 mRNA Vaccine Clinical Trials,” which argues that when reporting results from coronavirus vaccine trials, they should be giving absolute risk rather than relative risk. These have the same numerator, different denominators. Let X be the number of cases that would occur under the treatment, Y be the number of cases that would occur under the control, and Z be the number of people in the population. The relative risk reduction (which is what we usually see) is (Y – X)/Y and the absolute risk reduction is (Y – X)/Z. So, for example, if X = 50, Y = 1000, and Z = 1 million, then the relative risk reduction is 95% but the absolute risk reduction is only 0.00095, or about a tenth of one percent. Here’s the wikipedia page.

What do I think about the above-linked article? It’s basically rehashing the wikipedia page. I’m not talking about Weggy-style plagiarism here, just that this is standard material. There’s nothing wrong with publishing standard material if people need to be reminded of some important message.

That said, I don’t think this article is particularly useful. I don’t think it conveys an important reason for parameterizing in terms of relative risk, which is that it can be much more stable than absolute risk. If the vaccine prevents 95% of infections, then that’s 95% of however many occur, which is a more relevant number than comparing to how many happened to occur in the control group in some particular study. Conversely, if that 95% is not a stable number, so that the vaccine prevents a greater proportion of infections in some settings and a lesser proportion in others, then we’d want to know that—this is an interaction—but, again, the absolute risk difference isn’t so relevant. Absolute risk does matter in some settings—for example, we wouldn’t be so interested in a drug that prevents 50% of cases in a disease that only affects 2 people in the world (unless knowing this would give us a clue of how to treat other diseases that have higher prevalence), but of course coronavirus is not a rare disease. Presumably the rate of infection was so low in those studies only because the participants were keeping pretty careful, but the purpose of the vaccines is to give it to everyone so we don’t have to go around keeping so careful.

So, unless there’s something I’m missing (that happens!), I disagree with the claim in the linked paper that failures to report absolute risk reduction “mislead and distort the public’s interpretation of COVID-19 mRNA vaccine efficacy and violate the ethical and legal obligations of informed consent.”

P.S. I’d never heard the term “foxed” before. From the above message, I’m guessing that it means “confused.” I googled the word, and according to the internet dictionary, it means “deceived; tricked,” but that doesn’t seem to fit the context.

When can we challenge authority with authority?

Michael Nelson writes:

I want to thank you for posting your last decade of publications in a single space and organized by topic. But I also wanted to share a critique of your argument style as exemplified in your Annals of Surgery correspondence [here and here]. While I think it’s important and valuable that you got the correct reasoning against post hoc power analysis on the record, I don’t think there was ever much of a chance that a correct argument was going to change the authors’ convictions significantly. Their belief was not a result of a logical mistake and so could not be undone by logic; they believed it because it was what they were originally taught and/or picked up from mentors and colleagues. I suggest that the most effective way to get scientists to change their practices, or at least to withdraw their faulty arguments, is to challenge authority with authority.

What if, after you present your rational argument, you then say something like: “I know this is what you were taught, as were many of my own very accomplished colleagues, but a lot of things are taught incorrectly in statistics (citations). However, without exception, every single one of the current, most-respected authorities in statistics and methodology (several recognizable names) agree that post hoc power analysis (or whatever) does not work, for precisely the reasons I have given. More importantly, their arguments and demonstrations to this effect have been published in the most authoritative statistical journals (citations) and have received no notable challenges from their fellow experts. Respectfully, if you are confident that your argument is indeed valid, then you have outwitted the best of my field. You are compelled by professional ethics to publicize your breakthrough proof in these same journals, at conferences of quantitative methodologists (e.g., SREE) and any other venue that may reach the top statistical minds in the social sciences. If correct, you will be well-rewarded: you’ll instantly become famous (at least among statisticians) for overturning points that have long been thought mathematically and empirically proven.” In short, put up or shut up.

My reply:

That’s an interesting idea. It won’t work in all cases, as often it’s a well-respected authority making the mistake: either the authority figure is making the error himself, or a high-status researcher is making the error based on respected literature. So in that case the appeal to authority won’t work, as these people are the authority on their fields. Similarly, we can’t easily appeal to authority to talk people out of naive and way-wrong interpretations of significance tests and p-values, as these mistakes are all over the place in textbooks. But on the occasions where someone is coming out of the blue with a bad idea, yeah, then it could make sense to bring in consensus as one of our arguments.

Of course, in some way whenever I make an argument under my own name, I’m challenging authority with authority, in that I bring to the table credibility based on my successful research and textbooks. I don’t usually make this argument explicitly, as there are many sources of statistical authority (see section 26.2 of this paper), but I guess it’s always there in the background.

What’s the biggest mistake revealed by this table? A puzzle:

This came up in our discussion the other day:

It’s a table comparing averages for treatment and control groups in an experiment. There’s one big problem here (summarizing differences by p-values) and some little problems, such as reporting values to ridiculous precision (who cares if something has an average of “346.57” when its standard deviation is 73?) and an error (24.19% where it should be 22%).

There’s one other thing about the table that bothers me. It’s not a problem in the table itself; rather, it’s something that the table reveals about the data analysis.

Do you see it?

Answer is in the comments.

Why did it take so many decades for the behavioral sciences to develop a sense of crisis around methodology and replication?

“On or about December 1910 human character changed.” — Virginia Woolf (1924).

Woolf’s quote about modernism in the arts rings true, in part because we continue to see relatively sudden changes in intellectual life, not merely from technology (email and texting replacing letters and phone calls, streaming replacing record sales, etc.) and power relations (for example arising from the decline of labor unions and the end of communism) but also ways of thinking which are not exactly new but seem to take root in a way that had not happened earlier. Around 1910, it seemed that the literary and artistic world was ready for Ezra Pound, Pablo Picasso, Igor Stravinsky, Gertrude Stein, and the like to shatter old ways of thinking, and (in a much lesser way) the behavioral sciences were upended just about exactly 100 years later by what is now known as the “replication crisis” . . .

The above is from a new paper with Simine Vazire. Here’s the abstract:

For several decades, leading behavioral scientists have offered strong criticisms of the common practice of null hypothesis significance testing as producing spurious findings without strong theoretical or empirical support. But only in the past decade has this manifested as a full-scale replication crisis. We consider some possible reasons why, on or about December 2010, the behavioral sciences changed.

You can read our article to hear the full story. Actually, I think our main point here is to raise the question; it’s not like we have such a great answer. And I’m sure lots of people have raised the question before. Some questions are just worth asking over and over again.

Our article is a discussion of “What behavioral scientists are unwilling to accept,” by Lewis Petrinovich, for the Journal of Methods and Measurement in the Social Sciences, edited by Alex Weiss.

State-level predictors in MRP and Bayesian prior

Something came up in comments today that I’d like to follow up on.

In our earlier post, I brought up an example:

If you’re modeling attitudes about gun control, think hard about what state-level predictors to include. My colleagues and I thought about this a bunch of years ago when doing MRP for gun-control attitudes. Two natural state-level predictors are Republican vote share and percent rural. These variables are also highly correlated. But look at Vermont: it’s one of the most Democratic states and also the most rural. Vermont also is a small state, so the MRP inference for Vermont will depend strongly on the fitted model, which in turn will depend strongly on the coefficients for R vote and %rural. I think you’ll need some strong priors here to get a stable answer. Default flat priors could mess you up. You might not realize the problem when fitting to one particular dataset, but you’ll be getting a really noisy answer.

To elaborate, consider the following 4 MRP models:

1. No state-level predictors. This is bad because for states without much data, estimates are pooled toward the national mean. This is clearly the wrong thing to do for Montana, say, as Montana is a small state where attitudes toward gun control are probably very far from the national average.

2. Republican vote share as a state-level predictor. Now the estimates are pooled toward the state-level regression model, which will estimate a negative effect of state-level Republican vote on gun control attitude. This will do the right thing for Montana. This model will partially pool Vermont to other strongly Democratic states such as California, Maryland, and Hawaii.

3. Percent rural as a state-level predictor. This should do ok also, with voters in more rural states being less likely to support gun control, and it should also do the right thing for Montana. This model will partially pool Vermont to other rural states such as Montana, North Dakota, and Mississippi.

4. Include Republican vote share and percent rural as two state-level predictors. This will do the right thing again for Montana and will split the difference on Vermont. The estimate for Vermont should also have a higher uncertainty, reflecting inferential uncertainty about the relative importance of the two state-level predictors.

Of all of these options, I think #4 is the best—but this presupposes that I’m willing to use strong priors to control the estimation. If I’m only allowed to use weak priors—or, worse, not allowed to use any priors at all—then #4 could give very noisy results.

This is where Bayes comes in. It’s not just that Bayes allows the use of prior information. It’s also that, by allowing the use of prior information, Bayes also opens the door to including more information into the model in the form of predictors.

To put it another way, models 1, 2, and 3 above are all special cases of model 4, but with very strong priors where one or two of the coefficients are assumed to be exactly zero. Paradoxically, putting Bayesian priors (or, more generally, some regularization) in model 4 allows us to fit a bigger, more general model than would otherwise be realistically possible.

Some issues when using MRP to model attitudes on a gun control attitude question on a 1–4 scale

Elliott Morris writes:

– I want to run a MRP model predicting 4 categories of response options to a question about gun control (multinomial logit)

– I want to control for demographics in the standard hierarchical way (MRP)

– I want the coefficients to evolve in a random walk over time, as I have data from multiple weeks (dynamic)

Do you know of any example Stan code that does this? So far I have been accomplishing this by interacting linear and quadratic terms for time with all my demographic controls, but that seems like a hack given what we could do with a gaussian process. The folks who made the dgo package for R seemed to have figured this out, but not for multinomial models!

My reply:

I’d recommend linear model rather than multinomial logit. With 4 categories I think the linear model should work just fine, and then you can focus on the more important parts of the model rather than having the link function be a hangup. You can always go back later and look into the discreteness of the data if you’d like.

Also, if you’re modeling attitudes about gun control, think hard about what state-level predictors to include. My colleagues and I thought about this a bunch of years ago when doing MRP for gun-control attitudes. Two natural state-level predictors are Republican vote share and percent rural. These variables are also highly correlated. But look at Vermont: it’s one of the most Democratic states and also the most rural. Vermont also is a small state, so the MRP inference for Vermont will depend strongly on the fitted model, which in turn will depend strongly on the coefficients for R vote and %rural. I think you’ll need some strong priors here to get a stable answer. Default flat priors could mess you up. You might not realize the problem when fitting to one particular dataset, but you’ll be getting a really noisy answer.

P.S. These issues would arise for any survey estimation procedure, not just MRP. MRP just makes the whole thing more transparent.

P.P.S. More here.

Understanding the value of bloc voting, using the Congressional Progressive Caucus as an example:

Daniel Stock writes:

I’m a public policy PhD student, interested in economic policy and a bit of political science.

I recently saw that the Congressional Progressive Caucus instituted bloc voting rules a few months ago: if at least two thirds of them agree on a bill or amendment, then all CPC members are bound to vote for it.

This got me thinking: how well would this bloc work, given the observed voting records / preferences of the members? There must be a sweet spot: it is irrelevant if (i) none of the members ever agree, or (ii) the members always agree. It might even be possible to calculate which member is expected to benefit the most (AOC?), in terms of winning the most extra votes to her side (and being forced to flip on the fewest votes).

I googled “optimal voting bloc” and your 2003 paper came up. Do you know if this is still an active / promising area of study, especially on the empirics side? It seems like most papers are about detecting blocs, rather than evaluating them.

This is a great research topic! I haven’t thought about coalitions in awhile, but my guess is that this is a good area for further study. One thing I hadn’t considered at all in my earlier work was anything like this two-thirds rule. There seems to be room for both theoretical and empirical work here.

One lesson from my 2003 paper is that you can’t think about the effects of coalitions in isolation. The left-wing congressmembers have formed this caucus, but there are also conservative and moderate caucuses. I don’t know if these other groups have this sort of formal bloc-voting rule, but they must do some coordination.

Anyway, if any of you have suggestions on where to go with this research project, just let us know in the comments.

Adjusting for differences between treatment and control groups: “statistical significance” and “multiple testing” have nothing to do with it

Jonathan Falk points us to this post by Scott Alexander entitled “Two Unexpected Multiple Hypothesis Testing Problems.” The important questions, though, have nothing to do with multiple hypothesis testing or with hypothesis testing at all. As is often the case, certain free-floating scientific ideas get in the way of thinking about the real problem.

Alexander tells the story about a clinical trial of a coronavirus treatment following up on a discussion by Lior Pachter. Here’s Alexander:

The people who got the Vitamin D seemed to do much better than those who didn’t. But there was some controversy over the randomization . . . they checked for fifteen important ways that the groups could be different, and found they were only significantly different on one – blood pressure.

The table doesn’t show continuous blood pressure measurements, but it says that, among the 50 people in the treatment group, 11 had previous high blood pressure, whereas of the 26 people in the control group, 15 had high blood pressure. That’s 22% with previous high blood pressure in the control group, compared to 58% in the treatment group.

Just as an aside, I can’t quite figure out what’s going on in the table, where the two proportions are reported as 24.19% and 57.69%. I understand the 57.69% comes from: it’s 15/26 to four decimal places. But I don’t know how they got 24.19% from 11/50. I played around with some other proportions (12/49, etc.) but couldn’t quite come up with 24.19%. I guess it doesn’t really matter, though.

Anyway, Alexander continues:

Two scientists who support this study, say that [the imbalance in blood pressure comparing the two groups] shouldn’t bother us too much. They point out that because of multiple testing (we checked fifteen hypotheses), we need a higher significance threshold before we care about significance in any of them, and once we apply this correction, the blood pressure result stops being significant.

Alexander then writes, and I agree with him:

Come on! We found that there was actually a big difference between these groups! You can play around with statistics and show that ignoring this difference meets certain formal criteria for statistical good practice. But the difference is still there and it’s real. For all we know it could be driving the Vitamin D results.

Or to put it another way – perhaps correcting for multiple comparisons proves that nobody screwed up the randomization of this study; there wasn’t malfeasance involved. But that’s only of interest to the Cordoba Hospital HR department when deciding whether to fire the investigators. If you care about whether Vitamin D treats COVID-19, it matters a lot that the competently randomized, non-screwed up study still coincidentally happened to end up with a big difference between the two groups. It could have caused the difference in outcome.

Well put.

The thing that Alexander doesn’t seem to fully realize is that there is an accepted method in statistics to handle this. What you do is fit a regression model on the outcome of interest, adjusting for important pre-treatment predictors. Such an analysis is often described as “controlling” for the predictors, but I prefer to reserve “control” for when the variables are actually being controlled and to use “adjust” for adjustments in the analysis.

Alexander does allude to such an analysis, writing:

Although the pre-existing group difference in blood pressure was dramatic, their results were several orders of magnitude more dramatic. The paper Pachter is criticizing does a regression to determine whether the results are still significant even controlling for blood pressure, and finds that they are. I can’t see any problem with their math, but it should be remembered that this is a pretty desperate attempt to wring significance out of a small study, and it shouldn’t move our needle by very much either way.

I disagree! Adjusting for pre-treatment differences is not a “desperate” strategy. It’s standard statistics (for example in chapter 19 of Regression and Other Stories, but it’s an old, old method; we didn’t come up with it, I’m just referring to our book as a textbook presentation of this standard method), nothing desperate at all. Also, no need to “wring significance” out of anything. The point is to summarize the evidence in the study. The adjusted analysis should indeed “move our needle” to the extent that it resolves concerns about imbalance. In this case the data are simple enough that you could just show a table of outcomes for each category treatment or control and high or low blood pressure. I guess I’d prefer to use blood pressure as a continuous predictor but that’s probably not such a big deal here.

Multiple comparisons and statistical significance never came up. The other thing is that you shouldn’t just adjust for blood pressure. Indeed it would be better to combine the pre-treatment indicators in some reasonable way and adjust for all of them. There’s a big literature on all of this and not always clear agreement in the statistical literature on what to do, so I’m not saying it’s easy. As Alexander notes, in a randomized trial the important of any such adjustment will be more important when sample size is small. That’s just the way it goes. I read Pachter’s linked post and there he says that the experiment was originally designed to be a pilot study, but then the results were so stunning that the researchers decided to share the results right away, which seems fair enough.

Pachter summarizes as follows:

As for Vitamin D administration to hospitalized COVID-19 patients reducing ICU admission, the best one can say about the Córdoba study is that nothing can be learned from it.

And here’s his argument:

Unfortunately, the poor study design, small sample size, availability of only summary statistics for the comorbidities, and imbalanced comorbidities among treated and untreated patients render the data useless. While it may be true that calcifediol administration to hospital patients reduces subsequent ICU admission, it may also not be true.

I see his point regarding small sample size and data availability, but the concern about imbalanced comorbidities among treated and untreated patients . . . that can be adjusted for. I can see him saying the evidence isn’t as clear as is claimed, and there are always possible holes in any study, but is it really true that nothing can be learned from the study? The headline result, “only 1/50 of the treated patients was admitted to the ICU, whereas 13/26 of the untreated patients were admitted,” seems pretty strong.

I didn’t quite follow Pachter’s argument regarding poor study design. He says that admission to the intensive care unit could be based in part on adjustment for pre-treatment conditions. It could be that more careful adjustment would change the result, so it does seem like, as always, it would be better if the data could be made public in some form. Or maybe there’s an issue of information leakage so that the ICU assignment was made with some knowledge of who got the treatment? In any case, lots more will be learned from larger studies to come.

“Statistical significance” and “multiple testing” have nothing to do with it

But here’s the point. All this discussion of p-values and multiple comparisons adjustments is irrelevant. As Alexander says, to the extent that imbalance between treatment and control groups is a problem, it’s a problem whether or not this imbalance is “statistically significant,” however that is defined. The relevant criticisms of the study would be if the adjustment is done poorly, or if the outcome measure in the study is irrelevant to ultimate outcomes, or if there was information leakage, or if Vitamin D creates other risks so the net benefits could be negative (all of these are points I took from the above-linked posts). The discussions of statistical significance and testing and p-values and all the rest have nothing to do with any of this. So it’s frustrating to me that so much of Pachter’s and Alexander’s discussions focus on these tangential issues. Reading their posts, I keep seeing them drift toward the interesting questions and then springing back to these angels-on-a-pin probability calculations. Really the point of this post is to say: Hey, focus on the real stuff. The point of statistics is to allow non-statisticians to focus on science and decision making, not to draw them into a vortex of statistics thinking!

P.S. I just noticed that Pachter’s post that Alexander is reacting to is from Nov 2020. Pachter links to this clinical trial with 2700 patients scheduled to be completed at the end of June 2021, so I guess then we’ll know more.

This one’s for all the Veronica Geng fans out there . . .

I recently read Joseph Lanza’s excellent book from 1994, “Elevator Music: A Surreal History of Musak, Easy-Listening, and Other Moodsong.” I’ll have more to say about this book in a future post, but for now I just had to share this bit I noticed on page 53:

Lyndon Baines Johnson owned Muzak franchises in Austin during his early senatorial days.

Wow. Just wow. If Veronica Geng had only known this when writing her hilarious story. You know the one. Just think how Johnson could’ve riled up Shaw by piping Muzak into his bedroom. In true LBJ-as-recounted-by-Geng style, the senator would’ve put the volume control in some hard-to-reach place behind the sofa so that the elderly playwright would’ve had to contort himself to turn it down, making him feel even more foolish.

StanConnect 2021: Call for Session Proposals

Back in February it was decided that this year’s StanCon would be a series of virtual mini-symposia with different organizers instead of a single all-day event. Today the Stan Governing Body (SGB) announced that submissions are now open for anyone to propose organizing a session. Here’s the announcement from the SGB on the Stan forums: 

Following up on our previous announcement, the SGB is excited to announce a formal call for proposals for StanConnect 2021.

StanConnect is a virtual miniseries that will consist of several 3-hour meetings/mini-symposia. You can think of each meeting as a kind of organized conference “session.”

  • Anyone can feel free to organize a StanConnect meeting as a “Session Chair”. Simply download the proposal form as a docx, fill it out, and submit to SGB via email ( by April 26, 2021 (New York) . The meeting must be scheduled for sometime this year after June 1.
  • The talks must involve Stan and be focused around a subject/topic theme. E.g. “Spatial models in Ecology via Stan”.
  • You will see that though we provide a few “templates” for how to structure a StanConnect meeting, we are trying to avoid being overly prescriptive. Rather, we are giving Session Chairs freedom to invite speakers related to their theme and structure the 3-hr meeting as they see fit.
  • If you have any questions, please feel free to post here.

I wasn’t involved in the decision to change the format but I really like the idea of a virtual miniseries. I thought the full day StanCon 2020 was great, but one nearly 24-hour global virtual conference feels like enough. And hopefully having a bunch of separately organized events will give more people a chance to get involved with Stan, either as an organizer, speaker, or attendee. 

Discuss our new R-hat paper for the journal Bayesian Analysis!

Here’s your opportunity:

We welcome public contributions to the Discussion of the manuscript the manuscript Rank-normalization, folding, and localization: An improved R-hat for assessing convergence of MCMC by A. Vehtari, A. Gelman, D. Simpson, B. Carpenter and P. C. Bürkner, which will be featured as a Discussion Paper in the June 2021 issue of the journal. You can find the manuscript in the Advance publication section of the journal website. The contributions should be no more than two pages in length, using the BA latex style and should be submitted to the journal using the Electronic Journal Management System (EJMS) submission page, before May 10th, 2021. An announcement for the public Webinar presentation will follow. As a reminder, all BA Discussion Webinars can be viewed on the ISBA YouTube channel.

I’m very happy with this article. It takes the basic principle from our 1992 paper on R-hat and improves it in various ways. I also recommend you read the followup paper by Ben Lambert and Aki Vehtari on R*, which is a multivariate mixing statistic that uses nonparametric clustering, again, following the basic idea of comparing individual to mixed chains, but in a more general way.

The 2019 project: How false beliefs in statistical differences still live in social science and journalism today

It’s the usual story. PNAS, New York Times, researcher degrees of freedom, story time. Weakliem reports:

[The NYT article] said that a 2016 survey found that “when asked to imagine how much pain white or black patients experienced in hypothetical situations, the medical students and residents insisted that black people felt less pain.” I [Weakliem] was curious about how big the differences were, so I read the paper.

Clicking through to read the research article, which was published in PNAS, I did not find any claim that the medical students and residents insisted that black people felt less pain.

I did see this sentence from the PNAS article: “Specifically, we test whether people—including people with some medical training—believe that black people feel less pain than do white people.” But, as Weakliem finds out, it turns out there were no average differences:

Medical students who had a high number of false beliefs rated the white cases as experiencing more pain; medical students who had a low number of false beliefs rated the black cases as experiencing more. High and low were defined relative to the mean, so that implied that medical students with average numbers of false beliefs rated the black and white cases about the same.

The authors included their data as a supplement to the article, so I [Weakliem] downloaded it and calculated the means. The average rating for the black cases was 7.622, on a scale of 1-10, while the average rating for the white cases was 7.626—that is, almost identical. The study also asked how the different cases should be treated—135 gave the same recommendation for both of their cases, 40 recommended stronger medication for their white case, and 28 for their black case. Since the total distribution of conditions was the same for the black and white cases, this means that in this sample, treatment recommendations were different for black and whites. However, the difference was not statistically significant at conventional levels (p is about .14)—that is, the sample difference could easily have come up by chance.

So you could conclude that, in this sample, there is no evidence that medical students rate the pain of blacks and whites differently, but perhaps some evidence that they treat white pain more aggressively. (If you just went by statistical significance, you would accept the hypothesis that they treat hypothetical black and white cases the same, but a more sensible conclusion would that you should collect more data). The paper, however, didn’t do this. . . .

Hey, this is a big fat researcher degree of freedom! The authors of this paper easily could’ve summarized their results as, “White people no longer believe black people feel less pain than do white people.” That could’ve been the title of the PNAS article. And then the New York Times article could’ve been, “Remember that finding that white people believe that black people feel less pain? It’s no longer the case.”

OK, I guess not, as PNAS would never have published such a paper. The interaction between beliefs of physical differences, beliefs about pain, and attitudes toward pain treatment—that’s what made the paper publishable. Unfortunately, the patterns that were found could be explainable by noise—but, no problem, there were enough statistical knobs to be turned that the researchers could find statistical significance and declare a win. At that point, maybe they felt that going back and reporting, “No average difference,” would ruin their story.

Weakliem summarizes:

The statement that “the medical students and residents insisted that black people felt less pain” is false: they rated black and white pain as virtually equal. I [Weakliem] don’t blame Villarosa [the author of the NYT article] for that—the way it was written, I could see how someone would interpret the results that way. I don’t really blame the authors either—interaction effects can be confusing. I would blame the journal (PNAS) for (1) not asking the authors to show means for the black and white examples as standard procedure and (2) not getting reviewers who understand interaction effects.

I don’t know if I agree with Weakliem in letting the authors of these articles off the hook. The NYT article did misrepresent the claims in the PNAS article; the PNAS article did come in to test a hypothesis and then never report the result of that test; so both these articles failed their readers, at least regarding this particular claim. Indeed, the title of the NYT article is, “Myths about physical racial differences were used to justify slavery — and are still believed by doctors today”—a message that is completely changed by reporting that the PNAS study found no average belief in pain differences.

As for PNAS, I think it’s too much to expect they can find reviewers who understand interaction effects—that’s really complicated—and I guess it’s too much to expect that they would turn down an article that fits their political preconceptions. But, jeez, can’t they at least be concerned about data quality? Study 1 was based on 121 participants on Mechanical Turk. Study 2 was based on 418 medical students at a single university. I can see the rationale for Study 2—medical students grow up and become doctors, so we should be concerned about their views regarding medical treatment. But I can’t see how it can be considered scientifically legitimate to take data from 121 Mechanical Turk participants and report them in the abstract of the paper as telling us something about “a substantial number of white laypeople.” You don’t need to understand interaction effects to see the problem here; you just need to stop drinking the causal-identification Kool-Aid (the attitude by which any statistically significant difference is considered to represent some true population effect, as long as it is associated with a randomized treatment assignment, instrumental variable analysis, or regression discontinuity).

“America Has a Ruling Class”

I agree with these points made by Samuel Goldman:

America’s most powerful people have a problem. They can’t admit that they’re powerful.

Take Andrew Cuomo. On a recent call with reporters, the embattled Mr. Cuomo insisted that he was “not part of the political club.” The assertion was confounding because Mr. Cuomo is in his third term as governor of New York — a position his father also held for three terms. Mr. Cuomo has also served as state attorney general and as secretary of the Department of Housing and Urban Development. . . .

This sort of false advertising isn’t limited to Democrats. Senator Josh Hawley of Missouri, for instance, has embraced an image as a populist crusader against a distant “political class.” He does not emphasize his father’s career as a banker, his studies at Stanford and Yale Law School, or his work as clerk to prominent judges, including Chief Justice John Roberts. The merits of Mr. Hawley’s positions are open to debate. But his membership in the same elite that he rails against is not.

And it’s not only politicians. Business figures love to present themselves as “disrupters” of stagnant industries. But the origins of the idea are anything but rebellious. Popularized by a Harvard professor and promoted by a veritable industry of consultants, it has been embraced by some of the richest and most highly credentialed people in the world. . . .

Part of the explanation is strategic. An outsider pose is appealing because it allows powerful people to distance themselves from the consequences of their decisions. When things go well, they are happy to take credit. When they go badly, it’s useful to blame an incompetent, hostile establishment for thwarting their good intentions or visionary plans.

Another element is generational. Helen Andrews argues that baby boomers have never been comfortable with the economic, cultural and political dominance they achieved in the 1980s. “The rebels took over the establishment,” she writes, “only they wanted to keep preening like revolutionaries as they wielded power.” . . .

America has a de facto ruling class. Since World War II, membership in that class has opened to those with meritocratic credentials. But that should not conceal the truth that it remains heavily influenced by birth. Even if their ancestors were not in The Social Register, Mr. Cuomo, Ms. Haines and Mr. Hawley were born to families whose advantages helped propel their careers. Admitting the fact of noblesse might help encourage the ideal of oblige.

But there’s a limit to what can be accomplished by exhortation. Ultimately, the change must come from the powerful themselves. Just once, I’d like to hear a mayor, governor or president say: “Yes, I’m in charge — and I’ve been trying to get here for my entire life. I want you to judge me by how I’ve used that position, not by who I am.”

I’m reminded of our discussion a few years ago about so-called rogue academics.

As Ben Folds put it, it’s no fun to be the man.

I disagree with one part of Goldman’s op-ed, though, and that’s where he seems to argue that a powerful person can’t be oppositional. For example, after quoting director of national intelligence Avril Haines as saying “I have never shied away from speaking truth to power,” Goldman says, “that is a curious way of describing a meteoric career . . .” But there’s no reason she can’t have been speaking truth to power from the inside. Saying you speak truth to power is not the same as claiming outsider status.

I say this as someone who, like Avril Haines, Josh Hawley, and Steven Levitt, who is well connected but still sometimes does things that antagonize powerful people. The point is not that it’s a virtue to be a “rogue” or whatever, just that our actions should be evaluated on their merits. And, as Goldman says, we should be open about the advantages we’ve had.

Where can you find the best CBD products? CBD gummies made with vegan ingredients and CBD oils that are lab tested and 100% organic? Click here.