Skip to content

Can you trust international surveys? A follow-up:

Michael Robbins writes:

A few years ago you covered a significant controversy in the survey methods literature about data fabrication in international survey research. Noble Kuriakose and I put out a proposed test for data quality.

At the time there were many questions raised about the validity of this test. As such, I thought you might find a new article [How to Get Better Survey Data More Efficiently, by Mollie Cohen and Zach Warner] in Political Analysis of significant interest. It provides pretty strong external validation of our proposal but also provides a very helpful guide of the effectiveness of different strategies for detecting fabrication and low quality data in international survey research.

I’ve not read this new paper—it’s hard to find the time, what with writing 400 blog posts a year etc!—but I like the general idea of developing statistical methods to check data quality. Data collection and measurement are not covered enough in our textbooks or our research articles—think about all those papers that never tell you the wording of their survey questions! And remember that notorious Lancet Iraq study, or that North-Korea-is-a-democracy study.

We’re hiring (in Melbourne)

Andrew, Qixuan and I (Lauren) are hiring a postdoctoral research fellow to explore research topics around the use on multi-level regression and poststratification with non-probability surveys. This work is funded by the National Institutes of Health, and is collaborative work with Prof Andrew Gelman (Statistics and Political Science, Columbia University) and Assoc/Prof Qixuan Chen (Biostatistics, Columbia University).

My hope with this work is that we will develop functional validation techniques when using MRP with non-probability (or convenience) samples. Work will focus on considering the theoretical framing of MRP, as well as various validation tools like cross-validation with non-representative samples. Interested applicants need not have a background in MRP, but experience with multilevel models and/or survey weighting methods would be desirable. The team works primarily within an R language framework.
Interested applicants can apply at or contact the Monash chief investigator Dr Lauren Kennedy ( for further information. I should add that the successful applicant must have relevant work rights in Australia (that one’s from HR, and to do with the whole covid/travelling situation).

Hierarchical modeling of excess mortality time series

Elliott writes:

My boss asks me:

For our model to predict excess mortality around the world, we want to calculate a confidence interval around our mean estimate for total global excess deaths. We have real excess deaths for like 60 countries, and are predicting on another 130 or so. we can easily calculate intervals for any particular country. However, if we simulate each country independently, then the confidence interval surrounding the global total will be tiny, and incorrect, because you’ll never get simulations where like 160 countries are all off in the same direction. We need some way to estimate how errors are likely to be correlated between countries. What would Andrew Gelman recommend?

I [Elliott] shot back:

I’ll ask. He is going to recommend a hierarchical model, where you model excess deaths as a function of a global time trend, country-level intercept, country-level time trend and country-level covariates. Something like:

deaths ~ day + day^2 + (1 + day + day^2 | country) + observed_cases + observed_deaths + (observed_deaths | county)

Oh, residual excess deaths are definitely are not a quadratic function, now that I think about it. Probably cubic or ^4. But you get the idea.

In brms, you could also do splines:

deaths ~ s(day) + s(day, by="country") + (1 | country) + ....

Then, you take the 95% CI [actually, posterior interval, not confidence interval — AG] from the posterior draws.

Otherwise, you can derive the countey-by-country covariance matrix of day-level predicted excess deaths and simulate from a multivariate normal distribution (are excess deaths normally distributed? Maybe lognormal), then grab the CI off of that.

Yes, this all sounds vaguely reasonable. But definitely do the spline, not the polynomial. You should pretty much never be doing the polynomial.

I’d also recommend taking a look at the work of Leontine Alkema on Bayesian modeling of vital statistics time series.

The Tall T*

Saw The Tall T on video last night. It was ok, but when it comes to movies whose titles begin with that particular character string, I think The Tall Target was much better. I’d say The Tall Target is the best train movie ever, but maybe that title should go to Intolerable Cruelty, which was much better than Out of Sight, another movie featuring George Clooney playing the George Clooney character. As far as Elmore Leonard adaptations go, I think The Tall T was much better.

One reason why that estimated effect of Fox News could’ve been so implausibly high.

Ethan Kaplan writes:

I just happened upon a post of yours on the potential impact of Fox News on the 2016 election [“No, I don’t buy that claim that Fox news is shifting the vote by 6 percentage points“]. I am one of the authors of the first Fox News study from 2007 (DellaVigna and Kaplan). My guess is that the reason the Martin and Yurukoglu estimates are so large is that their instrument is identified off a group of people who watch Fox because it is higher in the channel order. My guess is that such potential voters are decently more likely than the average voter to be influenced by cable news. Moreover, my guess is that there are not a huge number of such people who actually do spend tons of time watching Fox News. Moreover, there may be fewer of such people than in the year 2000 when U.S. politics was less polarized than it is today and when Fox News did not yet have as much of a well-known reputation for being a conservative news outlet.

Interesting example of an interaction; also interesting example of a bias that can arise from a natural experiment.

Questions about our old analysis of police stops

I received this anonymous email:

I read your seminal work on racial bias in stops with Professors Fagan and Kiss and just had a few questions.

1. Your paper analyzed stops at the precinct level. A critique I have heard regarding aggregating data at that level is that: “To say that the threshold test can feasibly discern whether racial bias is present in a given aggregate dataset would be to ignore its concerning limitations which make it unusable in its ability to perform this task. The Simpson’s paradox is a phenomenon in probability and statistics which refers to a pattern exhibited by aggregated data either disappearing or reversing once the data is disaggregated. When the police behave differently across the strata of some variable, but a researcher’s analysis uses data that ignores and aggregates across this distribution, Simpson’s paradox threatens to give statistics that are inconsistent with reality. The variable being place, police are treating different strata of place differently, and the races are distributed unequally across strata. The researchers who designed the threshold test do not properly control for place, as modeling for something as large as a county or precinct (which is what they do) does not properly account for place if the police structure their behavior along the lines of smaller hot spots.”

2. How would your paper account for changes in police deployment patterns?

3. What are your thoughts on this article? It addresses a paper from one of your colleagues but if the critiques were valid would they also applied to your paper in 2007?

My reply:

1. I’m not sure about most of this because it seems to be referring to some other work, not ours. It refers to a threshold test, which is not what we’re doing. As to the question of why we used precincts: this was to address concern that the citywide patterns could be explained by differences between neighborhoods; we discuss this at the beginning of section 4 of our paper. Ultimately, though, the data are what they are, and we’re not making any claims beyond what we found in the data.

2. The data show that the police stopped more blacks and Hispanics than whites in comparable neighborhoods, in comparison to their rate of arrests in the previous year. All these stops could be legitimate police decisions based on local information. We really can’t say; all we can do is give these aggregates.

3. I read the linked article, and it seems that a key point there is that most of the stops in question are legal, and that “Those lawful stops should have been excluded from his regression analysis, since they cannot form the basis for concluding that the officers making the stop substituted race for reasonable suspicion.” I don’t agree with this criticism. The point of our analysis is to show statistical patterns in total stops. Legality of the individual stops is a separate question. Another comment made in the linked article is that the analysis was “evaluating whether the police were making stops based on skin color rather than behavior.” This is not an issue with our analysis because we were not trying to make any such evaluation; we were just showing statistical patterns. There was also a criticism regarding the use of data from one month to predict the next month. I can’t say for sure but I don’t think that shifting things by a month would chance our analysis. Another criticism had a prediction that a census tract would experience 120 stops in a month but it only had an average of 19. I don’t know the details here, but all regression models have errors. It all depends on what predictors are in the model. Finally, there is a statement that “It is a lot more comfortable to talk about the allegedly racist police than about black-on-black crime.” I don’t think it has to be one or the other. Our paper was about patterns in police stops; there’s other research on patterns of crime.

A recommender system for scientific papers

Jeff Leek writes:

We created a web app that lets people very quickly sort papers on two axes: how interesting it is and how plausible they think it is. We started with covid papers but have plans to expand it to other fields as well.

Seems like an interesting idea, a yelp-style recommender system but with two dimensions.

A fill-in-the-blanks contest: Attributing the persistence of the $7.25 minimum wage to “the median voter theorem” is as silly as _______________________

My best shots are “attributing Napoleon’s loss at Waterloo to the second law of thermodynamics” or “attributing Michael Jordan’s 6 rings to the infield fly rule.” But these aren’t right at all. I know youall can do better.

Background here.

For some relevant data, see here, here, here, and here.

P.S. I get it that Cowen doesn’t like the minimum wage. I have no informed opinion on the topic myself. But public opinion on the topic is clear enough. Also, I understand that I might be falling for a parody. Poe’s law and all that.

P.P.S. Yes, this is all silly stuff. The serious part is that Cowen and his correspondent are basically saying (or joking) that what happens, should happen. I agree that there’s a lot of rationality in politics, but you have to watch out for circular reasoning.

In making minimal corrections and not acknowledging that he made these errors, Rajan is dealing with the symptoms but not the underlying problem, which is that he’s processing recent history via conventional wisdom.

Raghuram Rajan is an academic and policy star, University of Chicago professor, former chief economist for the International Monetary Fund, and former chief economic advisor to the government of India, and featured many times in NPR and other prestige media.

He also appears to be in the habit of telling purportedly data-backed stories that aren’t really backed by the data.

Story #1: The trend that wasn’t

Guarav Sood writes:

In late 2019 . . . while discussing the trends in growth in the Indian economy . . . Mr. Rajan notes:

We were growing really fast before the great recession, and then 2009 was a year of very poor growth. We started climbing a little bit after it, but since then, since about 2012, we have had a steady upward movement in growth going back to the pre-2000, pre-financial crisis growth rates. And then since about mid-2016 (GS: a couple of years after Mr. Modi became the PM), we have seen a steady deceleration.

The statement is supported by the red lines that connect the deepest valleys with the highest peak, eagerly eliding over the enormous variation in between (see below).

Not to be left behind, Mr. Rajan’s interlocutor Mr. Subramanian shares the following slide about investment collapse. Note the title of the slide and then look at the actual slide. The title says that the investment (tallied by the black line) collapses in 2010 (before Mr. Modi became PM).

Story #2: Following conventional wisdom

Before Gaurav pointed me to his post, the only other time I’d heard of Rajan was when I’d received his book to review a couple years ago, at which time I sent the following note to the publisher:

I took a look at Rajan’s book, “The Third Pillar: How Markets and the State Leave the Community Behind,” and found what seems to be a mistake right on the first page. Maybe you can forward this to him and there will be a chance for him to correct it before the book comes out.

On the first page of the book, Rajan writes: “Half a million more middle-aged non-Hispanic white American males died between 1999 and 2013 than if their death rates had followed the trend of other ethnic groups.” There are some mistakes here. First, the calculation is wrong because it does not account for changes in the age distribution of this group. Second, it was actually women, not men, whose death rates increased. See here for more on both points.

There is a larger problem here is that there is received wisdom that white men are having problems, hence people attribute a general trend to men, even though in this case the trend is actually much stronger for women.

I noticed another error. On page 216, Rajan writes, “In the United States, the Affordable Care Act, or Obamacare, was the spark that led to the organizing of the Tea Party movement…” This is incorrect. The Tea Party movement started with a speech on TV in February, 2009, in opposition to Obama’s mortgage relief plan. From Wikipedia: “The movement began following Barack Obama’s first presidential inauguration (in January 2009) when his administration announced plans to give financial aid to bankrupt homeowners. A major force behind it was Americans for Prosperity (AFP), a conservative political advocacy group founded by businessmen and political activist David H. Koch.” The Affordable Care Act came later, with discussion in Congress later in 2009 and the bill passing in 2010. The Tea Party opposed the Affordable Care Act, but the Affordable Care Act was not the spark that led to the organizing of the Tea Party movement. This is relevant to Rajan’s book because it calls into question his arguments about populism.

The person to whom I sent this email said she notified the author so I was hoping he fixed these small factual problems and also that he correspondingly adjusted his arguments about populism. Arguments are ultimately based on facts; shift the facts and the arguments should change to some extent.

In the meantime, Rajan came out with a second edition of his book, and so I was able to check on Amazon to see if he had fixed the errors.

The result was disappointing. It seems that he corrected both errors but in a minimal way: changing “American males” to “Americans” and changing “the spark that led to the organizing of the Tea Party movement” to “an important catalyst in the organizing of the Tea Party Movement.” That’s good that they made the changes (though not so cool that they didn’t cite me) but I’m bothered by the way the changes were so minimal. These were not typos; they reflected real misunderstanding, and it’s best to wrestle with one’s misunderstanding rather than just making superficial corrections.

At this point you might say I’m being picky. The fixed the error; isn’t that enough? But, no, I don’t think that’s enough. As I wrote two years ago, arguments are ultimately based on facts; shift the facts and the arguments should change to some extent. If the facts change but the argument stays the same, that represents a problem.

In making minimal corrections and not acknowledging that he made these errors, Rajan is dealing with the symptoms but not the underlying problem, which is that he’s processing recent history via conventional wisdom.

This should not be taken as some sort of blanket condemnation of Rajan, who might be an excellent banker and college professor. Lots of successful people operate using conventional wisdom. We just have to interpret his book not as an economic analysis or a synthesis of the literature but as an expression of conventional wisdom by a person with many interesting life experiences.

The trouble is, if all you’re doing is processing conventional wisdom, you’re not adding anything to the discourse.

“Do you come from Liverpool?”

Paul Alper writes:

Because I used to live in Trondheim, I have a special interest in this NYT article about exercise results in Trondheim, Norway.

Obviously, even without reading the article in any detail, the headline claim that

The Secret to Longevity? 4-Minute Bursts of Intense Exercise May Help

can be misleading and is subject to many caveats.

The essential claims:

Such studies [of exercise and mortality], however, are dauntingly complicated and expensive, one reason they are rarely done. They may also be limited, since over the course of a typical experiment [of short duration], few adults may die. This is providential for those who enroll in the study but problematic for the scientists hoping to study mortality; with scant deaths, they cannot tell if exercise is having a meaningful impact on life spans.

However, exercise scientists at the Norwegian University of Science and Technology in Trondheim, Norway, almost 10 years ago, began planning the study that would be published in October in The BMJ.

More than 1,500 of the Norwegian men and women accepted. These volunteers were, in general, healthier than most 70-year-olds. Some had heart disease, cancer or other conditions, but most regularly walked or otherwise remained active. Few were obese. All agreed to start and continue to exercise more regularly during the upcoming five years.

Via random assignment, they were put into three groups: the control group which “agreed to follow standard activity guidelines and walk or otherwise remain in motion for half an hour most days,” the moderate group which exercises “moderately for longer sessions of 50 minutes twice a weekend” and the third group “which started a program of twice-weekly high-intensity interval training, or H.I.I.T., during which they cycled or jogged at a strenuous pace for four minutes, followed by four minutes of rest, with that sequence repeated four times.”
Note that those in the control group were allowed to indulge in interval training if they felt like it.

Almost everyone kept up their assigned exercise routines for five years [!!], an eternity in science, returning periodically to the lab for check-ins, tests and supervised group workouts.

The results:

The men and women in the high-intensity-intervals group were about 2 percent less likely to have died than those in the control group, and 3 percent less likely to die than anyone in the longer, moderate-exercise group. People in the moderate group were, in fact, more likely to have passed away than people in the control group [!!].

In essence, says Dorthe Stensvold, a researcher at the Norwegian University of Science and Technology who led the new study, intense training — which was part of the routines of both the interval and control groups — provided slightly better protection against premature death than moderate workouts alone.

Here can be found the BMJ article itself. A closer look at the BMJ article is puzzling because of the term non-significant which appears in the BMJ article itself and not in the NYT.


This study suggests that combined MICT and HIIT has no effect on all cause mortality compared with recommended physical activity levels. However, we observed a lower all cause mortality trend after HIIT compared with controls and MICT.


The Generation 100 study is a long and large randomised controlled trial of exercise in a general population of older adults (70-77 years). This study found no differences in all cause mortality between a combined exercise group (MICT and HIIT) and a group that followed Norwegian guidelines for physical activity (control group). We observed a non-significant 1.7% absolute risk reduction in all cause mortality in the HIIT group compared with control group, and a non-significant 2.9% absolute risk reduction in all cause mortality in the HIIT group compared with MICT group. Furthermore, physical activity levels in the control group were stable throughout the study, with control participants performing more activities as HIIT compared with MICT participants, suggesting a physical activity level in control participants between that of MICT and HIIT.

As it happens, I [Alper] lived in Trondheim back before North Sea oil transformed the country. The Norwegian University of Science and Technology in Trondheim, Norway did not exist but was called the NTH, incorrectly translated as the Norwegian Technical High School. Back then and as it is today, exercise was the nation’s religion and the motto of the country was

It doesn’t matter whether you win or lose. The important thing is to beat Sweden.

To give you a taste of what the country was like in the 1960s, while I was on a walk, a little kid stopped me and said, “Do you come from Liverpool?”

Dude should have his own blog.

Conference on digital twins

Ron Kenett writes:

This conference and the special issue that follows might be of interest to (some) of your blog readers.

Here’s what it says there:

The concept of digital twins is based on a combination of physical models that describe the machine’s behavior and its deterioration processes over time with analytics capabilities that enable lessons to be learned, decision making and model improvement. The physical models can include the control model, the load model, the erosion model, the crack development model and more while the analytics model which is based on experimental data and operational data from the field.

I don’t fully follow this, but it sounds related to all the engineering workflow things that we like to talk about.

Which sorts of posts get more blog comments?

Paul Alper writes:

Some of your blog postings elicit many responses and some, rather few. Have you ever thought of displaying some sort of statistical graph illustrating the years of data? For example, sports vs. politics, or responses for one year vs. another (time series), winter vs. summer, highly technical vs. breezy.

I’ve not done any graph or statistical analysis. Informally I’ve noticed a gradual increase in the rate of comments. It’s not always clear which posts will get lots of comments and which will get few, except that more technical material typically gets less reaction. Not because people don’t care, I think, but because it’s harder to say much in response to a technical post. I think we also get fewer comments for posts on offbeat topics such as art and literature. And of course we get fewer comments on posts that are simply announcing job opportunities, future talks, and future posts. And all posts get zillions of spam comments, but I’m not counting them.

As of this writing, we have published 10,157 posts and 146,661 comments during the 16 years since the birth of this blog. The rate of comments has definitely been increasing, as I remember not so long ago that the ratio was 10-to-1. Unfortunately, there aren’t so many blogs anymore, so I’m pretty sure that the total rate of blog commenting has been in steady decline for years.

Webinar: An introduction to Bayesian multilevel modeling with brms

This post is by Eric.

This Wednesday, at 12 pm ET, Paul Bürkner is stopping by to talk to us about brms. You can register here.


The talk will be about Bayesian multilevel models and their implementation in R using the package brms. We will start with a short introduction to multilevel modeling and to Bayesian statistics in general followed by an introduction to Stan, which is a flexible language for fitting open-ended Bayesian models. We will then explain how to access Stan using the standard R formula syntax via the brms package. The package supports a wide range of response distributions and modeling options such as splines, autocorrelation, and censoring all in a multilevel context. A lot of post-processing and plotting methods are implemented as well. Some examples from Psychology and Medicine will be discussed.

About the speaker

Paul Bürkner is a statistician currently working as a Junior Research Group Leader at the Cluster of Excellence SimTech at the University of Stuttgart (Germany). He is the author of the R package brms and a member of the Stan Development Team. Previously, he studied Psychology and Mathematics at the Universities of Münster and Hagen (Germany) and did his PhD in Münster on optimal design and Bayesian data analysis. He has also worked as a Postdoctoral researcher at the Department of Computer Science at Aalto University (Finland).

More on that credulity thing

I see five problems here that together form a feedback loop with bad consequences. Here are the problems:

1. Irrelevant or misunderstood statistical or econometric theory;
2. Poorly-executed research;
3. Other people in the field being loath to criticize, taking published or even preprinted claims as correct until proved otherwise;
4. Journalists taking published or even preprinted claims as correct until proved otherwise;
5. Journalists following the scientist-as-hero template.

There are also related issues such as fads in different academic fields, etc.

When I write about regression discontinuity, we often focus on item #2 above, because it helps to be specific. But the point of my post was that maybe we could work on #3 and #4. If researchers could withdraw from their defensive position by which something written by a credentialed person in their field is considered to be good work until proved otherwise, and if policy journalists could withdraw from their default deference to anything that has an identification strategy and statistical significance, then maybe we could break that feedback loop.

Economists: You know better! Think about your own applied work. You’ve done bad analyses yourself, even published some bad analyses, right? I know I have. Given that you’ve done it, why assume by default that other people haven’t made serious mistakes in understanding.

Policy journalists: You can know better! You already have a default skepticism. If someone presented you with a pure observational comparison, you’d know to be concerned about unmodeled differences between treatment and control groups. So don’t assume that an “identification strategy” removes these concerns. Don’t assume that you don’t have to adjust for differences between cities north of the river and cities south of the river. Don’t assume you don’t have to adjust for differences between schools in the hills and schools in the flatlands, or the different ages of people in different groups, or whatever.

I’m not saying this is all easy. Being skeptical without being nihilistic—that’s a challenge. But, given that you’re already tooled up for some skepticism when comparing groups, my recommendation is to not abandon that skepticism just because you see an identification strategy, some robustness checks, and an affiliation with a credible university.

Rob Tibshirani, Yuling Yao, and Aki Vehtari on cross validation

Rob Tibshirani writes:

About 9 years ago I emailed you about our new significance result for the lasso. You wrote about in your blog. For some reason I never saw that full blog until now. I do remember the Stanford-Berkeley Seminar in 1994 where I first presented the lasso and you asked that question. Anyway, thanks for
admitting that the lasso did turn out to be useful!

That paper “A significance test for the lasso” did not turn to be directly useful but did help spawn the post-selection inference area. Whether that area is useful, remains to be seen.

Yesterday we just released a paper with Stephen Bates and Trevor Hastie that I think will be important, because it concerns a tool data scientists use every day: cross validation:

“Cross-validation: what is it estimating how well does it do it?”

We do two things (a) we establish, for the first time, what exactly CV is estimating and (b) we show that the SEs from CV are often WAY too small, and show how to fix them. The fix is a nested CV, and there will be software for this procedure. We also show similar properties to those in (a) for bootstrap, Cp, AIC and data splitting. i am super excited!

I forwarded this to Yuling Yao and Aki Vehtari.

Yuling wrote:

The more cross validation papers I read, the more I want to write a paper with the title “Cross validation is fundamentally unsound,” analogous to O’Hagan one:

1. (sampling variation) Even with iid data and exact LOO [leave-one-out cross validation], we rely on pseudo Monte Carlo method, which itself ignores sampling variation.

2. (outcome variation) Even worse than the unsound Monte Carlo, now we not only have sampling variation on x, but also variation on the outcome y, especially if y is binary or discrete.

3. (pointwise model performance) Sometimes we use pointwise cv error to understand local model fit, no matter how large variance it could have. The solution to 1–3 is a better modeling of cv errors other than sample average. We have already applied hierarchical modeling to stacking, and we should apply to cv, too.

4. (experiment design) Cross validation is controlled experiment, which provides us the privilege to design which unit(s) to receive the treatment (left) or control (not left). Leave-one-cell-out is some type of block-design. But there is no general guidance to link the rick literature on experiment design into the context of cv.

Just to clarify here: Yuling is saying this from the perspective of someone who thinks about cross validation all the time. His paper, “Using stacking to average Bayesian predictive distributions” (with Aki, Dan Simpson, and me) is based on leave-one-out cross validation. So when he writes, “Cross validation is fundamentally unsound,” he’s not saying that cross validation is a bad idea; he’s saying that it has room for improvement.

With that in mind, Aki wrote:

Monte Carlo is still useful even if you can reduce the estimation error by adding prior information. Cross validation is still useful even if you can reduce the estimation error by adding prior information. I think better title would be something like “How to beat cross validation by including more information.”

Aki also responded to Rob’s original message, as follows:

Tuomas Sivula, and Måns Magnusson, and I [Aki] have recently examined the frequency properties of leave-one-out cross validation in the Bayesian context and specifically for model comparison as that brings an additional twist in the behavior of CV.

– Tuomas Sivula, Måns Magnusson, and Aki Vehtari (2020). Uncertainty in Bayesian leave-one-out cross-validation based model comparison.

– Tuomas Sivula, Måns Magnusson, and Aki Vehtari (2020). Unbiased estimator for the variance of the leave-one-out cross-validation estimator for a Bayesian normal model with fixed variance.

Although Bayesian LOO-CV is slightly different, there are certainly the same issues in folds not being independent and the naive SE estimator being biased. However it seems this is mostly problematic for small n, very similar models, or badly mispecified models (1st paper), and in addition there is a way to reduce the bias (2nd paper). I have shared your [Rob’s] paper with Tuomas and Måns and we will look carefully if there is something we should take into account, too.

Lots of this conversation is just whipping over my head but I thought I’d share it with you. I’ve learned so much about cross validation from working with Aki and Yuling, and of course all of statistics has benefited from Rob’s methods.

P.S. Zad sends in this picture demonstrating careful contemplation of the leave-one-out cross validation principle. I just hope nobody slams me in the Supplemental Materials section of a paper in the The European Journal of Clinical Investigation for using a cat picture on social media. I’ve heard that’s a sign of sloppy science!

Is explainability the new uncertainty?

This is Jessica. Last August, NIST published a draft document describing four principles of explainable AI. They asked for feedback from the public at large, to “stimulate a conversation about what we should expect of our decision-making devices‘’. 

I find it interesting because from a quick skim, it seems like NIST is stepping into some murkier territory than usual. 

The first principle suggests that all AI outputs resulting from querying a system should come with explanation: “AI systems should deliver accompanying evidence or reasons for all their outputs.” From the motivation they provide in the first few pages: 

“Based on these calls for explainable systems [40], it can be assumed that the failure to articulate the rationale for an answer can affect the level of trust users will grant that system. Suspicions that the system is biased or unfair can raise concerns about harm to oneself and to society [102]. This may slow societal acceptance and adoption of the technology, as members of the general public oftentimes place the burden of meeting societal goals on manufacturers and programmers themselves [27, 102]. Therefore, in terms of societal acceptance and trust, developers of AI systems may need to consider that multiple attributes of an AI system can influence public perception of the system. Explainable AI is one of several properties that characterize trust in AI systems [83, 92].”

NIST is an organization whose mission involves increasing trust in tech, so thinking about what society wants and needs is not crazy. The report summarizes a lot of recent work in explainable AI to back their principles, and acknowledges at points that techniques are still being developed. Still, I find the report to be kind of a bold statement. Explainable AI is pretty open to interpretation, as the research is still developing. The report describes ideals but on what seems like still underspecified ground.

The second principle, for instance, is:

Systems should provide explanations that are meaningful or understandable to individual users.

They expound on this: 

A system fulfills the Meaningful principle if the recipient understands the system’s explanations. Generally, this principle is fulfilled if a user can understand the explanation, and/or it is useful to complete a task. This principle does not imply that the explanation is one size fits all. Multiple groups of users for a system may require different explanations. The Meaningful principle allows for explanations which are tailored to each of the user groups. Groups may be defined broadly as the developers of a system vs. end-users of a system; lawyers/judges vs. juries; etc. The goals and desiderata for these groups may vary. For example, what is meaningful to a forensic practitioner may be different than what is meaningful to a juror [31]. 

Later they mention the difficulty of modeling the human model interaction:

As people gain experience with a task, what they consider a meaningful explanation will likely change [10, 35, 57, 72, 73]. Therefore, meaningfulness is influenced by a combination of the AI system’s explanation and a person’s prior knowledge, experiences, and mental processes. All of the factors that influence meaningfulness contribute to the difficulty in modeling the interface between AI and humans. Developing systems that produce meaningful explanations need to account for both computational and human factors [22, 58]

I would like to interpret this as saying we need to model the “system” comprised of the human and the model, which is a direction of some recent work in human AI complementarity, though it’s not clear how much the report intends formal modeling versus simply considering things like users’ prior knowledge. Both are hard, of course, and places where research is still very early along. As far as I know, most of the work in AI explainability and interpretability—with the former applying to any system for which reasoning can be generated and the latter applying to “self-explainable” models that people can understand more or less on their own—is still about developing different techniques. Less is on studying their effects, and even less about applying any formal framework to understand both the human and the model together. And relatively little has involved computer scientists teaming up with non-computer scientists to tackle more of the human side. 

The third principle requires “explanation accuracy”:

  • The explanation correctly reflects the system’s process for generating the output.

Seems reasonable enough but finding accurate ways to explain model predictions is not an easy problem. I’m reminded of some recent work in interpretability showing that some very “intuitive” approaches to deep neural net explanations like giving the salient features of an input given the output sometimes don’t exhibit basic traits we’d expect, like generating the same explanations when two models produce the same outputs for a set of inputs even if they have different architectures. 

Also, especially when the explanations must correctly reflect the system’s process, then they will often entail introducing the user to features or counterfactuals and may include probability to convey the model’s confidence in the prediction. This is all extra information for the end-user to process. It makes me think of the broader challenge of expressing uncertainty with model estimates. AI explanations, like expressions of uncertainty, become an extra thing that users have to make sense of in whatever decision context they’re in. “As-if optimization” where you assume the point estimate or prediction is correct and go ahead with your decision, becomes harder. 

One thing I find interesting though is how much more urgency there seems to be in examples like the NIST report or popular tech press around explainable AI relative to the idea that we need to make any outputs of statistical models more useful by expressing uncertainty. NIST has addressed the latter in multiple reports, though never with the implied urgency and human-centric focus here. It’s not like there’s a shortage of examples where misinterpreting uncertainty in model estimates led to bad choices on the parts of individuals, governments, etc. Suggesting AI predictions require explanations and measurements require uncertainty expressions come from a similar motivation that modelers owe their users provenance information so they can make more informed decisions. However, the NIST reports on uncertainty have not really discussed public needs or requirements implied by human cognition, focusing instead on defining different sources and error measurements. Maybe times have changed and if NIST did an uncertainty report now, it would be much less dry and technical, stressing the importance of understanding what people can tolerate or need. Or maybe it’s a branding thing. Explainable AI sounds like an answer to people’s deepest fears of being outsmarted by machine. Uncertainty communication just sounds like a problem.  

At any rate, something I’ve seen again and again in my research, and which is well known in JDM work on reasoning under uncertainty is the pervasiveness of mental approximations or heuristics people use to simplify how they use the extra information, even in the case of seemingly “optimal” representations. It didn’t really surprise me to learn that heuristics have come up in the explainable AI lit recently. For instance, some work argues people rarely engage analytically with each individual AI recommendation and explanation; instead they develop general heuristics about whether and when to follow the AI suggestions, accurate or not. 

In contrast to uncertainty, though, which is sometimes seen as risky since it might confuse end-users of some predictive system, explainability often gets seen as a positive thing. Yet it’s pretty well established at this point that people overrely on AI recommendations in many settings, and explainability does not necessarily help as we might hope. For instance, a recent paper finds they increase overreliance independent of correctness of the AI recommendation. So the relationship more explanation = more trust should not be assumed when trust is mentioned as in the NIST report, just like it shouldn’t be assumed that more expression of uncertainty = more trust. 

So, lots of similarities on the surface. Though not (yet) much overlap yet between uncertainty expression/comm research and explainable AI research. 

The final principle, since I’ve gone through the other three, is about constraints on use of a model: 

The system only operates under conditions for which it was designed or when the system reaches a sufficient confidence in its output. (The idea is that if a system has insufficient confidence in its decision, it should not supply a decision to the user.

Elaborated a bit:

The previous principles implicitly assume that a system is operating within its knowledge limits. This Knowledge Limits principle states that systems identify cases they were not designed or approved to operate, or their answers are not reliable. 

Another good idea, but hard. Understanding and expressing dataset limitations is also a growing area of research (I like the term Dataset Cartography, for instance). I can’t help but wonder, why does this property tend to only come up when we talk about AI/ML models rather than statistical models used to inform real world decisions more broadly? Is it because statistical modeling outside of ML is seen as being more about understanding parameter relationships than making decisions? While examples like Bayesian forecasting models are not black boxes the way deep neural nets are, there’s still lots of room for end-users to misinterpret how they work or how reliable their predictions are (election forecasting for example). Or maybe because the datasets tend to be larger and are often domain general in AI, there’s more worries about overlooking mismatches in development versus use population, and explanations are a way to guard against that. I kind of doubt there’s a single strong reason to worry so much more about AI/ML models than other predictive models.     

Relative vs. absolute risk reduction . . . 500 doctors want to know!

Some stranger writes:

What are your thoughts on this paper? Especially the paragraph on page 6 “Similar to the critical appraisal ….. respectively”.
There are many of us MD’s who are quite foxed.
If you blog about it, please don’t mention my name and just say a doctor on a 500-member listserv asked you about this. And send me the link to that blog article please. There are at least 500 of us doctors who would love to be enlightened.

The link is to an article called, “Outcome Reporting Bias in COVID-19 mRNA Vaccine Clinical Trials,” which argues that when reporting results from coronavirus vaccine trials, they should be giving absolute risk rather than relative risk. These have the same numerator, different denominators. Let X be the number of cases that would occur under the treatment, Y be the number of cases that would occur under the control, and Z be the number of people in the population. The relative risk reduction (which is what we usually see) is (Y – X)/Y and the absolute risk reduction is (Y – X)/Z. So, for example, if X = 50, Y = 1000, and Z = 1 million, then the relative risk reduction is 95% but the absolute risk reduction is only 0.00095, or about a tenth of one percent. Here’s the wikipedia page.

What do I think about the above-linked article? It’s basically rehashing the wikipedia page. I’m not talking about Weggy-style plagiarism here, just that this is standard material. There’s nothing wrong with publishing standard material if people need to be reminded of some important message.

That said, I don’t think this article is particularly useful. I don’t think it conveys an important reason for parameterizing in terms of relative risk, which is that it can be much more stable than absolute risk. If the vaccine prevents 95% of infections, then that’s 95% of however many occur, which is a more relevant number than comparing to how many happened to occur in the control group in some particular study. Conversely, if that 95% is not a stable number, so that the vaccine prevents a greater proportion of infections in some settings and a lesser proportion in others, then we’d want to know that—this is an interaction—but, again, the absolute risk difference isn’t so relevant. Absolute risk does matter in some settings—for example, we wouldn’t be so interested in a drug that prevents 50% of cases in a disease that only affects 2 people in the world (unless knowing this would give us a clue of how to treat other diseases that have higher prevalence), but of course coronavirus is not a rare disease. Presumably the rate of infection was so low in those studies only because the participants were keeping pretty careful, but the purpose of the vaccines is to give it to everyone so we don’t have to go around keeping so careful.

So, unless there’s something I’m missing (that happens!), I disagree with the claim in the linked paper that failures to report absolute risk reduction “mislead and distort the public’s interpretation of COVID-19 mRNA vaccine efficacy and violate the ethical and legal obligations of informed consent.”

P.S. I’d never heard the term “foxed” before. From the above message, I’m guessing that it means “confused.” I googled the word, and according to the internet dictionary, it means “deceived; tricked,” but that doesn’t seem to fit the context.

When can we challenge authority with authority?

Michael Nelson writes:

I want to thank you for posting your last decade of publications in a single space and organized by topic. But I also wanted to share a critique of your argument style as exemplified in your Annals of Surgery correspondence [here and here]. While I think it’s important and valuable that you got the correct reasoning against post hoc power analysis on the record, I don’t think there was ever much of a chance that a correct argument was going to change the authors’ convictions significantly. Their belief was not a result of a logical mistake and so could not be undone by logic; they believed it because it was what they were originally taught and/or picked up from mentors and colleagues. I suggest that the most effective way to get scientists to change their practices, or at least to withdraw their faulty arguments, is to challenge authority with authority.

What if, after you present your rational argument, you then say something like: “I know this is what you were taught, as were many of my own very accomplished colleagues, but a lot of things are taught incorrectly in statistics (citations). However, without exception, every single one of the current, most-respected authorities in statistics and methodology (several recognizable names) agree that post hoc power analysis (or whatever) does not work, for precisely the reasons I have given. More importantly, their arguments and demonstrations to this effect have been published in the most authoritative statistical journals (citations) and have received no notable challenges from their fellow experts. Respectfully, if you are confident that your argument is indeed valid, then you have outwitted the best of my field. You are compelled by professional ethics to publicize your breakthrough proof in these same journals, at conferences of quantitative methodologists (e.g., SREE) and any other venue that may reach the top statistical minds in the social sciences. If correct, you will be well-rewarded: you’ll instantly become famous (at least among statisticians) for overturning points that have long been thought mathematically and empirically proven.” In short, put up or shut up.

My reply:

That’s an interesting idea. It won’t work in all cases, as often it’s a well-respected authority making the mistake: either the authority figure is making the error himself, or a high-status researcher is making the error based on respected literature. So in that case the appeal to authority won’t work, as these people are the authority on their fields. Similarly, we can’t easily appeal to authority to talk people out of naive and way-wrong interpretations of significance tests and p-values, as these mistakes are all over the place in textbooks. But on the occasions where someone is coming out of the blue with a bad idea, yeah, then it could make sense to bring in consensus as one of our arguments.

Of course, in some way whenever I make an argument under my own name, I’m challenging authority with authority, in that I bring to the table credibility based on my successful research and textbooks. I don’t usually make this argument explicitly, as there are many sources of statistical authority (see section 26.2 of this paper), but I guess it’s always there in the background.

What’s the biggest mistake revealed by this table? A puzzle:

This came up in our discussion the other day:

It’s a table comparing averages for treatment and control groups in an experiment. There’s one big problem here (summarizing differences by p-values) and some little problems, such as reporting values to ridiculous precision (who cares if something has an average of “346.57” when its standard deviation is 73?) and an error (24.19% where it should be 22%).

There’s one other thing about the table that bothers me. It’s not a problem in the table itself; rather, it’s something that the table reveals about the data analysis.

Do you see it?

Answer is in the comments.

Why did it take so many decades for the behavioral sciences to develop a sense of crisis around methodology and replication?

“On or about December 1910 human character changed.” — Virginia Woolf (1924).

Woolf’s quote about modernism in the arts rings true, in part because we continue to see relatively sudden changes in intellectual life, not merely from technology (email and texting replacing letters and phone calls, streaming replacing record sales, etc.) and power relations (for example arising from the decline of labor unions and the end of communism) but also ways of thinking which are not exactly new but seem to take root in a way that had not happened earlier. Around 1910, it seemed that the literary and artistic world was ready for Ezra Pound, Pablo Picasso, Igor Stravinsky, Gertrude Stein, and the like to shatter old ways of thinking, and (in a much lesser way) the behavioral sciences were upended just about exactly 100 years later by what is now known as the “replication crisis” . . .

The above is from a new paper with Simine Vazire. Here’s the abstract:

For several decades, leading behavioral scientists have offered strong criticisms of the common practice of null hypothesis significance testing as producing spurious findings without strong theoretical or empirical support. But only in the past decade has this manifested as a full-scale replication crisis. We consider some possible reasons why, on or about December 2010, the behavioral sciences changed.

You can read our article to hear the full story. Actually, I think our main point here is to raise the question; it’s not like we have such a great answer. And I’m sure lots of people have raised the question before. Some questions are just worth asking over and over again.

Our article is a discussion of “What behavioral scientists are unwilling to accept,” by Lewis Petrinovich, for the Journal of Methods and Measurement in the Social Sciences, edited by Alex Weiss.

Where can you find the best CBD products? CBD gummies made with vegan ingredients and CBD oils that are lab tested and 100% organic? Click here.