Effective Number of Parameters in a Statistical Model

This is my talk for a seminar organized by Joe Suzuki at Osaka University on Tues 10 Sep 2024, 8:50-10:20am Japan time / 19:50-21:20 NY time:

Effective Number of Parameters in a Statistical Model

Andrew Gelman, Department of Statistics, Columbia University

Degrees-of-freedom adjustment for estimated parameters is a general idea in small-sample hypothesis testing, uncertainty estimation, and assessment of prediction accuracy. The effective number of parameters gets interesting In the presence of nonlinearity, constraints, boundary conditions, hierarchical models, informative priors, discrete parameters, and other complicating factors. Many open questions remain, including: (a) defining the effective number of parameters, (b) measuring how the effective number of parameters can depend on data and vary across parameter space, and (c) understanding how the effective number of parameters changes as sample size increases. We discuss using examples from demographics, imaging, pharmacology, political science, and other application areas.

It will be a remote talk—I won’t be flying to Japan—so maybe the eventual link will be accessible to outsiders.

It feels kinda weird to be scheduling a talk nearly a year in advance, but since I had to give a title and abstract anyway, I thought I’d share it with you. My talk will be part of a lecture series they are organizing for graduate students at Osaka, “centered around WAIC/WBIC and its mathematical complexities, covering topics such as Stan usage, regularity conditions, WAIC, WBIC, cross-validation, SBIC, and learning coefficients.” I’m not gonna touch the whole BIC thing.

Now that we have loo, I don’t see any direct use for “effective number of parameters” in applied statistics, but the concept still seems important for understanding fitted models, in vaguely the same way that R-squared is useful for understanding, even though it does not answer any direct question of interest. I thought it could be fun to give a talk on all the things that confuse me about effective number of parameters, because I think it’s a concept that we often take for granted without fully thinking through.

Agreeing to give the talk could also motivate me to write a paper on the topic, which I’d like to do, given that it’s been bugging me for about 35 years now.

Book on Stan, R, and Python by Kentaro Matsuura

A new book on Stan using CmdStanR and CmdStanPy by Kentaro Matsuura has landed. And I mean that literally as you can see from the envelope (thanks, Kentaro!). Even the packaging from Japan is beautiful—it fit the book perfectly. You may also notice my Pentel Pointliner pen (they’re the best, but there’s a lot of competition) and my Mnemosyne pad (they’re the best, full stop), both from Japan.

If you click through to Amazon using the above link, the “Read Sample” button takes you to a list where you can read a sample, which includes the table of contents and a brief intro to notation.

Yes, it comes with source code

There’s a very neatly structured GitHub package, Bayesian statistical modeling with Stan R and Python, with all of the data and source code for the book.

The book just arrived, but from thumbing through it, I really like the way it’s organized. It uses practical simulation code and realistic data to illustrate points of workflow and show users how to get unstuck from common problems. This is a lot like the way Andrew teaches this material. Unlike how Andrew teaches, it starts from the basics, like what is a probability distribution. Luckily for the reader, rather than a dry survey trying to cover everything, it hits a few insightful highlights with examples—this is the way to go if you don’t want to just introduce distributions as you go.

The book is also generous with its workflow advice and tips on dealing with problems like non-identifiability or challenges like using discrete parameters. There’s even an advanced section at the end that works up to Gaussian processes and the application of Thompson sampling (not to reinforce Andrew’s impression that I love Thompson sampling—I just don’t have a better method for sequential decision making in “bandit” problems [scare quotes also for Andrew]).

CmdStanR and CmdStanPy interfaces

This is Kentaro’s second book on Stan. The first is in Japanese and it came out before CmdStanR and CmdStanPy. I’d recommend both this book and using CmdStanR or CmdStanPy—they are our go-to recommendations for using Stan these days (along with BridgeStan if you want transforms, log densities, and gradients). After moving to Flatiron Institute, I’ve switched from R to Python and now pretty much exclusively use Python with CmdStanPy, NumPy/SciPy (basic math and stats functions), plotnine (ggplot2 clone), and pandas (R data frame clone).

Random comment on form

In another nod to Andrew, I’ll make an observation about a minor point of form. If you’re going to use code in a book set in LaTeX, use sourcecodepro. It’s a Lucida Console-like font that’s much easier to read than courier. I’d just go with mathpazo for text and math in Palatino, but I can see why people like Times because it’s so narrow. Somehow Matsuura managed to solve the dreaded twiddle problem in his displayed Courier code so the twiddles look natural and not like superscripts—I’d love to know the trick to that. Overall, though, the graphics are abundant, clear, and consistently formatted, though Andrew might not like some of the ggplot2 defaults.

Comments from the peanut gallery

Brian Ward, who’s leading Stan language development these days and also one of the core devs for CmdStanPy and BridgeStan, said that he was a bit unsettled seeing API choices he’s made set down in print. Welcome to the club :-). This is why we’re so obsessive about backward compatibility.

Multilevel models for taxonomic data structures


This post from John Cook about the hierarchy of U.S. census divisions (nation, region, division, state, county, census tract, block group, census block, household, family, person) reminded me of something that I’ve discussed before but have never really worked out, which is the construction of families of hierarchical models for taxonomic data structures.

A few ideas come into play here:

1. Taxonomic structures come up a lot in social science. Cook gives the example of geography. Two other examples are occupation and religion, each of which can be categorized in many layers. Check out the Census Bureau’s industry and occupation classifications or the many categories of religion, religious denomination, etc.

2. Hierarchical modeling. If you have a national survey of, say, 2000 people classified by county, you won’t want to just do a simple “8 schools“-style hierarchical model, because if you do that you’ll pool the small counties pretty much all the way to the national average. You’ll want to include county-level predictors. One way of doing this is to include indicators for state, division, and region: that is, a hierarchical model over the taxonomy. This would be a simple example of the “unfolding flower” structure that expands to include more model structure as more data come in, thus avoiding the problem that the usual forms of weak priors are very strong as the dimensionality of the model increases while the data remain fixed (see Section 3 of my Bayesian model-building by pure thought paper from 1996).

I think there’s a connection here to quantum mechanics, in the way that, when a system is heated, new energy levels appear.

3. Taxonomies tend to have fractal structure. Some nodes have lots of branches and lots of depth, others just stop. For example, in the taxonomy of religious denominations, Christians can be divided into Catholics, Protestant, and others, then Protestants can be divided into mainline and evangelical, each which can be further subdivided, whereas some of the others, such as Mormons, might not be divided at all. Similarly, some index entries in a book will have lots of sub-entries and can go on for multiple columns, within which some sub-entries have many sub-sub-entries, etc., and then other entries are just one line. You get the sort of long-tailed distribution that’s characteristic of self-similar processes. Mandelbrot wrote about this back in 1955! This might be his first publication of any kind about fractals, a topic that he productively chewed on for several decades more.

My colleagues and I have worked with some taxonomic models for voting based on geography:
– Section 5 of our 2002 article on the mathematics and statistics of voting power,
– Our recent unpublished paper, How democracies polarize: A multilevel perspective.
These are sort of like spatial models or network models, but structured specifically based on a hierarchical taxonomy. Both papers have some fun math.

I think there are general principles here, or at least something more to be said on the topic.

Hey! Here are some amazing articles by George Box from around 1990. Also there’s some mysterious controversy regarding his center at the University of Wisconsin.

The webpage is maintained by John Hunter, son of Box’s collaborator William Hunter, and I came across it because I was searching for background on the paper-helicopter example that we use in our classes to teach principles of experimental design and data analysis.

There’s a lot to say about the helicopter example and I’ll save that for another post.

Here I just want to talk about how much I enjoyed reading these thirty-year-old Box articles.

A Box Set from 1990

Many of the themes in those articles continue to resonate today. For example:

• The process of learning. Here’s Box from his 1995 article, “Total Quality: Its Origins and its Future”:

Scientific method accelerated that process in at least three ways:

1. By experience in the deduction of the logical consequences of the group of facts each of which was individually known but had not previously been brought together.

2. By the passive observation of systems already in operation and the analysis of data coming from such systems.

3. By experimentation – the deliberate staging of artificial experiences which often might ordinarily never occur.

A misconception is that discovery is a “one shot” affair. This idea dies hard. . . .

• Variation over time. Here’s Box from his 1989 article, “Must We Randomize Our Experiment?”:

We all live in a non-stationary world; a world in which external factors never stay still. Indeed the idea of stationarity – of a stable world in which, without our intervention, things stay put over time – is a purely conceptual one. The concept of stationarity is useful only as a background against which the real non-stationary world can be judge. For example, the manufacture of parts is an operation involving machines and people. But the parts of a machine are not fixed entities. They are wearing out, changing their dimensions, and losing their adjustment. The behavior of the people who run the machines is not fixed either. A single operator forgets things over time and alters what he does. When a number of operators are involved, the opportunities for change because of failures to communicate are further multiplied. Thus, if left to itself any process will drift away from its initial state. . . .

Stationarity, and hence the uniformity of everything depending on it, is an unnatural state that requires a great deal of effort to achieve. That is why good quality control takes so much effort and is so important. All of this is true, not only for manufacturing processes, but for any operation that we would like to be done consistently, such as the taking of blood pressures in a hospital or the performing of chemical analyses in a laboratory. Having found the best way to do it, we would like it to be done that way consistently, but experience shows that very careful planning, checking, recalibration and sometimes appropriate intervention, is needed to ensure that this happens.

Here an example, from Box’s 1992 article, “How to Get Lucky”:

For illustration Figure 1(a) shows a set of data designed so seek out the source of unacceptably large variability which, it was suspected, might be due to small differences in five, supposedly identical, heads on a machine. To test this idea, the engineer arranged that material from each of the five heads was sampled at roughly equal intervals of time in each of six successive eight-hour periods. . . . the same analysis strongly suggested that real differences in means occurred between the six eight-hour periods of time during which the experiment was conducted. . . .

• Workflow. Here’s Box from his 1999 article, “Statistics as a Catalyst to Learning by Scientific Method Part II-Discussion”:

Most of the principles of design originally developed for agricultural experimentation would be of great value in industry, but the most industry experimentation differed from agricultural experimentation in two major respects. These I will call immediacy and sequentially.

What I mean by immediacy is that for most of our investigations the results were available, if not within hours, then certainly within days and in rare cases, even within minutes. This was true whether the investigation was conducted in a laboratory, a pilot plant or on the full scale. Furthermore, because the experimental runs were usually made in sequence, the information obtained from each run, or small group of runs, was known and could be acted upon quickly and used to plan the next set of runs. I concluded that the chief quarrel that our experimenters had with using “statistics” was that they thought it would mean giving up the enormous advantages offered by immediacy and sequentially. Quite rightly, they were not prepared to make these sacrifices. The need was to find ways of using statistics to catalyze a process of investigation that was not static, but dynamic.

There’s lots more. It’s funny to read these things that Box wrote back then, that I and others have been saying over and over again in various informal contexts, decades later. It’s a problem with our statistical education (including my own textbooks) that these important ideas are buried.

More Box

A bunch of articles by Box, with some overlap but not complete overlap with the above collection, is at the site of the University of Wisconsin, where he worked for many years. Enjoy.

Some kinda feud is going on

John Hunter’s page also has this:

The Center for Quality and Productivity Improvement was created by George Box and Bill Hunter at the University of Wisconsin-Madison in 1985.

In the first few years reports were published by leading international experts including: W. Edwards Deming, Kaoru Ishikawa, Peter Scholtes, Brian Joiner, William Hunter and George Box. William Hunter died in 1986. Subsequently excellent reports continued to be published by George Box and others including: Gipsie Ranney, Soren Bisgaard, Ron Snee and Bill Hill.

These reports were all available on the Center’s web site. After George Box’s death the reports were removed. . . .

It is a sad situation that the Center abandonded the ideas of George Box and Bill Hunter. I take what has been done to the Center as a personal insult to their memory. . . .

When diagonoised with cancer my father dedicated his remaining time to creating this center with George to promote the ideas George and he had worked on throughout their lives: because it was that important to him to do what he could. They did great work and their work provided great benefits for long after Dad’s death with the leadership of Bill Hill and Soren Bisgaard but then it deteriorated. And when George died the last restraint was eliminated and the deterioration was complete.

Wow. I wonder what the story was. I asked someone I know who works at the University of Wisconsin and he had no idea. Box died in 2013 so it’s not so long ago; there must be some people who know what happened here.

For how many iterations should we run Markov chain Monte Carlo?

So, Charles and I wrote this article. It’s for the forthcoming Handbook of Markov Chain Monte Carlo, 2nd edition. We still have an opportunity to revise it, so if you see anything wrong or misleading, or if there are other issues you think we should mention, please let us know right here in comments!

Remember, it’s just one chapter of an entire book—it’s intended to be an updated version of my chapter with Kenny, Inference from Simulations and Monitoring Convergence, from the first edition. For our new chapter, Charles liked the snappier, “How many iterations,” title.

In any case, our chapter does not have to cover MCMC, just inference and monitoring convergence. But I think Charles is right that, once you think about inference and monitoring convergence, you want to step back and consider design, in particular how many chains to run, where do you start them, and how many iterations per chain.

Again, all comments will be appreciated.

Hydrology Corner: How to compare outputs from two models, one Bayesian and one non-Bayesian?

Zac McEachran writes:

I am a Hydrologist and Flood Forecaster at the National Weather Service in the Midwest. I use some Bayesian statistical methods in my research work on hydrological processes in small catchments.

I recently came across a project that I want to use a Bayesian analysis for, but I am not entirely certain what to look for to get going on this. My issue: NWS uses a protocol for calibrating our river models using a mixed conceptual/physically-based model. We want to assess whether a new calibration is better than an old calibration. This seems like a great application for a Bayesian approach. However, a lot of the literature I am finding (and methods I am more familiar with) are associated with assessing goodness-of-fit and validation for models that were fit within a Bayesian framework, and then validated in a Bayesian framework. I am interested in assessing how a non-Bayesian model output compares with another non-Bayesian model output with respect to observations. Someday I would like to learn to use Bayesian methods to calibrate our models but one step at a time!

My response: I think you need somehow to give a Bayesian interpretation to your non-Bayesian model output. This could be as simple as taking 95% prediction intervals and interpreting them as 95% posterior intervals from a normally-distributed posterior. Or if the non-Bayesian fit only gives point estimates, then do some boostrapping or something to get an effective posterior. Then you can use external validation or cross validation to compare the predictive distributions of your different models, as discussed here; also see Aki’s faq on cross validation.

A Hydrologist and Flood Forecaster . . . how cool is that?? Last time we had this level of cool was back in 2009 when we were contacted by someone who was teaching statistics to firefighters.

Blue Rose Research is hiring (yet again) !

Blue Rose Research has a few roles that we’re actively hiring for as we gear up to elect more Democrats in 2024, and advance progressive causes!

A bit about our work:

  • For the 2022 US election, we used engineering and statistics to advise major progressive organizations on directing hundreds of millions of dollars to the right ads and states.
  • We tested thousands of ads and talking points in the 2022 election cycle and partnered with orgs across the space to ensure that the most effective messages were deployed from the state legislative level all the way up to Senate and Gubernatorial races and spanning the issue advocacy space as well.
  • We were more accurate than public polling in identifying which races were close across the Senate, House, and Gubernatorial maps.
  • And we’ve built up a technical stack that enables us to continue to build on innovative machine learning, statistical, and engineering solutions.

Now as we are looking ahead to 2024, we are hiring for the following positions:

All positions are remote, with optional office time with the team in New York City.

Please don’t hesitate to reach out with any questions ([email protected]).

Postdoc on Bayesian methodological and applied work! To optimize patient care! Using Stan! In North Carolina!

Sam Berchuck writes:

I wanted to bring your attention to a postdoc opportunity in my group at Duke University in the Department of Biostatistics & Bioinformatics. The full job ad is here: https://forms.stat.ufl.edu/statistics-jobs/entry/10978/.

The postdoc will work on Bayesian methodological and applied work, with a focus on modeling complex longitudinal biomedical data (including electronic health records and mobile health data) to create data-driven approaches to optimize patient care among patients with chronic diseases. The position will be particularly interesting to people interested in applying Bayesian statistics in real-world big data settings. We are looking for people who have experience in Bayesian inference techniques, including Stan!

Interesting. In addition to the Stan thing, I’m interested in data-driven approaches to optimize patient care. This is an area where a Bayesian approach, or something like it, is absolutely necessary, as you typically just won’t have enough data to make firm conclusions about individual effects, so you have to keep track of uncertainty. Sounds like a wonderful opportunity.

Bayes factors evaluate priors, cross validations evaluate posteriors

I’ve written this explanation on the board often enough that I thought I’d put it in a blog post.

Bayes factors

Bayes factors compare the data density (sometimes called the “evidence”) of one model against another. Suppose we have two Bayesian models for data y, one model p_1(\theta_1, y) with parameters \theta_1 and a second model p_2(\theta_2, y) with parameters \theta_2.

The Bayes factor is defined to be the ratio of the marginal probability density of the data in the two models,

\textrm{BF}_{1,2} = p_1(y) \, / \, p_2(y),

where we have

p_1(y) = \mathbb{E}[p_1(y \mid \Theta_1)] \ = \ \int p_1(y \mid \theta_1) \cdot p_1(\theta_1) \, \textrm{d}\theta_1


p_2(y) = \mathbb{E}[p_2(y \mid \Theta_2)] \ = \ \int p_2(y \mid \theta_2) \cdot p_2(\theta_2) \, \textrm{d}\theta_2.

The distributions p_1(y) and p_2(y) are known as prior predictive distributions because they integrate the likelihood over the prior.

There are ad-hoc guidelines from Harold Jeffreys of “uninformative” prior fame, classifying Bayes factor values as “decisive,” “very strong,” “strong,” “substantial,” “barely worth mentioning,” or “negative”; see the Wikipedia on Bayes factors. These seem about as useful as a 5% threshold on p-values before declaring significance.

Held-out validation

Held-out validation tries to evaluate prediction after model estimation (aka training). It works by dividing the data y into two pieces, y = y', y'' and then training on y' and testing on y''. The held out validation values are

p_1(y'' \mid y') = \mathbb{E}[p_1(y'' \mid \Theta_1) \mid y'] = \int p_1(y'' \mid \theta_1) \cdot p_1(\theta_1 \mid y') \, \textrm{d}\theta_1


p_2(y'' \mid y') = \mathbb{E}[p_2(y'' \mid \Theta_2) \mid y'] = \int p_2(y'' \mid \theta_2) \cdot p_2(\theta_2 \mid y') \, \textrm{d}\theta_2.

The distributions p_1(y'' \mid y') and p_2(y'' \mid y') are known as posterior predictive distributions because they integrate the likelihood over the posterior from earlier training data.

This can all be done on the log scale to compute either the log expected probability or the expected log probability (which are different because logarithms are not linear). We will use expected log probability in the next section.

(Leave one out) cross validation

Suppose our data is y_1, \ldots, y_N. Leave-one-out cross validation works by successively taking y'' = y_n and y' = y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N and then averaging on the log scale.

\frac{1}{N} \sum_{n=1}^N \log\left( \strut p_1(y_n \mid y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N) \right)


\frac{1}{N} \sum_{n=1}^N \log \left( \strut p_2(y_n \mid y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N) \right).

Leave-one-out cross validation is interpretable as the expected log posterior density (ELPD) for a new data item. Estimating ELPD is (part of) the motivation for various information criteria such as AIC, DIC, and WAIC.

Conclusion and a question

The main distinction between Bayes factors and cross validation is that the former uses prior predictive distributions whereas the latter uses posterior predictive distributions. This makes Bayes factors very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors.

This matters because pragmatic Bayesians like Andrew Gelman tend to use weakly informative priors that determine the rough magnitude, but not the value of parameters. You can’t get good Bayes factors this way. The best way to get a good Bayes factor is to push the prior toward the posterior, which you get for free with cross validation.

My question is whether the users of Bayes factors really believe so strongly in their priors. I’ve been told that’s true of the hardcore “subjective” Bayesians, who aim for strong priors, and also the hardcore “objective” Bayesians, who try to use “uninformative” priors, but I don’t think I’ve ever met anyone who claimed to follow either approach. It’s definitely not the perspective we’ve been pushing in our “pragmatic” Bayesian approach, for instance as described in the Bayesian workflow paper. We flat out encourage people to start with weakly informative priors and then add more information if the priors turn out to be too weak for either inference or computation.

Further reading

For more detail on these methods and further examples, see Gelman et al.’s Bayesian Data Analysis, 3rd Edition, which is available free online through the link, particularly Section 7.2 (“Information criteria and cross-validation,” p. 175) and section 7.4 (“Model comparison using Bayes factors,” page 183). I’d also recommend Vehtari, Gelman, and Gabry’s paper, Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.

Difference-in-differences: What’s the difference?

After giving my talk last month, Better Than Difference in Differences, I had some thoughts about how diff-in-diff works—how the method operates in relation to its assumptions—and it struck me that there are two relevant ways to think about it.

From a methods standpoint the relevance here is that I will usually want to replace differencing with regression. Instead of taking (yT – yC) – (xT – xC), where T = Treatment and C = Control, I’d rather look at (yT – yC) – b*(xT – xC), where b is a coefficient estimated from the data, likely to be somewhere between 0 and 1. Difference-in-differences is the special case b=1, and in general you should be able to do better by estimating b. We discuss this with the Electric Company example in chapter 19 of Regression and Other Stories and with a medical trial in our paper in the American Heart Journal.

Given this, what’s the appeal of diff-in-diff? I think the appeal of the method comes from the following mathematical sequence:

Control units:
(a) Data at time 0 = Baseline + Error_a
(b) Data at time 1 = Baseline + Trend + Error_b

Treated units:
(c) Data at time 0 = Baseline + Error_c
(d) Data at time 1 = Baseline + Trend + Effect + Error_d

Now take a diff in diff:

((d) – (c)) – ((b) – (a)) = Effect + Error,

where that last Error is a difference in difference of errors, which is just fine under the reasonable-enough assumption that the four error terms are independent.

The above argument looks pretty compelling and can easily be elaborated to include nonlinear trends, multiple time points, interactions, and so forth. That’s the direction of the usual diff-in-diff discussions.

The message of my above-linked talk and our paper, though, was different. Our point was that, whatever differencing you take, it’s typically better to difference only some of the way. Or, to make the point more generally, it’s better to model the baseline and the trend as well as the effect.

Seductive equations

The above equations are seductive: with just some simple subtraction, you can cancel out Baseline and Trend, leaving just Effect and error. And the math is correct (conditional on the assumptions, which can be reasonable). The problem is that the resulting estimate can be super noisy; indeed, it’s basically never the right thing to do from a probabilistic (Bayesian) standpoint.

In our example it was pretty easy in retrospect to do the fully Bayesian analysis. It helped that we had 38 replications of similar experiments, so we could straightforwardly estimate all the hyperparameters in the model. If you only have one experiment, your inferences will depend on priors that can’t directly be estimated from local data. Still, I think the Bayesian approach is the way to go, in the sense of yielding effect-size estimates that are more reasonable and closer to the truth.

Next step is to work this out on some classic diff-in-diff examples.

Donald Trump’s and Joe Biden’s ages and conditional probabilities of their dementia risk

The prior

Paul Campos posted something on the ages of the expected major-party candidates for next year’s presidential election:

Joe Biden is old. Donald Trump is also old. A legitimate concern about old people in important positions is that they are or may become cognitively impaired (For example, the prevalence of dementia doubles every five years from age 65 onward, which means that on average an 80-year-old is is eight times more likely to have dementia than a 65 year old).

Those are the baseline probabilities, which is one reason that Nate Silver wrote:

Of course Biden’s age is a legitimate voter concern. So is Trump’s, but an extra four years makes a difference . . . The 3.6-year age difference between Biden and Trump is potentially meaningful, at least based on broad population-level statistics. . . . The late 70s and early 80s are a period when medical problems often get much worse for the typical American man.

Silver also addressed public opinion:

An AP-NORC poll published last week found that 77 percent of American adults think President Biden is “too old to be effective for four more years”; 51 percent of respondents said the same of Donald Trump. . . . the differences can’t entirely be chalked up to partisanship — 74 percent of independents also said that Biden was too old, while just 48 percent said that of Trump.

The likelihood

OK, those are the base rates. What about the specifics of this case? Nate compares to other politicians but doesn’t offer anything about Biden or Trump specifically.

Campos writes:

Recently, Trump has been saying things that suggest he’s becoming deeply confused about some very basic and simple facts. For example, this weekend he gave a speech in which he seemed to be under the impression that Jeb Bush had as president launched the second Iraq war . . . In a speech on Friday, Trump claimed he defeated Barack Obama in the 2016 presidential election. . . . Trump also has a family history of dementia, which is a significant risk factor in terms of developing it himself.

Campos notes that Biden’s made his own slip-ups (“for example he claimed recently that he was at the 9/11 Twin Towers site the day after the attack, when in fact it was nine days later”) and summarizes:

I think it’s silly to deny that Biden’s age isn’t a legitimate concern in the abstract. Yet based on the currently available evidence, Trump’s age is, given his recent ramblings, a bigger concern.

It’s hard to know. A quick google turned up this:

On July 21, 2021, during a CNN town hall, President Joe Biden was asked when children under 12 would be able to get COVID-19 vaccinations. Here’s the start of his answer to anchor Don Lemon: “That’s under way just like the other question that’s illogical, and I’ve heard you speak about it because y’all, I’m not being solicitous, but you’re always straight up about what you’re doing. And the question is whether or not we should be in a position where you uh um, are why can’t the, the, the experts say we know that this virus is in fact uh um uh is, is going to be, excuse me.”

On June 18, after Biden repeatedly confused Libya and Syria in a news conference, The Washington Post ran a long story about 14 GOP lawmakers who asked Biden to take a cognitive test. The story did not note any of the examples of Biden’s incoherence and focused on, yes, Democrats’ concerns about Trump’s mental health.

And then there’s this, from a different news article:

Biden has also had to deal with some misinformation, including the false claim that he fell asleep during a memorial for the Maui wildfire victims. Conservatives — including Fox News host Sean Hannity — circulated a low-quality video on social media to push the claim, even though a clearer version of the moment showed that the president simply looked down for about 10 seconds.

There’s a lot out there on the internet. One of the difficulties with thinking about Trump’s cognitive capacities is that he’s been saying crazy things for years, so when he responds to a question about the Russia-Ukraine war by talking about windmills, that’s compared not to a normal politician but with various false and irrelevant things he’s been saying for years.

I’m not going to try to assess or summarize the evidence regarding Biden’s or Trump’s cognitive abilities—it’s just too difficult given that all we have is anecdotal evidence. Both often seem disconnected from the moment, compared to previous presidents. And, yes, continually being in the public eye can expose weaknesses. And Trump’s statements have been disconnected from reality for so long, that this seems separate from dementia even if it could have similar effects from a policy perspective.

Combining the prior and the likelihood

When comparing Biden and Trump regarding cognitive decline, we have three pieces of information:

1. Age. This is what I’m calling the base rate, or prior. Based on the numbers above, someone of Biden’s age is about 1.6 times more likely to get dementia than someone of Trump’s age.

2. Medical and family history. This seems less clear, but from the above information it seems that someone with Trump’s history is more at risk of dementia than someone with Biden’s.

3. Direct observation. Just so hard to compare. That’s why there are expert evaluations, but it’s not like a bunch of experts are gonna be given access to evaluate the president and his challenger.

This seems like a case where data source 3 should have much more evidence than 1 and 2. (It’s hard for me to evaluate 1 vs. 2; my quick guess would be that they are roughly equally relevant.) But it’s hard to know what to do with 3, given that no systematic data have been collected.

This raises an interesting statistical point, which is how to combine the different sources of information. Nate Silver looks at item 1 and pretty much sets aside items 2 and 3. In contrast, Paul Campos says that 1 and 2 pretty much cancel and that the evidence from item 3 is strong.

I’m not sure what’s the right way to look at this problem. I respect Silver’s decision not to touch item 3 (“As of what to make of Biden and Trump in particular — look, I have my judgments and you have yours. Cognitively, they both seem considerably less sharp to me than they did in their primes”); on the other hand, there seems to be so much direct evidence, that I’d think it would overwhelm a base rate odds of 1.6.

News media reporting

The other issue is news media coverage. Silver argues that the news media should be spending more time discussing the statistical probability of dementia or death as a function of age, in the context of Biden and Trump, and one of his arguments is that voters are correct to be more concerned about the health of the older man.

Campos offers a different take:

Nevertheless, Biden’s age is harped on ceaselessly by the media, while Trump apparently needs to pull a lampshade over his head and start talking about how people used to wear onions on their belts before the same media will even begin to talk about the exact same issue in regard to him, and one that, given his recent behavior, seems much more salient as a practical matter.

From Campos’s perspective, voters’ impressions are a product of news media coverage.

But on the internet you can always find another take, such as this from Roll Call magazine, which quotes a retired Democratic political consultant as saying, “the mainstream media has performed a skillful dance around the issue of President Biden’s age. . . . So far, it is kid gloves coverage for President Biden.” On the other hand, the article also says, “Then there’s Trump, who this week continued his eyebrow-raising diatribes on his social media platform after recently appearing unable, at times, to communicate complete thoughts during a Fox News interview.”

News coverage in 2020

I recall that age was discussed a lot in the news media during the 2020 primaries, where Biden and Sanders were running against several much younger opponents. It didn’t come up so much in the 2020 general election because (a) Trump is almost as old as Biden, and (b) Trump had acted so erratically as president that it was hard to line up his actual statements and behaviors with a more theoretical, actuarial-based risk based on Biden’s age.

Statistical summary

Comparing Biden and Trump, it’s not clear what to do with the masses of anecdotal data; on the other hand, it doesn’t seem quite right to toss all that out and just go with the relatively weak information from the base rates.

I guess this happens a lot in decision problems. You have some highly relevant information that is hard to quantify, along with some weaker, but quantifiable statistics. In their work on cognitive illusions, Tversky and Kahneman noted the fallacy of people ignoring base rates, but there can be the opposite problem of holding on to base rates too tightly, what we’ve called slowness to update. In general, we seem to have difficulty balancing multiple pieces of information.

P.S. Some discussion in comments about links between age and dementia, or diminished mental capacities, also some discussion about evidence for Trump’s and Biden’s problems. The challenge remains of how to put these two pieces of information together. I find it very difficult to think about this sort of question where the available data are clearly relevant yet have such huge problems with selection. There’s a temptation to fall back on base rates but that doesn’t seem right to me either.

P.P.S. I emailed Campos and Silver regarding this post. Campos followed up here. I didn’t hear back from Silver, but I might not have his current email, so if anyone has that, could you please send it to me? Thanks.

My SciML Webinar next week (28 Sep): Multiscale generalized Hamiltonian Monte Carlo with delayed rejection

I’m on the hook to do a SciML webinar next week:

These are organized by Keith Phuthi (who is at CMU) through University of Michigan’s Institute for Computational Discovery and Engineering.

Sam Livingstone is moderating.This is presenting joint work with Alex Barnett, Chirag Modi, Edward Roualdes, and Gilad Turok.

I’m very excited about this project as it combines a number of threads I’ve been working on with collaborators. When I did my job talk here, Leslie Greengard, our center director, asked me why we didn’t use variable stepwise integrators when doing Hamiltonian Monte Carlo. I told him we’d love to do it, but didn’t know how to do it in such a way as to preserve the stationary target distribution.

Delayed rejection HMC

Then we found Antonietta Mira’s work on delayed rejection. It lets you retry a second Metropolis proposal if the first one is rejected. The key here is that we can use a smaller step size for the second proposal, thus recovering from proposals that are rejected because the Hamiltonian diverged (i.e., the first-order gradient based algorithm can’t handle regions of high curvature in the target density). There’s a bit of bookkeeping (which is frustratingly hard to write down) for the Hastings condition to ensure detailed balance. Chirag Modi, Alex Barnett and I worked out the details, and Chirag figured out a novel twist on delayed rejection that only retries if the original acceptance probability was low. You can read about it in our paper:

This works really well and is enough that we can get proper draws from Neal’s funnel (vanilla HMC fails on this example in either the tails in either the mouth or neck of the funnel, depending on the step size). But it’s inefficient in that it retries an entire Hamiltonian trajectory. Which means if we cut the step size in half, we double the number of steps to keep the integration time constant.

Radford Neal to the rescue

As we were doing this, the irrepressible Radford Neal published a breakthrough algorithm:

What he managed to do was use generalized Hamiltonian Monte Carlo (G-HMC) to build an algorithm that takes one step of HMC (like Metropolis-adjusted Langevin, but over the coupled position/momentum variables) and manages to maintain directed progress. Instead of fully resampling momentum each iteration, G-HMC resamples a new momentum value then performs a weighted average with the existing momentum with most of the weight on the existing momentum. Neal shows that with a series of accepted one-step HMC iterations, we can make directed progress just like HMC with longer trajectories. The trick is getting sequences of acceptances together. Usually this doesn’t work because we have to flip momentum each iteration. We can re-flip it when regenerating, to keep going in the same direction on acceptances, but with rejections we reverse momentum (this isn’t an issue with HMC because it fully regenerates each time). So to get directed movement, we need steps that are too small. What Radford figured out is that we can solve this problem by replacing the way we generate uniform(0, 1)-distributed probabilities for the Metropolis accept/reject step (we compare the variate generated to the ratio of the density at the proposal to the density at the previous point and accept if it’s lower). Radford realized that if we instead generate them in a sawtooth pattern (with micro-jitter for ergodicity), then when we’re at the bottom of the sawtooth generating a sequence of values near zero, the acceptances will cluster together.

Replacing Neal’s trick with delayed rejection

Enter Chirag’s and my intern, Gilad Turok (who came to us as an undergrad in applied math at Columbia). Over the summer, working with me and Chirag and Edward Roualdes (who was here as a visitor), he built and evaluated a system that replaces Neal’s trick (sawtooth pattern of acceptance probability) with the Mira’s trick (delayed rejection). It indeed solves the multi scale problem. It exceeded our expectations in terms of efficiency—it’s about twice as fast as our delayed rejection HMC. Going one HMC step at a time, it is able to adjust its stepsize within what would be a single Hamiltonian trajectory. That is, we finally have something that works roughly like a typical ODE integrator in applied math.

Matt Hoffman to the rescue

But wait, that’s not all. There’s room for another one of the great MCMC researchers to weigh in. Matt Hoffman, along with Pavel Sountsov, figured out how to take Radford’s algorithm and provide automatic adaptation for it.

What Hoffman and Sountsov manage to do is run a whole lot of parallel chains, then use information in the other chains to set tuning parameters for a given chain. In that way it’s like the Goodman and Weare affine-invariant sampler that’s used in the Python package emcee. This involves estimating the metric (posterior covariance or just variance in the diagonal case) and also estimating steps size, which they do through a heuristic largest-eigenvalue estimate. Among the pleasant properties of their approach is that the entire setup produces a Markov chain from the very first iteration. That means we only have to do what people call “burn in” (sorry Andrew, but notice how I say other people call it that, not that they should), not set aside some number of iterations for adaptation.

Edward Roualdes has coded up Hoffman and Sountsov’s adaptation and it appears to work with delayed rejection replacing Neal’s trick.

Next for Stan?

I’m pretty optimistic that this will wind up being more efficient than NUTS and also make things like parallel adaptation and automatic stopping a whole lot easier. It should be more efficient because it doesn’t waste work—NUTS goes forward and backward in time and then subsamples along the final doubling (usually—it’s stochastic with a strong bias toward doing that). This means we “waste” the work going the wrong way in time and beyond where we finally sample. But we still have a lot of eval to do before we can replace Stan’s longstanding sampler or even provide an alternative.

My talk

The plan’s basically to expand this blog post with details and show you some results. Hope to see you there!

How big problem it is that cross-validation is biased?

Some weeks ago, I posted in Mastodon (you can follow me there) a thread about “How big problem it is that cross-validation is biased?”. I have also added that text to CV-FAQ. Today I extended that thread as we have a new paper out on estimating and correcting selection induced bias in cross-validation model selection.

I’m posting here the whole thread for the convenience of those who are not (yet?) following me in Mastodon:

Unbiasedness has a special role in statistics, and too often there are dichotomous comments that something is not valid or is inferior because it’s not unbiased. However, often the non-zero bias is negligible, and often by modifying the estimator we may even increase bias but reduce the variance a lot, providing an overall improved performance.

In CV the goal is to estimate the predictive performance for unobserved data given the observed data of size n. CV has pessimistic bias due to using less than n observation to fit the models. In case of LOO-CV this bias is usually small and negligible. In case of K -fold-CV with a small K, the bias can be non-negligible, but if the effective number of parameters of the model is much less than n, then with K>10 the bias is also usually negligible compared to the variance.

There is a bias correction approach by Burman (1989) (see also Fushiki (2011)) that reduces CV bias, but even in the cases with non-negligible bias reduction, the variance tends to increase so much that there is no real benefit (see, e.g. Vehtari and Lampinen (2002)).

For time series when the task is to predict future (there are other possibilities like missing data imputation) there are specific CV methods such as leave-future-out (LFO) that have lower bias than LOO-CV or K -fold-CV (Bürkner, Gabry and Vehtari, 2020). There are sometimes comments that LOO-CV and K -fold-CV would be invalid for time series. Although they tend to have a bigger bias than LFO, they are still valid and can be useful, especially in model comparison where bias can cancel out.

Cooper et al. (2023) demonstrate how in time series model comparison variance is likely to dominate, it is more important to reduce the variance than bias, and leave-few-observations and use of joint log score is better than use of LFO. The problem with LFO is that the data sets used for fitting models are smaller, increasing the variance.

Bengio and Grandvalet (2004) proved that there is no unbiased estimate for the variance of CV in general, which has been later used as an argument that there is no hope. Instead of dichotomizing to unbiased or biased, Sivula, Magnusson and Vehtari (2020) consider whether the variance estimates are useful and how to diagnose when the bias is likely to not be negligible (Sivula, Magnusson and Vehtari (2023) prove also a special case where there actually exists unbiased variance estimate).

CV tends to have high variance, as the sample reuse is not making any modeling assumptions (this holds also for information criteria such as WAIC). Not making modeling assumptions is good when we don’t trust our models, but if we trust we can get reduced variance in model comparison, for example, examining directly the posterior or using reference models to filter out noise in the data (see, e.g., Piironen, Paasiniemi and Vehtari (2018) and Pavone et al. (2020)).

When using CV (or information criteria such as WAIC) for model selection, the performance estimate for the selected model has additional selection induced bias. In case of small number of models this bias is usually negligible, that is, smaller than the standard deviation of the estimate or smaller than what is practically relevant. In case of negligible bias, we may choose suboptimal model, but the difference to the performance of oracle model is small.

In case of a large number of models the selection induced bias can be non-negligible, but this bias can be estimated using, for example, nested-CV or bootstrap. The concept of the selection induced bias and related potentially harmful overfitting are not new concepts, but there hasn’t been enough discussion when they are negligible or non-negligible.

In our new paper with Yann McLatchie Efficient estimation and correction of selection-induced bias with order statistics we review the concepts of selection-induced bias and overfitting, propose a fast to compute estimate for the bias, and demonstrate how this can be used to avoid selection induced overfitting even when selecting among 10^30 models.

The figure here shows simulation results with p=100 covariates, with different data sizes n, and varying block correlation among the covariates. The red lines show the LOO-CV estimate for the best model chosen so far in forward-search. The grey lines show the independent, much bigger test data performance, which usually don’t have available. The black line shows our corrected estimate taking into account the selection induced bias. Stopping the searches at the peak of black curves avoids overfitting.
The figure here shows simulation results with p=100 covariates, with different data sizes n, and varying block correlation among the covariates. The red lines show the LOO-CV estimate for the best model chosen so far in forward-search. The grey lines show the independent much bigger test data performance, which usually don't have available. Black line shows our corrected estimate taking into account the selection induced bias. Stopping the searches at the peak of black curves avoids overfitting.

Although we can estimate and correct the selection induced bias, we primarily recommend to use more sensible priors and not to do model selection. See more in Efficient estimation and correction of selection-induced bias with order statistics and Bayesian Workflow.

Using forecasts to estimate individual variances

Someone who would like to remain anonymous writes:

I’m a student during the school year, but am working in industry this summer. I am currently attempting to overhaul my company’s model of retail demand. We advise suppliers to national retailers, our customers are suppliers. Right now, for each of our customers, our demand model outputs a point estimate of how much of their product will be consumed at one of roughly a hundred locations. This allows our customers to decide how much to send to each location.

However, because we are issuing point estimates of mean demand, we are *not* modeling risk directly, and I want to change that, as understanding risk is critical to making good decisions about inventory management – the entire point of excess inventory is to provide a buffer against surprises.

Additionally, the model currently operates on a per-day basis, so that predictions for a month from now are obtained by chaining together thirty predictions about what day N+1 will look like. I want to change that too, because it seems to be causing a lot of problems with errors in the model propagating across time, to the point that predictions over even moderate time intervals are not reliable.

I already know how to do both of these in an abstract way.

I’m willing to bite the bullet of assuming that the underlying distribution of the PDF should be multivariate Gaussian. From there, arriving at the parameters of that PDF just requires max likelihood estimation. For the other change, without going into a lot of tedious detail, Neural ODE models are flexible with respect to time such that you can use the same model to predict the net demand accumulated over t=10 days as you would to predict the net demand accumulated over t=90 days, just by changing the time parameter that you query the model with.

The problem is, although I know how to build a model that will do this, I want the estimated variance for each customer’s product to be individualized. Yet frustratingly, in a one-shot scenario, the maximum likelihood estimator of variance is zero. The only datapoint I’ll have to use to train the model to estimate the mean aggregate demand for, say, cowboy hats in Seattle at time t=T (hereafter (c,S,T)) is the actual demand for that instance, so the difference between the mean outcome and the actual outcome will be zero.

It’s clear to me that if I want to arrive at a good target for variance or covariance in order to conduct risk assessment, I need to do some kind of aggregation over the outcomes, but most of the obvious options don’t seem appealing.

– If I obtain an estimate of variance by thinking about the difference between (c,S,T) and (c,Country,T), aggregating over space, I’m assuming that each location shares the same mean demand, which I know is false.

– If I obtain one by thinking about the difference between (c,S,T) and (c,S,tbar), aggregating over time, I am assuming there’s a stationary covariance matrix for how demand accumulates at that location over time, which I know is false. This will fail especially badly if issuing predictions across major seasonal events, such as holidays or large temperature changes.

– If I aggregate across customers by thinking about the difference between (c,S,T) and (cbar,S,T), I’ll be assuming that the demand for cowboy hats at S,T should obey similar patterns as the demand for other products, such as ice cream or underwear sales, which seems obviously false.

I have thought of an alternative to these, but I don’t know if it’s even remotely sensible, because I’ve never seen anything like it done before. I would love your thoughts and criticisms on the possible approach. Alternatively, if I need to bite the bullet and go with one of the above aggregation strategies instead, it would benefit me a lot to have someone authoritative tell me so, so that I stop messing around with bad ideas.

My thought was that instead of asking the model to use the single input vector associated with t=0 to predict a single output vector at t=T, I could instead ask the model to make one prediction per input vector for many different input vectors from the neighborhood of time around t=0 in order to predict outcomes at a neighborhood of time around t=T. For example, I’d want one prediction for t=-5 to t=T, another prediction for t=-3 to t=T+4, and so on.

I would then judge the “true” target variance for the model relative to the difference between (c,S,T)’s predicted demand and the average of the model’s predicted demands for those nearby time slices. The hope is that this would reasonably correspond to the risks that customers should consider when optimizing their inventory management, by describing the sensitivity of the model to small changes in the input features and target dates it’s queried on. The model’s estimate of its own uncertainty wouldn’t do a good job of representing out-of-model error, of course, but the hope is that it’d at least give customers *something*.

Does this make any sense at all as a possible approach, or am I fooling myself?

My reply: I haven’t followed all the details, but my guess is that your general approach is sound. It should be possible to just fit a big Bayesian model in Stan, but maybe that would be too slow, I don’t really know how big the problem is. The sort of approach described above, where different models are fit and compared, can be thought of as a kind of computational approximation to a more structured hierarchical model, in the same way that cross-validation can be thought of as an approximation to an error model, or smoothing can be thought of as an approximation to a time-series model.

Improving Survey Inference in Two-phase Designs Using Bayesian Machine Learning

Xinru Wang, Lauren Kennedy, and Qixuan Chen write:

The two-phase sampling design is a cost-effective sampling strategy that has been widely used in public health research. The conventional approach in this design is to create subsample specific weights that adjust for probability of selection and response in the second phase. However, these weights can be highly variable which in turn results in unstable weighted analyses. Alternatively, we can use the rich data collected in the first phase of the study to improve the survey inference of the second phase sample. In this paper, we use a Bayesian tree-based multiple imputation (MI) approach for estimating population means using a two-phase survey design. We demonstrate how to incorporate complex survey design features, such as strata, clusters, and weights, into the imputation procedure. We use a simulation study to evaluate the performance of the tree-based MI approach in comparison to the alternative weighted analyses using the subsample weights. We find the tree-based MI method outperforms weighting methods with smaller bias, reduced root mean squared error, and narrower 95% confidence intervals that have closer to the nominal level coverage rate. We illustrate the application of the proposed method by estimating the prevalence of diabetes among the United States non-institutionalized adult population using the fasting blood glucose data collected only on a subsample of participants in the 2017-2018 National Health and Nutrition Examination Survey.

Yes, weights can be variable! Poststratification is better, but we don’t always have the relevant information. Imputation is a way to bridge the gap. Imputations themselves are model-dependent and need to be checked. Still, the alternatives of ignoring design calculations or relying on weights have such problems of their own, that I think that modeling is the way to go. Further challenges will arise such as imputing cluster membership in the population.

When I said, “judge this post on its merits, not based on my qualifications,” was this anti-Bayesian? Also a story about lost urine.

Paul Alper writes, regarding my post criticizing an epidemiologist and a psychologist who were coming down from the ivory tower to lecture us on “concrete values like freedom and equality”:

In your P.P.S. you write,

Yes, I too am coming down from the ivory tower to lecture here. You’ll have to judge this post on its merits, not based on my qualifications. And if I go around using meaningless phrases such as “concrete values like freedom and equality,” please call me on it!

While this sounds reasonable, is it not sort of anti-Bayes? By that I mean your qualifications represent a prior and the merits the (new) evidence. I am not one to revere authority but deep down in my heart, I tend to pay more attention to a medical doctor at the Mayo Clinic than I do to Stella Immanuel. On the other hand, decades ago the Mayo Clinic misplaced (lost!) a half liter of my urine and double charged my insurance when getting a duplicate a few weeks later.

Alper continues:

Upon reflection—this was over 25 years ago—a half liter of urine does sound like an exaggeration, but not by much. The incident really did happen and Mayo tried to charge for it twice. I certainly have quoted it often enough so it must be true.

On the wider issue of qualifications and merit, surely people with authority (degrees from Harvard and Yale, employment at the Hoover Institution, Nobel Prizes) are given slack when outlandish statements are made. James Watson, however, is castigated exceptionally precisely because of his exceptional longevity.

I don’t have anything to say about the urine, but regarding the Bayesian point . . . be careful! I’m not saying to make inferences or make decisions solely based on local data, ignoring prior information coming from external data such as qualifications. What I’m saying is to judge this post on its merits. Then you can make inferences and decisions in some approximately Bayesian way, combining your judgment of this post with your priors based on your respect for my qualifications, my previous writings, etc.

This is related to the point that a Bayesian wants everybody else to be non-Bayesian. Judge my post on its merits, then combine with prior information. Don’t double-count the prior.

This is what “power = .06” looks like (visualized by Art Owen).

Art Owen writes:

Here’s a figure you might like. I generated a bunch of data sets constructed to have true effect size 1 and power equal to 0.06. The first 100 confidence intervals that exclude 0 are shown. Their endpoints come close to the origin and their centers have absolute value far from 1. Just over 1/4 of them have the wrong sign.

When the CI endpoint is a full CI width away from zero, then you’d be pretty safe regarding the sign. Details are in this article.

I do like the above figure. It’s a very vivid expression of This is what “power = .06” looks like. Get used to it. Somebody should send it to the University of Chicago economics department.

PhD student, PostDoc, and Research software engineering positions

Several job opportunities in beautiful Finland!

  1. Fully funded postdoc and doctoral student positions in various topics including Bayesian modeling, probabilistic programming and workflows with me and other professors in Aalto University and University of Helsinki, funded by Finnish Center for Artificial Intelligence

    See more topics, how to apply, and job details like salary at fcai.fi/we-are-hiring

    You can also ask me for further details

  2. Permanent full time research software engineer position at Aalto University. Aalto Scientific Computing is a specialized type of research support, providing high-performance computing hardware, management, research support, teaching, and training. The team works with top researchers throughout the university. All the work is open-source by default and the team take an active part in worldwide projects.

    See more about tasks, qualifications, salary, etc in www.aalto.fi/en/open-positions/research-software-engineer

    This could be a great fit also for someone interested in probabilistic programming. I know some of the RSE group members, and they are great, and we’ve been very happy to get their help, e.g. in developing priorsense package.

There are no underpowered datasets; there are only underpowered analyses.

Is it ok to pursue underpowered studies?

This question comes from Harlan Campbell, who writes:

Recently we saw two different about commentaries on the importance of pursuing underpowered studies, both with arguments motivated by thoughts on COVID-19 research:

COVID-19: underpowered randomised trials, or no randomised trials? by Atle Fretheim

Causal analyses of existing databases: no power calculations required by Miguel Hernán

Both explain the important idea that underpowered/imprecise studies “should be viewed as contributions to the larger body of evidence” and emphasize that several of these studies can, when combined together in a meta-analysis, “provide a more precise pooled effect estimate”.

Both sparked quick replies:
and lastly from myself and others:

and even got some press.

My personal opinion is that there are both costs (e.g., wasting valuable resources, furthering distrust in science) and benefits (e.g., learning about an important causal question) to pursuing underpowered studies. The trade-off may indeed tilt towards the benefits if the analysis question is sufficiently important; much like driving through a red light on-route to the hospital might be advisable in a medical emergency, but should otherwise be avoided. In the latter situation, risks can be mitigated with a trained ambulance driver at the wheel and a wailing siren. When it comes to pursuing underpowered studies, there are also ways to minimize risks. For example, by committing to publish one’s results regardless of the outcome, by pre-specifying all of one’s analyses, and by making the data publicly available, one can minimize the study’s potential contribution to furthering distrust in science. That’s my two cents. In any case, it certainly is an interesting question.

I agree with the general principle that data are data, and there’s nothing wrong with gathering a little bit of data and publishing what you have, in the hope that it can be combined now or later with other data and used to influence policy in an evidence-based way.

To put it another way, the problem is not “underpowered studies”; it’s “underpowered analyses.”

In particular, if your data are noisy relative to the size of the effects you can reasonably expect to find, then it’s a big mistake to use any sort of certainty thresholding (whether that be p-values, confidence intervals, posterior intervals, Bayes factors, or whatever) in your summary and reporting. That would be a disaster—type M and S errors will kill you.

So, if you expect ahead of time that the study will be summarized by statistical significance or some similar thresholding, then I think it’s a bad idea to do the underpowered study. But if you expect ahead of time that the raw data will be reported and that any summaries will be presented without selection, then the underpowered study is fine. That’s my take on the situation.

The fundamental role of data partitioning in predictive model validation

David Zimmerman writes:

I am a grad student in biophysics and basically a novice to Bayesian methods. I was wondering if you might be able to clarify something that is written in section 7.2 of Bayesian Data Analysis. After introducing the log pointwise predictive density as a scoring rule for probabilistic prediction, you say:

The advantage of using a pointwise measure, rather than working with the joint posterior predictive distribution … is in the connection of the pointwise calculation to cross-validation, which allows some fairly general approaches to approximation of out-of-sample fit using available data.

But would it not be possible to do k-fold cross-validation, say, with a loss function based on the joint predictive distribution over each full validation set? Can you explain why (or under what circumstances) it is preferable to use a pointwise measure rather than something based on the joint predictive?

My reply: Yes, for sure you can do k-fold cross validation. Leave-one-out (LOO) has the advantage of being automatic to implement in many models using Pareto-smoothed importance sampling, but for structured problems such as time series and spatial models, k-fold can make more sense. The reason we made such a big deal in our book about the pointwise calculation was to emphasize that predictive validation fundamentally is a process that involves partitioning the data. This aspect of predictive validation is hidden by AIC and related expressions such as DIC that work with the unpartitioned joint likelihood. When writing BDA3 we worked to come up with an improvement/replacement for DIC—the result was chapter 7 of BDA3, along with this article with Aki Vehtari and Jessica Hwang—and part of this was a struggle to manipulate the posterior simulations of the joint likelihood. At some point I realized that the partitioning was necessary, and this point struck me as important enough to emphasize when writing all this up.

And here’s Aki’s cross validation FAQ and two of his recent posts on the topic:

from 2020: More limitations of cross-validation and actionable recommendations

from 2022: Moving cross-validation from a research idea to a routine step in Bayesian data analysis