Abraham Lincoln and confidence intervals

This one from 2017 is good; I want to share it with all of you again:

Our recent discussion with mathematician Russ Lyons on confidence intervals reminded me of a famous logic paradox, in which equality is not as simple as it seems.

The classic example goes as follows: Abraham Lincoln is the 16th president of the United States, but this does not mean that one can substitute the two expressions “Abraham Lincoln” and “the 16th president of the United States” at will. For example, consider the statement, “If things had gone a bit differently in 1860, Stephen Douglas could have become the 16th president of the United States.” This becomes flat-out false if we do the substitution: “If things had gone a bit differently in 1860, Stephen Douglas could have become Abraham Lincoln.”

Now to confidence intervals. I agree with Rink Hoekstra, Richard Morey, Jeff Rouder, and Eric-Jan Wagenmakers that the following sort of statement, “We can be 95% confident that the true mean lies between 0.1 and 0.4,” is not in general a correct way to describe a classical confidence interval. Classical confidence intervals represent statements that are correct under repeated sampling based on some model; thus the correct statement (as we see it) is something like, “Under repeated sampling, the true mean will be inside the confidence interval 95% of the time” or even “Averaging over repeated samples, we can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.” Russ Lyons, however, felt the statement “We can be 95% confident that the true mean lies between 0.1 and 0.4,” was just fine. In his view, “this is the very meaning of “confidence.'”

This is where Abraham Lincoln comes in. We can all agree on the following summary:

A. Averaging over repeated samples, we can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.

And we could even perhaps feel that the phrase “confidence interval” implies “averaging over repeated samples,” and thus the following statement is reasonable:

B. “We can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.”

Now consider the other statement that caused so much trouble:

C. “We can be 95% confident that the true mean lies between 0.1 and 0.4.”

In a problem where the confidence interval is [0.1, 0.4], “the lower and upper endpoints of the confidence interval” is just “0.1 and 0.4.” So B and C are the same, no? No. Abraham Lincoln, meet the 16th president of the United States.

In statistical terms, once you supply numbers on the interval, you’re conditioning on it. You’re no longer implicitly averaging over repeated samples. Just as, once you supply a name to the president, you’re no longer implicitly averaging over possible elections.

So here’s what happened. We can all agree on statement A. Statement B is a briefer version of A, eliminating the explicit mention of replications because they are implicit in the reference to a confidence interval. Statement C does a seemingly innocuous switch but, as a result, implies conditioning on the interval, thus resulting in a much stronger statement that is not necessarily true (that is, in mathematical terms, is not in general true).

None of this is an argument over statistical practice. One might feel that classical confidence statements are a worthy goal for statistical procedures, or maybe not. But, like it or not, confidence statements are all about repeated sampling and are not in general true about any particular interval that you might see.

P.S. More here.

Putting a price on vaccine hesitancy (Bayesian analysis of a conjoint experiment)

Tom Vladeck writes:

I thought you may be interested in some internal research my company did using a conjoint experiment, with analysis using Stan! The upshot is that we found that vaccine hesitant people would require a large payment to take the vaccine, and that there was a substantial difference between the prices required for J&J and Moderna & Pfizer (evidence that the pause was very damaging). You can see the model code here.

My reply: Cool! I recommend you remove the blank lines from your Stan code as that will make your program easier to read.

Vladeck responded:

I prefer a lot of vertical white space. But good to know that I’m likely in the minority there.

For me, it’s all about the real estate. White space can help code be more readable but it should be used sparingly. What I’d really like is a code editor that does half white spaces.

How large is that treatment effect, really? (My talk at the NYU economics seminar, Thurs 7 Mar 18 Apr)

Thurs 7 Mar 18 Apr 2024, 12:30pm at 19 West 4th St., room 517:

How large is that treatment effect, really?

“Unbiased estimates” aren’t really unbiased, for a bunch of reasons, including aggregation, selection, extrapolation, and variation over time. Econometrics typically focus on causal identification, with this goal of estimating “the” effect. But we typically care about individual effects (not “Does the treatment work?” but “Where and when does it work?” and “Where and when does it hurt?”). Estimating individual effects is relevant not only for individuals but also for generalizing to the population. For example, how do you generalize from an A/B test performed on a sample right now to possible effects on a different population in the future? Thinking about variation and generalization can change how we design and analyze experiments and observational studies. We demonstrate with examples in social science and public health.

ISBA 2024 Satellite Meeting: Lugano, 25–28 June

Antonietta Mira is organizing a satellite workshop before ISBA. It’s free, there is still time to submit a poster, and it’s a great excuse to visit Lugano. Here are the details:

I really like small meetings like this. Mitzi and I are going to be there and then continue on to ISBA.

Varying slopes and intercepts in Stan: still painful in 2024

Andrew recently blogged the following: Tutorial on varying-intercept, varying-slope multilevel models in Stan, from Will Hipson. This is the kind of model Andrew et al. used for one example in Red State, Blue State, which is the varying effect of income on Republican preference by state. Each state had its own slope and intercept related with a multivariate hierarchical prior. The version in Gelman and Hill’s regression book is a hack that tried to scale an inverse Wishart; the LKJ is what they would have used if Ben Goodrich had created it at that point.

Andrew points to a tutorial on Bayesian varying effects models from Will Hipson, which is really nice in the way it steps through workflow, building up the model in stages. The model Hipson develops is an improvement on what we have in our User’s Guide. After everything else, I circle back and talk about doc, trying to connect it to my recent post on why doc is so dangerous.

I think we can do a bit better in the current verison of Stan, but I have to confess up front that Andrew’s right—this is still painful. This took me around three hours to put together the model and simulations and blog post and I’m the one who designed the language! This would’ve been much faster if I wasn’t trying to bring it up to a “publishable” standard as an example of how I like to see Stan code written.

The original Stan model

Here’s Will Hipson’s model:

data {
  int N_obs; // number of observations
  int N_pts; // number of participants
  int K; // number of predictors + intercept
  int pid[N_obs]; // participant id vector
  matrix[N_obs, K] x; // matrix of predictors
  real y[N_obs]; // y vector
}

parameters {
  matrix[K, N_pts] z_p; // matrix of intercepts and slope
  vector[K] sigma_p; // sd for intercept and slope
  vector[K] beta; // intercept and slope hyper-priors
  cholesky_factor_corr[K] L_p; // Cholesky correlation matrix
  real sigma; // population sigma
}

transformed parameters {
  matrix[K, N_pts] z; // non-centered version of beta_p
  z = diag_pre_multiply(sigma_p, L_p) * z_p; 
}

model {
  vector[N_obs] mu;
  
  // priors
  beta ~ normal(0, 1);
  sigma ~ exponential(1);
  sigma_p ~ exponential(1);
  L_p ~ lkj_corr_cholesky(2);
  to_vector(z_p) ~ normal(0, 1);
  
  // likelihood
  for(i in 1:N_obs) {
    mu[i] = beta[1] + z[1, pid[i]] + (beta[2] + z[2, pid[i]]) * x[i, 2];
  }
  y ~ normal(mu, sigma);
}

generated quantities {
  matrix[2, 2] Omega;
  Omega = multiply_lower_tri_self_transpose(L_p);
}

Warning: There’s a bug in this code in that it only handles the K = 2 case. You can see this with the 1 and 2 hardcoded in the definition of mu[i].

My Stan model

The documentation for the model is at the top of the Stan code, then the Stan code only has a single line of doc other than explanations of the variables (which I wouldn’t include in non-tutorial code, just to link this back to what I was saying a few posts ago about comments).

/**
 * Varying slopes and intercept hierarchical linear regression.
 * N observations organized into J groups, with jj[n] being the group
 * and x[n, 1:K] the covariates for observation n.  The covariate
 * matrix x should include a column of 1s to include a slope.
 * 
 * The slopes and intercept per group have a multivariate normal prior
 * and the scale has an exponential prior.  The location of the
 * multivariate normal prior has a standard normal hyperprior and its
 * covariance is decomposed into a correlation matrix with an LKJ
 * hyperprior and a scale vector with an exponential hyperprior. In
 * symbols: 
 *
 * Likelihod:
 *   y[n] ~ normal(x[n] * beta[1:K, jj[n]], sigma) for n in 1:N
 *
 * Priors:
 *   sigma ~ exponential(1)
 *   beta[1:K, j] ~ multi_normal(nu, Sigma) for j in 1:J
 * 
 * Hyperpriors:
 *   nu ~ normal(0, 1)
 *   scale(Sigma) ~ exponential(1)
 *   corr(Sigma) ~ lkj(2)
 *
 * where scale(Sigma) is the scale vector and corr(Sigma) is the
 * correlation matrix of Sigma.
 *
 * For efficiency and numerical stability, the covariance and
 * correlation matrices are Cholesky factored.
 */
data {
  int<lower=0> J;                      // number of groups
  int<lower=0> N;                      // number of observations
  array[N] int<lower=1, upper=J> jj;   // group per observation
  int<lower=1> K;                      // number of covariates
  matrix[N, K] x;                      // data matrix
  vector[N] y;                         // observations
}
parameters {
  vector[K] nu;                        // location of beta[ , j]
  vector<lower=0>[K] tau;              // scale of beta[ , j]
  cholesky_factor_corr[K] L_Omega;     // Cholesky of correlation of beta[ , j]
  matrix[K, J] beta_std;               // standard beta (beta - nu) / Sigma
  real<lower=0> sigma;                 // observation error for y
}
transformed parameters {
  matrix[K, J] beta = rep_matrix(nu, J)
                      + diag_pre_multiply(tau, L_Omega) * beta_std;
}
model {
  nu ~ normal(0, 1);
  tau ~ exponential(1);
  L_Omega ~ lkj_corr_cholesky(2);
  to_vector(beta_std) ~ normal(0, 1);  // beta[ , j] ~ multi_normal(nu, Sigma)
  sigma ~ exponential(1);
  y ~ normal(rows_dot_product(x, beta[ , jj]'), sigma);
}
generated quantities {
  matrix[K, K] Sigma                   // covariance of beta[, j]
    = multiply_lower_tri_self_transpose(diag_pre_multiply(tau, L_Omega));
}

(WordPress is really annoying in its mishandling of angle brackets in pre environments.)

I started the first version using K = 2 and loops rather than vectorization. Next, I generalized from K = 2 to general K. Then I found the rows dot product function and got rid of the last loop. (Loops are fast in Stan—it’s the redundant autodiff, like multiple calculations of log(sigma) that are the time wasters in non-vectorized code.)

One could make the case for suffixing all the parameters of the prior for beta using _beta. You’d probably want to do that in a context with multiple groupings. It gets very hard to read even in this simple case—with multiple groupings it’s a right royal mess. Long variable names are very hard to read in math.

I put all the comments to the right at end of lines, so they don’t waste any vertical space and don’t get in the way of actually reading the code. Except the one true inline code comment that I’d leave, which is the implication of all the non-centered parameterization on the distribution of beta.

Let’s take it for a spin

I actually wrote the model first. I would generally recommend debugging the stages, but I’ve built this model many times, so I decided to roll the dice.

What I haven’t done before is simulate data in Python. Luckily, I have a very helpful coding assistant named ChatGPT. I have a system prompt that looks like this (I’m not actually a professor any more, but I thought it’d be easier to underestand that way).

WHO: I am a professor in statistics and machine learning with a Ph.D. in cognitive science and computer science. I understand (a) graduate level probability and statistics, (b) graduate-level machine learning including neural networks, (c) applied mathematics including calculus and analysis, matrices and linear algebra, combinatorics and graph theory, mathematical logic and set theory, (c) cognitive science including psycholinguistics and cognitive anthropology, (d) linguistics including phonology, morphology, syntax, semantics, and pragmatics, and (e) philosophy including 20th century analytic philosophy and philosophy of mind.

HOW: Respond in a concise and neutral academic style without embellishments.

Please call me “Bob”.

This was a one-and-done prompt. I’ve also had a bit of practice formulating these queries.

BOB: In Python, I want to set K = 5 covariates J = 10 groups. I want to generate a random K-vector location nu from a standard normal, and a (K x K) random covariance matrix Sigma of roughly unit scale (multiplying Cholesky factors or taking entries from a random walk would be OK). Then generate a K x J matrix beta where the columns are drawn from a multivariate normal distribution with location nu and covariance Sigma.

Then I want to set N = 1000 observations. For each observation in in range(N), generate a random group jj[n] in range(J) uniformly. Take a scale sigma = 1.5 for errors, and generate an N x K standard normal matrix x of covariates, with the first column set to all 1 values for an intercept. Then generate y from a hierarchical linear regression where the coefficient vector for observation n is given by beta[:, jj[n]] and the noise is normal with location 0 and scale sigma.

GPT-4 (Feb 28, 2024): [It just led with the code, which is unusual.]

import numpy as np
import scipy.linalg

# Set parameters
K = 5  # number of covariates
J = 10  # number of groups
N = 1000  # number of observations
sigma = 1.5  # scale of errors

# Generate nu and Sigma
nu = np.random.normal(0, 1, K)  # K-vector location from a standard normal
A = np.random.normal(0, 1, (K, K))
Sigma = np.dot(A, A.T)  # create a positive semi-definite matrix for covariance

# Generate beta
beta = np.random.multivariate_normal(nu, Sigma, J).T  # K x J matrix

# Generate observations
jj = np.random.randint(0, J, N)  # random group for each observation
x = np.random.normal(0, 1, (N, K))  # N x K standard normal matrix
x[:, 0] = 1  # set first column to all 1s for intercept

# Generate y
y = np.zeros(N)
for n in range(N):
    y[n] = np.dot(x[n], beta[:, jj[n]]) + np.random.normal(0, sigma)

y[:5]  # Show the first 5 values of y as a sample output

The hierarchical linear regression model was successfully generated with the specified parameters. The first five values of the dependent variable y from the model are approximately:

−2.27
8.86
8.20
1.30
7.27

These values are based on the simulation with K=5 covariates,
J=10 groups, N=1000 observations, and an error scale σ=1.5.

It looks like it actually ran that Python code? I haven’t seen output in this format before from GPT—the code’s in a pulldown labeled “Analyzing”.

Anway, GPT’s code worked out of the box. It only took a couple of minutes to pull the data generation into a function and return a dictionary with simulations and simulated parameter values to compare to the fitted values. My Python code illustrates my suggestion a few posts back to turn blocks of code into functions rather than putting a bunch of comments around them in the script.

import numpy as np
import scipy.linalg

def sim_param_data(K, J, N, sigma):
    nu = np.random.normal(0, 1, K)
    A = np.random.normal(0, 1, (K, K))
    Sigma = np.dot(A, A.T)
    beta = np.random.multivariate_normal(nu, Sigma, J).T 
    jj = np.random.randint(0, J, N)
    x = np.random.normal(0, 1, (N, K))
    x[:, 0] = 1
    y = np.zeros(N)
    for n in range(N):
        y[n] = np.dot(x[n], beta[:, jj[n]]) + np.random.normal(0, sigma)
    return nu, Sigma, beta, {'J': J, 'N': N, 'jj': jj + 1, 'K': K, 'x': x, 'y': y}
    
import cmdstanpy as csp
nu, Sigma, beta, data = sim_param_data(K = 5, J = 10, N = 1000, sigma = 1.5)
model = csp.CmdStanModel(stan_file = 'rsbs.stan')
fit = model.sample(data)

It takes Stan about 20 seconds to fit this data set, R-hats all less than 1.01, is low, ESS in the thousands from a sample of size 4000, and all but a couple parameters are all recovered within 95% posterior intervals. There is quite a lot of uncertainty here with this little data and this many groups—don’t take those point estimates of covariance too seriously!

Appendix on doc

Let’s digress and talk about doc. I wrote a blog post a few days ago on doc, and this triggers some of the same issues. I want to say up front that doc is hard and if you go and look at code I’ve written, there will be a lot of places where you can improve the doc. Same for the code. So this is a kind of normative theory of doc, not what one might expect in reality. People only have a finite amount of time for any project. You might want to take a look at the doc in R parts of his example with the same eye.

First, there’s a scaffolding example which has the classic problem of documentation just for the sake of documentation.

vector[N] mu; // declaring a mu vector

You see the same thing in the final example where “vector[N] y” is documented as “y vector”. For the same reason, I don’t like this from an early example,

  sigma ~ exponential(1); // using an exponential prior on sigma

And this is what I meant by documenting the language.

mu[i] = x[i] * beta; // * is matrix multiplication in this context

and

  cholesky_factor_corr[K] L_p; // Cholesky correlation matrix

Going back to the final example, rather than “population sigma”, I would prefer “error scale” as it does not rely on the conventional notation sigma to pick out the scale.

The comment for z says “non-centered version of beta_p”, but the non-centered variable here is z_p. The terminology of “centering” is around the population mean, not zero.

Continuing with doc for z, I don’t understand what it means to be a version of beta_p. There is no beta_p in the model, so maybe some external doc? In the definition of mu, you can see beta acting as the location of the non-centered parameterization.

Did anyone spot the bug in this model? This is the real reason we don’t trust doc and have to read the code. It only works for K = 2. You’ll see a hard-coded 1 and 2 on the line defining mu[i] despite other parts of the program using K. My advice in this situation is to just bite the bullet and code the K = 2 case first. Then generalize later if you need to. I code the general case in the next section.

I want to emphasize that I’m not trying to pick on Will Hipson here. I’m assuming his intent was to be pedagogical, as the comment density drops as the models get more complicated. And the code is really good—better than in our User’s Guide.

This example also emphasizes why code review is so useful—a second set of eyes is the top thing I can recommend for improving code. Just knowing your code will be reviewed helps you write better code.

Free online book by Bruno Nicenboim, Daniel Schad, and Shravan Vasishth on Bayesian inference and hierarchical modeling using brms and Stan

Shravan points us to these materials:

Hierarchical models are bread and butter stuff for psycholinguists, so we are trying hard to make Stan/brms mainstream through various means. Teaching this stuff feels like the most important work I am doing right now, more important even than the scientific side of things.

We have chapters on hierarchical modeling in our book (to be published soon with CRC Press), we use both brms and Stan:

https://vasishth.github.io/bayescogsci/book/ [edit: made it a live link]

The online version will remain available for free. Comments/corrections are welcome; one can open issues: https://github.com/vasishth/bayescogsci/issues

This summer, I [Shravan] am teaching an intro to Bayes using brms/Stan, with a focus on hierarchical modeling, especially directed at researchers in linguistics who do experimental work:

https://www.mils.ugent.be/courses/module-9-bayesian-data-analysis/

Plus, at Potsdam, for the last seven years I have been running an annual summer school on stats for linguistics and psych, where our focus is on hierarchical modeling using Stan/brms:

https://vasishth.github.io/smlp2024/

Here, we teach both frequentist and Bayesian approaches to hierarchical modeling.

Cool! Good to have these resources out there.

Tutorial on varying-intercept, varying-slope multilevel models in Stan, from Will Hipson

I was teaching varying-intercept, varying-slope multilevel models, and . . . I can get them to fit in Stan, but the code is kinda ugly, so I was struggling to clean it up, with no success. This will be a real research project, to add appropriate functions and possibly expand the Stan language so that these models can be written at a higher, more intuitive level.

Varying-intercept models aren’t so bad. In lme4 or blme or rstanarm or brms, you write something like:

y ~ 1 | group + x + z + x:z

and that transfers pretty directly into Stan. Just create the X matrix and go from there. Indeed, you can add as many batches of varying coefficients and it’s no biggie to code it up.

But once you get to varying intercepts and slopes, it all changes. In lme4 or blme or rstanarm or brms, you can just write things like:

y ~ (1 + z | group) + x + z + x:z

But if you want to program this directly in Stan, once you have varying intercepts and slopes, you have to deal with covariance-matrix decompositions and arrays of coefficient vectors, and it’s all a hairy mess.

What to do? For this semester’s class, Imma just gonna go with lme4/blme/rstanarm when fitting varying-intercept, varying-slope models. All this Stan coding is a rabbit hole that’s getting us away from the goal, which is to be able to fit, use, and evaluate statistical models for measurement and variation.

I would like to be able to more easily fit these in Stan, though. Why, you might ask? If we can fit them in lme4, or blme for more stability, or rstanarm for including more uncertainty in the inferences, then why bother coding directly in Stan?

The answer for why we want to code directly in Stan is that we’re often wanting to expand our models, for example adding mixture components, measurement error terms, time series or spatial dependence, etc.

For that reason, you will want to be able to code varying-intercept, varying-slope models in Stan—even if I won’t be teaching that in class this semester.

The good news is that I did some googling and found this tutorial by Will Hipson on programming hierarchical regressions in Stan. It’s from 2020 and I have not looked at every line of code there, but it all looks reasonable and there’s lots of explanation of the workflow. So maybe this is the best place to start, if you want to go in this direction, as you should!

Hand-drawn Statistical Workflow at Nelson Mandela

In September 2023 I taught a week-long course on statistical workflow at the Nelson Mandela African Institution of Science and Technology (NM-AIST), a public postgraduate research university in Arusha, Tanzania established in 2009.

NM-AIST – CENIT@EA

The course was hosted by Dean Professor Ernest Rashid Mbega and the Africa Centre for Research, Agricultural Advancement, Teaching Excellence and Sustainability (CREATES) through the Leader Professor Hulda Swai and Manager Rose Mosha.

Our case study was an experiment on the NM-AIST campus designed and implemented by Dr Arjun Potter and Charles Luchagula to study the effects of drought, fire, and herbivory on growth of various acacia tree species. The focus was pre-data workflow steps, i.e. experimental design. The goal for the week was to learn some shared statistical language so that scientists can work with statisticians on their research.

Together with Arjun and Charles, with input from Drs Emmanuel Mpolya, Anna Treydte, Andrew Gelman, Michael Betancourt, Avi Feller, Daphna Harel, and Joe Blitzstein, I created course materials full of activities. We asked participants to hand-draw the experimental design and their priors, working together with their teammates. We also did some pencil-and-paper math and some coding in R.

Course participants were students and staff from across NM-AIST. Over the five days, between 15 and 25 participants attended on a given day.

Using the participants’ ecological expertise, we built a model to tell a mathematical story of how acacia tree height could vary by drought, fire, herbivory, species, and plot location. We simulated parameters and data from this model, e.g. beta_fire = rnorm(n = 1, mean = -2, sd = 1) then simulated_data …= rnorm(n, beta_0 + beta_fire*Fire +… beta_block[Block], sd_tree). We then fit the model to the simulated data.

Due to difficulty in manipulating fire, fire was assigned at the block-level, whereas drought and herbivory were assigned at the sub-block level. We saw how this reduced precision in estimating the effect of fire:

We redid the simulation assuming a smaller block effect and saw improved precision. This confirmed the researcher’s intuitions that they need to work hard to reduce the block-to-block differences.

To keep the focus on concepts not code, we only simulated once from the model. A full design analysis would include many simulations from the model. In Section 16.6 of ROS they fix one value for the parameters and simulate multiple datasets. In Gelman and Carlin (2014) they consider a range of plausible parameters using prior information. Betancourt’s workflow simulates parameters from the prior.

Our course evaluation survey was completed by 14 participants. When asked “which parts of the class were most helpful to you to understand the concepts?”, respondents chose instructor explanations, drawings, and activities as more helpful than the R code. However, participants also expressed eagerness to learn R and to analyze the real data in our next course.

The hand-drawn course materials and activities were inspired by Brendan Leonard’s illustrations in Bears Don’t Care About Your Problems and I Hate Running and You Can Too. Brendan wrote me,

I kind of think hand-drawing stuff makes it more fun and also maybe less intimidating?

I agree.

More recently, I have been reading Introduction to Modern Causal Inference by Alejandro Schuler and Mark van der Laan, who say

It’s easy to feel like you don’t belong or aren’t good enough to participate…

yup.

To deal with that problem, the voice we use throughout this book is informal and decidedly nonacademic…Figures are hand-drawn and cartoonish.

I’m excited to return to NM-AIST to continue the workflow steps with the data that Dr Arjun Potter and Charles Luchagula have been collecting. With the real data, we can ask: is our model realistic enough to achieve our scientific goals ?

Bayesian Analysis with Python

Osvaldo Martin writes:

The third edition of Bayesian Analysis with Python serves as an introduction to the basic concepts of applied Bayesian modeling. It adopts a hands-on approach, guiding you through the process of building, exploring and expanding models using PyMC and ArviZ. The field of probabilistic programming is in a different place today than it was when the first edition was devised in the middle of the last decade. The journey from its first publication to this current edition mirrors the evolution of Bayesian modeling itself – a path marked by significant advancements, growing community involvement, and an increasing presence in both academia and industry. Consequently, this updated edition also includes coverage of additional topics and libraries such as Bambi, for flexible and easy hierarchical linear modeling, PyMC-BART, for flexible non-parametric regression; PreliZ, for prior elicitation; and Kulprit, for variable selection.

Whether you’re a student, data scientist, researcher, or developer aiming to initiate Bayesian data analysis and delve into probabilistic programming, this book provides an excellent starting point. The content is introductory, requiring little to none prior statistical knowledge, although familiarity with Python and scientific libraries like NumPy is advisable.

By the end of this book, you will possess a functional understanding of probabilistic modeling, enabling you to design and implement Bayesian models for your data science challenges. You’ll be well-prepared to delve into more advanced material or specialized statistical modeling if the need arises.

See more at the book website

Osvaldo spent one year at Aalto in Finland (unfortunately, during the pandemic) so I know he knows what he writes. Bambi is rstanarm / brms style interface for building models with PyMC in Python ecosystem, and Kulprit is the Python version of projpred (in R) for projective predictive model selection (which is one of my favorite research topics).

“Replicability & Generalisability”: Applying a discount factor to cost-effectiveness estimates.

This one’s important.

Matt Lerner points us to this report by Rosie Bettle, Replicability & Generalisability: A Guide to CEA discounts.

“CEA” is cost-effectiveness analysis, and by “discounts” they mean what we’ve called the Edlin factor—“discount” is a better name than factor, because it’s a number that should be between 0 and 1, it’s what you should multiply a point estimate by to adjust for inevitable upward biases in reported effect-size estimates, issues discussed here and here, for example.

It’s pleasant to see some of my ideas being used for a practical purpose. I would just add that type M and type S errors should be lower for Bayesian inferences than for raw inferences that have not been partially pooled toward a reasonable prior model.

Also, regarding empirical estimation of adjustment factors, I recommend looking at the work of Erik van Zwet et al; here are some links:
What’s a good default prior for regression coefficients? A default Edlin factor of 1/2?
How large is the underlying coefficient? An application of the Edlin factor to that claim that “Cash Aid to Poor Mothers Increases Brain Activity in Babies”
The Shrinkage Trilogy: How to be Bayesian when analyzing simple experiments
Erik van Zwet explains the Shrinkage Trilogy
The significance filter, the winner’s curse and the need to shrink
Bayesians moving from defense to offense: “I really think it’s kind of irresponsible now not to use the information from all those thousands of medical trials that came before. Is that very radical?”
Explaining that line, “Bayesians moving from defense to offense”

I’m excited about the application of these ideas to policy analysis.

Evaluating samplers with reference draws

I started thinking a bit more about what we’re doing when we use something like posteriordb (from Stan) or Inference Gym (from TensorFlow Probability) to evaluate a sampler.

posteriordb gives you Stan programs and 10,000 reference draws from their posterior. The reference draws must be generated by another sampler. In some cases, like Neal’s funnel or high-dimensional normals, etc., we can just run forward simulation and get independent draws. In cases of models with data, we don’t actually know the posterior analytically—that’s why we’re running MCMC in the first place! I think what posteriordb did was run NUTS for enough iterations it could be thinned down to 10,000 roughly independent draws.

Standard error in reference

The standard error for $latex N$ independent draws estimating a particular quantity like the posterior mean of a parameter is

$latex \textrm{se} = \frac{\displaystyle \textrm{sd}}{\displaystyle \sqrt{N}}$

where sd is the standard deviation of the quantity in the posterior. Standard Monte Carlo methods scale horribly due to that square root—Markov chain Monte Carlo can be even worse because rather than the number of draws, we have the effective sample size, which can be much lower than the number of iterations for more challenging posteriors due to the correlation in the draws.

Because $latex \sqrt{10,000} = 100$, with 10,000 draws, or standard error is only 1/100th of our standard deviation. In general, we need four times as many draws to cut the error in half, or 100 times as many draws to add a decimal place of accuracy by cutting error by a factor of ten. That means we can’t reasonably expect something like posteriordb to store enough draws to make fine-grained convergence evaluation possible. A slightly more economical alternative would be to store long term averages. There we cold run a million iterations and save parameter and squared parameter estimates. We could even run a hundred million by farming out 100 jobs on a cluster. But that’s still only three or four decimal places of accuracy. After that, maybe we could wring out another decimal place or two with control variates.

That’s a lot of residual error to have to deal with when evaluating an algorithm that asymptotically approaches the correct answer. The only thing that saves us is that the algorithms we’re evaluating are also samplers and are subject to the same inverse square root rate of reduction in standard error.

Report error in evaluations

What we really want to report is the standard error of the residual between our new model’s estimate and the truth. But we don’t usually have the truth. In practice, we usually do one of two things, depending on how compute-intensive it is to fit my model and our compute resources and patience.

Easy model:. Run the new algorithm 100 times or even 1000 times and report something like a histogram of errors. With this many runs, we can directly estimate standard error (assuming our draws are unbiased) and we can estimate it even more tightly and without assumptions if we know the true values.

Hard model:. Run the new algorithm once or twice and use an estimated standard error for the new algorithm and report that instead of an empirical histogram of errors. This suffers from estimation error in the effective sample size, which can be large.

The problem with both of these aproaches is that it’s only reporting variability of the new algorithm, treating the reference draws as true. We should be going one step further and folding the error in posteriordb or another reference database into the uncertainty reporting of the error of the new algorithm. The additional error could be done with simulation (and maybe analytically—normals are nice that way).

That’s easy enough to do reporting single univariate means. I do not know how to do this for evaluating error in something like KL-divergence, which itself has a Monte Carlo integration, or something like Wasserstein distance. In the Pathfinder paper, we used a semi-discrete Wassertein distance algorithm. I have to admit I don’t even know how that algorithm works.

Evaluating delayed rejection

I ran into this problem because I want to evaluate our delayed rejection HMC and show that the error goes to zero as I take more draws. Specifically, I want to extend the illustration in the paper that shows that standard HMC and NUTS fails, but DR-HMC succeeds at sampling from multiscale distributions like Neal’s funnel in its centered parameterization.

There are two fundamental problems with this.

Problem One: By the time we get past an effective samle size of 10,000 from the new algorithm, the error we’re estimating is dominated by the standard error of posteriordb. We don’t even know the real error of posteriordb, only an estimated standard error assuming draws are independent.

Problem Two: The second problem is that we have no guarantees that the reference draws are unbiased draws from the posterior. For example, in the centered parameterization of Neal’s funnel, there is no step size for HMC that will properly explore the posterior—each step size picks out a range of log scale parameters it can explore (as we demonstrate in the delayed rejection paper). The standard error estimate from the reference draws assumes independent draws from the posterior, not biased draws. If the posteriordb draws are biased, the error reported for a perfect sampler will not go to zero as the number of posteriordb and new sampler draws goes to infinity.

Help?

Does anyone have any suggestions of what we can do in the face of potentially biased and stochastic samples for evaluation? I could wait until the Stan meeting on Thursday and ask Aki, who’s a font of inspiration on topics like this, but I thought I’d ask a wider group and see how that goes.

A question about Lindley’s supra Bayesian method for expert probability assessment

Andy Solow writes:

I wonder if you can help me with a question that has been bugging me for a while? I have been thinking about Lindley’s supra Bayesian method for expert probability assessment. Briefly, the model is that, conditional on the event of interest A, the log odds ratio for an expert is normal with mean mu > 0 and variance v and, symmetrically, conditional on not-A, it’s normal with mean -mu and the same variance. When the prior probability of A is 1/2, the posterior log odds ratio given the expert’s log odds ratio q is:

(2*mu/v) * q

Lindley took v = 2*mu so that the expert’s log odds ratio is simply adopted. Now, I would have thought that mu can be viewed as a measure of expertise: the more expert the expert, the greater mu. If that’s the case, then I also would have thought that the distribution of a more-expert expert should stochastically dominate that of a less-expert expert. But this is not true under Lindley’s assumption. Stochastic dominance requires that v is a non-increasing function of mu – the simplest case being a constant v. But for mu > v/2 the posterior log odds ratio is greater than q, which doesn’t seem right either. I wonder if I am thinking about this incorrectly? Any thoughts you might have would be greatly appreciated.

My reply: I don’t know, I’ve never thought about this one! The whole thing looks kind of arbitrary to me, and I’ve never been a fan of models of “expert opinion” that don’t connect to the data used by the expert or to whatever the expert is actually predicting. But Lindley was a smart guy, so I’m guessing that the idea is more general than looks to me at first.

Can someone explain in comments?

Mister P and Stan go to Bangladesh . . .

Prabhat Barnwal, Yuling Yao, Yiqian Wang, Nishat Akter Juy, Shabib Raihan, Mohammad Ashraful Haque, and Alexander van Geen ask,

Is the low COVID-19–related mortality reported in Bangladesh for 2020 associated with massive undercounting?

Here’s what they did:

This repeated survey study is based on an in-person census followed by 2 rounds of telephone calls. Data were collected from a sample of 135 villages within a densely populated 350-km2 rural area of Bangladesh. Household data were obtained first in person and subsequently over the telephone. For the analysis, mortality data were stratified by month, age, sex, and household education. Mortality rates were modeled by bayesian multilevel regression, and the strata were aggregated to the population by poststratification. Data analysis was performed from February to April 2021. . . .

Mortality rates were compared for 2019 and 2020, both without adjustment and after adjustment for nonresponse and differences in demographic variables between surveys. Income and food availability reported for January, May, and November 2020 were also compared.

And here’s what they found:

All-cause mortality in the surveyed are was lower in 2020 compared with 2019, but measures to control the COVID-19 pandemic were associated with a reduction in rural income and food availability. These findings suggest that government restrictions designed to curb the spread of COVID-19 may have been effective in 2020 but needed to be accompanied by expanded welfare support.

More specifically:

Enumerators collected data from an initial 16 054 households in January 2020 . . . for a total of 58 806 individuals . . . A total of 276 deaths were reported between February and the end of October 2020 for the subset of the population that could be contacted twice over the telephone, slightly below the 289 deaths reported for the same population over the same period in 2019. After adjustment for survey nonresponse and poststratification, 2020 mortality changed by −8% (95% CI, −21% to 7%) compared with an annualized mortality of 6.1 deaths per 1000 individuals in 2019. However, in May 2020, salaried primary income earners reported a 40% decrease in monthly income (from 17 485 to 10 835 Bangladeshi Taka), and self-employed earners reported a 60% decrease in monthly income (23 083 to 8521 Bangladeshi Taka), with only a small recovery observed by November 2020.

I’ve worked with Lex and Yuling for a long time, and they both know what they’re doing.

Beyond the direct relevance of this work, the above-linked article is a great example of applied statistical analysis with multilevel regression and poststratification using Stan.

Progress in 2023, Leo edition

Following Andrew, Aki, Jessica, and Charles, and based on Andrew’s proposal, I list my research contributions for 2023.

Published:

  1. Egidi, L. (2023). Seconder of the vote of thanks to Narayanan, Kosmidis, and Dellaportas and contribution to the Discussion of ‘Flexible marked spatio-temporal point processes with applications to event sequences from association football’Journal of the Royal Statistical Society Series C: Applied Statistics72(5), 1129.
  2. Marzi, G., Balzano, M., Egidi, L., & Magrini, A. (2023). CLC Estimator: a tool for latent construct estimation via congeneric approaches in survey research. Multivariate Behavioral Research, 58(6), 1160-1164.
  3. Egidi, L., Pauli, F., Torelli, N., & Zaccarin, S. (2023). Clustering spatial networks through latent mixture models. Journal of the Royal Statistical Society Series A: Statistics in Society186(1), 137-156.
  4. Egidi, L., & Ntzoufras, I. (2023). Predictive Bayes factors. In SEAS IN. Book of short papers 2023 (pp. 929-934). Pearson.
  5. Macrì Demartino, R., Egidi, L., & Torelli, N. (2023). Power priors elicitation through Bayes factors. In SEAS IN. Book of short papers 2023 (pp. 923-928). Pearson.

Preprints:

  1. Consonni, G., & Egidi, L. (2023). Assessing replication success via skeptical mixture priorsarXiv preprint arXiv:2401.00257. Submitted.

Softwares:

    CLC estimator

  • free and open-source app to estimate latent unidimensional constructs via congeneric approaches in survey research (Marzi et al., 2023)

   footBayes package (CRAN version 0.2.0)

   pivmet package (CRAN version 0.5.0)

I hope and guess that the paper dealing with the replication crisis, “Assessing replication success via skeptical mixture priors” with Guido Consonni, could have good potential in the Bayesian assesment of replication success in social and hard sciences; this paper can be seen as an extension of the paper written by Leonhard Held and Samuel Pawel entitled “The Sceptical Bayes Factor for the Assessment of Replication Success“.  Moreover, I am glad that the paper “Clustering spatial networks through latent mixture models“, focused on a model-based clustering approach defined in a hybrid latent space, has been finally published in JRSS A.

Regarding softwares, the footBayes package, a tool to fit the most well-known soccer (football) models through Stan and maximum likelihood methods, has been deeply developed and enriched with new functionalities (2024 objective: incorporate CmdStan with VI/Pathfinder algorithms and write a package’s paper in JSS/R Journal format).

What’s the problem, “math snobs” or rich dudes who take themselves too seriously and are enabled in that by the news media?

Chris Barker, the chair of the Statistical Consulting Section of the American Statistical Association, writes:

I’m curious about your reaction/opinion to a Financial times article I read today about Sam Bankman-Fried (“SBF,” charged with fraud in the loss several billion of Crypto) with pointless insults about mathematicians (“mathematical chauvinists,” “math snobs,” “mental arithmetic,” and what seems to be a claim that “math snob” caused to a decline in the UK economy). And disparagement at the potential use or abuse of Bayesian statistics. From the Financial Times article:

We must leave it to the criminal courts to decide the future of Sam Bankman-Fried. He denies the various charges against him. For now, I am less concerned with his specific doings than with his worldview, which is a sort of mathematical chauvinism. A theme in Michael Lewis’s new book about “SBF” is the subject’s mistrust of what cannot be quantified. Shakespeare’s supposed primacy in literature, for example. “What are the odds that the greatest writer was born in 1564?” SBF is quoted as asking, citing the billions of people who have been born since then, and the higher share of them who are educated. These are his “Bayesian priors”. I hope to never encounter a starker case of abstract reasoning getting in the way of practical observation.

He is, if nothing else, of his time. A year ago this weekend, Liz Truss, a maths snob who assailed colleagues with mental arithmetic questions, fell as UK premier, almost taking the economy with her. If we consider, too, the dark, Kremlin-partial end of finance bro politics, these are the most embarrassing times for maths chauvinists since Robert McNamara, who even looked geometric and dug America ever deeper into the pit of Vietnam on the back of data.

I defer to the lexicographers or the relevant experts as to the whether this is the first, or if not, for how long these insults against mathematicians and statisticians have appeared in the media.

I replied that: Yes, the Financial Times article seems pretty bad to me, indeed just seems too stupid to deserve a response! Is the author of the column well-respected in Britain? I did some googling and he seems just like a generic hack political columnist.

Barker replied:

The demonization of math and statistics was disappointing. There is no way to ever know what the editors were thinking by permitting publication of that article, nor do I particularly care. In other areas of the internet that article might simply be called “click bait.”

He also points to this quote from “SBF”:

I could go on and on about the failings of Shakespeare and the constitution and Stradivarius violins, and at the bottom of this post I do, but really I shouldn’t need to: the Bayesian priors are pretty damning. About half of the people born since 1600 have been born in the past 100 years, but it gets much worse than that. When Shakespeare wrote almost all of Europeans were busy farming, and very few people attended university; few people were even literate—probably as low as about ten million people. By contrast there are now upwards of a billion literate people in the Western sphere. What are the odds that the greatest writer would have been born in 1564? The Bayesian priors aren’t very favorable.

I agree with everyone else that this is represents a misapplication of Bayesian methods but for a kind of subtle reason. The numerator/denominator thing is ok; the real problem is in the premise, which is that there’s something called “the greatest writer.” Was William Shakespeare a greater writer than Veronica Geng? How could you even answer such a question? And, mathematically, you can only apply Bayesian inference to a problem that is well defined.

The general problem to me is not SBF’s asinine “Bayesian priors” quote—if it wasn’t that, he’d be wielding other mysterious power phrases such as “the subconscious” or “comparative advantage” or “quantum leap” or “inflection point” or whatever—, but rather the well-known phenomenon of rich people thinking they know what they’re talking about when they’re actually just making things up in some nonsensical way.

P.S. But, yeah, there is a history of stupid arguments being made with a Bayesian connection.

Regarding the use of “common sense” when evaluating research claims

I’ve often appealed to “common sense” or “face validity” when considering unusual research claims. For example, the statement that single women during certain times of the month were 20 percentage points more likely to support Barack Obama, or the claim that losing an election for governor increases politicians’ lifespan by 5-10 years on average, or the claim that a subliminal smiley face flashed on a computer screen causes large changes in people’s attitudes on immigration, or the claim that attractive parents are 36% more likely to have girl babies . . . these claims violated common sense. Or, to put it another way, they violated my general understanding of voting, health, political attitudes, and human reproduction.

I often appeal to common sense, but that doesn’t mean that I think common sense is always correct or that we should defer to common sense. Rather, common sense represents some approximation of a prior distribution or existing model of the world. When our inferences contradict our expectations, that is noteworthy (in a chapter 6 of BDA sort of way), and we want to address this. It could be that addressing this will result in a revision of “common sense.” That’s fine, but if we do decide that our common sense was mistaken, I think we should make that statement explicitly. What bothers me is when people report findings that contradict common sense and don’t address the revision in understanding that would be required to accept that.

In each of the above-cited examples (all discussed at various times on this blog), there was a much more convincing alternative explanation for the claimed results, given some mixture of statistical errors and selection bias (p-hacking or forking paths). That’s not to say the claims are wrong (Who knows?? All things are possible!), but it does tell us that we don’t need to abandon our prior understanding of these things. If we want to abandon our earlier common-sense views, that would be a choice to be made, an affirmative statement that those earlier views are held so weakly that they can be toppled by little if any statistical evidence.

P.S. Perhaps relevant is this recent article by Mark Whiting and Duncan Watts, “A framework for quantifying individual and collective common sense.”

Progress in 2023, Charles edition

Following the examples of Andrew, Aki, and Jessica, and at Andrew’s request:

Published:

Unpublished:

This year, I also served on the Stan Governing Body, where my primary role was to help bring back the in-person StanCon. StanCon 2023 took place at the University of Washington in St. Louis, MO and we got the ball rolling for the 2024 edition which will be held at Oxford University in the UK.

It was also my privilege to be invited as an instructor at the Summer School on Advanced Bayesian Methods at KU Leuven, Belgium and teach a 3-day course on Stan and Torsten, as well as teach workshops at StanCon 2023 and at the University of Buffalo.

Progress in 2023, Aki Edition

Following Andrew, here is my (Aki’s) list of published papers and preprints in 2023 (20% together with Andrew)

Published

  1. Manushi Welandawe, Michael Riis Andersen, Aki Vehtari, and Jonathan H. Huggins (2023). Robust, Automated, and Accurate Black-box Variational Inference. Journal of Machine Learning Research, accepted for publication.
    arXiv preprint arXiv:2203.15945.

  2. Alex Cooper, Dan Simpson, Lauren Kennedy, Catherine Forbes, and Aki Vehtari (2023). Cross-validatory model selection for Bayesian autoregressions with exogenous regressors. Bayesian Analysis, accepted for publication.
    arXiv preprint arXiv:2301.08276.

  3. Noa Kallioinen, Topi Paananen, Paul-Christian Bürkner, and Aki Vehtari (2023). Detecting and diagnosing prior and likelihood sensitivity with power-scaling. Statistics and Computing, 34(57).
    Online
    arXiv preprint arXiv:2107.14054.
    Supplementary code.
    priorsense: R package

  4. Martin Modrák, Angie H. Moon, Shinyoung Kim, Paul Bürkner, Niko Huurre, Kateřina Faltejsková, Andrew Gelman, and Aki Vehtari (2023). Simulation-based calibration checking for Bayesian computation: The choice of test quantities shapes sensitivity. Bayesian Analysis, doi:10.1214/23-BA1404.
    arXiv preprint arXiv:2211.02383.
    Code
    SBC R package

  5. Erik Štrumbelj, Alexandre Bouchard-Côté, Jukka Corander, Andrew Gelman, Håvard Rue, Lawrence Murray, Henri Pesonen, Martyn Plummer, and Aki Vehtari (2023). Past, Present, and Future of Software for Bayesian Inference. Statistical Science, accepted for publication. preprint

  6. Marta Kołczyńska, Paul-Christian Bürkner, Lauren Kennedy, and Aki Vehtari (2023). Trust in state institutions in Europe, 1989–2019. Survey Research Methods, accetped for publication.
    SocArXiv preprint doi:10.31235/osf.io/3v5g7.

  7. Juho Timonen, Nikolas Siccha, Ben Bales, Harri Lähdesmäki, and Aki Vehtari (2023). An importance sampling approach for reliable and efficient inference in Bayesian ordinary differential equation models. Stat, doi:10.1002/sta4.614.
    arXiv preprint arXiv:2205.09059.

  8. Petrus Mikkola, Osvaldo A. Martin, Suyog Chandramouli, Marcelo Hartmann, Oriol Abril Pla, Owen Thomas, Henri Pesonen, Jukka Corander, Aki Vehtari, Samuel Kaski, Paul-Christian Bürkner, Arto Klami (2023). Prior knowledge elicitation: The past, present, and future. Bayesian Analysis, doi:10.1214/23-BA1381.
    arXiv preprint arXiv:2112.01380.

  9. Peter Mikula, Oldřich Tomášek, Dušan Romportl, Timothy K. Aikins, Jorge E. Avendaño, Bukola D. A. Braimoh-Azaki, Adams Chaskda, Will Cresswell, Susan J. Cunningham, Svein Dale, Gabriela R. Favoretto, Kelvin S. Floyd, Hayley Glover, Tomáš Grim, Dominic A. W. Henry, Tomas Holmern, Martin Hromada, Soladoye B. Iwajomo, Amanda Lilleyman, Flora J. Magige, Rowan O. Martin, Marina F. de A. Maximiano, Eric D. Nana, Emmanuel Ncube, Henry Ndaimani, Emma Nelson, Johann H. van Niekerk, Carina Pienaar, Augusto J. Piratelli, Penny Pistorius, Anna Radkovic, Chevonne Reynolds, Eivin Røskaft, Griffin K. Shanungu, Paulo R. Siqueira, Tawanda Tarakini, Nattaly Tejeiro-Mahecha, Michelle L. Thompson, Wanyoike Wamiti, Mark Wilson, Donovan R. C. Tye, Nicholas D. Tye, Aki Vehtari, Piotr Tryjanowski, Michael A. Weston, Daniel T. Blumstein, and Tomáš Albrecht (2023). Bird tolerance to humans in open tropical ecosystems. Nature Communications, 14:2146. doi:10.1038/s41467-023-37936-5.

  10. Gabriel Riutort-Mayol, Paul-Christian Bürkner, Michael R. Andersen, Arno Solin, and Aki Vehtari (2023). Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming. Statistics and Computing, 33(17):1-28. doi:10.1007/s11222-022-10167-2.
    arXiv preprint arXiv:2004.11408.

Pre-prints

  1. Lauren Kennedy, Aki Vehtari, and Andrew Gelman (2023). Scoring multilevel regression and poststratification based population and subpopulation estimates. arXiv preprint arXiv:2312.06334.

  2. Alex Cooper, Aki Vehtari, Catherine Forbes, Lauren Kennedy, and Dan Simpson (2023). Bayesian cross-validation by parallel Markov chain Monte Carlo. arXiv preprint arXiv:2310.07002.

  3. Yann McLatchie and Aki Vehtari (2023). Efficient estimation and correction of selection-induced bias with order statistics. arXiv preprint arXiv:2309.03742.

  4. Yann McLatchie, Sölvi Rögnvaldsson, Frank Weber, and Aki Vehtari (2023). Robust and efficient projection predictive inference. arXiv preprint arXiv:2306.15581.

  5. Frank Weber, Änne Glass, and Aki Vehtari (2023). Projection predictive variable selection for discrete response families with finite support. arXiv preprint arXiv:2301.01660.

jd asked Andrew “which paper from 2023 do you like best?”, and I also find it difficult to choose one. I highlight two papers, but I’m proud of all of them!

“Detecting and diagnosing prior and likelihood sensitivity with power-scaling” is based on an idea that had been on my todo list for a very long time, and seeing that it works so well and can have practical software implementation was really nice.

In “Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming” we didn’t come up with a new GP approximation, but we were able to develop simple diagnostics to tell whether we have enough basis functions. I just love when diagnostics can answer frequently asked questions like “How do I choose the number of basis functions?”

Bayesian BATS to advance Bayesian Thinking in STEM

Mine Dogucu writes:

We are recruiting our new cohort of STEM instructors who are interested in incorporating Bayesian thinking and methods in their teaching in US universities and colleges.

Please help us spread the word.

stat.uci.edu/bayes-bats/

Our goal is to advance Bayesian Thinking in STEM, hence the name BATS.

BATS is a three-tiered program

  • The first tier of the program consists of a week-long instructor training bootcamp (on the west coast at University of California Irvine in Summer 2023, and on the east coast at Vassar College in Summer 2024), to build a diverse community of Bayesian educators across different STEM fields.
  • In the second tier of the project, selected instructors will develop Bayesian teaching and learning materials specifically using scientific data from their fields with the support of the PIs, during the fall semester after their summer boot camp training participation.
  • In the third tier of the project, selected instructors will disseminate the teaching and learning materials through conferences and publications with the support of the PIs.

The BATS Project Objectives are as follows:

  • Increase the number of undergraduate students who are exposed to Bayesian methods;
  • Enhance the capacity of STEM instructors in Bayesian methods through training and community building;
  • Develop and enrich teaching and learning materials that showcase use of Bayesian methods in STEM fields

Prediction isn’t everything, but everything is prediction

Image

This is Leo.

Explanation or explanatory modeling can be considered to be the use of statistical models for testing causal hypotheses or associations, e.g. between a set of covariates and a response variable. Prediction or predictive modeling, (supposedly) on the other hand, is the act of using a model—or device, algorithm—to produce values of new, existing, or future observations. A lot has been written about the similarities and differences between explanation and prediction, for example Breiman (2001), Shmueli (2010), Billheimer (2019), and many more.

These are often thought to be separate dimensions of statistics, but Jonah and I have been discussing for a long time that in some sense there may actually be no such thing as explanation without prediction. Basically, although prediction itself is not the only goal of inferential statistics, everything—every objective—in inferential statistics can be reframed through the lense of prediction.

Hypothesis testing, ability estimation, hierarchical modeling, treatment effect estimation, causal inference problems, etc., can all be described in our opinion from a (inferential) predictive perspective. So far we have not found an example for which there is no way to reframe it as prediction problem. So I ask you: is there any inferential statistical ambition that cannot be described in predictive terms?

P.S. Like Billheimer (2019) and others, we think that inferential statistics should be considered as inherently predictive and be focused primarily on probabilistic predictions of observable events and quantities, rather than focusing on statistical estimates of unobersvable parameters that do not exist outside of our highly contrived models. Similarly, we also feel that the goal of Bayesian modeling should not be taught to students as finding the posterior distribution of unobservables, but rather as finding the posterior predictive distribution of the observables (with finding the posterior as an intermediate step); even when we don’t only care about predictive accuracy and we still care about understanding how a model works (model checking, GoF measures), we think the predictive modeling interpretation is generally more intuitive and effective.