A new piranha paper

Kris Hardies points to this new article, Impossible Hypotheses and Effect-Size Limits, by Wijnand and Lennert van Tilburg, which states:

There are mathematical limits to the magnitudes that population effect sizes can take within the common multivariate context in which psychology is situated, and these limits can be far more restrictive than typically assumed. The implication is that some hypothesized or preregistered effect sizes may be impossible. At the same time, these restrictions offer a way of statistically triangulating the plausible range of unknown effect sizes.

This is closely related to our Piranha Principle, which we first formulated here and then followed up with this paper. It’s great to see more work being done in this area.

My NYU econ talk will be Thurs 18 Apr 12:30pm (NOT Thurs 7 Mar)

Hi all. The other day I announced a talk I’ll be giving at the NYU economics seminar. It will be Thurs 18 Apr 12:30pm at 19 West 4th St., room 517.

In my earlier post, I’d given the wrong day for the talk. I’d written that it was this Thurs, 7 Mar. That was wrong! Completely my fault here; I misread my own calendar.

So I hope nobody shows up to that room tomorrow! Thank you for your forbearance.

I hope to see youall on Thurs 18 Apr. Again, here’s the title and abstract:

How large is that treatment effect, really?

“Unbiased estimates” aren’t really unbiased, for a bunch of reasons, including aggregation, selection, extrapolation, and variation over time. Econometrics typically focus on causal identification, with this goal of estimating “the” effect. But we typically care about individual effects (not “Does the treatment work?” but “Where and when does it work?” and “Where and when does it hurt?”). Estimating individual effects is relevant not only for individuals but also for generalizing to the population. For example, how do you generalize from an A/B test performed on a sample right now to possible effects on a different population in the future? Thinking about variation and generalization can change how we design and analyze experiments and observational studies. We demonstrate with examples in social science and public health.

How large is that treatment effect, really? (My talk at the NYU economics seminar, Thurs 7 Mar 18 Apr)

Thurs 7 Mar 18 Apr 2024, 12:30pm at 19 West 4th St., room 517:

How large is that treatment effect, really?

“Unbiased estimates” aren’t really unbiased, for a bunch of reasons, including aggregation, selection, extrapolation, and variation over time. Econometrics typically focus on causal identification, with this goal of estimating “the” effect. But we typically care about individual effects (not “Does the treatment work?” but “Where and when does it work?” and “Where and when does it hurt?”). Estimating individual effects is relevant not only for individuals but also for generalizing to the population. For example, how do you generalize from an A/B test performed on a sample right now to possible effects on a different population in the future? Thinking about variation and generalization can change how we design and analyze experiments and observational studies. We demonstrate with examples in social science and public health.

Varying slopes and intercepts in Stan: still painful in 2024

Andrew recently blogged the following: Tutorial on varying-intercept, varying-slope multilevel models in Stan, from Will Hipson. This is the kind of model Andrew et al. used for one example in Red State, Blue State, which is the varying effect of income on Republican preference by state. Each state had its own slope and intercept related with a multivariate hierarchical prior. The version in Gelman and Hill’s regression book is a hack that tried to scale an inverse Wishart; the LKJ is what they would have used if Ben Goodrich had created it at that point.

Andrew points to a tutorial on Bayesian varying effects models from Will Hipson, which is really nice in the way it steps through workflow, building up the model in stages. The model Hipson develops is an improvement on what we have in our User’s Guide. After everything else, I circle back and talk about doc, trying to connect it to my recent post on why doc is so dangerous.

I think we can do a bit better in the current verison of Stan, but I have to confess up front that Andrew’s right—this is still painful. This took me around three hours to put together the model and simulations and blog post and I’m the one who designed the language! This would’ve been much faster if I wasn’t trying to bring it up to a “publishable” standard as an example of how I like to see Stan code written.

The original Stan model

Here’s Will Hipson’s model:

data {
  int N_obs; // number of observations
  int N_pts; // number of participants
  int K; // number of predictors + intercept
  int pid[N_obs]; // participant id vector
  matrix[N_obs, K] x; // matrix of predictors
  real y[N_obs]; // y vector
}

parameters {
  matrix[K, N_pts] z_p; // matrix of intercepts and slope
  vector[K] sigma_p; // sd for intercept and slope
  vector[K] beta; // intercept and slope hyper-priors
  cholesky_factor_corr[K] L_p; // Cholesky correlation matrix
  real sigma; // population sigma
}

transformed parameters {
  matrix[K, N_pts] z; // non-centered version of beta_p
  z = diag_pre_multiply(sigma_p, L_p) * z_p; 
}

model {
  vector[N_obs] mu;
  
  // priors
  beta ~ normal(0, 1);
  sigma ~ exponential(1);
  sigma_p ~ exponential(1);
  L_p ~ lkj_corr_cholesky(2);
  to_vector(z_p) ~ normal(0, 1);
  
  // likelihood
  for(i in 1:N_obs) {
    mu[i] = beta[1] + z[1, pid[i]] + (beta[2] + z[2, pid[i]]) * x[i, 2];
  }
  y ~ normal(mu, sigma);
}

generated quantities {
  matrix[2, 2] Omega;
  Omega = multiply_lower_tri_self_transpose(L_p);
}

Warning: There’s a bug in this code in that it only handles the K = 2 case. You can see this with the 1 and 2 hardcoded in the definition of mu[i].

My Stan model

The documentation for the model is at the top of the Stan code, then the Stan code only has a single line of doc other than explanations of the variables (which I wouldn’t include in non-tutorial code, just to link this back to what I was saying a few posts ago about comments).

/**
 * Varying slopes and intercept hierarchical linear regression.
 * N observations organized into J groups, with jj[n] being the group
 * and x[n, 1:K] the covariates for observation n.  The covariate
 * matrix x should include a column of 1s to include a slope.
 * 
 * The slopes and intercept per group have a multivariate normal prior
 * and the scale has an exponential prior.  The location of the
 * multivariate normal prior has a standard normal hyperprior and its
 * covariance is decomposed into a correlation matrix with an LKJ
 * hyperprior and a scale vector with an exponential hyperprior. In
 * symbols: 
 *
 * Likelihod:
 *   y[n] ~ normal(x[n] * beta[1:K, jj[n]], sigma) for n in 1:N
 *
 * Priors:
 *   sigma ~ exponential(1)
 *   beta[1:K, j] ~ multi_normal(nu, Sigma) for j in 1:J
 * 
 * Hyperpriors:
 *   nu ~ normal(0, 1)
 *   scale(Sigma) ~ exponential(1)
 *   corr(Sigma) ~ lkj(2)
 *
 * where scale(Sigma) is the scale vector and corr(Sigma) is the
 * correlation matrix of Sigma.
 *
 * For efficiency and numerical stability, the covariance and
 * correlation matrices are Cholesky factored.
 */
data {
  int<lower=0> J;                      // number of groups
  int<lower=0> N;                      // number of observations
  array[N] int<lower=1, upper=J> jj;   // group per observation
  int<lower=1> K;                      // number of covariates
  matrix[N, K] x;                      // data matrix
  vector[N] y;                         // observations
}
parameters {
  vector[K] nu;                        // location of beta[ , j]
  vector<lower=0>[K] tau;              // scale of beta[ , j]
  cholesky_factor_corr[K] L_Omega;     // Cholesky of correlation of beta[ , j]
  matrix[K, J] beta_std;               // standard beta (beta - nu) / Sigma
  real<lower=0> sigma;                 // observation error for y
}
transformed parameters {
  matrix[K, J] beta = rep_matrix(nu, J)
                      + diag_pre_multiply(tau, L_Omega) * beta_std;
}
model {
  nu ~ normal(0, 1);
  tau ~ exponential(1);
  L_Omega ~ lkj_corr_cholesky(2);
  to_vector(beta_std) ~ normal(0, 1);  // beta[ , j] ~ multi_normal(nu, Sigma)
  sigma ~ exponential(1);
  y ~ normal(rows_dot_product(x, beta[ , jj]'), sigma);
}
generated quantities {
  matrix[K, K] Sigma                   // covariance of beta[, j]
    = multiply_lower_tri_self_transpose(diag_pre_multiply(tau, L_Omega));
}

(WordPress is really annoying in its mishandling of angle brackets in pre environments.)

I started the first version using K = 2 and loops rather than vectorization. Next, I generalized from K = 2 to general K. Then I found the rows dot product function and got rid of the last loop. (Loops are fast in Stan—it’s the redundant autodiff, like multiple calculations of log(sigma) that are the time wasters in non-vectorized code.)

One could make the case for suffixing all the parameters of the prior for beta using _beta. You’d probably want to do that in a context with multiple groupings. It gets very hard to read even in this simple case—with multiple groupings it’s a right royal mess. Long variable names are very hard to read in math.

I put all the comments to the right at end of lines, so they don’t waste any vertical space and don’t get in the way of actually reading the code. Except the one true inline code comment that I’d leave, which is the implication of all the non-centered parameterization on the distribution of beta.

Let’s take it for a spin

I actually wrote the model first. I would generally recommend debugging the stages, but I’ve built this model many times, so I decided to roll the dice.

What I haven’t done before is simulate data in Python. Luckily, I have a very helpful coding assistant named ChatGPT. I have a system prompt that looks like this (I’m not actually a professor any more, but I thought it’d be easier to underestand that way).

WHO: I am a professor in statistics and machine learning with a Ph.D. in cognitive science and computer science. I understand (a) graduate level probability and statistics, (b) graduate-level machine learning including neural networks, (c) applied mathematics including calculus and analysis, matrices and linear algebra, combinatorics and graph theory, mathematical logic and set theory, (c) cognitive science including psycholinguistics and cognitive anthropology, (d) linguistics including phonology, morphology, syntax, semantics, and pragmatics, and (e) philosophy including 20th century analytic philosophy and philosophy of mind.

HOW: Respond in a concise and neutral academic style without embellishments.

Please call me “Bob”.

This was a one-and-done prompt. I’ve also had a bit of practice formulating these queries.

BOB: In Python, I want to set K = 5 covariates J = 10 groups. I want to generate a random K-vector location nu from a standard normal, and a (K x K) random covariance matrix Sigma of roughly unit scale (multiplying Cholesky factors or taking entries from a random walk would be OK). Then generate a K x J matrix beta where the columns are drawn from a multivariate normal distribution with location nu and covariance Sigma.

Then I want to set N = 1000 observations. For each observation in in range(N), generate a random group jj[n] in range(J) uniformly. Take a scale sigma = 1.5 for errors, and generate an N x K standard normal matrix x of covariates, with the first column set to all 1 values for an intercept. Then generate y from a hierarchical linear regression where the coefficient vector for observation n is given by beta[:, jj[n]] and the noise is normal with location 0 and scale sigma.

GPT-4 (Feb 28, 2024): [It just led with the code, which is unusual.]

import numpy as np
import scipy.linalg

# Set parameters
K = 5  # number of covariates
J = 10  # number of groups
N = 1000  # number of observations
sigma = 1.5  # scale of errors

# Generate nu and Sigma
nu = np.random.normal(0, 1, K)  # K-vector location from a standard normal
A = np.random.normal(0, 1, (K, K))
Sigma = np.dot(A, A.T)  # create a positive semi-definite matrix for covariance

# Generate beta
beta = np.random.multivariate_normal(nu, Sigma, J).T  # K x J matrix

# Generate observations
jj = np.random.randint(0, J, N)  # random group for each observation
x = np.random.normal(0, 1, (N, K))  # N x K standard normal matrix
x[:, 0] = 1  # set first column to all 1s for intercept

# Generate y
y = np.zeros(N)
for n in range(N):
    y[n] = np.dot(x[n], beta[:, jj[n]]) + np.random.normal(0, sigma)

y[:5]  # Show the first 5 values of y as a sample output

The hierarchical linear regression model was successfully generated with the specified parameters. The first five values of the dependent variable y from the model are approximately:

−2.27
8.86
8.20
1.30
7.27

These values are based on the simulation with K=5 covariates,
J=10 groups, N=1000 observations, and an error scale σ=1.5.

It looks like it actually ran that Python code? I haven’t seen output in this format before from GPT—the code’s in a pulldown labeled “Analyzing”.

Anway, GPT’s code worked out of the box. It only took a couple of minutes to pull the data generation into a function and return a dictionary with simulations and simulated parameter values to compare to the fitted values. My Python code illustrates my suggestion a few posts back to turn blocks of code into functions rather than putting a bunch of comments around them in the script.

import numpy as np
import scipy.linalg

def sim_param_data(K, J, N, sigma):
    nu = np.random.normal(0, 1, K)
    A = np.random.normal(0, 1, (K, K))
    Sigma = np.dot(A, A.T)
    beta = np.random.multivariate_normal(nu, Sigma, J).T 
    jj = np.random.randint(0, J, N)
    x = np.random.normal(0, 1, (N, K))
    x[:, 0] = 1
    y = np.zeros(N)
    for n in range(N):
        y[n] = np.dot(x[n], beta[:, jj[n]]) + np.random.normal(0, sigma)
    return nu, Sigma, beta, {'J': J, 'N': N, 'jj': jj + 1, 'K': K, 'x': x, 'y': y}
    
import cmdstanpy as csp
nu, Sigma, beta, data = sim_param_data(K = 5, J = 10, N = 1000, sigma = 1.5)
model = csp.CmdStanModel(stan_file = 'rsbs.stan')
fit = model.sample(data)

It takes Stan about 20 seconds to fit this data set, R-hats all less than 1.01, is low, ESS in the thousands from a sample of size 4000, and all but a couple parameters are all recovered within 95% posterior intervals. There is quite a lot of uncertainty here with this little data and this many groups—don’t take those point estimates of covariance too seriously!

Appendix on doc

Let’s digress and talk about doc. I wrote a blog post a few days ago on doc, and this triggers some of the same issues. I want to say up front that doc is hard and if you go and look at code I’ve written, there will be a lot of places where you can improve the doc. Same for the code. So this is a kind of normative theory of doc, not what one might expect in reality. People only have a finite amount of time for any project. You might want to take a look at the doc in R parts of his example with the same eye.

First, there’s a scaffolding example which has the classic problem of documentation just for the sake of documentation.

vector[N] mu; // declaring a mu vector

You see the same thing in the final example where “vector[N] y” is documented as “y vector”. For the same reason, I don’t like this from an early example,

  sigma ~ exponential(1); // using an exponential prior on sigma

And this is what I meant by documenting the language.

mu[i] = x[i] * beta; // * is matrix multiplication in this context

and

  cholesky_factor_corr[K] L_p; // Cholesky correlation matrix

Going back to the final example, rather than “population sigma”, I would prefer “error scale” as it does not rely on the conventional notation sigma to pick out the scale.

The comment for z says “non-centered version of beta_p”, but the non-centered variable here is z_p. The terminology of “centering” is around the population mean, not zero.

Continuing with doc for z, I don’t understand what it means to be a version of beta_p. There is no beta_p in the model, so maybe some external doc? In the definition of mu, you can see beta acting as the location of the non-centered parameterization.

Did anyone spot the bug in this model? This is the real reason we don’t trust doc and have to read the code. It only works for K = 2. You’ll see a hard-coded 1 and 2 on the line defining mu[i] despite other parts of the program using K. My advice in this situation is to just bite the bullet and code the K = 2 case first. Then generalize later if you need to. I code the general case in the next section.

I want to emphasize that I’m not trying to pick on Will Hipson here. I’m assuming his intent was to be pedagogical, as the comment density drops as the models get more complicated. And the code is really good—better than in our User’s Guide.

This example also emphasizes why code review is so useful—a second set of eyes is the top thing I can recommend for improving code. Just knowing your code will be reviewed helps you write better code.

Free online book by Bruno Nicenboim, Daniel Schad, and Shravan Vasishth on Bayesian inference and hierarchical modeling using brms and Stan

Shravan points us to these materials:

Hierarchical models are bread and butter stuff for psycholinguists, so we are trying hard to make Stan/brms mainstream through various means. Teaching this stuff feels like the most important work I am doing right now, more important even than the scientific side of things.

We have chapters on hierarchical modeling in our book (to be published soon with CRC Press), we use both brms and Stan:

https://vasishth.github.io/bayescogsci/book/ [edit: made it a live link]

The online version will remain available for free. Comments/corrections are welcome; one can open issues: https://github.com/vasishth/bayescogsci/issues

This summer, I [Shravan] am teaching an intro to Bayes using brms/Stan, with a focus on hierarchical modeling, especially directed at researchers in linguistics who do experimental work:

https://www.mils.ugent.be/courses/module-9-bayesian-data-analysis/

Plus, at Potsdam, for the last seven years I have been running an annual summer school on stats for linguistics and psych, where our focus is on hierarchical modeling using Stan/brms:

https://vasishth.github.io/smlp2024/

Here, we teach both frequentist and Bayesian approaches to hierarchical modeling.

Cool! Good to have these resources out there.

Tutorial on varying-intercept, varying-slope multilevel models in Stan, from Will Hipson

I was teaching varying-intercept, varying-slope multilevel models, and . . . I can get them to fit in Stan, but the code is kinda ugly, so I was struggling to clean it up, with no success. This will be a real research project, to add appropriate functions and possibly expand the Stan language so that these models can be written at a higher, more intuitive level.

Varying-intercept models aren’t so bad. In lme4 or blme or rstanarm or brms, you write something like:

y ~ 1 | group + x + z + x:z

and that transfers pretty directly into Stan. Just create the X matrix and go from there. Indeed, you can add as many batches of varying coefficients and it’s no biggie to code it up.

But once you get to varying intercepts and slopes, it all changes. In lme4 or blme or rstanarm or brms, you can just write things like:

y ~ (1 + z | group) + x + z + x:z

But if you want to program this directly in Stan, once you have varying intercepts and slopes, you have to deal with covariance-matrix decompositions and arrays of coefficient vectors, and it’s all a hairy mess.

What to do? For this semester’s class, Imma just gonna go with lme4/blme/rstanarm when fitting varying-intercept, varying-slope models. All this Stan coding is a rabbit hole that’s getting us away from the goal, which is to be able to fit, use, and evaluate statistical models for measurement and variation.

I would like to be able to more easily fit these in Stan, though. Why, you might ask? If we can fit them in lme4, or blme for more stability, or rstanarm for including more uncertainty in the inferences, then why bother coding directly in Stan?

The answer for why we want to code directly in Stan is that we’re often wanting to expand our models, for example adding mixture components, measurement error terms, time series or spatial dependence, etc.

For that reason, you will want to be able to code varying-intercept, varying-slope models in Stan—even if I won’t be teaching that in class this semester.

The good news is that I did some googling and found this tutorial by Will Hipson on programming hierarchical regressions in Stan. It’s from 2020 and I have not looked at every line of code there, but it all looks reasonable and there’s lots of explanation of the workflow. So maybe this is the best place to start, if you want to go in this direction, as you should!

Mister P and Stan go to Bangladesh . . .

Prabhat Barnwal, Yuling Yao, Yiqian Wang, Nishat Akter Juy, Shabib Raihan, Mohammad Ashraful Haque, and Alexander van Geen ask,

Is the low COVID-19–related mortality reported in Bangladesh for 2020 associated with massive undercounting?

Here’s what they did:

This repeated survey study is based on an in-person census followed by 2 rounds of telephone calls. Data were collected from a sample of 135 villages within a densely populated 350-km2 rural area of Bangladesh. Household data were obtained first in person and subsequently over the telephone. For the analysis, mortality data were stratified by month, age, sex, and household education. Mortality rates were modeled by bayesian multilevel regression, and the strata were aggregated to the population by poststratification. Data analysis was performed from February to April 2021. . . .

Mortality rates were compared for 2019 and 2020, both without adjustment and after adjustment for nonresponse and differences in demographic variables between surveys. Income and food availability reported for January, May, and November 2020 were also compared.

And here’s what they found:

All-cause mortality in the surveyed are was lower in 2020 compared with 2019, but measures to control the COVID-19 pandemic were associated with a reduction in rural income and food availability. These findings suggest that government restrictions designed to curb the spread of COVID-19 may have been effective in 2020 but needed to be accompanied by expanded welfare support.

More specifically:

Enumerators collected data from an initial 16 054 households in January 2020 . . . for a total of 58 806 individuals . . . A total of 276 deaths were reported between February and the end of October 2020 for the subset of the population that could be contacted twice over the telephone, slightly below the 289 deaths reported for the same population over the same period in 2019. After adjustment for survey nonresponse and poststratification, 2020 mortality changed by −8% (95% CI, −21% to 7%) compared with an annualized mortality of 6.1 deaths per 1000 individuals in 2019. However, in May 2020, salaried primary income earners reported a 40% decrease in monthly income (from 17 485 to 10 835 Bangladeshi Taka), and self-employed earners reported a 60% decrease in monthly income (23 083 to 8521 Bangladeshi Taka), with only a small recovery observed by November 2020.

I’ve worked with Lex and Yuling for a long time, and they both know what they’re doing.

Beyond the direct relevance of this work, the above-linked article is a great example of applied statistical analysis with multilevel regression and poststratification using Stan.

Bayesians are frequentists.

We wrote about this a few years ago, but it’s a point of confusion that keeps coming up, so I’m posting it again:

Bayesians are frequentists. The Bayesian prior distribution corresponds to the frequentist sample space: it’s the set of problems for which a particular statistical model or procedure will be applied.

Bayesian and frequentist inference are both about averaging over possible problems to which a method might be applied. Just as there are many different Bayesian approaches (corresponding to different sorts of models, that is, different sorts of assumptions about the set of problems over which to average), there are many different frequentist approaches (again, corresponding to different sorts of assumptions about the set of problems over which to average).

I see a direct mapping between the frequentist reference set and the Bayesian prior distribution. Another way to put it is that a frequentist probability, defined as relative frequency in repeated trials, requires some definition of what is a repeated trial. We discuss this, from various angles, in chapter 1 of BDA.

I agree that different groups of statisticians, who are labeled “Bayesians” and “frequentists,” can approach problems in different ways. I just don’t think the differences are so fundamental, because I think that any Bayesian interpretation of probability has to have some frequentist underpinning to be useful. And, conversely, any frequentist sample space corresponds to some set of problems to be averaged over.

For example, consider the statement, “Brazil had a 0.25 probability of winning the 2018 World Cup.” To the extent that this statement is the result of a Bayesian analysis (and not simply a wild guess), it would be embedded in web of data and assumptions regarding other teams, other World Cups, other soccer games, etc. And, to me, this network of reference sets is closely related to the choice in frequentist statistics of what outcomes to average over.

Self-declared frequentists do have a way of handling one-off events: they call them “predictions.” In the classical frequentist world, you’re not allowed to make probabilistic inferences about parameters, but you are allowed to make probabilistic inferences about predictions. Indeed, in that classical framework, the difference between parameters and predictions or missing data is precisely that parameters are unmodeled, whereas predictions and missing data are modeled.

The comments are worth reading too. In particular, Ben Goodrich makes good points in disagreeing with me, and the late Keith O’Rourke and others add some useful perspective too.

This all came to mind because Erik van Zwet pointed me to some online discussion of our recent post. The commenter wrote:

In a recent blog, Andrew Gelman writes “Bayesians moving from defense to offense: I really think it’s kind of irresponsible now not to use the information from all those thousands of medical trials that came before. Is that very radical?”

Here is what is perplexing me.

It looks to me that ‘those thousands of medical trials’ are akin to long run experiments. So isn’t this a characteristic of Frequentism? So if bayesians want to use information from long run experiments, isn’t this a win for Frequentists?

What is going offensive really mean here?

Some of the participants in that thread did seem to get the point, but nobody knew about our saying, “Bayesians are frequentists,” so I added it to the lexicon.

So, just to say it again, yes, good Bayesian inference and good frequentist inference are both about having a modeled population (also called a prior distribution, or frequency distribution, or reference set) that is a good match to the applied question being asked.

A prior that corresponds to a relevant frequency distribution is a win for frequentists and for Bayesians too! Remember Hal Stern’s principle that the most important aspect of a statistical analysis is not what you do with the data, it’s what data you use.

And, regarding what I meant by “Bayesians moving from defense to offense,” see our followup here.

The continuing challenge of poststratification when we don’t have full joint data on the population.

Torleif Halkjelsvik at the Norwegian Institute of Public Health writes:

Norway has very good register data (education/income/health/drugs/welfare/etc.) but it is difficult to obtain complete tables at the population level. It is however easy to get independent tables from different registries (e.g., age by gender by education as one data source and gender by age by welfare benefits as another). What if I first run a multilevel model to regularize predictions for a vast set of variables, but in the second step, instead of a full table, use a raking approach based on several independent post-stratification tables? Would that be a valid approach? And have you seen examples of this?

My reply: I think the right way to frame this is as a poststratification problem where you don’t have the full poststratification table, you only have some margins. The raking idea you propose could work, but to me it seems awkward in that it’s mixing different parts of the problem together. Instead I’d recommend first imputing a full poststrat table and then using this to do your poststratification. But then the question is how to do this. One approach is iterative proportional fitting (Deming and Stephan, 1940). I don’t know any clean examples of this sort of thing in the recent literature, but there might be something out there.

Halkjelsvik responded:

It is an interesting idea to impute a full poststrat table, but I wonder whether it is actually better than directly calculating weights using the proportions in the data itself. Cells that should be empty in the population (e.g., women, 80-90 years old, high education, sativa spray prescription) may not be empty in the imputed table when using iterative proportional fitting (IPF), and these “extreme” cells may have quite high or low predicted values. By using the data itself, such cells will be empty, and they will not “steal” any of the marginal proportions when using IPF. This is of course a problem in itself if the data is limited (if there are empty cells in the data that are not empty in the population).

Me: If you have information that certain cells are empty or nearly so, that’s information that you should include in the poststrat table. I think the IPF approach will be similar to the weighting; it is just more model-based. So if you think the IPF will give some wrong answers, that suggests you have additional information. I recommend you try to write down all the additional information you have and use all of it in constructing the poststratification table. This should allow you to do better than with any procedure that does not use this info.

Halkjelsvik:

After playing with a few scenarios (on a piece of paper, no simulation) I see that my suggested raking/weighting approach (which also would involve iterative proportional fitting) directly on the sample data is not a good idea in contexts where MRP is most relevant. That is, if the sample cell sizes are small and regularization matters, then the subgroups of interest (e.g. geographical regions) will likely have too little data on rare demographic combinations. The approach you suggested (full population table imputation based on margins) appears more reasonable, and the addition of “extra information” is obviously a good idea. But how about a hybrid: Instead of manually accounting for “extra information” (e.g., non-existing demographic combinations) this extra information can be derived directly from the proportions of the sample itself (across subgroups of interest) and can be used as “seed” values (i.e., before accounting for margins at the local level). Using information from the sample to create the initial (seed) values for the IPF may be a good way to avoid imputing positive values in cells that are structural zeros, given that the sample is sufficiently large to avoid too many “sample zeros” that are not true “structural zeros”.

So the following could be an approach for my problem?

1. Obtain regularized predictions from sample.

2. Produce full postrat seed table directly from “global” cell values in the sample (or from other available “global” data, e.g. if available only at national level). That is, regions start with identical seed structures.

3. Adjust the poststrat table by iterative proportional fitting based on local margins (but I have read that there may be convergence problems when there are many zeros in seed cells).

Me: I’m not sure! I really want to have a fully worked-out example, a case study of MRP where the population joint distribution (the poststratification table) is not known and it needs to be estimated from data. We’re always so sloppy in those settings. I’d like to do it with a full Bayesian model in Stan and then compare various approximations.

Bayesians moving from defense to offense: “I really think it’s kind of irresponsible now not to use the information from all those thousands of medical trials that came before. Is that very radical?”

Erik van Zwet, Sander Greenland, Guido Imbens, Simon Schwab, Steve Goodman, and I write:

We have examined the primary efficacy results of 23,551 randomized clinical trials from the Cochrane Database of Systematic Reviews.

We estimate that the great majority of trials have much lower statistical power for actual effects than the 80 or 90% for the stated effect sizes. Consequently, “statistically significant” estimates tend to seriously overestimate actual treatment effects, “nonsignificant” results often correspond to important effects, and efforts to replicate often fail to achieve “significance” and may even appear to contradict initial results. To address these issues, we reinterpret the P value in terms of a reference population of studies that are, or could have been, in the Cochrane Database.

This leads to an empirical guide for the interpretation of an observed P value from a “typical” clinical trial in terms of the degree of overestimation of the reported effect, the probability of the effect’s sign being wrong, and the predictive power of the trial.

Such an interpretation provides additional insight about the effect under study and can guard medical researchers against naive interpretations of the P value and overoptimistic effect sizes. Because many research fields suffer from low power, our results are also relevant outside the medical domain.

Also this new paper from Zwet with Lu Tian and Rob Tibshirani:

Evaluating a shrinkage estimator for the treatment effect in clinical trials

The main objective of most clinical trials is to estimate the effect of some treatment compared to a control condition. We define the signal-to-noise ratio (SNR) as the ratio of the true treatment effect to the SE of its estimate. In a previous publication in this journal, we estimated the distribution of the SNR among the clinical trials in the Cochrane Database of Systematic Reviews (CDSR). We found that the SNR is often low, which implies that the power against the true effect is also low in many trials. Here we use the fact that the CDSR is a collection of meta-analyses to quantitatively assess the consequences. Among trials that have reached statistical significance we find considerable overoptimism of the usual unbiased estimator and under-coverage of the associated confidence interval. Previously, we have proposed a novel shrinkage estimator to address this “winner’s curse.” We compare the performance of our shrinkage estimator to the usual unbiased estimator in terms of the root mean squared error, the coverage and the bias of the magnitude. We find superior performance of the shrinkage estimator both conditionally and unconditionally on statistical significance.

Let me just repeat that last sentence:

We find superior performance of the shrinkage estimator both conditionally and unconditionally on statistical significance.

From a Bayesian standpoint, this is no surprise. Bayes is optimal if you average over the prior distribution and can be reasonable if averaging over something close to the prior. Especially reasonable in comparison to naive unregularized estimates (as here).

Erik summarizes:

We’ve determined how much we gain (on average over the Cochrane Database) by using our shrinkage estimator. It turns out to be about a factor 2 more efficient (in terms of the MSE) than the unbiased estimator. That’s roughly like doubling the sample size! We’re using similar methods as our forthcoming paper about meta-analysis with a single trial.

People sometimes ask me how I’ve changed as a statistician over the years. One answer I’ve given is that I’ve gradually become more Bayesian. I started out as a skeptic, concerned about Bayesian methods at all; then in grad school I started using Bayesian statistics in applications and realizing it could solve some problems for me; when writing BDA and ARM, still having the Bayesian cringe and using flat priors as much as possible, or not talking about priors at all; then with Aleks, Sophia, and others moving toward weakly informative priors; eventually under the influence of Erik and others trying to use direct prior information. At this point I’ve pretty much gone full Lindley.

Just as a comparison to where my colleagues and I are now, check out my response in 2008 to a question from Sanjay Kaul about how to specify a prior distribution for a clinical trial. I wrote:

I suppose the best prior distribution would be based on a multilevel model (whether implicit or explicit) based on other, similar experiments. A noninformative prior could be ok but I prefer something weakly informative to avoid your inferences being unduly affected by extremely unrealistic possibilities in the tail of the distribuiton.

Nothing wrong with this advice, exactly, but I was still leaning in the direction of noninformativeness in a way that I would not anymore. Sander Greenland replied at the time with a recommendation to use direct prior information. (And, just for fun, here’s a discussion from 2014 on a topic where Sander and I disagree.)

Erik concludes:

I really think it’s kind of irresponsible now not to use the information from all those thousands of medical trials that came before. Is that very radical?

That last question reminds me of our paper from 2008, Bayes: Radical, Liberal, or Conservative?

P.S. Also this:

You can click through to see the whole story.

P.P.S. More here on the “from defense to offense” thing.

Adding intermediate outcomes to an item-response (Bradley-Terry) model

Huib Meulenbelt writes:

Assume we have the following hierarchical Bradley-Terry model:

data {
  int<lower=0> K;                                // players
  int<lower=0> N;                                // number of rallies
  int<lower=1, upper=K> player_a;     // player a
  int<lower=1, upper=K> player_b;     // player b
  int<lower=0> y;                                 // number of rallies won
}
parameters {
  real<lower=0> sigma;
  vector[K] skill;                                   // ability for player K
}
model{
  sigma ~ lognormal(0, 0.5); 
  skill ~ normal(0, sigma);
  y ~ binomial_logit(N, skill[player_a] - skill[player_b]);
}

In this blog post you argue “there’s a lot of information in the score (or vote) differential that’s thrown away if you just look at win/loss.”

I agree completely. Of each rally I obtained the length of the rally and I would like this variable to influence the skill level of the player and opponent. The skill level of the two players are closer to each other when the game ends in 11-7 and the match lasts 500 seconds than when the game ends in 11-7 and the match only lasts 100 seconds.

So, we move from
p[player A wins over player B] = logistic(skill_A – skill_B)

to

p[player A wins over player B] = logistic(f(rally_length) * (skill_A – skill_B))

How would you define this function?

My reply: This is known in psychometrics as an item-response model with a discrimination parameter. The multiplier, f(rally_length) in the above notation, is called the discrimination: the idea is that the higher it is, the more predictive the skill-level difference is of the outcome. If discrimination is zero, the skill difference doesn’t predict at all, and a negative discrimination corresponds to a prediction that goes in the unexpected direction (the worse player being more likely to win).

My answer to the immediate question above is: sure, try it out. You could start with some simple form for the function f and see how it works. Ultimately I’m not thrilled with this model because it is not generative. I expect you can do better by modeling the length of the rally as an intermediate outcome. You can do this in Stan too. I’d recommend starting with just a single parameter per player, but you might need to add another parameter for each player if the rally length varies systematically by player after adjusting for ability.

But the biggest thing is . . . above you say that you agree with me to model score differential and not just win/loss, but then in your model you’re only including win/loss as an outcome. You’re throwing away information! Don’t do that. Whatever model you use, I strongly recommend you use score differential, not win/loss, as your outcome.

Multilevel models for taxonomic data structures

mandelbrot2.png

This post from John Cook about the hierarchy of U.S. census divisions (nation, region, division, state, county, census tract, block group, census block, household, family, person) reminded me of something that I’ve discussed before but have never really worked out, which is the construction of families of hierarchical models for taxonomic data structures.

A few ideas come into play here:

1. Taxonomic structures come up a lot in social science. Cook gives the example of geography. Two other examples are occupation and religion, each of which can be categorized in many layers. Check out the Census Bureau’s industry and occupation classifications or the many categories of religion, religious denomination, etc.

2. Hierarchical modeling. If you have a national survey of, say, 2000 people classified by county, you won’t want to just do a simple “8 schools“-style hierarchical model, because if you do that you’ll pool the small counties pretty much all the way to the national average. You’ll want to include county-level predictors. One way of doing this is to include indicators for state, division, and region: that is, a hierarchical model over the taxonomy. This would be a simple example of the “unfolding flower” structure that expands to include more model structure as more data come in, thus avoiding the problem that the usual forms of weak priors are very strong as the dimensionality of the model increases while the data remain fixed (see Section 3 of my Bayesian model-building by pure thought paper from 1996).

I think there’s a connection here to quantum mechanics, in the way that, when a system is heated, new energy levels appear.

3. Taxonomies tend to have fractal structure. Some nodes have lots of branches and lots of depth, others just stop. For example, in the taxonomy of religious denominations, Christians can be divided into Catholics, Protestant, and others, then Protestants can be divided into mainline and evangelical, each which can be further subdivided, whereas some of the others, such as Mormons, might not be divided at all. Similarly, some index entries in a book will have lots of sub-entries and can go on for multiple columns, within which some sub-entries have many sub-sub-entries, etc., and then other entries are just one line. You get the sort of long-tailed distribution that’s characteristic of self-similar processes. Mandelbrot wrote about this back in 1955! This might be his first publication of any kind about fractals, a topic that he productively chewed on for several decades more.

My colleagues and I have worked with some taxonomic models for voting based on geography:
– Section 5 of our 2002 article on the mathematics and statistics of voting power,
– Our recent unpublished paper, How democracies polarize: A multilevel perspective.
These are sort of like spatial models or network models, but structured specifically based on a hierarchical taxonomy. Both papers have some fun math.

I think there are general principles here, or at least something more to be said on the topic.

The importance of group-level predictors in multilevel models

Political scientist and Stan user Josh Alley points to this new paper on using hierarchical (multilevel) models to estimate heterogeneous effects.

I like it. The only thing I’d add is an emphasis that group-level predictors can be important. Jennifer and I discuss this in our multilevel modeling book too. For example, when doing MRP with a multilevel model for the 50 states, we also include indicators for region of the country and state-level predictors such as Republican vote share in the previous election. If you don’t do that, you partially pool small states to the national mean, which is a mistake.

The error for people to avoid is to naively think that just because you’ve fit a multilevel model, that you’ve now “controlled for” the groups and so group-level predictors are superfluous. That would be a mistake. Group-level predictors are important; indeed, one of the advantages of multilevel modeling is to facilitate the inclusion of group-level predictors.

Difference-in-differences: What’s the difference?

After giving my talk last month, Better Than Difference in Differences, I had some thoughts about how diff-in-diff works—how the method operates in relation to its assumptions—and it struck me that there are two relevant ways to think about it.

From a methods standpoint the relevance here is that I will usually want to replace differencing with regression. Instead of taking (yT – yC) – (xT – xC), where T = Treatment and C = Control, I’d rather look at (yT – yC) – b*(xT – xC), where b is a coefficient estimated from the data, likely to be somewhere between 0 and 1. Difference-in-differences is the special case b=1, and in general you should be able to do better by estimating b. We discuss this with the Electric Company example in chapter 19 of Regression and Other Stories and with a medical trial in our paper in the American Heart Journal.

Given this, what’s the appeal of diff-in-diff? I think the appeal of the method comes from the following mathematical sequence:

Control units:
(a) Data at time 0 = Baseline + Error_a
(b) Data at time 1 = Baseline + Trend + Error_b

Treated units:
(c) Data at time 0 = Baseline + Error_c
(d) Data at time 1 = Baseline + Trend + Effect + Error_d

Now take a diff in diff:

((d) – (c)) – ((b) – (a)) = Effect + Error,

where that last Error is a difference in difference of errors, which is just fine under the reasonable-enough assumption that the four error terms are independent.

The above argument looks pretty compelling and can easily be elaborated to include nonlinear trends, multiple time points, interactions, and so forth. That’s the direction of the usual diff-in-diff discussions.

The message of my above-linked talk and our paper, though, was different. Our point was that, whatever differencing you take, it’s typically better to difference only some of the way. Or, to make the point more generally, it’s better to model the baseline and the trend as well as the effect.

Seductive equations

The above equations are seductive: with just some simple subtraction, you can cancel out Baseline and Trend, leaving just Effect and error. And the math is correct (conditional on the assumptions, which can be reasonable). The problem is that the resulting estimate can be super noisy; indeed, it’s basically never the right thing to do from a probabilistic (Bayesian) standpoint.

In our example it was pretty easy in retrospect to do the fully Bayesian analysis. It helped that we had 38 replications of similar experiments, so we could straightforwardly estimate all the hyperparameters in the model. If you only have one experiment, your inferences will depend on priors that can’t directly be estimated from local data. Still, I think the Bayesian approach is the way to go, in the sense of yielding effect-size estimates that are more reasonable and closer to the truth.

Next step is to work this out on some classic diff-in-diff examples.

Better Than Difference in Differences (my talk for the Online Causal Inference Seminar Tues 19 Sept)

Here’s the announcement, and here’s the video:

Better Than Difference in Differences

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

It is not always clear how to adjust for control data in causal inference, balancing the goals of reducing bias and variance. We show how, in a setting with repeated experiments, Bayesian hierarchical modeling yields an adaptive procedure that uses the data to determine how much adjustment to perform. The result is a novel analysis with increased statistical efficiency compared with the default analysis based on difference estimates. The increased efficiency can have real-world consequences in terms of the conclusions that can be drawn from the experiments. An open question is how to apply these ideas in the context of a single experiment or observational study, in which case the optimal adjustment cannot be estimated from the data; still, the principle holds that difference-in-differences can be extremely wasteful of data.

The talk follows up on Andrew Gelman and Matthijs Vákár (2021), Slamming the sham: A Bayesian model for adaptive adjustment with noisy control data, Statistics in Medicine 40, 3403-3424, http://www.stat.columbia.edu/~gelman/research/published/chickens.pdf

Here’s the talk I gave in this seminar a few years ago:

100 Stories of Causal Inference

In social science we learn from stories. The best stories are anomalous and immutable (see http://www.stat.columbia.edu/~gelman/research/published/storytelling.pdf). We shall briefly discuss the theory of stories, the paradoxical nature of how we learn from them, and how this relates to forward and reverse causal inference. Then we will go through some stories of applied causal inference and see what lessons we can draw from them. We hope this talk will be useful as a model for how you can better learn from own experiences as participants and consumers of causal inference.

No overlap, I think.

Using forecasts to estimate individual variances

Someone who would like to remain anonymous writes:

I’m a student during the school year, but am working in industry this summer. I am currently attempting to overhaul my company’s model of retail demand. We advise suppliers to national retailers, our customers are suppliers. Right now, for each of our customers, our demand model outputs a point estimate of how much of their product will be consumed at one of roughly a hundred locations. This allows our customers to decide how much to send to each location.

However, because we are issuing point estimates of mean demand, we are *not* modeling risk directly, and I want to change that, as understanding risk is critical to making good decisions about inventory management – the entire point of excess inventory is to provide a buffer against surprises.

Additionally, the model currently operates on a per-day basis, so that predictions for a month from now are obtained by chaining together thirty predictions about what day N+1 will look like. I want to change that too, because it seems to be causing a lot of problems with errors in the model propagating across time, to the point that predictions over even moderate time intervals are not reliable.

I already know how to do both of these in an abstract way.

I’m willing to bite the bullet of assuming that the underlying distribution of the PDF should be multivariate Gaussian. From there, arriving at the parameters of that PDF just requires max likelihood estimation. For the other change, without going into a lot of tedious detail, Neural ODE models are flexible with respect to time such that you can use the same model to predict the net demand accumulated over t=10 days as you would to predict the net demand accumulated over t=90 days, just by changing the time parameter that you query the model with.

The problem is, although I know how to build a model that will do this, I want the estimated variance for each customer’s product to be individualized. Yet frustratingly, in a one-shot scenario, the maximum likelihood estimator of variance is zero. The only datapoint I’ll have to use to train the model to estimate the mean aggregate demand for, say, cowboy hats in Seattle at time t=T (hereafter (c,S,T)) is the actual demand for that instance, so the difference between the mean outcome and the actual outcome will be zero.

It’s clear to me that if I want to arrive at a good target for variance or covariance in order to conduct risk assessment, I need to do some kind of aggregation over the outcomes, but most of the obvious options don’t seem appealing.

– If I obtain an estimate of variance by thinking about the difference between (c,S,T) and (c,Country,T), aggregating over space, I’m assuming that each location shares the same mean demand, which I know is false.

– If I obtain one by thinking about the difference between (c,S,T) and (c,S,tbar), aggregating over time, I am assuming there’s a stationary covariance matrix for how demand accumulates at that location over time, which I know is false. This will fail especially badly if issuing predictions across major seasonal events, such as holidays or large temperature changes.

– If I aggregate across customers by thinking about the difference between (c,S,T) and (cbar,S,T), I’ll be assuming that the demand for cowboy hats at S,T should obey similar patterns as the demand for other products, such as ice cream or underwear sales, which seems obviously false.

I have thought of an alternative to these, but I don’t know if it’s even remotely sensible, because I’ve never seen anything like it done before. I would love your thoughts and criticisms on the possible approach. Alternatively, if I need to bite the bullet and go with one of the above aggregation strategies instead, it would benefit me a lot to have someone authoritative tell me so, so that I stop messing around with bad ideas.

My thought was that instead of asking the model to use the single input vector associated with t=0 to predict a single output vector at t=T, I could instead ask the model to make one prediction per input vector for many different input vectors from the neighborhood of time around t=0 in order to predict outcomes at a neighborhood of time around t=T. For example, I’d want one prediction for t=-5 to t=T, another prediction for t=-3 to t=T+4, and so on.

I would then judge the “true” target variance for the model relative to the difference between (c,S,T)’s predicted demand and the average of the model’s predicted demands for those nearby time slices. The hope is that this would reasonably correspond to the risks that customers should consider when optimizing their inventory management, by describing the sensitivity of the model to small changes in the input features and target dates it’s queried on. The model’s estimate of its own uncertainty wouldn’t do a good job of representing out-of-model error, of course, but the hope is that it’d at least give customers *something*.

Does this make any sense at all as a possible approach, or am I fooling myself?

My reply: I haven’t followed all the details, but my guess is that your general approach is sound. It should be possible to just fit a big Bayesian model in Stan, but maybe that would be too slow, I don’t really know how big the problem is. The sort of approach described above, where different models are fit and compared, can be thought of as a kind of computational approximation to a more structured hierarchical model, in the same way that cross-validation can be thought of as an approximation to an error model, or smoothing can be thought of as an approximation to a time-series model.

Improving Survey Inference in Two-phase Designs Using Bayesian Machine Learning

Xinru Wang, Lauren Kennedy, and Qixuan Chen write:

The two-phase sampling design is a cost-effective sampling strategy that has been widely used in public health research. The conventional approach in this design is to create subsample specific weights that adjust for probability of selection and response in the second phase. However, these weights can be highly variable which in turn results in unstable weighted analyses. Alternatively, we can use the rich data collected in the first phase of the study to improve the survey inference of the second phase sample. In this paper, we use a Bayesian tree-based multiple imputation (MI) approach for estimating population means using a two-phase survey design. We demonstrate how to incorporate complex survey design features, such as strata, clusters, and weights, into the imputation procedure. We use a simulation study to evaluate the performance of the tree-based MI approach in comparison to the alternative weighted analyses using the subsample weights. We find the tree-based MI method outperforms weighting methods with smaller bias, reduced root mean squared error, and narrower 95% confidence intervals that have closer to the nominal level coverage rate. We illustrate the application of the proposed method by estimating the prevalence of diabetes among the United States non-institutionalized adult population using the fasting blood glucose data collected only on a subsample of participants in the 2017-2018 National Health and Nutrition Examination Survey.

Yes, weights can be variable! Poststratification is better, but we don’t always have the relevant information. Imputation is a way to bridge the gap. Imputations themselves are model-dependent and need to be checked. Still, the alternatives of ignoring design calculations or relying on weights have such problems of their own, that I think that modeling is the way to go. Further challenges will arise such as imputing cluster membership in the population.

Unifying Design-Based and Model-Based Sampling Inference (my talk this Wednesday morning at the Joint Statistical Meetings in Toronto)

Wed 9 Aug 10:30am:

Unifying Design-Based and Model-Based Sampling Inference

A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But it is not clear how to apply this advice when fitting regressions that include only some of the weighting information, nor does it tell us what to do when analyzing already-collected surveys where the weighting procedure has not been clearly explained or where the weights depend in part on information that is not available in the data. It is also not clear how one is supposed to account for clustering in such analyses. We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest.

No slides, but I whipped up a paper on the topic which you can read if you want to get a sense of the idea.

Statistical mistake in published paper is fixed. Correction is issued. Original paper still sitting online with the wrong findings. Press release still sitting online with the wrong findings. Elsevier!

David Allison points to this post:

Often indirectly, but sometimes directly, we hear from true believers in concepts attached to obesity, nutrition, and public policy. The embedded question is “Why do you doubt this article of faith?” Among the many articles of faith in this realm is the belief that if we deliver just the right education or just the right nudges to people (or children) at risk for obesity, we will finally “tackle” the problem. It’s understandable. In fact, we see no problem with bringing strong beliefs to the subject of obesity, so long as we bring along even stronger analyses to test them.

An excellent case in point appears in the February issue of the Journal of Nutrition Education and Behavior. The initial analysis of a study of cooking programs found support for its effectiveness with children. But a stronger analysis showed that the effect was nil.

The premise of this study was straightforward. Test the presumption that exposing children to healthy foods in a cooking program could nudge them toward more often choosing healthy foods afterward. Frans Folkvord, Doeschka Anschütz, and Marieke Geurts set up this experiment with 125 children between the ages of 10 and 12. They randomized the children for exposure to ten-minute video clips from a cooking program. The experimental group saw a clip that emphasized only healthy foods, but the control group viewed a clip that emphasized unhealthy foods. After those clips, children in both groups chose between healthy and unhealthy snacks as a reward for participating.

Initially, Folkvord et al reported positive results:

“These findings indicated a priming effect of the foods the children were exposed to, showing that nutrition education guided by reactivity theory can be promising.”

Unfortunately this initial analysis didn’t hold up under a closer look. . . . the initial finding was reversed and a corrigendum has been published . . .

Happy ending!

Just one thing. I clicked on the original article (there’s a link at the end of the above-linked post), and it still has the unsupported conclusions:

Results

Children who watched the cooking program with healthy foods had a higher probability of selecting healthy food than children who watched the cooking program with unhealthy foods (P = .027), or with the control condition (P = .039).

Conclusions and Implications

These findings indicated a priming effect of the foods the children were exposed to, showing that nutrition education guided by reactivity theory can be promising. Cooking programs may affect the food choices of children and could be an effective method in combination with other methods to improve their dietary intake.

Ummmmm, no! From the correction:

Results

Children who watched the cooking program with healthy foods did not have a higher probability to select the healthy food than children who watched the cooking program with unhealthy foods, or who were in the control condition.

Conclusions and Implications

These findings do not directly indicate a priming effect of the foods children were exposed to. The current study did not show that cooking programs affect food choices of children, although other studies showed that cooking programs could be an effective method in combination with other methods to improve children’s dietary intake.

The authors regret these errors.

The Journal of Nutrition Education and Behavior should put this correction on the main page of the original article.

As it is, if you go to the original article, you’ll see the bad abstract and then you have to scroll all the way down past all the references to get to a section called Linked Article where there’s a link to the Corrigendum. Or you can click on Linked Article on the main page. Either way, it’s not obvious, and there’s no good reason to keep the original abstract as the first thing that anyone sees.

This is not the fault of the authors, who were admirably open in the whole process. Everyone makes mistakes. It’s the publisher that should fix this. Also this completely uncorrected press release is still on the internet. Not a good look, Elsevier.

Cross-validation FAQ

Here it is! It’s from Aki.

Aki linked to it last year in a post, “Moving cross-validation from a research idea to a routine step in Bayesian data analysis.” But I thought the FAQ deserved its own post. May it get a million views.

Here’s its current table of contents:

1 What is cross-validation?
1.1 Using cross-validation for a single model
1.2 Using cross-validation for many models
1.3 When not to use cross-validation?
2 Tutorial material on cross-validation
3 What are the parts of cross-validation?
4 How is cross-validation related to overfitting?
5 How to use cross-validation for model selection?
6 How to use cross-validation for model averaging?
7 When is cross-validation valid?
8 Can cross-validation be used for hierarchical / multilevel models?
9 Can cross-validation be used for time series?
10 Can cross-validation be used for spatial data?
11 Can other utility or loss functions be used than log predictive density?
12 What is the interpretation of ELPD / elpd_loo / elpd_diff?
13 Can cross-validation be used to compare different observation models / response distributions / likelihoods?

P.S. Also relevant is this discussion from the year before, “Rob Tibshirani, Yuling Yao, and Aki Vehtari on cross validation.”