Last year we saw that individual-level Loss(y_i, yhat_i) may not be great for choosing models for MRP (“individualism doesn’t work”). Last week we saw that weighted-to-the-population individual-level loss also isn’t great (“individualism doesn’t work (even when weighted)”).
Andrew commented that Wang & Gelman 2014 discuss this problem. We cited this paper in our discussion of MRP vs classical poststratification. Let’s walk thru their back-of-envelope calculation:

Consider one cell (e.g. Missouri, income level $75,000–$150,000). Say its true proportion of Democrats is 40%. Consider 3 models that estimate this as 41%, 44%, 38%. The expected predictive losses are -[0.4 log(0.41) + 0.6 log(0.59)] = 0.6732, -[0.4 log(0.44) + 0.6 log(0.56)] = 0.6739, and -[0.4 log(0.38) + 0.6 log(0.62)] = 0.6763. The difference between the best and worst is 0.0031.
Wang & Gelman 2014 claim these differences “would hardly be noticed in a cross-validation calculation unless the number of observations in the cell were huge”. But in politics, there’s a meaningful difference between winning 38% versus 44% of a group.
Is CV accurate enough to distinguish among these estimates ? Bates et al. 2024 study uncertainty intervals for CV estimates. Let:
- yhat_i = estimated probability i is a Democrat, estimated using the CV folds excluding i
- e_i = Loss(y_i, yhat_i) = – [y_i log yhat_i + (1-y_i) log (1-yhat_i)]
- n = sample size for the cell we are considering
Bates et al. 2024 call sqrt(Var(e_i)/n) the naive CV standard error. (Naive because it does not account for the correlation between the errors.) Extending Wang & Gelman 2014‘s back of envelope calculations, without variability across folds:
a <- -log(0.41) b <- -log(0.59) mu <- 0.4 * a + 0.6 * b var_e <- 0.4 * (a - mu)^2 + 0.6 * (b - mu)^2 SE <- 0.0031 / 2 n <- var_e / SE^2
We would need n = 13,000 in this cell to distinguish between the best and worst models.
As Wang & Gelman 2014 write:
The problem is that improved fits with binary data yield minuscule improvements in log loss, in moderate sample sizes nearly indistinguishable from noise even if the improved estimates are substantively important when aggregated (for example, state-level public opinion).
I have read your post several times since you posted it yesterday (yesterday in my time zone, at least). Something that became clear to me (again) was the connection between cross-validation and resampling techniques. I realised this when you wrote about correlated errors. Cross-validation is a kind of bootstrap technique: we sample (k-1)/k*n from the data without ‘returning the observations to the urn’ and assign an invisible label (1 to k) to each observation. For valid inference, we therefore need to assume that the observations are independent of each other. If they are not, we must use a different technique. I last considered this when forecasting GDP for a seminar paper in an econometrics seminar – the forecast was a multivariate time series, so the observations were strongly correlated with each other. The same applies to spatial models and networks. I suspect that, for your election forecasting, you have all of the above😅 Anyway, thank you for your thoughts!
Thank you, Raphael ! Your close read and insights make this so fun.
1. “connection between cross-validation and resampling techniques”. Indeed, Bates et al. 2024 write “CV is part of a broader landscape of resampling techniques to estimate prediction error, with bootstrap-based techniques as the most common alternative.”
2. I think that the errors e_i are correlated even when the data observations are independent. The e_i correlations result from the CV structure, not the data structure. I like the explanation on p.1096 (below Corollary 3) of Bengio and Grandvalet 2004.
3. When we also have dependent data structure (e.g. time series, spatial, networks), CV becomes additionally challenging, as you say ! Wang & Gelman 2014 say that one challenge is figuring out the random splitting into train and holdout sets, as Aki explains: https://users.aalto.fi/~ave/CV-FAQ.html#the-way-how-the-data-is-divided-in-cross-validation
3b. Wang & Gelman 2014 also say “in multilevel models, the observed loss function for data-level cross-validation can be so close to flat that the cross-validation estimates of prediction errors under candidate models can be swamped by random fluctuations.” But they also attribute this noise swamping to having binary outcome y (the last quote in the post).