This post has 3 questions for you below.
Multilevel Regression and Poststratification (MRP) aims to address nonresponse. Suppose we want to estimate E[Y], the population mean. But we only have Y for respondents. For example, suppose Y is voting Republican. And what if respondents are more or less Republican than the population ? If we have population data on X, e.g. a bunch of demographic variables, then we can estimate E[Y|X] and aggregate: E[Y] = E[E[Y|X]]. So if our sample has the wrong distribution of X, at least we fix that with some calibration.
With nonresponse worsening, we want to adjust for a lot of covariates X (including their interactions !). Estimates from such big models will be unstable without a lot of data and/or regularization.

As Andrew writes about MRP and RPP:
5. The multilevel part of MRP comes because you want to adjust for lots of cells j in your poststrat
…
But it’s not crucial that the theta_j’s be estimated using multilevel regression. More generally, we can use any regularized prediction method that gives reasonable and stable estimates while including a potentially large number of predictors.
Hence, regularized prediction and poststratification. RPP. It doesn’t sound quite as good as MRP but it’s the more general idea.
As Andrew writes about the lasso:
Lasso (“least absolute shrinkage and selection operator”) is a regularization procedure that shrinks regression coefficients toward zero…Somehow it took the Stanford school of open-minded non-Bayesians to regularize in the way that Bayesians always could—but didn’t…I was fitting uncontrolled regressions with lots of predictors and not knowing what to do…
“Over the past few months I [Tibshirani] have been consumed with a new piece of work [with Richard Lockhart, Jonathan Taylor, and Ryan Tibshirani] that’s finally done. Here’s the paper, the slides (easier to read), and the R package.”
QUESTION 1: These links are broken. [They’re no longer broken! I found the correct links and went in and updated them. — AG]
As Andrew writes about the “bet on sparsity principle“:
sparse models can be faster to compute, easier to understand, and yield more stable inferences.
Though this “easier to understand” may be kind of fake, as Andrew says:
lasso (or alternatives such as horseshoe) are fine, but I don’t think they really give you a more interpretable model. Or, I should say, yes, they give you an interpretable model, but the interpretability is kinda fake, because had you seen slightly different data, you’d get a different model. Interpretability is bought at the price of noise—not in the prediction, but in the chosen model.
And:
In most settings I actually find it difficult to directly interpret more than one coefficient in a regression model.
Same.
QUESTION 2: Ok, so forgetting about the interpretability stuff, let’s say I am doing a giant Regularized Prediction and Poststratification (RPP). Do you know of papers doing RPP with sparsifying priors ?
Implementation time.
Starting with the lasso: Tibshirani’s selectiveInference R package gives a logistic example here (our Y is binary too). I extended their example using brms and rstanarm with Laplace priors with scale 1/lambda, see ESL p.72:

mod_rstanarm <- stan_glm( formula = formula_all, data = data, family = binomial(), prior = laplace(0, 1.25) )
Here are my results comparing the point estimates:

QUESTION 3: why does Bayesian inference with Laplace priors (and the same likelihood and data) give posterior means more similar to selectiveInference than lasso ? is the above expected behavior or do you suspect my code has a bug ?
(I asked about this here, in Andrew’s post about a post-selection inference question from Richard Artner.)
Moving away from lasso to other sparsifying priors, Piironen and Vehtari (2017) conclude:
the (regularized) horseshoe consistently outperforms Lasso in terms of predictive accuracy…A clear advantage for Lasso, on the other hand, is that it is hugely faster
brms even stopped supporting lasso in 2023.
In the end while these modifications of the horseshoe population model can help they typically don’t achieve any better performance in practice than a Cauchy population model or the binary mixture normal population model, both of which are much easier to configure and accurately fit out of the box. Consequently it’s hard to argue for the practical utility of the horseshoe model.
So back to QUESTION 2: which sparsifying priors should we use for RPP ?
Answer to Question 1 in regards to the R package. No this is not the selectiveInference R package. The name of this R package is covTest (as shown in the URL) and it can now be found on archived on CRAN at https://cran.r-project.org/src/contrib/Archive/covTest/
Thank you, Rodney !
I see Andrew also has helpfully already incorporated your correction (thank you, Andrew !).
As Andrew wrote in his post, I also don’t tend to use significance tests in my own work. But I have worked with applied researchers interested in using the lasso for inference, so the selectiveInference R package is more relevant, so I’d like to understand it better.
Do you use either the covTest or the selectiveInference R packages, Rodney ?
Thanks again !
iirc the post selection inference literature is about getting “better” p values, or whatever you’re using for hypothesis testing / intervals, after you performed some type of variable selection (say, the lasso)
Thanks, mt !
This is my understanding as well.
In my post I use the example from selectiveInference::fixedLassoInf(), whose documentation says:
I see that the post-selection-inference coefficients are in between the shrunken lasso coefficients and the completely unshrunken classical logistic glm. Why ?
my guess is that its the default values for tuning params or initial values between the implementations regularized versions.
Id expect the selective inf results to be close to all your lasso variants and drastically different then the classical unregulized version (which looks to be the case)
Seems like you’re aware already but most of the post sel inf stuff is for ‘error bars’ of Xs relation to Y.
While a lot of the survey applications of MRP is focused on the Y.
Thanks again, mt !
Yes, but they say in documentation for selectiveInference::fixedLassoInf():
And you say:
Yes, though presumably if you wanted to estimate uncertainty for the MRP esimate, you’d want some error bars ? A fully Bayesian approach makes this easiest, but if someone is using lasso then maybe selectiveInference is useful ? See Ben Goodrich’s comment below.
You can use archive.org to find the archived copies or just email the authors
fyi …
https://web.archive.org/web/20141009031737/https://statweb.stanford.edu/~tibs/ftp/covtest.pdf
Thanks, Gaurav !
That’s a good idea for next time. This time Andrew was kind enough to hunt things down for us.
Re. “lasso (or alternatives such as horseshoe) are fine, but I don’t think they really give you a more interpretable model. Or, I should say, yes, they give you an interpretable model, but the interpretability is kinda fake, because had you seen slightly different data, you’d get a different model. Interpretability is bought at the price of noise—not in the prediction, but in the chosen model.”
you can just use stability selection, right?
Do you have a reference for “stability selection” ? Thanks again, Gaurav !
classic: https://arxiv.org/abs/0809.2932
Thanks, Gaurav ! I’m curious what Andrew thinks about this.
Shira:
Just a minor thing. The expectation E is just an operator, and it’s just fine to write E(y), no reason to use brackets as in E[y]. I mean, brackets are fine, in the same way that, mathematically, you can write 2*(3+4)=14 or 2*[3+4]=14, but they’re not necessary. I say this only because in your post you write expectations always with brackets, never with parentheses, which suggests to me you’re enforcing a rule that isn’t there.
Good point, Andrew !
Just seeing this post now: https://statmodeling.stat.columbia.edu/2020/02/05/abuse-of-expectation-notation
(does the latex display properly for you ?)
I wasn’t aiming to enforce any rule besides consistency (all brackets or all parentheses). I forget where I saw expectations written with brackets and got used to that aesthetically.
The use of [.] vs (.) always confused me, as people seem to mix and match. Is there some kind of unwritten rule that when you have a functional (a function that takes a function as an argument), you write E[f] or Pr[f], but when a function takes a concrete object as an argement like a scalar or vector you write f(x)?
Eric:
I think the true rule is that (), [], and {} are for most purposes equivalent in mathematical notation, except for special cases such as when [] is used for the integer part of a real number or when {} is used for a set.
> you write expectations always with brackets, never with parentheses
The alternative of writing expectations sometimes with brackets and sometimes with parenthesis seems much worse. Consistency is good even though “enforcing a rule that isn’t there” sounds bad.
Square brackets are often used when the argument of the function(al) is not a value / element but a function / random variable / set. Interestingly one of the first results in a quick search about this subject is another [authors’s] entry in this blog: https://statmodeling.stat.columbia.edu/2020/02/05/abuse-of-expectation-notation/
Carlos:
I always use parentheses for everything, except that sometimes with messy expressions it can be more readable to use some brackets to avoid ))) etc.
The usage in BDA is varied.
In page 6 we see sd(θ)/E(θ) and also exp(E[log(θ)]) and sd(E[log(θ)]). You are indeed avoiding ))) in those cases.
In page 60 we have E[Z^m(1-Z)^n] but there was not even a )) there. There are expressions way more messy, often including expectations, without brackets.
In page 357 there is an equation containing E(E(.|.)|.) and another equation containing otherwise similar expressions E[var(.|.)|.] + var[E(.|.)|.]. The latter is then transformed to E[.|.] + var[.|.] to have some local consistency, I guess.
There is at least one place (page 122) where square brackets are used for an actual function (not E or var): φ[(28-14.5)/9.1]=93%. I am not sure if it’s an oversight or a deliberate notation choice but it seems that φ([28-14.5]/9.1) would be more natural.
Carlos:
To be more precise, I use parentheses and brackets interchangeably but I usually prefer parentheses.
I love the attention to detail here, thank you Carlos and Andrew !
The convention derives from physics, where square brackets indicate functionals (i.e., functions of functions). See, for example, the APS’s guidelines:
https://journals.aps.org/authors/bracketing-mathematical-expressions-h9
Because a random variable is technically a function from sample space to expected value, functions of random variables like expectations are conventionally written with square brackets. Same for KL-divergence.
If you want to be careful talking about distinctions between random and bound variables, you run into trouble with BDA’s notation. I found it very very hard when I was just learning what a random variable is (you don’t get a precise definition in BDA from what I recall). The problem arises if we want to do something like define the cumulative distribution function of a scalar random variable Y, which is usually defined as the function
F_Y(y) = Pr[Y < y],
where y is a bound variable. Because the function F_Y is a function of a scalar bound variable y, it’s written with parens. Because Pr is a function of random variables, it’s written with square brackets. Because Y is random, it’s capitalized; because y is bound, it’s lowercase. Andrew doesn’t like the Y vs. y distinction, so it becomes very hard to define a cdf without confusion. I do get Andrew’s point about there not being enough letter types to indicate vectors with bold and matrices with caps and parameters with Greek and constants with roman and unknowns with italics—you start running out of options when trying to combine options. For example, matrices and RVs are both distinguished with capital letters, but what if I have a random matrix? I still think the distinction is super helpful for pedagogical purposes for beginners, though, which is why I tried to make these distinctions carefully and in the more formal appendix in my intro to Bayesian stats with Stan
Thanks, Bob ! This is really helpful.
Yup, I don’t think “random variable” is defined in BDA.
BDA p.23 defines a cumulative distribution function as:
F(v*) = Pr(v <= v*)
replacing the capital/lowercase with unstarred/starred.
I think the bigger issue is that you should not want a point estimate of the parameters of a multilevel model if you are going to use it for MRP because you should want a posterior predictive distribution in the (sub)population and not a point prediction. Perhaps it is plausible to ask what regularization scheme on a point estimator yields point predictions that match the me(di)an of a posterior predictive distribution so that you do not have to do MCMC to obtain it, but I don’t think anyone knows the answer to that question (I am not surprised by the table comparing the point estimates) and it would not allow you to propagate uncertainty through to whatever you are using the predictions for.
Sparsifying priors is kind of a misnomer because none of the draws from the prior or the posterior are exactly zero, so you still have to keep track of them regardless of how far away they are from zero. Dan has a good post on MRP, https://statmodeling.stat.columbia.edu/2019/08/22/multilevel-structured-regression-and-post-stratification/ , that links to his coauthored ArXiv paper on structured priors that are centered on special cases of the multilevel model that are less multilevel. Such priors would make it more likely that those model components get projected away if you use Aki et al.’s projpred package. It would be a worthwhile research question to see if projection yields better predictions in the population (it can only worsen the ELPD estimate from the sample but maybe by a negligible amount).
Thanks, Ben ! 3 questions for you:
1. As mt points out, selectiveInference does compute confidence intervals for the parameters. If someone were doing lasso and not MCMC, could they use these intervals to propagate uncertainty through to the predictions and whatever you use them for (e.g. population means) ?
I remember that excellent post (and paper) by Dan !
Their structured priors are random walks for variables like age.
2. What do you mean by these models being “less multilevel” ?
3. Why would the age parameters be more likely to be projected away by projpred in this case ?
1) The idea of using numerical optimization to get a point estimator, and then using the curvature at that point to get an interval estimator, and then using the interval estimator to get 95% of a posterior distribution centered at the midpoint (and often pushing that through a non-linear inverse link function and sometimes pushing the predictive distribution through a non-linear utility function) presumes that the posterior distribution is close to normal. That has the same issues of using numerical optimization to find the mode and interpreting the mode as the median and mean. In a multilevel regression context (presuming you don’t integrate a bunch of stuff out analytically), the posterior distribution is rarely close to multivariate normal and the posterior mode is nowhere near where all the MCMC draws are. The various penalization schemes for optimization that are nondifferentiable at zero shift the optimum and don’t care that happens to the objective function away from the optimum, whereas the principal effect of prior probability distributions is through how heavy the tails are.
2) The functions in the projpred package can take away a spline or a (1 + x | g) thing at the solution to its optimization problem. So, the result is less multilevel in the sense that there are fewer parameters in the posterior distribution that depend on other parameters.
3) The coefficients on age are unlikely to be projected away with a random walk prior on the coefficients. If you had something simpler (and probably worse) like y ~ age + (1 + age | g), then a normal or some kind of shrinkage prior on the deviations for the coefficient on age across levels of g might be more likely to get simplified to y ~ age by projpred.
Thanks again, Ben !
1) So if we want a sparsified prior for MRP, you’re advocating for Bayesian inference rather than lasso + selectiveInference + propagating uncertainty to the predictions, due to that propagation of uncertainty usually being based on poor approximations ?
2) Oh I meant why are the structured priors (e.g. random walks) “less multilevel” ? You said “structured priors that are centered on special cases of the multilevel model that are less multilevel.” Maybe I misread it !
3) Fom Gao et al. 2021 (the paper linked to in Dan’s post), compare their
age_j ~ N(0, sigma)independent normals to theirage_j | age_j-1 ~ N(age_j-1, sigma_RW)priors. Why would the independent normals be more likely to be projected away than the random walk prior ?Shira asks, “why does stan give posterior means more similar to selectiveInference than lasso?” Stan just does posterior inference w.r.t. a model and data. I would rephrase this to ask, “Why does Bayesian inference with model p(y, theta) and data y give you posterior means more similar to selectiveInference than lasso?” The usual answer is that { 0 } is a measure 0 set, so you never get an exact 0 answer. How close you get to zero will depend on posterior uncertainty.
Thanks again, Bob. I like your rephrase, so I edited the post.
Do you have examples of data/code you can share to try to answer your third question? I understand if the exact data cannot be shared but I’d appreciate having some code to play around with.
Yes, thank you Michael ! Here:
# SOURCE: https://search.r-project.org/CRAN/refmans/selectiveInference/html/fixedLassoInf.html# Required packages
library(selectiveInference)
library(glmnet)
library(brms)
library(rstanarm)
# Logistic model setup
set.seed(43)
n = 50
p = 10
sigma = 1
x = matrix(rnorm(n*p),n,p)
x = scale(x,TRUE,TRUE)
beta = c(3,2,rep(0,p-2))
y = x%*%beta + sigma*rnorm(n)
y = 1*(y>mean(y))
# glmnet lasso
gfit = glmnet(x,y,standardize=FALSE,family="binomial")
lambda = .8
beta_hat = coef(gfit, x=x, y=y, s=lambda/n, exact=TRUE)
print(beta_hat)
# Post-selection inference
out = fixedLassoInf(x,y,beta_hat,lambda,family="binomial")
print(out)
# Prepare data
data = data.frame(x,y)
colnames(data) <- c(paste0("V",1:p), "y")
# Extract selected variables from lasso
selected_vars <- rownames(beta_hat)[-1][as.vector(beta_hat[-1,]) != 0]
# GLM with selected variables
formula_selected <- as.formula(paste("y ~", paste(selected_vars, collapse = " + ")))
glm_model <- glm(formula_selected, data = data, family = "binomial")
summary(glm_model)
# Formula with all variables for brms
formula_all <- as.formula(paste("y ~", paste(paste0("V",1:p), collapse = " + ")))
1/lambda # scale of Laplace = double_exponential
# ESL p.72 double exponential (or Laplace) distribution for each input,
# with density (1/2τ ) exp(−|β|/τ ) and τ = 1/λ.
# this is the same parameterization as in stan and hence in brms priors:
# https://mc-stan.org/docs/functions-reference/unbounded_continuous_distributions.html#double-exponential-laplace-distribution
mod_brms <- brm(
formula = formula_all,
data = data,
family = bernoulli(link = "logit"),
prior = prior(double_exponential(0, 1.25), class = "b")
)
# brms stopped supporting lasso in 2023: https://github.com/paul-buerkner/brms/releases/tag/v2.20.3
mod_rstanarm <- stan_glm(
formula = formula_all,
data = data,
family = binomial(),
prior = laplace(0, 1.25)
)
# Extract point estimates
brms_coefs <- fixef(mod_brms)[,1] # intercept + all p vars
rstanarm_coefs <- coefficients(mod_rstanarm) # intercept + all p vars
lasso_coefs <- as.vector(beta_hat) # intercept + all p vars
glm_coefs <- coef(glm_model) # intercept + selected_vars
selectiveInference_coefs <- out$coef0 # selected_vars
all_vars <- c("(Intercept)", paste0("V", 1:p))
comparison <- data.frame(Variable = all_vars, lasso = lasso_coefs, brms = brms_coefs, rstanarm = rstanarm_coefs)
comparison[comparison$Variable %in% selected_vars,"selectiveInference"] = selectiveInference_coefs
comparison[comparison$Variable %in% selected_vars,"glm"] = glm_coefs[-1]
comparison[1,"glm"] = glm_coefs[1]
comparison
I am a data scientist so I enjoy these sorts of post, thought I admit to using ChatGPT for explaination.
Thanks, DBSI !
No shame in getting more explanations. If there are ways I can improve clarity, do let me know !