Survey Statistics: Sparsified MRP

This post has 3 questions for you below.

Multilevel Regression and Poststratification (MRP) aims to address nonresponse. Suppose we want to estimate E[Y], the population mean. But we only have Y for respondents. For example, suppose Y is voting Republican. And what if respondents are more or less Republican than the population ? If we have population data on X, e.g. a bunch of demographic variables, then we can estimate E[Y|X] and aggregate: E[Y] = E[E[Y|X]]. So if our sample has the wrong distribution of X, at least we fix that with some calibration.

With nonresponse worsening, we want to adjust for a lot of covariates X (including their interactions !). Estimates from such big models will be unstable without a lot of data and/or regularization.

As Andrew writes about MRP and RPP:

5. The multilevel part of MRP comes because you want to adjust for lots of cells j in your poststrat

But it’s not crucial that the theta_j’s be estimated using multilevel regression. More generally, we can use any regularized prediction method that gives reasonable and stable estimates while including a potentially large number of predictors.

Hence, regularized prediction and poststratification. RPP. It doesn’t sound quite as good as MRP but it’s the more general idea.

As Andrew writes about the lasso:

Lasso (“least absolute shrinkage and selection operator”) is a regularization procedure that shrinks regression coefficients toward zero…Somehow it took the Stanford school of open-minded non-Bayesians to regularize in the way that Bayesians always could—but didn’t…I was fitting uncontrolled regressions with lots of predictors and not knowing what to do…

“Over the past few months I [Tibshirani] have been consumed with a new piece of work [with Richard Lockhart, Jonathan Taylor, and Ryan Tibshirani] that’s finally done. Here’s the paper, the slides (easier to read), and the R package.”

QUESTION 1: These links are broken. [They’re no longer broken! I found the correct links and went in and updated them. — AG]

As Andrew writes about the “bet on sparsity principle“:

sparse models can be faster to compute, easier to understand, and yield more stable inferences.

Though this “easier to understand” may be kind of fake, as Andrew says:

lasso (or alternatives such as horseshoe) are fine, but I don’t think they really give you a more interpretable model. Or, I should say, yes, they give you an interpretable model, but the interpretability is kinda fake, because had you seen slightly different data, you’d get a different model. Interpretability is bought at the price of noise—not in the prediction, but in the chosen model.

And:

In most settings I actually find it difficult to directly interpret more than one coefficient in a regression model.

Same.

QUESTION 2: Ok, so forgetting about the interpretability stuff, let’s say I am doing a giant Regularized Prediction and Poststratification (RPP). Do you know of papers doing RPP with sparsifying priors ?

Implementation time.

Starting with the lasso: Tibshirani’s selectiveInference R package gives a logistic example here (our Y is binary too). I extended their example using brms and rstanarm with Laplace priors with scale 1/lambda, see ESL p.72:

mod_rstanarm <- stan_glm(
formula = formula_all,
data = data,
family = binomial(),
prior = laplace(0, 1.25)
)

Here are my results comparing the point estimates:

QUESTION 3: why does Bayesian inference with Laplace priors (and the same likelihood and data) give posterior means more similar to selectiveInference than lasso ? is the above expected behavior or do you suspect my code has a bug ?

(I asked about this here, in Andrew’s post about a post-selection inference question from Richard Artner.)

Moving away from lasso to other sparsifying priors, Piironen and Vehtari (2017) conclude:

the (regularized) horseshoe consistently outperforms Lasso in terms of predictive accuracy…A clear advantage for Lasso, on the other hand, is that it is hugely faster

brms even stopped supporting lasso in 2023.

Michael Betancourt showed struggles with the horseshoe, concluding:

In the end while these modifications of the horseshoe population model can help they typically don’t achieve any better performance in practice than a Cauchy population model or the binary mixture normal population model, both of which are much easier to configure and accurately fit out of the box. Consequently it’s hard to argue for the practical utility of the horseshoe model.

So back to QUESTION 2: which sparsifying priors should we use for RPP ?

33 thoughts on “Survey Statistics: Sparsified MRP

  1. iirc the post selection inference literature is about getting “better” p values, or whatever you’re using for hypothesis testing / intervals, after you performed some type of variable selection (say, the lasso)

    • Thanks, mt !

      This is my understanding as well.

      In my post I use the example from selectiveInference::fixedLassoInf(), whose documentation says:

      This function computes selective p-values and confidence intervals for the lasso, given a fixed value of the tuning parameter lambda. …
      Note that the coefficients and standard errors reported are unregularized. Eg for the Gaussian, they are the usual least squares estimates and standard errors for the model fit to the active set from the lasso.

      I see that the post-selection-inference coefficients are in between the shrunken lasso coefficients and the completely unshrunken classical logistic glm. Why ?

      • my guess is that its the default values for tuning params or initial values between the implementations regularized versions.

        Id expect the selective inf results to be close to all your lasso variants and drastically different then the classical unregulized version (which looks to be the case)

        Seems like you’re aware already but most of the post sel inf stuff is for ‘error bars’ of Xs relation to Y.

        While a lot of the survey applications of MRP is focused on the Y.

        • Thanks again, mt !

          Id expect the selective inf results to be close to all your lasso variants and drastically different then the classical unregulized version (which looks to be the case)

          Yes, but they say in documentation for selectiveInference::fixedLassoInf():

          Note that the coefficients and standard errors reported are unregularized.

          And you say:

          Seems like you’re aware already but most of the post sel inf stuff is for ‘error bars’ of Xs relation to Y.

          While a lot of the survey applications of MRP is focused on the Y.

          Yes, though presumably if you wanted to estimate uncertainty for the MRP esimate, you’d want some error bars ? A fully Bayesian approach makes this easiest, but if someone is using lasso then maybe selectiveInference is useful ? See Ben Goodrich’s comment below.

  2. Re. “lasso (or alternatives such as horseshoe) are fine, but I don’t think they really give you a more interpretable model. Or, I should say, yes, they give you an interpretable model, but the interpretability is kinda fake, because had you seen slightly different data, you’d get a different model. Interpretability is bought at the price of noise—not in the prediction, but in the chosen model.”

    you can just use stability selection, right?

  3. Shira:

    Just a minor thing. The expectation E is just an operator, and it’s just fine to write E(y), no reason to use brackets as in E[y]. I mean, brackets are fine, in the same way that, mathematically, you can write 2*(3+4)=14 or 2*[3+4]=14, but they’re not necessary. I say this only because in your post you write expectations always with brackets, never with parentheses, which suggests to me you’re enforcing a rule that isn’t there.

    • The use of [.] vs (.) always confused me, as people seem to mix and match. Is there some kind of unwritten rule that when you have a functional (a function that takes a function as an argument), you write E[f] or Pr[f], but when a function takes a concrete object as an argement like a scalar or vector you write f(x)?

      • Eric:

        I think the true rule is that (), [], and {} are for most purposes equivalent in mathematical notation, except for special cases such as when [] is used for the integer part of a real number or when {} is used for a set.

    • > you write expectations always with brackets, never with parentheses

      The alternative of writing expectations sometimes with brackets and sometimes with parenthesis seems much worse. Consistency is good even though “enforcing a rule that isn’t there” sounds bad.

      Square brackets are often used when the argument of the function(al) is not a value / element but a function / random variable / set. Interestingly one of the first results in a quick search about this subject is another [authors’s] entry in this blog: https://statmodeling.stat.columbia.edu/2020/02/05/abuse-of-expectation-notation/

        • The usage in BDA is varied.

          In page 6 we see sd(θ)/E(θ) and also exp(E[log(θ)]) and sd(E[log(θ)]). You are indeed avoiding ))) in those cases.

          In page 60 we have E[Z^m(1-Z)^n] but there was not even a )) there. There are expressions way more messy, often including expectations, without brackets.

          In page 357 there is an equation containing E(E(.|.)|.) and another equation containing otherwise similar expressions E[var(.|.)|.] + var[E(.|.)|.]. The latter is then transformed to E[.|.] + var[.|.] to have some local consistency, I guess.

          There is at least one place (page 122) where square brackets are used for an actual function (not E or var): φ[(28-14.5)/9.1]=93%. I am not sure if it’s an oversight or a deliberate notation choice but it seems that φ([28-14.5]/9.1) would be more natural.

    • The convention derives from physics, where square brackets indicate functionals (i.e., functions of functions). See, for example, the APS’s guidelines:

      https://journals.aps.org/authors/bracketing-mathematical-expressions-h9

      Because a random variable is technically a function from sample space to expected value, functions of random variables like expectations are conventionally written with square brackets. Same for KL-divergence.

      If you want to be careful talking about distinctions between random and bound variables, you run into trouble with BDA’s notation. I found it very very hard when I was just learning what a random variable is (you don’t get a precise definition in BDA from what I recall). The problem arises if we want to do something like define the cumulative distribution function of a scalar random variable Y, which is usually defined as the function

      F_Y(y) = Pr[Y < y],

      where y is a bound variable. Because the function F_Y is a function of a scalar bound variable y, it’s written with parens. Because Pr is a function of random variables, it’s written with square brackets. Because Y is random, it’s capitalized; because y is bound, it’s lowercase. Andrew doesn’t like the Y vs. y distinction, so it becomes very hard to define a cdf without confusion. I do get Andrew’s point about there not being enough letter types to indicate vectors with bold and matrices with caps and parameters with Greek and constants with roman and unknowns with italics—you start running out of options when trying to combine options. For example, matrices and RVs are both distinguished with capital letters, but what if I have a random matrix? I still think the distinction is super helpful for pedagogical purposes for beginners, though, which is why I tried to make these distinctions carefully and in the more formal appendix in my intro to Bayesian stats with Stan

      • Thanks, Bob ! This is really helpful.

        Yup, I don’t think “random variable” is defined in BDA.

        BDA p.23 defines a cumulative distribution function as:

        F(v*) = Pr(v <= v*)

        replacing the capital/lowercase with unstarred/starred.

  4. I think the bigger issue is that you should not want a point estimate of the parameters of a multilevel model if you are going to use it for MRP because you should want a posterior predictive distribution in the (sub)population and not a point prediction. Perhaps it is plausible to ask what regularization scheme on a point estimator yields point predictions that match the me(di)an of a posterior predictive distribution so that you do not have to do MCMC to obtain it, but I don’t think anyone knows the answer to that question (I am not surprised by the table comparing the point estimates) and it would not allow you to propagate uncertainty through to whatever you are using the predictions for.

    Sparsifying priors is kind of a misnomer because none of the draws from the prior or the posterior are exactly zero, so you still have to keep track of them regardless of how far away they are from zero. Dan has a good post on MRP, https://statmodeling.stat.columbia.edu/2019/08/22/multilevel-structured-regression-and-post-stratification/ , that links to his coauthored ArXiv paper on structured priors that are centered on special cases of the multilevel model that are less multilevel. Such priors would make it more likely that those model components get projected away if you use Aki et al.’s projpred package. It would be a worthwhile research question to see if projection yields better predictions in the population (it can only worsen the ELPD estimate from the sample but maybe by a negligible amount).

    • Thanks, Ben ! 3 questions for you:

      I think the bigger issue is that you should not want a point estimate of the parameters of a multilevel model if you are going to use it for MRP…

      1. As mt points out, selectiveInference does compute confidence intervals for the parameters. If someone were doing lasso and not MCMC, could they use these intervals to propagate uncertainty through to the predictions and whatever you use them for (e.g. population means) ?

      I remember that excellent post (and paper) by Dan !

      Their structured priors are random walks for variables like age.

      2. What do you mean by these models being “less multilevel” ?

      3. Why would the age parameters be more likely to be projected away by projpred in this case ?

      • 1) The idea of using numerical optimization to get a point estimator, and then using the curvature at that point to get an interval estimator, and then using the interval estimator to get 95% of a posterior distribution centered at the midpoint (and often pushing that through a non-linear inverse link function and sometimes pushing the predictive distribution through a non-linear utility function) presumes that the posterior distribution is close to normal. That has the same issues of using numerical optimization to find the mode and interpreting the mode as the median and mean. In a multilevel regression context (presuming you don’t integrate a bunch of stuff out analytically), the posterior distribution is rarely close to multivariate normal and the posterior mode is nowhere near where all the MCMC draws are. The various penalization schemes for optimization that are nondifferentiable at zero shift the optimum and don’t care that happens to the objective function away from the optimum, whereas the principal effect of prior probability distributions is through how heavy the tails are.

        2) The functions in the projpred package can take away a spline or a (1 + x | g) thing at the solution to its optimization problem. So, the result is less multilevel in the sense that there are fewer parameters in the posterior distribution that depend on other parameters.

        3) The coefficients on age are unlikely to be projected away with a random walk prior on the coefficients. If you had something simpler (and probably worse) like y ~ age + (1 + age | g), then a normal or some kind of shrinkage prior on the deviations for the coefficient on age across levels of g might be more likely to get simplified to y ~ age by projpred.

        • Thanks again, Ben !

          1) So if we want a sparsified prior for MRP, you’re advocating for Bayesian inference rather than lasso + selectiveInference + propagating uncertainty to the predictions, due to that propagation of uncertainty usually being based on poor approximations ?

          2) Oh I meant why are the structured priors (e.g. random walks) “less multilevel” ? You said “structured priors that are centered on special cases of the multilevel model that are less multilevel.” Maybe I misread it !

          3) Fom Gao et al. 2021 (the paper linked to in Dan’s post), compare their age_j ~ N(0, sigma) independent normals to their age_j | age_j-1 ~ N(age_j-1, sigma_RW) priors. Why would the independent normals be more likely to be projected away than the random walk prior ?

  5. Shira asks, “why does stan give posterior means more similar to selectiveInference than lasso?” Stan just does posterior inference w.r.t. a model and data. I would rephrase this to ask, “Why does Bayesian inference with model p(y, theta) and data y give you posterior means more similar to selectiveInference than lasso?” The usual answer is that { 0 } is a measure 0 set, so you never get an exact 0 answer. How close you get to zero will depend on posterior uncertainty.

  6. Do you have examples of data/code you can share to try to answer your third question? I understand if the exact data cannot be shared but I’d appreciate having some code to play around with.

    • Yes, thank you Michael ! Here:

      # SOURCE: https://search.r-project.org/CRAN/refmans/selectiveInference/html/fixedLassoInf.html
      # Required packages
      library(selectiveInference)
      library(glmnet)
      library(brms)
      library(rstanarm)

      # Logistic model setup
      set.seed(43)
      n = 50
      p = 10
      sigma = 1
      x = matrix(rnorm(n*p),n,p)
      x = scale(x,TRUE,TRUE)
      beta = c(3,2,rep(0,p-2))
      y = x%*%beta + sigma*rnorm(n)
      y = 1*(y>mean(y))

      # glmnet lasso
      gfit = glmnet(x,y,standardize=FALSE,family="binomial")
      lambda = .8
      beta_hat = coef(gfit, x=x, y=y, s=lambda/n, exact=TRUE)
      print(beta_hat)

      # Post-selection inference
      out = fixedLassoInf(x,y,beta_hat,lambda,family="binomial")
      print(out)

      # Prepare data
      data = data.frame(x,y)
      colnames(data) <- c(paste0("V",1:p), "y")

      # Extract selected variables from lasso
      selected_vars <- rownames(beta_hat)[-1][as.vector(beta_hat[-1,]) != 0]

      # GLM with selected variables
      formula_selected <- as.formula(paste("y ~", paste(selected_vars, collapse = " + ")))
      glm_model <- glm(formula_selected, data = data, family = "binomial")
      summary(glm_model)

      # Formula with all variables for brms
      formula_all <- as.formula(paste("y ~", paste(paste0("V",1:p), collapse = " + ")))

      1/lambda # scale of Laplace = double_exponential

      # ESL p.72 double exponential (or Laplace) distribution for each input,
      # with density (1/2τ ) exp(−|β|/τ ) and τ = 1/λ.
      # this is the same parameterization as in stan and hence in brms priors:
      # https://mc-stan.org/docs/functions-reference/unbounded_continuous_distributions.html#double-exponential-laplace-distribution

      mod_brms <- brm(
      formula = formula_all,
      data = data,
      family = bernoulli(link = "logit"),
      prior = prior(double_exponential(0, 1.25), class = "b")
      )
      # brms stopped supporting lasso in 2023: https://github.com/paul-buerkner/brms/releases/tag/v2.20.3

      mod_rstanarm <- stan_glm(
      formula = formula_all,
      data = data,
      family = binomial(),
      prior = laplace(0, 1.25)
      )

      # Extract point estimates
      brms_coefs <- fixef(mod_brms)[,1] # intercept + all p vars
      rstanarm_coefs <- coefficients(mod_rstanarm) # intercept + all p vars
      lasso_coefs <- as.vector(beta_hat) # intercept + all p vars
      glm_coefs <- coef(glm_model) # intercept + selected_vars
      selectiveInference_coefs <- out$coef0 # selected_vars

      all_vars <- c("(Intercept)", paste0("V", 1:p))
      comparison <- data.frame(Variable = all_vars, lasso = lasso_coefs, brms = brms_coefs, rstanarm = rstanarm_coefs)

      comparison[comparison$Variable %in% selected_vars,"selectiveInference"] = selectiveInference_coefs

      comparison[comparison$Variable %in% selected_vars,"glm"] = glm_coefs[-1]
      comparison[1,"glm"] = glm_coefs[1]

      comparison

Leave a Reply

Your email address will not be published. Required fields are marked *