Lasso regression etc in Stan

[cat picture]

Someone on the users list asked about lasso regression in Stan, and Ben replied:

In the rstanarm package we have stan_lm(), which is sort of like ridge regression, and stan_glm() with family = gaussian and prior = laplace() or prior = lasso(). The latter estimates the shrinkage as a hyperparameter while the former fixes it to a specified value. Again, there are possible differences in scaling but you should get good predictions. Also, there is prior = hs() or prior = hs_plus() that implement hierarchical shrinkage on the coefficients.

We discussed horseshoe in Stan awhile ago, and there’s more to be said on this topic, including the idea of postprocessing the posterior inferences if there’s a desire to pull some coefficients all the way to zero. And informative priors on the scaling parameters: yes, these hyperparameters can be estimated from data alone, but such estimates can be unstable, and some prior information should be helpful. What we really need are a bunch of examples applying these models to real problems.

13 thoughts on “Lasso regression etc in Stan

  1. I’ll also paste in an edited version of my reply on the users list:

    The “lasso” usually refers to penalized maximum likelihood estimates for regression models with L1 penalties on the coefficients. You have to choose the scale of that penalty.

    You can include a Laplace prior in a Bayesian model, and then the posterior is proportional to the lasso’s penalized likelihood. If you take the mode (the so-called MAP estimate), it’s just the lasso again. And the posterior mode isn’t a Bayesian estimator (this is not inconsistent with Andrew’s desire to throw a Laplace approximation around the mode and call it an approximate posterior, perhaps with some importance weighting if you care about approximate expectations).

    The posterior mean is a Bayesian estimator in the sense that it averages over posterior uncertainty and is thus a posterior expectation. We know posterior means and posterior modes can be quite distant in models with asymmetric posteriors; sometimes the modes (or even means) don’t even exist.

    Stan provides penalized maximum likelihood. Unlike specialized lasso software such as truncated gradient descent, straight-up optimization is unlikely to find estimates of exactly zero for parameters—it will only get there from rounding below 10e-300 or so.

    Stan also lets you calculate Bayesian posterior means. But these are never going to be zero, because $latex \mathrm{Pr}[\beta = 0 \, | \, y] = 0$ when $latex \beta$ is a continuous variable ($latex y$ is the data here).

    The absolute value in the L1 penalty (same as in Laplace prior) is bad news in that it means the posterior isn’t continuously differentiable. Stan’s samplers and optimizers can run into problems when the parameter jumps across that non-differentiable point.

  2. Bob said:

    > The absolute value in the L1 penalty (same as in Laplace prior) is bad news in that it means > the posterior isn’t continuously differentiable. Stan’s samplers and optimizers can run into > problems when the parameter jumps across that non-differentiable point.

    In rstanarm, we write the Laplace (and “lasso”) prior as a scale mixture of normals, so there is no lack of differentiability. But the heavy tails do cause divergences unless adapt_delta is really high.

  3. Also, I could have mentioned that rstanarm implemented prior = product_normal() where the prior on each coefficient is the product of two (or more) variates that are standard normal a priori. The implied PDF for the coefficient is infinite at 0 (by default, but you can shift the location to whatever you want), which is probably a bad idea but may be popular if your audience dogmatically believes the coefficient “is” zero and you are trying to use actual data to convince them otherwise.

    This actually seems to sample okay, even though the standard normal variates can be swapped when the coefficient “is” zero without changing the posterior kernel. NUTS can barely handle multimodality if the modes are close enough.

    • Mike:

      I think that “automatic relevance determination” is what we in statistics call hierarchical modeling. And of course it’s not really automatic: if data are sparse, some prior information can help in estimating these hyperparameters.

      • And “automatic relevance determination” is a bad name as in GPs it is related to the nonlinearity and not the relevance! See more in Juho Piironen and Aki Vehtari (2016). Projection predictive input variable selection for Gaussian process models. In 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), doi:10.1109/MLSP.2016.7738829. arXiv preprint http://arxiv.org/abs/1510.04813

  4. Can someone help me understand the differences in these approaches?
    Specifically, what is the lasso prior, what is the hs prior, what is the hs_plus prior (plus, as in positive?); what are their meaningful differences.

    E.g.,
    vector[K] beta;
    real tau;

    beta ~ normal(0, tau)
    tau ~ cauchy(0,1)
    y ~ normal(X*beta, sigma)

    What is ^^ that method most analagous to, lasso or hs?

    • The Laplace, horseshoe (hs) and horseshoe+ distributions are all continuous, symmetrical distributions that place high density very close to their central value, which then drops off quickly out to long, heavy tails. Compared to a Gaussian distribution they all place a higher density near their centre and higher density out in the tails. When used as priors they have a tendency to shrink small posterior estimates towards the distribution’s central value (a Laplace prior centred at zero behaves similarly to frequentist LASSO regularisation, Bob Carpenter’s first comment gives detail on this). For this reason they’re most commonly used to carry out Bayesian regularisation of parameter estimates.

      If the main, common features of the distributions are high density close to their centres and high density out in the tails, then their main practical difference is in how strong this tendency is. The horseshoe distribution tends to place even greater density near its centre, and out in its tails, than the Laplace, and the horseshoe+ even more again compared to the horseshoe. Figures 2 and 3 of this paper (https://arxiv.org/abs/1502.00560) give a good impression of this. Generally this means that the horseshoe and horseshoe+ will shrink small values more aggressively towards the centre than the Laplace, whilst having less of an effect on large values (see figure 3 of http://ceur-ws.org/Vol-1218/bmaw2014_paper_8.pdf).

      Your example doesn’t correspond directly to any of these cases, a normal prior with Cauchy distributed SD (which should have a lower limit of 0) won’t carry out any notable shrinkage of the posterior estimates. The Laplace approach can be achieved by setting beta ~ double_exponential(0, tau), the two posts below give examples of coding up horseshoe and horseshoe+ models in stan.

      https://www.ariddell.org/horseshoe-prior-with-stan.html
      https://brandonwillard.bitbucket.io/bayes-horseshoe-plus/horseshoe-stan-nvm.html?

      • Ewan,

        Thanks for the reply.
        Since I wrote the comment, I read the papers about HS and HS+, and they make much more sense to me.
        I’ve noticed that N(0,tau) /does/ definitely shrink, but actually overly shrinks. If you have 5 signals and 25 non-signals, so to speak, the 5 signals will be shrunken down. In a sense, it’s a bit like the horseshoe prior, but without the local shrinkage parameters, and as a result it only has global shrinkage (tau is estimated, lambda_i is not).

        Now I understand what horseshoe, horseshoe+, and lasso all do and how they do it. Thank you for the extra links, as well. I also wound up using brms with a hs prior to see how they code it, and it’s plain as day how it’s done. I’ll start using HS more often now.

  5. Hi,

    I am attempting to use the Lasso priors on a final project in my Bayesian course. I was wondering if anyone knew of some walk through examples or perhaps how to get similar plots to the coefficients vs shrinkage parameter plot that is typical in the frequentest approach. When I specify the prior, I get a typical summary but am not sure exactly how to explain how the coefficients were determined. In the frequentest perspective, we use cross validation to choose the shrinkage parameter which minimizes cross validation error (or minimum + 1sd). From what I have read, the shrinkage parameter has a prior which is chi-square. So, is there a way to examine the shrinkage of the coefficients in a similar way?

  6. I am really curious to know whether stan sampler is able to sample from posterior distribution that has non-differentiable points. If in a sample Gaussian model theta ~ Normal(y | theta), if we set Laplace prior for the parameter, theta, then the posterior distribution of theta will be continuous but not always differentiable. Is stan able to fit this model? Or will it return some error?

    I use stan a lot in my work, so i really want to know how stan handle these non-differentiable points in the posterior?

    Thanks,

    Shijing

  7. Y’all know the “Bayesian Lasso” doesn’t work*, right? Simulate from your prior and check to see if you can have both small (ie zero) and large (ie non-zero) values at the same time. You can’t.

    There’s also quite a lot of theory (the number of trees that have died so Dutch/French asymptoticists can hold forth on why this doesn’t work in the mighty [and mightily incomprehensible] Annals of Statistics beggars belief). But long story short, the tail is much too light and there needs to be a spike at zero.

    The Bayseian Lasso is just a terrible exercise in Branding.

    * Work here means “promotes sparsity” / “a priori and hence a posteriori allows for approximately sparse signals”. If you’re using the Bayesian Lasso in a way that doesn’t think about sparsity at all and isn’t in high dimensions, I guess it might work.

Leave a Reply to Ben Goodrich Cancel reply

Your email address will not be published. Required fields are marked *