Tomi Peltola, Aki Havulinna, Veikko Salomaa, and Aki Vehtari write:

This paper describes an application of Bayesian linear survival regression . . . We compare the Gaussian, Laplace and horseshoe shrinkage priors, and find that the last has the best predictive performance and shrinks strong predictors less than the others. . . .

And here’s their R and Stan code!

P.S. Here’s more horseshoe from PyStan developer Allen Riddell, who writes:

Betting that only a subset of the explanatory variables are useful for prediction is a bet on sparsity. A popular model making this bet is the Lasso or, less handily, L1-regularized regression. A Bayesian competitor to the Lasso makes use of the “Horseshoe prior” (which I’ll call “the Horseshoe” for symmetry). This prior captures the belief that regression coefficients are rather likely to be zero (the bet on sparsity). The following shows how to use the Horseshoe in Stan.

And the Horseshoe+ prior from Anindya Bhadra, Jyotishka Datta, Nicholas Polson, and Brandon Willard, who write:

The horseshoe+ prior is a natural extension of the horseshoe prior . . . concentrates at a rate faster than that of the horseshoe in the Kullback-Leibler (K-L) sense . . . lower mean squared error . . . In simulations, the horseshoe+ estimator demonstrates superior performance in a standard design setting against competing methods, including the horseshoe and Dirichlet-Laplace estimators. . . .

And they too have R and Stan code!

I thought this was going to be a study of how many Bayesians made it to the next round in the horseshoes tournament in Stanton. Imagine my disappoint.

Stan-code given for the regression coefficients in the first PS.-link:

`beta[i] ~ normal(0, square(lambda[i]) * square(tau));`

I think the “square”-calls should be dropped, since stan uses standard deviation parametrization rather than variance for normal:

`beta[i] ~ normal(0, lambda[i] * tau);`

In the third link, horseshoe prior is given by:

transformed parameters {

vector[J] theta;

vector[J] lambda;

lambda <- lambda_step * tau;

theta <- (theta_step .* lambda_step) * tau;

}

model {

tau ~ cauchy(0, 1.0/J);

lambda_step ~ cauchy(0, 1);

theta_step ~ normal(0, 1);

y ~ normal(theta, 1);

}

I don't understand a couple of things here. Why tau isn't constrained to be positive? And why lambda can be calculated as a product of lambda_step and tau?

lambda_step is Cauchy(0,1), so lambda computed as product of lambda_step and tau is Cauchy(0, tau), which is how lambda is defined in the model. (I would compute theta as product of theta_step and lambda in the code rather than re-doing the lambda_step and tau product for theta.)

As long as the prior is zero mean, tau doesn’t strictly need to be constrained to be positive (if tau changes sign, one can change the signs of theta_step to give the same theta). Still, I would probably recommend constraining it to be positive to remove the sign ambiguity as the ambiguity will make the posterior multimodal for tau and theta_step. Although a single chain will probably sample from one of the modes, multiple chains might end up in different modes, which can complicate the analysis of the posterior samples (e.g., the automatic multi-chain convergence statistics will probably not like this). An of course, if one were to change the code to use Cauchy(0, tau) for lambda, negative tau values would not make sense.

Absolutely — allways identify parameters where you can for exactly the reasons T mentions.

I just love seeing all this great advice and code in Stan. Thanks!