Jared writes:
I gave a talk at the Washington DC Conference about six different tools for fitting lasso models. You’ll be happy to see that rstanarm outperformed the rest of the methods.
That’s about 60 more slides than I would’ve used. But it’s good to see all that code.
I have checked out the slides but did not get the idea. What is driving the differences between the different methods of fitting lasso?
Besides the conceptual difference between MAP versus posterior distribution for stan, I imagine it is mostly the default tuning of hyper-parameter in different software that makes the difference, as otherwise we should be pretty shocked if different algebra solvers lead to diverging answer for the same strictly convex problem.
+1. I realize that (1) I’m looking a gift horse in the mouth and (2) I don’t know what information was presented verbally in the presentation that doesn’t show up here, but I’m frustrated by being able to see only that xgboost and stan do better than the others, and not what the differences are in the hyperparameter/penalization selection machinery.
Here’s the talk
https://youtu.be/R-lVeYjJtw0
Can anyone recommend a good introductory tutorial for fitting Bayesian penalized regression with STAN/rstanarm/brms? Thank you.
Yes please! This would be very interesting.
The various methods either tuned the hyperpameters automatically (like rstanarm) or used a value that was heuristically chosen.
How is rstanarm tuning the hyperparameters? Based on the slides, I would guess that a default prior is used.
Jesper:
Yes, rstanarm’s priors are set by default. We’ve been talking about making the default priors stronger, actually, which I think should work well, given that they’re already defined relative to centered and scaled predictors and outcomes.
Automatically as in using a hierarchical model?
As Andrew mentions, rstanarm also preconditions the inputs, which changes the interpretation of the default priors.
Is that MSE on held out data with point estimates? I find root mean squared error (aka RMSE) easier to interpret.
The lasso() prior used a chi-squared prior on the tuning parameter.
It’s the RMSE on held out data.
Keras used MAE as the criterion function. Isn’t that a mistake?
Keras used MSE as the loss function and MAE as the metric.
No mention of regularized horseshoe, which is what I would use rstanarm sparse regression for over better Lasso defaults, anyway.
Would have done the horseshoe but wanted to keep it all lasso.
Very interesting topic. As mentioned above, I have checked out the slides but did not get the idea. What is actually driving the differences in performance between the different methods of fitting lasso? Do you have any for-dummies type of summary available?
Oh, I see I missed the talk that you link to in one of the comments above. I guess that could be a starting point. (Have not checked it out yet.) Thanks for the link!
OK, so I watched the talk. I liked that it was easy to follow, but what is the takeaway (besides how to pronounce “glmnet” – that was a good one)? You have illustrated that lasso can be implemented in several different ways using existing packages. You have shown that the results were different, but have not really explained why. I think it could very well have been due to different values of tuning parameters used in different implementations. Some implementations shared the values, and incidentally the results were similar, e.g. glmnet and lars. Still not quite sure what conclusion to make out of all this.
Why are they using `intercept=FALSE, standardize=FALSE` with glmnet?