Dean Eckles writes:
I remember reading on your blog that you were working on some tools to fit multilevel models that also include “fixed” effects — such as continuous predictors — that are also estimated with shrinkage (for example, an L1 or L2 penalty). Any new developments on this front?
I often find myself wanting to fit a multilevel model to some data, but also needing to include a number of “fixed” effects, mainly continuous variables. This makes me wary of overfitting to these predictors, so then I’d want to use some kind of shrinkage.
As far as I can tell, the main options for doing this now is by going fully Bayesian and using a Gibbs sampler. With MCMCglmm or BUGS/JAGS I could just specify a prior on the fixed effects that corresponds to a desired penalty. However, this is pretty slow, especially with a large data set and because I’d like to select the penalty parameter by cross-validation (which is where this isn’t very Bayesian I guess?).
We allow informative priors in blmer/bglmer. Unfortunately blmer/bglmer aren’t ready yet but they will be soon, I hope. They’ll be in the “arm” package in R. We’re also working on a bigger project of multilevel models for deep interactions of continuous predictors. But that won’t be ready for awhile; we still have to figure out what we want to do there.
For most data sets an MCMC sampler for such a model should take only seconds in R. But when the data set gets huge, I usually find that the bottleneck in processing lies in inverting the square of the design matrix to get the variances for the conditional posteriors of effects (inverting (X'X), which is a square matrix with dimension the number of observations). In such cases I just give up on sampling the effects jointly, sample them from independent normals, and do a metropolis hastings decorrelating step to break autocorrelation across the additive components. Doing so in C yields a pretty quick sampler with reasonable chains. My ability to do such hacky things is why I don't rely on BUGS etc. for my sampling, although perhaps there's a better way to sample such posteriors in cases with large data sets?
I've also seen solutions to these sorts of problems (e.g., MAP estimators) under the auspices of the 'relevance vector machine' or 'relevance vector regression'.
silly question: Y U NO USE cppBugs?
I'm theorizing that it would be a half-measure between the fully templated autodiff implementation that you may be pursuing for the deep-interactions project, and good old reliable BUGS/JAGS for simpler models. Reasonable guess?
And whoever brought up Tipping's RVM — my understanding at this moment in time is that it's very, very close to logistic regression with a Bayesian lasso. Not that this is in any way a bad idea, and in fact some ideas are so good that they get invented several times (Gaussian process regression == kriging, for example). That, again, is just my understanding.
A doctoral student in another lab implemented a very clever method that used AdaBoost to combine RVMs for classifying tumors by the most likely treatment combination to generate a response in the patient — I thought that was pretty damned clever and it inspired me to learn more about RVMs (I already knew about adaBoost).
The simple answer is that we didn't know about it. Our main concern's been with the lack of convergence of Gibbs sampling, not the slowness of BUGS. Having said that, something 10-100 times faster than BUGS for cases where Gibbs works would be nice.
Like the thing we're writing, it's going to be a big leap for the casual BUGS user to start compiling C++. The home page just lists the update() method, which looks like BUGS. I was wondering how they did that until I looked at the rest of the class for the model:
That looks more similar to how we're currently specifying our models, though we have users extend an abstract base class by implementing a virtual density function directly rather than trying to define a BUGS-like model as in CppBUGS or PyMC. We could always compile a CppBUGS-type specification down to what we're doing. I really like PyMC and see the value of having the BUGS-type variable types (Stochastic/Deterministic in CppBUGS).
Who is this "they" you speak of? Whit Armstrong wrote that whole thing himself. Then Dirk took up the cause of promoting it as an example of using the 'inline' package to write fast Gibbs samplers on-the-fly from R:
That's when I realized it could sometimes be easier (as well as faster) to use something like this than to bother with BUGS. Supercool and sexy, no?
What you guys are doing is also very cool, BTW.
But, X'X is pXp not nXn. Usually, p is small enough that inverting X'X is rather easy. Now, inverting Z'Z is another matter if you have a large number of random effects. But, I digress.
What kind of sampling is CppBUGS using? On cursory inspection (by someone else), it looked like random walk Metropolis, which is what I think PyMC is using as of now (though we know they're planning an auto-dif implementation of HMC in their model language).
If CppBUGS's sampler doesn't compute gradients, would there be a way to specify the sampling density using an R-defined function? That's what Andrew et al. were trying to do with their UMACS package for R. That would be something I could imagine lots of statisticians using.
Michael Malecki is going to help us with Rcpp so we can do something like the example you provided in the link from CppBUGS (i.e. compiling and linking C++ programs specified as R strings). I'm still not sure how useful this will be, as it will still require users to learn enough about C++ to compile a program. BUGS/R is convenient because statisticians already know R and the BUGS models looks just the way they're written in papers. (I actually prefer JAGS these days — it's more tightly integrated with R than WinBUGS and I've switched to working on a Mac.)
Something that'd convince people that CppBUGS is "better" than BUGS for a range of models is doing the kind of thing that Martyn Plummer's done for OpenBUGS and JAGS, as summarized in his blog post How fast is JAGS?. It tackles not only efficiency, but also expressiveness and the robustness questions which make people reluctant to try new packages if they already have one that works well enough.
Feel free to replace "they" or "them" in this message and others with "the author or authors". It's common usage to refer to one or more people of unknown gender as "they", though you'd never be able to sneak it by a New Yorker copy editor.
We're looking at p > 2000 [p is the number of predictors], even with just two-way multilevel interactions from a handful of predictors.
If p were much smaller, we might start thinking about Newton-type methods involving (inverse) Hessians, as described in the recent paper of Girolami and Calderhead. Though even then, the HMC approach was fairly competitive for most of the problems they evaluated and MUCH simpler.
Ouch. But, don't you just need to
invert X'X once? I must be missing something.
Rodney, to get the posterior variance I think you have to invert something like (X'X) + B, where B is a diagonal matrix with elements that are prior variances of the effects. In hierarchical contexts you'd have to do this on every iteration of the chain as B is being updated….on my 16-core 12-gig-of-ram server this inversion takes several seconds with a 2000×2000 matrix!
That said…I guess the models asked about are not hierarchical, but have a constant shrinkage prior. So yep, you should only need to do the inversion once. A good MCMC implementation should be very fast, but I have no idea how BUGS/JAGS handles sampling such linear models.
CppBUGS does look pretty appealing. I wonder how it compared in speed to MCMCglmm, which is reasonably fast and has been what we've been using primarily.
Somewhat hidden in my question is a more philosophical issue of thinking like a Bayesian while selecting penalties/priors based on cross-validation results.