Fast CAR: Two weird tricks for fast conditional autoregressive models in Stan

Max Joseph writes:

Conditional autoregressive (CAR) models are popular as prior distributions for spatial random effects with areal spatial data. Historically, MCMC algorithms for CAR models have benefitted from efficient Gibbs sampling via full conditional distributions for the spatial random effects. But, these conditional specifications do not work in Stan, where the joint density needs to be specified (up to a multiplicative constant).

CAR models can still be implemented in Stan by specifying a multivariate normal prior on the spatial random effects, parameterized by a mean vector and a precision matrix. This works, but is slow and hard to scale to large datasets.

Order(s) of magnitude speedups can be achieved by combining 1) sparse matrix multiplications from Kyle Foreman (outlined on the stan-users mailing list), and 2) a fancy determinant trick from Jin, Xiaoping, Bradley P. Carlin, and Sudipto Banerjee. “Generalized hierarchical multivariate CAR models for areal data.” Biometrics 61.4 (2005): 950-961.

With the oft-used Scotland lip cancer dataset, the sparse CAR implementation with the NUTS (No-U-Turn Sampler) algorithm in Stan gives 120 effective samples/sec compared to 7 effective samples/sec for the precision matrix implementation.

Details for these sparse exact methods can be found here.

Max Joseph is part of the Earth Lab Analytics Hub, University of Colorado – Boulder.

Bob Carpenter adds:

I put Max Joseph’s case study up on our web page. The top-level case studies page is here (with license and links to package dependencies and the case study).

The direct link to the case study is here.

Very cool!

Cool indeed. With this set-up, the implementation of such models has moved from “statistics research” to “statistics practice.”

P.S. I took a look at the document and I have a few questions/comments:

1. The map of Scotland is distorted. No big deal but might as well get that right.

2. I can’t believe that the “tau ~ gamma(0.5, .0005);” prior is a good idea. It just looks weird to me. At the very least I’d suggest a reparameterization to make it scale-free.

3. Is there a way to package up all that code into a function? I don’t have enough experience with Stan functions to be sure, but the idea would be to have a CAR model with all that code inside the function so that, as a user, I could just say theta ~ CAR(…) with all the parameters, and then I wouldn’t have to worry about all those matrix manipulations.

21 thoughts on “Fast CAR: Two weird tricks for fast conditional autoregressive models in Stan

  1. We’d love to see more user-contributed case studies like this one. So far, we have one for PyStan (usying Jupyter) and a bunch for RStan (using knitr), but we’ll take them in any interface you can get to render.

    We’ll blog them here and post them on the Stan web page. Everyone wins, because this is how our users (and developers!) learn to program useful models in Stan. The case studies page explains what we need in the way of license and content (short version is that you can keep copyright and keep source on your own repository, but need to license the content open source so we can distribute it):

    http://mc-stan.org/documentation/case-studies.html

    I’ll put one out on movement HMMs for ecology and on typical sets as soon as I can find the time—they’re both “in process” now.

  2. To all:

    Another advantage of the case studies format is that everything’s all in one place, so an outsider such as myself can quickly look it over and raise some issues. See the P.S. I added to the above post.

  3. 1. The map of Scotland is distorted. No big deal but might as well get that right.

    True – it does look distorted. These are the polygons associated with the CARBayes package which bundles up the lip cancer data as an example, but there doesn’t seem to be any projection information in there.

    2. I can’t believe that the “tau ~ gamma(0.5, .0005);” prior is a good idea. It just looks weird to me. At the very least I’d suggest a reparameterization to make it scale-free.

    This prior is a carry-over from the WinBUGS example – they do seem to stick out like sore thumbs nowadays!

    3. Is there a way to package up all that code into a function? I don’t have enough experience with Stan functions to be sure, but the idea would be to have a CAR model with all that code inside the function so that, as a user, I could just say theta ~ CAR(…) with all the parameters, and then I wouldn’t have to worry about all those matrix manipulations.

    That would be great. I would guess most users would want to input their adjacency matrix rather than computing eigenvalues etc. I’m not sure how to efficiently implement this in Stan, since some of the efficiency gains come from pre-processing the data (e.g., in the transformed parameters block), but a CAR(…) function would be called from within the model block.

  4. It is great to have more case studies written up like this. And it is better to evaluate the log posterior kernel faster than evaluating the same log posterior kernel slower.

    But having an effective sample size that is 20% of the nominal sample size is not great by Stan standards. For the Scottish Lip Cancer example, it is possible to get an effective sample size that is much higher by post-multiplying a row-vector of parameters (that have a standard normal prior) by the inverse of the Cholesky factor of the precision matrix to form phi (transposed) in the transformed parameters block. However, this triangular solve is an O(N^3) operation, so it is slow for large N (which could easily be much larger than in the Scottish Lip Cancer example).

    The case study calculates effective sample size per second of post-warmup time, which, in fairness is exactly what Stan developers have recommended to compare MCMC efficiency. But I think we need to qualify that. Comparing n_eff / second across implementations only makes sense if the concept of effective sample size makes sense (which essentially requires the MCMC Central Limit Theorem to hold) for each implementation being compared. And it requires that our estimates of the effective sample size be reasonable, which often seems not to be the case particularly when the estimated effective sample size is low.

    Matt-tricking phi is “worse” on the estimated n_eff / second metric than the impressive 120 achieved in the case study. But I would feel more comfortable with a larger estimated n_eff and a lower estimated n_eff / second as long as the time it takes to estimate the model is feasible.

    • Thanks for your comments Ben, and your help over the past few years with previous versions of these models. This does seem inefficient compared to what we usually see with Stan with respect to n_eff / n_iter. The main advantage of this approach seems to be scalability. Anecdotally, this scales well to datasets that have 1000+ spatial units, at which point the Matt trick/Cholesky decomposition of the covariance matrix takes far too long.

      Good point about the assumptions behind n_eff – is there a better way to evaluate efficiency, or diagnose whether the estimated effective sample size is reliable?

      • Let me follow up Ben’s comments with some important detail.

        You cannot talk about scalability until you have determined that Stan is recovering a reasonable result. Formally, we need NUTS to be geometrically ergodic with respect to the given posterior so that a Central Limit Theorem holds and we get unbiased MCMC estimators whose variances can be estimated using the effective sample size. If NUTS is not geometrically ergodic then the effective sample size _is a meaningless number_. Stan will compute its estimator and print it, but it will not correspond to any meaning quantity.

        Verifying geometric ergodicity is really, really, really hard and so it often is simply assumed to hold, but we have decades of naively-implemented hierarchical models fit with Gibbs samplers that demonstrate how dangerous this is! Fortunately Stan provides two very powerful diagnostics. The first is the Gelman-Rubin statistic, i.e. Rhat — when calculated from many chains each initialized from random starting points, a lack of geometric ergodicity manifests as Rhat > 1.1 or so. The second are divergences. These are a unique to Hamiltonian Monte Carlo algorithms like NUTS, but they are incredibly sensitive to the kind of pathologies that can obstruct geometric ergodicity and hence valid MCMC estimators.

        So before we even talk about performance and scalability you have to demonstrate validity. _Any_ fit being compared has to have Rhat ~ 1 using multiple chains (the more the better) and no divergences encountered in sampling. None. Do these CAR models satisfy those conditions? If so, then we can talk about performance.

        In particular, the techniques combined into this model implementation speeds up the centered parameterization of the latent Gaussian model, but the centered parameterization is known to be very susceptible to breakdowns in geometric ergodicity. Ben is advocating the non-centered parameterization which needs the Cholesky decomposition and hence is more expensive, but in practice it tends to be much more statistically robust. Poor fits often run much faster than good fits! But, again, speed is irrelevant until the validity of the estimates has been demonstrated.

        tl:dr; Does this fast CAR model have Rhat ~ 1 when using multiple chains and show no divergences in any of those chains? If so then the centered parameterization it uses is appropriate for the data set you’re fitting and this improved implementation will be a huge gain! If not then you’re just scaling a poor fit.

        P.S. There is one other concern, and that is the effective sample size per transition. When Markov chains are mixing slowly the autocorrelations, and hence the effective sample size, can be really hard to accurately estimate, even when there are no concerns about statistical validity. This manifests in n_eff / n_iter < 0.001 or so, which is then a really good indication that you probably can't trust the n_eff estimation and hence any performance metric that depends on it. In practice I am comfortable with 0.01 < n_eff / n_iter < 1, provided that all the other diagnostics look okay.

        • Thanks Michael! This helps to clarify some of the concerns raised by Ben. In this case with the sparse implementation, we have Rhat values that are all <= 1.01, no divergent transitions, and n_eff/n_iter around 0.1 for the log-probability, though for other parameters (e.g., the global intercept beta[1]), n_eff/n_iter was 0.048, approaching the lower end of your comfort zone.

        • The global intercept is a concern, but it could just be due to strong linear correlations (paired with weak identifiability) causing NUTS to terminate prematurely. You can catch this from the pairs plots, but it’ll likely have to be resolved with stronger priors. Certainly get rid of those gamma priors.

  5. Several things:
    1) Tracy Chapman!!!!

    2) A Sparse Matrix in Stan!!!! I’m so happy there is finally a working version of this!

    3) For practical examples, you need to make an intrinsic version work as well (i.e. one where the precision matrix is singular). This is harder. The standard trick when you’re using MH is to propose using some sort of quadratic approximation (the laplace approximation, the matrix that comes from the information matrix etc) and then do a Hastings correction. I’m not completely sure how to do something similar for HMC. The problem is basically that for an intrinsic prior the determinant is zero, whereas the normalising constant is the product of the non-zero eigenvalues (which is not zero! and often depends on an unknown parameter). Multiplying and dividing by the right multivariate Gaussian will fix everything, but we know from MH experience (which may or may not be tranferable to HMC) that this Gaussian should be the taylor expansion of the likelihood at the mode, which usually doesn’t have a closed form or else your sampler behaves badly. Maybe freezing that distribution won’t have as much of a speed penalty with HMC.

    4) The thing that Ben suggested (the non-centred parameterisation) is what you should do for this type of data (see http://homepages.inf.ed.ac.uk/imurray2/pub/10hypers/hypers.pdf)

    5) You should be never compute a full eigen-decomposition of a sparse matrix (it costs N^3 so there’s no point having a sparse matrix). In this case you can work out what the pattern of the symbolic cholesky decomposition and then hard code it into every example. It would be easier if Stan supported sparse matrices directly (and given they’re in Eigen3, I assume it’s somewhere on the “to do list”).

    • Dan:

      1. It’s not like I was looking for a Tracy Chapman reference—actually that’s the only song of hers I can think of—it just jumped out at me given the topic.

      2. The intrinsic CAR model is sooo cool. I did not try to follow your comment in detail but one thing that’s confusing me is . . . it’s the posterior we’re working with, not the prior. Once you have some data, the posterior distribution integrates, and anything Gaussian will have a non-singular precision matrix, right?

      • Surely with Stan you’re always “Talkin’ bout a revolution”…

        It’s mainly that the way that you typically write the prior as a product of conditionals would involve a singular gaussian. If the observations are also gaussian, then you can just compute the posterior precision matrix and use that for both the determinant and the quadratic form. Otherwise, you have a prior (singular) precision matrix and some log-likelihood term. (in which case you need a full eigendecomposition of the precision matrix, which is too expensive for most models; the lip cancer dataset is really small!)

        The easiest way to fix this is to mulitply and divide by a normal approximation to the likelihood. (This allows you to combine this with the prior and have an approximate posterior that is Gaussian and non-singular). But for less sophisticated methods than HMC, this only works well if the normal distribution is a normal approximation to the likelihood for the current values of the parameters. This is usually not available in closed form, so you’d probably need to think harder for Stan

  6. I think there are better ways to evaluate the efficiency, but there is no point in starting a(nother) flame war in the comments section of Andy’s blog. It should suffice to do like 100 chains of the Fast CAR model and see if the standard deviation of the estimated posterior means across chains is similar to the estimated Monte Carlo standard error. However, if the shape of the posterior distribution is problematic for Stan, then there is a large probability that at least one of the 100 chains will go off the rails, which makes comparisons to the Monte Carlo standard error difficult but also means that the estimated n_eff was probably not reasonable in the first place.

  7. There’s a subtle error in the ICAR part of the example: the dimension of the support of the prior is (n-1) rather than n and so the term corresponding to $\tau$ in the log density should be $(n-1)\log(\tau)$.

    I’ve fixed it and made a pull request, so hopefully it’ll be fixed soon. Intrinsic models are annoying.

Leave a Reply to Andrew Cancel reply

Your email address will not be published. Required fields are marked *