Implementing Laplace approximation in Stan: What’s happening under the hood

Bob Carpenter and I have been talking about implementation of the Laplace approximation (based on mode and curvature of the log posterior density on the unconstrained scale) in Stan. As usual, there are lots of details that I don’t think about until the work is mostly done, and then lots more details that I never think about at all. As a public service, Bob prepared this summary.

Before going on, let me explain that the Laplace approximation is an old idea and a simple idea. Here are some advantages, limitations, and challenges of implementing Laplace:

Advantages of Laplace:
– Relatively cheap to compute, given that we already have a mode-finder in Stan.
– Easy to understand, use, and communicate.
– Works reasonably well in many examples (wherever the posterior can be well approximated by a normal distribution).
– Easy to take draws from the normal approximation, also easy to compute importance ratios and use Pareto-smoothed importance sampling.

Limitations of Laplace:
– Sometimes the normal approx is pretty bad (funnels, multimodal distributions, long tails).
– Sometimes the joint mode is useless or does not even exist (funnels, etc.), in which case the model itself would need to be altered in some way to get a stable mode.

These last two points are kind of obvious, in that the Laplace approximation is hundreds of years old and was the standard approach for Bayesian computation all the way through the 1980s, so if it really did the job, we wouldn’t need Stan at all.

In that case, why implement Laplace in Stan all? The answer: it’s part of workflow. Mode-finding is fast, and we can do a lot of useful model exploration by fitting models fast. And, once you have the mode, a normal approximation centered at the mode can be a useful accessory.

Finally, the challenges of implementing Laplace:
– We can’t easily and generally compute the second-derivative matrix using autodiff, so instead we numerically differentiate the gradients.
– In high dimensions, you’re carrying around a K x K matrix, which can just be too big to handle.
– All the programming details (see below).

Ok, now to Bob’s report:

After our long design back and forth, I [Bob] thought you might be amused to see what happens next on the coding front.

I [Bob] finished writing the C++ code and the tests for Laplace approximation. I’d say it took me around 8 hours total to get it feature complete and tested. It would’ve been faster, but I never wrote anything in the service layer before, so I had to look up all the support functions and test functions, such as how to deal with CSV output from the sample writer. I know C++ pretty well, so that wasn’t a problem, but the assumptions surrounding our code base have gotten very deep and thus are very tough for a directionally challenged person like me to navigate. I thought you might be amused to see what the C++ code for a simple function like this looks like and then what the tests look like, so I’ve appended them. The rest of this note is a description of what the process looks like to get something into Stan.

A change like this requires checking out a new branch of the code from GitHub, doing the modification on that branch, then submitting a pull request to merge that branch back into the development branch of our code. That’s where I’m at right now. We do that so that the development branch of the code always works and is always ready for release and so that we can all work on different parts of the code simultaneously.

After writing the code and the tests are done and the pull request is submitted, the continuous integration kicks off tests that ensure we’ve followed basic coding standards, checks that the changes can be merged into the develop branch without conflicts, and then auto-formats the code and pushes the auto-formatted code back to my branch. Then CI kicks off automatic testing on the formatted branch. The CI steps are what Nick (Nicursor Serban) is maintaining—they run on the Flatiron Institute servers now, so we’re saving a few $K/month in Amazon fees. We test everything on multiple compilers for Windows, Linux, and the Mac. If the integration tests fail (e.g., it won’t compile on Windows or I relied on some undefined behavior that’s different in Linux and on the Mac or at different compilation levels or implementations of arithmetic), then I debug, fix it, and update the pull request and another round of continuous integration kicks off automatically.

Then when tests are passing for my submitted branch, another developer with merge permission (the people listed on our web site) is going to go through all the code and review it. They’ll have requested changes, such as how to write clearer doc or how to make matrix operations faster, or how to use built-in functions rather than a one-off. Then we either discuss the requested changes, or I just make them. Then after the updates, we go back to the top of continuous integration. When the tests pass and all changes are done, it goes back to the original reviewer for another review. This can iterate, but we try not to go multiple cycles of review. When the reviewer is happy, they approve the changes and merge the code in the pull request into the develop branch of the code.

Before it is “in Stan” in a way that you can use it, we have to repeat this entire process for integration into CmdStan. That requires deciding what file formats to use and what the command-line interface is going to look like in terms of arguments and error handling.

CmdStan also requires the CmdStan user’s guide to be updated, which is its own set of updates and pull requests and code review.

Then when it’s merged into CmdStan, we repeat the whole process again for CmdStanPy. That requires defining what the arguments and defaults are going to look like in Python and what the in-memory data structures are going to be and interacting with the formats defined by CmdStan using files and the operating system. That then turns into a pull request that gets reviewed, and hopefully eventually merged. At least the doc for CmdStanPy is in the same repository, so it doesn’t require multiple pull requests.

Then when CmdStanPy is done, everything we did for CmdStanPy gets done again for CmdStanR, but the interfaces might change slightly to match R coding and interface conventions. This is mainly a matter of translation, but it’s just as much code, testing, review, and documentaiton as for CmdStanPy.

Then, the next time a release happens (every 3 to 4 months), the code automatically gets merged from the develop branch into the main branch, and gets included in our release. At this point, we try to write release notes so that users know what has changed version to version.

Anyway, that’s what it means to put something “in Stan” and “in the Stan interfaces”. It’s a lot of work, but there’s really no other choice if we want to keep things stable and we want to keep the state of our development going asynchronously with multiple people working on different parts of the code. By my count we need a pull request and code review for: Stan, CmdStan, CmdStan docs, CmdStanPy, and CmdStanR. It’ll require at least me (basic feature), Mitzi (CmdStan and CmdStanPy) and Jonah (CmdStanR) to be involved, plus three additional people to review the pull requests.

So that’s why I was so insistent during the design discussion to get a sign off from you on what exactly to implement. If it goes all the way to CmdStanR and you’re not happy, we have to go right back to the top of this process and iterate, which is super time consuming, as I’m sure you can imagine given the description.

I’m tired just typing this, but I thought it might help you understand why I’m so obsessed with getting a final design that you are OK with before breaking ground on the code. I’m not at all complaining—just saying that it’s not so trivial to add even a simple algorithm like Laplace approximation.

If it’s just something like a new function to include in Stan, then it’s just a matter of doing a pull request for the math library (C++ code) and a new pull request for the language compiler, stanc (OCaml code) and for the functions reference and optionally the user’s guide. Much easier, but it still requires a lot of complementary skills.

And his code:

=============IMPLEMENTATION===========================

#ifndef STAN_SERVICES_OPTIMIZE_LAPLACE_SAMPLE_HPP
#define STAN_SERVICES_OPTIMIZE_LAPLACE_SAMPLE_HPP

#include <stan/callbacks/interrupt.hpp>
#include <stan/callbacks/logger.hpp>
#include <stan/callbacks/writer.hpp>
#include <stan/math/rev.hpp>
#include <stan/services/error_codes.hpp>
#include <stan/services/util/create_rng.hpp>
#include <string>
#include <type_traits>
#include <vector>

namespace stan {
namespace services {
namespace internal {

template <bool jacobian, typename Model>
void laplace_sample(const Model& model, const Eigen::VectorXd& theta_hat,
   int draws, unsigned int random_seed,
   int refresh, callbacks::interrupt& interrupt,
   callbacks::logger& logger,
   callbacks::writer& sample_writer) {
  if (draws <= 0) {
    throw std::domain_error("Number of draws must be > 0; found draws = "
   + std::to_string(draws));
  }
  
  std::vector<std::string> unc_param_names;
  model.unconstrained_param_names(unc_param_names, false, false);
  int num_unc_params = unc_param_names.size();

  if (theta_hat.size() != num_unc_params) {
    throw::std::domain_error("Specified mode is wrong size; expected "
    + std::to_string(num_unc_params)
    + " unconstrained parameters, but specified mode has size = "
    + std::to_string(theta_hat.size()));
    
  }

  std::vector<std::string> param_tp_gq_names;
  model.constrained_param_names(param_tp_gq_names, true, true);
  size_t draw_size = param_tp_gq_names.size();

  // write names of params, tps, and gqs to sample writer
  std::vector<std::string> names;
  static const bool include_tp = true;
  static const bool include_gq = true;
  model.constrained_param_names(names, include_tp, include_gq);
  names.push_back("log_p");
  names.push_back("log_q");
  sample_writer(names);

  // create log density functor for vars and vals
  std::stringstream log_density_msgs;
  auto log_density_fun
    = [&](const Eigen::Matrix<stan::math::var, -1, 1>& theta) {
    return model.template log_prob<true, jacobian, stan::math::var>(
      const_cast<Eigen::Matrix<stan::math::var, -1, 1>&>(theta),
      &log_density_msgs);
  };

  // calculate inverse negative Hessian's Cholesky factor
  if (refresh > 0) {
    logger.info("Calculating Hessian\n");
  }
  double log_p;  // dummy
  Eigen::VectorXd grad;  // dummy
  Eigen::MatrixXd hessian;
  interrupt();
  math::internal::finite_diff_hessian_auto(log_density_fun, theta_hat, log_p,
  grad, hessian);
  if (refresh > 0) {
    logger.info(log_density_msgs);
    logger.info("\n");
  }

  // calculate Cholesky factor and inverse
  interrupt();
  if (refresh > 0) {
    logger.info("Calculating inverse of Cholesky factor");
    logger.info("\n");
  }
  Eigen::MatrixXd L_neg_hessian = (-hessian).llt().matrixL();
  interrupt();
  Eigen::MatrixXd inv_sqrt_neg_hessian = L_neg_hessian.inverse().transpose();
  interrupt();
  Eigen::MatrixXd half_hessian = 0.5 * hessian;
   
  // generate draws and output to sample writer
  if (refresh > 0) {
    logger.info("Generating draws");
    logger.info("\n");
  }
  boost::ecuyer1988 rng = util::create_rng(random_seed, 0);
  Eigen::VectorXd draw_vec;  // declare draw_vec, msgs here to avoid re-alloc
  for (int m = 0; m < draws; ++m) {
    interrupt();  // allow interpution each iteration
    if (refresh > 0 && m % refresh == 0) {
      logger.info("iteration: ");
      logger.info(std::to_string(m));
      logger.info("\n");
    }
    Eigen::VectorXd z(num_unc_params);
    for (int n = 0; n < num_unc_params; ++n) {
      z(n) = math::std_normal_rng(rng);
    }

    Eigen::VectorXd unc_draw = theta_hat + inv_sqrt_neg_hessian * z;
    double log_p = log_density_fun(unc_draw).val();
    Eigen::VectorXd diff = unc_draw - theta_hat;
    double log_q = diff.transpose() * half_hessian * diff; 
    
    std::stringstream msgs;
    model.write_array(rng, unc_draw, draw_vec, include_tp, include_gq, &msgs);
    if (refresh > 0) {
      logger.info(msgs);
      logger.info("\n");
    }
    std::vector<double> draw(&draw_vec(0), &draw_vec(0) + draw_size);
    draw.push_back(log_p);
    draw.push_back(log_q);
    sample_writer(draw);
  }
}
}  // namespace internal
  
/**
 * Take the specified number of draws from the Laplace approximation
 * for the model at the specified unconstrained mode, writing the
 * draws, unnormalized log density, and unnormalized density of the
 * approximation to the sample writer and writing messages to the
 * logger, returning a return code of zero if successful.
 *
 * Interrupts are called between compute-intensive operations.  To
 * turn off all console messages sent to the logger, set refresh to 0.
 * If an exception is thrown by the model, the return value is
 * non-zero, and if refresh > 0, its message is given to the logger as
 * an error.
 *
 * @tparam jacobian `true` to include Jacobian adjustment for
 * constrained parameters
 * @tparam Model a Stan model
 * @parma[in] m model from which to sample
 * @parma[in] theta_hat unconstrained mode at which to center the
 * Laplace approximation  
 * @param[in] draws number of draws to generate
 * @param[in] random_seed seed for generating random numbers in the
 * Stan program and in sampling  
 * @param[in] refresh period between iterations at which updates are
 * given, with a value of 0 turning off all messages
 * @param[in] interrupt callback for interrupting sampling
 * @param[in,out] logger callback for writing console messages from
 * sampler and from Stan programs
 * @param[in,out] sample_writer callback for writing parameter names
 * and then draws
 * @return a return code, with 0 indicating success
 */
template <bool jacobian, typename Model>
int laplace_sample(const Model& model, const Eigen::VectorXd& theta_hat,
  int draws, unsigned int random_seed,
  int refresh, callbacks::interrupt& interrupt,
  callbacks::logger& logger,
  callbacks::writer& sample_writer) {
  try {
    internal::laplace_sample<jacobian>(model, theta_hat, draws, random_seed,
      refresh, interrupt, logger,
      sample_writer);
    return error_codes::OK;
  } catch (const std::exception& e) {
    if (refresh >= 0) {
      logger.error(e.what());
      logger.error("\n");
    }
  } catch (...) {
    if (refresh >= 0) {
      logger.error("unknown exception during execution\n");
    }
  }
  return error_codes::DATAERR;
}
}  // namespace services
}  // namespace stan
  
#endif

==========================TEST CASE====================
#include <boost/algorithm/string.hpp>
#include <gtest/gtest.h>
#include <stan/callbacks/stream_logger.hpp>
#include <stan/callbacks/stream_writer.hpp>
#include <stan/io/dump.hpp>
#include <stan/io/empty_var_context.hpp>
#include <stan/io/stan_csv_reader.hpp>
#include <stan/services/error_codes.hpp>
#include <stan/services/optimize/laplace_sample.hpp>
#include <test/test-models/good/services/multi_normal.hpp>
#include <test/unit/services/instrumented_callbacks.hpp>
#include <test/unit/util.hpp>
#include <cmath>
#include <iostream>
#include <vector>

class ServicesLaplaceSample: public ::testing::Test {
 public:
  ServicesLaplaceSample() : logger(msgs, msgs, msgs, msgs, msgs) { }

  void SetUp() {
    stan::io::empty_var_context var_context;
    model = new stan_model(var_context);  // typedef from multi_normal.hpp
  }

  void TearDown() { delete model; }

  stan_model* model;
  std::stringstream msgs;
  stan::callbacks::stream_logger logger;
  stan::test::unit::instrumented_interrupt interrupt;
};

TEST_F(ServicesLaplaceSample, values) {
  Eigen::VectorXd theta_hat(2);
  theta_hat << 2, 3;
  int draws = 50000;  // big to enable mean, var & covar test precision
  unsigned int seed = 1234;
  int refresh = 1;
  std::stringstream sample_ss;
  stan::callbacks::stream_writer sample_writer(sample_ss, "");
  int return_code = stan::services::laplace_sample<true>(*model, theta_hat,
     draws, seed, refresh, interrupt, logger, sample_writer);
  EXPECT_EQ(stan::services::error_codes::OK, return_code);
  std::string samples_str = sample_ss.str();
  EXPECT_EQ(2, count_matches("y", samples_str));
  EXPECT_EQ(1, count_matches("y.1", samples_str));
  EXPECT_EQ(1, count_matches("y.2", samples_str));
  EXPECT_EQ(1, count_matches("log_p", samples_str));
  EXPECT_EQ(1, count_matches("log_q", samples_str));

  std::stringstream out;
  stan::io::stan_csv draws_csv = stan::io::stan_csv_reader::parse(sample_ss, &out);

  EXPECT_EQ(4, draws_csv.header.size());
  EXPECT_EQ("y[1]", draws_csv.header[0]);
  EXPECT_EQ("y[2]", draws_csv.header[1]);
  EXPECT_EQ("log_p", draws_csv.header[2]);
  EXPECT_EQ("log_q", draws_csv.header[3]);

  Eigen::MatrixXd sample = draws_csv.samples;
  EXPECT_EQ(4, sample.cols());
  EXPECT_EQ(draws, sample.rows());
  Eigen::VectorXd y1 = sample.col(0);
  Eigen::VectorXd y2 = sample.col(1);
  Eigen::VectorXd log_p = sample.col(2);
  Eigen::VectorXd log_q = sample.col(2);

  // because target is normal, laplace approx is exact
  for (int m = 0; m < draws; ++m) {
    EXPECT_FLOAT_EQ(0, log_p(m) - log_q(m));
  }

  // expect mean draws to be location params +/0 MC error
  EXPECT_NEAR(2, stan::math::mean(y1), 0.05);
  EXPECT_NEAR(3, stan::math::mean(y2), 0.05);

  double sum1 = 0;
  for (int m = 0; m < draws; ++m) {
    sum1 += std::pow(y1(m) - 2, 2);
  }
  double var1 = sum1 / draws;
  EXPECT_NEAR(1, var1, 0.05);

  double sum2 = 0;
  for (int m = 0; m < draws; ++m) {
    sum2 += std::pow(y2(m) - 3, 2);
  }
  double var2 = sum2 / draws;
  EXPECT_NEAR(1, var2, 0.05);

  double sum12 = 0;
  for (int m = 0; m < draws; ++m) {
    sum12 += (y1(m) - 2) * (y2(m) - 3);
  }
  double cov = sum12 / draws;
  EXPECT_NEAR(0.8, cov, 0.05);
}

TEST_F(ServicesLaplaceSample, wrongSizeModeError) {
  Eigen::VectorXd theta_hat(3);
  theta_hat << 2, 3, 4;
  int draws = 10;
  unsigned int seed = 1234;
  int refresh = 1;
  std::stringstream sample_ss;
  stan::callbacks::stream_writer sample_writer(sample_ss, "");
  int RC = stan::services::laplace_sample<true>(*model, theta_hat, draws, seed, refresh, interrupt, logger, sample_writer);
  EXPECT_EQ(stan::services::error_codes::DATAERR, RC);
}

TEST_F(ServicesLaplaceSample, nonPositiveDrawsError) {
  Eigen::VectorXd theta_hat(2);
  theta_hat << 2, 3;
  int draws = 0;
  unsigned int seed = 1234;
  int refresh = 1;
  std::stringstream sample_ss;
  stan::callbacks::stream_writer sample_writer(sample_ss, "");
  int RC = stan::services::laplace_sample<true>(*model, theta_hat, draws, seed, refresh, interrupt, logger, sample_writer);
  EXPECT_EQ(stan::services::error_codes::DATAERR, RC);
}

6 thoughts on “Implementing Laplace approximation in Stan: What’s happening under the hood

  1. Nice!

    Out of curiosity: given how clean Stan syntax is, what are the benefits and detriments of implementing this natively vs. writing a program to parse & translate *.stan files for use by e.g. scipy.optimize.minimize in Python or optimx::optimx (or stats::optim) in R (either relying on base functions for eg distributions, or calling a few extra packages)?

    • Nikolai:

      I think that Bob built a version of Stan where the function and gradient evaluations can be called directly from Python so I guess that would be the way to go, if you wanted to do it that way. At some point if there are better inference algorithms in Python, maybe I’d learn the language so I could write models in Stan and fit them faster this way. Or maybe if I want to keep using R, someone would write some R package that calls Python calling Stan, I dunno . . .

      • Andrew’s talking about BridgeStan. That started with Daniel Lee and Dan Muck, then I completed their interface, then Edward Roualdes solved the C interfacing bottleneck, then we extended it to R and Julia with Brian Ward and Seth Axen’s help. I’m sure I’m missing people given that people have been jumping in to help with BridgeStan.

        BridgeStan is the inverse from what Nikolai’s asking for, which is a way to have a Stan program compile down to Python code. On the other hand, Andrew’s right that BridgeStan would give you a way to replace Stan’s built-in L-BFGS with an arbitrary gradient-based (or even Hessian-based) optimizer in Python.

        The direct answer to Nikolai’s question is that the language transpiler team generalized the front-end parser to have a middle intermediate representation layer below which we can target different back ends. There’s a prototype TensorFlow implementation that was started by Adam Haber. We’ve been thinking about targeting JAX, but it’s a pretty large set of back-end math calls for the language with all of our distributions, CDFs, etc., as well as all of the data slicing, loops, user-defined functions, and transforms.

  2. Why can’t you easily compute the hessian by autodiff? I’m surprised y’all are using numerical differentiation; I’ve of the impression that numerical differentiation is a pretty fragile operation and am surprised that you can use it for a matrix that you then have to invert and cholesky decompose. Also, how does stan choose a step size for numerical differentiation?

Leave a Reply

Your email address will not be published. Required fields are marked *