Manipulating a machine-learning method by feeding it doctored training data

A correspondent who usually prefers to remain anonymous points us to this article by Ilia Shumailov et al. which states:

The data-driven nature of modern machine learning (ML) training routines puts pressure on data supply pipelines, which become increasingly more complex. . . .

This emergent complexity gives a perfect opportunity for an attacker to disrupt ML training, while remaining covert. In the case of stochastic gradient descent (SGD), it assumes uniform random sampling of items from the training dataset, yet in practice this randomness is rarely tested or enforced. Here, we focus on adversarial data sampling. . . . by simply changing the order in which batches or data points are supplied to a model during training, an attacker can affect model behaviour. . . .

They focus on the idea that “it is possible to perform integrity and availability attacks without adding or modifying any data points,” but to me that doesn’t seem like such a big deal. Once you can get in and modify the order of the data, you should be able to change the data too, no?

Kinda like with Brian “Pizzagate” Wansink: Once you accept that he can pick and choose which data to include in his papers, it’s not such a big step to suppose that he can just make up an entire experiment.

That issue aside, this work sounds super-interesting to me, especially in its connection to generalization and poststratification (differences between the training and test data), as discussed for example here and here.

Implementing Laplace approximation in Stan: What’s happening under the hood

Bob Carpenter and I have been talking about implementation of the Laplace approximation (based on mode and curvature of the log posterior density on the unconstrained scale) in Stan. As usual, there are lots of details that I don’t think about until the work is mostly done, and then lots more details that I never think about at all. As a public service, Bob prepared this summary.

Before going on, let me explain that the Laplace approximation is an old idea and a simple idea. Here are some advantages, limitations, and challenges of implementing Laplace:

Advantages of Laplace:
– Relatively cheap to compute, given that we already have a mode-finder in Stan.
– Easy to understand, use, and communicate.
– Works reasonably well in many examples (wherever the posterior can be well approximated by a normal distribution).
– Easy to take draws from the normal approximation, also easy to compute importance ratios and use Pareto-smoothed importance sampling.

Limitations of Laplace:
– Sometimes the normal approx is pretty bad (funnels, multimodal distributions, long tails).
– Sometimes the joint mode is useless or does not even exist (funnels, etc.), in which case the model itself would need to be altered in some way to get a stable mode.

These last two points are kind of obvious, in that the Laplace approximation is hundreds of years old and was the standard approach for Bayesian computation all the way through the 1980s, so if it really did the job, we wouldn’t need Stan at all.

In that case, why implement Laplace in Stan all? The answer: it’s part of workflow. Mode-finding is fast, and we can do a lot of useful model exploration by fitting models fast. And, once you have the mode, a normal approximation centered at the mode can be a useful accessory.

Finally, the challenges of implementing Laplace:
– We can’t easily and generally compute the second-derivative matrix using autodiff, so instead we numerically differentiate the gradients.
– In high dimensions, you’re carrying around a K x K matrix, which can just be too big to handle.
– All the programming details (see below).

Ok, now to Bob’s report:

After our long design back and forth, I [Bob] thought you might be amused to see what happens next on the coding front.

I [Bob] finished writing the C++ code and the tests for Laplace approximation. I’d say it took me around 8 hours total to get it feature complete and tested. It would’ve been faster, but I never wrote anything in the service layer before, so I had to look up all the support functions and test functions, such as how to deal with CSV output from the sample writer. I know C++ pretty well, so that wasn’t a problem, but the assumptions surrounding our code base have gotten very deep and thus are very tough for a directionally challenged person like me to navigate. I thought you might be amused to see what the C++ code for a simple function like this looks like and then what the tests look like, so I’ve appended them. The rest of this note is a description of what the process looks like to get something into Stan.

A change like this requires checking out a new branch of the code from GitHub, doing the modification on that branch, then submitting a pull request to merge that branch back into the development branch of our code. That’s where I’m at right now. We do that so that the development branch of the code always works and is always ready for release and so that we can all work on different parts of the code simultaneously.

After writing the code and the tests are done and the pull request is submitted, the continuous integration kicks off tests that ensure we’ve followed basic coding standards, checks that the changes can be merged into the develop branch without conflicts, and then auto-formats the code and pushes the auto-formatted code back to my branch. Then CI kicks off automatic testing on the formatted branch. The CI steps are what Nick (Nicursor Serban) is maintaining—they run on the Flatiron Institute servers now, so we’re saving a few $K/month in Amazon fees. We test everything on multiple compilers for Windows, Linux, and the Mac. If the integration tests fail (e.g., it won’t compile on Windows or I relied on some undefined behavior that’s different in Linux and on the Mac or at different compilation levels or implementations of arithmetic), then I debug, fix it, and update the pull request and another round of continuous integration kicks off automatically.

Then when tests are passing for my submitted branch, another developer with merge permission (the people listed on our web site) is going to go through all the code and review it. They’ll have requested changes, such as how to write clearer doc or how to make matrix operations faster, or how to use built-in functions rather than a one-off. Then we either discuss the requested changes, or I just make them. Then after the updates, we go back to the top of continuous integration. When the tests pass and all changes are done, it goes back to the original reviewer for another review. This can iterate, but we try not to go multiple cycles of review. When the reviewer is happy, they approve the changes and merge the code in the pull request into the develop branch of the code.

Before it is “in Stan” in a way that you can use it, we have to repeat this entire process for integration into CmdStan. That requires deciding what file formats to use and what the command-line interface is going to look like in terms of arguments and error handling.

CmdStan also requires the CmdStan user’s guide to be updated, which is its own set of updates and pull requests and code review.

Then when it’s merged into CmdStan, we repeat the whole process again for CmdStanPy. That requires defining what the arguments and defaults are going to look like in Python and what the in-memory data structures are going to be and interacting with the formats defined by CmdStan using files and the operating system. That then turns into a pull request that gets reviewed, and hopefully eventually merged. At least the doc for CmdStanPy is in the same repository, so it doesn’t require multiple pull requests.

Then when CmdStanPy is done, everything we did for CmdStanPy gets done again for CmdStanR, but the interfaces might change slightly to match R coding and interface conventions. This is mainly a matter of translation, but it’s just as much code, testing, review, and documentaiton as for CmdStanPy.

Then, the next time a release happens (every 3 to 4 months), the code automatically gets merged from the develop branch into the main branch, and gets included in our release. At this point, we try to write release notes so that users know what has changed version to version.

Anyway, that’s what it means to put something “in Stan” and “in the Stan interfaces”. It’s a lot of work, but there’s really no other choice if we want to keep things stable and we want to keep the state of our development going asynchronously with multiple people working on different parts of the code. By my count we need a pull request and code review for: Stan, CmdStan, CmdStan docs, CmdStanPy, and CmdStanR. It’ll require at least me (basic feature), Mitzi (CmdStan and CmdStanPy) and Jonah (CmdStanR) to be involved, plus three additional people to review the pull requests.

So that’s why I was so insistent during the design discussion to get a sign off from you on what exactly to implement. If it goes all the way to CmdStanR and you’re not happy, we have to go right back to the top of this process and iterate, which is super time consuming, as I’m sure you can imagine given the description.

I’m tired just typing this, but I thought it might help you understand why I’m so obsessed with getting a final design that you are OK with before breaking ground on the code. I’m not at all complaining—just saying that it’s not so trivial to add even a simple algorithm like Laplace approximation.

If it’s just something like a new function to include in Stan, then it’s just a matter of doing a pull request for the math library (C++ code) and a new pull request for the language compiler, stanc (OCaml code) and for the functions reference and optionally the user’s guide. Much easier, but it still requires a lot of complementary skills.

And his code:

=============IMPLEMENTATION===========================

#ifndef STAN_SERVICES_OPTIMIZE_LAPLACE_SAMPLE_HPP
#define STAN_SERVICES_OPTIMIZE_LAPLACE_SAMPLE_HPP

#include <stan/callbacks/interrupt.hpp>
#include <stan/callbacks/logger.hpp>
#include <stan/callbacks/writer.hpp>
#include <stan/math/rev.hpp>
#include <stan/services/error_codes.hpp>
#include <stan/services/util/create_rng.hpp>
#include <string>
#include <type_traits>
#include <vector>

namespace stan {
namespace services {
namespace internal {

template <bool jacobian, typename Model>
void laplace_sample(const Model& model, const Eigen::VectorXd& theta_hat,
   int draws, unsigned int random_seed,
   int refresh, callbacks::interrupt& interrupt,
   callbacks::logger& logger,
   callbacks::writer& sample_writer) {
  if (draws <= 0) {
    throw std::domain_error("Number of draws must be > 0; found draws = "
   + std::to_string(draws));
  }
  
  std::vector<std::string> unc_param_names;
  model.unconstrained_param_names(unc_param_names, false, false);
  int num_unc_params = unc_param_names.size();

  if (theta_hat.size() != num_unc_params) {
    throw::std::domain_error("Specified mode is wrong size; expected "
    + std::to_string(num_unc_params)
    + " unconstrained parameters, but specified mode has size = "
    + std::to_string(theta_hat.size()));
    
  }

  std::vector<std::string> param_tp_gq_names;
  model.constrained_param_names(param_tp_gq_names, true, true);
  size_t draw_size = param_tp_gq_names.size();

  // write names of params, tps, and gqs to sample writer
  std::vector<std::string> names;
  static const bool include_tp = true;
  static const bool include_gq = true;
  model.constrained_param_names(names, include_tp, include_gq);
  names.push_back("log_p");
  names.push_back("log_q");
  sample_writer(names);

  // create log density functor for vars and vals
  std::stringstream log_density_msgs;
  auto log_density_fun
    = [&](const Eigen::Matrix<stan::math::var, -1, 1>& theta) {
    return model.template log_prob<true, jacobian, stan::math::var>(
      const_cast<Eigen::Matrix<stan::math::var, -1, 1>&>(theta),
      &log_density_msgs);
  };

  // calculate inverse negative Hessian's Cholesky factor
  if (refresh > 0) {
    logger.info("Calculating Hessian\n");
  }
  double log_p;  // dummy
  Eigen::VectorXd grad;  // dummy
  Eigen::MatrixXd hessian;
  interrupt();
  math::internal::finite_diff_hessian_auto(log_density_fun, theta_hat, log_p,
  grad, hessian);
  if (refresh > 0) {
    logger.info(log_density_msgs);
    logger.info("\n");
  }

  // calculate Cholesky factor and inverse
  interrupt();
  if (refresh > 0) {
    logger.info("Calculating inverse of Cholesky factor");
    logger.info("\n");
  }
  Eigen::MatrixXd L_neg_hessian = (-hessian).llt().matrixL();
  interrupt();
  Eigen::MatrixXd inv_sqrt_neg_hessian = L_neg_hessian.inverse().transpose();
  interrupt();
  Eigen::MatrixXd half_hessian = 0.5 * hessian;
   
  // generate draws and output to sample writer
  if (refresh > 0) {
    logger.info("Generating draws");
    logger.info("\n");
  }
  boost::ecuyer1988 rng = util::create_rng(random_seed, 0);
  Eigen::VectorXd draw_vec;  // declare draw_vec, msgs here to avoid re-alloc
  for (int m = 0; m < draws; ++m) {
    interrupt();  // allow interpution each iteration
    if (refresh > 0 && m % refresh == 0) {
      logger.info("iteration: ");
      logger.info(std::to_string(m));
      logger.info("\n");
    }
    Eigen::VectorXd z(num_unc_params);
    for (int n = 0; n < num_unc_params; ++n) {
      z(n) = math::std_normal_rng(rng);
    }

    Eigen::VectorXd unc_draw = theta_hat + inv_sqrt_neg_hessian * z;
    double log_p = log_density_fun(unc_draw).val();
    Eigen::VectorXd diff = unc_draw - theta_hat;
    double log_q = diff.transpose() * half_hessian * diff; 
    
    std::stringstream msgs;
    model.write_array(rng, unc_draw, draw_vec, include_tp, include_gq, &msgs);
    if (refresh > 0) {
      logger.info(msgs);
      logger.info("\n");
    }
    std::vector<double> draw(&draw_vec(0), &draw_vec(0) + draw_size);
    draw.push_back(log_p);
    draw.push_back(log_q);
    sample_writer(draw);
  }
}
}  // namespace internal
  
/**
 * Take the specified number of draws from the Laplace approximation
 * for the model at the specified unconstrained mode, writing the
 * draws, unnormalized log density, and unnormalized density of the
 * approximation to the sample writer and writing messages to the
 * logger, returning a return code of zero if successful.
 *
 * Interrupts are called between compute-intensive operations.  To
 * turn off all console messages sent to the logger, set refresh to 0.
 * If an exception is thrown by the model, the return value is
 * non-zero, and if refresh > 0, its message is given to the logger as
 * an error.
 *
 * @tparam jacobian `true` to include Jacobian adjustment for
 * constrained parameters
 * @tparam Model a Stan model
 * @parma[in] m model from which to sample
 * @parma[in] theta_hat unconstrained mode at which to center the
 * Laplace approximation  
 * @param[in] draws number of draws to generate
 * @param[in] random_seed seed for generating random numbers in the
 * Stan program and in sampling  
 * @param[in] refresh period between iterations at which updates are
 * given, with a value of 0 turning off all messages
 * @param[in] interrupt callback for interrupting sampling
 * @param[in,out] logger callback for writing console messages from
 * sampler and from Stan programs
 * @param[in,out] sample_writer callback for writing parameter names
 * and then draws
 * @return a return code, with 0 indicating success
 */
template <bool jacobian, typename Model>
int laplace_sample(const Model& model, const Eigen::VectorXd& theta_hat,
  int draws, unsigned int random_seed,
  int refresh, callbacks::interrupt& interrupt,
  callbacks::logger& logger,
  callbacks::writer& sample_writer) {
  try {
    internal::laplace_sample<jacobian>(model, theta_hat, draws, random_seed,
      refresh, interrupt, logger,
      sample_writer);
    return error_codes::OK;
  } catch (const std::exception& e) {
    if (refresh >= 0) {
      logger.error(e.what());
      logger.error("\n");
    }
  } catch (...) {
    if (refresh >= 0) {
      logger.error("unknown exception during execution\n");
    }
  }
  return error_codes::DATAERR;
}
}  // namespace services
}  // namespace stan
  
#endif

==========================TEST CASE====================
#include <boost/algorithm/string.hpp>
#include <gtest/gtest.h>
#include <stan/callbacks/stream_logger.hpp>
#include <stan/callbacks/stream_writer.hpp>
#include <stan/io/dump.hpp>
#include <stan/io/empty_var_context.hpp>
#include <stan/io/stan_csv_reader.hpp>
#include <stan/services/error_codes.hpp>
#include <stan/services/optimize/laplace_sample.hpp>
#include <test/test-models/good/services/multi_normal.hpp>
#include <test/unit/services/instrumented_callbacks.hpp>
#include <test/unit/util.hpp>
#include <cmath>
#include <iostream>
#include <vector>

class ServicesLaplaceSample: public ::testing::Test {
 public:
  ServicesLaplaceSample() : logger(msgs, msgs, msgs, msgs, msgs) { }

  void SetUp() {
    stan::io::empty_var_context var_context;
    model = new stan_model(var_context);  // typedef from multi_normal.hpp
  }

  void TearDown() { delete model; }

  stan_model* model;
  std::stringstream msgs;
  stan::callbacks::stream_logger logger;
  stan::test::unit::instrumented_interrupt interrupt;
};

TEST_F(ServicesLaplaceSample, values) {
  Eigen::VectorXd theta_hat(2);
  theta_hat << 2, 3;
  int draws = 50000;  // big to enable mean, var & covar test precision
  unsigned int seed = 1234;
  int refresh = 1;
  std::stringstream sample_ss;
  stan::callbacks::stream_writer sample_writer(sample_ss, "");
  int return_code = stan::services::laplace_sample<true>(*model, theta_hat,
     draws, seed, refresh, interrupt, logger, sample_writer);
  EXPECT_EQ(stan::services::error_codes::OK, return_code);
  std::string samples_str = sample_ss.str();
  EXPECT_EQ(2, count_matches("y", samples_str));
  EXPECT_EQ(1, count_matches("y.1", samples_str));
  EXPECT_EQ(1, count_matches("y.2", samples_str));
  EXPECT_EQ(1, count_matches("log_p", samples_str));
  EXPECT_EQ(1, count_matches("log_q", samples_str));

  std::stringstream out;
  stan::io::stan_csv draws_csv = stan::io::stan_csv_reader::parse(sample_ss, &out);

  EXPECT_EQ(4, draws_csv.header.size());
  EXPECT_EQ("y[1]", draws_csv.header[0]);
  EXPECT_EQ("y[2]", draws_csv.header[1]);
  EXPECT_EQ("log_p", draws_csv.header[2]);
  EXPECT_EQ("log_q", draws_csv.header[3]);

  Eigen::MatrixXd sample = draws_csv.samples;
  EXPECT_EQ(4, sample.cols());
  EXPECT_EQ(draws, sample.rows());
  Eigen::VectorXd y1 = sample.col(0);
  Eigen::VectorXd y2 = sample.col(1);
  Eigen::VectorXd log_p = sample.col(2);
  Eigen::VectorXd log_q = sample.col(2);

  // because target is normal, laplace approx is exact
  for (int m = 0; m < draws; ++m) {
    EXPECT_FLOAT_EQ(0, log_p(m) - log_q(m));
  }

  // expect mean draws to be location params +/0 MC error
  EXPECT_NEAR(2, stan::math::mean(y1), 0.05);
  EXPECT_NEAR(3, stan::math::mean(y2), 0.05);

  double sum1 = 0;
  for (int m = 0; m < draws; ++m) {
    sum1 += std::pow(y1(m) - 2, 2);
  }
  double var1 = sum1 / draws;
  EXPECT_NEAR(1, var1, 0.05);

  double sum2 = 0;
  for (int m = 0; m < draws; ++m) {
    sum2 += std::pow(y2(m) - 3, 2);
  }
  double var2 = sum2 / draws;
  EXPECT_NEAR(1, var2, 0.05);

  double sum12 = 0;
  for (int m = 0; m < draws; ++m) {
    sum12 += (y1(m) - 2) * (y2(m) - 3);
  }
  double cov = sum12 / draws;
  EXPECT_NEAR(0.8, cov, 0.05);
}

TEST_F(ServicesLaplaceSample, wrongSizeModeError) {
  Eigen::VectorXd theta_hat(3);
  theta_hat << 2, 3, 4;
  int draws = 10;
  unsigned int seed = 1234;
  int refresh = 1;
  std::stringstream sample_ss;
  stan::callbacks::stream_writer sample_writer(sample_ss, "");
  int RC = stan::services::laplace_sample<true>(*model, theta_hat, draws, seed, refresh, interrupt, logger, sample_writer);
  EXPECT_EQ(stan::services::error_codes::DATAERR, RC);
}

TEST_F(ServicesLaplaceSample, nonPositiveDrawsError) {
  Eigen::VectorXd theta_hat(2);
  theta_hat << 2, 3;
  int draws = 0;
  unsigned int seed = 1234;
  int refresh = 1;
  std::stringstream sample_ss;
  stan::callbacks::stream_writer sample_writer(sample_ss, "");
  int RC = stan::services::laplace_sample<true>(*model, theta_hat, draws, seed, refresh, interrupt, logger, sample_writer);
  EXPECT_EQ(stan::services::error_codes::DATAERR, RC);
}

Free Bayesian Data Analysis course

We’re organizing a free Bayesian Data Analysis course targeted for Global south and other underrepresented groups. This is currently the third rendition of the BDA GSU course. Please see more information and the link to the registration form at the course web page.

The course is based on BDA3 book and BDA course at Aalto. All course material is freely available.

This is not the easiest Bayes course. The registration form requires you to answer some prerequisite questions. The web page has recommendations for easier material.

As all the material is free, you can choose to study at your own pace. We recommend registering and following the common schedule to benefit from the support of your peers and TAs.

If you want to volunteer to be a TA for the course, the course web page has also a link to TA registration.

The head TA is Meenal Jhajharia who did great job also in 2022.

The course is supported by Stan governing body, Numfocus, and Eduflow is supporting us by providing a free license of Peergrade tool.

Efficient leave-one-out cross-validation for Bayesian non-factorized normal and Student-t models

Paul, Jonah, and Aki write:

Cross-validation can be used to measure a model’s predictive accuracy for the purpose of model comparison, averaging, or selection. Standard leave-one-out cross-validation (LOO-CV) requires that the observation model can be factorized into simple terms, but a lot of important models in temporal and spatial statistics do not have this property or are inefficient or unstable when forced into a factorized form. We derive how to efficiently compute and validate both exact and approximate LOO-CV for any Bayesian non-factorized model with a multivariate normal or Student-t distribution on the outcome values. We demonstrate the method using lagged simultaneously autoregressive (SAR) models as a case study.

Aki’s post from last year, “Moving cross-validation from a research idea to a routine step in Bayesian data analysis,” connects this to the bigger picture.

Our first in-person StanCon since 2019 will be held in Washington University in St. Louis!

Our first in-person StanCon since 2019 will be held in Washington University in St. Louis. Registration and schedule coming soon!

Tutorials and workshops: Tuesday and Wednesday, June 20-21, 2023.

Conference: Thursday and Friday, June 22-23, 2023.

Tutorials: we’re seeking proposals for half-day or full-day tutorials. These can either be focused on foundations of Stan programming and Bayesian inference or advanced topics that can be for a niche audience. Proposals must describe the content and objectives of the tutorial.

Contributed talks: if you’d like to present your work at StanCon 2023, we’re looking for people to talk either in the general session or poster session.
If you have any questions or are considering submitting a proposal, feel free to reach out to [email protected].

Sponsorship: We can’t do this without the support of generous sponsors that support our events! If you’re interested in sponsoring StanCon, let’s talk! ([email protected]).

More details can be found on the Stan Forums.

The conference is organized by Debashis Mondal (Washington University in St. Louis), Eric Ward (University of Washington & NOAA), Charles Margossian (Flatiron Institute), Vianey Leos Barajas (University of Toronto), and Yi Zhang (Sage Therapeutics, Inc). The affiliations of the organizers already indicate the mix of Stan users and developers in academia, business, nonprofits, and government.

Some previous StanCon materials are here.

Another physically motivated sampler: Microcanonical HMC

This time it’s astrophysicists rather than quantum physicists getting in on the sampling action.

Microcanonical HMC

Uroš Seljak, an astrophysics professor at UC Berkeley, and collaborators developed a form of Hamiltonian Monte Carlo (HMC) sampling with an alternative approach to energy related to underdamped Langevin dynamics. Here’s the link to the arXiv paper.

Uros presented a preliminary version last month when he visited Flatiron Institute for our workshop on measure transport, diffusion processes, and sampling (a topic which was way more popular than my co-organizer Michael Albergo and I anticipated).

Meaningful evaluation vs. NUTS

I like that the microcanonical HMC paper demonstrates an auto-tuning scheme that shows impressive improvements over the no-U-turn sampler from Stan rather than using vanilla Hamiltonian Monte Carlo as a baseline. Vanilla HMC is very difficult to tune to work and even harder to tune for efficiency, especially without integration time jittering (step size or number of steps).

Langevin diffusions everywhere

I’m seeing Langevin diffusions everywhere these days. Chirag Modi, a joint astro and math postdoc here at Flatiron Institute who did his Ph.D. with Uros, Alex Barnett, Edward Roualdes and I are working on mashing up our recent work on delayed rejection for HMC for multiscale distributions with Radford Neal’s latest partial momentum refresh Langevin sampler, with a dash of parallel auto-adaptation courtesy of Matt Hoffman and Pavel Sountsov’s latest sampler, MEADS. This is the project that motivated Edward to start BridgeStan.

The field is heating up

The field of NUTS competitors is continuing to heat up (physical systems pun purely coincidental). Stay tuned for the results of BridgeStan implementations with more extensive evaluations based on posteriordb.

Research fellow, postdoc, and doctoral student positions at Aalto University, Finland

We’re looking for research fellows, postdocs, and doctoral students for projects in

  • Bayesian workflows for iterative model building and networks of models (Proj. 7, Aalto University)
  • Evaluating and improving posterior inference for difficult posteriors
    (Proj. F9, FCAI/Aalto/Helsinki with Prof. Arto Klami)
  • Workflows for better priors (Proj. F19, FCAI/Aalto/Helsinki with Prof. Arto Klami)

See the abstracts below.

There are also many other topics in probabilistic modeling, ML, and AI at Aalto University and University of Helsinki

All topics and how to apply at

You can ask me (Aki) for more information

Aalto University and University Helsinki have strong Bayesian/ML/AI community. We contribute to open source software packages like Stan and ArviZ. Aalto pays postdocs well compared to many other countries. We have plenty of travel funds. Finland is a great place for living, with or without family. It is a safe, politically stable and well-organized society, where equality is highly valued and corruption is low. Extensive social security supports people in all situations of life. Occupational and national public healthcare in Finland are great and free. You can manage in work and everyday life well with English (no need to learn Finnish unless you want to). Finland has been ranked as the happiest country in the world in 2018–2021.

Topic: Bayesian workflows for iterative model building and networks of models

We formalize and develop theory and diagnostics for iterative Bayesian model building. The practical workflow recommendations and diagnostics guide the modeller through the appropriate steps to ensure safe iterative model building, or indicate when the modeler is likely to be in the danger zone.

Topic: Evaluating and improving posterior inference for difficult posteriors

Both MCMC and distributional approximations often struggle to handle complex posteriors, but we lack good tools for understanding how and why. We study diagnostics for identifying the nature of the computational difficulty, e.g. whether the difficulty is caused by narrow funnels or strong curvature. We also develop improved inference algorithms, e.g. via automated and semi-automated transformations.

Topic: Workflows for better priors

Bayesian models rely on prior distributions that encode knowledge about the problem, but specifying good priors is often difficult in practice. We are working on multiple fronts on making it easier, with contributions to e.g. prior elicitation, prior diagnostics, prior checking, and specification of priors in predictive spaces.

Update 4 – World Cup Qatar 2022 predictions (semifinals and winning probabilities)

Time for our last update! Qatar 2022 World Cup is progressing fast, and only four teams – Argentina, France, Croatia and Morocco – are still in contention for the final victory. Who will be the winner on December 18th? Is our model better than the Paul the Octopus, an almost perfect oracle during World Cup 2010?

Semifinals predictions

We report in the table below the posterior predictive match probabilities from our DIBP model – get a look also here and here for other updates – for the two semifinals planned for Tuesday, December 13 and Wednesday, December 14, Argentina-Croatia and France-Morocco, respectively. We also report the usual ppd ‘chessboard plots’ for the exact outcomes in gray-scale color.

Notes: ‘mlo’ in the table denotes the ‘most likely result’ , whereas darker regions in the plots correspond to more likely results. The first team listed in each sub-title is the ‘favorite’ (x-axis), whereas the second team is the ‘underdog’ (y-axis). The 2-way grid displays the 2 held-out matches in such a way that the closest match appears in the left panel of the grid, whereas the most unbalanced match (‘blowout’) appears in the right panel. 

 

France and Argentina seem clearly ahead against Croatia and Morocco, respectively.  Anyway, underdogs  such as Morocco have a non-negligible chance – approximately 35% – to beat France and advance to the final: consider that  Morocco got two ‘clean-sheets‘ in the round of 16 and quarter of finals matches, against Spain and Portugal, respectively!   Croatia already achieved the final four years ago, so maybe it should not be considered as a pure underdog…and Luka Modric, the Croatia’s captain, is still one of the best players in the world.

Note: keep in mind that the above predictions refer to the ‘regular’ times, not to the extra times! Anyway, to get an approximated probability to advance to the final, say for the favorite team, one could compute: favorite probability + 0.5*draw probability. The same could be done for the underdog team. In such a way, with no further assumptions we assume that the draw probability within the regular times is equally split between the two teams in the eventual extra-times. 

World Cup winning probabilities

We also provide some World Cup winning probabilities for the four teams, based on some forward simulations of the tournament.

The results are somehow surprising! Unlike for what happens for the majority of the bookies, Argentina has the highest chances to win the World Cup. France comes at the second place, whereas Morocco is the underdog, with only the 8% probability to become the World Cup winner.

Full code and details

You find the complete results, R code and analysis here. Some preliminary notes and model limitations can be found here. And use the footBayes package!

Final considerations

We had a lot of fun with these World Cup predictions, we guess this has been a good and challenging statistical application. To summarize, the average of the correct probabilities, i.e. the average of the model probabilities for the actually observed outcomes, is 0.41, whereas the pseudo R-squared is 0.36 (up to the quarter of finals matches).

BridgeStan: Stan model log densities, gradients, Hessians, and transforms in R, Python, and Julia

We’re happy to announce the official 1.0.0 release of BridgeStan.

What is BridgeStan?

From the documentation:

BridgeStan provides efficient in-memory access through Python, Julia, and R to the methods of a Stan model, including log densities, gradients, Hessians, and constraining and unconstraining transforms.

BridgeStan should be useful for developing algorithms and deploying applications. It connects to R and Python through low-level foreign function interfaces (.C and ctypes, respectively) and is thus very performant and portable. It is also easy to install and keep up to date with Stan releases. BridgeStan adds the model-level functionality from RStan/PyStan that is not implemented in CmdStanR/CmdStanPy.

Documentation and source code

Detailed forum post

Here’s a post on the Stan forums with much more detail:

Among other things, it describes the history and relations to other Stan-related projects. Edward Roualdes started the project in order to access Stan models through Julia, then Brian Ward and I (mostly Brian!) helped Edward finish it, with some contributions from Nicholas Siccha and Mitzi Morris.

Where does this quote come from? “I don’t trust anything until it runs. In fact, I don’t trust anything until it runs twice.”

Google the above quote and you’ll see it attributed to me. And it does sound like the kind of thing I might say. But I don’t recall actually having ever said it, and a search does not turn up any original appearance of this saying.

Does anyone know the original source? Is it something I said in a talk and have forgotten? Or maybe it was misattributed to me, but then who said it, and when?

P.S. I was the one who said it! Back in 2017. Commenter Ahm reports:

I put it in my “Ideas to Inflict Upon Co-authors and PhD Students” notebook after hearing it in the talk “Theoretical Statistics is the Theory of Applied Statistics: How to Think About What We Do” at the 2017 New York R Conference.

The talk is on Youtube (timestamp 27:51). Link here.

The full-ish quote is “Computer programing is basically the last bastion of rigor in the modern world. I don’t trust anything until it runs. In fact, until it runs twice on the computer.”

A different Bayesian World Cup model using Stan (opportunity for model checking and improvement)

Maurits Evers writes:

Inspired by your posts on using Stan for analysing football World Cup data here and here, as well as the follow-up here, I had some fun using your model in Stan to predict outcomes for this year’s football WC in Qatar. Here’s the summary on Netlify. Links to the code repo on Bitbucket are given on the website.

Your readers might be interested in comparing model/data/assumptions/results with those from Leonardo Egidi’s recent posts here and here.

Enjoy, soccerheads!

P.S. See comments below. Evers’s model makes some highly implausible predictions and on its face seems like it should not be taken seriously. From the statistical perspective, the challenge is to follow the trail of breadcrumbs and figure out where the problems in the model came from. Are they from bad data? A bug in the code? Or perhaps a flaw in the model so that the data were not used in the way that were intended? One of the great things about generative models is that they can be used to make lots and lots of predictions, and this can help us learn where we have gone wrong. I’ve added a parenthetical to the title of this post to emphasize this point. Also good to be reminded that just cos a method uses Bayesian inference, that doesn’t mean that its predictions make any sense! The output is only as good as its input and how that input is processed.

Update 2 – World Cup Qatar 2022 Predictions with footBayes/Stan

Time to update our World Cup 2022 model!

The DIBP (diagonal-inflated bivariate Poisson) model performed very well in the first match-day of the group stage in terms of predictive accuracy – consider that the ‘peudo R-squared’, namely the geometric mean of the probabilities assigned from the model to the ‘true’ final match results, is about 0.4, whereas, on average, the main bookmakers got 0.36.

It’s now time to re-fit the model after the first 16 group stage games with the footBayes R package and obtain the probabilistic predictions for the second match-day. Here there are the posterior predictive match probabilities for the held-out matches of the Qatar 2022 group stage played from November 25th to November 28th, along with some ppd ‘chessboard plots’ for the exact outcomes in gray-scale color – ‘mlo’ in the table denotes the ‘most likely result’ , whereas darker regions in the plots correspond to more likely results.

Plot/table updates: (see Andrew’ suggestions from the previous post, we’re still developing these plots to improve their appearance, see below some more notes). In the plots below, the first team listed in each sub-title is the ‘favorite’ (x-axis), whereas the second team is the ‘underdog’ (y-axis). The 2-way grid displays the 16 held-out matches in such a way that closer matches appear at the top-left of the grid, whereas more unbalanced matches (‘blowouts’) appear at the bottom-right.  The matches are then ordered from top-left to bottom-right in terms of increasing winning probability for the favorite teams. The table reports instead the matches according to a chronological order.

The most unbalanced game seems Brazil-Switzerland, where the Brazil is the favorite team with an associated winning probability about 71%. The closest game seems Iran-Wales – Iran just won with two goals of margin scored in the last ten minutes! – whereas France is given only 44% probability of winning against Denmark. Argentina seems to be ahead against Mexico, whereas Spain seems to have a non-negligible advantage in the match against Germany.

Another predictive note: Regarding ‘most-likely-outcomes’ (mlo here above), the model ‘guessed’ 4 ‘mlo’ out of 16 in the previous match-day.

You find the complete results, R code and analysis here.

Some more technical notes/suggestions about the table and the plots above:

  • We replaced ‘home’ and ‘away’ by ‘favorite’ and ‘underdog’.
  • I find difficult to handle ‘xlab’ and ‘ylab’ in faceted plots with ggplot2! (A better solution could be in fact to directly put the team names on each of the axes of the sub-plots).
  • The occurrence ‘4’ actually stands for ‘4+’, meaning that it captures the probability of scoring ‘4 or more goals’ (I did not like the thick ‘4+’ in the plot, for this reason we just set ‘4’, however we could improve this).
  • We could consider adding some global ‘x’ and ‘y’-axes with probability margins between underdog and  favorite. Thus, for Brazil-Switzerland, we should have a thick on the x-axis at approximately 62%, whereas for Iran-Wales at 5%.

For other technical notes and model limitations check the previous post.

Next steps: we are going to update the predictions for the third match-day and even compute some World Cup winning probabilities through a ahead-simulation of the whole tournament.

Stay tuned!

Time Series Forecasting: futile but necessary. An example using electricity prices.

This post is by Phil Price, not Andrew.

I have a client company that owns refrigerated warehouses around the world. A refrigerated warehouse is a Costco-sized building that is kept very cold; 0 F is a common setting in the U.S. (I should ask what they do in Europe). As you might expect, they have an enormous electric bill — the company as a whole spends around a billion dollars per year on electricity — so they are very interested in the cost of electricity. One decision they have to make is: how much electricity, if any, should they purchase in advance? The alternative to purchasing in advance is paying the “real-time” electricity price. On average, if you buy in advance you pay a premium…but you greatly reduce the risk of something crazy happening. What do I mean by ‘crazy’? Take a look at the figure below. This is the monthly-average price per Megawatt-hour (MWh) for electricity purchased during the peak period (weekday afternoons and evenings) in the area around Houston, Texas. That big spike in February 2021 is an ice storm that froze a bunch of wind turbines and also froze gas pipelines — and brought down some transmission lines, I think — thus leading to extremely high electricity prices. And this plot understates things, in a way, by averaging over a month: there were a couple of weeks of fairly normal prices that month, and a few days when the price was over $6000/MWh.

Monthly-mean peak-period (weekday afternoon) electricity price, in dollars per megawatt-hour, in Texas.

If you buy a lot of electricity, a month when it costs 20x as much as you expected can cause havoc with your budget and your profits. One way to avoid that is to buy in advance: a year ahead of time, or even a month ahead of time, you could have bought your February 2021 electricity for only a bit more than electricity typically costs in Texas in February. But events that extreme are very rare — indeed I think this is the most extreme spike on record in the whole country in at least the past thirty years — so maybe it’s not worth paying the premium that would be involved if you buy in advance, month after month and year after year, for all of your facilities in the U.S. and Europe. To decide how much electricity to buy in advance (if any) you need at least a general understanding of quite a few issues: how much electricity do you expect to need next month, or in six months, or in a year; how much will it cost to buy in advance; how much is it likely to cost if you just wait and buy it at the time-of-use rate; what’s the chance that something crazy will happen, and, if it does, how crazy will the price be; and so on.

Continue reading

Bigshot chief scientist of major corporation can’t handle criticism of the work he hypes.

A correspondent who wishes to remain anonymous points us to this article in Technology Review, “Why Meta’s latest large language model survived only three days online. Galactica was supposed to help scientists. Instead, it mindlessly spat out biased and incorrect nonsense.” Here’s the story:

On November 15 Meta unveiled a new large language model called Galactica, designed to assist scientists. But instead of landing with the big bang Meta hoped for, Galactica has died with a whimper after three days of intense criticism. Yesterday the company took down the public demo that it had encouraged everyone to try out.

Meta’s misstep—and its hubris—show once again that Big Tech has a blind spot about the severe limitations of large language models. There is a large body of research that highlights the flaws of this technology, including its tendencies to reproduce prejudice and assert falsehoods as facts.

However, Meta and other companies working on large language models, including Google, have failed to take it seriously. . . .

There was some hype:

Meta promoted its model as a shortcut for researchers and students. In the company’s words, Galactica “can summarize academic papers, solve math problems, generate Wiki articles, write scientific code, annotate molecules and proteins, and more.”

Actually, though:

Like all language models, Galactica is a mindless bot that cannot tell fact from fiction. Within hours, scientists were sharing its biased and incorrect results on social media. . . . A fundamental problem with Galactica is that it is not able to distinguish truth from falsehood, a basic requirement for a language model designed to generate scientific text. People found that it made up fake papers (sometimes attributing them to real authors), and generated wiki articles about the history of bears in space as readily as ones about protein complexes and the speed of light. It’s easy to spot fiction when it involves space bears, but harder with a subject users may not know much about.

I’d not heard about this Galactica thing at all, but the article connected to some things I had heard about:

For the last couple of years, Google has been promoting language models, such as LaMDA, as a way to look up information.

A few months ago we discussed that Google chatbot. I was disappointed that the Google engineer was willing to hype it but not to respond to reasoned criticisms of his argument.

The Technology Review article continues:

And it wasn’t just the fault of Meta’s marketing team. Yann LeCun, a Turing Award winner and Meta’s chief scientist, defended Galactica to the end. On the day the model was released, LeCun tweeted: “Type a text and Galactica will generate a paper with relevant references, formulas, and everything.” Three days later, he tweeted: “Galactica demo is off line for now. It’s no longer possible to have some fun by casually misusing it. Happy?”

I hate twitter. LeCun also approvingly links to someone else who writes, in response to AI critic Gary Marcus:

or maybe it [Galactica] was removed because people like you [Marcus] abused the model and misrepresented it. Thanks for getting a useful and interesting public demo removed, this is why we can’t have nice things.

Whaaa?

Let’s unpack this last bit for a moment. Private company Meta launched a demo, and then a few days later they decided to remove it. The demo was removed in response to public criticisms, and . . . that’s a problem? “We can’t have nice things” because . . . outsiders are allowed to criticize published material?

This attitude of LeCun is ridiculous on two levels. First, and most obviously, the decision to remove the demo was made by Meta, not by Marcus. Meta is one of the biggest companies in the world; they have some agency, no? Second, what’s the endgame here? What’s LeCun’s ideal? Presumably it’s not a world in which outsiders are not allowed to criticize products. So what is it? I guess the ideal would be that Marcus and others would voluntarily suppress their criticism out of a public-spirited desire not to have Meta take “nice things” away from people? So weird. Marcus doesn’t work for your company, dude.

The funny thing is that the official statement from Meta was much more reasonable! Here it is:

Thank you everyone for trying the Galactica model demo. We appreciate the feedback we have received so far from the community, and have paused the demo for now. Our models are available for researchers who want to learn more about the work and reproduce results in the paper.

I don’t quite understand what it means for the demo to have been paused if the models remain available to researchers, but in any case they’re taking responsibility for what they’re doing with their own code; they’re not blaming critics. This is a case where the corporate marketing team makes much more sense than the company’s chief scientist.

This all relates to Jessica’s recent post on academic fields where criticism is suppressed, where research critique is taken as personal attacks, and where there often seems to be a norm of never saying anything negative. LeCun seems to have that same attitude, not about research papers but about his employer’s products. Either way, it’s the blame-the-critic game, and my take is the same: If you don’t want your work criticized, don’t make it public. It’s disappointing, but all too common, to see scientists who are opposed to criticism, which is essential to the scientific process.

The big picture

Look. I’m not saying LeCun is a bad person. I don’t know the guy at all. Anybody can have a bad day! One of his company’s high-profile products got bad press, so he lashed out. Ultimately no big deal.

It’s just . . . that idea that outside criticism is “why we can’t have nice things” . . . at worst this seems like an authoritarian attitude and at best it seems to reflect an extreme naivety about how science works. I guess that without outside criticism we’d all be driving cars that run on cold fusion, cancer would already have been cured 100 times over, etc.

P.S. I sent the above post to some people, and we got involved in a discussion of whether LeCun in his online discussions is “attacking” Galactica’s critics. I said that, from my perspective, LeCun is disagreeing with the critics but not attacking them. To this, Thomas Basebøll remarked that, whether the critics are on the “attack” or not, LeCun is certainly on the defensive, reacting to the criticism as though it’s an attack. Kind of like calling it “methodological terrorism” or something.

That’s an interesting point regarding LeCun being on the defensive. Indeed, instead of being in the position of arguing how great this product is for humanity, he’s spending his time arguing how it’s not dangerous. I can see how this can feel frustrating from his end.

P.P.S. LeCun responds here in comments.

Football World Cup 2022 Predictions with footBayes/Stan

It’s time for football (aka soccer) World Cup Qatar 2022 and statistical predictions!

This year me and my collaborator Vasilis Palaskas implemented a diagonal-inflated bivariate Poisson model for the scores through our `footBayes` R CRAN package (depending on the `rstan` package), by considering as a training set more than 3000 international matches played during the years’ range 2018-2022. The model incorporates some dynamic-autoregressive team-parameters priors for attack and defense abilities and the Coca-Cola/FIFA rankings differences as the only predictor. The model, firstly proposed by Karlis & Ntzoufras in 2003, extends the usual bivariate Poisson model by allowing to inflate the number of draw occurrences. Weakly informative prior distributions for the remaining parameters are assumed, whereas sum-to-zero constraints for attack/defense abilities are considered to achieve model identifiability. Previous World Cup and Euro Cup models posted in this blog can be found here, here and here.

Here is the new model for the joint couple of scores (X,Y,) of a soccer match. In brief:

We fitted the model by using HMC sampling, with 4 Markov Chains, 2000 HMC iterations each, checking for their convergence and effective sample sizes. Here there are the posterior predictive matches probabilities for the held-out matches of the Qatar 2022 group stage, played from November 20th to November 24th, along with some ppd ‘chessboard plots’ for the exact outcomes in gray-scale color (‘mlo’ in the table denotes the ‘most likely result’ , whereas darker regions in the plots correspond to more likely results):

Better teams are acknowledged to have higher chances in these first group stage matches:

  • In Portugal-Ghana, Portugal has an estimated winning probability about 81%, whereas in Argentina-Saudi Arabia Argentina has an estimated winning probability about 72%. The match between England and Iran seems instead more balanced, and a similar trend is observed for Germany-Japan. USA is estimated to be ahead in the match against Wales, with a winning probability about 47%.

Some technical notes and model limitations:

  • Keep in mind that ‘home’ and ‘away’ do not mean anything in particular here – the only home team is Qatar! – but they just refer to the first and the second team of the single matches. ‘mlo’ denotes the most likely exact outcome.
  • The posterior predictive probabilities appear to be approximated at the third decimal digit, which could sound a bit ‘bogus’… However, we transparently reported the ppd probabilities as those returned from our package computations.
  • One could use these probabilities for betting purposes, for instance by betting on that particular result – among home win, draw, or away win – for which the model probability exceeds the bookmaker-induced probability. However, we are not responsible for your money loss!
  • Why a diagonal-inflated bivariate Poisson model, and not other models? We developed some sensitivity checks in terms of leave-one-out CV on the training set to choose the best model. Furthermore, we also checked our model in terms of calibration measures and posterior predictive checks.
  • The model incorporates the (rescaled) FIFA ranking as the only predictor. Thus, we do not have many relevant covariates here.
  • We did not distinguish between friendly matches, world cup qualifiers, euro cup qualifiers, etc. in the training data, rather we consider all the data as coming from the same ‘population’ of matches. This data assumption could be poor in terms of predictive performances.
  • We do not incorporate any individual players’-based information in the model, and this also could represent a major limitation.
  • We’ll compute some predictions’ scores – Brier score, pseudo R-squared – to check the predictive power of the model.
  • We’ll fit this model after each stage, by adding the previous matches in the training set and predicting the next matches.

This model is just an approximation for a very complex football tornament. Anyway, we strongly support scientific replication, and for such reason the reports, data, R and RMarkdown codes can be fully found here, in my personal web page. Feel free to play with the data and fit your own model!

And stay tuned for the next predictions in the blog. We’ll add some plots, tables and further considerations. Hopefully, we’ll improve predictive performance as the tournament proceeds.

Stan and Tensorflow for fast parallel Bayesian inference

Davendra Seunarine Maharaj writes:

We are seeking to characterize the performance and potential bottlenecks of the latest fast MCMC samplers. I see that Stan is currently using Intel TBB to parallelize the no-U-turn sampler (NUTS) across multiple chains. Do you know of any research attempted to parallelize each sampler itself within one chain.

Matt Hoffman (the inventor of the NUTS algorithm) responded:

Our group at Google has been very interested in using parallel compute in HMC variants (including NUTS), particularly on accelerators (e.g., GPUs). We’ve been working in the deep-learning-oriented autodiff+accelerator software frameworks TensorFlow and JAX, both of which are supported by our TensorFlow Probability library. It turns out that building on top of these frameworks gives you parallelism both within-chain (mostly from data parallelism) and across chains (because multiple chains can be run in parallel) “for free” without thinking too much about it—we just let the substrate (i.e., TF or JAX) decide how best to use the available resources.

Some of our systems papers you may (or may not) find interesting:
https://arxiv.org/abs/2002.01184
https://arxiv.org/abs/2001.05035

Some papers arguing theoretically and empirically that running many chains in parallel is a good thing to do if you have the resources:
https://proceedings.mlr.press/v130/hoffman21a.html
https://proceedings.mlr.press/v119/hoffman20a.html

And one about convergence diagnostics if you’re going to run many chains in parallel and don’t want to wait for them all to generate large effective sample sizes (with Andrew, among others):
https://arxiv.org/abs/2110.13017

Mark Jeffrey followed up:

Our research group comprises computer architecture and PL folks, and we are always keen to improve our understanding of a diversity of application domains to improve computer system support for them. It is encouraging to hear that Google has a significant interest in this area to the point that TensorFlow has a team developing it.

Two more questions:

1. Are you aware of any HMC/NUTS/MCMC competitions akin to those organized at DIMACS, e.g., does StanCon run a competition? This would give us pointers to
potentially useful input models to work with.

2. In your opinion, does TensorFlow’s support for HMC methods supersede Stan, or will both continue to coexist with different strengths? I expect the students in my group may be more productive hacking on the internals of Stan than TensorFlow, but I am open to suggestion.

I passed those questions over to Bob Carpenter, who replied:

Stan can use multiple threads to evaluate the log density and gradients within a single chain and it can use multiple threads to run multiple chains in parallel. We can also send some computations to GPU to evaluate the log density.

Google did some comparisons of Stan and TensorFlow and the answer is not surprising: Stan was faster on CPU and TensorFlow faster on GPU. Most of the models used in Stan aren’t easy to push onto CPU because they don’t involve SIMD calculations within a single log density eval. This is in contrast to deep neural nets, which involve a lot of dense
matrix operations, which are the bread and butter of GPUs.

TensorFlow usually runs 32 bit and Stan always runs 64 bit.

HMC and NUTS are both instances of MCMC, so I don’t see the contrast.

No, we’re not running any bakeoffs, nor do I know of any plans to do such. They are notoriously difficult to organize.

Finally, I’d suggest looking at some of Matt Hoffman and Pavel Sountsov’s new algorithms, which are designed to take advantage of the SIMD capabilities of GPUs and TPUs. I’m personally very excited about their recent MEADS algorithm.

Matt also sent along his responses:

1. In HMC variants, the overhead (at least in FLOPs) of the model-independent code is generally quite small compared to the cost of computing gradients. So as long as you’re FLOPS-bound, there’s not much point in aggressively optimizing the implementation. (But in https://proceedings.mlr.press/v130/hoffman21a.html we find that, when running on a GPU, NUTS in particular has a lot of control-flow overhead when run on top of TF or JAX; cf. also https://github.com/tensorflow/probability/blob/main/discussion/technical_note_on_unrolled_nuts.md)

So I don’t think there’s been a lot of work put into DIMACS-style implementation challenges. (Or I might just not know about it.)

But there are some “model zoos” out there that are more oriented towards comparing algorithmic improvements. Stan’s example models repo has lots of great stuff, there’s posteriordb, and our own inference gym.

2. I certainly don’t think that TFP is going to make Stan obsolete anytime soon. For one thing, I’d say that Stan’s documentation, community support, and general ease of use are in most respects well ahead of TFP. Also, I think it’s safe to say that Stan has better support for some modeling features—ODEs are a big example that springs to mind. Also, if you need double precision then it can be kind of a pain to get in TF/JAX.

Some of TFP’s big advantages over Stan to my mind are:

A. We get accelerator (i.e., GPU and TPU) support “for free” from building on top of TF and JAX. That includes support for sharding computation across multiple accelerators, which lets us do useful things like run 8x as many chains relatively easily, as well as letting us do kind-of-absurd things like apply HMC to a large Bayesian neural network using 512 TPU cores.

B. It’s easier to use TFP’s HMC as part of a larger system. If I want to, say, use HMC as an inner loop in training a deep generative model, or use an invertible neural network to precondition my sampler, I think it’d be harder to do that using Stan than using TFP.

C. I strongly prefer doing algorithm development in Python/JAX (using, e.g., FunMC) to c++. Having a model zoo that’s defined in the same autodiff framework as the one I’m writing my algorithm in is very convenient.

So compared to Stan, I think TFP is more geared towards problems where either you want the increased flexibility (and lower user-friendliness) of its lower-level API, or you’ve really hit a computational bottleneck that can be solved with access to more FLOPs.

Whereas for a lot of data-analysis problems, I think TFP’s advantages over Stan aren’t that relevant—if you can run 6 chains for 2000 iterations on a laptop in under five minutes, using a well-supported, well-documented, easy-to-use system, and everything converges with no tuning, why fix what’s not broken? Even in situations where things don’t converge properly, often the solution is to fix the model specification (e.g., by noncentering) rather than to throw more compute or customized algorithms at it.

Interesting discussion, and it’s good to see all this work being done on practical algorithms. No method will solve all problems, so it makes sense that multiple systems are being developed.

Simulation-based calibration checking (SBC) is stronger than you thought! (and the SBC package in R)

Martin Modrak writes:

We (Martin Modrák, Angie Moon, Shinyoung Kim, Paul Bürkner, Niko Huurre, Kateřina Faltejsková, Andrew Gelman, and Aki Vehtari) have a new preprint on SBC: Simulation-Based Calibration Checking for Bayesian Computation: The Choice of Test Quantities Shapes Sensitivity.

For those, who are unaware, simulation-based calibration checking (SBC) is a method where you use simulated datasets to verify that you implemented you model correctly and/or that your sampling algorithm work. It has been known and used for a while, but was considered to have a few shortcomings, which we try to address. See the preprint or the documentation for the SBC package for a precise description of how that works.

The gist of our preprint is that strength of SBC depends on the choice of “test quantities” for which you compute ranks. The default approach is to take all individual parameter values. This is already a very useful check, but it leaves a big class of bugs you can’t detect (e.g. when posterior is just the prior distribution). However, when you add derived test quantities, combining the parameters with the simulated data, you can (in theory) detect any problem! (Yaaaay!) But in theory you may need infinitely many quantities :-(.

In practice, it seems to be quite sufficient to add just a few additional test quantities beyond the default. In particular, our experience as well as theoretical considerations indicate that the model likelihood is very sensitive. The power of SBC is still limited by the number of simulated datasets you can reasonably fit which primarily limits how big discrepancies can go undetected.

More generally, we provide a firmly grounded theoretical analysis of SBC which will hopefully help others to build better intuition on how and why it works in practice. Notably, we make it clear that SBC does not check whether “data-averaged posterior equals the prior” as is sometimes claimed in the literature (and as I also was convinced just a couple months ago :-D )

The SBC package supports all of the ideas discussed in the preprint. I personally now use SBC for almost all my Stan work from the get go – although there is some extra work to setup SBC, you end up detecting bugs early and thus save time in the long run. If you want to see an example of simulation-driven model development workflow, check out the Small model implementation workflow.

Looking forward to your feedback on the preprint (as well as the package).

I agree that simulation-based calibration checking is an important part of statistical workflow. I often do a simpler version where I just set parameters to reasonable values, then simulate data, then fit the model to the simulated data to check to see if I can recover the parameter values to within a reasonable level of accuracy. But it’s definitely cleaner to do this by drawing from the prior.

When Cook and Rubin first suggested this idea around 2004, it was common to fit Bayesian models using hand-coded Gibbs and Metropolis algorithms, and the point of SBC was to find errors in the code that you’d write. Nowadays with Stan we’re not so worried about bugs in the sampling code, but we are concerned with possible poor mixing of the HMC and poor identification of the parameters in the model. The big idea here, though, is, as Martin writes above, to incorporate this sort of checking into workflow.

Also, for implementation you’d like to have this done in parallel, which means it would be good to have a setup where you can access 100 processors directly. I’m not sure if that’s in Martin’s workflow linked above. I guess there must be some way to link Rstudio directly with AWS so that you just put in your credit card number and it accesses multiple processors as needed?

New version of Posteriordb: a database of posterior distributions for evaluating Bayesian computation algorithms

Måns Magnusson writes:

Now I and @avehtari are done with the PR review and finally, a new version of posteriordb is out. There is still a need for more complex posteriors > 100 parameters or models that take longer to estimate (minutes or hours rather than seconds). The ideal posteriors are complex posteriors that have already been published where the data is open and can be included. Feel free to reach out if you have implemented such posteriors. I’m happy to include them.

The update includes new posterior distributions, bug fixes, and updated documentation.

I think this project is really important.