Time Series Forecasting: futile but necessary. An example using electricity prices.

This post is by Phil Price, not Andrew.

I have a client company that owns refrigerated warehouses around the world. A refrigerated warehouse is a Costco-sized building that is kept very cold; 0 F is a common setting in the U.S. (I should ask what they do in Europe). As you might expect, they have an enormous electric bill — the company as a whole spends around a billion dollars per year on electricity — so they are very interested in the cost of electricity. One decision they have to make is: how much electricity, if any, should they purchase in advance? The alternative to purchasing in advance is paying the “real-time” electricity price. On average, if you buy in advance you pay a premium…but you greatly reduce the risk of something crazy happening. What do I mean by ‘crazy’? Take a look at the figure below. This is the monthly-average price per Megawatt-hour (MWh) for electricity purchased during the peak period (weekday afternoons and evenings) in the area around Houston, Texas. That big spike in February 2021 is an ice storm that froze a bunch of wind turbines and also froze gas pipelines — and brought down some transmission lines, I think — thus leading to extremely high electricity prices. And this plot understates things, in a way, by averaging over a month: there were a couple of weeks of fairly normal prices that month, and a few days when the price was over $6000/MWh.

Monthly-mean peak-period (weekday afternoon) electricity price, in dollars per megawatt-hour, in Texas.

If you buy a lot of electricity, a month when it costs 20x as much as you expected can cause havoc with your budget and your profits. One way to avoid that is to buy in advance: a year ahead of time, or even a month ahead of time, you could have bought your February 2021 electricity for only a bit more than electricity typically costs in Texas in February. But events that extreme are very rare — indeed I think this is the most extreme spike on record in the whole country in at least the past thirty years — so maybe it’s not worth paying the premium that would be involved if you buy in advance, month after month and year after year, for all of your facilities in the U.S. and Europe. To decide how much electricity to buy in advance (if any) you need at least a general understanding of quite a few issues: how much electricity do you expect to need next month, or in six months, or in a year; how much will it cost to buy in advance; how much is it likely to cost if you just wait and buy it at the time-of-use rate; what’s the chance that something crazy will happen, and, if it does, how crazy will the price be; and so on.

Continue reading

Football World Cup 2022 Predictions with footBayes/Stan

It’s time for football (aka soccer) World Cup Qatar 2022 and statistical predictions!

This year me and my collaborator Vasilis Palaskas implemented a diagonal-inflated bivariate Poisson model for the scores through our `footBayes` R CRAN package (depending on the `rstan` package), by considering as a training set more than 3000 international matches played during the years’ range 2018-2022. The model incorporates some dynamic-autoregressive team-parameters priors for attack and defense abilities and the Coca-Cola/FIFA rankings differences as the only predictor. The model, firstly proposed by Karlis & Ntzoufras in 2003, extends the usual bivariate Poisson model by allowing to inflate the number of draw occurrences. Weakly informative prior distributions for the remaining parameters are assumed, whereas sum-to-zero constraints for attack/defense abilities are considered to achieve model identifiability. Previous World Cup and Euro Cup models posted in this blog can be found here, here and here.

Here is the new model for the joint couple of scores (X,Y,) of a soccer match. In brief:

We fitted the model by using HMC sampling, with 4 Markov Chains, 2000 HMC iterations each, checking for their convergence and effective sample sizes. Here there are the posterior predictive matches probabilities for the held-out matches of the Qatar 2022 group stage, played from November 20th to November 24th, along with some ppd ‘chessboard plots’ for the exact outcomes in gray-scale color (‘mlo’ in the table denotes the ‘most likely result’ , whereas darker regions in the plots correspond to more likely results):

Better teams are acknowledged to have higher chances in these first group stage matches:

  • In Portugal-Ghana, Portugal has an estimated winning probability about 81%, whereas in Argentina-Saudi Arabia Argentina has an estimated winning probability about 72%. The match between England and Iran seems instead more balanced, and a similar trend is observed for Germany-Japan. USA is estimated to be ahead in the match against Wales, with a winning probability about 47%.

Some technical notes and model limitations:

  • Keep in mind that ‘home’ and ‘away’ do not mean anything in particular here – the only home team is Qatar! – but they just refer to the first and the second team of the single matches. ‘mlo’ denotes the most likely exact outcome.
  • The posterior predictive probabilities appear to be approximated at the third decimal digit, which could sound a bit ‘bogus’… However, we transparently reported the ppd probabilities as those returned from our package computations.
  • One could use these probabilities for betting purposes, for instance by betting on that particular result – among home win, draw, or away win – for which the model probability exceeds the bookmaker-induced probability. However, we are not responsible for your money loss!
  • Why a diagonal-inflated bivariate Poisson model, and not other models? We developed some sensitivity checks in terms of leave-one-out CV on the training set to choose the best model. Furthermore, we also checked our model in terms of calibration measures and posterior predictive checks.
  • The model incorporates the (rescaled) FIFA ranking as the only predictor. Thus, we do not have many relevant covariates here.
  • We did not distinguish between friendly matches, world cup qualifiers, euro cup qualifiers, etc. in the training data, rather we consider all the data as coming from the same ‘population’ of matches. This data assumption could be poor in terms of predictive performances.
  • We do not incorporate any individual players’-based information in the model, and this also could represent a major limitation.
  • We’ll compute some predictions’ scores – Brier score, pseudo R-squared – to check the predictive power of the model.
  • We’ll fit this model after each stage, by adding the previous matches in the training set and predicting the next matches.

This model is just an approximation for a very complex football tornament. Anyway, we strongly support scientific replication, and for such reason the reports, data, R and RMarkdown codes can be fully found here, in my personal web page. Feel free to play with the data and fit your own model!

And stay tuned for the next predictions in the blog. We’ll add some plots, tables and further considerations. Hopefully, we’ll improve predictive performance as the tournament proceeds.

Stan and Tensorflow for fast parallel Bayesian inference

Davendra Seunarine Maharaj writes:

We are seeking to characterize the performance and potential bottlenecks of the latest fast MCMC samplers. I see that Stan is currently using Intel TBB to parallelize the no-U-turn sampler (NUTS) across multiple chains. Do you know of any research attempted to parallelize each sampler itself within one chain.

Matt Hoffman (the inventor of the NUTS algorithm) responded:

Our group at Google has been very interested in using parallel compute in HMC variants (including NUTS), particularly on accelerators (e.g., GPUs). We’ve been working in the deep-learning-oriented autodiff+accelerator software frameworks TensorFlow and JAX, both of which are supported by our TensorFlow Probability library. It turns out that building on top of these frameworks gives you parallelism both within-chain (mostly from data parallelism) and across chains (because multiple chains can be run in parallel) “for free” without thinking too much about it—we just let the substrate (i.e., TF or JAX) decide how best to use the available resources.

Some of our systems papers you may (or may not) find interesting:
https://arxiv.org/abs/2002.01184
https://arxiv.org/abs/2001.05035

Some papers arguing theoretically and empirically that running many chains in parallel is a good thing to do if you have the resources:
https://proceedings.mlr.press/v130/hoffman21a.html
https://proceedings.mlr.press/v119/hoffman20a.html

And one about convergence diagnostics if you’re going to run many chains in parallel and don’t want to wait for them all to generate large effective sample sizes (with Andrew, among others):
https://arxiv.org/abs/2110.13017

Mark Jeffrey followed up:

Our research group comprises computer architecture and PL folks, and we are always keen to improve our understanding of a diversity of application domains to improve computer system support for them. It is encouraging to hear that Google has a significant interest in this area to the point that TensorFlow has a team developing it.

Two more questions:

1. Are you aware of any HMC/NUTS/MCMC competitions akin to those organized at DIMACS, e.g., does StanCon run a competition? This would give us pointers to
potentially useful input models to work with.

2. In your opinion, does TensorFlow’s support for HMC methods supersede Stan, or will both continue to coexist with different strengths? I expect the students in my group may be more productive hacking on the internals of Stan than TensorFlow, but I am open to suggestion.

I passed those questions over to Bob Carpenter, who replied:

Stan can use multiple threads to evaluate the log density and gradients within a single chain and it can use multiple threads to run multiple chains in parallel. We can also send some computations to GPU to evaluate the log density.

Google did some comparisons of Stan and TensorFlow and the answer is not surprising: Stan was faster on CPU and TensorFlow faster on GPU. Most of the models used in Stan aren’t easy to push onto CPU because they don’t involve SIMD calculations within a single log density eval. This is in contrast to deep neural nets, which involve a lot of dense
matrix operations, which are the bread and butter of GPUs.

TensorFlow usually runs 32 bit and Stan always runs 64 bit.

HMC and NUTS are both instances of MCMC, so I don’t see the contrast.

No, we’re not running any bakeoffs, nor do I know of any plans to do such. They are notoriously difficult to organize.

Finally, I’d suggest looking at some of Matt Hoffman and Pavel Sountsov’s new algorithms, which are designed to take advantage of the SIMD capabilities of GPUs and TPUs. I’m personally very excited about their recent MEADS algorithm.

Matt also sent along his responses:

1. In HMC variants, the overhead (at least in FLOPs) of the model-independent code is generally quite small compared to the cost of computing gradients. So as long as you’re FLOPS-bound, there’s not much point in aggressively optimizing the implementation. (But in https://proceedings.mlr.press/v130/hoffman21a.html we find that, when running on a GPU, NUTS in particular has a lot of control-flow overhead when run on top of TF or JAX; cf. also https://github.com/tensorflow/probability/blob/main/discussion/technical_note_on_unrolled_nuts.md)

So I don’t think there’s been a lot of work put into DIMACS-style implementation challenges. (Or I might just not know about it.)

But there are some “model zoos” out there that are more oriented towards comparing algorithmic improvements. Stan’s example models repo has lots of great stuff, there’s posteriordb, and our own inference gym.

2. I certainly don’t think that TFP is going to make Stan obsolete anytime soon. For one thing, I’d say that Stan’s documentation, community support, and general ease of use are in most respects well ahead of TFP. Also, I think it’s safe to say that Stan has better support for some modeling features—ODEs are a big example that springs to mind. Also, if you need double precision then it can be kind of a pain to get in TF/JAX.

Some of TFP’s big advantages over Stan to my mind are:

A. We get accelerator (i.e., GPU and TPU) support “for free” from building on top of TF and JAX. That includes support for sharding computation across multiple accelerators, which lets us do useful things like run 8x as many chains relatively easily, as well as letting us do kind-of-absurd things like apply HMC to a large Bayesian neural network using 512 TPU cores.

B. It’s easier to use TFP’s HMC as part of a larger system. If I want to, say, use HMC as an inner loop in training a deep generative model, or use an invertible neural network to precondition my sampler, I think it’d be harder to do that using Stan than using TFP.

C. I strongly prefer doing algorithm development in Python/JAX (using, e.g., FunMC) to c++. Having a model zoo that’s defined in the same autodiff framework as the one I’m writing my algorithm in is very convenient.

So compared to Stan, I think TFP is more geared towards problems where either you want the increased flexibility (and lower user-friendliness) of its lower-level API, or you’ve really hit a computational bottleneck that can be solved with access to more FLOPs.

Whereas for a lot of data-analysis problems, I think TFP’s advantages over Stan aren’t that relevant—if you can run 6 chains for 2000 iterations on a laptop in under five minutes, using a well-supported, well-documented, easy-to-use system, and everything converges with no tuning, why fix what’s not broken? Even in situations where things don’t converge properly, often the solution is to fix the model specification (e.g., by noncentering) rather than to throw more compute or customized algorithms at it.

Interesting discussion, and it’s good to see all this work being done on practical algorithms. No method will solve all problems, so it makes sense that multiple systems are being developed.

Simulation-based calibration checking (SBC) is stronger than you thought! (and the SBC package in R)

Martin Modrak writes:

We (Martin Modrák, Angie Moon, Shinyoung Kim, Paul Bürkner, Niko Huurre, Kateřina Faltejsková, Andrew Gelman, and Aki Vehtari) have a new preprint on SBC: Simulation-Based Calibration Checking for Bayesian Computation: The Choice of Test Quantities Shapes Sensitivity.

For those, who are unaware, simulation-based calibration checking (SBC) is a method where you use simulated datasets to verify that you implemented you model correctly and/or that your sampling algorithm work. It has been known and used for a while, but was considered to have a few shortcomings, which we try to address. See the preprint or the documentation for the SBC package for a precise description of how that works.

The gist of our preprint is that strength of SBC depends on the choice of “test quantities” for which you compute ranks. The default approach is to take all individual parameter values. This is already a very useful check, but it leaves a big class of bugs you can’t detect (e.g. when posterior is just the prior distribution). However, when you add derived test quantities, combining the parameters with the simulated data, you can (in theory) detect any problem! (Yaaaay!) But in theory you may need infinitely many quantities :-(.

In practice, it seems to be quite sufficient to add just a few additional test quantities beyond the default. In particular, our experience as well as theoretical considerations indicate that the model likelihood is very sensitive. The power of SBC is still limited by the number of simulated datasets you can reasonably fit which primarily limits how big discrepancies can go undetected.

More generally, we provide a firmly grounded theoretical analysis of SBC which will hopefully help others to build better intuition on how and why it works in practice. Notably, we make it clear that SBC does not check whether “data-averaged posterior equals the prior” as is sometimes claimed in the literature (and as I also was convinced just a couple months ago :-D )

The SBC package supports all of the ideas discussed in the preprint. I personally now use SBC for almost all my Stan work from the get go – although there is some extra work to setup SBC, you end up detecting bugs early and thus save time in the long run. If you want to see an example of simulation-driven model development workflow, check out the Small model implementation workflow.

Looking forward to your feedback on the preprint (as well as the package).

I agree that simulation-based calibration checking is an important part of statistical workflow. I often do a simpler version where I just set parameters to reasonable values, then simulate data, then fit the model to the simulated data to check to see if I can recover the parameter values to within a reasonable level of accuracy. But it’s definitely cleaner to do this by drawing from the prior.

When Cook and Rubin first suggested this idea around 2004, it was common to fit Bayesian models using hand-coded Gibbs and Metropolis algorithms, and the point of SBC was to find errors in the code that you’d write. Nowadays with Stan we’re not so worried about bugs in the sampling code, but we are concerned with possible poor mixing of the HMC and poor identification of the parameters in the model. The big idea here, though, is, as Martin writes above, to incorporate this sort of checking into workflow.

Also, for implementation you’d like to have this done in parallel, which means it would be good to have a setup where you can access 100 processors directly. I’m not sure if that’s in Martin’s workflow linked above. I guess there must be some way to link Rstudio directly with AWS so that you just put in your credit card number and it accesses multiple processors as needed?

New version of Posteriordb: a database of posterior distributions for evaluating Bayesian computation algorithms

Måns Magnusson writes:

Now I and @avehtari are done with the PR review and finally, a new version of posteriordb is out. There is still a need for more complex posteriors > 100 parameters or models that take longer to estimate (minutes or hours rather than seconds). The ideal posteriors are complex posteriors that have already been published where the data is open and can be included. Feel free to reach out if you have implemented such posteriors. I’m happy to include them.

The update includes new posterior distributions, bug fixes, and updated documentation.

I think this project is really important.

Tigers need your help.

Jim Logue, Director of Baseball R&D at the Detroit Tigers, writes:

We are now hiring a Principal Quantitative Analyst. With this position we’re looking for someone with extensive Bayesian experience, with a secondary emphasis on baseball knowledge.

The Tigers went 66-96 last year so the good news is that if you join them now you can take some credit for whatever improvement they show next year!

I assume that knowledge of Stan will be a plus.

0 to 100K users in 10 years: how Stan got to where it is

I did a talk at Simons Foundation on how Stan was initially developed and how it got to where it is today:

It was the first time I’d presented this and I was having so much fun I ran out of time near the end. If you’d like to go more slowly, here’s

Also, here’s a direct link to the the TV clip that was played near the start of the talk.

A baseball analytics job using Stan!

Tony Williams writes:

I have nothing to do with this job, but it might be interesting to your readers since they specifically mention Stan as a desired skill.

From the link:

Data Scientist, Baseball Research & Development

The Cleveland Guardians Baseball Research & Development (R&D) group is seeking data scientists at a variety of experience levels . . . You will analyze video, player tracking, and biomechanics data as well as traditional baseball data sources like box scores to help us acquire and develop baseball players into a championship-caliber team. . . .

Qualifications

– Demonstrated experience or advanced degree in a quantitative field such as Statistics, Computer Science, Economics, Machine Learning, or Operations Research.

– Programming skills in a language such as R or Python to work efficiently at scale with large data sets.

– Desire to continue learning about data science applications in baseball.

And then in the Preferred Experience section, along with “Demonstrated research experience in a sports context (baseball is a plus)” and “Experience with computer vision” and a few other things, they have:

– Experience with Bayesian statistics and languages such as Stan.

How cool is that??

And, hey! I just looked it up . . . the Guardians have a winning record this year and they’re headed for the playoffs! Nothing like the Cleveland MLB teams I remember from my childhood . . .

Programmer position with our research group—at the University of Michigan!

Hey, midwesterners—here’s a chance for you to join the big team! It’s for our research project with Yajuan Si, Len Covello, Mitzi Morris, Jonah Gabry, and others on building models and software for epidemic tracking (see this paper, this paper, or, for the short version, this op-ed):

Summary

The Survey Research Center (SRC) at the University of Michigan’s Institute for Social Research (ISR) invites applications for a Full Stack Programmer with the Survey Methodology Program, in collaboration with the team of Stan developers at Columbia University.

Our multidisciplinary research team is involved in cutting-edge statistical methodology research and the development of computational algorithms, software, and web-based interfaces for application and visualization. We are looking for a Full-Stack Programmer to work on an innovative infrastructure to enable user-friendly implementation and reproducibility of statistical methods for population generalizability. The position provides an opportunity to work in an exciting and rewarding research area that constantly poses new technical and computational problems.

Responsibilities*

Develop, test, maintain, document, and deploy an interface for statistical inferences with sample data and result visualization, specifically the implementation of multilevel regression and poststratification.

Provide timely technical support during research deployments.
Create documentation and tutorials about the developed applications for use by interdisciplinary research teams.

Required Qualifications*

Bachelor’s or Master’s degree in Statistics/Computer Science/Information Science/Informatics or related technical discipline or a combination of education and software-development experience in a research or corporate environment to equal three (3) years.

Skills in R/Python programming.

Experience in data visualization and dashboard construction.

Experience with databases. Direct experience with demographic or geospatial data a plus.

Familiarity with C++ or Java. Knowledge with Stan programming a plus.

Track record of successful application development a plus.

Desired Qualifications*

Dashboard development experience preferred.

Work Locations

This position will be on-site at the University of Michigan offices in Ann Arbor, with flexible scheduling and remote opportunities made available within our overall Center policies. If relocation is required, we will allow for reasonable time and flexibility to make necessary arrangements.

This is an exciting research project and we’re hoping to find someone who can take a lead on the programming. Click through to the link for more details and to apply.

“Which user?”: When chunking your math and your code, what seems thickerer depends on where you’re coming from.

One of my favorite joke scenarios is the one about the comedians who give a number to each joke. To me, what’s funny about the number thing is that it’s true—or, to be precise, it’s true that sometimes all you have to do is remember the label for some story, and that’s enough to make you laugh, without the need to play through the entire narrative.

I was thinking about this recently when working with Bob Carpenter on a simple Stan program.

Here’s what I started with:

data {
  int J;
  int T;
  int shock[J,T];
}
transformed data {
  int prev_shock[J,T];
  int prev_avoid[J,T];
  for (j in 1:J){
    prev_shock[j,1] = 0;
    prev_avoid[j,1] = 0;
    for (t in 2:T){
      prev_shock[j,t] = prev_shock[j,t-1] + shock[j,t-1];
      prev_avoid[j,t] = prev_avoid[j,t-1] + 1 - shock[j,t-1];
    }
  }
}
parameters {
  real<lower=0, upper=1> a;
  real<lower=0, upper=1> b;
}
model {
  real p[J,T];
  for (j in 1:J){
    for (t in 1:T){
      p[j,t] = a^prev_shock[j,t] * b^prev_avoid[j,t];
      shock[j,t] ~ bernoulli(p[j,t]);
    }
  }
}

And here’s Bob’s cleaned-up version:

data {
  int<lower=0> J;
  int<upper=0> T;
  array[J, T] int<lower=0, upper=1> shock;
}
transformed data {
  matrix<lower=0>[J, T] prev_shock;
  matrix<lower=0>[J, T] prev_avoid;
  for (j in 1:J) {
    row_vector[T - 1] shock_j = to_row_vector(shock[j, 1:T - 1]);
    prev_avoid[j] = cumulative_sum(append_col(0, 1 - shock_j));
    prev_shock[j] = cumulative_sum(append_col(0, shock_j));
  }
}
parameters {
  real<lower=0, upper=1> a;
  real<lower=0, upper=1> b;
}
transformed parameters {
  matrix<lower=0, upper=1>[J, T] p = a.^prev_shock .* b.^prev_avoid;
}
model {
  for (j in 1:J) {
    shock[j] ~ bernoulli(p[j]);
  }
}

Looking at the change in code, I see a few things:
1. Bob changed int shock[J,T]; to array[J,T] int shock;
2. Bob added bounds on the data and transformed data.
3. Bob changed my looping into various vector things such as append_col, to_row_vector, and cumulative_sum.
4. Bob did something where he referred to shock[j] and p[j] which are either rows or columns of matrices.

Just to go through these quickly:
1. This is a fix in the Stan code, as I was using an old formulation for arrays.
2. This is a matter of taste, but I guess Bob’s approach is better here. Including these bounds explicitly serves an error check: for example, if you feed it shock data other than 0’s and 1’s, the program will now return an error.
3. I have mixed feelings on this bit. I assume this is more efficient and to Bob it’s cleaner; to me it’s much more confusing! I guess that when writing a case study, it will make sense to explain the pluses and minuses of the two formulations.
4. I find this really confusing, the idea of using a single index to pull out one row or column of an array, so I’d much prefer to do this with two nested loops.

Bob responded to my third and fourth points by saying that, to him, the matrix version is clearer. I get that. I like the loops and explicit indexing, but I can see these indexes also offer opportunities for bugs, for example if you type i instead of j somewhere. Similarly with the thing where you use a single index to pull out a single row or column of a matrix: to me it looks like a trick, but if, like Bob, you’re familiar with the programming language it’s a transparent step.

Example of the logistic transform

I’ll write logistic regression like this:

Pr(y=1) = logit^{-1} (a + bx).

But a lot of people will write:

Pr(y=1) = e^{a + bx} / (1 + e^{a + bx})

or some other thing like that. To me, the “logit^{-1}” or “invlogit” formulation is clearest: invlogit is that S-shaped curve with a slope of 1/4 at 0, and when its argument goes from -5 to 0 to 5, it goes from 0.01 to 0.5 to 0.99, and the ratio of exponentials thing is a bit confusing; indeed I think the best way to read it is to chunk it into “invlogit.” But to people who are less familiar with statistical notation, the thing with the exponentials is clearer. For this example, I’m the “Bob Carpenter” who has internalized the transformation and prefers the chunked version, while these other users are still stuck on the other side (as I am with vectors and matrices in Stan) so they find the clunky notation more explicit and clear.

I say this because when considering coding options, we sometimes talk about a tradeoff between computational efficiency and clarity for the user, and, yes, sometimes this tradeoff does exist, but often the answer to the question, “What is clear code for the user?”, depends on another question: “Which user?”

How to win the Sloan Sports hackathon

Stan developer Daniel Lee writes:

I walked in with knowing a few things about the work needed to win hackathons:

– Define a problem.
If you can clearly define a problem, you’ll end up it the top third of the competition. It has to be clear why the problem matters and you have to communicate this effectively.

– Specify a solution.
If you’re able to specify a solution to the problem, you’ll end up in the top 10%. It has to be clear to the judges that this solution solves the problem.

– Implement the solution.
If you’ve gotten this far and you’re now able to actually implement the solution that you’ve outlined, you’ll end up in the top 3. It’s hard to get to this point. We’re talking about understanding the topic well enough to define a problem of interest, having explored enough of the solution space to specify a solution, then applying skills through focused effort to build the solution in a short amount of time. Do that and I’m sure you’ll be a finalist.

– Build interactivity.
If the judges can do something with the solution, specifically evaluate “what if” scenarios, then you’ve gone above and beyond the scopes of a hackathon. That should get you a win.

Winning a hackathon takes work and focus. It’s mentally and physically draining to compete in a hackathon. You have to pace yourself well, adjust to different challenges as they come, and have enough time and energy at the end to switch context to present the work.

One additional note: the solution only needs to be a proof of concept and pass a smell test. It’s important to know when to move on.

Positive, negative, or neutral?

We’ve talked in the past about advice being positive, negative, or neutral.

Given that Daniel is giving advice on how to win a competition that has only one winner, you might think I’d call it zero-sum. Actually, though, I’d call it positive-sum, in that the goal of a hackathon is not just to pick a winner, it’s also to get people involved in the field of study. It’s good for a hackathon if its entries are good.

The story

Daniel writes:

I [Daniel] participated in the SSAC22 hackathon. I showed up, found a teammate [Fabrice Mulumba], and won. Here’s a writeup about our project, our strategy for winning, and how we did it.

The Data

All hackathon participants were provided data from the 2020 Stanley Cup Finals. This included:

– Tracking data for 40 players, the puck, and the referees. . . . x, y, z positions with estimates of velocity recorded at ~100 Hz. The data are from chips attached to jerseys and in the puck.

– Play-by-play data. Two separate streams of play-by-play data were included: hand-generated and system-generated. . . .

– Other meta data. Player information, rink information, game time, etc.

Data was provided for each of the 6 games in the series. For a sense of scale: one game has about 1.5M rows of tracking data with 1.5 GB of JSON files across the different types of data.

The Hackathon

There were two divisions for the Hackathon: Student and Open. The competition itself had very little structure. . . . Each team would present to the judges starting at 4 pm and the top teams would present in the finals. . . .

Daniel tells how it came together:

Fabrice and I [Daniel] made a pretty good team. But it almost didn’t happen.

Both Fabrice and I had competed in hackathons before. We first met around 8:30 am, half an hour before the hackathon started. As Fabrice was setting up, I saw that he had on an AfroTech sweatshirt and a Major League Hacking sticker on his laptop. I said hi, asked if he was competing alone, and if he was looking for a teammate. He told me he wanted to compete alone. I was hoping to find a teammate, but had been preparing to compete alone too. While it’s hard to do all the things above alone, it’s actually harder if you have the wrong teammate. We went our separate ways. A few minutes later, we decided to team up.

Something about the team felt right from the start. Maybe I was more comfortable teaming up with one of the few other POC in the room. Maybe there was a familiar cadence and vibe from having parents that immigrated to the US. Maybe it was knowing that the other had been through an intense working session in the past and was voluntarily going through it again. Whatever it was, it worked.

In the few days prior, I had spent a couple hours trying to gain some knowledge about hockey from friends that know the sport. The night before, I found a couple of people that worked for the LA Kings and asked questions about what they thought about and why. I came in thinking we should look at something related to goalie position. Fabrice came in wanting to work on a web app and focus on identifying a process within the game. These ideas melded together and formed the winning project.

For the most part, we worked on separate parts of the problem. We were able to split the work and trust that the other would get their part done. . . .

The Winning Project: Sloan Goalie Card

We focused on a simple question. Does goaltender depth matter?

Having access to x, y, z position of every player meant that we could analyze where the goalie was at the time when shots were taken. Speaking to some hockey people, we found out that this data wasn’t publicly available, so this would be one of the first attempts at this type of analysis.

In the allotted time, we pulled off a quick analysis of goalie depth and built the Sloan Goalie Card web app.

I don’t know anything about hockey so I can’t comment on the actual project. What I like is Daniel’s general advice.

P.S. I googled *how to win a hackathon*. It’s a popular topic, including posts going back to 2014. Some of the advice seems pretty ridiculous; for example one of the links promises “Five Easy Steps to Developer Victory”—which makes me wonder what would happen if two competitors tried this advice for the same hackathon. They couldn’t both win, right?

Stan downtown intern posters: scikit-stan & constraining transforms

It’s been a happening summer here at Stan’s downtown branch at the Flatiron Institute. Brian Ward and I advised a couple of great interns. Two weeks or so before the end of the internship, our interns present posters. Here are the ones from Brian’s intern Alexey and my intern Meenal.

Alexey Izmailov: scikit-stan

Alexey built a version of the scikit-learn API backed by Stan’s sampling, optimization, and variational inference. It’s plug and play with scikit.learn.

Meenal Jhajharia: unconstraining transforms

Meenal spent the summer exploring constraining transforms and how to evaluate them with a goal toward refining Stan’s transform performance and to add new data structures. This involved both figuring out how to evaluate them (vs. target distributions w.r.t. convexity, condition if convex, and sampling behavior in the tail, body, and near the mode of target densities). Results are turning out to be more interesting than we suspected in that different transforms seem to work better under different conditions. We’re also working with Seth Axen (Tübingen) and Stan devs Adam Haber and Sean Pinkney.

They don’t make undergrads like they used to

Did I mention they were undergrads? Meenal’s heading back to University of Delhi to finish her senior year and Alexey heads back to Brown to start his junior year! The other interns at the Center for Computational Mathematics, many of whom were undergraduates, have also done some impressive work in everything from using normalizing flows to improve sampler proposals for molecular dynamics to building 2D surface PDE solvers at scale to HPC for large N-body problems. In this case, not making undergrads like they used to is a good thing!

Hiring for next summer

If you’re interested in working on statistical computing as an intern next summer, drop me a line at [email protected]. I’ll announce when applications are open here on the blog.

 

Discussion on identifiability and Bayesian inference

A bunch of us were in an email thread during the work on the article, “Bayesian workflow for disease transmission modeling in Stan,” by Léo Grinsztajn, Liza Semenova, Charles Margossian, and Julien Riou, and the topic came up of when to say that a parameter in a statistical model is “identified” by the data.

We did not come to any conclusions, but I thought some of you might find our discussion helpful in organizing your own thoughts.

It started with Liza, who said:

Definitions are important when it comes to identifiability. There are a few methods in the literature addressing (or, at least, discussing) the issue, but it is not always specified which definition is being used.

The most important pair of definitions is probably structural identifiability and practical identifiability.

Practical non-identifiability is linked to the amount and quality of data. It answers the question of whether parameters can be estimated given available data.

Structural non-identifiability arises due to model structure and can not be resolved by collecting more data, or better quality data. Example y = a * b: we can measure y as much as we want, but will not be able to distinguish between a and b no matter the number of measurements. One would believe that the issue could get resolved either via re-parametrisation or via providing sensible priors (i.e. by adding the Bayesian sauce). In this specific example, the issue is caused by the scaling symmetry of the equation, i.e. the equation does not change if we substitute a with k*a, and substitute b with 1/k*b. I wonder whether the knowledge of the type of symmetry can help us understand which priors would be “sensible” as these priors would need to break the detected symmetries. (Everything above applies to any kind of models, but my focus at work at the moment is on ODEs. For ODE systems there is a lot of literature on analytical methods of finding symmetries, and the simplest of them can be applied to mid-size systems by hand.)

An intuition concerning Stan (the software) would be that it would produce warnings in every case of non-identifiability. But I start developing a feeling that it might not be the case. I am preparing an example of an ODE dx/dt = – a1*a2*x, x(0)=x_0 with uniform priors U(0,10) on both a1, a2, and maxtree_depth = 15. No warnings. Convergence is good.

A further list of definitions includes structural global identifiability, structural local identifiability, identifiability of the posterior, weak identifiability. Linked terms are observability and estimability.

My aim to distil existing methods to test for identifiability and demonstrate them on a few examples. As well as to answer the questions emerging along the way, e.g. “Do Bayesian priors help?”, “Can the knowledge about the type of invariance of the system help us chose the priors?”, “How to test a system for identifiability – both analytically and numerically?”, “Can analytical methods help us find good model reparametrisations to avoid numerical issues?”.

I responded:

One insight that might be helpful is that identifiability depends on data as well as model. For example, suppose we have a regression with continuous outcome y and binary predictors x1,…,x10. If your sample size is low enough, you’ll have collinearity. Even with a large sample size you can have collinearity. So I’m not so happy with definitions based on asymptotic arguments. A relevant point is that near-nonidentifiability can be as bad as full nonidentifiability. To put it another way, suppose you have an informative prior on your parameters. Then nonidentifiability corresponds to data that are weak enough that the prior is dominant.
Also relevant is this paper, “The prior can often only be understood in the context of the likelihood,” also this, “The experiment is just as important as the likelihood in understanding the prior.”

Ben Bales added:

Identifiability is like a complicated thing that is often the result of a failure.

Like if a car explodes, we can sit around and explain what happens once it has exploded (fire, smoke, etc.), but the fundamental problem of making a car that doesn’t explode is simpler (and more useful if we wanted to drive a car).

The last time I hit identifiability problems on the forums we worked it out with simulated data.

My reaction more and more to a model that doesn’t behave as expected is first we should find a model that does kinda behave well and go from there. The reason I think this is because finding a model that does work is a much more methodical, reliable thing than trying to figure out geometry or whatever is causing it. I tell people to look at posterior correlations and stuff to find non-identifiabilities, but I don’t know how far it’s ever gotten anyone.

I threw in:

Just to shift things slightly: I think that one of the problems with the statistical literature (including my own writings!) is how vague it all is. For example, we talk about weakly informative priors or weak identification, and these terms are not clearly defined. I think the appropriate way to go forward here is to recognize that “identification,” however defined, is a useful concept, but I don’t think that any strict definition of “identification” will be useful to us here, because in the Bayesian world, everything’s a little bit identified (as long as we have proper priors) and nothing is completely identified (because we’re not asymptotic). So maybe, rather than talking about identified or weakly identified or partially identified etc., we should be developing quantitative tools to assess the role of the prior and how it affects posterior inferences. Or something like that.

To which Liza replied:

Yes to quantitative tools. Some attempts in that direction have been made, but it seems that none of them has been acquired broadly:

1). Percentage overlap between prior and posterior (Garrett and Zeger, 2000). In applications, they have used the threshold of 35%: if the overlap is greater than the threshold, they considered the parameter in question weakly identifiable. The overlap is calculated using posterior samples and kernel smoothing. This approach has been only used when priors were uniform, and the threshold would be hard to calibrate for other priors. Moreover, there are examples when posterior of a non-identifiable parameter differs from its prior (Koop et al, 2013). Also, posterior and prior of an identifiable parameter can have high overlap – for instance, when the prior is informative and data only confirms prior knowledge.

2). Xie and Carlin (2005) suggest two measures of Bayesian learning (computed as KL divergence): how much we can potentially learn about a parameter, and how much there is left to learn given data y. The computational algorithm involves MCMC.

3). Data cloning (Lele et al, 2010): it is a technique used to obtain MLE estimates in complex models using Bayesian inference. Supposedly, it is useful to study identifiability. The idea is that a dataset gets cloned K times (i.e. each observation is repeated K times), while the model remains unchanged. For estimable parameters the scaled variance (ratio of the variance of posterior at K=1 to the variance of posterior at K>1) should, in theory, behave as 1/K. If a parameter is not estimable, the scaled variance is larger than 1/K. The method has been proposed in a setting with uniform priors. The question of what happens if priors are not uniform stands open. I have applied data cloning to an example of two ODEs. The first ODE only contains parameter k: x'(t) = -k*x; the second ODE is overparametrized and contains two parameters lambda1 and lambda2: x'(t) = -lambda1*lambda2*x; priors on the parameters are always the same – U(0,10). The plot of scaling variances is attached. Note that with the Stan setup which I have chosen (max tree depth = 15) there has been no warnings.

Another question is the connection of identifiability and model selection: if an information criterion takes the number of parameters into account, shouldn’t it be the number of identifiable parameters?

An overall suggestion for the “identifiability workflow”: it should be a combination of analytical and numerical methods and include all or some of the steps:
1. apply analytical methods by hand, if applicable (e.g. testing scaling invariances is possible for mid-size ODE systems)
2. numerical methods:
(a) Hessian method (standardized eigenvalues which are too small correspond to redundant parameters, but picking the threshold is hard),
(b) simulation-based method (start with some true values, generate data, and fit the model, then calculate the coefficient of variation; it is small for identifiable parameters and large for nonidentifiable)
(c) profile likelihood method: a flat profile indicates parameter redundancy
3. symbolic methods

P.S. There is also no consensus on the term either – both non-identifiability and unidentifiability are used.

Charles then said:

Thanks for pooling all these ideas. As Andrew mentions, it can be good to distinguish finite data cases and asymptotics. The intuitive way I think about non-identifiability is we can’t solve a * b = x for a and b. There is an infinite number of solutions, which in a Bayesian context suggests an infinite variance. So maybe thinking in terms of variance might be helpful? But we could survey problems scientists face and see how these motivate different notions of identifiability.

If I understand correctly, (1) measures how much we learn from the data / how much the posterior is driven by the prior. The data could be not very informative (relative the prior) or (equivalently) the prior could be very informative (because it is well constructed or else). I think this type of criterion goes back to the questionable idea that the prior shouldn’t influence the posterior. This becomes a question of robustness, because we can think of the prior as additional data with a corresponding likelihood (so putting a bad prior is like adding bad data).

(2) sounds interesting too. Working out “how much there is to learn from more data” reminds me of a prediction problem for my qualifying exams. For simple enough models, the learning rate plateaus quickly and more data wouldn’t improve predictions. Maybe the idea here is that the posterior variance of the parameter doesn’t go to 0 with more data.

(3) also sounds very neat and relates to learning rates. I’m not sure what “estimable” and “non-estimable” means here. But in a unif-normal case, I agree the posterior variance goes down with n. We could then ask what the expected rate is for multimodal distributions. A minor note: rather than cloning the data, can we simply lower likelihood variance?

Job opportunity for a Stan user at Dugway

Scott Hunter writes:

Job opportunity as a statistician at US Army Dugway Proving Ground, in remote Dugway, Utah. Work is rewarding in that we solve problems to improve protecting our warfighters from chemical and biological attacks. Looking for someone comfortable coding in R and Stan (primarily using the rstan and rstanarm packages). Most problems deal with linear models as well as logistic models which may or may not be hierarchical. Design of Experiment experience is a plus as well as writing Shiny applications. Pay will depend on education/experience and ranges from $66,214 to $122,684 (GS11 – GS13 pay scale). Work week is Monday through Thursday with telework opportunities. If there is any interest, please send a resume soonest to [email protected].

I’ve done some Stan trainings for this group in the past and I really enjoyed working with them!

PPL Workshop this Friday (29 July 2022)

The Computational Abstractions for Probabilistic and Differentiable Programming (CAPP) Workshop is happening this Friday, 29 July 2022. It’s organized out of Cambridge and is running on UK time.

Program and Zoom link

The schedule of speakers is online and the free Zoom link is available under the Location tab of their web site:

Despite the title, the talks sound very practical. The speaker lineup includes developers from BUGS, JAGS, Stan, JAX, and Turing.jl.

My talk

I (Bob) will be talking about our plan to introduce Python-like comprehensions into the Stan language to deal with constant matrix constructions like Gaussian Process covariance matrices. This is motivated by our shift in underlying matrix data structure for autodiff in Stan—what the Stan developers have been calling “mat-var” in meetings and pull requests because it’s an autodiff variable with a matrix value and matrix adjoint rather than our original approach using a matrix of scalar autodiff variables.

P.S. This is a big reduction in memory pressure and time for larger-scale matrix ops for Stan. The really clever part is how the C++ devs (Tadej Ciglarič, Steve Bronder, Rok Češnovar, et al.) have managed to abstract the mat-var construct in a way that’s backward compatible with our existing autodiff with simpler code. If you like C++ and functional template meta programming in C++, I urge you to follow along. Steve’s about to drop all the dev-facing doc for all of this—it’s currently in a massive PR undergoing final revisions.

Solving inverse problems using Bayesian inference with FFT in Stan, from Python . . . this example has it all!

Brian Ward, Bob Carpenter, and David Barmherzig write:

The HoloML technique is an approach to solving a specific kind of inverse problem inherent to imaging nanoscale specimens using X-ray diffraction.

To solve this problem in Stan, we first write down the forward scientific model given by Barmherzig and Sun, including the Poisson photon distribution and censored data inherent to the physical problem, and then find a solution via penalized maximum likelihood. . . .

The idealized version of HCDI is formulated as
Given a reference R, data Y = |F(X+R)|^2
Recover the source image X
Where F is an oversampled Fourier transform operator.

However, the real-world set up of these experiments introduces two additional difficulties. Data is measured from a limited number of photons, where the number of photons received by each detector is modeled as Poisson distributed with expectation given by Y_ij (referred to in the paper as Poisson-shot noise). The expected number of photons each detector receives is denoted N_p. We typically have N_p < 10 due to the damage that radiation causes the biomolecule under observation. Secondly, to prevent damage to the detectors, the lowest frequencies are removed by a beamstop, which censors low-frequency observations.

In this Python notebook, Ward et al. do it all. They simulate fake data from the model, they define helper functions for their Fourier transforms in Stan (making use of Stan’s fast Fourier transform function), they write the Bayesian model (including an informative prior distribution) directly as a Stan program, they run it from Python using cmdstanpy, then they do some experimentation to see what happens when they vary the sample size of the data. The version in the case study uses optimization, but the authors tell me that sampling also works, and they’re now working on the next step, which is comparison with other methods.

I wish this computational infrastructure had existed back when I was working on my Ph.D. thesis. It would’ve made my life so much simpler—I wouldn’t’ve had to write hundreds of lines of Fortran code!

In all seriousness, this is wonderful, from beginning to end, showing how Bayesian inference and Stan can take on and solve a hard problem—imaging with less than 10 photons per detector, just imagine that!

Also impressive to me is how mundane and brute-force this all is. It’s just step by step: write down the model, program it up, simulate fake data, check that it all makes sense, all in a single Python script. Not all these steps are easy—for this particular problem you need to know a bit of Fourier analysis—but they’re all direct. I love this sort of example where the core message is not that you need to be clever, but that you just need to keep your eye on the ball, and with no embarrassment about the use of a prior distribution. This is where I want Bayesian inference to be. So thanks, Brian, Bob, and David for sharing this with all of us.

P.S. Also this, which team Google might be wise to emulate:

Reproducibility

This notebook’s source and related materials are available at https://github.com/WardBrian/holoml-in-stan.

The following versions were used to produce this page:

%load_ext watermark
%watermark -n -u -v -iv -w
print("CmdStan:", cmdstanpy.utils.cmdstan_version())
Last updated: Fri Jul 01 2022

Python implementation: CPython
Python version       : 3.10.4
IPython version      : 8.4.0

scipy     : 1.8.0
cmdstanpy : 1.0.1
numpy     : 1.22.3
IPython   : 8.4.0
matplotlib: 3.5.2

Watermark: 2.3.1

CmdStan: (2, 30)
The rendered HTML output is produced with
jupyter nbconvert --to html "HoloML in Stan.ipynb" --template classic --TagRemovePreprocessor.remove_input_tags=hide-code -CSSHTMLHeaderPreprocessor.style=tango --execute

Stan goes mountain climbing, also a suggestion of 3*Alex

Jarrett Phillips writes:

I came across this recent preprint by Drummond and Popinga (2021) on applying Bayesian modeling to assess climbing ability. Drummond is foremost a computational evolutionary biologist (as am I) who is also an experienced climber. The work looks quite interesting. I was previously unaware of such an application and thought others may also appreciate it.

I’m not a climber at all, but it’s always fun to see new applications of Stan. In their revision, I hope the authors can collaborate with someone who’s experienced in Bayesian data visualization and can help them make some better graphs. I don’t mean they should just take their existing plots and make them prettier; I mean that I’m pretty sure there are some more interesting and informative ways to display their data and fitted models. Maybe Jonah or Yair could help—they live in Colorado so they might be climbers, right? Or Aleks would be perfect: he’s from Slovenia where everyone climbs, and he makes pretty graphs, and then the paper would have 3 authors named some form of Alex. So that’s my recommendation.

Academic jobs in Bayesian workflow and decision making

This job post (with two reserach topics) is by Aki (I promise that next time I post about something else)

I’m looking for postdocs and doctoral students to work with me on Bayesian workflow at Aalto University, Finland. You can apply through a joint call (with many more other related topics) application forms for postdocs) and for doctoral students.

We’re also looking for postdocs and doctoral students to work on Probabilistic modeling for assisting human decision making in with Finnish Center for Artificial Intelligence funding. You can apply through a joint call (with many more probabilistic modeling topics) application form.

To get some idea on how we might approach these topics, you can check what I’ve been recently talking and working.

For five years straight, starting in 2018, the World Happiness Report has singled out Finland as the happiest country on the planet