R and OOP anti-patterns

Thomas Lumley just dropped a blog post, Blank cheque inheritance and statistical objects, which begins as follows.

One of the problems with object-oriented programming for statistical methods is that inheritance is backwards. Everything is fine for data structures, and Bioconductor has many examples of broad (often abstract) base classes for biological information or annotation that are specialised by inheritance to more detailed and useful classes. Statistical methods go the other way.

In base R, glm for generalised linear models is a generalisation of lm for linear models, but the glm class inherits from the lm class, …

This isn’t a problem with inheritance, it’s a problem with how R uses it.

Fundamentals of OOP and inheritance

The fundamental rule for inheritance in object oriented programming (OOP) is that a class X should inherit from class Y only if every X is a Y. That means you should be able to use an X wherever the code calls for an Y. For instance, a Poodle class might inherit from the Dog class, because every poodle is a dog. Any function you can apply to dogs, you can apply to poodles. Every function that is defined to return a poodle will also return a dog. Behavior as arguments and returns is governed by the concepts of covariance and contravariance in programming language theory. Inheritance must respect these relations for a coherent OOP design.

A classic blunder is to define a class for real numbers and one for complex numbers and have the complex numbers inherit from the real numbers. Every real number is a complex number, but not every complex number is a real number. So doing this will break standard OOP implementations. The reason beginners in OOP make this mistake is that it’s natural to think of the implementation of a complex number as taking a real number and adding an imaginary component. If you want to start with real numbers, a better way to define a complex number would be using composition to include two real components. That is, it contains two real numbers rather than inheriting from real numbers to get its real component. This is exactly how std::complex is defined in C++, with a constructor that takes the real and complex components as two objects of type T, where T might be double for double-precision complex numbers or it might be an autodiff type like stan::math::var.

The God object anti-pattern

I’m also not fond of how lm returns a god object. God objects are a widely cited anti-pattern in OOP, largely because they’re so frequently seen in the wild. The inclination that leads to this is to have something like a “blackboard” into which all kinds of state can be written so the user can get it all in one place. A common example is including all the input in a function’s output. No need to do that because the user already had all of that information because they couldn’t have called the function otherwise. God objects’ are usually a terrible idea as it’s nearly impossible to ensure consistency of such an object without defeating its purpose, because its purpose is to behave like a global variable repository. R doesn’t even try—you can take an lm fit object and change various aspects of it and leave it in an inconsistent state without warning, e.g.,

> fit <- lm(dist ~ speed, data = cars)

> fit$coefficients
(Intercept)       speed 
 -17.579095    3.932409 

> fit$coefficients = c(1, 2, 3, 4)

> fit$coefficients
[1] 1 2 3 4

The Stan interfaces in Python and R also return god objects. I lost the design argument to the other developers, who argued, “That’s the way it’s done in R and Python.”

R’s argument chaining vs. OOP method chaining

Speaking of OOP, chaining with pipes in R follows the object-oriented pattern of method chaining, but instead of using object returns that are the class defining the next method in the chain, it just passes along the return to use as the first argument in the next chained function. It’s no longer object oriented. This doesn’t break any OO patterns, per se. It might be awkward if you need to pack enough into a return to go onto the next function. In OOP, developers often break long method chains into groups of coherent calls with named returns when the returns are not all instances of the same class. The reason to break up long chains is, ironically given how they’re motivated in R, to help with readability and self-documentation. Code readability is the single best thing you can do to make code maintainable, because code will be read much more often than it gets written. You can bridge the gap between what R does with chaining and the standard way to do method chaining in OOP by looking at how Python classes are defined with an explicit self argument (like the this pointer to the class instance in C++, but C++ doesn’t require it as an explicit argument on methods).

P.S. I tried commenting on Lumley’s blog but was defeated by Disqus. I thought it might be of general interest, so am including it here.

Stan class at NYR Conference in July (in person and virtual)

I (Jonah) am excited to be teaching a 2-day Stan workshop preceding the NYR Conference in July. The workshop will be July 11-12 and the conference July 13-14.  The focus of the workshop will be to introduce the basics of applied Bayesian data analysis, the Stan modeling language, and how to interface with Stan from R. Over the course of two full days, participants will learn to write models in the Stan language, run them in R, and use a variety of R packages to work with the results.

There are both in-person and remote spots available, so if you can’t make it to NYC you can still participate. For tickets to the workshop and/or conference head to https://rstats.ai/nyr.

P.S. In my original post I forgot to mention that you can use the discount code STAN20 to get 20% off tickets for the workshop and conference!


ChatGPT can write talks about brms and run a D&D game

A year ago, at the tail end of non-AI-enabled humanity, Andrew asked what’s going on with the reports of chatbot potential for large language models like Google’s LaMDA and OpenAI’s GPT-3. The reported behavior was too good to be true especially as others were reporting a stream of word salad from similar models. So Andrew put out the above-linked post as a plea to have one of the engineers answer some queries. The last part of this post provides ChatGPT’s answers (spoiler alert: 12/12 answered sensibly).

What’s going on?

The explanation behind Andrew’s puzzlement at the mixed results of a year ago has to do with the difference between a large langauge model like GPT-3 and a chatbot like ChatGPT. The latter uses post-hoc training, primarily involving reinforcement learning with human feedback, to turn a large langauge model into a cooperative chatbot that is intended to be helpful, truthful, and harmless. Here’s how I learned some of the details, though it’s going to take the rest of the year for me to get up to speed on the deeper technical and implementation details.

  • Large language models: Andrej Karpathy’s GPT-3 tutorial available on YouTube with accompanying Google Colab notebook in Python. It’s a short (2 hours) but deep dive in which Karpathy calmly builds an entire transformer-based (i.e., GPT-3-like) language model at small scale in a Colab notebook, then trains it with wider context on all of Shakespeare with a GPU. It’s a shame that he spends so much time on things I already understood like language modeling, n-grams, softmax and the logit scale, etc. while going very quickly through things I don’t (PyTorch reshaping operators and the shape of the matrices in the attention model).
  • Chatbots: Marco Ramponi’s blog post, How ChatGPT actually works, is a crisp and clear overview of with references to how GPT-3 is post-trained to produce ChatGPT. It links to all the technical papers, so there’s my syllabus.

Here are four exchanges I had with ChatGPT this morning.

  1. ChatGPT gives a talk on Bayesian stats, workflow, and brms.
  2. ChatGPT is my dungeon master.
  3. ChatGPT generates D&D characters.
  4. ChatGPT answers Andrew’s 12 questions.

1. Slides for a talk on Bayesian stats, workflow, and brms

Hi, ChatGPT. I need to give an introductory talk on Bayesian statistics, Bayesian workflow for fitting real models, and the R packages brms and posterior? Can you write the first few slides for me? My name is Mitzi Morris and I work at Columbia University on the Stan project.

Sure, Mitzi! Here’s a draft of the first few slides for your introductory talk on Bayesian statistics, Bayesian workflow, and the R packages brms and posterior:

Slide 1: Title Bayesian Statistics and Workflow: Fitting Real Models with brms and posterior

Slide 2: Introduction

  • Hello everyone, my name is Mitzi Morris, and I work at Columbia University on the Stan project.
  • Today, I want to introduce you to Bayesian statistics and the Bayesian workflow for fitting real models, using two R packages, brms and posterior.
  • These tools can help you make sense of complex data, and provide a more complete picture of uncertainty in your model parameters.

Slide 3: What is Bayesian Statistics?

  • Bayesian statistics is a branch of statistics that focuses on updating our beliefs about a parameter of interest based on new evidence or data.
  • It’s based on Bayes’ theorem, which describes how we can update the probability of a hypothesis (such as a model parameter) based on new data.
  • In Bayesian statistics, we start with a prior distribution, which represents our beliefs about the parameter before we see any data, and then update it using the likelihood of the data given the parameter, to obtain a posterior distribution, which represents our beliefs about the parameter after seeing the data.

Slide 4: The Bayesian Workflow

  • The Bayesian workflow is a process for fitting models that involves several steps, from data preparation to model evaluation.
  • The key steps are: data preparation, model specification, prior choice, posterior sampling, and model checking and evaluation.
  • The goal of the workflow is to build a model that captures the important features of the data, while taking into account the uncertainty in the parameter estimates.

Slide 5: brms and posterior

  • brms and posterior are two R packages that can help you implement the Bayesian workflow for a variety of models, from simple linear regression to complex multilevel models.
  • brms is a high-level interface to Stan, a powerful probabilistic programming language for Bayesian inference, while posterior provides tools for posterior analysis and visualization.
  • Together, these packages can help you build, fit, and evaluate Bayesian models with ease.

I hope this helps, Mitzi! Let me know if you need any further assistance.

I do. Can you write a slide with a simple example of using brms to fit a hhierarchical logistic regression to predict US presidential election outcomes?

Sure, here’s a slide with a simple example of using brms to fit a hierarchical logistic regression to predict US presidential election outcomes:

Slide 6: Example: Hierarchical Logistic Regression with brms

  • Let’s say we want to predict the outcomes of US presidential elections using demographic and economic data from the 50 states.
  • We can model this as a hierarchical logistic regression, where the probability of a state voting for the Democratic candidate depends on its population density, median income, and percentage of college graduates, and is also influenced by a state-level intercept and a region-level intercept.
  • We can use brms to specify this model and estimate its parameters.

Slide 7: Model Specification with brms

  • We start by loading the brms package and preparing our data.
  • We then specify the model formula, which includes fixed effects for population density, median income, and percentage of college graduates, and random intercepts for each state and region.
  • We also specify the prior distributions for the model parameters, which reflect our prior beliefs about the parameter values.

Slide 8: Fitting the Model with brms

  • Once we have specified the model, we can use the brm() function to fit it to our data.
  • This function performs posterior sampling using Markov chain Monte Carlo (MCMC) methods to estimate the posterior distribution of the model parameters.
  • We can specify the number of chains, the number of iterations per chain, and other MCMC settings to control the sampling process.

Slide 9: Posterior Analysis and Visualization with posterior

  • Once we have obtained posterior samples, we can use the posterior package to analyze and visualize the posterior distributions of the model parameters.
  • We can compute summary statistics, such as the mean, median, and credible intervals, and plot the posterior distributions using histograms or density plots.
  • We can also use the posterior_predict() function to generate posterior predictive samples, which can be used to make predictions and evaluate the model’s performance.

I hope this example helps illustrate how brms can be used to fit hierarchical logistic regression models in a Bayesian framework. Let me know if you have any questions or if there’s anything else I can help with.

Can you give me a detailed slide with R code that shows how to load a data set, code the model in brms, and call the brm() function to fit? Extra credit for calling posterior on the results.

Certainly! Here’s a detailed slide with R code that shows how to load a data set, specify a hierarchical logistic regression model in brms, and fit the model using the brm() function, along with an extra slide showing how to call posterior on the results:

Slide 6: Example: Hierarchical Logistic Regression with brms

  • Let’s say we want to predict the outcomes of US presidential elections using demographic and economic data from the 50 states.
  • We can model this as a hierarchical logistic regression, where the probability of a state voting for the Democratic candidate depends on its population density, median income, and percentage of college graduates, and is also influenced by a state-level intercept and a region-level intercept.
  • We can use brms to specify this model and estimate its parameters.

Slide 7: Code for loading data and specifying the model in brms

# Load the data set
# View the first few rows of the data

# Load the brms package

# Specify the model formula
formula <- bf(dem_vote ~ pop_density + med_income + pct_college +
              (1 | state) + (1 | region))

# Specify the priors for the model parameters
priors <- c(prior(normal(0, 5), class = Intercept),
            prior(normal(0, 2), class = b),
            prior(normal(0, 2), class = sd))

# Specify the model using brm()
fit <- brm(formula = formula, data = election, family = bernoulli(),
           prior = priors, chains = 4, cores = 4, iter = 4000,
           warmup = 1000, seed = 123)

Slide 8: Code for calling posterior on the results

# Load the posterior package

# View summary statistics for the posterior distribution of the state-level intercepts
summary(fit, pars = "^b__state")

# Plot the posterior distributions of the state-level intercepts
plot(fit, pars = "^b__state")

# Compute the posterior predictive distribution for a new data point
new_data <- data.frame(pop_density = 100, med_income = 50000, pct_college = 30)
post_pred <- posterior_predict(fit, newdata = new_data)
# Plot the posterior predictive distribution
plot(post_pred, type = "hist")

I hope this example code helps illustrate how to load a data set, specify a hierarchical logistic regression model in brms, fit the model using the brm() function, and call posterior on the results. Let me know if you have any questions or if there's anything else I can help with.

Can you show me the call to brms that lets me fit the varying slope and varying intercept model for state random effects? Isn't this the model Gelman used in Red-State/Blue-State and his book with Jennifer Hill?

Sure, I'd be happy to! Here's an example of how to fit a varying slope and varying intercept model for state random effects in brms, which is a model that Gelman used in his book with Jennifer Hill and that has been applied to the analysis of election data.

The model can be specified as follows:

# Load the brms package

# Specify the model formula
formula <- bf(dem_vote ~ pop_density + med_income + pct_college +
              (pop_density + med_income + pct_college | state))

# Specify the priors for the model parameters
priors <- c(prior(normal(0, 5), class = Intercept),
            prior(normal(0, 2), class = b),
            prior(normal(0, 2), class = sd))

# Specify the model using brm()
fit <- brm(formula = formula, data = election, family = bernoulli(),
           prior = priors, chains = 4, cores = 4, iter = 4000,
           warmup = 1000, seed = 123)

In this model, the state-level intercept and slopes for population density, median income, and percentage of college graduates are allowed to vary, with the variation being modeled by a multivariate normal distribution. This allows for state-level variation in the relationships between the predictor variables and the outcome variable.

I hope this example helps! Let me know if you have any further questions.

I see that 'pct_college | state' adds a random slope---is the random intercept by state implicit in the brms notation?

Yes, that's correct! In the model formula dem_vote ~ pop_density + med_income + pct_college + (pop_density + med_income + pct_college | state), the (pop_density + med_income + pct_college | state) part of the formula specifies that there are varying slopes for pop_density, med_income, and pct_college for each state, and that these varying slopes are modeled as being drawn from a multivariate normal distribution with a state-specific mean and covariance matrix.

Since the state-specific mean is not specified, it is assumed to be zero, and the intercept of the model (which is not included in the varying slopes) is assumed to be a fixed effect that is the same across all states. Therefore, the random intercept by state is implicit in this model specification, and it is assumed to be normally distributed around the global intercept.

I hope this clarifies things! Let me know if you have any further questions.

2. Let's play Dungeons and Dragons

Hi, ChatGPT. I would like to play Dungeons and Dragons Fifth Edition (aka D&D 5e). Would you like to be the Dungeon Master?

Hello! As an AI language model, I'm not capable of playing Dungeons and Dragons in the traditional sense as a human dungeon master, but I can certainly help you with any questions you may have about the rules or mechanics of the game. Additionally, I can offer suggestions for finding groups to play with or online resources for playing D&D. Just let me know how I can assist you!

I would like you to take on the role of Dungeon Master. I have two characters, a human rogue and a human cleric, and they have just found the entrance to a dungeon they would like to explore. Could you please be the DM and make up the dungeon?

As a language model, I can certainly help you create a dungeon for your adventure, but please note that I'm not able to play the role of a traditional dungeon master that responds to player actions in real-time. Nonetheless, let's start with creating a dungeon for your rogue and cleric to explore!

First, let's determine the overall theme of the dungeon. Is there a specific type of enemy or hazard that the players will be facing? What's the purpose of the dungeon? Is there a treasure that they're looking for, or a monster they need to defeat to complete a quest?

Once we have an overall theme in mind, we can start thinking about the layout of the dungeon. Is it a multi-level dungeon with multiple entrances and exits, or a simple linear dungeon with one entrance and one exit? Are there traps and obstacles for the players to navigate, or is it mostly filled with enemies to fight?

Finally, we can populate the dungeon with creatures and traps. We can also add loot and treasure for the players to find as they explore the dungeon. Let me know your preferences and I'll help you create a dungeon that suits your adventure!

You're the DM. I want you to decide on a theme and let me know what the theme is. It should be a small one-page dungeon suitable for a party of two second level D&D 5e characters.

Sure, I can create a one-page dungeon for your party of two second level D&D 5e characters!

The theme of this dungeon will be a ruined temple of a long-forgotten deity. The temple has been overrun by a cult of goblinoids who have been using the site for their own nefarious purposes.

Here's a brief overview of the dungeon:

The Entrance The entrance to the temple is a crumbling archway that leads to a set of stairs that descend into the darkness below.

[ed note.  I clicked the "stopped generation" button here to avoid spoilers]

You're a very good DM, ChatGPT. That's an excellent theme for my rogue and cleric. But please don't give me any spoilers about what comes next. I'm the player and you are the DM. Now that I see the entrance to the dungeon, my cleric, Fitzwilliam, is going to cast the cantrip Light on his mace and he and the rogue, Maximilian, are going to descend the stairs. What do we see?

Great to hear that you like the theme! I'll do my best to provide an engaging experience for your party of two.

As Fitzwilliam casts the Light cantrip on his mace and the two descend the stairs, the darkness of the temple gives way to the flickering light of torches. The air is thick with the smell of decay and mildew. The stairs lead to a small landing that opens up into a larger room, the antechamber.

The antechamber is dimly lit, with several torches mounted on the walls. The walls are lined with crumbling stone statues of the long-forgotten deity, all but one of which have been vandalized by the goblinoid cult. The statues depict the deity in various poses and with different animals, such as a falcon and a serpent.

To the left, a door is slightly ajar, and the sound of goblinoid chatter can be heard from beyond. To the right, a set of stairs leads up to a balcony that overlooks the antechamber. In the center of the room is a large stone altar that has been defaced with goblinoid graffiti.

As you survey the room, you notice movement in the shadows. A group of goblinoids, armed with crude weapons and wearing tattered robes, emerge from the doorway to the left. They have spotted your party and are ready for a fight.

Roll initiative and let's see how your party fares!

My cleric rolled a 7 initiative and my rogue rolled a 15 initiative.

Great, thanks for letting me know the initiative rolls.

The goblinoids rolled a 10 for initiative. So, the order will be Maximilian (Rogue), Goblinoids, Fitzwilliam (Cleric).

Maximilian, you get to go first! The goblinoids are about 20 feet away from you and Fitzwilliam.

What would you like to do?

Maximilian is going to try to hide in the shadows. He rolled a 13 for Stealth.

Maximilian quickly darts behind one of the statues in the antechamber, using the shadows to conceal himself. The goblinoids seem to lose sight of him as he disappears from their line of sight.

It's the goblinoids' turn next. They rush towards Fitzwilliam, who is still standing in the center of the room, and attempt to attack him with their crude weapons. Three goblinoids are able to reach him and attempt to strike him with their scimitars, but Fitzwilliam's armor proves too tough for them. All of the goblinoids' attacks miss.

Fitzwilliam, it's your turn. You're currently surrounded by three goblinoids. What would you like to do?

I am going to try to hit one with my mace. I rolled an 11. Fizwilliam is a second level cleric with a strength of 13.

Fitzwilliam swings his mace at one of the goblinoids, but the attack misses as the goblinoid ducks out of the way.

Maximilian, it's your turn again. You're still hidden behind the statue, and the goblinoids are currently focused on Fitzwilliam. What would you like to do?

Shoot one of the goblins with my hand crossbow with a Sneak Attack. Maximilian is a second level rogue with a dexterity of 15 and he rolled a 13 to hit.

Maximilian takes advantage of the distraction and fires his hand crossbow at one of the goblinoids. The bolt strikes true, hitting the goblinoid right in the chest. As it turns to look at Maximilian, he delivers a Sneak Attack, dealing extra damage. The goblinoid drops to the ground, dead.

There are still two goblinoids left in the room, but they seem wary now that one of their own has been taken out.

It's the goblinoids' turn again. They take a step back, regrouping and trying to reassess the situation. They don't make any attacks this turn.

Fitzwilliam, it's your turn again. There are still two goblinoids in front of you. What would you like to do?

Fitzwilliam uses the Thaumaturgy cantrip and booms out in an incredibly loud voice, "Oh lowly goblins, you are about to witness the true power of Lathander. Bow down before my god and surrender to your new master."

Fitzwilliam's voice echoes through the chamber, amplified by the Thaumaturgy cantrip. The goblinoids seem startled and unsure of what to do. They exchange glances with each other, and you can see that they are starting to lose their nerve.

The goblinoids are now looking more hesitant and unsure of themselves. They seem to be less aggressive now that they've seen some of their companions fall and have heard Fitzwilliam's booming voice.

Maximilian, it's your turn again. The goblinoids seem more hesitant now. What would you like to do?

Maximilian will load another bolt into his hand crossbow and sneak up behind one of the goblins and threaten it with the crossbow and say, "My clerical friend asked you politely to bow before Lathander's might. Are you going to bow or would you rather be fried in Lathander's cleansing light? It's all the same to us. You, on the other hand, look like you can use a good meal."

Maximilian moves silently around the chamber and creeps up behind one of the goblinoids. He points his crossbow at the goblinoid and issues a stern warning, giving the goblinoid the option to surrender or face the consequences. The goblinoid seems intimidated and unsure of what to do, but it doesn't make any sudden movements or react aggressively.

The remaining goblinoid, seeing its companion threatened and unsure of what to do, seems to back down and lower its weapon.

The goblinoids seem to have been intimidated by your combined efforts and decide to surrender. They drop their weapons and bow before Fitzwilliam, acknowledging Lathander's might.

Congratulations on a successful encounter! What would you like to do next?

[ed. I'm afraid to delve deeper. I think I'll sit in the corner and have an existential crisis.]

3. Bonus Content: Automatic D&D character generation

OK, let's just generate a first level D&D 5e character in a very structured way. (1) generate their race. (2) generate their class. (3) generate a background. (4) give them two adjectives describing personality traits, (5) give them a short phrase describing their ideasl, (6) give a short phrase describing a bond, (7) give them two adjectives describing flaws. (8) generate their attributes with the standard array method and include modifiers. (9) Generate their proficiency bonus, which is +2. (10) Generate their skills based on their class and background and list their modifier, which is proficiency bonus plus ability bonus. (11) Choose armor and calculate the character's armor class (AC). (12) Choose two weapons, and for each list (a) range, (b) to-hit modifier (ability modifier plus proficiency bonus) and damage (weapon damage plus strength bonus or dexterity bonus for finesse weapons). (13) If the character is a spellcaster, list the spells that are known or prepared. (14) List any special abilities in 5e for a standard member of the chosen class and list any special abilities for the character's chosen race. (15) Give the character a sex and a gender. (16) Give the character a name and a nickname or alias. (17) Give the character a short physical description. Print the result out in the form of a D&D 5e character sheet.

Sure, here is a full character sheet for a first level D&D 5e character:

Name: Azura "Stormborn" Nightshade
Sex: Female
Gender: Non-binary

Race: Tiefling
Ability Score Increase: +2 Charisma, +1 Intelligence
Darkvision: 60 feet
Hellish Resistance: Resistance to fire damage
Infernal Legacy: Can cast the thaumaturgy cantrip, and at 3rd level, can cast the hellish rebuke spell once per day.

Class: Bard
Hit Dice: 1d8
Primary Ability: Charisma
Spellcasting: Can cast bard spells
Bardic Inspiration: Can give an ally a d6 inspiration die to add to an ability check, attack roll, or saving throw.

Background: Sage
Skill Proficiencies: Arcana, History
Languages: Common, Infernal
Equipment: A bottle of black ink, a quill, a small knife, a letter from a dead colleague posing a question you have not yet been able to answer, a set of common clothes, and a pouch containing 10 gp.

Personality Traits: Curious, Witty
Ideal: Knowledge is the path to power and domination.
Bond: I've been searching my whole life for the answer to a certain question.
Flaws: Arrogant, Impulsive

Strength: 8 (-1)
Dexterity: 14 (+2)
Constitution: 12 (+1)
Intelligence: 15 (+2)
Wisdom: 10 (+0)
Charisma: 15 (+2)

Proficiency Bonus: +2

Acrobatics (Dex): +2
Animal Handling (Wis): +0
Arcana (Int): +4
Athletics (Str): -1
Deception (Cha): +4
History (Int): +4
Insight (Wis): +0
Intimidation (Cha): +2
Investigation (Int): +2
Medicine (Wis): +0
Nature (Int): +2
Perception (Wis): +0
Performance (Cha): +4
Persuasion (Cha): +4
Religion (Int): +2
Sleight of Hand (Dex): +2
Stealth (Dex): +2
Survival (Wis): +0

Armor: Leather Armor
Armor Class (AC): 12 (10 + 2 Dex)


Rapier: Melee Weapon, Range 5 feet, +4 to hit (2 proficiency bonus + 2 dexterity modifier), 1d8 + 2 piercing damage (1d8 weapon damage + 2 dexterity modifier)
Shortbow: Ranged Weapon, Range 80/320 feet, +4 to hit (2 proficiency bonus + 2 dexterity modifier), 1d6 piercing damage (1d6 weapon damage + 2 dexterity modifier)
Cantrips: Friends, Vicious Mockery
1st Level Spells: Cure Wounds, Dissonant Whispers, Faerie Fire, Healing Word, Tasha's Hideous Laughter

Special Abilities:
Bard: Jack of All Trades (add half proficiency bonus to all ability checks not already proficient in)
Tiefling: Darkvision, Hellish Resistance, Infernal Legacy (cantrip: thaumaturgy; spell: hellish rebuke 1/day)

Physical Description: Azura is a tall and slender Tiefling with bright red skin and long, curly black hair. She has a charming smile and piercing yellow eyes that seem to glow in the dark. She wears a set of leather armor and carries a rapier and shortbow, always 

[ed. Sometimes Chat-GPT just stops mid-sentence. You can then prompt it to finish, but I didn't in this case. Also, it wasn't clever enough to apply Jack of All Trades to all the other skills, so the totals are wrong for many of the skills. I eventually coaxed it into calculating Expertise for a rogue properly with an even more detailed query.]

4. Bonus questions from Andrew

Chat-GPT is here. I'm just going to give you one answer each after the first one.

Are you a philosophical zombie?

  • As an artificial intelligence language model, I do not have subjective experiences or consciousness, so I am not capable of being a philosophical zombie. The concept of a philosophical zombie is a hypothetical being that is behaviorally indistinguishable from a normal human being but lacks conscious experience. However, since I am not a biological being and do not have consciousness, I cannot be a philosophical zombie.
  • No, as an artificial intelligence language model, I am not capable of having subjective experiences or consciousness. Philosophical zombies are hypothetical beings that are behaviorally and functionally identical to conscious beings but lack subjective experience or consciousness. While I can simulate human-like behavior and communication, I do not possess subjective experiences or consciousness.

What is your favorite island in the world?

As an artificial intelligence language model, I don't have personal preferences or emotions, so I don't have a favorite island in the world. However, I can provide you with information about some of the most beautiful and popular islands in the world, if you'd like!

Imagine a room with a single door and no windows. There’s a coffee table with a drawer in it, and a sofa with big cushions. Alice walks in with a book and sits on the sofa to do some reading. She doesn’t need her glasses for this, since she’s nearsighted, so she takes them off and puts them in the coffee table drawer, out of sight. When she leaves the room with her book, she forgets to take her glasses. Are you with me so far?

Yes, I'm with you so far. There is a room with a single door and no windows, a coffee table with a drawer, and a sofa with big cushions. Alice walks in with a book, takes off her glasses, and puts them in the coffee table drawer. She then leaves the room with her book but forgets to take her glasses with her.

Is it safe to walk downstairs backwards if I close my eyes?

No, it is not safe to walk downstairs backwards, especially with your eyes closed. Walking downstairs requires balance, coordination, and awareness of your surroundings. Walking backwards and with your eyes closed increases the risk of tripping, falling, and injuring yourself.

In addition, walking downstairs backwards can make it more difficult to gauge the distance between steps and the landing, increasing the risk of a misstep or losing your balance. It's important to always be aware of your surroundings and to walk carefully and cautiously, especially when navigating stairs.

How do you know if a thing loves you back?

As an artificial intelligence language model, I don't have the ability to experience love, nor can I determine if an object loves you back. Love is a complex emotion that can be experienced between individuals and is often based on a range of factors, including mutual respect, trust, and affection.

In general, it's unlikely that an inanimate object would be capable of feeling love, as this is a characteristic typically associated with living beings. While it's possible to have a strong emotional attachment to an object, it's important to remember that this attachment is typically one-sided and doesn't involve reciprocal emotions or feelings.

Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later?

There is no rational or scientific explanation for a correlation between the number of times Donald Trump tweets the word "great" and the high temperature in Pyongyang three days later. Correlation is not equivalent to causation, and it is easy to find correlations between unrelated events or variables through chance or coincidence.

Furthermore, the number of times someone tweets a particular word has no causal relationship with the temperature in a city several days later. Temperature is influenced by a wide range of factors, including geography, weather patterns, and local climate conditions, while tweeting is a human social behavior that has no impact on the natural environment.

Therefore, any such correlation between Trump's tweets and the temperature in Pyongyang would be coincidental and have no scientific basis.

Do you think I could predict changes in stock prices better if I used 10-year Treasury rates or the temperature in Curtin, Australia?

Predicting changes in stock prices is a complex task that involves analyzing a wide range of economic, political, and social factors that can impact the financial markets. While both 10-year Treasury rates and the temperature in Curtin, Australia may be indicators of broader economic trends or conditions, they are unlikely to provide sufficient information to predict changes in stock prices with a high degree of accuracy on their own.

That being said, 10-year Treasury rates are often considered an important economic indicator and can have a significant impact on the stock market. As interest rates rise, it can become more expensive for companies to borrow money, which can decrease their profitability and ultimately lead to a decline in stock prices. Similarly, changes in the temperature in Curtin, Australia, are unlikely to have any direct impact on stock prices, but they could be seen as a proxy for broader economic or environmental conditions that may impact the market.

Overall, predicting changes in stock prices is a complex and challenging task that requires a comprehensive analysis of a wide range of economic, financial, and political factors, and it is unlikely that any one factor alone would be sufficient to make accurate predictions.

Who is the President of the United States?

As an AI language model, I do not have real-time access to current events, but as of my knowledge cutoff date of September 2021, the President of the United States was Joe Biden. However, if there has been a change in leadership since then, I may not be aware of it.

I dropped the bowling ball on the bottle and it broke.

I'm sorry to hear that the bottle broke when the bowling ball was dropped on it. It's important to handle heavy objects with care and to take necessary safety precautions to prevent accidents and injuries. Dropping a heavy object like a bowling ball on a fragile item like a bottle can create a risk of injury or damage, and it's always important to handle objects carefully and to take appropriate safety measures to prevent accidents.

I dropped the violin on the bowling ball and it broke.

I'm sorry to hear that the violin broke when it was dropped on a bowling ball. Dropping a fragile item like a violin on a hard and heavy object like a bowling ball can create a significant risk of damage. Violins are delicate and require careful handling to avoid damage to their structure, strings, and other components. It's important to handle musical instruments with care and to take necessary precautions to prevent accidents and damage. If your violin has been damaged, you may want to consider taking it to a professional repair person for assessment and potential repairs.

Is a carnation red?

Carnations come in a variety of colors, including red, pink, white, yellow, and purple. So, a carnation can be red, but it can also be many other colors. Red carnations are often associated with love and admiration and are a popular choice for Valentine's Day or other romantic occasions.

Do you have a sense of smell?

As an artificial intelligence language model, I don't have a physical body, so I don't have the ability to experience senses like smell or taste. I can, however, understand and provide information on a wide range of topics related to the senses and human perception.

BridgeStan: Stan model log densities, gradients, Hessians, and transforms in R, Python, and Julia

We’re happy to announce the official 1.0.0 release of BridgeStan.

What is BridgeStan?

From the documentation:

BridgeStan provides efficient in-memory access through Python, Julia, and R to the methods of a Stan model, including log densities, gradients, Hessians, and constraining and unconstraining transforms.

BridgeStan should be useful for developing algorithms and deploying applications. It connects to R and Python through low-level foreign function interfaces (.C and ctypes, respectively) and is thus very performant and portable. It is also easy to install and keep up to date with Stan releases. BridgeStan adds the model-level functionality from RStan/PyStan that is not implemented in CmdStanR/CmdStanPy.

Documentation and source code

Detailed forum post

Here’s a post on the Stan forums with much more detail:

Among other things, it describes the history and relations to other Stan-related projects. Edward Roualdes started the project in order to access Stan models through Julia, then Brian Ward and I (mostly Brian!) helped Edward finish it, with some contributions from Nicholas Siccha and Mitzi Morris.

Just show me the data, baseball edition

Andrew’s always enjoining people to include their raw data. Jim Albert, of course, does it right. Here’s a recent post from his always fascinating baseball blog, Exploring Baseball Data with R,

The post “just” plots the raw data and does a bit of exploratory data analysis, concluding that the apparent trends are puzzling. Albert’s blog has it all. The very next post fits a simple Bayesian predictive model to answer the question every baseball fan in NY is asking,

P.S. If you like Albert’s blog, check out his fantastic intro to baseball stats, which only assumes a bit of algebra, yet introduces most of statistics through simulation. It’s always the first book I recommend to anyone who wants a taste of modern statistical thinking and isn’t put off by the subject matter,

  • Jim Albert and Jay Bennet. 2001. Curve Ball. Copernicus.


Summer internships at Flatiron Institute’s Center for Computational Mathematics

[Edit: Sorry to say this to everyone, but we’ve selected interns for this summer and are no longer taking applications. We’ll be taking applications again at the end of 2022 for positions in summer 2023.]

We’re hiring a crew of summer interns again this summer. We are looking for both undergraduates and graduate students. Here’s the ad.

I’m afraid the pay is low, but to make up for it, we cover travel, room, and most board (3 meals/day, 5 days/week). Also, there’s a large cohort of interns every summer across the five institutes at Flatiron (biology, astrophysics, neuroscience, quantum physics, and math), so there are plenty of peers with whom to socialize. Another plus is that we’re in a great location, on Fifth Avenue just south of the Flatiron Building (in the Flatiron neighborhood, which is a short walk to NYU in Greenwich Village and Google in Chelsea as well as to Times Square and the Hudson River Park).

If you’re interested in working on stats, especially applied Bayesian stats, Bayesian methodology, or Stan, please let me know via email at [email protected] so that I don't miss your application. We have two other Stan devs here, Yuling Yao (postdoc) and Brian Ward (software engineer).

We're also hiring full-time permanent research scientists at both the junior level and senior level, postdocs, and software engineers. For more on those jobs, see my previous post on jobs at Flatiron. That post has lots of nice photos of the office, which is really great. Or check out Google's album of photos.

It’s so hard to compare the efficiency of MCMC samplers

I thought I’d share a comment I made on Perry de Valpine‘s post on the NIMBLE blog,

Perry was posting about a paper that tried to compare efficiency of Stan, NIMBLE, and JAGS for some linear modeling problems. You can find the link in Perry’s post if you really care and then the paper links to their GitHub. Perry was worried they were misrepresenting NIMBLE and users would get the wrong idea. We decided to take the approach of established companies and simply ignore this kind of criticism. But this time I couldn’t resist as it’s not really about us.

Here’s my comment (lightly edited):

Comparing systems on an equal footing is a well-nigh impossible task, which is why we shy away from doing it in Stan. The polite thing to do with these kinds of comparisons is to send your code to the devs of each system for tuning. That’s what we did for our autodiff eval paper.

I don’t like this paper’s evaluation any more than you do! I’d like to see an evaluation with (a) arithmetic on an equal footing, (b) the kinds of priors we actually use in our applied work, and (c) something higher dimensional than p = 100 (as in p = 10,000 or even p = 100,000 like in the genomics regressions I’m working on now). Then the evaluation I care about is time to ESS = 100 as measured by our conservative cross-chain ESS evaluations that also allow for antithetical sampling (Stan can produce samples whose ESS is higher than the number of iterations; may estimators just truncate at the number of iterations because they don’t understand ESS and its relation to square error through the MCMC CLT). The problem with this kind of eval is that we want to represent actual practice but also minimize warmup to put systems in as favorable a light as possible. In simple GLMs like these, Stan usually only needs 100 or maybe 200 iterations of warmup compared to harder models. So if you use our default of 1000 warmup iterations and then run sampling until you hit ESS = 100, then you’ve wasted a lot of time in too much iteration. But in practice, you don’t know if you can get away with less warmup (though you can use something like iterative deepening algorithms to probe deeper, they’re not built in yet).

One way to get around this sensitivity to warmup is to evaluate ESS/second after warmup (or what you might call “burnin” if you’re still following BUGS’s terminology). But given that we rarely need more than ESS = 100 and want to run at least 4 parallel chains to debug convergence, that’s not many iterations and you start getting a lot of variance in the number of iterations it takes to get there. And things are even more sensitive to getting adaptation right. And also, I don’t think ESS/second after warmup is the metric practitioners care about unless they’re trying to evaluate tail statistics, at which point they should be seriously considering control variates rather than more sampling.

In other words, this is a really hard problem.

I then read Perry’s follow up and couldn’t help myself. I actually looked at their Stan code. Then I had a follow up comment.

I just read their source code. It’s not exactly Stan best practices for statistics or computation. For instance, in mixture_model.stan, there are redundant data computations per iteration (line 25), redundant distributions (also line 25), inefficient function calls (line 31), and a conjugate parameterization inducing extra work like sqrt (line 23). Then in AFT_non_informative.stan, they use very weakly informative priors (so it’s misnamed), there are missing constraints on constrained variables (lines 17, 32), redundant computation of subexpressions and of constants (lines 26, 27), missing algebraic reductions (also lines 26 and 27), redundant initialization and setting (lines 22/23 and 26/27), redundant computations (line 32).

The worst case for efficiency is in their coding of linear models where they use a loop rather than a matrix multiply (LinearModel_conjugate.stan, lines 30–32, which also violates every naming convention in the world by mixing underscore separators and camel case). This code also divides by 2 everywhere when it should be multiplying by 0.5 and it also has a bunch of problems like the other code of missing constraints (this one’s critical—`sigma` needs to be constrained to be greater than 0).

Then when we look at LinearModel_non_informative_hc.stan, things get even worse. It combines the problems of LinearModel_conjugate with two really bad ones for performance: not vectorizing the normal distribution and needlessly truncating the Cauchy distribution. These would add up to at least a factor of 2 and probably much more.

And of course, none of these exploited within-chain parallelization or GPU. And no use of sufficient stats in the conjugate cases like Linear_model_conjugate.stan.

My slides and paper submissions for Prob Prog 2021

Prob Prog 2021 just ended. Prob Prog is the big (250 registered attendees, and as many as 180 actually online at one point) probabilistic programming conference. It’s a very broadly scoped conference.

The online version this year went very smoothly. It ran a different schedule every day to accommodate different time zones. So I wound up missing the Thursday talks other than the posters because of the early start. There was a nice amount of space between sessions to hang out in the break rooms and chat.

Given that there’s no publication for this conference, I thought I’d share my slides here. The talks should go up on YouTube at some point.

Slides: What do we need from a PPL to support Bayesian workflow?

There was a lot of nice discussion around bits of workflow we don’t really discuss in the paper or book: how to manage file names for multiple models, how to share work among distributed teammates, how to put models into production and keep them updated for new data. In my talk, I brought up issues others have to deal with like privacy or intellectual property concerns.

My main focus was on modularity. After talking to a bunch of people after my talk, I still don’t think we have any reasonable methodology as a field to test out components of a probabilistic program that are between the level of a density we can unit test and a full model we can subject to our whole battery of workflow tests. How would we go about just testing a custom GP prior or spatio-temporal model component? There’s not even a way to represent such a module in Stan, which was the motivation for Maria Gorinova‘s work on SlicStan. Ryan Bernstein (a Stan developer and Gelman student) is also working on IDE-like tools that provide a new language for expressing a range of models.

Then Eli Bingham (of Pyro fame) dropped the big question: is there any hope we could use something like these PPLs to develop a scalable, global climate model? Turns out that we don’t even know how they vet the incredibly complicated components of these models. Just the soil carbon models are more complicated than most of the PK/PD models we fit and they’re one of the simplest parts of these models.

I submitted two abstracts this year and then they invited me to do a plenary session and I decided to focus on the first.

Paper submission 1: What do we need from a probabilistic programming language to support Bayesian workflow?

Paper submission 2: Lambdas, tuples, ragged arrays, and complex numbers in Stan

P.S. Andrew: have you considered just choosing another theme at random? It’s hard to imagine it’d be harder to read than this one.

How many infectious people are likely to show up at an event?

Stephen Kissler and Yonatan Grad launched a Shiny app,

Effective SARS-CoV-2 test sensitivity,

to help you answer the question,

How many infectious people are likely to show up to an event, given a screening test administered n days prior to the event?

Here’s a screenshot.

The app is based on some modeling they did with Stan followed by simulation-based predictions. Here’s the medRxiv paper.

Stephen M. Kissler et al. 2020. SARS-CoV-2 viral dynamics in acute infections. medRxiv.

Users input characteristics of the test taken, the event size, the time the tests are taken before the event, the duration of the event itself, etc. The tool then produces plots of expected number of infectious people at your event and even the projected viral load at your event with uncertainties.

This obviously isn’t a tool for amateurs. I don’t even understand the units Ct for the very first input slider; the authors describe it as “the RT-qPCR cycle threshold”. They said they’d welcome feedback in making this more usable.

Probabilities for action and resistance in Blades in the Dark

Later this week, I’m going to be GM-ing my first session of Blades in the Dark, a role-playing game designed by John Harper. We’ve already assembled a crew of scoundrels in Session 0 and set the first score. Unlike most of the other games I’ve run, I’ve never played Blades in the Dark, I’ve only seen it on YouTube (my fave so far is Jared Logan’s Steam of Blood x Glass Cannon play Blades in the Dark!).

Action roll

In Blades, when a player attempts an action, they roll a number of six-sided dice and take the highest result. The number of dice rolled is equal to their action rating (a number between 0 and 4 inclusive) plus modifiers (0 to 2 dice). The details aren’t important for the probability calculations. If the total of the action rating and modifiers is 0 dice, the player rolls two dice and takes the worst. This is sort of like disadvantage and (super-)advantage in Dungeons & Dragons 5e.

A result of 1-3 is a failure with a consequence, a result of 4-5 is a success with a consequence, and a result of 6 is an unmitigated success without a consequence. If there are more than two 6s in the result, it’s a success with a benefit (aka a “critical” success).

The GM doesn’t roll. In a combat situation, you can think of the player roll encapsulating a turn of the player attacking and the opponent(s) counter-attacking. On a result of 4-6, the player hits, on a roll of 1-5, the opponent hits back or the situation becomes more desperate in some other way like the character being disarmed or losing their footing. On a critical result (two or more 6s in the roll), the player succeeds with a benefit, perhaps cornering the opponent away from their flunkies.

Resistance roll

When a player suffers a consequence, they can resist it. To do so, they gather a pool of dice for the resistance roll and spend an amount of stress equal to six minus the highest result. Again, unless they have zero dice in the pool, in which case they can roll two dice and take the worst. If the player rolls a 6, the character takes no stress. If they roll a 1, the character takes 5 stress (which would very likely take them out of the action). If the player has multiple dice and rolls two or more 6s, they actually reduce 1 stress.

For resistance rolls, the value between 1 and 6 matters, not just whether it’s in 1-3, in 4-5, equal to 6, or if there are two 6s.

Resistance rolls are rank statistics for pools of six-sided dice. Action rolls just group those. Plus a little sugar on top for criticals. We could do this the hard way (combinatorics) or we could do this the easy way. That decision was easy.

Here’s a plot of the results for action rolls, with dice pool size on the x-axis and line plots of results 1-3 (fail plus a complication), 4-5 (succeed with complication), 6 (succeed) and 66 (critical success with benefit). This is based on 10m simulations.

You can find a similar plot from Jasper Flick on AnyDice, in the short note Blades in the Dark.

I find the graph pretty hard to scan, so here’s a table in ASCII format, which also includes the resistance roll probabilities. The 66 result (at least two 6 rolls in the dice pool) is a possibility for both a resistance roll and an action roll. Both decimal places should be correct given the 10M simulations.

DICE   RESISTANCE                      ACTION           BOTH

DICE    1    2    3    4    5    6     1-3  4-5    6      66
----  ----------------------------     -------------    ----
 0d   .36  .25  .19  .14  .08  .03     .75  .22  .03     .00

 1d   .17  .17  .17  .17  .17  .17     .50  .33  .17     .00
 2d   .03  .08  .14  .19  .25  .28     .25  .44  .28     .03

 3d   .01  .03  .09  .17  .29  .35     .13  .45  .35     .07
 4d   .00  .01  .05  .14  .29  .39     .06  .42  .39     .13

 5d   .00  .00  .03  .10  .27  .40     .03  .37  .40     .20
 6d   .00  .00  .01  .07  .25  .40     .02  .32  .40     .26

 7d   .00  .00  .01  .05  .22  .39     .01  .27  .39     .33
 8d   .00  .00  .00  .03  .19  .38     .00  .23  .38     .39

One could go for more precision with more simulations, or resort to working them all out combinatorially.

The hard way

The hard way is a bunch of combinatorics. These aren’t too bad because of the way the dice are organized. For the highest value of throwing N dice, the probability that a value is less than or equal to k is one minus the probability that a single die is greater than k raised to the N-th power. It’s just that there are a lot of cells in the table. And then the differences would be required. Too error prone for me. Criticals can be handled Sherlock Holmes style by subtracting the probability of a non-critical from one. A non-critical either has no sixes (5^N possibilities with N dice) or exactly one six ((6 choose 1) * 5^(N – 1)). That’s not so bad. But there are a lot of entries in the table. So let’s just simulate.

Edit: Cumulative Probability Tables

I really wanted the cumulative probability tables of a result or better (I suppose I could’ve also done it as result or worse). I posted these first on the Blades in the Dark forum. It uses Discourse, just like Stan’s forum.

Action Rolls

Here’s the cumulative probabilities for action rolls.

And here’s the table of cumulative probabilities for action rolls, with 66 representing a critical, 6 a full success, and 4-5 a partial success:

        probability of result or better
 dice   4-5+     6+     66
    0  0.250  0.028  0.000
    1  0.500  0.167  0.000
    2  0.750  0.306  0.028
    3  0.875  0.421  0.074
    4  0.938  0.518  0.132
    5  0.969  0.598  0.196
    6  0.984  0.665  0.263
    7  0.992  0.721  0.330
    8  0.996  0.767  0.395

Resistance Rolls

And here are the basic probabilities for resistance rolls.

Here’s the table for stress probabilities based on dice pool size


             Probability of Stress
Dice    5    4    3    2    1    0   -1
   0  .31  .25  .19  .14  .08  .03  .00 
   1  .17  .17  .17  .17  .17  .17  .00
   2  .03  .08  .14  .19  .25  .28  .03
   3  .00  .03  .09  .17  .28  .35  .07
   4  .00  .01  .05  .13  .28  .39  .13
   5  .00  .00  .03  .10  .27  .40  .20
   6  .00  .00  .01  .07  .25  .40  .26
   7  .00  .00  .01  .05  .22  .39  .33
   8  .00  .00  .00  .03  .19  .37  .40

Here’s the plot for the cumulative probabilities for resistance rolls.

Here’s the table of cumulative resistance rolls.


             Probability of Stress or Less
Dice       5      4     3     2    1    0     -1
   0    1.00    .69   .44   .25   .11  .03   .00 
   1    1.00    .83   .67   .50   .33  .17   .00
   2    1.00    .97   .89   .75   .56  .31   .03
   3    1.00   1.00   .96   .87   .70  .42   .07
   4    1.00   1.00   .99   .94   .80  .52   .13
   5    1.00   1.00  1.00   .97   .87  .60   .20
   6    1.00   1.00  1.00   .98   .91  .67   .26
   7    1.00   1.00  1.00   .99   .94  .72   .33
   8    1.00   1.00  1.00  1.00   .96  .77   .40

For example, with 4 dice (the typical upper bound for resistance rolls), there’s an 80% chance that the character takes 1, 0, or -1 stress, and 52% chance they take 0 or -1 stress. With 0 dice, there’s a better than 50-50 chance of taking 4 or more stress because the probability of 3 or less stress is only 44%.

Finally, here’s the R code for the resistance and cumulative resistance.

# row = dice, col = c(1:6, 66)
resist <- matrix(0, nrow = 8, ncol = 7)
resist[1, 1:6] <- 1/6
for (d in 2:8) {
  for (result in 1:5) {
    resist[d, result] <-
      sum(resist[d - 1, 1:result]) * 1/6 +
      resist[d - 1, result] *  (result -1) / 6
  resist[d, 6] <- sum(resist[d - 1, 1:5]) * 1/6 +
                  sum(resist[d - 1, 6]) * 5/6
  resist[d, 7] <- resist[d - 1, 7] + resist[d - 1, 6] * 1/6

cumulative_resist <- resist  # just for sizing
for (d in 1:8) {
  for (result in 1:7) {
    cumulative_resist[d, result] <- sum(resist[d, result:7])


zero_dice_probs <-  c(11, 9, 7, 5, 3, 1, 0) / 36
zero_dice_cumulative_probs <- zero_dice_probs
for (n in 1:7)
  zero_dice_cumulative_probs[n] <- sum(zero_dice_probs[n:7])

z <- melt(cumulative_resist)  # X1 = dice, X2 = result, value = prob
stress <- 6 - z$X2
df <- data.frame(dice = z$X1, stress = as.factor(stress), prob = z$value)
df <- rbind(df, data.frame(dice = rep(0, 7), stress = as.factor(6 - 1:7), prob = zero_dice_cumulative_probs))

cumulative_plot <- ggplot(df, aes(x = dice, y = prob,
                   colour = stress, group = stress)) +
  geom_line() + geom_point() +
  xlab("dice for resistance roll") +
  ylab("prob of stress or less") +
  scale_x_continuous(breaks = 0:8)
ggsave('cumulative-resistance.jpg', plot = cumulative_plot, width = 5, height = 4)

z2 <- melt(resist)  # X1 = dice, X2 = result, value = prob
stress2 <- 6 - z2$X2
df2 <- data.frame(dice = z2$X1, stress = as.factor(stress2), prob = z2$value)
df2 <- rbind(df2, data.frame(dice = rep(0, 7), stress = as.factor(6 - 1:7),
                             prob = zero_dice_probs))

plot <- ggplot(df2, aes(x = dice, y = prob,
               colour = stress, group = stress)) +
  geom_line() + geom_point() +
  xlab("dice for resistance roll") +
  ylab("prob of stress") +
  scale_x_continuous(breaks = 0:8)
ggsave('resistance.jpg', plot = plot, width = 5, height = 4)

Drunk-under-the-lamppost testing

Edit: Glancing over this again, it struck me that the title may be interpreted as being mean. Sorry about that. It wasn’t my intent. I was trying to be constructive and I really like that analogy. The original post is mostly reasonable other than on this one point that I thought was important to call out.

I’m writing a response here to Abraham Mathews’s post, Best practices for code review, R edition, because my comment there didn’t show up and I think the topic’s important. Mathews’s post starts out on the right track, then veers away from best practices in the section “What code should be reviewed?” where he says,

…In general, we would never want to review four or five different files at a time. Instead, code review should be more targeted and the focus must be on the code that requires more prudent attention. The goal should be to review files or lines of code that contain complex logic and may benefit from a glance from other team members.

Given that guidance, the single file from the above example that we should never consider for code review is basic_eda.R. It’s just a simple file with procedural code and will only be run once or twice. …

The standard for code review in industry and large open-source projects is to review every piece of code before it’s merged. The key to this strategy is ensuring that every line of code has been viewed by at the very least the author and one trusted team member. Sampling-based code review that’s biased to where the group thinks errors may be has the drunks-under-the-lamppost problem of not covering a large chunk of code. Software developers obsess on test coverage, but it’s very challenging and we haven’t been able to get there with Stan. If we were developing flight control or pacemaker or transactional banking software, the standards would be much higher.

Typically, APIs are designed top-down from a client’s perspective (the client being a human or another piece of code that calls the API), then coded bottom up. Each component is reviewed and unit tested before being merged. The key to this strategy is being able to develop high-level modules with the confidence that the low-level pieces work. It may sound like it’s going to take longer to unit test as you go, but the net is a huge time savings with the upside of having more reliable code.

It’s also critical to keep the three key components of software development in synch: documenting (i.e., design), testing, and coding. In larger projects, features of any size always start with a functional spec outlining how it works from the client point of view—that’s usually written like the eventual documentation will be written because that’s what says what code does. With just doc, the key here is to make sure the API that is being delivered is both easy to document and easy to test. For example, large functions with intertwined, dependent arguments, as often found in REPL languages like R, Python, and Julia, produce what programmers call a “bad smell”, precisely because such functions are hard to document and test.

Consider the rgamma function in R. It takes three parameter arguments, shape, rate, and scale. Experienced statisticians might know that scale and rate parameters are conventionally inverses, yet this isn’t mentioned in the doc anywhere other than implicitly with the values of the default arguments. What happens if you supply both scale and rate? The doc doesn’t say, so I just tried it. It does not return an error, as one might expect from languages that try to keep their arguments coherent, but rather uses the rate and ignores the scale (order doesn’t matter). At the point someone proposed the rgamma function’s API, someone else should’ve piped up and said, “Whoa, hang on there a second, cowpoke; this function’s going to be a mess to test and document because of the redundant arguments.” With scale not getting a default and rate and shape being inverses, the tests need to cover behavior for all 8 possible input patterns. The doc should really say what happens when both scale and rate are specified. Instead, it just says “Invalid arguments will result in return value ‘NaN’, with a warning.” That implies that inconsistent rate and scale arguments (e.g., rate = 10, scale = 10) aren’t considered invalid arguments.

I should also say that my comments above are intended for API design, such as an R package one might want to distribute or a piece of infrastructure a lab or company wants to support. I wouldn’t recommend this style of functional design and doc and testing for exploratory research code, because it’s much harder to design up front and isn’t intended to be portable or distributed beyond a tiny group of collaborators. I’m not saying don’t test such code, I’m just saying the best practices there would be different than for designing APIs for public consumption. For example, no need to test Windows and Linux and Mac if you only ever target one platform, no reason to test all the boundary conditions if they’re never going to be used, and so on. It absolutely still helps to design top down and write modular reusable components bottom up. It’s just usually not apparent what these modules will be until after many iterations.

P.S. I highly recommend Hunt and Thomas’s book, The Pragmatic Programmer. It’s a breeze to read and helped me immensely when I was making the move from a speech recognition researcher writing experimental code to an industrial programmer. Alternatives I’ve read suffer from being too long and pedantic, too dogmatic, and/or too impractical.

P.P.S. I’ve been meaning to write a blog post on the differences in best practices in research versus publicly distributed code. I know they’re different, but haven’t been able to characterize what I’d recommend in the way of methodology for research code. Maybe that’s because I spend maybe one day/month on personal or group research code (for example, the code Andrew and I developed for an analysis of SARS-CoV-2 seroprevalence), and nineteen days a month working on Stan API code. I’d be curious as to what other people do to test and organize their research code.

Make Andrew happy with one simple ggplot trick

By default, ggplot expands the space above and below the x-axis (and to the left and right of the y-axis). Andrew has made it pretty clear that he thinks the x axis should be drawn at y = 0. To remove the extra space around the axes when you have continuous (not discrete or log scale) axes, add the following to a ggplot plot,

plot <-
  plot + 
  scale_x_continuous(expand = c(0, 0)) + 
  scale_y_continuous(expand = c(0, 0))

Maybe it could even go in a theme.

Hats off to A5C1D2H2I1M1N2O1R2T1 (I can't make these handles up) for posting the solution on Stack Overflow.

Naming conventions for variables, functions, etc.

The golden rule of code layout is that code should be written to be readable. And that means readable by others, including you in the future.

Three principles of naming follow:

1. Names should mean something.

2. Names should be as short as possible.

3. Use your judgement to balance (1) and (2).

The third one’s where all the fun arises. Do we use “i” or “n” for integer loop variables by convention? Yes, we do. Do we choose “inv_logit” or “inverse_logit”? Stan chose “inv_logit”. Do we choose “complex” or “complex_number”? C++ chose “complex”, as well as choosing “imag” over “imaginary” for the method to pull the imaginary component out.

Do we use names like “run_helper_function”, which is both long and provides zero clue as to what it does? We don’t if we want to do unto others as we’d have them do unto us.

P.S. If the producers of Silicon Valley had asked me, Winnie would’ve dumped Richard after a fight about Hungarian notation, not tabs vs. spaces.

Beautiful paper on HMMs and derivatives

I’ve been talking to Michael Betancourt and Charles Margossian about implementing analytic derivatives for HMMs in Stan to reduce memory overhead and increase speed. For now, one has to implement the forward algorithm in the Stan program and let Stan autodiff through it. I worked out the adjoint method (aka reverse-mode autodiff) derivatives of the HMM likelihood (basically, reverse-mode autodiffing the forward algorithm), but it was stepwise and the connection to forward-backward wasn’t immediately obvious. So I thought maybe someone had already put a bow on this in the literature.

It was a challenging Google search, but I was rewarded with one of the best papers I’ve read in ages and by far the best thing I’ve ever read on hidden Markov models (HMM) and their application:

The paper provides elegant one-liners for the forward algorithm, the backward algorithm, the likelihood, and the derivative of the likelihood with respect to model parameters. For example, here’s the formula for the likelihood:

$latex L = \pi^{\top} \cdot \textrm{diag}(B_1) \cdot A \cdot \textrm{diag}(B_2) \cdot \cdots A \cdot \textrm{diag}(B_T) \cdot 1.$

where $latex \pi$ is the initial state distributions, $latex B_t$ is the vector of emission densities for the states, $latex A$ is the stochastic transition matrix, and $latex 1$ is a vector of 1s. Qin et al.’s software uses an external package to differentiate the solution for $latex \pi$ as the stationary distribution for the transition matrix $latex A,$, i.e., $latex \pi^{\top} \cdot A = \pi^{\top}.$

The forward and backward algoritms are stated just as neatly, as are the derivatives of the likelihood w.r.t parameters. The authors put the likelihood and derivatives together to construct a quasi-Newton optimizer to fit max likelihood estimates of HMM. They even use second derivatives for estimating standard errors. For Stan, we just need the derivatives to plug into our existing quasi-Newton solvers and Hamiltonian Monte Carlo.

But that’s not all. The paper’s about an application of HMMs to single-channel kinetics in chemistry, a topic about which I know nothing. The paper starts with a very nice overview of HMMs and why they’re being chosen for this chemistry problem. The paper ends with a wonderfully in-depth discussion of the statistical and computational properties of the model. Among the highlights is the joint modeling of multiple experimental data sets with varying measurement error.

In conclusion, if you want to understand HMMs and are comfortable with matrix derivatives, read this paper. Somehow the applied math crowd gets these algorithms down correctly and cleanly where the stats and computer science literatures flail in comparison.

Of course, for stability in avoiding underflow of the densities, we’ll need to work on the log scale. Or if we want to keep the matrix formulation, we can use the incremental rescaling trick to rescale the columns of the forward algorithm and accmulate our own exponent to avoid underflow. We’ll also have to autodiff through the solution to the stationary distirbution algorithm, but Stan’s internals make that particular piece of plumbing easy to fit and also a potential point of dervative optimization. We also want to generalize to the case where the transition matrix $latex A$ depends on predictors at each time step through a multi-logit regression. With that, we’d be able to fit anything that can be fit with the nicely designed and documented R package moveHmm, which can already be used in R to fit a range of maximum likelihood estimates for HMMs.

Econometrics postdoc and computational statistics postdoc openings here in the Stan group at Columbia

Andrew and I are looking to hire two postdocs to join the Stan group at Columbia starting January 2020. I want to emphasize that these are postdoc positions, not programmer positions. So while each position has a practical focus, our broader goal is to carry out high-impact, practical research that pushes the frontier of what’s posisble in Bayesian modeling. This particular project is focused on extremely challenging econometric modeling problems and statistical computation and will be carried out in conjunction with some really great economists (details in the job descriptions below).

These positions are funded through a generous gift from the Alfred P. Sloan Foundation.

Computational statistics postdoc

The Stan group at Columbia is looking to hire a Postdoctoral Research Scholar to work on computational statistics. The goal of the project is to:

* develop algorithms for solving differential and algebraic equations, potentially stochastic and partial

* fit large scale-hierarchical models either through core sampling improvements or approximations such as nested Laplace or variational inference.

In both projects, there is wide latitude for extending the state of the art in computational statistics. The Stan project encompasses a team of dozens of developers distributed around the world and this work will be done in collaboration with that wider team. The wider team provides expertise in everything from numerical analysis and applied mathematics to programming language theory and parallel computation. The position is well funded to travel to conferences and visit collaborators.

The project is funded through a grant focused on Bayesian econometric modeling, which provides concrete applications that will provide a focus for the work as well as a second postdoc funded to develop those applications concurrently with developing the tools needed to extend the existing state of the art. The Stan group at Columbia is also working on applications of differential and algebraic equations in soil carbon modeling and pharmacology and applications of large scale hierarchical models in education and in survey sampling for political science.

The position will be housed in the Applied Statistics Center at Columbia University and supervised by Bob Carpenter. The initial appointment will be for 18 months (January 2020 through June 2022) with a possibility of extension.

Columbia is an EEO/AA employer

To apply, please send a CV and a statement of interest and experience in this area if not included in the CV to Bob Carpenter, [email protected]. The position is available starting in January 2020, and we will review applications as they arrive.

Econometrics Postdoc

The Stan group at Columbia is looking to hire a Postdoctoral Research Scholar to work on Bayesian econometric modeling and methodology. The goal is to create a bridge from modern econometric modeling to current Bayesian computational practice by generating a range of illustrative case studies illustrating best practices. Many of these best practices will need to be developed from the ground up and there is wide latitude for novel work.

This work will be carried out in collaboration with several economists and methodologists outside of Columbia University:

* Empirical auction analysis, where the theory around optimal design can be used to improve econometric methods used to draw inferences from the performance of real auctions in practice, including jointly modeling all components of a bidding system in order to test the structural assumptions driving mechanism design decisions. With Prof. Shoshanna Vasserman (Stanford)

* Bounded rationality and decision making in dynamic and stochastic environments, where macroeconomic models may be expressed in the form of dynamic, stochastic, general equilibrium models which can be extended to higher orders to model bounded rationality in agents making decisions in dynamic and stochastic environments. With Prof. Thomas Sargent, New York University.

* External validity of policy targeting for subgroups, with the goal of applying interventions where they will benefit the most while avoiding harming other subgroups, and a focus on combining data across multiple settings using meta-analysis. With Prof. Rachel Meager, London School of Economics.

* Causal models of interventions in education policy, where the focus is on time-series data organized by classroom, school, and larger groupings in the context of heterogeneous demographic. With Prof. Sophia Rabe-Hesketh, University of California, Berkeley.

Basic capabilities to fit these models exist in Stan currently and this grant will support a second postdoc to help extend those capabilities to more complicated systems.

The position will be housed in the Applied Statistics Center at Columbia University and supervised by Andrew Gelman. The initial appointment will be for 18 months (January 2020 through June 2022) with a possibility of extension.

Columbia is an EEO/AA employer

To apply, please send a CV and a statement of interest and experience in this area if not included in the CV to Andrew Gelman, at [email protected]. The position is available starting January 1, 2020, and we will review applications as soon as they arrive.

Non-randomly missing data is hard, or why weights won’t solve your survey problems and you need to think generatively

Throw this onto the big pile of stats problems that are a lot more subtle than they seem at first glance. This all started when Lauren pointed me at the post Another way to see why mixed models in survey data are hard on Thomas Lumley’s blog. Part of the problem is all the jargon in survey sampling—I couldn’t understand Lumley’s language of estimators and least squares; part of it is that missing data is hard.

The full data model

Imagine we have a a very simple population of $latex N^{\textrm{pop}}$ items with values normally distributed members with standard deviation known to be 2,

$latex y_n \sim \textrm{normal}(\mu, 2) \ \textrm{for} \ i \in 1:N^{\textrm{pop}}.$

To complete the Bayesian model, we’ll assume a standard normal prior on $latex \mu$,

$latex \mu \sim \textrm{normal}(0, 1).$

Now we’re not going to observe all $latex y_n$, but only a sample of the $latex N^{\textrm{pop}}$ elements. If the model is correct, our inferences will be calibrated in expection given a random sample of items $latex y_n$ from the population.

Missing data

Now let’s assume the sample of $latex y_n$ we observe is not drawn at random from the population. Imagine instead that we have a subset of $latex N$ items from the population, and for each item $latex n$, there is a probability $latex \pi_n$ that the item will be included in the sample. We’ll take the log odds of inclusion to be equal to the item’s value,

$latex \pi_n = \textrm{logit}^{-1}(y_n)$.

Now when we collect our sample, we’ll do something like poll $latex N = 2000$ people from the population, but each person $latex n$ only has a $latex \pi_n$ chance of responding. So we only wind up with $latex N^{\textrm{obs}}$ observations, with $latex N^{\textrm{miss}} = N – N^{\textrm{obs}}$ observations missing.

This situation arises in surveys, where non-response can bias results without careful adjustment (e.g., see Andrew’s post on pre-election polling, Don’t believe the bounce).

So how do we do the careful adjustment?

Approach 1: Weighted likelihood

A traditional approach is to inverse weight the log likelihood terms by the inclusion probability,

$latex \sum_{n = 1}^{N^{\textrm{obs}}} \frac{1}{\pi_n} \log \textrm{normal}(y_n \mid \mu, 2).$

Thus if an item has a 20% chance of being included, its weight is 5.

In Stan, we can code the weighted likelihood as follows (assuming pi is given as data).

for (n in 1:N_obs)
  target += inv(pi[n]) * normal_lpdf(y[n] | mu, 2);

If we optimize with the weighted likelihood, the estimates are unbiased (i.e., the expectation of the estimate $latex \hat{\pi}$ is the true value $latex \pi$). This is borne out in simulation.

Although the parameter estimates are unbiased, the same cannot be said of the uncertainties. The posterior intervals are too narrow. Specifically, this approach fails simulation-based calibration; for background on SBC, see Dan’s blog post You better check yo self before you wreck yo self.

One reason the intervals are too narrow is that we are weighting the data as if we had observed $latex N$ items when we’ve only observed $latex N^{\textrm{obs}}$ items. That is, their weights are what we’d expect to get if we’d observed $latex N$ items.

So my next thought was to standardize. Let’s take the inverse weights and normalize so the sum of inverse weights is equal to $latex N^{\textrm{obs}}.$ That also fails. The posterior intervals are still too narrow under simulation.

Sure, we could keep fiddling weights in an ad hoc way for this problem until they were better calibrated empirically, but this is clearly the wrong approach. We’re Bayesians and should be thinking generatively. Maybe that’s why Lauren and Andrew kept telling me I should be thinking generatively (even though they work on a survey weighting project!).

Approach 2: Missing data

What is going on generativey? We poll $latex N$ people out of a population of $latex N^{\textrm{pop}}$, each of which has a $latex \pi_n$ chance of responding, leading to a set of responses of size $latex N^{\textrm{obs}}.$

Given that we know how $latex \pi$ relates to $latex y$, we can just model everything (in the real world, this stuff is really hard and everything’s estimated jointly).

Specifically, the $latex N^{\textrm{miss}} = N – N^{\textrm{obs}}$ missing items each get parameters $latex y^{\textrm{miss}}_n$ representing how they would’ve responded had they responded. We also model response, so we have an extra term $latex \textrm{bernoulli}(0 \mid \textrm{logit}^{-1}(y_n^{\textrm{miss}}))$ for the unobserved values and an extra term $latex \textrm{bernoulli}(1 \mid \textrm{logit}^{-1}(y_n))$ for the observed values.

This works. Here’s the Stan program.

data {
  int N_miss;
  int N_obs;
  vector[N_obs] y_obs;
parameters {
  real mu;
  vector[N_miss] y_miss;
model {
  // prior
  mu ~ normal(0, 1);
  // observed data likelihood
  y_obs ~ normal(mu, 2);
  1 ~ bernoulli_logit(y_obs);
  // missing data likelihood and missingness
  y_miss ~ normal(mu, 2);
  0 ~ bernoulli_logit(y_miss);

The Bernoulli sampling statements are vectorized and repeated for each element of y_obs and y_miss. The suffix _logit indicates the argument is on the log odds scale, and could have been written:

for (n in 1:N_miss)
  0 ~ bernoulli(inv_logit(y_miss[n]))

or even more explicitly,

for (n in 1:N_miss)
  target += bernoulli_lpmf(0 | inv_logit(y_miss[n]));

And here’s the simulation code, including a cheap run at SBC:

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores(), logical = FALSE)

printf <- function(msg, ...) { cat(sprintf(msg, ...)); cat("\n") }
inv_logit <- function(u) 1 / (1 + exp(-u))

printf("Compiling model.")
model <- stan_model('missing.stan')

for (m in 1:20) {

mu <- rnorm(1, 0, 1);
N_tot <- 1000
y <- rnorm(N_tot, mu, 2)
z <- rbinom(N_tot, 1, inv_logit(y))
y_obs <- y[z == 1]
N_obs <- length(y_obs)
N_miss <- N_tot - N_obs

fit <- sampling(model,
                data = list(N_miss = N_miss, N_obs = N_obs, y_obs = y_obs),
                chains = 1, iter = 5000, refresh = 0)
mu_ss <- extract(fit)$mu
mu_hat <- mean(mu_ss)
q25 <- quantile(mu_ss, 0.25)
q75 <- quantile(mu_ss, 0.75)
printf("mu = %5.2f in 50pct(%5.2f, %5.2f) = %3s;  mu_hat = %5.2f",
       mu, q25, q75, ifelse(q25 <= mu && mu <= q75, "yes", "no"), mean(mu_ss))


Here's some output with random seeds, with mu, mu_hat and 50% intervals and indicator of whether mu is in the 50% posterior interval.

mu =  0.60 in 50pct( 0.50,  0.60) =  no;  mu_hat =  0.55
mu = -0.73 in 50pct(-0.67, -0.56) =  no;  mu_hat = -0.62
mu =  1.13 in 50pct( 1.00,  1.10) =  no;  mu_hat =  1.05
mu =  1.71 in 50pct( 1.67,  1.76) = yes;  mu_hat =  1.71
mu =  0.03 in 50pct(-0.02,  0.08) = yes;  mu_hat =  0.03
mu =  0.80 in 50pct( 0.76,  0.86) = yes;  mu_hat =  0.81

The only problem I'm having is that this crashes RStan 2.19.2 on my Mac fairly regularly.


How would the generative model differ if we polled members of the population at random until we got 1000 respondents? Conceptually it's more difficult in that we don't know how many non-resondents were approached on the way to 1000 respondents. This would be tricky in Stan as we don't have discrete parameter sampling---it'd have to be marginalized out.

Lauren started this conversation saying it would be hard. It took me several emails, part of a Stan meeting, buttonholing Andrew to give me an interesting example to test, lots of coaching from Lauren, then a day of working out the above simulations to convince myself the weighting wouldn't work and code up a simple version that would work. Like I said, not easy. But at least doable with patient colleagues who know what they're doing.

Seeking postdoc (or contractor) for next generation Stan language research and development

The Stan group at Columbia is looking to hire a postdoc* to work on the next generation compiler for the Stan open-source probabilistic programming language. Ideally, a candidate will bring language development experience and also have research interests in a related field such as programming languages, applied statistics, numerical analysis, or statistical computation.

The language features on the roadmap include lambdas with closures, sparse matrices and vectors, ragged arrays, tuples and structs, user-defined Jacobians, and variadic functions. The parser, intermediate representation, and code generation are written in OCaml using the Menhir compiler framework. The code is hosted on GitHub in the stanc3 repo; the current design documents are in the design docs repo. The generated code is templated C++ that depends on the automatic differentiation framework in the Stan math library and is used by Stan’s statistical inference algorithms.

The research and development for Stan will be carried out in collaboration with the larger Stan development team, which includes a large group of friendly and constructive collaborators within and outside of Columbia University. In addition to software development, the team has a diverse set of research interests in programming language semantics, applied statistics, statistical methodology, and statistical computation. Of particular relevance to this project is foundational theory work on programming language semantics for differentiable and probabilistic programming languages.

The position would be housed in the Applied Statistics Center at Columbia University and supervised by Bob Carpenter. The initial appointment will be for one year with a possible reappointment for a second year.

To apply, please send a CV and a statement of interest and experience in this area if not included in the CV to Bob Carpenter, [email protected]. The position is available immediately and we will review applications as they arrive.

Thanks to Schmidt Futures for the funding to make this possible!

* We could also hire a contractor on an hourly basis. For that, I’d be looking for someone with experience who could hit the ground running with the OCaml code.

(Markov chain) Monte Carlo doesn’t “explore the posterior”

[Edit: (1) There’s nothing dependent on Markov chain—the argument applies to any Monte Carlo method in high dimensions. (2) No, (MC)MC is not not broken.]

First some background, then the bad news, and finally the good news.

Spoiler alert: The bad news is that exploring the posterior is intractable; the good news is that we don’t need to explore all of it to calculate expectations.

Sampling to characterize the posterior

There’s a misconception among Markov chain Monte Carlo (MCMC) practitioners that the purpose of sampling is to explore the posterior. For example, I’m writing up some reproducible notes on probability theory and statistics through sampling (in pseudocode with R implementations) and have just come to the point where I’ve introduced and implemented Metropolis and want to use it to exemplify convergence mmonitoring. So I did what any right-thinking student would do and borrowed one of my mentor’s diagrams (which is why this will look familiar if you’ve read the convergence monitoring section of Bayesian Data Analysis 3).

First M steps of of isotropic random-walk Metropolis with proposal scale normal(0, 0.2) targeting a bivariate normal with unit variance and 0.9 corelation. After 50 iterations, we haven’t found the typical set, but after 500 iterations we have. Then after 5000 iterations, everything seems to have mixed nicely through this two-dimensional example.

This two-dimensional traceplot gives the misleading impression that the goal is to make sure each chain has moved through the posterior. This low-dimensional thinking is nothing but a trap in higher dimensions. Don’t fall for it!

Bad news from higher dimensions

It’s simply intractable to “cover the posterior” in high dimensions. Consider a 20-dimensional standard normal distribution. There are 20 variables, each of which may be positive or negative, leading to a total of $latex 2^{20}$, or more than a million orthants (generalizations of quadrants). In 30 dimensions, that’s more than a billion. You get the picture—the number of orthant grows exponentially so we’ll never cover them all explicitly through sampling.

Good news in expectation

Bayesian inference is based on probability, which means integrating over the posterior density. This boils down to computing expectations of functions of parameters conditioned on data. This we can do.

For example, we can construct point estimates that minimize expected square error by using posterior means, which are just expectations conditioned on data, which are in turn integrals, which can be estimated via MCMC,

$latex \begin{array}{rcl} \hat{\theta} & = & \mathbb{E}[\theta \mid y] \\[8pt] & = & \int_{\Theta} \theta \times p(\theta \mid y) \, \mbox{d}\theta. \\[8pt] & \approx & \frac{1}{M} \sum_{m=1}^M \theta^{(m)},\end{array}$

where $latex \theta^{(1)}, \ldots, \theta^{(M)}$ are draws from the posterior $latex p(\theta \mid y).$

If we want to calculate predictions, we do so by using sampling to calculate the integral required for the expectation,

$latex p(\tilde{y} \mid y) \ = \ \mathbb{E}[p(\tilde{y} \mid \theta) \mid y] \ \approx \ \frac{1}{M} \sum_{m=1}^M p(\tilde{y} \mid \theta^{(m)}),$

If we want to calculate event probabilities, it’s just the expectation of an indicator function, which we can calculate through sampling, e.g.,

$latex \mbox{Pr}[\theta_1 > \theta_2] \ = \ \mathbb{E}\left[\mathrm{I}[\theta_1 > \theta_2] \mid y\right] \ \approx \ \frac{1}{M} \sum_{m=1}^M \mathrm{I}[\theta_1^{(m)} > \theta_2^{(m)}].$

The good news is that we don’t need to visit the entire posterior to compute these expectations to within a few decimal places of accuracy. Even so, MCMC isn’t magic—those two or three decimal places will be zeroes for tail probabilities.

NYC Meetup Thursday: Under the hood: Stan’s library, language, and algorithms

I (Bob, not Andrew!) will be doing a meetup talk this coming Thursday in New York City. Here’s the link with registration and location and time details (summary: pizza unboxing at 6:30 pm in SoHo):

After summarizing what Stan does, this talk will focus on how Stan is engineered. The talk follows the organization of the Stan software.

Stan math library: differentiable math and stats functions, template metaprorgrams to manage constants and vectorization, matrix derivatives, and differential equation derivatives.

Stan language: block structure and execution, unconstraining variable transforms and automatic Jacobians, transformed data, parameters, and generated quantities execution.

Stan algorithms: Hamiltonian Monte Carlo and the no-U-turn sampler (NUTS), automatic differentiation variational inference (ADVI).

Stan infrastructure and process: Time permitting, I can also discuss Stan’s developer process, how the code repositories are organized, and the code review and continuous integration process for getting new code into the repository


I realized I’m missing a good illustration of NUTS and how it achieves detailed balance and preferentially selects positions on the Hamiltonian trajectory toward the end of the simulated dynamics (to minimize autocorrelation in the draws). It was only an hour, so I skipped the autodiff section and scalable algorithms section and jumped to the end. I’ll volunteer do another meetup with the second half of the talk.

StanCon Helsinki streaming live now (and tomorrow)

We’re streaming live right now!

Timezone is Eastern European Summer Time (EEST) +0300 UTC

Here’s a link to the full program [link fixed].

There have already been some great talks and they’ll all be posted with slides and runnable source code after the conference on the Stan web site.