Is data science a discipline?

Jeannette Wing, director of the Columbia Data Science Institute, sent along this link to this featured story (their phrase) on their web site.

Is data science a discipline?

Data science is a field of study: one can get a degree in data science, get a job as a data scientist, and get funded to do data science research. But is data science a discipline, or will it evolve to be one, distinct from other disciplines? Here are a few meta-questions about data science as a discipline.

  • What is/are the driving deep question(s) of data science? Each scientific discipline (usually) has one or more “deep” questions that drive its research agenda: What is the origin of the universe (astrophysics)? What is the origin of life (biology)? What is computable (computer science)? Does data science inherit its deep questions from all its constituency disciplines or does it have its own unique ones?
  • What is the role of the domain in the field of data science? People (including this author) (Wing, J.M., Janeia, V.P., Kloefkorn, T., & Erickson, L.C. (2018)) have argued that data science is unique in that it is not just about methods, but about the use of those methods in the context of a domain—the domain of the data being collected and analyzed; the domain for which a question to be answered comes from collecting and analyzing the data. Is the inclusion of a domain inherent in defining the field of data science? If so, is the way it is included unique to data science?
  • What makes data science data science? Is there a problem unique to data science that one can convincingly argue would not be addressed or asked by any of its constituent disciplines, e.g., computer science and statistics?

I don’t understand how bullet point two is supposed to distinguish data science from the more prosaically titled field of applied statistics.

The story goes on to enumerate ten research challenges in data science. Some of them are hot AI topics like ethics and fairness, some of them are computer science topics such as computing systems for data-intensive applications, and some of them are statistics topics like causal inference.

What can we do with complex numbers in Stan?

I’m wrapping up support for complex number types in the Stan math library. Now I’m wondering what we can do with complex numbers in statistical models.

Functions operating in the complex domain

The initial plan is to add some matrix functions that use complex numbers internally:

  • fast fourier transforms
  • asymmetric eigendecomposition
  • Schur decomposition

The eigendecomposition and Schur decomposition are already working. We need to settle on an interface for FFTs before adding those. Are there other functions like these we should add?

Complex random variables?

But what if we add a complex number type to the language? That’d essentially give us complex random variables (and constants) as first-class citizens. We’d have something like the following.

complex x;             // declaration

x = complex(-1, 2.9);  // constructor

x.real = 1.3;          // setters
x.imag = -2.5;

complex y = sqrt(x);   // complex sqrt

real yr = y.real;      // getters
real yi = y.imag;

Technically, we’re treating a complex number just like a pair (the elements being the real and imaginary components). We will be able to differentiate with respect to real and/or imaginary components, and thus differentiate through functions operating in the complex domain, but we’re not taking derivatives with respect to an entire complex number.

It’s easy in the math library to create a matrix of complex numbers, but we’ll need new types in the language: complex_matrix, complex_vector, complex_row_vector. Arrays will just follow from general principles.

Applications, anyone?

So what can we do with complex random variables if we have them?

A quick search online found a discussion of complex variables on stats.stackexchange which mirrors some of our internal discussions.

I have no idea where we’d use something like characteristic functions in the Stan language or complex polynomial roots, but am hoping someone else can clue me in.

A normalizing flow by any other name

Another week, another nice survey paper from Google. This time:

What’s a normalizing flow?

A normalizing flow is a change of variables. Just like you learned way back in calculus and linear algebra.

Normalizing flows for sampling

Suppose you have a random variable $latex \Theta$ with a gnarly posterior density $latex p_{\Theta}(\theta)$ that makes it challenging to sample. It can sometimes be easier to sample a simpler variable $latex \Phi$ and come up with a smooth function $latex f$ such that $latex \Theta = f(\Phi).$ The implied distribution on $latex \Theta$ can be derived from the density of $latex \Phi$ and the appropriate Jacobian adjustment for change in volume,

$latex \displaystyle p_{\Theta}(\theta) = p_{\Phi}(f^{-1}(\theta)) \cdot \left|\, \textrm{det} \textrm{J}_{f^{-1}}(\theta) \,\right|,$

where $latex \textrm{J}_{f^{-1}}(\theta)$ is the Jacobian of the inverse transform evaluated at the parameter value. This is always possible in theory—the unit hypercube with a uniform distribution is a sufficient basis for any multivariate function with the function being the inverse cumulative distribution function.

Of course, we don’t know the inverse CDFs for our posteriors or we wouldn’t need to do sampling in the first place. The hope is that we can estimate an approximate but tractable normalizing flow, which when combined with a standard Metropolis accept/reject step will be better than working in the original geometry.

Normalizing flows in Stan

Stan uses changes of variables, aka normalizing flows, in many ways.

First, Stan’s Hamiltonian Monte Carlo algorithm learns (aka estimates) a metric during warmup that is used to provide an affine transform, either just to scale (mean field metric, aka diagonal) or to scale and rotate (dense metric). If Ben Bales’s work pans out, we’ll also have low rank metric estimation soon.

Second, Stan’s constrained variables are implemented via changes of variables with efficient, differentiable Jacobians. Thank Ben Goodrich for all the hard ones: covariance matrices, correlation matrices, Cholesky factors of the these, and unit vectors. TensorFlow Probability calls these transforms “bijectors.” These constrained-variable transforms allow Stan’s algorithms to work on unconstrained spaces. In the case of variational inference, Stan fits a multivariate normal approximation to the posterior, then samples from the multivariate normal and transforms the draws back to the constrained space to get an approximate sample from the model.

Third, we widely recommend reparameterizations, such as the non-centered parameterization of hierarchical models. We used to call that specific transform the “Matt trick” until we realized it already had a name. The point of a reparameterization is to apply the appropriate normalizing flow to make the posterior closer to isotropic Gaussian. Then there’s a deterministic transform back to the variables we really care about.

What’s next?

The real trick is automating the construction of these transforms. People hold out a lot of hope for neural networks or other non-parametric function fitters there. It remains to be seen whether anything practical will come out of this that we can use for Stan. I talked to Matt and crew at Google about their work on normalizing flows for HMC (which they call “neural transport”), but they told me it was too finicky to work as a black box in Stan.

Another related idea is Riemannian Hamiltonian Monte Carlo (RHMC), which uses a second-order Hessian-based approximation to normalize the posterior geometry. It’s just very expensive on a per-iteration basis because it requires a differentiable positive-definite conditioning phase involving an eigendecomposition.

Monte Carlo and winning the lottery

Suppose I want to estimate my chances of winning the lottery by buying a ticket every day. That is, I want to do a pure Monte Carlo estimate of my probability of winning. How long will it take before I have an estimate that’s within 10% of the true value?

It’ll take…

There’s a big NY state lottery for which there is a 1 in 300M chance of winning the jackpot. Back of the envelope, to get an estimate within 10% of the true value of 1/300M will take many millions of years.

Didn’t you say Monte Carlo only took a hundred draws?

What’s going on? The draws are independent with finite mean and variance, so we have a central limit theorem. The advice around these parts has been that we can get by with tens or hundreds of Monte Carlo draws. With a hundred draws, the standard error on our estimate is one tenth of standard deviation of the variable whose expectation is being estimated.

With the lottery, if you run a few hundred draws, your estimate is almost certainly going to be exactly zero. Did we break the CLT? Nope. Zero has the right absolute error properties. It’s within 1/300M of the true answer after all! But it has terrible relative error probabilities; it’s relative error after a lifetime of playing the lottery is basically infinity.

The moral of the story is that error bounds on Monte Carlo estimates of expectations are absolute, not relative.

The math

The draws are Bernoulli with a p chance of success, so the standard error of the Monte Carlo estimator

$latex \displaystyle \hat{\mu} = \frac{1}{M} \sum_{m = 1}^M y^{(m)}$

is going to be its variance

$latex \textrm{var}[\hat{\mu}] = \displaystyle \sqrt{\frac{p \cdot (1 – p)}{M}}$

for $latex M$ draws and a

$latex y \sim \textrm{bernoulli}(p)$

probability of winning.

Extra credit: Sequential decision theory

How long would it take to convince yourself that playing the lottery has an expected negative return if tickets cost $1, there’s a 1/300M chance of winning, and the payout is $100M?

Although no slot machines are involved, this is the simplest framing of a so-called “bandit” problem. More sophisticated problems would involve several lotteries with generalized payout schedules that might be stateful or non-stationary.

Abuse of expectation notation

I’ve been reading a lot of statistical and computational literature and it seems like expectation notation is absued as shorthand for integrals by decorating the expectation symbol with a subscripted distribution like so:

$latex \displaystyle \mathbb{E}_{p(\theta)}[f(\theta)] =_{\textrm{def}} \int f(\theta) \cdot p(\theta) \, \textrm{d}\theta.$

This is super confusing, because expectations are properly defined as functions of random variables.

$latex \displaystyle \mathbb{E}[f(A)] = \int f(a) \cdot p_A(a) \, \textrm{d}a.$

For example, the square bracket convention arises because random variables are functions and square brackets are conventionally used for functionals (that is, functions that apply to functions).

Expectation is an operator

With the proper notation, expectation is a linear operator on random variables, $latex \mathbb{E}: (\Omega \rightarrow \mathbb{R}) \rightarrow \mathbb{R}$, where $latex \Omega$ is the sample space and $latex \Omega \rightarrow \mathbb{R}$ the type of a random variable. In the abused notation, expectation is not an operator because there’s no argument, just an expression $latex f(\theta)$ with an unbound variable $latex \theta.$

In this post (and last week’s), I’ve been following standard notational conventions where capital letters like $latex A$ are random variables and their corresponding lower case variables used as bound variables. Then rather than using $latex p(\cdots)$ for every density, they are subscripted with the random variables from which they were derived, so the density of random variable $latex A$ is written $latex p_A$.

Bayesian Data Analysis notation

Gelman et al.’s Bayesian Data Analysis book overloads notation using lower case $a$ for both $A$ and $a$. This requires the reader to do a lot of sleuting to figure out which variables are random and which are bound. It led to no end of confusion for me when I was first learning this material. It turns out disambiguating a dense formula with ambigous notation is easier when you already understand the result.

The overloaded notation from Bayesian Data Analysis fine in most applied modeling work, but it makes it awkward to talk about random variables and bound variables simultaneously. For example, on page 20 of the third edition, Gelman et al. write (using $latex \textrm{E}$ for the expectation symbol and round parens instead of brackets and italic derivative symbol),

$latex \displaystyle \textrm{E}(u) = \int u p(u) du.$

Here, the $latex u$ in $latex \textrm{E}(u)$ is understood as a random variable and the other $latex u$ as bound variables. It’s even worse with the covariance definition,

$latex \displaystyle \textrm{var}(u) = \int (u – \textrm{E}(u))(u – \textrm{E}(u))^{T} du,$

where the $latex u$ in $latex \textrm{var}(u)$ and $latex \textrm{E}(u)$ are random variables, whereas the other two uses are bound variables.

Using more explicit notation which distinguishes random and bound variables, includes the multiplication operators, specifies range of integration, disambiguates the density symbol, and sets the derivative symbol in roman rather than italics, these become

$latex \displaystyle \mathbb{E}[U] = \int_{\mathbb{R}^N} u \cdot p_U(u) \, \textrm{d}u.$

$latex \displaystyle \textrm{var}[U] = \int_{\mathbb{R}^N} (u – \mathbb{E}[U]) \cdot (u – \mathbb{E}[U])^{\top} \cdot p_U(u) \, \textrm{d}u.$

This lets us clearly write variance out as an expectation as

$latex \textrm{var}[U] = \mathbb{E}[(U – \mathbb{E}[U]) \cdot (U – \mathbb{E}[U])^{\top} ],$

which would look as follows in Bayesian Data Analysis notation,

$latex \textrm{var}(u) = \textrm{E}((u – \textrm{E}(u))(u – \textrm{E}(u))^T)$

Conditional expectations and posteriors

The problem’s even more prevalent with posteriors or other conditional expectations, which I often see written using notation

$latex \displaystyle \mathbb{E}_{p(\theta \, \mid \, y)}[f(\theta)]$

for what I would write using conditional expectation notation as

$latex \displaystyle \mathbb{E}[f(\Theta) \mid Y = y].$

As before, this uses random variable notation inside the expectation and bound variable notation outside, with $latex Y = y$ indicating the random variable $latex Y$ takes on the value $latex y$.

Rao-Blackwellization and discrete parameters in Stan

I’m reading a really dense and beautifully written survey of Monte Carlo gradient estimation for machine learning by Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. There are great explanations of everything including variance reduction techniques like coupling, control variates, and Rao-Blackwellization. The latter’s the topic of today’s post, as it relates directly to current Stan practices.

Expecations of interest

In Bayesian inference, parameter estimates and event probabilities and predictions can all be formulated as expectations of functions of parameters conditioned on observed data. In symbols, that’s

$latex \displaystyle \mathbb{E}[f(\Theta) \mid Y = y] = \int f(\theta) \cdot p(\theta \mid y) \, \textrm{d}\theta$

for a model with parameter vector $latex \Theta$ and data $latex Y = y.$ In this post and most writing about probability theory, random variables are capitalized and bound variables are not.

Partitioning variables

Suppose we have two random variables $latex A, B$ and want to compute an expectation $latex \mathbb{E}[f(A, B)].$ In the Bayesian setting, this means splitting our parameters $latex \Theta = (A, B)$ into two groups and suppressing the conditioning on $latex Y = y$ in the notation.

Full sampling-based estimate of expectations

There are two unbiased approaches to computing the expectation $latex \mathbb{E}[f(A, B)]$ using sampling. This first one is traditional, with all random variables in the expectation being sampled.

Draw $latex (a^{(m)}, b^{(m)}) \sim p_{A,B}(a, b)$ for $latex m \in 1:M$ and estimate the expectation as

$latex \displaystyle\mathbb{E}[f(A, B)] \approx \frac{1}{M} \sum_{m=1}^M f(a^{(m)}, b^{(m)}).$

Marginalized sampling-based estimate of expectations

The so-called Rao-Blackwellized estimator of the expectation involves marginalizing $latex p_{A,B}(a, b)$ and sampling $latex b^{(m)} \sim p_{B}(b)$ for $latex m \in 1:M$. The expectation is then estimated as

$latex \displaystyle \mathbb{E}[f(A, B)] \approx \frac{1}{M} \sum_{m=1}^M \mathbb{E}[f(A, b^{(m)})]$

For this estimator to be efficiently computatable, the nested expectation must be efficiently computable,

$latex \displaystyle \mathbb{E}[f(A, b^{(m)})] = \int f(a, b^{(m)}) \cdot p(a \mid b^{(m)}) \, \textrm{d}a.$

The Rao-Blackwell theorem

The Rao-Blackwell theorem states that the marginalization approach has variance less than or equal to the direct approach. In practice, this difference can be enormous. It will be based on how efficiently we could estimate $latex \mathbb{E}[f(A, b^{(m)})]$ by sampling $latex a^{(n)} \sim p_{A \mid B}(a \mid b^{(m)}),$

$latex \displaystyle \mathbb{E}[f(A, b^{(m)})] \approx \frac{1}{N} \sum_{n = 1}^N f(a^{(n)}, b^{(m)})$

Discrete variables in Stan

Stan does not have a sampler for discrete variables. Instead, Rao-Blackwellized estimators must be used, which essentially means marginalizing out the discrete parameters. Thus if $latex A$ is the vector of discrete parameters in a model, $latex B$ the vector of continuous parameters, and $latex y$ the vector of observed data, then the model posterior is $latex p_{A, B \mid Y}(a, b \mid y).$

With a sampler that can efficiently make Gibbs draws (e.g., BUGS or PyMC3), it is tempting to try to compute posterior expectations by sampling,

$latex \displaystyle \mathbb{E}[f(A, B) \mid Y = y] \approx \frac{1}{M} \sum_{m=1}^M f(a^{(m)}, b^{(m)})$ where $latex (a^{(m)}, b^{(m)}) \sim p_{A,B}(a, b \mid y).$

This is almost always a bad idea if it possible to efficiently calculate the inner Rao-Blackwellizization expectation, $latex \mathbb{E}[f(A, b^{(m)})].$ With discrete variables, the formula is just

$latex \displaystyle \mathbb{E}[f(A, b^{(m)})] = \sum_{a \in A} p(a \mid b^{(m)}) \cdot f(a, b^{(m)}).$

Usually the summation can be done efficiently in models like mixture models where the discrete variables are tied to individual data points or in state-space models like HMMs where the discrete parameters can be marginalized using the forward algorithm. Where this is not so easy is with missing count data or variable selection problems where the posterior combinatorics are intractable.

Gains from marginalizing discrete parameters

The gains to be had from marginalizing discrete parameters are enormous. This is even true of models coded in BUGS or PyMC3. Cole Monnahan, James Thorson, and Trevor Branch wrote a nice survey of the advantages of marginalization for some ecology models that compares marginalized HMC with Stan to JAGS with discrete sampling and JAGS with marginalization. The takeway here isn’t that HMC is faster than JAGS, but that JAGS with marginalization is a lot faster than JAGS without.

The other place to see the effects of marginalization are in the Stan User’s Guide chapter on latent discrete parameters. The first choice-point example shows how much more efficient the marginalization is by comparing it directly with estimated generated from exact sampling of the discrete parameters conditioned on the continuous ones. This is particularly true of the tail statistics, which can’t be estimated at all with MCMC because too many draws would be required. I had the same experience in coding the Dawid-Skene model of noisy data coding, which was my gateway to Bayesian inference—I had coded it with discrete sampling in BUGS, but BUGS took forever (24 hours compared to 20m for Stan for my real data) and kept crashing on trivial examples during my tutorials.

Marginalization calculations can be found in the MLE literature

The other place marginalization of discrete parameters comes up is in maximum likelihood estimation. For example, Dawid and Skene’s original approach to their coding model used the expectation maximization (EM) algorithm for maximum marginal likelihood estimation. The E-step does the marginalization and it’s exactly the same marginalization as required in Stan for discrete parameters. You can find the marginalization for HMMs in the literature on calculating maximum likelihood estiates of HMMs (in computer science, electrical engineering, etc.) and in the ecology literature for the Cormack-Jolly-Seber model. And they’re in the Stan user’s guide.

Nothing’s lost, really

[edit: added last section explaining how to deal with posterior inference for the discrete parameters]

It’s convenient to do posterior inference with samples. Even with a Rao-Blackwellized estimator, it’s possible to sample $latex a^{(m)} \sim p(a \mid b^{(m)})$ in the generated quantities block of a Stan program and then proceed from there with full posterior draws $latex (a^{(m)}, b^{(m)})$ of both the discrete and continuous parameters.

As tempting as that is because of simplicitly, the marginalization is worth the coding effort, because the gain in efficiency from working in expectation with the Rao-Blackwellized estimator is enormous for discrete parameters. It can often take problems from infeasible to straightforward computationally.

For example, to estimate the posterior distribution of a discrete parameter, we need the expectation

$latex \displaystyle \mbox{Pr}[A_n = k] = \mathbb{E}[\textrm{I}[A_n = k]].$

for all values $latex k$ that $latex A_n$ might take. This is a trival computation with MCMC (assuming the number of values is not too large) and carried out in Stan by defining an indicator variable and setting it. In contrast, estimating such a variable by sampling $latex a^{(m)} \sim p(a \mid b^{(m)})$ is very inefficient and increasingly so as the probability $latex \mbox{Pr}[A_n = k]$ being estimated is small.

Examples of both forms of inference are shown in the user’s guide chapter on latent discrete parameters.

Royal Society spam & more

Just a rant about spam (and more spam) from pay-to-publish and closed-access journals. Nothing much to see here.

The latest offender is from something called the “Royal Society.” I don’t even know to which king or queen this particular society owes allegiance, because they have a .org URL. Exercising their royal prerogative, they created an account for me without my permission.

I clicked through to their web site to see if the journal was both free and open, but it turns out to be a pay-to-publish journal. The sovereign in question must have rather shallow pockets, as they’re asking authors to pony up a $1260 (plus VAT!) “article processing charge”.

Whenever I get review requests for pay-to-publish or closed-access journals, I just decline. Not so easy here as I can’t reply to their email. Instead, they indicate they’re going to keep spamming me and I have to actively opt out. I think I’m going to chnage my policy to never review for a journal that signs me up for their reviewing site without my permission.

Now onto the details. Here’s the text of the mail they sent (my bold face).

Dear Dr Carpenter,

An account has been created for you on the Royal Society Open Science ScholarOne Manuscripts site for online manuscript submission and review. You are likely to be invited to contribute to the journal as a reviewer or author in the near future. For information about the journal please visit http://rsos.royalsocietypublishing.org.

Your account login details are as follows:

USER ID: [email protected]
PASSWORD: To set your password please click the link below. Clicking the link will take you directly to the option for setting your permanent password.
https://mc.manuscriptcentral.com/rsos?URL_MASK=blah-blah-blah

Please log in to https://mc.manuscriptcentral.com/rsos to:
– Choose a memorable user ID and password for easy login
– Associate your ORCID ID with your account
– Add your affiliation details
Tell us when you are unavailable, or delete your account
– Opt out of emails relating to the Royal Society and/or its journal content

When you log in for the first time, you will be asked to complete your profile.

The Royal Society is committed to your privacy and to the protection of your personal information. View our privacy policy at https://royalsociety.org/about-us/terms-conditions-policies/privacy-policy/#authors.

Please contact us if you have any questions about why an account has been created.

Sincerely,

Royal Society Open Science Editorial Office
[email protected]

Keep up-to-date:
Content and keyword alerts: http://royalsocietypublishing.org/alerts
Twitter: https://twitter.com/RSocPublishing
Facebook: http://www.facebook.com/RoyalSocietyPublishingFanPage
Log in to Remove This Account – https://mc.manuscriptcentral.com/rsos?URL_MASK=blah-blah-blah

To avoid more spam, I thought I’d log in to remove the account, but wait, they have quirky weak password requirements.

Create New Password
Password Requirements:

Cannot be a recently used password
Cannot be the same as your username
Minimum of 8 characters
Minimum of 2 numbers
Minimum of 1 letter (Upper or lower case)
More Information May be Required

If the site requires more information, you will be taken to your profile next.

* = Required Fields

Of course it required more information.

Your Profile Needs to be Updated
The following profile item(s) need to be updated before you can access the site:

Salutation is a required field
Country / Region is a required field

Thankfully, a country and salutation were pre-entered by whoever decided to put me into this system. After this, I had to click through several more screens of profile creation for reviewers until I found a “delete your account” section. Done. Finally.

Done, not done

No sooner had I shaken off a pay-to-publish journal when I get hit with the same thing from a closed-access IEEE journal (the name “ScholarOne” is repeated acorss spams, so maybe that’s the name of the spam tool).

Dear Dr. Carpenter:

Welcome to the Transactions on Computer-Aided Design of Integrated Circuits and Systems – ScholarOne Manuscripts site for online manuscript submission and review. Your name has been added to our reviewer database in the hopes that you will be available to review manuscripts for the Journal which fall within your area of expertise.

Your USER ID for your account is as follows:

SITE URL: https://mc.manuscriptcentral.com/tcad
USER ID: [email protected]
PASSWORD: To set your password please click the link below. Clicking the link will take you directly to the option for setting your permanent password.

https://mc.manuscriptcentral.com/tcad?blah-blah-blah

When you log in for the first time, you will be asked to complete your full postal address, telephone, and fax number. You will also be asked to select a number of keywords describing your particular area(s) of expertise.

Thank you for your participation.

Beautiful paper on HMMs and derivatives

I’ve been talking to Michael Betancourt and Charles Margossian about implementing analytic derivatives for HMMs in Stan to reduce memory overhead and increase speed. For now, one has to implement the forward algorithm in the Stan program and let Stan autodiff through it. I worked out the adjoint method (aka reverse-mode autodiff) derivatives of the HMM likelihood (basically, reverse-mode autodiffing the forward algorithm), but it was stepwise and the connection to forward-backward wasn’t immediately obvious. So I thought maybe someone had already put a bow on this in the literature.

It was a challenging Google search, but I was rewarded with one of the best papers I’ve read in ages and by far the best thing I’ve ever read on hidden Markov models (HMM) and their application:

The paper provides elegant one-liners for the forward algorithm, the backward algorithm, the likelihood, and the derivative of the likelihood with respect to model parameters. For example, here’s the formula for the likelihood:

$latex L = \pi^{\top} \cdot \textrm{diag}(B_1) \cdot A \cdot \textrm{diag}(B_2) \cdot \cdots A \cdot \textrm{diag}(B_T) \cdot 1.$

where $latex \pi$ is the initial state distributions, $latex B_t$ is the vector of emission densities for the states, $latex A$ is the stochastic transition matrix, and $latex 1$ is a vector of 1s. Qin et al.’s software uses an external package to differentiate the solution for $latex \pi$ as the stationary distribution for the transition matrix $latex A,$, i.e., $latex \pi^{\top} \cdot A = \pi^{\top}.$

The forward and backward algoritms are stated just as neatly, as are the derivatives of the likelihood w.r.t parameters. The authors put the likelihood and derivatives together to construct a quasi-Newton optimizer to fit max likelihood estimates of HMM. They even use second derivatives for estimating standard errors. For Stan, we just need the derivatives to plug into our existing quasi-Newton solvers and Hamiltonian Monte Carlo.

But that’s not all. The paper’s about an application of HMMs to single-channel kinetics in chemistry, a topic about which I know nothing. The paper starts with a very nice overview of HMMs and why they’re being chosen for this chemistry problem. The paper ends with a wonderfully in-depth discussion of the statistical and computational properties of the model. Among the highlights is the joint modeling of multiple experimental data sets with varying measurement error.

In conclusion, if you want to understand HMMs and are comfortable with matrix derivatives, read this paper. Somehow the applied math crowd gets these algorithms down correctly and cleanly where the stats and computer science literatures flail in comparison.

Of course, for stability in avoiding underflow of the densities, we’ll need to work on the log scale. Or if we want to keep the matrix formulation, we can use the incremental rescaling trick to rescale the columns of the forward algorithm and accmulate our own exponent to avoid underflow. We’ll also have to autodiff through the solution to the stationary distirbution algorithm, but Stan’s internals make that particular piece of plumbing easy to fit and also a potential point of dervative optimization. We also want to generalize to the case where the transition matrix $latex A$ depends on predictors at each time step through a multi-logit regression. With that, we’d be able to fit anything that can be fit with the nicely designed and documented R package moveHmm, which can already be used in R to fit a range of maximum likelihood estimates for HMMs.

Macbook Pro (16″ 2019) quick review

I just upgraded yesterday to one of the new 2019 Macbook Pro 16″ models:

  • Macbook Pro (16″, 2019), 3072 x 1920 pixel display, 2.4 GHz 8-core i9, 64GB 2667 MHz DDR4 memory, 2880 x 1800 pixel display, AMD Radeon Pro 5500M GPU with 4GB of GDDR6 memory, 1 TB solid-state drive

    US$4120 list including Apple Care (about US$3800 after the education discount)

The only other upgrade option is an additional 4GB GPU memory for US$100.

My computer for the last seven-plus years and my basis for comparison is a mid-2012 Macbook Pro:

  • Macbook Pro (15″ Retina, Mid 2012), 2880 x 1800 pixel display, 2.3 GHz 4-core i7, 16 GB 1600MHz DDR3 memory, 256 GB solid-state drive

I did 100% of my work on Stan during that time using this computer and overall, it’s been the best computer I’ve ever had. But my old computer was dying. The screen was so burned in I could read last week’s xkcd (I never got the replacement during the recall). The battery was so shot it’d go from 30% power to auto shutdown in the blink of an eye.

I have no idea what’s available in the Windows PC or Linux world for a similar price. It probably comes with 32 cores, 256GB memory, and also acts as a hoverboard. For comparison, while working at Bell Labs in the mid-1990s, I once spent US$7200 for a dual-boot Linux/Windows XP Thinkpad with a super high-res monitor for the time and enough memory and compute power to develop and run our speech recognition code. So while US$3800 may seem outrageously expensive, the bigger picture is that really powerful computers just keep getting more affordable over time.

Form factor

I bought the new computer sight unseen without paying too much attention to anything other than that it had 8 cores, 64GB memory, and an escap key. I was expecting something like a PC gamer deck. Compared to my previous machine, the cases are exactly the same size and the machines are the same weight at least insofar as I can tell physically without reading the specs. It’s even the old silver color, which I strongly prefer to the space grey.

I like that the apple on the lid doesn’t light up.

Ease of upgrade

Apple makes it super easy to move everything from an old machine. Once I entered enough passwords on my menagerie of Apple devices, it took less than 2 hours to transfer everything from the old machine to the new one via my home wireless network.

The only software I’ve had to upgrade to get back to working on Stan is Xcode (the C++ compiler). And I did that just from the command line using this one-liner:

> xcode-select --install

Hats off to Dirk Avery for his blog post on Catalina, Xcode, and Homebrew.

It really was that easy. The entire Stan developer toolchain just works. R, RStan, etc., all just ran once I ran the above command from the terminal.

The keyboard, touchpad, and touchbar

There’s an escape key. I’ve been using emacs for 30+ years and it’s a big deal to me and others like me.

Keyboards matter a lot to me. I’m a very fast typist—around 100 words/minute the last time I tested myself on transcription (two years of school, part time jobs as secretary and keypunch operator, followed by tens of thousands of adult hours at the keyboard).

Overall, I consider this keyboard a downgrade from my 2012 Macbook Pro. I had the same problem with ThinkPads between 1996 and 2010—the keyboards just kept getting worse with every new model. At least the new Macbook Pro keyboards are a lot better than the very-short-throw, low-feedback keyboards used in the time between my 2012 Mac and the new 2019 ones.

The touchpad is huge compared to the old machine. I was worried I’d be accidentally hitting it all the time because I set it to use touch rather than click, but that has thanfully not happened.

The touchbar’s fine for what it’s there for. Its default is to display the only controls I ever used on the old computer—volume and brightness.

Display

Together, the keyboard and display are the most important parts of a computer to me. I’ve always prioritized displays over CPUs. I bought a first-generation Retina Macbook Pro as soon as they were available.

The monitor in the 16″ Macbook Pros is impressive. After using it for a day, the color on all my other devices (previous computer, iPhone, iPad) now looks off (specifically, blue-shifted). Sitting next to each other at max brightness, one might think the backlighting was broken in the old monitor it’s so dim.

Even though it’s not that much bigger, having spent 7 years on a slightly smaller one, this one feels a fair bit bigger. They squeezed it into the same form factor by reducing the bezel size. There are also a few more pixes.

Is it faster?

Yes, much. I haven’t done any formal measurements, but with twice as many cores, each of which is faster, and much faster memory, one would expect to see exactly what I’m seeing informally—the Stan C++ unit tests compile and run more than 50% faster.

Not much compared to the PC heyday when every 18 months saw a doubling of straight-line speed. But enough to be noticeable and well worth the upgrade if that was all I was getting.

I haven’t tried any GPU code yet. I wouldn’t expect too much from a notebook on that front.

64 GB?

It wasn’t that much more expensive to fully load the machine’s memory. This means we should be able to run 8 processes each using nearly 8 GB of memory each.

Ports and dongles

There’s a headphone jack on the right (instead of left as it was on my old computer) and two USB-C jacks on either side. I just plugged the power into one of the ones on the left and it worked.

Ports and dongles are the great weakness of Apple-knows-best design in my experience. I’m going to have to buy a USB-C to HDMI dongle. I really liked that the 2012 Macbook Pro had an HDMI port.

I’m also going to have to figure out how to charge my iPad and iPhone. I prefer to travel without the iPad-specific wall wart.

Apple seems to think they get points for being “mimimal”, flying in the face of every review I’ve ever read of an Apple product. So here you go Apple, another negative review of your choice in the port department to ignore.

Am I an Apple fanboy?

I certainly don’t self identify as an Apple fanboy. I use exclusively Apple products (Macbook, iPhone, iPad) primarily because I’m lazy and hate learning new interfaces and managing software. My decision’s being driven almost entirely from the Macbooks because I want Unix on my notebook without the incompatibility of Cygwin or administrative headache of Linux on a notebook.

It’s clear the Macbook isn’t getting the most love among Apple’s products. I also resent Apple’s we-know-best attitude, which I blame for their cavalier attitude toward backward compatibility at both the software and hardware levels. It’s no surprise Microsoft still dominates the corporate PC market and Linux the corporate server market.

Overall impression

I love it. For my use, the 8 cores, faster 64GB memory, and the high resolution and brightness 16″ monitor more than make up for the slightly poorer keyboard and reduced port selection.

I also ordered the same machine for Andrew and he’s been using his a day or two longer than me, so I’m curious what his impressions are.

Field goal kicking—like putting in 3D with oblong balls

Putting

Andrew Gelman (the author of most posts on this blog, but not this one), recently published a Stan case study on golf putting [link fixed] that uses a bit of geometry to build a regression-type model based on angles and force.

Field-goal kicking

In American football, there’s also a play called a “field goal.” In the American football version, a kicker (often a player migrating from the sport everyone else in the world calls “football”) tries to kick an oblong-ish “ball” between 10 and 70 meters between a pair of vertical posts and above a post at a certain height. If you’re not from the U.S. or other metrically-challenged country still using (British) imperial measures, it’ll help to know that a meter is roughly 1.1 yards.

Sounds kind of like putting, only in 3D and with no penalty for kicking too hard or far and wind effects instead of terrain features. This modeling problem came to my attention from the following blog post:

Unlike Gelman’s golf-putting example, Long’s model combines a kick-by-kick accuracy model with a career-trajectory model for kickers, another popular contemporary sports statistics adjustment. Long used brms, a Bayesian non-linear multilevel modeling package built on top of Stan, to fit his model of field-goal-kicking accuracy. (For what it’s worth, more people use brms and rstanarm now than use Stan directly in R, at least judging from CRAN downloads through RStudio.)

Model expansion

The focus of Gelman’s case study is model expansion—start with a simple model, look at the residuals (errors), figure out what’s going wrong, then refine the model. Like Gelman, Long starts with a logistic regression model for distance; unlike Gelman, he expands the model with career trajectories and situational effects (like “icing” the kicker) rather than geometry. An interesting exercise would be to do what Gelman did and replace Long’s logistic model of distance with one based on geometry. I’m pretty sure this could be done with brms by transforming the data, but someone would need to verify that.

Similarly, Gelman’s model still has plenty of room for expansion if anyone wants to deal with the condition of the greens (how they’re cut, moisture, etc.), topography, putter career trajectories, situational effects, etc. My father was a scratch golfer in his heyday on local public courses, but he said he’d never be able to sink a single putt if the greens were maintained the way they were for PGA tournaments. He likes to quote Lee Trevino, who said pro greens were like putting on the hood of a car; Trevino’s quotes are legendary. My dad’s own favorite golf quote is “drive for show, putt for dough”—he was obsessive about his short game—his own career was ended by knee and rotator cuff surgery—hockey wasn’t good to his body, either, despite playing in a “non-contact” league as an adult.

It would be fun to try to expand both Long’s and Gelman’s models further. This would also be a natural discussion for the Stan forums, which have a different readership than this blog. I like Gelman’s and Long’s post because they’re of the hello-world variety and thus easy to understand. Of course, neither’s ready to go into production for bookmaking yet. It’d be great to see references to some state-of-the-art modeling of these things.

Other field goals

Field goals in basketball (shots into the basket from the floor as opposed to free throws) would be another good target for a model like Gelman’s or Long’s. Like the American football case and unlike golf, there’s a defense. Free throws wouldn’t be a good target as they’re all from the same distance (give or take a bit based on where they position themeselves side to side).

Are there things like field goals in rugby or Australian-rules football? I love that the actual name of the sport has “rules” in the title—it’s the kind of pedantry near and dear to this semanticist’s heart.

Editorial

I thought twice about writing about American football. I boycott contact sports like football and ice hockey due to their intentionally violent nature. I’ve not watched American football in over a decade.

For me, this is personal now. I have a good friend of my age (mid-50s) who’s a former hockey player who was recently diagnosed with CTE. He can no longer function independently and has been given 1–2 years to live. His condition resulted from multiple concussions that started in school and continued through college hockey into adult hockey. He had a full hockey scholarship and would’ve been a pro (the second best player after him on our state-champion high-school team in Michigan played for the NY Rangers). My friend’s pro hopes ended when an opponent broke both his knees with a stick during a fight in a college game. He continued playing semi-pro hockey as an adult and accumulating concussions. Hockey was the first sport I boycotted, well over 30 years ago when my friend and my father were still playing, because it was clear to me the players were trying to hurt each other.

I’m now worried about baseball. I saw too many catchers and umpires rocked by foul tips to the face mask this season. I feel less bad watching baseball because at least nobody’s trying to hurt the catchers or umpires as part of the sport. The intent is what originally drove me out of watching hockey and football before the prevalence of CTE among former athletes was widely known. I simply have no interest in watching people trying to hurt each other. Nevertheless, it’s disturbing to watch an umpire get led off the field who can no longer see straight or walk on his own or see a catcher don the gear again after multiple concussions. As we know, that doesn’t end well.

Econometrics postdoc and computational statistics postdoc openings here in the Stan group at Columbia

Andrew and I are looking to hire two postdocs to join the Stan group at Columbia starting January 2020. I want to emphasize that these are postdoc positions, not programmer positions. So while each position has a practical focus, our broader goal is to carry out high-impact, practical research that pushes the frontier of what’s posisble in Bayesian modeling. This particular project is focused on extremely challenging econometric modeling problems and statistical computation and will be carried out in conjunction with some really great economists (details in the job descriptions below).

These positions are funded through a generous gift from the Alfred P. Sloan Foundation.

Computational statistics postdoc

The Stan group at Columbia is looking to hire a Postdoctoral Research Scholar to work on computational statistics. The goal of the project is to:

* develop algorithms for solving differential and algebraic equations, potentially stochastic and partial

* fit large scale-hierarchical models either through core sampling improvements or approximations such as nested Laplace or variational inference.

In both projects, there is wide latitude for extending the state of the art in computational statistics. The Stan project encompasses a team of dozens of developers distributed around the world and this work will be done in collaboration with that wider team. The wider team provides expertise in everything from numerical analysis and applied mathematics to programming language theory and parallel computation. The position is well funded to travel to conferences and visit collaborators.

The project is funded through a grant focused on Bayesian econometric modeling, which provides concrete applications that will provide a focus for the work as well as a second postdoc funded to develop those applications concurrently with developing the tools needed to extend the existing state of the art. The Stan group at Columbia is also working on applications of differential and algebraic equations in soil carbon modeling and pharmacology and applications of large scale hierarchical models in education and in survey sampling for political science.

The position will be housed in the Applied Statistics Center at Columbia University and supervised by Bob Carpenter. The initial appointment will be for 18 months (January 2020 through June 2022) with a possibility of extension.

Columbia is an EEO/AA employer

To apply, please send a CV and a statement of interest and experience in this area if not included in the CV to Bob Carpenter, [email protected]. The position is available starting in January 2020, and we will review applications as they arrive.

Econometrics Postdoc

The Stan group at Columbia is looking to hire a Postdoctoral Research Scholar to work on Bayesian econometric modeling and methodology. The goal is to create a bridge from modern econometric modeling to current Bayesian computational practice by generating a range of illustrative case studies illustrating best practices. Many of these best practices will need to be developed from the ground up and there is wide latitude for novel work.

This work will be carried out in collaboration with several economists and methodologists outside of Columbia University:

* Empirical auction analysis, where the theory around optimal design can be used to improve econometric methods used to draw inferences from the performance of real auctions in practice, including jointly modeling all components of a bidding system in order to test the structural assumptions driving mechanism design decisions. With Prof. Shoshanna Vasserman (Stanford)

* Bounded rationality and decision making in dynamic and stochastic environments, where macroeconomic models may be expressed in the form of dynamic, stochastic, general equilibrium models which can be extended to higher orders to model bounded rationality in agents making decisions in dynamic and stochastic environments. With Prof. Thomas Sargent, New York University.

* External validity of policy targeting for subgroups, with the goal of applying interventions where they will benefit the most while avoiding harming other subgroups, and a focus on combining data across multiple settings using meta-analysis. With Prof. Rachel Meager, London School of Economics.

* Causal models of interventions in education policy, where the focus is on time-series data organized by classroom, school, and larger groupings in the context of heterogeneous demographic. With Prof. Sophia Rabe-Hesketh, University of California, Berkeley.

Basic capabilities to fit these models exist in Stan currently and this grant will support a second postdoc to help extend those capabilities to more complicated systems.

The position will be housed in the Applied Statistics Center at Columbia University and supervised by Andrew Gelman. The initial appointment will be for 18 months (January 2020 through June 2022) with a possibility of extension.

Columbia is an EEO/AA employer

To apply, please send a CV and a statement of interest and experience in this area if not included in the CV to Andrew Gelman, at [email protected]. The position is available starting January 1, 2020, and we will review applications as soon as they arrive.

Non-randomly missing data is hard, or why weights won’t solve your survey problems and you need to think generatively

Throw this onto the big pile of stats problems that are a lot more subtle than they seem at first glance. This all started when Lauren pointed me at the post Another way to see why mixed models in survey data are hard on Thomas Lumley’s blog. Part of the problem is all the jargon in survey sampling—I couldn’t understand Lumley’s language of estimators and least squares; part of it is that missing data is hard.

The full data model

Imagine we have a a very simple population of $latex N^{\textrm{pop}}$ items with values normally distributed members with standard deviation known to be 2,

$latex y_n \sim \textrm{normal}(\mu, 2) \ \textrm{for} \ i \in 1:N^{\textrm{pop}}.$

To complete the Bayesian model, we’ll assume a standard normal prior on $latex \mu$,

$latex \mu \sim \textrm{normal}(0, 1).$

Now we’re not going to observe all $latex y_n$, but only a sample of the $latex N^{\textrm{pop}}$ elements. If the model is correct, our inferences will be calibrated in expection given a random sample of items $latex y_n$ from the population.

Missing data

Now let’s assume the sample of $latex y_n$ we observe is not drawn at random from the population. Imagine instead that we have a subset of $latex N$ items from the population, and for each item $latex n$, there is a probability $latex \pi_n$ that the item will be included in the sample. We’ll take the log odds of inclusion to be equal to the item’s value,

$latex \pi_n = \textrm{logit}^{-1}(y_n)$.

Now when we collect our sample, we’ll do something like poll $latex N = 2000$ people from the population, but each person $latex n$ only has a $latex \pi_n$ chance of responding. So we only wind up with $latex N^{\textrm{obs}}$ observations, with $latex N^{\textrm{miss}} = N – N^{\textrm{obs}}$ observations missing.

This situation arises in surveys, where non-response can bias results without careful adjustment (e.g., see Andrew’s post on pre-election polling, Don’t believe the bounce).

So how do we do the careful adjustment?

Approach 1: Weighted likelihood

A traditional approach is to inverse weight the log likelihood terms by the inclusion probability,

$latex \sum_{n = 1}^{N^{\textrm{obs}}} \frac{1}{\pi_n} \log \textrm{normal}(y_n \mid \mu, 2).$

Thus if an item has a 20% chance of being included, its weight is 5.

In Stan, we can code the weighted likelihood as follows (assuming pi is given as data).

for (n in 1:N_obs)
  target += inv(pi[n]) * normal_lpdf(y[n] | mu, 2);

If we optimize with the weighted likelihood, the estimates are unbiased (i.e., the expectation of the estimate $latex \hat{\pi}$ is the true value $latex \pi$). This is borne out in simulation.

Although the parameter estimates are unbiased, the same cannot be said of the uncertainties. The posterior intervals are too narrow. Specifically, this approach fails simulation-based calibration; for background on SBC, see Dan’s blog post You better check yo self before you wreck yo self.

One reason the intervals are too narrow is that we are weighting the data as if we had observed $latex N$ items when we’ve only observed $latex N^{\textrm{obs}}$ items. That is, their weights are what we’d expect to get if we’d observed $latex N$ items.

So my next thought was to standardize. Let’s take the inverse weights and normalize so the sum of inverse weights is equal to $latex N^{\textrm{obs}}.$ That also fails. The posterior intervals are still too narrow under simulation.

Sure, we could keep fiddling weights in an ad hoc way for this problem until they were better calibrated empirically, but this is clearly the wrong approach. We’re Bayesians and should be thinking generatively. Maybe that’s why Lauren and Andrew kept telling me I should be thinking generatively (even though they work on a survey weighting project!).

Approach 2: Missing data

What is going on generativey? We poll $latex N$ people out of a population of $latex N^{\textrm{pop}}$, each of which has a $latex \pi_n$ chance of responding, leading to a set of responses of size $latex N^{\textrm{obs}}.$

Given that we know how $latex \pi$ relates to $latex y$, we can just model everything (in the real world, this stuff is really hard and everything’s estimated jointly).

Specifically, the $latex N^{\textrm{miss}} = N – N^{\textrm{obs}}$ missing items each get parameters $latex y^{\textrm{miss}}_n$ representing how they would’ve responded had they responded. We also model response, so we have an extra term $latex \textrm{bernoulli}(0 \mid \textrm{logit}^{-1}(y_n^{\textrm{miss}}))$ for the unobserved values and an extra term $latex \textrm{bernoulli}(1 \mid \textrm{logit}^{-1}(y_n))$ for the observed values.

This works. Here’s the Stan program.

data {
  int N_miss;
  int N_obs;
  vector[N_obs] y_obs;
}
parameters {
  real mu;
  vector[N_miss] y_miss;
}
model {
  // prior
  mu ~ normal(0, 1);
  // observed data likelihood
  y_obs ~ normal(mu, 2);
  1 ~ bernoulli_logit(y_obs);
  // missing data likelihood and missingness
  y_miss ~ normal(mu, 2);
  0 ~ bernoulli_logit(y_miss);
}

The Bernoulli sampling statements are vectorized and repeated for each element of y_obs and y_miss. The suffix _logit indicates the argument is on the log odds scale, and could have been written:

for (n in 1:N_miss)
  0 ~ bernoulli(inv_logit(y_miss[n]))

or even more explicitly,

for (n in 1:N_miss)
  target += bernoulli_lpmf(0 | inv_logit(y_miss[n]));

And here’s the simulation code, including a cheap run at SBC:

library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores(), logical = FALSE)

printf <- function(msg, ...) { cat(sprintf(msg, ...)); cat("\n") }
inv_logit <- function(u) 1 / (1 + exp(-u))

printf("Compiling model.")
model <- stan_model('missing.stan')

for (m in 1:20) {

# SIMULATE DATA
mu <- rnorm(1, 0, 1);
N_tot <- 1000
y <- rnorm(N_tot, mu, 2)
z <- rbinom(N_tot, 1, inv_logit(y))
y_obs <- y[z == 1]
N_obs <- length(y_obs)
N_miss <- N_tot - N_obs

# COMPILE AND FIT STAN MODEL
fit <- sampling(model,
                data = list(N_miss = N_miss, N_obs = N_obs, y_obs = y_obs),
                chains = 1, iter = 5000, refresh = 0)
mu_ss <- extract(fit)$mu
mu_hat <- mean(mu_ss)
q25 <- quantile(mu_ss, 0.25)
q75 <- quantile(mu_ss, 0.75)
printf("mu = %5.2f in 50pct(%5.2f, %5.2f) = %3s;  mu_hat = %5.2f",
       mu, q25, q75, ifelse(q25 <= mu && mu <= q75, "yes", "no"), mean(mu_ss))

}

Here's some output with random seeds, with mu, mu_hat and 50% intervals and indicator of whether mu is in the 50% posterior interval.

mu =  0.60 in 50pct( 0.50,  0.60) =  no;  mu_hat =  0.55
mu = -0.73 in 50pct(-0.67, -0.56) =  no;  mu_hat = -0.62
mu =  1.13 in 50pct( 1.00,  1.10) =  no;  mu_hat =  1.05
mu =  1.71 in 50pct( 1.67,  1.76) = yes;  mu_hat =  1.71
mu =  0.03 in 50pct(-0.02,  0.08) = yes;  mu_hat =  0.03
mu =  0.80 in 50pct( 0.76,  0.86) = yes;  mu_hat =  0.81

The only problem I'm having is that this crashes RStan 2.19.2 on my Mac fairly regularly.

Exercise

How would the generative model differ if we polled members of the population at random until we got 1000 respondents? Conceptually it's more difficult in that we don't know how many non-resondents were approached on the way to 1000 respondents. This would be tricky in Stan as we don't have discrete parameter sampling---it'd have to be marginalized out.

Lauren started this conversation saying it would be hard. It took me several emails, part of a Stan meeting, buttonholing Andrew to give me an interesting example to test, lots of coaching from Lauren, then a day of working out the above simulations to convince myself the weighting wouldn't work and code up a simple version that would work. Like I said, not easy. But at least doable with patient colleagues who know what they're doing.

All the names for hierarchical and multilevel modeling

The title Data Analysis Using Regression and Multilevel/Hierarchical Models hints at the problem, which is that there are a lot of names for models with hierarchical structure.

Ways of saying “hierarchical model”

hierarchical model
a multilevel model with a single nested hierarchy (note my nod to Quine’s “Two Dogmas” with circular references)
multilevel model
a hierarchical model with multiple non-nested hierarchies
random effects model
Item-level parameters are often called “random effects”; reading all the ways the term is used on the Wikipedia page on random effects illustrates why Andrew dislikes the term so much (see also here; both links added by Andrew)—it means many different things to different communities.
mixed effects model
that’s a random effects model with some regular “fixed effect” regression thrown in; this is where lme4 is named after linear mixed effects and NONMEM after nonlinear mixed effects models.
empirical Bayes
Near and dear to Andrew’s heart, because regular Bayes just isn’t empirical enough. I jest—it’s because “empirical Bayes” means using maximum marginal likelihood to estimate priors from data (just like lme4 does).
regularized/penalized/shrunk regression
common approach in machine learning where held out data is used to “learn” the regularization parameters, which are typically framed as shrinkage or regularization scales in penalty terms rather than as priors
automatic relevance determination (ARD)
Radford Neal’s term in his thesis on Gaussian processes and now widely adopted in the GP literature
domain adaptation
This one’s common in the machine-learning literature; I think it came from Hal Daumé III’s paper, “Frustratingly easy domain adaptation” in which he rediscovered the technique; he also calls logistic regression a “maximum entropy classifier”, like many people in natural language processing (and physics)
variance components model
I just learned this one on the Wikipedia page on random effects models
cross-sectional model
apparently a thing in econometrics for groups of points measured at the same time; opposite of a time-series in which measurements invove mutliple times
nested data model, split-plot design, random coefficient
The Wikipedia page on multilevel models listed all these.
iterated nested Laplace approximation (INLA), expectation maximization (EM), …
Popular algorithmic approaches that get confused with the modeling technique.

I’m guessing the readers of the blog will have more items to add to the list.

If you liked this post

You might like my earlier post, Logistic regression by any other name.

Calibration and sharpness?

I really liked this paper, and am curious what other people think before I base a grant application around applying Stan to this problem in a machine-learning context.

Gneiting et al. define what I think is a pretty standard notion of calibration for Bayesian models based on coverage, but I’m not 100% sure if there are alternative sensible definitions.

They also define a notion of sharpness, which for continuous predictions is essentialy narrow posterior intervals, hence the name.

By way of analogy to point estimators, calibration is like unbiasedness and sharpness is like precision (i.e., inverse variance).

I seem to recall that Andrew told me that calibration is a frequentist notion, whereas a true Bayesian would just believe their priors. I’m not so worried about those labels here as about the methodological ramifications of taking the ideas of calibration and sharpness seriously.

Seeking postdoc (or contractor) for next generation Stan language research and development

The Stan group at Columbia is looking to hire a postdoc* to work on the next generation compiler for the Stan open-source probabilistic programming language. Ideally, a candidate will bring language development experience and also have research interests in a related field such as programming languages, applied statistics, numerical analysis, or statistical computation.

The language features on the roadmap include lambdas with closures, sparse matrices and vectors, ragged arrays, tuples and structs, user-defined Jacobians, and variadic functions. The parser, intermediate representation, and code generation are written in OCaml using the Menhir compiler framework. The code is hosted on GitHub in the stanc3 repo; the current design documents are in the design docs repo. The generated code is templated C++ that depends on the automatic differentiation framework in the Stan math library and is used by Stan’s statistical inference algorithms.

The research and development for Stan will be carried out in collaboration with the larger Stan development team, which includes a large group of friendly and constructive collaborators within and outside of Columbia University. In addition to software development, the team has a diverse set of research interests in programming language semantics, applied statistics, statistical methodology, and statistical computation. Of particular relevance to this project is foundational theory work on programming language semantics for differentiable and probabilistic programming languages.

The position would be housed in the Applied Statistics Center at Columbia University and supervised by Bob Carpenter. The initial appointment will be for one year with a possible reappointment for a second year.

To apply, please send a CV and a statement of interest and experience in this area if not included in the CV to Bob Carpenter, [email protected]. The position is available immediately and we will review applications as they arrive.

Thanks to Schmidt Futures for the funding to make this possible!


* We could also hire a contractor on an hourly basis. For that, I’d be looking for someone with experience who could hit the ground running with the OCaml code.

Why does my academic lab keep growing?

Andrew, Breck, and I are struggling with the Stan group funding at Columbia just like most small groups in academia. The short story is that to apply for enough grants to give us a decent chance of making payroll in the following year, we have to apply for so many that our expected amount of funding goes up. So our group keeps growing, putting even more pressure on us in the future to write more grants to make payroll. It’s a better kind of problem to have than firing people, but the snowball effect means a lot of work beyond what we’d like to be doing.

Why does my academic lab keep growing?

Here’s a simple analysis. For the sake of argument, let’s say your lab has a $1.5M annual budget. And to keep things simple, let’s suppose all grants are $0.5M. So you need three per year to keep the lab afloat. Let’s say you have a well-oiled grant machine with a 40% success rate on applications.

Now what happens if you apply for 8 grants? There’s roughly a 30% chance you get fewer than the 3 grants you need, a 30% chance you get exactly the 3 grants you need, and a 40% chance you get more grants than you need.

If you’re like us, a 30% chance of not making payroll is more than you’d like, so let’s say you apply for 10 grants. Now there’s only a 20% chance you won’t make payroll (still not great odds!), a 20% chance you get exactly 3 grants, and a whopping 60% chance you wind up with 4 or more grants.

The more conservative you are about making payroll, the bigger this problem is.

Wait and See?

It’s not quite as bad as that analysis leads one to believe, because once a lab’s rolling, it’s usually working in two-year chunks, not one-year chunks. But that takes a while to build up that critical mass.

It would be great if you could apply and wait and see before applying again, but it’s not so easy. Most government grants have fixed deadlines, typically once or at most twice per year. The ones like NIH that have two submission periods/year have a tendency to no fund first applications. So if you don’t apply in a cycle, it’s usually at least another year before you can apply again. Sometimes special one-time-only opportunities with partners or funding agencies come up. We also run into problems like government shutdowns—I still have two NSF grants under review that have been backed up forever (we’ve submitted and heard back on other grants from NSF in the meantime).

The situation with Stan at Columbia

We’ve received enough grants to keep us going. But we have a bunch more in process, some of which we’re cautiously optimistic about. And we’ve already received about half a grant more than we anticipated, so we’re going to have to hire even if we don’t get the ones in process.

So if you know any postdocs or others who might want to work on the Stan language in OCaml and C++, let me know ([email protected]). A more formal job ad will be out out soon.

Software release strategies

Scheduled release strategy

Stan’s moved to a scheduled release strategy where we’ll simply release whatever we have every three months. The Stan 2.20 release just went out last week. So you can expect Stan 2.21 in three months. Our core releases include the math library, the language compiler, and CmdStan.

That requires us to keep these repositories in synch and working with each other. We’ve had this always-ready-to-release discipline in place since roughly Stan 2.0, when we started picking up many more developrs and working on many issues in parallel. Letting the continous integration builds fail to be deliverable is a recipe for disaster when there are two dozen programmers working on a code base.

What we used to do is decide to do a release when we had an important bug fix or new feature. Inevitably, Daniel and I would have a meeting to do triage and then add a few just-one-more-thing issues to the release. It would take months to do a release just to get all the last-minute fixes in. Now we just release on the scheduled date. We figure that’ll get features to users faster net. It’s also good for developers to see their recent commits released when they can still remember what they did.

Down to the wire for 2.20

For this release (2.20), I rebuilt the base C++ classes for our compiled models and Mitzi updated the way Stan and CmdStan were built and tested so that we could separately compile a model and all the service code.

The good news is that compiling a Stan model in CmdStan on my 2012 notebook is now just under 7 seconds. It used to be close to 30 seconds. This really tested our resolve for this release, as the final tweaks to the makefiles happened on the morning of the release. Mitzi, who was building things got up early and Sebastian, who was reviewing the build pull request, stayed up late in Switzerland to get it in. Otherwise, we were willing to put things off three months just to stick to our process.

So thanks for all the hard work. Hopefully, these compile times will percolate through to RStan and PyStan (the other interfaces get the speedup for free by depending on CmdStan).

AnnoNLP conference on data coding for natural language processing

This workshop should be really interesting:

Silviu Paun and Dirk Hovy are co-organizing it. They’re very organized and know this area as well as anyone. I’m on the program committee, but won’t be able to attend.

I really like the problem of crowdsourcing. Especially for machine learning data curation. It’s a fantastic problem that admits of really nice Bayesian hierarchical models (no surprise to this blog’s audience!).

The rest of this note’s a bit more personal, but I’d very much like to see others adopting similar plans for the future for data curation and application.

The past

Crowdsourcing is near and dear to my heart as it’s the first serious Bayesian modeling problem I worked on. Breck Baldwin and I were working on crowdsourcing for applied natural language processing in the mid 2000s. I couldn’t quite figure out a Bayesian model for it by myself, so I asked Andrew if he could help. He invited me to the “playroom” (a salon-like meeting he used to run every week at Columbia), where he and Jennifer Hill helped me formulate a crowdsourcing model.

As Andrew likes to say, every good model was invented decades ago for psychometrics, and this one’s no different. Phil Dawid had formulated exactly the same model (without the hierarchical component) back in 1979, estimating parameters with EM (itself only published in 1977). The key idea is treating the crowdsourced data like any other noisy measurement. Once you do that, it’s just down to details.

Part of my original motivation for developing Stan was to have a robust way to fit these models. Hamiltonian Monte Carlo (HMC) only handles continuous parameters, so like in Dawid’s application of EM, I had to marginalize out the discrete parameters. This marginalization’s the key to getting these models to sample effectively. Sampling discrete parameters that can be marginalized is a mug’s game.

The present

Coming full circle, I co-authored a paper with Silviu and Dirk recently, Comparing Bayesian models of annotation, that reformulated and evaluated a bunch of these models in Stan.

Editorial Aside: Every field should move to journals like TACL. Free to publish, fully open access, and roughly one month turnarond to first decision. You have to experience journals like this in action to believe it’s possible.

The future

I want to see these general techniques applied to creating probabilistic corpora, to online adaptative training data (aka active learning), to joint corpus inference and model training (a la Raykar et al.’s models), and to evaluation.

P.S. Cultural consensus theory

I’m not the only one who recreated Dawid and Skene’s model. It’s everywhere these days.

Recently, I just discovered an entire literature dating back decades on cultural consensus theory, which uses very similar models (I’m pretty sure either Lauren Kennedy or Duco Veen pointed out the literature). The authors go more into the philosophical underpinnings of the notion of consensus driving these models (basically the underlying truth of which you are taking noisy measurements). One neat innovation in the cultural consensus theory literature is a mixture model of truth—you can assume multiple subcultures are coding the data with different standards. I’d thought of mixture models of coders (say experts, Mechanical turkers, and undergrads), but not of the truth.

In yet another small world phenomenon, right after I discovered cultural consensus theory, I saw a cello concert organized through Groupmuse by a social scientist at NYU I’d originally met through a mutual friend of Andrew’s. He introduced the cellist, Iona Batchelder, and added as an aside she was the daughter of well known social scientists. Not just any social scientists, the developers of cultural consensus theory!

Peter Ellis on Forecasting Antipodal Elections with Stan

I liked this intro to Peter Ellis from Rob J. Hyndman’s talk announcement:

He [Peter Ellis] started forecasting elections in New Zealand as a way to learn how to use Stan, and the hobby has stuck with him since he moved back to Australia in late 2018.

You may remember Peter from my previous post on his analysis of NZ traffic crashes.

The talk

Speaker: Peter Ellis

Title: Poll position: statistics and the Australian federal election

Abstract: The result of the Australian federal election in May 2019 stunned many in the politics-observing class because it diverged from a long chain of published survey results of voting intention. How surprising was the outcome? Not actually a complete outlier; about a one in six chance, according to Peter Ellis’s forecasting model for Free Range Statistics. This seminar will walk through that model from its data management (the R package ozfedelect, built specifically to support it), the state-space model written in Stan and R that produces the forecasts; and its eventual visualisation and communication to the public. There are several interesting statistical issues relating to how we translate crude survey data into actual predicted seats, and some even more interesting communication issues about how all this is understood by the public. This talk is aimed at those with an interest in one or more of R, Stan, Bayesian modelling and forecasts, and Australian voting behaviour.

Location: 11am, 31 May 2019. Room G03, Learning and Teaching Building, 19 Ancora Imparo Way, Clayton Campus, Monash University [Melbourne, Australia]

The details

Ellis’s blog, Free Range Statistics, has the details of the Australian election model and much much more.

You can also check out his supporting R package, ozfedelect, on GitHub.

From hobbyist to pundit

Ellis’s hobby led to his being quoted by The Economist in a recent article, Did pollsters misread Australia’s election or did pundits?. Quite the hobby.

But wait, there’s more…

There are a lot more goodies on Peter Ellis’s blog, both with and without Stan.

A plea

I’d like to make a plea for a Stan version of the Bradley-Terry model (the statistical basis of the Elo rating system) for predicting Australian Football League matches. It’s an exercise somewhere in the regression sections of Gelman et al.’s Bayesian Data Analysis to formulate the model (including how to extend to ties). I have a half-baked Bradley-Terry case study I’ve been meaning to finish, but would be even happier to get an outside contribution! I’d be happy to share what I have so far.

[edit: fixed spelling of “Elo”]

Stan examples in Harezlak, Ruppert and Wand (2018) Semiparametric Regression with R

I saw earlier drafts of this when it was in preparation and they were great.

Jarek Harezlak, David Ruppert and Matt P. Wand. 2018. Semiparametric Regression with R. UseR! Series. Springer.

I particularly like the careful evaluation of variational approaches. I also very much like that it’s packed with visualizations and largely based on worked examples with real data and backed by working code. Oh, and there are also Stan examples.

Overview

From the authors:

Semiparametric Regression with R introduces the basic concepts of semiparametric regression and is focused on applications and the use of R software. Case studies are taken from environmental, economic, financial, medical and other areas of applications. The book contains more than 50 exercises. The HRW package that accompanies the book contains all of the scripts used in the book, as well as datasets and functions.

There’s a sample chapter linked from the book’s site. It’s the intro chapter with lots of examples.

R code

There’s a thorough site supporting the book with all the R code. R comes with its own warning label on the home page:

All of the examples and exercises in this book [Semiparametric Regression with R] depend on the R computing environment. However, since R is continually changing readers should regularly check the book’s News, Software Updates and Errata web-site.

You’ve got to respect the authors’ pragmatism and forthrightness. I’m pretty sure most of the lack of backward compatibility users experience in R is primarily due to contributed packages, not the language itself.

Background reading

The new book’s based on an earlier book by an overlapping set of authors:

D. Ruppert, M. P. Wand and R. J. Carroll. 2003. Semiparametric Regression. Cambridge University Press.

Cost and Format

First the good news. You can buy a pdf. I wish more authors and published had this as an option. I want to read everything in pdf format on my iPad.

Now the bad news. The pdf is US$89.00 or $29.95 per chapter. The softcover book is US$119.99. The printed book’s a bit less on Amazon at US$109.29 as of today. I wonder who works out the pennies in these prices.

Here’s the Springer page for the book in case you want a pdf.

Sometimes the Columbia library has these Springer books available to download a chapter at a time as pdfs. I’ll have to check about this one when I’ve logged back into the network.