Hello, world! Stan, PyMC3, and Edward

Being a computer scientist, I like to see “Hello, world!” examples of programming languages. Here, I’m going to run down how Stan, PyMC3 and Edward tackle a simple linear regression problem with a couple of predictors.

No, I’m not going to take sides—I’m on a fact-finding mission. We (the Stan development team) have been trying to figure out whether we want to develop a more “pythonic” interface to graphical modeling in Stan. By the way, if anyone knows any guides to what people mean by “pythonic”, please let me know—I’m looking for something like Bloch’s Effective Java or Myers’s Effective C++. Rather than reinventing the wheel, I’ve been trying to wrap my head around how PyMC3 and Edward work. This post just highlights some of my observations. Please use the comments to let me know if I’ve misunderstood or misrepresented something in PyMC3 or Edward; I really want to understand what they’re doing more clearly. If I’m way off base, I’ll update the main post.

Bayesian linear regression

I’ll use a simple linear regression in all three frameworks to give you an idea of what it looks like. That is, we’ll use the sampling distribution

$latex y_n \sim \mbox{Normal}(\alpha + \beta_1 x_{n,1} + \beta_2 x_{n,2}, \sigma)$

and the priors

$latex \alpha, \beta_1, \beta_2 \sim \mbox{Normal}(0, 10)$

$latex \sigma \sim \mbox{Normal}(0, 1)$, restricted to $latex \sigma > 0$

In Edward, I went with their example prior, which is a lognormal on variance,

$latex \log \, \sigma^2 \sim \mbox{Normal}(0, 1)$, again restricted to $latex \sigma > 0$.

 
Hello, world! Stan

data {
  int N;
  vector[N] y;
  matrix[N, 2] x;
}
parameters {
  real alpha;
  vector[2] beta;
  real<lower=0> sigma;
}
model {
  alpha ~ normal(0, 10);
  beta ~ normal(0, 10);
  sigma ~ normal(0, 1);
  y ~ normal(alpha + x * beta, sigma);
}

 
Hello, world! Edward

import edward as ed
import numpy as np
import tensorflow as tf

from edward.models import Normal

X = tf.placeholder(tf.float32, [N, 2])
sigma = tf.sqrt(tf.exp(tf.Variable(tf.random_normal([]))))
beta = Normal(loc=tf.zeros(2), scale=tf.ones(2))
alpha = Normal(loc=tf.zeros(1), scale=tf.ones(1))
y = Normal(loc=ed.dot(X, beta) + alpha, scale=sigma * tf.ones(N))

I’m not 100% sure this would work—I borrowed pieces of examples from their Supervised Learning (Regression) tutorial and their Linear Mixed Effects Models tutorial.

 
Hello, world! PyMC3

This is copied directly from the official Getting Started with PyMC3 tutorial:

from pymc3 import Model, Normal, HalfNormal

basic_model = Model()

with basic_model:

    # Priors for unknown model parameters
    alpha = Normal('alpha', mu=0, sd=10)
    beta = Normal('beta', mu=0, sd=10, shape=2)
    sigma = HalfNormal('sigma', sd=1)

    # Expected value of outcome
    mu = alpha + beta[0]*X1 + beta[1]*X2

    # Likelihood (sampling distribution) of observations
    Y_obs = Normal('Y_obs', mu=mu, sd=sigma, observed=Y)

It’s not quite standalone as is—there are free variables Y, X1, X2. In the tutorial, they make sure the necessary data variables (Y, X1, X2) are defined in the environment before attempting to define the model. So they start their tutorial by running this simulation first:

import numpy as np
import matplotlib.pyplot as plt

# Initialize random number generator
np.random.seed(123)

# True parameter values
alpha, sigma = 1, 1
beta = [1, 2.5]

# Size of dataset
size = 100

# Predictor variable
X1 = np.random.randn(size)
X2 = np.random.randn(size) * 0.2

# Simulate outcome variable
Y = alpha + beta[0]*X1 + beta[1]*X2 + np.random.randn(size)*sigma

So you really need to consider the definitions of Y, X1, X2 as part of the model specification. And keep in mind that alpha, sigma, beta here are not the same variables as defined in the model scope previously; they are masked in the basic_model environment by the definitions of those variables there.

 
Model size

Both Edward and PyMC3 model definitions are substantially shorter than Stan’s. That’s largely because of Stan’s standalone static type definitions—the actual model density is the line-for-line similar in all three interfaces.

What’s happening in both PyMC3 and Edward is that the distribution functions are defining the shapes and sizes of the variables. And Python itself is dynamically typed, so variables just get their types from what is assigned to them.

 
Joint density model abstraction and data binding

In both Stan and Edward, the program defining a model defines a joint log density $latex \log p(y, \theta)$ that acts as a function from data sets to concrete posterior densities. In both Stan and Edward, the language distinguishes data variables from parameter values and provides an object-level representation of data variables.

In PyMC3, the data is included as simple Python types in the model objects as the graph is built. So to get a model abstract, you’d have to write a function that takes the variables as arguments then returns the model instantiated with data. The definition of the deterministic node mu here is in terms of the actual data vectors X1 and X2—these aren’t placeholders, their values are used from the containing environment. The definition of the stochastic node Y_obs explicitly includes actual data through observed=Y. This maneuver is necessary because they couldn’t assign a stochastic node to Y (well, they could, seeing as it’s Python, but it wouldn’t do what a naive user might expect). The parameter values from the simulation (alpha, beta, sigma) are not used in the model definition—rather, they’re masked (redefined) in the scope of the basic_model using the with statement.

I wonder how much of this behavior of PyMC3 is due to building on top of Theano (the symbolic differentiation package they use for gradients). I couldn’t figure out how to do code generation autodiff or symbolic autodiff any other way back when I was thinking about it and it’s one of the main reasons we went with templated autodiff instead.

Like Stan, Edward defines a function rather than binding in the data directly to the model. The tf.placeholder() business is defining lazy arguments to the model (a lambda abstraction in more formal terms). As with Stan, Edward then lets you plug in data for the placeholders to produce an instantiated model implementing the posterior density determined by conditioning on that data. (Here’s the TensorFlow doc on placeholder.)

In BUGS and JAGS, the model definition didn’t specify what variables are data and which is parameters, but it is fixed when the model is run with data by inspecting the data.

The compilation bottleneck for Stan is when the program is compiled into the joint density (a C++ executable in CmdStan, an Rcpp object in RStan, or a Cython object in Python). Adding the data is fast as it only binds the data $latex y$ in the joint density to create a concrete posterior density; in implementation terms, we construct an instance of an immutable C++ class from the data. I’m pretty sure Edward would work the same way. In PyMC3, the compilation down to Theano must only happen after the data is provided; I don’t know how long that takes (seems like forever sometimes in Stan—we really need to work on speeding up compilation).

 
Variable sizes and constraints inferred from distributions

In PyMC3, shape=2 is what determines that beta is a 2-vector.

The PyMC3 program also explicitly uses the half-normal distribution because they implicitly use the sampling distribution to define constraints on the parameters, so that they can use the same kind of underlying unconstraining transforms as Stan under the hood in order to run HMC on an unconstrained space.

Edward doesn’t seem to support constraints (at least not in their intro tutorials—they transform variables manually, which is easy for lognormal, but substantially harder elsewhere). At least that’s what I’m guessing is the reason behind their formulation of the prior on scale, tf.sqrt(tf.exp(tf.Variable(tf.random_normal([])))).

Edward is also different than PyMC3 and Stan in that it broadcasts up the parameters so that they are all the same size. That’s the purpose of the size in scale=tf.ones(2) in the expression defining beta and also the purpose of the multiplication in scale=sigma * tf.ones(N) in the expression defining y.

Stan and PyMC3 use more standard vectorized functions that broadcast internally. Also, Stan, like BUGS and JAGS, allows truncations for univariate distributions. It means Stan can get away without a separate half-normal distribution. I don’t know if truncation is possible in either Edward or PyMC3.

 
Non-trivial use of embedding?

One nice aspect of embedded languages is that you can use all the language environment tools and can use the language itself. One commonly mentioned benefit is autocompletion in the REPL environment (interactive Python interpreter). We can do a bit of that in Stan in emacs and in Rstudio for R, but it’s hardly the smooth embedding of PyMC3 or Edward.

I couldn’t find examples in either Edward or PyMC3 that make non-trivial use of the embedding in Python. All the examples are just scripts (sequences of Python statements). I’d like to see an example in which they take advantage of being embedded in Python to build something like a hierarchical model component that could be used with other models. We can’t do that with a function in Stan because functions can’t introduce new parameters. I don’t quite know enough Python to pull off an example myself, but it should be possible. I asked a few of the PyMC3 developers and never got a concrete reference, so please comment if you have a working example somewhere I can cite.

 
Type checking and data types

Stan enforces static typing on all of its variables. It then provides a matrix arithmetic style that is like that of R and MATLAB. So you just multiply a matrix times a vector and you get a vector. Argument types like requiring a matrix argument as the covariance parameter in a multivariate normal are enforced statically, too. Size constraints and content constraints like positive definiteness are enforced at run time.

With PyMC3 and Edward, we’re in a slightly different situation. Although Python itself is dynamically typed, the random variables in these languages resolve as class instances in their respective libraries. I don’t know how they do error checking—hopefully at object construction time rather than when the model is executed.

Stan has Cholesky factor parameters for correlation and covariance matrices, simplex parameters, ordered and unit-vector parameters. I’m not sure if or how these would be handled in PyMC3 or Edward. Having the Cholesky factors is key for efficient multivariate covariance estimation that’s numerically stable.

One place Stan is lacking is in sparse and ragged data strutures or tuple data structures for multiple returns. We have a single sparse operation (sparse matrix times dense vector, which is the one you need for efficient sparse GLM predictor matrices), but no direct sparse data types. We require users to code their own raggedness in long (melted) form. Is there a way to do any of that more easily in PyMC3 or Edward?

Stan has rather clunky handling of missing data compared to BUGS or JAGS (two languages I know pretty well, unlike PyMC3 and Edward!). PyMC3 gives you a halfway house in that you pass in a mask if the data that’s missing is missing-at-random from an observation vector or matrix (Python doesn’t have the undefined (NA) data structure from R). I’ve been trying to figure out how to make this easier in Stan from the get go.

 
Whose functions?

Stan uses its own math libraries and all functions resolve to the math lib equivalents (these often delegate to Eigen or Boost) which bottom out in Stan’s automatic differentiation. PyMC3 and Edward functions need to bottom out in Theano and TensorFlow functions to allow analytic derivatives and automatic differentiation respectively. I know that Theano uses NumPy, but I’m not sure if that’s also the case with TensorFlow (there seem to be multiple options for data representations in Edward).

Both PyMC3 and Edward used a class named Normal to represent a variable with a normal sampling distrbution in the model. These classes are not the same in the two interfaces, nor are they the normal distributions built into TensorFlow (tf.Normal, not ed.Normal) or into SciPy (scipy.stats.norm).

Thus none of PyStan, PyMC3, or Edward can just call arbitrary Python functions as parts of their models. I don’t know where the languages stand in terms of functions available. Stan has a library of linear algebra, probability, differential equation, and general math functions listed in the back of our manual, but I’m not sure where to find a list of functions or distributions supported in PyMC3 or Edward (partly because I think some of this delegates to Theano and TensorFlow).

 
Extending with user-defined functions

Stan provides a language for writing new functions in the Stan language. I couldn’t see how you’d define new desnities in either PyMC3 or Edward directly in their APIs. Of course, you could use functions to put models together (that’s what I expected to see exemplified in their tutorials).

PyMC3 has instructions for adding new functions, but they don’t work for autodiff (presumably you’d need to go into Theano to do that the way you have to go into C++ to add a new underlying distribution as a Stan built-in). So you’re presumably either out of luck or you try to use something like the zero-trick of BUGS and JAGS.

I have idea what you’d do in Edward to write a new density function.

In all three languages, you can go deep and add new differentiable functions to the underlying math library. I don’t know if that requires C++ in either Theano or Tensorflow, or if you could do it in Python (perhaps with some reduced efficiency).

 
Minor nitpicks with PyMC3

I’m not a big fan of duplicating all the variable names in quotes. My guess is that there’s not enough reflection inside the PyMC3 model class to avoid it.

The PyMC3 argument naming mu, sd bothers me because I’m a neat freak like every other low-level API designer. For consistency, the naming should be one of

  • mu, sigma
  • location, scale
  • mean, sd

The first one seems opaque. The second one is what I’d have used and what Edward uses. The last choice is what R uses, but I don’t like it because it confuses the parameter with a property of a random variable.

 
Efficiency and sampling

This post isn’t about efficiency of the model’s ability to calculate values and derivatives. In Stan (and presumably the other interfaces if they’re at all efficient), that time is all taken up doing gradient calculations.

I’m more concerned with user efficiency and clarity in writing models, especially when you have a bunch of models you want to build in a sequence as you’re doing an applied project.

Until Edward implements something like NUTS, it’s not going to be very effective at solving general purpose practical problems with full Bayes because HMC is so hard to tune in general (see the Hoffman and Gelman NUTS paper in JMLR for a sequence of striking examples that even downplay how hard the tuning is by collapsing one dimension of the grid that was searched for tuning parameters) and other approaches are so slow to mix. There’s no fundamental obstacle other than that NUTS is a fussy recursive algorithm; I just don’t think the Edward devs care much about doing practical MCMC.

One thing that’s very nice about PyMC3 (don’t know much about inference in Edward) is that you can piece together block samplers. The rest of the PyMC3 interface is very nice for doing lots of things that wind up being pretty clunky in Stan. Some of these design issues are very relevant for deciding if we want to try to embed a graphical modeling language for Stan in Python.

 
What might an embedded Stan graphical modeling language look like?

We could define a Python API the following way, which would make the way a model is built up in the embedded language very similar to how it’s built up in Stan, but with quite a few limitations arising from the restriction to graphical modeling.

joint = StanModel();
with joint:
    N = data(integer())
    y = data(vector(N))
    x = data(matrix(N, 2))

    alpha = parameter(real())
    beta = parameter(vector(2))
    sigma = parameter(real(lower=0))

    alpha.normal(0, 10)
    beta.normal(0, 10)
    sigma.normal(0, 1, lower=0)
    y.normal(alpha + x * beta, sigma)

data = {N:10, y:np.rand.randn(N), x:np.rand.randn(shape=(N, 2))}
posterior = model.condition(data)
sample = posterior.sample(...)

That’d require a lot of using statements, or you could prefix all those functions/constructors with “st.” or maybe something else (I’m have much less experience with Python than R).

And all the arguments would have names, but do we really want to write normal(loc=mu, scale=sigma) rather than normal(mu, sigma)?

 
Other thoughts?

I’ll leave the floor open at this point. As you can see, I didn’t do a very deep dive into PyMC3 or Edward and am happy to clarify vague descriptions or correct misconceptions in how these languages would be used in practice. Like I said up front, we’re thinking about adding a graphical modeling language to Stan, so I want to understand all of this better. No better way to hold my feet to the fire than blog about it!

NIMBLE is an API that’s very much like PyMC in R. The big drawback there is that they don’t have autodiff (so it’s like PyMC, not PyMC3). Without autodiff or symbolic diff, it’s pretty much impossible to implement HMC or L-BFGS or gradient descent.