**PLEASE NOTE:** This is a **guest post** by **Llewelyn Richards-Ward**.

When there are two packages appearing to do the same thing, lets return to the Zen of Python which suggests that:

There should be one—and preferably only one—obvious way to do it.

Why is this particular mantra important? I think because the majority of users of packages like PyMC and PyStan are probably not programmers or statisticians. They are folk, like myself, who have an area of interest or expertise that these packages can assist us in understanding. So the implicit invitation to choose the “better” package, frankly, is probably not in our skill-set. With this in mind, I wrote to Prof. Andrew Gelman asking whether he would consider a blog or comparison between these two packages to add the expertise I don’t possess. He copied Bob Carpenter and Allen Riddell into the email, both of whom are subject matter experts around Stan and Python, respectively. The subsequent discussion stimulated so many more questions that I believe it is worth sharing some of the insights they offered me. This short summary is for end users and others with a healthy degree of OCD and inquisitiveness.

How much is each implementation of Bayesian analysis used? Using some (crude) indicators of search results from Stack Overflow and (Google search) showed:

- PyMC2 had 49 (4,150),
- PyMC3 173 (12,300),
- Stan 1,116 (262,000),
- PyStan 4 (4720).

[ed. No idea how you search for Stan on Google — we should’ve listened to Hadley and named it sStan3 or something.]

This fits with Stan being the powerhouse, with PyMC3 gaining a Python following and PyStan either being so clear to use no-one asks questions, or just not used in Python.

**What goes in…**

The original question about which to choose arose following some online (Stack-exchange, Wikipedia) searching for a solid set of reasons to choose between learning PyStan or PyMC3. I’ve used PyMC2 and it is, well, pythonic, but probably not really referenced elsewhere that often. Where there are examples, they are reasonable in terms of entry-level Bayesian modeling, but arguably also ‘toy’ examples.

In contrast, I’d read that Stan is powerful but disadvantaged from having to learn the model coding language (actually, having never programmed in C++, this turns out to be relatively easy with well documented syntax in the manual. The main drawback is not having access to attributes and methods within the text-formatted Stan language as you write the code, at least if you write it in Python. How statistical notation becomes python function or expression code, keeping the spirit of python around simplicity, is what the Stan language, Theano, patsy and others all are doing. Each has limitations, strengths and weaknesses with probably no outright ‘best way’, at present. Overall, the key message is that coding a model in PyStan, PyMC2 or PyMC3 all rely much more on understanding your model than how to write code.

**What comes out…**

Visualisation is at the heart of taking data and developing human intuition and understanding. Both PyMC and PyStan produce traces and have the capacity to simply query variables. Neither has a particularly well developed wrapper for plots, although PyMC is clearly ahead. A step back from PyStan is the Shinystan package (see also, Introducing ShinyStan) which “can be used to explore the output of any MCMC program written in any programming language, including but not limited to Stan, JAGS, BUGS, MCMCPack, NIMBLE, emcee, and SAS”. As an aspirational plotting package for Python implementation, either with PyStan or PyMC, it is well worth looking at. I’m unsure if it can be imported and used using the Rpy2 package. It is useful to know that there is agreement and intention for collaboration between PyMC and PyStan in how plots are produced. Neither will have seaborn as a dependency; this is somewhat puzzling as seaborn, when applied to traces from either, seems to handle them well. Both can convert the trace to a pandas object, allowing use of the pandas plotting api. So there is an exciting opportunity here for matplotlib (soon to be significantly improved) gurus to join with Allen Riddell in creating the “best of all worlds plotting api” for PyMC/PyStan.

**The bit in between…**

‘Under active development’ probably best describes the bit in between the code for the model and the output. It is useful to read and participate in the Stan and PYMC online fora. This is why the original question about comparing PyMC3 and PyStan is simply wrong; both are excellent packages. Both do different things across the two dimensions of implementation and usability. PyMC3 represents huge steps in modeling language and sampler flexibility. Bob Carpenter wrote, “I think they’re probably behind in functions and distributions and maybe in sampler adaptation. But they’re way ahead in Python integration directness! And we so far only have continuous params. I don’t know what PyMC does for discrete params. It’s a tradeoff, because marginalizing them out where possible leads to much faster samplers but is a pain from the coding side.” For a useful overview of why Stan was created, and particularly of the strengths it possesses, Bob Carpenter’s presentation provides some key insights.

**Additional considerations:**

- PyMC2 used MH samplers; PyMC3 uses a “No-U-Turn Sampler (NUTS; Hoffman, 2014), a self-tuning variant of Hamiltonian Monte Carlo (HMC; Duane, 1987)”. PyStan uses the NUTS variant of HMC by default, but allows other current and future customizations by virtue of its overall architecture. The expectation from NUTS/HMC is that convergence will be significantly more rapid, an expectation borne out by empirical testing. For the non-technical user, both allow use of NUTS, which means the weakness of HMC (the need to tune parameters at the start) is well managed.
- The level of documentation in PyStan offers a basic overview, but you need to read the Stan manual to really understand the code structure and depth. Similarly, the PyMC3 documentation (still in beta) is ‘emerging’, but introduces a powerful package. For a fuller treatment of current PYMC documentation, look to PyMC2.3.4. Overall, the Stan manual is significantly more comprehensive in documentation, is well-written (hat off to Bob Carpenter). Additionally, there is the popular Stan-users list, and Stan’s library of pre-coded models (about 100 models from the Bugs and Arm examples).
- Optimization also is important: Stan is based on C++ code, PyMC2’s optimization path is less clear. One would imagine Cython or other optimizations are possible, but how this pans out in terms of real-world analyses, where the time is probably better spent on the problem and not solving optimization issues, seems like a barrier for PyMC’s utility. Enter Theano. PyMC3, through using Theano, is a huge step forward in optimization. Theano allows Python “to transparently transcode models to C and compile them to machine code, thereby boosting performance. Theano is a library that allows expressions to be defined using generalized vector data structures called tensors, which are tightly integrated with the popular NumPy ndarray data structure, and similarly allow for broadcasting and advanced indexing, just as NumPy arrays do. Theano also automatically optimizes the likelihood’s computational graph for speed and provides simple GPU integration.” (from the PyMC3 manual).
- Missing data is more difficult to manage in PyStan/Stan; PyMC can easily leverage off masked arrays, perhaps using pandas to clean data first, and provide a tidier set of input data and/or manage data within the MCMC model.
- Discrete parameters seemed more difficult to manage in PyStan than in PyMC. Or at least that was my initial impression. However, as Bob Carpenter noted: “ The key idea is that mixing is better when you marginalize out discrete parameters (as a consequence of the Rao-Blackwell theorem), and you get way more precise tail statistic estimates.” Another useful illustration of how end-user impressions can be wrong and how we simply may not ask the right questions. A useful comparison to develop understanding of both the theory and practice of discrete parameters and mixing is to be had from Chris Fonnesbeck’s example of the Coal Mining disasters model. In PyMC it is relatively intuitive to code and it solves efficiently. In PyStan, the coding is fully explained in the manual in pp.117-122. I suggest that coding and comparing both approaches will help a lot in understanding how these packages approach problems and account for statistical and computational theory. [ed. PyMC3 and Stan models will look more similar in problems without discrete parameters]
- ODE solvers are implemented into the code base in Stan. PyMC2’s capacity for this appears to rely on utilizing optimization through Cython, Odeint [ed. Boost’s odeint is what we use right now in Stan 2.8 for solving ordinary diff eqs] or similar indirect methods. PyMC3’s use of ode-solvers is unknown at this stage.
- Stan has variational inference under development, although not in PyStan, as yet; PyMC3 does not appear to offer variational inference. There is another package, BayesPy that currently has conjugate-exponential family (variational message passing) only, with plans for more.
- Scaling of algorithms is important also. Michael Betancourt presents a useful examination around why scaling in terms of size and dimensions is an ongoing challenge, rounding this out with discussion of solutions being developed in Stan.
- Stan, and PyStan are not compiled as Directed Acyclic Graphs; PyMC3 does use the DAG approach. PyMC3 allows production of graphics of the model because of the DAG implementation, which is a very useful feature. Stan relies on the language syntax at the start to illustrate the model.
- Although maybe not an immediate consideration for end-users of these packages, it also is worth considering how each package might have the capacity to adapt and solve open research questions in Bayesian computing –future proofing. A couple of examples, amongst many no doubt, are how variational inference will be handled, or how HMC and discrete parameters interact. If these or similar future areas are relevant, or even might be, I would strongly recommend finding out where each package is in terms of a plan to approach the problem.

**The wider stuff…**

Both PyMC and PyStan have support from statisticians and academics who are leaders in their field. While it requires a reasonable statistical background, Bayesian Data Analysis (3rd edition) is a ‘must-have’ if you are serious about this area. Examples in the Appendix are in Stan, easily implementable in PyStan, and show clear links between textbook and application. A gentler introduction is to be had from Gelman and Hill’s multilevel regression book.

There are some excellent recordings of presentations by the leaders in this field, notably by Bob Carpenter who presents a ‘start-to-go’ MCMC presentation, followed by a practical introduction to Stan and its capability.

John Salvatier (from an engineering background) and Chris Fonnesbeck (bio-statistics) both have a strong online presence around PyMC3. [ed. John was the one who recommended we use autodiff in Stan!] Many of their IPython notebooks offer exemplars of PyMC2 and PyMC3. Chris Fonnesbeck also has a series of three talks introducing and then developing Bayesian and MCMC techniques in python, including use of PyMC2:

There are other examples of using PyStan in climatology.

**And next?**

Initially my idea was to compare PyMC3 and PyStan on a more complex problem. Bob Carpenter suggested IRT 2PL problems as a useful benchmark, or possibly using an LKJ prior for correlation matrices (p.420 of the Stan manual). These ideas appeal as a way forward, comparing how greater complexity and breadth between both rather than the more simplistic comparison to date. What about a forum to compare and contrast, or a competition at the next SciPy event for the most elegant comparison?! An excellent ‘goto’ comparison currently available is Frequentism and Bayesianism 4, Bayesian in Python. My thought is that some productive friction between packages can only help move toward a best-practice position.

After further discussion, Prof. Gelman also commented on how PyStan, Stan and PyMC could all combine/evolve toward the common goal each has. In his words:

Ultimately I think/hope that PyMC can evolve into a wrapper for PyStan that will allow graphical models, missing-data imputation, fake-data simulation, posterior predictive checks, predictive model comparison (as in this paper, etc). As I see it, Stan is super-efficient and will continue to improve when it comes to sampling, optimization, and distributional approximation, so it would make sense for PyMC to use Stan as its “engine” rather than reinventing the wheel. But, from a statistical point of view, there’s a big gap between a graphical model and an unnormalized posterior density. And if Python users could work with graphical models and have Stan in the background ready to compute, that could be the best of all possible worlds. We’ve been talking about doing this in R but perhaps Python would be a better environment for it. I’d be happy to talk with the PyMC people (I’m cc-ing John Salvatier here) if there is interest in moving in that direction. I think this approach would be cool: if PyMC includes PyStan as its engine, PyMC could automatically have all the benefits of Stan but do so much more, enough to put Python at the forefront of serious Bayesian computing.

**Conclusion**

The only reasonable conclusion, I believe, is that one should go well beyond naive “Which is better?” or “Which is easier?” questions. It seems that various opinions weight things like the language used to model the problem, output, and the ‘black box’ that is the sampler. PyMC3, however, seems to offer a significant step up from PyMC2.3.4. With the integration of Python behind it, PyMC3, Stan and PyStan now seem to be running in the same race. My choice, for what it is worth, will be to go with PyStan, which for me just feels more robust computationally. It has a clear development plan, significantly rich functions, comprehensive documentations, dedicated user forum, textbooks that inform and lead the package, and the robustness of a large multi-disciplinary team across programming, statistics, and mathematical theory.

**P.S.** Llewelyn Richards-Ward wrote the above post. But you can blame Andrew for the title. Following up on the success of this recent post, we’re gonna try to go all Buzzfeed from now on.

I’ve been using PyStan for ca. two years now. Before that I oscillated between PyMC and implementing a python wrapper for JAGS. Virtually all of the relative advantages of Stan vs PyMC have been mentioned. IMO the biggest benefits, are the extensive manual for Stan and the graph representation in case of PyMC.

Irrespective of that the reason why I was not satisfied with PyMC, was that in PyMC it was very easy to write a slow model. Sure, if implemented correctly, PyMC is as fast a Stan or Jags. And of course, similar to PyMC, in Stan a different model formulation can affect the total simulation time. However, in Stan the performance difference between unoptimized and optimized model is often negligible, so that most of the time I just skip any optimization attempts. With PyMC an unoptimized model was awfully slow and practically useless without further optimization. Even with the simplest models I had to do a lot of optimization which was tiring. That was why I gave up PyMC2. I have no experience with PyMC3, so I can’t speak to that.

IMO, the Stan development should be separated into client and server and should focus on the server side providing just a rudimentary platform specific API for the client. I don’t use anything else beyond pystan.stan(). pystan.stanmodel.sampling() and pystan.stanfit.extract(). I use my own plot and diagnostic utilities. This separation would make it easier to achieve integration of stan as engine in pymc (as envisaged in the post) or with some other client. It would be also a first step towards a multi-client and multi-server functionality that would allow people to easily run Stan on a computing cluster. This functionality is what I’m missing most in Stan currently.

Re: client/server separation. This is what I’d like to see as well. I do, however, think that it’s worth keeping/maintaining an interface in Python that’s similar to RStan in order to allow users to transition from R to Python (if that’s what they’re interested in doing).

Andrew wrote:

“Optimization also is important: Stan is based on C++ code, PyMC2’s optimization path is less clear. One would imagine Cython or other optimizations are possible, but how this pans out in terms of real-world analyses…”

How does this compare to RStan? Would switching to PyStan usually offer a performance boost?

I don’t have an empirical answer to this question, but that choice shouldn’t matter much. Most of the time will be spent in the guts of Stan. RStan and PyStan are similar in that they both rely on packages to manage interaction with C++.

If I were making a choice between the two, I’d base my decision on how fast it is for me to do the tasks I want to do around the Stan program. And if you have to offload computation on to servers, I’d suggest exporting the data and using CmdStan over both RStan and PyStan since it has less dependencies and installation is trivial and possibly easier to compile the program once and run across machines on the cluster.

I believe they are working on including variational inference in PyMC3

https://github.com/pymc-devs/pymc3/blob/2250386c826e8744e13654188074dd69dad829de/pymc3/examples/advi.ipynb.

Also new distributions are extremely easy to write for PyMC3, and seem a bit less so in Stan, but that’s probably just my bias towards Python…

I believe that’s true. Maybe Thomas can let us know how far along VI is in PyMC3.

Regarding distributions: if you have the hang of the Stan language, it’s as easy as defining a function.

The core algorithm is implemented and it’s possible to fit models. I suspect there is still a bug or convergance problems as I did test it on a hierarchical model and compared it to NUTS where it gave completely different mean estimates (https://github.com/pymc-devs/pymc3/blob/980a96f392380cf0aca0382c4652856ada565340/pymc3/examples/GLM-hierarchical-ADVI.ipynb). What is not implemented are the recent additions of a grid-search for eps. Any help would be greatly appreciated as I think this would be a fantastic addition. Moreover, Theano has good support for stochastic gradient methods so these should be pretty easy to extend.

I believe this issue has been fixed now.

There’s still some bugs in their ADVI implementation I believe. At some point it might be more useful for them try to integrate one of the Theano libraries—such as Parmesan which has variational autoencoders—instead of building ADVI from scratch.

If you’re going to go BuzzFeed on the titles, check out

http://clickotron.com/

and the blog post describing it at

http://larseidnes.com/2015/10/13/auto-generating-clickbait-with-recurrent-neural-networks/

In terms of speed, I’d really like to see a timing benchmark on something very simple, such as HMC on hierarchical linear regression.

I’ll get around to writing a comparison one of these days. I think it should include a bit more:

– a couple examples. I’m sure PyMC and Stan have different characteristics on different problems.

– check estimates. Hopefully we see they both get to the same place (or the right place if analytic). Implementation is hard and this would be a sanity check that it’s been implemented correct.

– time per effective sample size. Possibly use the defaults to time it? This will change drastically based on the number of warmup iterations and sampling iterations. For Stan, I think we should be counting it in two ways: runtime and with C++ compile time. Unfortunately, that’s still part of the overall process.

– compare optimized code vs non-optimized code. I didn’t realize this was a problem in PyMC. This is useful to see.

Timing comparisons are tricky.

Great post, I agree with all the points but wanted to add the following:

* Code complexity: Because PyMC3 relies on Theano, the actual PyMC3 code base is pure Python and comparably light at around ~9000 LOC, whereas Stan comes in at around 37000 of advanced C++.

* Developer hours: PyMC3 probably has 1/10th of the developer hours being spent on it with mostly comparable feature set. So I think with more resources, PyMC3 could progress very quickly (related to the point above). We have seen an uptick in outside contributions recently so there is hope.

* The two points lead me to think that Andrew’s idea of having PyMC3 be a front-end to Stan is the wrong approach. Relying on Theano as a computational backend for all the heavy lifting is one of its biggest benefits.

* Computational features: Theano does a lot of tricks on the compute graph to do smart caching, lazy evaluation and symbolic simplifications (for speed and numerical robustness). In addition, compilation to the GPU could lead to more speed-ups but that hasn’t been fully explored yet. To my knowledge, these aren’t implemented in Stan.

* Variational Inference: There is a PR for ADVI as has been noted in another comment: https://github.com/pymc-devs/pymc3/pull/775

Thanks, Thomas! It’s actually really surprising that people think we have so many resources. We actually don’t. =).

Just wanted to clear up some of the claims:

* Code complexity: if we want to compare apples to apples, we’re not that bad (all these numbers include doc, includes, whitespace… I don’t have the patience to figure out how to remove it):

– Stan Library without the Stan compiler (my guess is that this is the fairest comparison): ~11000 LOC

– Stan language compiler (this would be like including Python which lets you use Python to specify the language): ~12000 LOC

– Stan Math library (fills the roll of Theano): ~41000 LOC

So, I think the right comparison would be 11k vs 9k, but at the end of the day, it doesn’t matter.

* Developer hours: I think you’re grossly overestimating it. When I talked to Chris the first time (2012), it sounded like PyMC had years of developer hours on Stan. But then again, part of it is being realistic with comparison: I don’t know if you’re counting Theano hours too…

* Computational features: we’re doing auto diff, not symbolic diff. Means we can express more, but can’t do those optimizations. Given PyMC3 is in Python, I think Theano is going to be the right move for a while.

Anyway, in short, comparing the two isn’t straightforward, but Stan isn’t as complex as you think and we don’t have that many hours behind it.

In these kinds of comparisons I think it’s also worth mentioning that much of the Stan developer time goes into developing new statistical methodology, such as sampling algorithms and unconstraining transformations and new distributions, which can then be quickly adopted by other projects such as PyMC. One of the huge benefits of Stan is that you get state-of-the-art methods and techniques as soon as they are developed.

A valid point. And there are many examples where PyMC3 was directly inspired by stan: NUTS, auto-transformations, ADVI etc. I think such cross-pollination can only be helpful to both communities which is why I’m excited to have this discussion.

Thanks, Daniel. I don’t disagree and Stan-devs should take the assumption that many more dev-hours went into it as a compliment :).

When counting lines (a poor measure indeed) or developer effort I’m specifically excluding Theano. That’s actually my point. By relying on a third-party library for everything but probability distributions, samplers, and glue (-> PyMC3), there are many things we didn’t have to build. I see that Stan has a similar level of abstraction with Eigen.

Thanks for commenting. These apples-to-oranges comparisons are very challenging, which is why the Stan-dev team has shied away from trying it ourselves. There’s a huge problem in understanding other systems well enough to not be hugely biased even in attempts at “fair” comparisons. Nevertheless, I find these discussions really useful, especially because nobody’s being a jerk here!

As an example, this “how much time did things take” discussion is very challenging. For instance, it may be that PyStan took 1/10th the developer hours of PyMC. We already had Stan for the command line and an underlying API for sampling and optimization, so all PyStan needed to do was wrap it in memory (non-trivial given Cython and NumPy), but still not as challenging as reimplementing fairly complex algorithms like NUTS or ADVI).

Another angle for which “how much time” discussions are misleading is that some developers are way faster than others — even on the Stan-dev team, we have a widely varying set of programmer skills from those with a lot of professional coding experience (like me) to undergrads who’ve never taken a computer science class. So you really somehow need to abstract this to “how much time would a given team take to implement X”, which is pretty much hopeless given the mix of mathematical and coding skills and the varying toolsets.

To add to what Michael said, much of our dev time goes into (a) documentation, (b) testing, and as Michael pointed out, (c) inventing the algorithms like NUTS and ADVI in the first place. We probably have as many lines of testing code as we have lines of code.

A whole lot more of our dev time goes into making sure Stan can run through the command line, R, and Python, as well as being wrappable from Julia, Stata, and MATLAB. Having a non-embedded language that ports across interfaces is both a positive and a negative. It lets people talk about modeling in Stan independently of interface-specific code, but it means we can depend much less on helpful native features of the interface language (though in the case of R, it does let us build ShinyStan, for which I don’t think there’s a PyMC equivalent, though you could argue ShinyStan isn’t really part of Stan proper; and it’s also letting us build RStanARM, an lme4 replacements using the same expression language for GLMs.

One of the main advantages I see with an approach like PyMC’s is the ability to do symbolic integration. It is also one of the main drawbacks, as it limits flexibility in adding new algorithms, routines, etc. For example, we have a stiff ODE solver implemented now that autodiffs the stiff solver.

While we don’t play with the expression graph symbolically in Stan, we could — the data structure’s there. But we do a huge amount of work on the low-level library for making derivatives efficient. The main example of this is our vectorized derivative code for probabilty functions which computes up to a proportion, dropping constants, and sharing subcomputations in the (local) expression graph (which naively would be a sum of a bunch of calls to the non-vectorized function).

It’d be very interesting to see speed comparisons here. I haven’t actually played with Theano, so all these comments are purely theoretical.

The route for us to using GPUs is in the recent enhancements to Eigen, which is the matrix library we use, and to the ODE solvers. We do not have a way to parallelize derivative calculations in general, though. Does Theano? It’s a tricky problem, but one well worth exploring if you have the graph at hand.

But in the end, the main factor is that most of our users are R users, because that’s what most serious research statisticians (target audience for Stan from the get go) use day to day if they use anyting open source (surprising [to me] amount of Stata, SAS, MATLAB, etc. still in use). We just weren’t willing to commit to Julia or Python (much less something closed source like MATLAB or Stata) when most of the research community was still using R. I’m not a huge fan of R (or Python, for that matter) myself, by the way (I think the aRgh site sums up my feelings about R pretty well) — it’s just the reality of the field at the moment. I’m holding out hopes that Julia will clean out the playing field — it actually feels like a language built from the ground up for the kinds of things I do.

I always felt you guys supported way too many interfaces for a project that’s relatively young. Doesn’t that spread our your precious resources a bit thin? My impression was that by just supporting something like R & Python you’d cover like 80% of your users? (out of curiosity what are the splits of Stan users by interface language)

Doesn’t supporting so many languanges lead to a lot of userbase fragmentation too?

I don’t know about Julia. The history of programming languages has always had people coming up with “clean” languages that are great in theory but somehow never catch on. I suspect Julia is destined for this path. Somewhat like the Esparanto of human languages.

Fortunately, for the majority of languages the interfaces are thin wrappers around CmdStan, and I personally think maintaining these interfaces is worth it even as a young project (although I don’t work on the interfaces, so I don’t actually know how much time they consume). There’s not much of a userbase fragmentation, but it does fragment our developing process. E.g., we have code tailored specifically to RStan vs CmdStan, and it’s a shame that the interfaces aren’t unified in terms of their features.

Ideally whenever Stan 3 gets released, the interfaces would be as minimal as possible so that they all profit from Stan features simultaneously.

“from a statistical point of view, there’s a big gap between a graphical model and an unnormalized posterior density”

I know the latter is a more general class of models than the former, but it seems like PGM captures the vast majority of common applied use cases. Can someone explain the distinction further and describe an example of where PGM would be inadequately expressive?

Anon:

I think that almost all the models we’d want to fit would be graphical models. That’s why I think it would be so great to have a Stan wrapper in Python (or R) that works with graphical models and translates to Stan.

I think it really requires a precise definition of what a graphical model is. There are many flavors, ranging from very pure expressions of conditional indepndence among random variables through systems like BUGS which give you a way to define deterministic nodes and compute functions of random variables and also introduce hacks to let you get around the limitations of strict graphical modeling.

Here are a few examples:

1. A simple mixture model can be represented as a graphical model if you have a discrete parameter z for the responsibility, saying for each n, which component it derived from. That gives you a model like p(y, z, mu, sigma). In Stan, we marginalize that z parameter out and have a density p(y, mu, sigma) which is no longer easy to represent as a graphical model.

2. What about custom distributions? Often, you can think of things as being written as graphical models if you had the distribution in your language. When you don’t, what do you do? In BUGS, there’s the zeros trick, where to add lp to the total log density, you use

`0 ~ poisson(-lp)`

, because log poisson(0 | lambda) = -lambda.3. Undirected graphical models, like Markov random fields, aren’t easily cast as directed graphical models.

4. What about a weighted regression, where each y[n] gets a “weight” w[n] and the density is given by w[n] * log p(y[n]). Not really a graphical model (this one’s defective in not being a proper Bayesian model, either, because the weights aren’t part of the model).

5. I don’t know how you feel about cut in BUGS (see link above), but that’s not a graphical model in the strict sense (nor is it a proper Bayesian model, either).

I’m sure I’m missing a bunch. But there are other issues of practical importance that this doesn’t get at.

6. The way Stan lets you assign and reassign local variables isn’t something that fits into a graphical modeling language. That’s why you see all those intermediate arrays created to hold what should be local variables in BUGS.

7. Ditto for user-defined functions. None of that’s easily decomposable into graphical models. Now if the user-defined functions somehow only defined modular sub-graphs, you might be able to do something more strict. For instance, where does the user-defined system of differential equations for compartments in a PK/PD model fit into a graphical model? In PK-BUGS it goes into an expression that gets dumped into a special function — very much like Stan, but also very much not integrated into the graphical modeling language.

I never tested this myself, but as to the point of marginalizing out discrete variables I heard that it can warp the posterior in non-trivial ways and thus make it much harder for the sampler to converge.

Do you have an example? Or a reference?

Of course, the posterior for the continuous variables is the same either way, so this only makes sense to me if I imagine you’re comparing to something like (blocked) Gibbs, where the conditionals of the continuous parameters given fixed discrete parameters is easier to sample from than the marginalized density. Then you have to make up for the loss of efficiency in the discrete sampling, but I could imagine it happening.

In the examples I’ve coded, HMC on the marginalized models has worked much better than Gibbs (as implemented by BUGS or JAGS; I’ve never tried mixing HMC and something like Gibbs or random-walk Metropolis). That includes all the ones in the manual, like the CJS model in ecology, the Dawid-Skene model in data coding, or the mining disaster change-point example. Some of these would take days to converge in BUGS but converged very quickly in Stan. Simple mixtures seem to work much better this way, too, but that’s not surprising because marginalizing out is exactly what the E step in EM does, and it works well for mixture modeling.

Thanks for the answer, Bob. I think that was mentioned in the case of Bayesian non-parametrics but I don’t have a reference or example.

Assuming you mean a Dirichlet process, then we can’t do that in Stan because the parameter vector size remains fixed. We can get close with a large-K approximation. I have basically no intuition about what would happen, but I have to imagine it’s going to depend a lot on the likelihood.