ProbProg 2021 is Oct 20–22 and registration is open

ProbProb 2021, the probabilistic programming conference, is coming up in two weeks and registration is only US$25. It’s entirely online still this year and runs from 10.30 am Wednesday, 20 October to 2.30 pm Friday 22 October EDT (eastern daylight time, aka Boston time).

Here’s the schedule of talks and posters. On Friday at noon, I’ll be talking about Bayesian workflow and what probabilistic programming languages need in order to support it. I’ll survey a range of PPLs, which I conclude all lack key features required for seamless workflow support. Perhaps surprisingly, I conclude that BUGS is the best of a bad bunch for workflow when dealing with problems that can be expressed in its language and where Gibbs sampling is effective. Most other systems, including Stan, PyMC3, Pyro, and all the ML toolkits, require rewriting models for different uses like simulation-based calibration, posterior predictive checks, and cross-validation. Part III of the Stan User’s guide shows how to code these model variants in Stan. I’ll try to squeeze in some details of how we hope to extend Stan to better support workflow.

Hope to see you there!

7 thoughts on “ProbProg 2021 is Oct 20–22 and registration is open

  1. Bob:

    That sounds like a cool conference. With regard to your Bugs comment: I see what you’re saying, but . . . an important part of workflow is being able to fit models flexibly and reliably. When using Bugs, the space of feasible models is restricted so much that I would consider this a restriction of workflow. To put it another way, even if the model you are currently fitting can be fit effectively in Bugs, your workflow is still limited because there are other nearby models that you can’t fit.

  2. Hey Bob, although you may not have time before the talk, eventually take a look at Turing.jl in Julia. You can start here: https://turing.ml/dev/docs/using-turing/guide

    Some interesting aspects to it. For example, you can ask it to sample from the prior without altering the model.

    chain = sample(model,Prior(),1000)

    You can ask it to sample from the prior predictive by just giving it “missing” data. It can use Gibbs sampling to mix different samplers for different dimensions of the model (so for example, do discrete sampling for discrete parameters, and NUTS for continuous ones)

    I would say it’s a bit more sensitive to the use of “performance” tricks, for example in a recent thread on discourse.julialang.org I found sampling 300 independent normals using a loop was like a 2 hour sampling procedure, whereas with MvNormal it was 10 seconds… so you definitely need to understand https://turing.ml/dev/docs/using-turing/performancetips and read various threads… It doesn’t have a manual anywhere near as nice as Stan… But it’s a domain specific language embedded in what I think is the nicest extant programming language for numerical work ever invented. In general you can include whatever kind of Julia code you want in your model.

    • In fairness, I think all general purpose samplers requires knowledge of various “performance tricks” to really get the most out of them.

      In JAGS, you have to know when to take advantage of conjugacy and when not to. Sometimes it helps, but sometimes “breaking” conjugacy actually leads to faster sampling with less autocorrelation in hierarchical models.

      In Stan (and, it sounds like, julia?), knowing how to vectorize is critical. In Stan, I also find anecdotally that to get the best performance, I need to do my best to get the base model as close as possible to multivariate normals and use transformed parameters a lot.

      In BUGS, well, maybe “performance” was never a word you could associate with BUGS. But as Bob says, this means there is no incentive to do any “tricks” so the code ends up being pretty much a one-to-one translation of the model.

      It strikes me that the “performance tricks” seem to be related to developers’ assumptions about what kinds of models they expect their software to be used for, which makes sense. BUGS/JAGS are built for relatively simple hierarchical models with conjugate distributions. I don’t know much about julia, but Stan seems to have been built with large-scale hierarchical regression models in mind as the primary use case.

      As you say, though, the availability of an extensive manual for Stan is a big plus!

      • In general, Julia is very fast for any kind of code. “Vectorization” is not a thing in julia. Except apparently in the context of autodifferentiation using reverse mode autodiff.

        I’ve had plenty of cases where Stan was failing to fit quickly so its definitely not a Julia only thing. I think if anything the reason I mention it is that the manual isn’t nearly as explicit as the Stan manual.

        Fortunately the Stan manual can easily be used to inspire Turing models.

        • Daniel:

          Indeed, one reason all of Stan is open source is that we’re happy for people to take whatever innovations we have and use them in other languages.

        • (Daniel Lakeland) although you may not have time before the talk, eventually take a look at Turing.jl in Julia

          I try to keep up on at least the basics of all these languages. Turing.jl is definitely in my survey.

          (Daniel Lakeland) “Vectorization” is not a thing in julia. Except apparently in the context of autodifferentiation using reverse mode autodiff.

          This is the same as for Stan. It’s not the loops that are slow, like in Python or R. The reason vectorization matters is that if you just naively evaluate log normal(y[n] | mu, sigma) for multiple y[n] in a loop, you wind up calculating log(sigma) multiple times. The problem in doing this is compounded under autodiff, where the expression graph grows. Essentially, the reverse pass of reverse-mode autodiff is interpreted and there’s no way to get around it with the kind of autodiff design used in Stan or Julia.

          (Daniel Lakeland) For example, you can ask it to sample from the prior without altering the model.

          In Stan, we don’t assume a factorization of a log density into a prior and likelihood, so this isn’t strictly possible by design. Given that Julia also doesn’t declare which piece of the density is likelihood and which is prior, this must mean something different than “prior” in the way statisticians think about it. For example, suppose I have a bayesian model p(y | theta) * p(theta). Julia can figure out p(theta) is the prior and just return draws from that? Or does it do what BUGS would do and return samples from the joint density p(y, theta)? The latter doesn’t involve figuring out which piece of the density is actually the prior. There’s also the tricky business of marginalizing latent parameters to define a likelihood, such as for a mixture model. The mixture responsibilities are neither considered part of the prior or of the likelihood in this:


          mu ~ ...; sigma ~ ...;
          z[n] ~ bernoulli(theta);
          y[n] ~ normal(mu[z[n]], sigma[z[n]]);

          In this case, the prior is p(mu, sigma) and the likelihood is p(y | mu, sigma), with the zs only being part of what’s called the “full data likelihood” in frequentist circles, p(y, z | mu, sigma). So it’d be really amazing if Julia could actually evaluate the likelihood of such a model by marginalizing out z.

          (Daniel Lakeland) You can ask it to sample from the prior predictive by just giving it “missing” data.

          This is a feature of BUGS and JAGS, but not of Stan. This is one of the reasons we’re interested in blockless Stan along the line of Maria Gorinova’s SlicStan. As long as we organize by blocks, we can’t fluently move between data/parameter. This was the main area in which I was expecting pushback on Stan’s design, but it turns out not a lot of people do this (though we would very much like to encourage more people doing it).

          (Daniel Lakeland) But it’s a domain specific language embedded in

          We intentionally went standalone so that Stan could be easily used in R, Python, etc. For better or worse, Julia’s still very much a niche language compared to the popularity of R and Python. Matt and I considered embedding in Python, but couldn’t see how to get the efficiency we needed on the back end without writing C++ anyway.

          (Daniel Lakeland) I would say it’s a bit more sensitive to the use of “performance” tricks

          (gec) In fairness, I think all general purpose samplers requires knowledge of various “performance tricks” to really get the most out of them.

          +1 to gec’s quote. Languages may try to pretend they automate performance, but at the bleeding edge, you need to know not only the language but also the properties of your hardware in order to optimize.

          To make things efficient in either BUGS or JAGS you have to be concerned with conjugacy. The main difference is the non-conjugate samplers they use and some specializations for GLMs in JAGS.

          (Andrew) Indeed, one reason all of Stan is open source is that we’re happy for people to take whatever innovations we have and use them in other languages.

          Indeed! In this case, it’s something both projects inherited from taped forms of reverse-mode autodiff. The integration of autodiff in Julia into the language is deeper.

          (gec) BUGS/JAGS are built for relatively simple hierarchical models with conjugate distributions. I don’t know much about julia, but Stan seems to have been built with large-scale hierarchical regression models in mind as the primary use case.

          BUGS was built for the more general non-conjugate case. It’s just not very efficient there. There’s nothing specific about regressions in either of Stan or BUGS, but JAGS does some optimizations specifically for GLMs. One of the reasons we built the Stan language is that Matt and Daniel and I couldn’t figure out how to extend the lme4 language to the full set of models that Andrew and Jennifer discuss in their regression book. It was easier to work from a more general basis like BUGS (i.e., to exploit the inventor’s paradox).

          P.S. The Turing.jl examples are strange to me in that they use the same variance for the prior and sampling distribution (e.g., here and here).

  3. I’m reposting this discussion here because Daniel Lakeland has been declared a robot by the blog and can’t get past the reCAPTCHA system (whereas I just coded 18 images to help Google’s self-driving Car business). Andrew says they’re working on it and he’ll presumably post some news that they’ve figured out how to reduce spam without reCAPTCHA.

    Hi Bob! I wrote the following comment for the blog on the ProbProg post, but in the last few days every time I try to post it doesn’t show me the “i’m not a robot” to click, but then when I try to post it says I didn’t enter the re-captcha value… so something’s broken on the blog for me…

    Hi Bob, thanks for your comments! I suspected you must know something about Turing.

    (Bob) Given that Julia also doesn’t declare which piece of the density is likelihood and which is prior, this must mean something different than “prior” in the way statisticians think about it.

    Perhaps. What happens is when you write your turing @model macro, it builds a function which takes data as input and constructs an object of a model type. The Turing “compiler” (which is a macro written in Julia and compiles to Julia code) understands whether a named variable is input (observable data), or created in the scope of the model (a parameter). So if you ask it to sample from the prior, it ignores the input data, and I think it must sample from the prior predictive distribution. I’m not sure if it just throws away the simulated data or if it gives the simulated data back to you as well. I’d have to try it.

    (Bob) It’s not the loops that are slow, like in Python or R. The reason vectorization matters is that if you just naively evaluate log normal(y[n] | mu, sigma) for multiple y[n] in a loop, you wind up calculating log(sigma) multiple times. The problem in doing this is compounded under autodiff, where the expression graph grows.

    Yes, it appears that loops are slow in Turing when you’re doing this together with ReverseDiff for the reasons you mention… But not when using some other autodiff packages (there is ForwardDiff, which is forward mode, and Zygote which is source-to-source rewriting). The reverse tape in Turing can be compiled. it uses the Memoization.jl package to build the tape and then compiles the tape after the first run through, I think, remember unlike Stan the compiler is right there *in* the language. However, even if you’re compiling it, it’s still a lot of computation to do, and it’s slow. So calling a multivariate function is definitely faster.

    So it’d be really amazing if Julia could actually evaluate the likelihood of such a model by marginalizing out z.

    In your example, I don’t know where “theta” came from, but it seems better to discuss whether something is observed or not… everything that isn’t observed is a parameter, and so if the z’s are unobserved, then they’re just a parameter and you sample from them. If you want to “marginalize them out” you just look at the samples of y independent of what z was chosen.

    P.S. As a report from the field, I had to code another 9 images for palm trees to submit. So a total of 3 reCAPTCHAs to log in and to submit this.

Leave a Reply

Your email address will not be published. Required fields are marked *