Skip to content

Whatever you’re looking for, it’s somewhere in the Stan documentation and you can just google for it.

Someone writes:

Do you have link to an example of Zero-inflated poisson and Zero-inflated negbin model using pure stan (not brms, nor rstanarm)? If yes, please share it with me!

I had a feeling there was something in the existing documentation already! So I googled *zero inflated Stan*, and . . . yup, it’s the first link:

We don’t generally recommend the Poisson model; as discussed in Regression and Other Stories, we prefer the negative binomial. So I’m not thrilled with this being the example in the user’s guide. But the code is simple enough that it wouldn’t take much to switch in the negative binomial instead. Really, the main challenge with the negative binomial is not the coding so much as the interpretation of the parameters, which is something we were struggling with in chapter 15 of Regression and Other Stories as well.

Anyway, the real message of this post is that the Stan documentation is amazing. Thanks, Bob (and everybody else who’s contributed to it)!


  1. Brent Hutto says:

    I don’t have the knowledge or the time (at the moment) to make a useful comment but I’d like to say there’s an interesting discussion possible around the general case where the distribution which is highly preferable for creating a valid model just happens to produce parameters that are nigh impossible to interpret meaningfully. For purely predictive models, it seems to me not a big issue. But for the other kind of model where you’re trying to make a meaningful description of the relationships within the data it’s a big deal. Is there often a trade-off decision to balance poor model fit (or threats to model validity) versus interpretability of the results?

    • What’s the problem for interpretability? The inflation/hurdle models just add extra probability for zero values and the negative binomial adds an independent control for variance > mean (but can’t handle underdispersion where variance < mean).

      • Andrew says:


        Negative binomial can be interpretable, but (a) you have to use the alternative parameterization for the negative binomial, not the standard parameterization, and (b) even with the alternative parameterization, the extra parameter is not so easy to interpret. In section 15.2 of Regression and Other Stories, we define phi as a “reciprocal dispersion parameter” so that sd(y|x) = sqrt(E(y|x) + E(y|x)^2/phi: “In this parameterization, the parameter phi is restricted to be positive, with lower values corresponding to more overdispersion, and the limit phi -> infinity representing the Poisson model (that is, zero overdispersion).” Our description has the virtue of being unambiguous, but it’s still not so easy to interpret phi.

        • What do we want to interpret? The distribution? Its parameters? Its priors?

          Any parameterization of negative binomial will let you define expectation and standard deviation as derived quantities.

          How about using a mean/variance parameterization? All we require is that variance >= sqrt(mean)? That’s super easy to interpret, but it pushes us to a joint prior in order to be able to normalize w.r.t. the constraint. So what do we do? Well, we could take scale to just be psi * sqrt(mean) for psi >= 1. Or we could take scale to be sqrt(mean) + psi * sqrt(mean) for psi >= 0.

          Andrew took the approach that’s popular in the regression literature, which does something similar, only with variance rather than scale. That is, variance is parameterized in terms the mean and parameter psi >= 0 as variance = mu + psi * m, making the scale sd = sqrt(mean + psi * mean). This is the most popular approach in the regression literature, but I don’t personally find variance very easy to interpret. Also for reasons that escape me, Andrew parameterized in terms of phi = 1 / psi, which I would think would be more challenging for prior formulation. What priors did you use for psi, Andrew?

  2. Alain says:

    You don’t recommend the Poisson model for count data because count data almost always have more variance than what is predicted by the Poisson model. I think I get it, but… in epidemiology, it is common to use Poisson regression for binary outcomes to compute rates or risk ratio. Since you have also written that binary variables can’t be over-dispersed, is it fine to use the Poisson regression for binary outcomes?

    • You could create a zero-inflated negative binomial in the same way you create a zero-inflated Poisson. Or you can alternatively rethink the way negative binomial is coded. That is,

      y ~ neg_binomial(alpha, beta)

      just marginalizes out the lambda in

      lambda ~ gamma(alpha, beta)
      y ~ poisson(lambda)

      So an alternative for over dispersed regression is to just add an overdispersion term epsilon[n],

      y[n] ~ poisson(x[n] * beta + epsilon[n])

      Then you can give the overdispersion parameter a hierarchical prior,

      epsilon[n] ~ normal(0, sigma[group[n]])

      You could probably do something like this using the negative binomial, but I find the parameterization trickier in the same way I find writing hierarchical beta models tricky.

    • Andrew says:


      If you have binary outcomes and you’re using the log model, then I guess the rates are very close to zero, so Poisson is the same as logistic.

  3. You’re welcome. I mainly used the doc as an excuse to learn all these models. So thanks for all the feedback as I was writing a lot of it.

    I hope to spend a month or so this year rewriting the whole thing to bring it up to our current coding practices. And maybe write a shorter getting started with Stan modeling document that tries to convey the overall way things work more succinctly.

    • I’m not thrilled with this being the example in the user’s guide.

      The joy of open source is that you can submit a revision as a pull request. It’s all just R markdown.

      For everyone else, Andrew and I have an ongoing debate over whether to write simple hello-world-type computer scientist examples or complicated narrative what-if statistician-type examples. I opted for the former. Andrew’s books opt for the latter.

  4. somebody says:

    > Zero inflation does not work for continuous models

    We can’t just model zeros separately from non-zeros and treat the continuous model as a separae likelihood on some finite probability p?

    model {
        p_zero ~ beta(prior_zero, prior_nonzero)
        n_zero ~ binomial(N, p_zero)
        mu ~ normal(mu_prior_mu, mu_prior_sd)
        sigma ~ log_normal(log_sigma_prior_mu, log_sigma_prior_sd)
        non_zeros ~ normal(mu, sigma)

    I feel like this should be okay since with any continuous model the probability of getting an exact real number is zero, so you’ll almost surely not get any observations where `is_zero = False` but the value is zero anyways by chance from the continuous distribution.

    [editor: code escaped]

    • Technically, if you zero-inflate a continuous distribution, it’s no longer continuous. But that doesn’t mean you can’t do it. The result is just a mixed distribution.

      Here’s the typical “spike and slab” formulation using lambda in (0, 1) as the mixing rate, which is also the zero-inflation rate.

      p(x) = lambda * bernoulli(x | 0) + (1 - lambda) * normal(x | mu, sigma)

      In Stan, you can code this mixture directly,

      target += log_mix(lambda, bernoulli_lpdf(x | 0), normal_lpdf(x | mu, sigma));

      Or you can code it in a way that is more efficient and reflects the inflation.

      target += log1m(lambda) + normal_lpdf(x | mu, sigma);
      if (x == 0) 
        target += log(lambda);

      P.S. I added <pre></pre> tags around your code—Wordpress doesn’t do markdown.

      • somebody says:

        I see why my formulation doesn’t make sense, but I’m having trouble wrapping my mind around the likelihood of a mixed distribution, maybe because I’ve never taken measure theory. I think the density should be

        lambda * dirac_delta(x) + (1 – lambda) f_theta(x)

        where parameters are (lambda, theta). I’m having trouble going from this density to the log likelihood

        log(1 – lambda) + log f_theta(x) + x == 0 ? log(lambda) : 0

      • somebody says:

        I’ve rethought my question. Okay, so a dominating measure should be

        mu(A) = Indicator(1 in A) + Lebesgue(A)

        Integrating your density:

        Integral lambda Indicator(x == 0) + (1 – lambda) Normal(x) d mu

        over the set {0} gives lambda + (1 – lambda) Normal(0), and integrating over the set R – {0} gives 1, so we end up with a probability greater than 1?

        • somebody says:

          “integrating over the set R – {0} gives 1”

          Sorry, I mean gives (1 – lambda)

          Integral over {0} + Integral over R – {0} = lambda + (1 – lambda) Normal(0) + (1 – lambda) = 1 + (1 – lambda) Normal(0) > 1?

      • somebody says:

        Sorry to blow you up with a gazillion comments, but this thread in discourse seems to suggest something different

  5. I agree, the Stan documentation is amazing. I googled “Stan Jacobian” today and landed on this gem of a quote:

    “A transformation samples a parameter, then transforms it, whereas a change of variables transforms a parameter, then samples it. Only the latter requires a Jacobian adjustment.”

    Exactly what I needed at exactly the right time :-) Thanks Stan team!!!

Leave a Reply

Where can you find the best CBD products? CBD gummies made with vegan ingredients and CBD oils that are lab tested and 100% organic? Click here.