Skip to content

How to “cut” using Stan, if you must

Frederic Bois writes:

We had talked at some point about cutting inference in Stan (that is, for example, calibrating PK parameters in a PK/PD [pharmacokinetic/pharmacodynamic] model with PK data, then calibrating the PD parameters, with fixed, non updated, distributions for the PK parameters). Has that been implemented?

(PK is pharmacokinetic and PD is pharmacodynamic.)

I replied:

This topic has come up before, and I don’t think this “cut” is a good idea. If you want to implement it, I have some ideas of how to do it—basically, you’d first fit model 1 and get posterior simulations, then approx those simulations by a mixture of multivariate normal or t distributions, then use that as a prior for model 2. But I think a better idea is to fit both models at once while allowing the PK parameters to vary, following the general principles of partial pooling as in this paper with Sebastian Weber.

It would be fun to discuss this in the context of a particular example.

I also cc-ed Aki, who wrote:

Instead of parametric distribution a non-parametric would be possible with the approach presented here by Diego Mesquita, Paul Blomstedt, and Samuel Kaski. This could be also used so that PK and PD are fitted separately, but combined afterwards (this is a bit like 1-step EP with a non-parametric distribution).

So, lots of possibilities here.


  1. Sean S says:

    Interesting that sometimes when you give someone a more powerful tool, they don’t initially see the opportunity to do something in a new way, but they would rather hamstring the tool so it behaves the same as what is familiar.

    There’s probably some analogy in there to wanting everything to be a hammer.

    OTOH – I think we know that part of getting published requires passing a peer review process that requires ‘only hammers allowed.’

    • Pharamcometricians have two models, a pharmacokinetic (PK) model and a pharmacodynamic (PD) model. The PK model is about how the body acts on the drug (usually how it diffuses and is cleared) and the PD model is about how the drug acts on the body (often a disease outcome). It’s easy to formulate the joint model and in a lot of cases you can fit it successfully in BUGS or in Stan. The problem is that the PD model is much less well specified than the PK model. The posterior for the joint model distorts the well calibrated posterior from the PK model.

      BUGS developed cut not as a shortcut, but in order to prevent the PD model’s misspecification from polluting the PK model while still allowing some uncertainty in the PK model.

      • Rather than doing a cut to handle this, it makes sense to understand that model error is also part of Bayes and quantified in the same way as measurement error or random variation. Errors aren’t mean zero? add a bias, errors are skewed? use a skewed distribution. Errors have time structure? Add a function of time… errors have different sizes under different circumstances? make the sd a function of circumstance…

        if your model error is large then the vagueness of the prediction means the PD model has limited effect on the PK fit.

  2. Ben says:

    > I don’t think this “cut” is a good idea. If you want to implement it, I have some ideas of how to do it

    I don’t think this is quite the right way to talk about this.

    We cut in all models, at some level. No way we have good, probabilistic models all the way down to the sensor level. At some point when we take measurements, we write down numbers — those are cuts. We’re taking something we don’t know 100% precisely and replacing it with a point. So cutting in models is something we should talk about, and it should factor into our data analysis — it’s just not a feature in Stan (and I don’t think it should be, though I have definitely dug around searching for ways to do this haha).

    So here Frederic is hitting a cut question. However Frederic got here, (and he could either start with the full model and add the cut, or start with the cut model and wonder about putting it all together), I think the correct way to answer this is simulated data.

    In either case Frederic can probably simulate from the complete model (even if he can’t do inferences there). Then he can do inferences with the simple model and see how biased he is. In the end this might be hard to characterize or quantify the biases or whatnot — but it’s better than just assuming.

    And the question of whether the normal approximations like you suggest are a good idea — do the same thing. This goes for any approximation.

    The idea that some data should influence some things and not the other (in the link) — I don’t like that. I think we should be always thinking of the full model as a reference (where Bayes rule does what it wants with the data), and then any tricks are just approximations of that in some way. And you can experiment with fitting approximations to full models that exceed your ability to do inference by taking advantage of simulated data (cuz it’s easier to generate fake data that fit a model).

    • Andrew says:


      Good point. Your comment falls into the general category of, “Everybody’s doing X, and we’re better off admitting it and then conducting the process formally.”

      For example:
      – Everyone’s doing partial pooling (deciding how much to use inferences from experiment A to make decisions about condition B), so let’s admit it and do it formally.
      – Everyone’s doing drugs (if it’s not alcohol and tobacco, it’s caffeine; if it’s not caffeine, it’s endorphins etc.), so let’s admit it and act accordingly.
      – Everyone is disabled at times so let’s admit at and accommodate it rather than presenting disability doesn’t exist.
      – Everyone’s fitting multiple models, so let’s do some model averaging and post-selection inference rather than pretending that the multiverse doesn’t exist.
      – Everybody has non-representative samples, so let’s admit it and do MRP rather than playing games with nonexistent sampling probabilities.
      And, yours:
      – Everybody’s doing “cut,” so let’s admit it and do it well.

    • > taking advantage of simulated data (cuz it’s easier to generate fake data that fit a model).
      + Infinity
      (in the sense that repeatedly pointing it out could always help)

  3. Nick Menzies says:

    This post also relevant:

    I think the issue of model misspecification is the key one. The advice to just build the correctly-specified model and fit it to all the data can be difficult to implement in some situations. In these situations I am not sure that joint inference is the clear winner.

    • Exactly. I think the previous posts are missing the point of what’s going on in the pharma models. The results of joint inference don’t make sense, which is why they came up with cut in the first place.

      I think the real question is whether a reasonable joint model can be formulated that outperforms these cut-based hacks.

  4. Eric Novik says:

    We (Arya Pourzanjani, Daniel Lee, and I) have done this kind of a model and were able to fit everything jointly as Andrew suggested. If remember correctly, Arya solved the PK system analytically and then we had a 4-state ODE for PD. Happy to share some of the code if you would find it useful. (can’t share the data, unfortunately)

  5. Your suggested first solution to approximate the cut distribution (“basically, you’d first fit model 1 and get posterior simulations, then approx those simulations by a mixture of multivariate normal or t distributions, then use that as a prior for model 2”) would not approximate the cut distribution, if I understand it correctly. This would approximate the standard posterior distribution, with an error coming from a mixture being an imperfect representation of the first posterior. It would still be standard Bayesian inference since the parameter on which you put a prior gets updated in the second stage.

    The point of the “cut” is that some marginal distributions would not get updated. A long-and-dirty way (but conceptually simple) of getting an approximation of the cut, is as follows: perform the ‘second stage’ posterior simulation, conditional on many independent draws from the ‘first stage’ posterior, and finally putt all the draws together.

    With a bit of notation, if p(theta1 | Y1) is the first posterior, and p(theta2|Y2,theta1) is the second stage posterior, the cut distribution is (in my understanding) either the joint distribution on (theta1, theta2) with density p(theta1 | Y1) p(theta2|Y2,theta1), or refers to the second marginal (on theta2) of that distribution. On the other hand, if you use the first posterior as a prior for the second stage, you’re doing standard Bayesian inference leading to a joint distribution with density p(theta1 | Y1, Y2) p(theta2|Y2,theta1), up to various approximation errors.

    Interested readers might want to consult the very few articles that are specifically on the topic of cut distribution, e.g. Plummer’s (, Liu, Bayarri & Berger’s ( or (Jacob, Murray, Holmes and Robert Another work with O’Leary and Atchadé describes (Section 5.5, another approach to approximate the cut distribution, which is a bit more convenient than the ‘long-and-dirty’ approach described above.

Leave a Reply