How to “cut” using Stan, if you must

Frederic Bois writes:

We had talked at some point about cutting inference in Stan (that is, for example, calibrating PK parameters in a PK/PD [pharmacokinetic/pharmacodynamic] model with PK data, then calibrating the PD parameters, with fixed, non updated, distributions for the PK parameters). Has that been implemented?

(PK is pharmacokinetic and PD is pharmacodynamic.)

I replied:

This topic has come up before, and I don’t think this “cut” is a good idea. If you want to implement it, I have some ideas of how to do it—basically, you’d first fit model 1 and get posterior simulations, then approx those simulations by a mixture of multivariate normal or t distributions, then use that as a prior for model 2. But I think a better idea is to fit both models at once while allowing the PK parameters to vary, following the general principles of partial pooling as in this paper with Sebastian Weber.

It would be fun to discuss this in the context of a particular example.

I also cc-ed Aki, who wrote:

Instead of parametric distribution a non-parametric would be possible with the approach presented here by Diego Mesquita, Paul Blomstedt, and Samuel Kaski. This could be also used so that PK and PD are fitted separately, but combined afterwards (this is a bit like 1-step EP with a non-parametric distribution).

So, lots of possibilities here.

P.S. See Pierre Jacob’s comment below for some reasonable next steps.

15 thoughts on “How to “cut” using Stan, if you must

  1. Interesting that sometimes when you give someone a more powerful tool, they don’t initially see the opportunity to do something in a new way, but they would rather hamstring the tool so it behaves the same as what is familiar.

    There’s probably some analogy in there to wanting everything to be a hammer.

    OTOH – I think we know that part of getting published requires passing a peer review process that requires ‘only hammers allowed.’

    • Pharamcometricians have two models, a pharmacokinetic (PK) model and a pharmacodynamic (PD) model. The PK model is about how the body acts on the drug (usually how it diffuses and is cleared) and the PD model is about how the drug acts on the body (often a disease outcome). It’s easy to formulate the joint model and in a lot of cases you can fit it successfully in BUGS or in Stan. The problem is that the PD model is much less well specified than the PK model. The posterior for the joint model distorts the well calibrated posterior from the PK model.

      BUGS developed cut not as a shortcut, but in order to prevent the PD model’s misspecification from polluting the PK model while still allowing some uncertainty in the PK model.

      • Rather than doing a cut to handle this, it makes sense to understand that model error is also part of Bayes and quantified in the same way as measurement error or random variation. Errors aren’t mean zero? add a bias, errors are skewed? use a skewed distribution. Errors have time structure? Add a function of time… errors have different sizes under different circumstances? make the sd a function of circumstance…

        if your model error is large then the vagueness of the prediction means the PD model has limited effect on the PK fit.

  2. > I don’t think this “cut” is a good idea. If you want to implement it, I have some ideas of how to do it

    I don’t think this is quite the right way to talk about this.

    We cut in all models, at some level. No way we have good, probabilistic models all the way down to the sensor level. At some point when we take measurements, we write down numbers — those are cuts. We’re taking something we don’t know 100% precisely and replacing it with a point. So cutting in models is something we should talk about, and it should factor into our data analysis — it’s just not a feature in Stan (and I don’t think it should be, though I have definitely dug around searching for ways to do this haha).

    So here Frederic is hitting a cut question. However Frederic got here, (and he could either start with the full model and add the cut, or start with the cut model and wonder about putting it all together), I think the correct way to answer this is simulated data.

    In either case Frederic can probably simulate from the complete model (even if he can’t do inferences there). Then he can do inferences with the simple model and see how biased he is. In the end this might be hard to characterize or quantify the biases or whatnot — but it’s better than just assuming.

    And the question of whether the normal approximations like you suggest are a good idea — do the same thing. This goes for any approximation.

    The idea that some data should influence some things and not the other (in the link) — I don’t like that. I think we should be always thinking of the full model as a reference (where Bayes rule does what it wants with the data), and then any tricks are just approximations of that in some way. And you can experiment with fitting approximations to full models that exceed your ability to do inference by taking advantage of simulated data (cuz it’s easier to generate fake data that fit a model).

    • Ben:

      Good point. Your comment falls into the general category of, “Everybody’s doing X, and we’re better off admitting it and then conducting the process formally.”

      For example:
      – Everyone’s doing partial pooling (deciding how much to use inferences from experiment A to make decisions about condition B), so let’s admit it and do it formally.
      – Everyone’s doing drugs (if it’s not alcohol and tobacco, it’s caffeine; if it’s not caffeine, it’s endorphins etc.), so let’s admit it and act accordingly.
      – Everyone is disabled at times so let’s admit at and accommodate it rather than presenting disability doesn’t exist.
      – Everyone’s fitting multiple models, so let’s do some model averaging and post-selection inference rather than pretending that the multiverse doesn’t exist.
      – Everybody has non-representative samples, so let’s admit it and do MRP rather than playing games with nonexistent sampling probabilities.
      And, yours:
      – Everybody’s doing “cut,” so let’s admit it and do it well.

  3. This post also relevant:

    https://statmodeling.stat.columbia.edu/2018/07/09/joint-inference-modular-inference-pierre-jacob-lawrence-murray-chris-holmes-christian-robert-discuss-conditions-strength-weaknesses-choices/

    I think the issue of model misspecification is the key one. The advice to just build the correctly-specified model and fit it to all the data can be difficult to implement in some situations. In these situations I am not sure that joint inference is the clear winner.

    • Exactly. I think the previous posts are missing the point of what’s going on in the pharma models. The results of joint inference don’t make sense, which is why they came up with cut in the first place.

      I think the real question is whether a reasonable joint model can be formulated that outperforms these cut-based hacks.

  4. We (Arya Pourzanjani, Daniel Lee, and I) have done this kind of a model and were able to fit everything jointly as Andrew suggested. If remember correctly, Arya solved the PK system analytically and then we had a 4-state ODE for PD. Happy to share some of the code if you would find it useful. (can’t share the data, unfortunately)

  5. Your suggested first solution to approximate the cut distribution (“basically, you’d first fit model 1 and get posterior simulations, then approx those simulations by a mixture of multivariate normal or t distributions, then use that as a prior for model 2”) would not approximate the cut distribution, if I understand it correctly. This would approximate the standard posterior distribution, with an error coming from a mixture being an imperfect representation of the first posterior. It would still be standard Bayesian inference since the parameter on which you put a prior gets updated in the second stage.

    The point of the “cut” is that some marginal distributions would not get updated. A long-and-dirty way (but conceptually simple) of getting an approximation of the cut, is as follows: perform the ‘second stage’ posterior simulation, conditional on many independent draws from the ‘first stage’ posterior, and finally putt all the draws together.

    With a bit of notation, if p(theta1 | Y1) is the first posterior, and p(theta2|Y2,theta1) is the second stage posterior, the cut distribution is (in my understanding) either the joint distribution on (theta1, theta2) with density p(theta1 | Y1) p(theta2|Y2,theta1), or refers to the second marginal (on theta2) of that distribution. On the other hand, if you use the first posterior as a prior for the second stage, you’re doing standard Bayesian inference leading to a joint distribution with density p(theta1 | Y1, Y2) p(theta2|Y2,theta1), up to various approximation errors.

    Interested readers might want to consult the very few articles that are specifically on the topic of cut distribution, e.g. Plummer’s (https://link.springer.com/article/10.1007/s11222-014-9503-z), Liu, Bayarri & Berger’s (https://projecteuclid.org/euclid.ba/1340370392) or (Jacob, Murray, Holmes and Robert https://arxiv.org/abs/1708.08719). Another work with O’Leary and Atchadé describes (Section 5.5, https://arxiv.org/abs/1708.03625) another approach to approximate the cut distribution, which is a bit more convenient than the ‘long-and-dirty’ approach described above.

    • Hi everyone, just came across this post! I am a university student and am working on one of your examples from your paper “Better together? Statistical learning in models made of modules” for my thesis. I am attempting to code the cut posterior of the biased data on Stan. So far i have this and it doesnt seem to work!

      library(StanHeaders)
      library(Rcpp)
      library(ggplot2)
      library(rstan)
      library(withr)
      options(mc.cores = parallel::detectCores())
      rstan_options(auto_write = TRUE)

      N1 <-100
      N2 <-1000
      Y1 <- rnorm(N1, mean = 0, sd = 1)
      Y2 <- rnorm(N2, mean = 1, sd = 1)
      lambda1 <- 1
      lambda2 <- 100

      model_data1 <- list(N1=N1,Y1 = Y1,lambda1 = lambda1, lambda2 = lambda2)
      fit2 <-stan(file = 'model1.stan', data = model_data1)

      //model1.stan
      data {
      int N1;
      vector[N1] Y1;
      real lambda1;
      }

      parameters {
      real theta1;
      }

      model {
      for(i in 1:N1)
      Y1[i] ~ normal(theta1, 1);
      theta1 ~ normal(0,lambda1^-1);
      }

      my_samples2 <- extract(fit2)
      theta1samples <- my_samples2$theta1[2001:2100]
      model_data4<-list(N1=N1,N2=N2,Y2 = Y2,theta1samples=theta1samples,lambda2 = lambda2)
      fit5 <- stan(file = 'cut1ci.stan', data = model_data4)

      //cut1ci.stan

      data {
      intN1;
      intN2;
      vector[N2] Y2;
      real lambda2;
      vector[N1] theta1samples;
      }

      parameters {
      real theta2;
      }

      model {

      for(i in 1:N1)
      Y2[i] ~ normal(theta1samples[i] + theta2, 1);
      theta2 ~ normal(0, sqrt(inv(lambda2)));
      }

      my_samples4 <- extract(fit5)

      Then, if both draws are correct, how do i put both draws together?

      Thank you for your help!

      • I am glad to see some people (especially students!) interested in this topic!

        First, here are some recent references on “Bayesian inference and cuts”
        https://arxiv.org/abs/2003.06804
        https://arxiv.org/abs/2005.07342
        https://arxiv.org/abs/2009.00217
        Exciting stuff, at least some people are taking this question seriously.

        It’s difficult to sample from the cut distribution. Good luck! Personally I’d strongly recommend coding everything yourself in R or python so that you can really understand what’s going on.

        Next, I am no Stan expert but I would code the second model in such a way that it takes one “theta1” as input, and then I would run Stan for the second model separately N1 times, each time with a different theta1. This way, your Stan models are each very standard, and you can distribute the executions across machines. You can take the post-burnin theta2s from each of these runs and just put them together (without weighting them in any special way) to obtain a sample from the cut distribution. That’s what I had in mind in the comment above. Of course it’s very brute-force and the choice of (many) burn-in periods is difficult.

        For this particular example (due to Liu, Berger & Bayarri, by the way), you can work out these distributions analytically and sample from them directly. Some relevant R code is here https://github.com/pierrejacob/bettertogether/tree/master/inst/reproduce
        That might help you to check that your code is doing what it’s supposed to.

        Let me reiterate that, in my understanding, the solution proposed by Andrew Gelman in the above post does not work. Although it can be coded up in Stan and executed in two successive steps, it does not actually approximate the cut distribution, but rather it approximates the standard posterior distribution. The question of Frederic Bois about the cut (“Has that been implemented?”) underestimates the difficulty in implementing such an approach with standard MCMC tools; see Martyn Plummer’s paper on the topic. Approximating the cut is still mostly an open question.

        • Thank you for your reply Prof Pierre it is good to hear from you! Thank you for all the feedback and for the resource that you have provided! My supervising professor wants me to do this distribution on Stan even though it seems that it’ll probably be easier for it to do it on R/Python/JAGS/BUGS. I will look into your suggestion!

Leave a Reply to Bob Carpenter Cancel reply

Your email address will not be published. Required fields are marked *