Bayesian Cognitive Modeling  Examples Ported to Stan

There’s a new intro to Bayes in town.

This book’s a wonderful introduction to applied Bayesian modeling. But don’t take my word for it — you can download and read the first two parts of the book (hundreds of pages including the bibliography) for free from the book’s home page (linked in the citation above). One of my favorite parts of the book is the collection of interesting and instructive example models coded in BUGS and JAGS (also available from the home page). As a computer scientist, I prefer reading code to narrative!

In both spirit and form, the book’s similar to Lunn, Jackson, Best, Thomas, and Spiegelhalter’s BUGS Book, which wraps their seminal set of example models up in textbook form. It’s also similar in spirit to Kruschke’s Doing Bayesian Data Analysis, especially in its focus on applied cognitive psychology examples.

Bayesian Cognitive Modeling Examples Now in Stan!

One of Lee and Wagenmaker’s colleagues, Martin Šmíra, has been porting the example models to Stan and the first batch is already available in the new Stan example model repository (hosted on GitHub):

Many of the models involve discrete parameters in the BUGS formulation which need to be marginalized out in the Stan models. The Stan 2.5 manual is adding a whole new chapter with some non-trivial marginalizations (change point models, CJS mark-recapture models, and categorical diagnostic accuracy models).

Expect the rest soon! And feel free to jump on the Stan users group to discuss the models and how they’ve been coded.

Warning: The models are embedded as strings in R code. We’re looking for a volunteer to pull the models out of the R code and generate data for them in a standalone file that could be used in PyStan or CmdStan.

Your Models Next?

If you’d like to contribute Stan models to our example repo, the README at the bottom of the front page of the GitHub repository linked above contains information on what we’d like to get. We only need open-source distribution rights — authors retain copyright for all their work on Stan. Contact us either via e-mail or via the Stan users group.

38 thoughts on “Bayesian Cognitive Modeling  Examples Ported to Stan

        • Part of the new manual chapter explains how to do inference for discrete $latex y$.

          In a nutshell, you compute $latex p(x,y)$ as a transformed parameter as part of the marginalization. If necessary, it can be normalized via Bayes’s rule to $latex p(y|x)$, allowing you to do better posterior inference for $latex y$ than if you had sampled it each iteration.

          If the expectations are too much of a pain to calculate, you can also sample $latex y$ from $latex p(y|x)$ in the generated quantities block using the categorical RNG function. Then inference proceeds as if Stan had discrete parameters.

        • I’m looking forward to seeing an example here. I looked for the 2.5 manual on the stan site but it looks like it’s not released yet, and I can’t figure out how to get the unfinished version from Git either (do you guys only push to Github at release?). so it looks like I’ll have to wait.

          On the other hand, I’ll plop down a simple example here and maybe the discussion will generate some other blog or Stan interest:

          suppose I have three kinds of events, the variable that describes the type of event will be called “y” and it takes on the values {a,b,c}. Each individual event has an associated cost, D. The cost D we’ll assume is normally distributed with different parameters for each possible outcome a,b,c.

          We have priors over the mean and stddev of costs for each category. Pretend these priors are moderately different, so that we don’t have a situation where the labels are more or less symmetric (like for example, the mean value for costs of type a are somewhere near 1000 but the mean value of costs of type b are somewhere near 200, and the mean value of costs of type c are mostly less than 100, there may be also differences in the stddev though. Or we may have different distributions on the three different types, like say a,b, are kind of normal, and c is maybe exponential or gamma distributed.

          We observe D for say 10 events, and we want to figure out as best we can what we know about the labels for the sequence of events.

          so I’d like to write:

          D ~ normal(mu(y),sigma(y));

          (or if the distributions are different I’d need something else??)

          and I’d have priors on mu(y) and sigma(y) for each possible value of y.

          and I might have a prior over the relative probability of each value of y, say a = 0.2, b=0.2, c=0.6

          ultimately I want a distribution over the sequences of 10 events with the inference on the type of event being the main thing we want to know.

          Now, we don’t have y as a discrete parameter in Stan, so I’d need to write down something like incrementing the log probability using law of total probability

          p(D) = p(D(a))p(a) + p(D(b))p(b) + p(D(c))p(c)

          where p(a),p(b),p(c) are my prior values for the labels, and p(D(a)) etc are the distributions for the costs in each case, and p(a),p(b),p(c) are the priors

          now, this is unconditional on the actual value of y, but what I want to know is “what were the y values likely to be in these 10 events?”.

          So, it sounds like you’re suggesting I’d generate y values randomly from the prior, and then calculate p(D) conditional on those random values?

        • We work in branches, as outlined on our developer process wiki page: https://github.com/stan-dev/stan/wiki/Developer-Process

          You’ll find the new manual in branch feature/issue-786-manual-2.5

          So, it sounds like you’re suggesting I’d generate y values randomly from the prior, and then calculate p(D) conditional on those random values?

          No — generate expectations of D conditional on the posterior distribution of the parameters (all conditioned on the data).

          Not everything’s going to be able to be easily marginalized. It’s easier when there is a latent discrete parameter associated with the data in some way. So if your D is really a D_i for each data item i, then it should be no problem. But if there’s something combinatoric like a sequence of categorical outcomes, the problem is that you either need to do something clever with dynamic programming (like the forward-backward algorithm for HMMs that lets you efficiently generate sufficient statistics) or you run into a combinatorial problem doing the marginalization.

          One example that’s intractable is feature selection in a regression where you have a binary indicator of whether the feature’s included or not. There are 2^P possible states, so if there are a lot of predictors P it won’t work.

        • Thanks, I looked up the manual and I think your example problem makes things clearer.

          I know someone who’s interested in figuring out from several time-series with one observation per year for a couple hundred years, in which years did some kind of duplication or censoring occur? So potentially there’s say 4 timeseries, they each have 100 observations, and maybe 5% of these observations might be actually missing without knowing it, so instead of say 1900,1901,1902,1903,… you might have 1900,1901,1903,… in one time series and you might have a different set of years in another timeseries. There is, then, one latent timeseries which all of the other ones are “copies with error”, but the errors occur not only as say normal variation from the overall series, but also progressive time-mismatch due to the censoring/duplication.

          In such a model, there is conceptually a discrete parameter for each year and in each timeseries, so let’s say 400 binary (or maybe trinary) parameters, almost all of these are 0, but 5% of them or so might be 1 (missing) or (-1) duplicate. that seems intractable because as you say 2^500 is a big number.

          Anyway, thanks for the help, and for the pointer to the manual.

        • If you add a latent parameter per year that’s continuous, like a probability of being censored, you might be able to make the same kind of inferences without discrete parameters.

          P.S. The Stan users group is probably a better place for this discussion. There are many more people to help with problems.

  1. I don’t get the cover. There must be some connection between blood stained bodies crying out in agony as they fail to escape Hades and Bayesian Modeling, but I can’t see what it is. One things for sure though: I will never complain about puppies on the cover of Bayesian books again.

    • I believe the ominous red bodies are constructed from lego pieces… although I’m not totally sure what the significance of that is supposed to be. I guess it is something to do with building complex forms out of a lot of simple pieces, or something. What exactly they are grasping for is anyone’s guess.

      I guess reading this book on the bus would elicit a rather different reaction from passengers than would reading the puppy book.

  2. This is a really great book. It complements Lunn et al in that it talks about very specific modeling problems that would be of interest to psychologists (and psycholinguists). Also, it doesn’t expect you to have had a formal education in statistics. For beginners who have at least a passive knowledge of calculus, reading the Lynch book alongside this one will give a classical, statistician’s view of Bayesian modeling (Lynch) and its application in building process models in psychology. I read the Kindle edition, which was surprisingly easy to read.

    • Exactly. I should’ve emphasized those points above. Yet even without assuming much stat background, the modeling discussions are very solid mathematically and organized by model in a nice modular way. Andrew always says he tries to write books that don’t say anything that’s wrong, and that’s harder and harder to do the less math you can assume from the audience and the more you have to fudge definitions.

      I don’t know the Lynch book (Springer), but it’s a whopping $118.97 for the hardcover and $79.95 for the Kindle version on Amazon. Lee and Wagenmakers (Cambridge Uni Press), by contrast, is $40.50 paperback and $25.99 on the Kindle, whereas Gelman et al.’s BDA3 (Chapman and Hall/CRC) is $60.52 hardcover and $57.49 Kindle. I wrote a book with Cambridge back when I worked on programming language semantics, and I had a great experience. Given how liberal they are with allowing electronic versions and how reasonably priced they are and how they have relatively high production values, I don’t know why everyone doesn’t publish with Cambridge.

      • I agree about CUP, they are awesome. I also like CRC Press books, I think they might have been the first to figure out how to deal with statistics books on Kindles. Springer still has no clue. Springer definitely needs to start pricing their books more sensibly.

        Lynch’s book is apparently available online for free although I read it as a physical book. I read the Lee and Wagenmakers book on a Kindle (on a Mac) and it was nicely formatted for the computer screen.

  3. Is it possible to work with a model like this in STAN:

    (x/(1-x))^(k/(c+k))- ((1-x)/x)^(c/(c+k))=
    c*e0^(-k/(c+k))*s0^(-c/(c+k))*t – (e0/s0)^(c/(c+k)) + (c/k)*(s0/e0)^(k/(c+k))

    I cannot solve it for x in the general case (sorry I don’t know the terminology for this), even using mathematica 10. However, I can solve for x given any specific combination of k,c,e0,s0.

    For example in R I can do this:
    #Start Code
    c=1; k=1; s0=10; e0=100; t=1:500
    z=c*e0^(-k/(c+k))*s0^(-c/(c+k))*t – (e0/s0)^(c/(c+k)) + (c/k)*(s0/e0)^(k/(c+k))

    p=matrix(nrow=length(z))
    for(i in 1:length(z)){
    f<-function(x) (x/(1-x))^(k/(c+k))- ((1-x)/x)^(c/(c+k))-z[i]
    p[i]<-uniroot(f, interval= c(0,1), tol=10^-9)$root
    }

    plot(t,p)
    #End code

    Obviously I want to fit c,k,e0,s0 to some data.

    • You can code any model where you can write down the log density. Then you can fix some variables and either optimize or sample from the others.

      I had to look up uniroot and it’s not surprisingly a root finder. We don’t have root finders, but if you could express what you’re doing as a statistical model with a probability function, then you could code it in Stan.

      • Bob,

        I can do this in STAN for the special case of c=k, because in that case it is possible for me to solve directly for x (ie an equation with “x=” on one side). The problem with the general model above is I cannot put it in a form with a single x on one side of the equation. Is this what you mean by “statistical model with a probability function”?

        • I’m afraid I’m not sure what you want to do. Stan only does two things: sampling and optimization. In both cases, this is from a probability density $latex p(\theta | y)$ where $latex \theta$ is the vector of unknowns (parameters) and $latex y$ a vector of known values (data). Usually the density $latex p(\theta | y)$ is the posterior of a Bayesian model expressed by a joint density $latex p(\theta, y)$, taking $latex p(\theta | y) \propto p(\theta, y)$ by Bayes’s rule. Stan only needs $latex p(\theta | y)$ up to a constant.

        • Bob,

          My situation is like in the following code (it is the above model when c=k). I want to know if it is possible somehow to fit the general model above where I cannot write out an explicit function for “x” (this is “p” in the STAN model).

          ######
          install.packages(“msm”)#used for truncated normal dist when generating data
          install.packages(“rstan”)
          require(msm)
          require(rstan)

          ####Generate some data
          generate.data<-function(c=1, k=1, s0=10, e0=100, t=1:500){
          z=c*e0^(-k/(c+k))*s0^(-c/(c+k))*t – (e0/s0)^(c/(c+k)) + (c/k)*(s0/e0)^(k/(c+k))

          p=matrix(nrow=length(z))
          for(i in 1:length(z)){
          f<-function(x) (x/(1-x))^(k/(c+k))- ((1-x)/x)^(c/(c+k))-z[i]
          p[i]<-uniroot(f, interval= c(0,1), tol=10^-9)$root
          p[i]<-rtnorm(n=1,mean=p[i], sd=.05, lower=0, upper=1)
          }
          return(p)
          }

          subj1<-generate.data(c=1, k=1, s0=10, e0=100, t=1:500)
          subj2<-generate.data(c=.5,k=.5, s0=10, e0=100, t=1:500)
          subj3<-generate.data(c=.3,k=.3, s0=10, e0=100, t=1:500)
          subj4<-generate.data(c=.25,k=.25, s0=10, e0=100, t=1:500)

          dat<-rbind(subj1,subj2,subj3,subj4)
          dat<-cbind(rep(1:length(subj1),4),dat)
          dat<-cbind(rep(1:4,each=length(subj1)),dat)
          colnames(dat)<-c("Subj","Time","y")

          ####Put in the format STAN likes

          Ndata=nrow(dat)
          Nsubj=length(unique(dat[,"Subj"]))
          subj=dat[,"Subj"]
          y=dat[,"y"]
          t=dat[,"Time"]
          dat2<- list(Ndata=Ndata, Nsubj=Nsubj, subj=subj, y=y, t = t)

          ####Then fit with this code
          STAN_code <- '
          data {
          int Ndata;
          int Nsubj;
          int subj[Ndata];
          real y[Ndata];
          int t[Ndata];
          }
          parameters {
          real s0;
          real e0;
          real k[Nsubj];
          }
          model {
          real z[Ndata];
          real p[Ndata];

          for(i in 1:Ndata){
          z[i]<-(k[subj[i]]*t[i] +s0 -e0)/(s0*e0)^.5;
          p[i]<-.5*(z[i]/((z[i])^2+4)^.5 +1);
          }
          y ~ normal(p , .05 );

          }
          '
          fit <- stan(model_code = STAN_code, data = dat2, iter = 3000, chains = 3, warmup=500)
          ####

        • This is an interesting problem because there are all kinds of implicit functions like that in engineering (ie. x is a function of the other stuff, but it’s not explicit, you have to solve numerically for the x.)

          here’s the question though, what do you want to do with this model? Do you want to observe values of k, and c with errors and do inference on the values of x? or what?

        • I see right at the end you said you want to fit c,k,s0, etc to data, so I’m assuming this means you’ve got observations of x, and you want to know which values of c,k,s0 etc are consistent with those observations.

          In a theoretical sense, (not a Stan programming sense) this seems amenable to the ABC type technique. You can’t write down a probability density for x, but if you have values of c,k,s0 etc you can GENERATE x values using a root finder like uniroot to solve your equation.

          I don’t think Stan can do that, but there might be other approaches you could take.

        • Daniel,

          Yes, that is what I want to do. I posted some example code for the simplified case above. Perhaps there is a way to call R’s uniroot function from STAN each step? That is what I’m wondering. Can you provide a link to ABC technique?

        • ABC stands for “Approximate Bayesian Computation” and is a class of Bayesian fitting techniques for models in which we can’t write down a likelihood but we can generate data. You can google around for those keywords, but maybe start with Wikipedia’s article to get a flavor:

          http://en.wikipedia.org/wiki/Approximate_Bayesian_computation

          and then maybe look at an R package for examples?

          http://cran.r-project.org/web/packages/abc/

          A typical example where this might get used is where you’re trying to do inference on a function defined by a differential equation, and you want to find parameters that make the differential equation go “close” to the data, but it’s also just as valid to deal with problems where you have an implicit function like you do, given parameters, you could solve for x, but not in closed form, only numerically. I don’t think Stan does this as-is, but once Stan has user-defined function, you might be able to user-define a function that calls out to R’s uniroot or something… I don’t know.

    • question, c and k are not identified in your model; only their ratio is. To see this, you can verify that your model equation is invariant to the mapping (c -> cK, k -> kK). This may not be a problem if their joint prior does not share this invariance, but it’s worth being aware of this feature of your model…

      • Corey,

        You got me to look at it again and I missed a c/k term on the left side of the equation so the above is in error (as is my stan model, but it didn’t matter for the example data since c=k there). The correct model is below:

        (c/k)*(x/(1-x))^(k/(c+k))- ((1-x)/x)^(c/(c+k))=
        c*e0^(-k/(c+k))*s0^(-c/(c+k))*t – (e0/s0)^(c/(c+k)) + (c/k)*(s0/e0)^(k/(c+k))

        I don’t think this addresses what you are saying though. With regards to that, the right side (which I have called “z” in the code) does include a lone “c” so we get different results with different values with the same ratio. Perhaps I misunderstood. New code:

        #Start Code
        plot(0,0, ylim=c(0,1), xlim=c(0,500), type=”n”, xlab=”t”, ylab=”p”)
        for(j in 1:2){
        c=j; k=j; s0=10; e0=100; t=1:500
        z=c*e0^(-k/(c+k))*s0^(-c/(c+k))*t – (e0/s0)^(c/(c+k)) + (c/k)*(s0/e0)^(k/(c+k))

        p=matrix(nrow=length(z))
        for(i in 1:length(z)){
        f<-function(x) (c/k)*(x/(1-x))^(k/(c+k))- ((1-x)/x)^(c/(c+k))-z[i]
        p[i]<-uniroot(f, interval= c(0,1), tol=10^-9)$root
        }
        lines(t,p, col=c("Red","Blue")[j], lwd=3)
        }
        legend("bottomright",c("c=k=1","c=k=2"), col=c("Red","Blue"), lwd=3)
        #End code

        http://s27.postimg.org/az8cx3wlv/image.png

        • Sorry the actual model used by STAN (simplified equation for the c=k case) is correct, at least according to my double checking. The code used to generate the data (using the general model) was incorrect, but it didnt matter for the parameters used since they all set c=k so c/k=1.

        • question: I recommend using Maxima or another computer algebra system such as Mathematica if you know it, to do manipulations on your equations when they are as complicated as yours, just to avoid having accidental errors.

          Also, did you take a look at the ABC package for R? that might do what you need. Another option might be to use the “mcmc” package in R. You could set up a posterior density based on a normal error between the data and the calculated numerical value of x using uniroot. “mcmc” isn’t super convenient the way Stan is, but it can handle models that Stan can’t handle, like this one.

        • Daniel,

          I opened the vignette for the ABC package about an hour ago, taking a look. My first impression is that it will not help because I need to “generate summary statistics” some other way.

        • A “summary statistic” in the ABC world is anything that helps you determine whether you have a good fit really. You could probably just use the data, looking at mean squared error, or something like that. I actually think the “mcmc” package might be more helpful for you, you can set up a normal error model but use uniroot within the log-probability function to calculate the theoretical x.

        • Check out the “mcmc” package. You can set up the log-posterior-density as an R function in a similar way, and use the uniroot function, similar to the way that you were doing it in modifying that example script, but the mcmc package will give you a plug-in function to do the sampling. The thing you’ll have to do is tune the proposal (scales and/or covariances) to get a good acceptance rate. Stan makes all of that tuning stuff go away, but it comes at the cost of requiring fairly “analytical” models.

          One of the things that you need to do with this sort of thing is to tune the proposal, typically by adjusting scales. Overall, you’re not going to get something as fast as Stan from this, but tuning can have a big affect, and you can do the “numerical only” models.

          note: you can probably make uniroot’s tolerance pretty big, since you’re also going to be putting a normal “error” model on top of it anyway. as long as uniroot is giving you something accurate to within a small fraction of the normal error scale it is probably “accurate enough” that could save you a fair amount of time possibly.

Leave a Reply

Your email address will not be published. Required fields are marked *