Fitting models with discrete parameters in Stan

This book, “Bayesian Cognitive Modeling: A Practical Course,” by Michael Lee and E. J. Wagenmakers, has a bunch of examples of Stan models with discrete parameters—mixture models of various sorts—with Stan code written by Martin Smira! It’s a good complement to the Finite Mixtures chapter in the Stan manual.

30 thoughts on “Fitting models with discrete parameters in Stan

  1. I glanced at this book. It has everything that makes frequentists grate their teeth when they hear Bayesians machinate. For example, a credibility interval contains the “true” value of the parameter (exercise 1.1.3). This is only correct if the prior distribution is the “true” prior (it’s easy enough to create two separate priors so that with the same data, the two posteriors have credibility intervals that don’t overlap). A gratuitous slap at asymptotic analysis (see discussion of the “plug-in” principle, exercise 1.2.1, with no corresponding discussion of convergence in law or probability). Etc.

    • Numeric:

      This may be so. But from my perspective I don’t need this book for theory; we already have BDA. What this book does offer that’s special is implementation of a bunch of useful models.

      • I have to say BDA is much more circumspect about its claims. This book reminds me of the time decases ago when I came across a French history text written in 1930 at the 6th grade level (so I could read it). One of the questions at the end of the first chapter was “Which country started the World War by invading Belgium?” This book reminds me of that question-begging approach.

    • Oi numeric! You raise some interesting theoretical/philosophical concerns. For the issue of exercise 1.2.1, see the classic book by Aitchison and Dunsmore (1975) — if I recall correctly, they make the same point throughout. Perhaps we should have cited this work but we felt the argument was sufficiently intuitive. We can rarely be sure whether we’ve arrived at the land of asymptotia (and it would be nice to have methods that work outside its sacred borders). Also, as far as “true values” are concerned, I am no longer sure whether the Bayesian agenda requires the concept of truth at all (in real applications, what do we mean with “truth” anyway?). This philosophical issue gets some discussion in the book, but perhaps not enough (see also Karl Pearson’s “Grammar of Science” for an interesting expose on scientific truth, where Pearson compares comparing humans to telephone operators who lack any knowledge of the outside world). As the title of the book implies, our aim was to provide some practical guidelines for implementing Bayesian models in cognitive science. Anyway, bottom line: Martin did a great job, and the contributors to the Stan mailing list were extremely helpful! And of course I am very happy to hear that frequentists will grate their teeth! ;-)

      Cheers,
      E.J.

      • We can rarely be sure whether we’ve arrived at the land of asymptotia…

        Whenever I read this sort of claim I like to point to Charles Geyer’s paper on the subject: “If the log likelihood is approximately quadratic with constant Hessian, then the maximum likelihood estimator (MLE) is approximately normally distributed. No other assumptions are required. We do not need independent and identically distributed data. We do not need the law of large numbers (LLN) or the central limit theorem (CLT). We do not need sample size going to infinity or anything going to infinity.”

        In short: a plot the log-likelihood might be all the evidence you need.

      • > I am very happy to hear that frequentists will grate their teeth! ;-)
        If in jest, I like that is sad.

        I do often very thoughtful texts are self-deflating by being smug in particular about Bayesian approaches taking you to the world of omnipotence – I do feel Bayesian approaches make doing science easier but not easy.

        > Pearson’s “Grammar of Science” for an interesting expose on scientific truth
        Have you read Peirce’s critique of this work by Pearson? https://en.wikisource.org/wiki/Popular_Science_Monthly/Volume_58/January_1901/Pearson%27s_Grammar_of_Science

        (I am a bit embarrassed the ASA quotes this work.)

        • Oops, how did that happen?

          > I am very happy to hear that frequentists will grate their teeth! ;-)
          Even if in jest, that is sad.

    • Credibility intervals can contain the true value of the parameter for a very wide variety of priors, there’s not just one “true” prior for which the posterior will contain the true value.

      Practically speaking, if the region around the true value is in the high probability interval of the prior, and the data model isn’t terribly wrong, then the posterior will also contain the true value.

      Suppose that the true value in some instance is 0, just for concreteness. and suppose you have a reasonably accurate data model. Then any of the following priors will probably work:

      normal(0,1)
      normal(100,1000)
      normal(5,30)
      laplace(1,10)

      mixture(1/3 * normal(0,10) + 1/3 * normal(1,2) + 1/3 * normal(-3,2))

      etc etc

      it all depends on what you think you know, but to think that there’s “one true prior” is seriously confused.

      • To make this argument comparable to a comparison with confidence intervals one would have to calculate all possible priors and determine which ones have the “true” parameter in it (an impossible task), and then compare that to all possible confidence intervals and see which ones have the “true” parameter in it. But if one is a “true” Bayesian there is one “true” prior (BDA gets around it by assuming there is a true parameter but our knowledge about the parameter is unknown, so there could be many “true” priors since everyone’s uncertainty is different). Of course (as demonstrated in
        https://normaldeviate.wordpress.com/2012/06/14/freedmans-neglected-theorem/) “ … it is easy to prove that for essentially any pair of Bayesians, each thinks the other is crazy.”–which gives an interesting insight into the concept of “coherence” as it applies to Bayesian analysis–maybe solipsism is the better word?

        I don’t expect deep philosophical insights in an introductory book but I object to smugness. The comments above indicate that there have been many thoughts on this by many people and even in an introductory text that should be reflected, particularly by an upstart philosophy (“a due respect of the opinions of mankind”, etc).

        • Numeric:

          I continue to be exhausted by this sort of argument which singles out Bayesian inference for its choice of priors without noting that all inference is sensitive to the choice of model, or optimization criteria, or whatever. All those logistic regressions that are just assumed to be true, then all this agonizing over a bit of regularization. It exhausts me. Any solipsism of Bayesian inference is also there for any other method of statistical inference.

        • Freedman’s theorem is neglected for a good reason: the set of all priors isn’t the set we care about because almost all priors in that set are uncomputable. We only care about the subset containing computable priors (a denumerable set!), since these are the priors it’s actually possible to use, and Freedman’s theorem tells us nothing about them.

        • I have no interest in a comparison with confidence intervals in general. Confidence intervals answer a question about how well some “confidence procedure” when blindly applied to problems it was maybe designed for will perform in the long run across many different problems… basically never do I care about that problem. I don’t apply the same confidence procedure repeatedly, I build individualized Bayesian models for individual problems.

          Also: “but if one is a true Bayesian there is one true prior” umm. no. This is absurd.

          For each Bayesian working on each possible problem there is some prior which that Bayesian will choose to use in that problem. Just as for each undergraduate in a physics lab using a meter stick with nothing but 0, 0.25, 0.5, 0.75 and 1.0 meter markings on it, there is some measurement that each undergrad will choose to represent the length of their foot. They make a guess informed by some boundaries and some general sense of the relative size of things, some visual processing information, but the guess will be a lot different than if they had millimeter markings on the stick.

        • We should make the distinction between the situation where there is a clear meaning to the word “true value of the parameter” and otherwise. Pretty clearly we can say that there is a “true value for the length of your foot” at least, when rounded to the nearest mm for example. There need not be a “true diameter of a crumpled ball of paper” yet there is a value for that diameter which when used in a calculation will lead to an approximately correct fall trajectory in a “dropping paper balls” experiment. What you might call the “true value of the *effective* diameter”.

          http://models.street-artists.org/2013/04/26/1719/

          In almost all models we’re really talking about the true value of an effective parameter, but in many models the effective parameter is closely related to a similar “real” parameter. For example, if you want to hit a golf ball and pretend it has no dimples, you can probably choose a slightly different diameter than what you’d measure with a caliper, and get a good approximation for the trajectory, but the difference in diameter isn’t going to be big, especially considering how the ball doesn’t stay in the boundary region of the drag coefficient for a long time.

        • “(BDA gets around it by assuming there is a true parameter but our knowledge about the parameter is unknown, so there could be many “true” priors since everyone’s uncertainty is different)”

          Numeric, there is no “gets around” about it. If distributions model uncertainty there can be many distributions which correctly due so for the same true parameter. If distributions model frequencies then presumably there’s just one correct answer.

          If you want to model frequencies why not just call yourself a “Frequencyologist”, refer to all distributions as “frequency distributions”, write them as F(x) instead of P(x), and leave all the inference in the face of uncertainty to the Bayesians?

        • And you could publish all your stuff in the Journal of Applied Frequencies ’n Stuff.

          This seems like fair compromise to me. Frequentists have their lane, and Bayesians have theirs. Just so long as it’s understood if the thing Bayesians are uncertain about is itself a frequency, then they are free to make inferences about it just the same as anything thing else.

        • P.s. Germany did start WWII in Europe, and it’s no more glib for a Bayesian to assume bayesian stuff even though some people don’t understand it than it is for a Quantum Physicist to assume Quantum physics even though some people don’t understand it.

        • The comment about “the World War” came from the 1930s (the absence of a number on the war is a clue) and it’s much less clear than for WW2 exactly what the initial event was.

          But in any case, I presume the point of the example was to show how one can “prove” one’s version of a debatable assertion (“who started a war”) by creating a rhetorical connection to an established fact (which country’s army marched into Belgium). I find it amusing to watch scientists resorting to all kinds of rhetoric to show that they are being “objective”, because subjectivity is bad m’kay.

  2. Way too much to extensively comment, but let me touch on some
    arguments made. My original objection was to the implicit presumption
    of superiority of Bayesian methods as exemplified in “Bayesian
    Cognitive Modeling: A Practical Course”, exercise 1.1.3:

    Figure 1.1 shows that the 95% Bayesian credible interval for N8
    extends from 0.59 to 0.98. This means that one can be 95% confident
    that the true value of N8 lies between 0.59 and 0.98.

    EJ Wagenmakers states:

    You raise some interesting theoretical/philosophical concerns. For
    the issue of exercise 1.2.1, see the classic book by Aitchiso if I
    recall correctly, they make the same point throughout.

    Answer:

    Obviously, my comment that frequentists grate their teeth refers to
    a large amount of research/discussion by frequentists.

    Daniel Lakeland says:

    Practically speaking, if the region around the true value is in the
    high probability interval of the prior, and the data model isn’t
    terribly wrong, then the posterior will also contain the true value.

    Answer:

    Yes, but one prior gives a region A of the line which is 95%
    credible, and another prior gives a smaller and completely contained
    region B, so what are we to conclude that the region A/B has no
    credibility? If so, why is it part of region A?

    Andrew opines:

    I continue to be exhausted by this sort of argument which singles
    out Bayesian inference for its choice of priors without noting that
    all inference is sensitive to the choice of model, or optimization
    criteria, or whatever. All those logistic regressions that are
    just assumed to be true, then all this agonizing over a bit of
    regularization. It exhausts me. Any solipsism of Bayesian
    inference is also there for any other method of statistical
    inference.

    Answer:

    I concur and this was my original point–to wit, to claim that the
    confidence interval is lacking while the Bayesian approach is
    correct (at a 95% confident) is simply incorrectly privileging one
    method over another without pointing out that “all inference is
    sensitive to the choice of model.” As far as solipsism of
    statistical models, methods which use consistency measures are to
    be preferred. While the Spanos/Mayo “school” is typically not
    referred to in Andrew’s work, I think it has real possibilities in
    terms of evaluating models–so perhaps there is a method which, if
    not eliminating sensitivity, could reduce it to some extent.

    Corey states:

    Freedman’s theorem is neglected for a good reason: the set of all
    priors isn’t the set we care about because almost all priors in
    that set are uncomputable.

    Answer:

    The priors admit of a measure in a probability space. I fail to see
    why computability is a restriction that should be applied in context
    of the Kolomogorov axioms.

    Daniel Lakeland offers a post-modern interpretation of probaiblity

    Also if one is a true Bayesian there is one true prior…umm, no.
    This is absurd.

    Answer:

    Objective Bayesians, I think, would argue there is one true prior.
    Subjective Bayesians would not. I will note in general if there
    is not a true prior then there is not a true model. Maybe true
    models are a frequentist construct.

    Anonymous says:

    P.s. Germany did start WWII in Europe,

    Answer:

    Dreyfus was guilty! More seriously, France had announced it would
    stand by its treaty with Russia and Russia had mobilized and announced
    that it would aid Serbia against Austria-Hungary, which had been
    guaranteed support by Germany (note–it was Servia at the time, but
    during the war it didn’t play well that the Entente was fighting on
    behalf of servants, hence the name change). Thus the usual conclusion
    is that the system of alliances would inevitably lead to a general war
    (and this was the conclusion of a set of academics from Germany,
    France and England in the early thirties). Revisionist historians
    have emphasized German militarism and a prevelant belief in Germany
    that war was inevitable as underlying causes for the war, but these
    were underlying causes that made Germany more likely to fight (just as
    the defeat in the Russo-Japanese war and the various perceived
    humiliations imposed on Russia by Austria-Hungary and Germany over
    Balkan incidents predisposed Russia to resort to war). If anyone
    “started” the war, it was either Austria-Hungary or Serbia. As a
    analogous historical situation, Serbia harbored terrorist who killed
    the archduke, and when that happened to the US, we attacked
    Afghanistan. Who started that war?

    • I fail to see why computability is a restriction that should be applied in context of the Kolomogorov axioms.

      But the context is, y’know, actually doing statistics, for which we require effective computability at the bare minimum.

      Or let me put it another way: no one should care about Friedman’s theorem for roughly the same reason that no one should worry that someone will break an orange into five pieces and then put them back together and get an orange the size of the solar system — even though Banach-Tarski says that this is possible.

    • “Yes, but one prior gives a region A of the line which is 95%
      credible, and another prior gives a smaller and completely contained
      region B, so what are we to conclude that the region A/B has no
      credibility? If so, why is it part of region A?”

      Two people sit down in front of a roulette wheel. One chooses 1/4 of the numbers at random using a random number table printout, and hopes for the best, the other has a computer in their shoe and uses it and the laws of physics to predict a quadrant of the wheel (See the book “The Eudaemonic Pie”), betting on all the numbers in that quadrant.

      Both are operating from different information, is there any reason we should be surprised that the person with the computer in their shoe wins more often? Now, admittedly, this is about a model for the data. But the prior is the same way, in the sense that two different people with two different background information states can choose two different priors, and both of them can be valid!

      In other words, the A and B in your example use different information, and arrive at a calculation that gives different output. The only question is whether one of them uses incorrect or un-justified information, not whether one uses “the true prior” and another uses “something else”.

      “The Objectively One-True-Prior” is nonsense, I don’t think any “objective bayesians” would agree with the idea of some kind of Platonic “true prior” out there in the universe. But what isn’t nonsense is “whether the true value of the parameter is somewhere in the high probability region of the chosen prior or not”.

      It’s no use using a prior like normal(0,1) when the true value of the parameter is 75. But any prior that contains 75 in its high probability region is “a true prior”.

      You MIGHT be able to argue that if you can somehow specify exactly a given state of information, then that determines a particular prior. but since specifying exactly a state of information is probably even more difficult than specifying a prior for a single parameter, I think this leads to uselessness for the most part.

      Vague priors are a little like saying “I have at least a dollar in the bank”. This is a true statement for a wide variety of people from a near-broke struggling grad student to Bill Gates. It’s just not very informative. But if you happen to have 3 months of historical bank statements for the given person, you can probably construct a much more informative prior for the closing balance today.

      • In fact the only meaning I can think of for “the one true prior” is “the prior the human race would arrive at if it chose a very vague initial prior and a data model containing all that we know about the problem as of today, and then input into the Bayesian machinery all of the data collected by every person who has ever studied the problem.”

        Which is obviously an unreasonable requirement for any given real person.

        • “The exact and entire past, present, and future history of the universe”–Leonard J. Savage.

          Any reasonable person should be able to have priors over this space.

      • I think you’re missing my point. The book in question states (in a particular exercise I cite, but more generally in tone throughout) that the “true” parameter is in the credibility interval with 95% probability but look, the confidence interval does not have this property, being only correct 95% of the time in some type of long-term frequentist experiment. So the Bayesian solution is “correct.” This makes frequentists teeth grate, and maybe I shouldn’t have posted such an off-hand comment, but qualifications are even more necessary in an introductory text than in an advanced, as naive minds are susceptible to the underlying meanings. Bayesians have noted this and complained bitterly but one aspect of sustained conflict is that the originally aggrieved side tends to adopt the worst behavior of the other (the British didn’t bomb German cities until London was bombed but then they repaid it 100-fold). The book grates, even though it is useful and I appreciate the authors’ writing it.

        • I don’t have the book, but the statement “The true parameter is in the credibility interval “with 95% probability” is *true* tautologically it’s the definition of probability for a Bayesian.

          However, this is NOT a statement about how frequently the true parameter will be in the interval if you repeat a similar kind of analysis across many different problems! It’s perfectly reasonable for 95% bayesian probability intervals to have higher or lower frequency properties. For example, perhaps in 1000 different models your 95% probability intervals contain the true parameter 100% of the time… does that mean your intervals were wrong? No because probability isn’t frequency!

          You could reword that statement to avoid the overloaded term probability as follows:

          95% of the weight of the evidence used in my model indicates that the true value of the parameter is in the given interval.

    • Joe:

      The most common discrete-parameter models are latent-variable or mixture models which can be expressed using the mixture formulation as described in the Stan manual (chapter 10 in Stan manual version 2.12).

Leave a Reply to numeric Cancel reply

Your email address will not be published. Required fields are marked *