Self-study resources for Bayes and Stan?

Someone writes:

I’m interested in learning more about data analysis techniques; I’ve bought books on Bayesian Statistics (including yours), on R programming, and on several other ‘related stuff’. Since I generally study this whenever I have some free time, I’m looking for sources that are meant for self study. Are there any sources that you can recommend that are particularly useful for this?

My reply: I recommend you take a look at the Stan case studies. Also the book by McElreath.

43 thoughts on “Self-study resources for Bayes and Stan?

  1. I’m glad the person that emailed you also indicated they are reading related material to bayes and R/Stan. I have become more and more convinced that learning about bayes is as much about learning about the subject directly as it is about the related material (mathematical modeling, philosophy of science, etc.).

    I’m sure others will provide amazing recommendations for Bayes/Stan so I’ll try to help out with some of the related components:

    Principles of Applied Statistics (2011), Cox and Donelly

    An Accidental Statistician: The Life and Memories of George E.P. Box (2013), G.E.B

    Introduction to Scientific Programming and Simulation Using R (2014), Owen James et. al.

    An Introduction to Statistical Learning: with Applications in R, Gareth et. al.

    A Concrete Approach to Mathematical Modelling (1995), Mike Gibbons

    Modelling with Differential and Difference Equations (2004), Fulford et. al.

    Statistical Models in Engineering (1994), Hahn & Shapiro

    A Paul Meehl Reader: Essays on the Practice of Scientific Psychology (2006), Edited by Waller et. al.

  2. Great suggestions here (and hopefully more will follow). I originally sent that e-mail on January 2nd, so we can all estimate the lag of this blog now ;-) Since then I’ve worked through most of McElreath’s book and it’s been massively useful to me. I now recommend that to anyone who shows interest in Bayes and Stan.

    Thanks again for pointing me in this direction!

  3. Well, the important selection bias is using only co-arrest networks rather than multiple network types, which would reduce the singletons. Not only is co-arrest not reflective of all co-offending, it’s also not reflective of (among other things) ties that might exist between nodes in the component and nodes who never were co-arrested with someone in the big component.

  4. Just note that the Chapter 6 of “Statistical rethinking” is a bit outdated on model comparison and model averaging (the same holds partially for BDA3, too). See http://link.springer.com/article/10.1007/s11222-016-9696-4 (preprint https://arxiv.org/abs/1507.04544) and https://arxiv.org/abs/1704.02030

    For BDA3 there are some R demos (including RStan and RStanARM demos) at https://github.com/avehtari/BDA_R_demos

    Stan case studies are also excellent. Few more and it would be possible to use only them as a course material :)

    • I’m outlining revisions for 2nd ed now, and I’m really unsure what to do about Chapter 6. Readers really like the intro to information theory, since they don’t get it in other texts. But it needs to be abbreviated somehow. I do plan to strip out much content on AIC/DIC. I also want to emphasize pointwise inspection for individual models more than averaging. More than anything, I need to figure out how to prevent readers from getting the notion that they need to select a model this way (inside of by e.g. background theory).

      Re LOO: As you know, not everyone is yet convinced that LOO is an advance over other metrics. I haven’t made up my mind. Will give you the chance to convince me in Helsinki sometime.

      • What’s the inferential goal? I think machine learning is taking over precisely because it concentrates on prediction rather than significance.
        When the inferential goal is prediction, leave-one-out cross-validation makes a lot of sense, as it’s a proxy for having a true held-out test set (like actually using your model to predict tomorrow’s weather).

        An alternative (though asymptotically similar) method for measuring prediction that I really like is outlined in this paper:

        I use an informal version of calibration and sharpness to compare models in my repeated binary trials case study.

        • I’m very sympathetic to that view. But the alternatives share the prediction focus. It’s not sufficient to distinguish among them.

        • Richard:

          I think Waic is kind of ok too, but Aki has a point that the best justification for Waic is as an approximation for Loo, and in that case I prefer Loo as the argument for it is more straightforward. Aic is fine as a starting point for linear models with flat priors, but that’s about it. And Dic is of historical interest as an intermediate step that happened along the way to our current understanding.

          Bob:

          Prediction is fine but it can be important to predict in new scenarios that are different from past scenarios. For example, using Mister P to generalize to a population that is different from the survey data at hand. Or using a differential equation model in pharmacology to make predictions for dosing scenarios that are different from those in the experimental data.

          It’s fine with me when people say that prediction is the only problem in statistics that matters—as long as they recognize that some sort of modeling (or generalization or regularization or whatever you want to call it) is necessary to make predictions for new cases, in the very important scenario where the observed data are not a random sample from the population of interest.

          Hmmm, maybe that needs its own blog post?

        • Re WAIC vs LOO: Re-running all my textbook examples with both, only big difference I find is that LOO warns that it is unreliable all the time, which I kinda like! I suppose the subtle concern is that LOO maybe trades bias for variance, relative to WAIC.

          From a teaching perspective, I need to start with AIC, because most biologists are familiar with it. It feels like a burden of presentation.

        • Richard:

          I think of Waic as an approximation to Loo. Once Aki explained that to me, I found it more difficult to recommend Waic to people. I think it’s a lot better than Dic, though. I don’t really have a problem with people using Waic, I just find it awkward to recommend.

          I agree that Aic is an excellent starting point, especially if people in the audience have already heard about it! To me the key step in explaining any of these things is to step away from the idea that there is some sort of Platonic “information criterion” out there to be discovered, and instead to consider all of these as methods for estimating out-of-sample pointwise prediction error. From there, it’s clear that Aic gives such an estimate in certain simple settings, that Loo is a reasonable general approach, and that approximations such as Waic or Psis-Loo can be useful for computing fast and stable estimates.

        • LOO, PSIS-LOO and WAIC have quite similar bias and variance as long as p_eff is small, with LOO having slightly higher variance and WAIC higher bias, but as shown in Fig 2 of https://arxiv.org/abs/1507.04544, it’s possible to shrink and bias LOO to behave similar to WAIC. Most of the time these differences are not important. However for influential observations the differences are bigger and WAIC can fail there, while LOO would work and PSIS-LOO can tell that it’s not working well.

          The biggest problem for me with *IC is that usually story about balance between fit and complexity is emphasized, and the connection to predictive task is forgotten and it’s misused, for example, for hierarchical models (predicting for a new hospital and not for a new patient) and timeseries (not taking into account that it’s easy to predict one missing observation in the middle of the time series).

        • These threads really move a lot while I sleep.

          Thanks, Aki. That was my understanding, from reading the papers, so glad I didn’t misunderstand too much. I still need to sit down and simulate in context of data I understand, so I can get some intuition.

          Your biggest problem is mine as well. My book warns the reader about those things, but I fear it also encourages misuse. I thought about having an example with explicit LOO for whole clusters (hospitals) in the 2nd edition, to emphasize the issue.

          Re time series, I have ecology colleagues who resist *IC because they often want to predict outside the range of past data, making the whole prediction task rather less well defined. Also, they are less interested in total predictive accuracy than avoiding extirpation/extinction.

        • The warnings of PSIS-LOOIC are the best part. So many times students have come to me complaining about those warnings, and it turns out to be the case that the flagged data points are wrong.

        • When the oil light goes on in the car, my approach is to unscrew that little light bulb so it doesn’t bother me. I’m also a big fan of unplugging the smoke detector.

  5. After the discussion about observables vs parameters, prediction vs inference etc I went back and re-read some of Bernardo and Smith’s ‘Bayesian Theory’. I’d recommend this, despite not subscribing to the approach as such, as it’s a pretty good read on these (and other) topics.

    Be warned, I guess, that they take an unabashedly ‘subjectivist’* and operational/predictive stance, where parameter inference is a secondary, special limiting case of predictive inference (and hence limited to parameters definable as large-sample functions of observables, i.e. identifiable parameters?).

    Which means, for example, I’m not sure how well the approach fits with e.g. ODE-style models derived from other considerations of the sort familiar to scientists, engineers and applied mathematicians (if that’s what you’re after). Though Andrew seems to be shifting towards a predictivist stance while also working with ODE models and things too. So stay tuned, I guess?

    They also discuss a bit the whole M-open and M-closed issue raised in Aki’s comment above (e.g. from one of his links: “The widely recommended procedure of Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit”). Though as far as I remember, they don’t really give much advice for dealing with M-openness…

    (*but see Andrew and Christian’s recent ‘Beyond…’ paper)

    • I can agree with Aki’s comment if we replace “the true data-generating process” with “a good data generating process that we have reason to believe is capable of predicting well” or something like that.

      “the true data-generating process” is as best we know, quantum physics, and quantum physics isn’t even close to any of the models used for anything much. For example Navier Stokes is as close as damn it to “the true data generating process” compared to say predator-prey models for ecology which is in turn as close as damn it to real compared to ideal rational actor models in Economics which is again as close as damn it compared to thick arms and voting patterns models…

      I take the “M-open” setting to be one where we have little reason to commit to the idea that we’ve circumscribed all the models we’d be interested in comparing as of today, whereas when we really do think that one of the models in question is a very good model, we can approximate things as “M-closed”. The truly “M-closed” setting never exists, but there are settings where one of our models is an approximation that is as close as we care to spend the effort to get.

      • There are some, depending on what you are looking fot, but I’m still a bit dissatisfied with what I’ve seen.

        You might find Data Assimilation: A Mathematical Approach interesting –

        http://www.springer.com/gp/book/9783319203249

        There are a few others around. Another interesting one from a different direction is ‘Nonlinear time series analysis’ by Kantz which uses tools from dynamical systems theory to do data analysis (rather than fitting prespecified models to data):

        https://www.cambridge.org/core/books/nonlinear-time-series-analysis/519783E4E8A2C3DCD4641E42765309C7

        Somewhat in-between, and recent, is Dynamic Data Analysis by Ramsay and Hooker but I haven’t read it properly yet:

        http://www.springer.com/kr/book/9781493971886

        • Appreciate the recommendations. I am still waiting on that other book you recommended to come in. Once it’s finished I’ll look to grab some of these.

          Functional Data Analysis by Ramsay et. al. has been in my statistics wish list on Amazon for a couple months now. Maybe I’ll add their Dynamics book to the priority list (itself a list of 70+ books).

          I’ll have to wait for the prices to come down though. In case readers on this blog aren’t aware, prices on Amazon swing wildly. Sometimes books in my wish lists start at $200 + and on some random days I can grab them for $25. There’s a website to actually track some of the more popular items! (https://ca.camelcamelcamel.com/) ….but I usually just check the wishlists daily (not that it wouldn’t be too difficult to write a script in r to check it for you)

          For the curious, Gelman’s BDA could have been purchased on jan 18th for $10 less then its current price (https://ca.camelcamelcamel.com/Bayesian-Analysis-Third-Andrew-Gelman/product/1439840954). Which is pretty stable actually. The rarer the book / the less demand the wilder the swings I have found.

        • Minus links:

          There are some, depending on what you are looking fot, but I’m still a bit dissatisfied with what I’ve seen.

          You might find Data Assimilation: A Mathematical Approach interesting.

          There are a few others around. Another interesting one from a different direction is ‘Nonlinear time series analysis’ by Kantz which uses tools from dynamical systems theory to do data analysis (rather than fitting prespecified models to data).

          Somewhat in-between, and recent, is Dynamic Data Analysis by Ramsay and Hooker but I haven’t read it properly yet.

        • Parameter estimation and inverse problems by Aster et al is probably relevant too (the second edition has material on Bayes).

          Tarantola’s Inverse Problems book is interesting though a little idiosyncratic.

        • Oh I should mention Jari Kaipio’s ‘Statistical and computational inverse problems’, and not just coz I sometimes have coffee with him. Another good Bayesian inverse problems book, if that’s your jam. Bolker’s Ecological Data Analysis (I think that’s the title) is also quite nice, includes some differential equation stuff and is fairly elementary

      • My favourite book from a modelling point of view is Stochastic Processes in Physics and Chemistry by van Kampen, where the _result_ is typically a differential equation governing your stochastic process (eg a Master equation or Fokker-Planck equation)

  6. I’d recommend the chapter of the 8th Edition of John Stuart Mill’s Logic on probability.

    • Mill, John Stuart. 1882. A System of Logic: Raciocinative and Inductive. Eighth Edition. Franklin Square, New York: Harper & Brothers, Publishers.

    Specifically Part III, Chapter 18. I think he does a brilliant job laying out the issues of “subjective” probability. I particularly like this quote:

    We must remember that the probability of an event is not a quality of the event itself, but a mere name for the degree of ground which we, or some one else, have for expecting it. … Every event is in itself certain, not probable; if we knew all, we should either know positively that it will happen, or positively that it will not. But its probability to us means the degree of expectation of its occurrence, which we are warranted in entertaining by our present evidence.

    He discusses several notions of probability, including classical notions of equiprobabile events.

    As I’ve said before, I also believe some familiarity with measure theory is necessary to understand modern statistics (in the same way that some familiarity with manifolds is required to understand modern physics). It’s not absolutely necessary, but it sure simplifies things. I don’t know good references for this. Ash’s short book on probability is OK, but you don’t really need that level of formalism. Texts like DeGroot and Schervish cover most of the material, but it’s very very long (hundreds of pages to get through).

    • For foundations of probability, I recommend chapter 1 of BDA as we include several real-data examples in which probabilities are constructed using a combination of empirical information and subject-matter knowledge:

      – Spell checking (data on word frequencies and typing errors, plus theoretical expectations that inference from a given database would be relevant in new cases)

      – Football point spreads (data on point spreads and game outcomes, plus economic theory that point spreads should give unbiased predictions of score differentials)

      – Record linkage (data on matches and non-matches, plus whatever theory it took to construct the algorithm that was used to create the uncalibrated scores)

      In each of these cases the model is imperfect, and that’s part of probability in the real world too.

      • If I didn’t know you so well, I’d think you were kidding recommending the first chapter of BDA for an intro to probability. I think what you’re forgetting is that others don’t start trying to read your book already knowing the material. Your first chapter is at best a sketch of the notation you’ll use for people who’ve already taken a class in math stats. I tried reading it several times over the first, second, and third editions, and failed each time. And I was a math major—it’s not because I was afraid of calc or even analysis.

        I do think the first chapter of BDA is a great intro to the principles of Bayesian stats for someone who’s already fluent in math stats. The bullets in the second paragraph already assumes you know what a conditional distribution is! Then the very first topic you discuss assumes randomness by scare-quoting the word “random”. Then you go onto exchangeability of densities without ever defining what a density is. Rather than defining what a density (or mass function is), your first mention is in a block called “Probability notation” where you explain that p(.|.) denotes a conditional probability density and p(.) a marginal density without ever defining these.

        For an introduction, it’s very very confusing to overload p(.|.) and p(.) the way you do (for every density) and also overloading random variables (traditionally capital letters) and bound variables (traditionally lower-case letters). I understand why you do this now, though I found it almost impossibly confusing when first trying to understand it. At the time I was trying to read it, I knew what a pdf, pmf, and cdf were, but didn’t know anything about random variable notation, so the whole thing was just frustratingly opaque.

        I’m actually trying to write an intro to probability theory right now based on proper definitions, but not a lot of theory. It’s what I wanted when I was first trying to learn the material. I didn’t need 300 pages of fluff and digressions into frequentist stats like you find in most introductions to probability theory in math stats textbooks.

        • Bob:

          1. If I didn’t know you so well, I’d think you were kidding recommending people read John Stuart Mill!

          2. I didn’t recommend the first chapter of BDA as an introduction to probability, I recommended it (in particular the three examples mentioned above, which are in sections 1.4, 1.6, and 1.7) for foundations of probability. Just about any intro probability book will cover the math of random variables. But, sure, I guess I should’ve recommended that too.

        • I think there are two kinds of foundations in play here, philsophical and mathematical.

          Mill does a great job of laying out what I think of as the philosophical foundations of probability as used in Bayesian statistics. He covers roughly the same ground as section 1.5 of BDA.

          I think of the mathematical foundation of probability as the kind sof topics covered in probability theory textbooks, like events, measure, expectations, basic laws of probability, the central limit theorem, etc.

          You presuppose knowledge of mathematical probability in BDA. The examples you cite provide examples of inference, going from prior and likelihood to posterior. I don’t see how they relate to the foundations of probability.

          Foundations of the theory of applied (Bayesian) statistics, perhaps?

        • Bob:

          Most of the discussions I’ve seen regarding the foundations of probability have had little to nothing on the idea that probability is a measurement. I think the examples of chapter 1 of BDA are useful demonstrating what it means to measure probability (as opposed to simply defining it) in practice.

        • Hmmm I find this extremely problematic. Clearly a prior is not a measurement, and probabilities over parameters in general are not measurements, if they were measurements they’d be observables.

          if you replace the word probability with frequency in your statement, then sure, the BDA examples are useful demonstrations of what it means to measure frequency.

          which gets at Bob’s point that the foundations in BDA are more foundations of applied bayesian statistics, not the foundations of probability.

        • Daniel, I like X’s recent posts about viewing the prior as a kind of reference measure. Then the posterior yields a probability measure interpreted relative to the prior. So, definitely not an observable, but I think is getting at what Andrew means here…

        • I don’t know that people need to read the whole first chapter of BDA to grasp the foundations of probability when they can just read the two pages that comprise section 1.5 on probability as a measure of uncertainty. Or really just the first paragraph of that section, which ends with “We take for granted a common understanding on the part of the reader of the mathematical definition of probability: that probabilities are numerical quantities, defined on a set of ‘outcomes,’ that are nonnegative, additive over mutually exclusive outcomes and sum to 1 over all possible mutually exclusive outcomes (11).”

          Some might quibble that is the definition of probability favored by frequentists, and that a Bayesian-who-shall-not-be-cited in BDA tried to provide a foundation for probability where those characteristics were theorems, and that the myriad of proofs of Cox’s theorem have been slightly controversial among commentators on this very blog, and that the move from discrete probability to continuous probability has been disputed even among Bayesians.

          But as long as “the ultimate proof [that probabilities may be a reasonable approach to summarizing uncertainty] is in the success of the applications (13)” by some unspecified criterion, that is foundational enough. Probably.

Leave a Reply to Daniel Lakeland Cancel reply

Your email address will not be published. Required fields are marked *