Everything I need to know about Bayesian statistics, I learned in eight schools.

This post is by Phil.

I’m aware that there are some people who use a Bayesian approach largely because it allows them to provide a highly informative prior distribution based subjective judgment, but that is not the appeal of Bayesian methods for a lot of us practitioners. It’s disappointing and surprising, twenty years after my initial experiences, to still hear highly informed professional statisticians who think that what distinguishes Bayesian statistics from Frequentist statistics is “subjectivity” (as seen in  a recent blog post and its comments).

My first encounter with Bayesian statistics was just over 20 years ago. I was a postdoc at Lawrence Berkeley National Laboratory, with a new PhD in theoretical atomic physics but working on various problems related to the geographical and statistical distribution of indoor radon (a naturally occurring radioactive gas that can be dangerous if present at high concentrations). One of the issues I ran into right at the start was that many researchers were trying to use data from statewide radon surveys to make maps of radon concentrations within states. The surveys had been designed to characterize the statistical distribution of radon in each state, not the spatial distribution, so sample sizes were much higher in highly populated counties than in low-population counties.

Within the counties with lots of measurements, the statistical distribution of radon measurements was roughly lognormal, with a geometric standard deviation of around 3 (a dimensionless number) and a geometric mean that varied from county to county. One of the first datasets I looked at in detail was from Minnesota, where counties with more than 100 radon survey measurements had observed geometric means as low as 83 Bequerels per cubic meter and as high as 136 Bequerels per cubic meter.

100 radon measurements from a county is enough to characterize the county’s radon concentration pretty well, especially if one is willing to assume the statistical distribution of measurements is lognormal; for instance, one of the parameters of interest to policy-makers is the expected fraction of homes with radon concentrations exceeding the recommended “action level” of 150 Bequerels per cubic meter, and with 100 measurements you could estimate that fraction to within a few percentage points, which was close enough.

But what about counties with many fewer measurements? Many counties had a number of measurements in the single digits. Unsurprisingly, by far the lowest and highest observed geometric mean radon concentrations were from counties with very few measurements: if you only measure in two homes in a county, 4% of the time they will both come from the highest 20% of homes in the county so your geometric mean will be much higher than the true geometric mean, and 4% of the time they will both come from the lowest 20% of homes in the county, with the opposite result. I still remember, 20 years later, that the highest county geometric mean was in Lac Qui Parle County, with just two sampled homes and an observed geometric mean of 500 Bequerels per cubic meter.

Using the statistics language S-Plus (this was years before R was released), I created my own model of what I thought was happening: I could choose a statistical distribution of “true” geometric means and geometric standard deviations, simulate the survey sampling process, and look at the resulting statistical distribution of simulated county geometric means and standard deviations. It was evident right away that if some of the true geometric means were near 500 Bequerels per cubic meter, then some of the sampled geometric means were very likely to be far higher than that: the statistical distribution of measurements is far wider than the statistical distribution of true values. So it seemed to me to be extremely unlikely that the geometric mean radon concentration of all of the homes in Lac Qui Parle county was anywhere near 500 Bequerels per cubic meter. Instead, the geometric mean was probably much lower but that the survey had happened to sample two homes from the upper end of the distribution. But however certain I was, how could I come up with an estimate and an uncertainty for the county’s true geometric mean radon concentration? I asked this question of the UC Berkeley statistics professor who gave us occasional advice, but didn’t get a useful answer. That professor’s attitude was:  I have an “unbiased” estimate of the geometric mean, 500 Bequerels per cubic meter, and an uncertainty in that quantity…what else could I hope for? This was very unsatisfying to me, especially since it was clear that the “unbiased” estimate was in fact far more likely to be too high than too low. But being unsatisfied didn’t get me closer to having a good answer.

Fortunately, my high school friend Andrew Gelman was a new statistics professor at Berkeley, and he let me sit in on his graduate level course on Bayesian data analysis. Quite early in the course we were introduced to “hierarchical” models, of which the first example was an analysis (originally by Don Rubin) of the effectiveness of classes that were designed to improve students’ performance on the Scholastic Aptitude Test. I won’t go into  details here; if you’re interested, find any edition of Bayesian Data Analysis (Gelman et al.).  To perform the evaluation, Rubin first estimated the effect (and uncertainty) of the training, on average, in each of the eight schools.  For our purposes, we take those estimates as given:

School, estimated treatment effect, standard error of the estimate
A 28 15
B 8 10
C -3 16
D 7 11
E -1 9
F 1 11
G 18 10
H 12 18

Now, suppose I want to estimate the “true effect size” in School A.  One argument is that the best estimate is 28 points. I have an unbiased estimate and a standard error, what else can I hope for?

On the other hand, the standard error is larger than the estimated effect in most of the schools. A  hypothesis test fails to reject the hypothesis that the true effect size is the same — 8 points — in all of the schools. So another argument is that my best estimate of the effect in School A is 8 points. After all, the “same course” was taught in each of the schools, so each of these eight schools gives me a single estimate of the same quantity, which is the course effectiveness.

Rubin thought (and I think most people would agree) that neither of those arguments represents our state of knowledge.  The first argument would treat School A completely in isolation, ignoring the fact that we have considerable evidence that courses similar to the one taught in School A evidently have typical effect size less than 20 points. The second argument would assume that the true effect in all schools is exactly equal, in spite of the courses being taught by different teachers to different students. Rubin suggested a sort of middle path: (1) assume that each school’s “true effect” is drawn from a normal distribution with unknown mean and standard deviation, and (2) assume  the observed effect in each school is sampled from a normal distribution with a mean equal to the true effect, and standard deviation given in the table above.  (As a physicist I recognized this as an “inverse problem”: we are given the results in the table above, and we have to work backwards to deduce the underlying parameters that produced them. )

An interesting feature of Rubin’s model is that it contains both of the conventional approaches discussed above as special cases: if we force the standard deviation of true effects to be zero then all of the schools end up with the same estimated true effect (with a central estimate of 8 points), whereas if we force the standard deviation of true effects to infinity then each school’s estimated true effect is equal to its observed effect. One can play around with a few different prior distributions for the standard deviation of true effects and find that it doesn’t really matter, any reasonable choice that doesn’t force the standard deviation to be very small or very large turns out to give about the same statistical distribution. School A ends up with an estimated true effect of 10 points, with about 50% of the probability between 7 and 16 points.

The thing that was so great about this example at the time is that it was almost exactly like the data and model I was already using for my indoor radon data.  If I worked with the logarithms of the radon measurements, then I had data that were approximately normally distributed within each county, and I was already assuming that the county means were (log)normally distributed in my simulation program. So immediately after leaving Andrew’s classroom I went back to my office, coded up the calculations, and came up with estimates for the geometric mean radon concentration in every county in Minnesota.  A few weeks later, I improved the model by including a couple of explanatory variables in a linear regression model. The final estimate for Lac Qui Parle County: 192 +/- 29 Bequerels per Cubic Meter, much lower than the geometric mean of its two observations, but higher than the typical Minnesota county. (In the unlikely event that you’re interested in the details, including model checking, you can read Price PN, Nero AV, and Gelman A, “Bayesian prediction of mean indoor radon concentrations for Minnesota counties,” Health Physics 71:922-936, 1996.) 

Although the characteristic of Rubin’s 8-schools model that I most appreciated at the time was its close match to the problem I was working on, the thing I most appreciate about it now is that it cleanly illustrates almost every theme of Bayesian modeling that I have encountered in the past twenty years: where do models come from; what happens if you pick the “wrong” form of the prior distribution; how do you check the sensitivity of the results to your assumptions…  I feel like once you’ve done the 8-schools problem, the rest of Bayesian hierarchical modeling is just tinkering with the details.

At the time that we did the radon work described above, Bayesian data analysis was not at all mainstream. Indeed, in some ways it was controversial.  Many of Andrew’s colleagues at Berkeley were openly hostile towards, or mocking of, Bayesian statistics, and I once sat in a conference room in slack-jawed wonder as I heard two UC Berkeley statistics professors discussing some Bayesian work that had just been presented in terms that were both shockingly ignorant and nearly slanderous.  They really just had no idea what they were talking about. Anyone who has ever worked through the eight schools problem knows more than those guys did. The anti-Bayes prejudice of the UC Berkeley faculty was a major factor in Andrew leaving for Columbia, which is a pity and I’m still unhappy about it.

One of the criticisms those professors brought up was that “of course, Bayesian analysis is so subjective…”  But in fact, the normal-normal model that we used for the radon analysis was no more (or less!) “subjective” than the normal-normal model our Frequentist project advisor had suggested using for an analysis of variance of those same data; indeed, it was the same model!

As for me,  I am completely indifferent to whether a given statistical approach is “Frequentist” or “Bayesian”; what I want to know is, “does it work for my problem”?   If there’s a Frequentist method for estimating the radon concentration in Lac Qui Parle County, Minnesota that is better — by which I mean either more accurate in its central estimate or providing more accurate uncertainty estimates, or both — then I want to use it.

That said, I am almost always dealing with situations in which the underlying parameters of interest (theta) are observed subject to error (measurements y), which is to say I am trying to find p(theta | y), and most or all Bayesian methods are designed to address exactly that issue. So I find myself using Bayesian methods a lot. But that’s a pragmatic  decision based on what works for my problem, not a philosophical decision based on what school of statistical philosophy I want to associate myself with.

This post is by Phil.

35 thoughts on “Everything I need to know about Bayesian statistics, I learned in eight schools.

  1. Phil:

    I agree with just about everything you wrote above (of course), and this fits in with two of my general themes regarding comparison of statistical methods: (1) experienced researchers are typically good with the methods that they are familiar with, and (2) different methods can be good for different problems. Regarding point 1, perhaps if you’d been working at Lawrence Stanford National Laboratories and had been a family friend of Paul Switzer, you would have ended up having success with some method from spatial statistics. Regarding point 2, as you note above, hierarchical Bayes is particularly appropriate in sparse-data settings such as the problem of estimating the average home radon concentration in a county where only two measurements are available.

    But your post makes me think of another issue: (3) you can end up adapting your problem to available methods. For example, our familiarity with the 8-schools example made it natural to see the radon problem in that way. But if you’d not known about hierarchical Bayes, you might have just given up on the goal of estimating the average home radon concentration in Lac Qui Parle County. Instead, for example, you could have come up with an adaptive procedure to pool adjacent counties until the total sample size reached some predetermined level that would represent stable estimation. Such a procedure would require research, and it might be that there was some existing method in spatial statistics that was a good fit to this problem. And you’d be saying, “Everything I need to know about spatial smoothing, I learned from the *** example.”

    And it gets better than that. Spatial smoothing is, like hierarchical Bayes, an open-ended methodology. Beyond the challenges of deciding which groups of adjacent counties to combine (and the opportunities to get uncertainties in the estimates by making the algorithm stochastic and bootstrapping the smoothing and estimation process), the approach can be extended by using covariate information (for example, county-level soil uranium measurements) to inform the choice of which counties to combine.

    And the procedure could even take on a partial-pooling flavor without being Bayesian. For example, one could compute estimates at many different levels of combination (for example, combine until each super-county has at least 10 measurements, or 20, or 50, or 100), then fit some 1/sqrt(n) type curve of the variation as a function of number of observations in the county-bundles, then for each county (including La Qui Parle!) taking some weighted average of the local observations and in larger and larger bundles. This could end up giving estimates that are just as good as what we got using the 8-schools model but without ever performing a Bayesian analysis.

    None of the above contradicts your point. In your problem, a hierarchical Bayes approach did work well. Indeed, lots of researchers—even those who didn’t go to high school with me!—have found Bayesian inference to be helpful, it’s a great way to deal with models with lots of parameters and local information. Often, what makes a statistical approach useful is not just that it can solve a problem, but that the tools for the solution are readily available and that there is some earlier, similar analysis that can be used as a template. And of course there are lots and lots of non-Bayesian examples of this too.

    • Andrew you’ll recall that we did (with your grad student John Boscardin) develop a way of using variograms to incorporate spatial information. So it’s not that we had to choose between spatial methods and Bayesian methods; we could and did use Bayesian spatial methods. It’s a pity that work was never published..John did a great job on it and should have gotten a publication out of it, and indeed he wrote 80% of a paper and I kept thinking I’d find time to finish it off, but that project had run out of time and money. Ah, well, whaddyagonnado.

      Also, in later work I used surface geology type indicators as additional regression variables , recognizing that part of the reason they were useful was simply because they did some spatial pooling: I compared the behavior of the actual geologic types to similarly sized random clumpings of contiguous counties and found that the actual geologic types performed only slightly better (in terms of predictive accuracy) than the fake ones. This approach is not as good as a genuine spatial model — the better approach would have been using the geologic information, as well as using spatial information in an explicitly spatial model — but it was very easy to do with my existing machinery and I judged, correctly I think, that the small additional benefit of incorporating the spatial model wouldn’t have been worth it.

      Just a year or two later, when BUGS became available so it was much easier to fit more complicated models, that calculation might have changed, but by then I had moved on to other things. Also, it’s just generally easier to do everything now. I don’t remember exactly how I got the latitude and longitude of each zip code, but I remember it took a long time…maybe I had to use gopher or archie to download it from somewhere but it took a long time to figure out where, or maybe I ordered a diskette from some company and it took a week to arrive. The 8 Schools problem is now a 10-minute exercise in Stan, and zip code centroids can be found in a few seconds using your favorite search engine. These kids today, they don’t know how lucky they have it. We used to live in a cardboard box in the middle of the road, get up and 3 a.m., and head to the mines.

      • Phil:

        Yes, I didn’t mean to imply that spatial and Bayes are mutually exclusive. What I meant to say was that we could’ve ended up with estimates very similar to what we got, but traveling a much different road, based on adaptive spatial aggregation and smoothing rather than hierarchical modeling.

  2. Phil:
    > I am trying to find p(theta | y)
    Not me, I am just trying to get a sufficiently self-correcting/evolving process to get well reasoned guesses at (a fictional theta) that’s adequately purposeful. And I think Bayesian methods are often the best bet.

    But otherwise, I agree with most of your post.

    In fact, my thesis can be seen as an abandoned attempt at a general frequentist based approach to the 8 schools problem – http://statmodeling.stat.columbia.edu/wp-content/uploads/2010/06/ThesisReprint.pdf – and the history stuff there is quite relevant. It was abandoned as I realised Bayesian methods, at least for me, were a better way forward (though the third order asymptotic theory stuff was so much fun).

    There are frequentist methods to bring to bear on these problems (as Andrew raises), the strange thing maybe is the way full Bayesian hierarchical model seemed to arise so late. Cochran may have been one the first with a Normal-Normal model, and he did a number of applied examples with Yates (1937 & 38) – both of which seemed to enrage Fisher who had very different ideas (e.g. ignore the individual school standard errors estimates, except for model testing, treat each school treatment estimate as just a random outcome and use my t.test. )

    p.s. Don Rubin told me once that he was completely unaware that Cochran (his Phd supervisor) had worked on this problem

    • Playing Devil’s advocate:

      1. In the Radon example the hierarchical Bayes estimate for Lac Qui Parle (LQP) looks more reasonable than the completely unpooled estimate. But in a way this is circular. It looks reasonable because it is more in accordance with our priors. Why? Because we used our priors to estimate it. If we judge success by what we think is reasonable, then including what we think is reasonable in the estimation is likely to be “successful”, almost by definition.

      2. I think to get at success we need to put the method in the context of a decision, like the allocation of RADON abatement efforts, or whatever. And here we get the bit O’Rourke mentioned about “sufficiently self-correcting/evolving process”. Using unpooled estimates we would worry about LQP enough to perhaps do a second round of (larger) sampling. Perhaps we learn that indeed radon is very high in that county, or not.

      So I think the right comparison is which method converges faster to the “truth”, or at least avoids the most harmful consequences from radon exposure, for a given cost, over a certain time period. I suspect both will have similar convergence properties but different variances. Different loss functions may give different assessments of methods.

      • Anonymous:

        I’m happy to say that you asked questions that we have already answered!

        1. Take a look at our 1996 paper. We demonstrate the effectiveness of our model using cross-validation. The unpooled estimate does not perform so well, as of course it shouldn’t, from basic mathematical principles.

        2. Our analysis was all in service of decision making. See our 1999 paper for all the details. We even go into detail on the loss function (a tradeoff between dollars and lives).

        • The validation is done using the same sample, which is ok, but does not get into the behavior over repeated sampling. Also, it focuses on counties with more than 50 observations. And, as far as I could tell, is not set as a straight horse race with the unpooled model.

          Note that if I had to bet on that horse race my money would be in hierachical bayes. But just saying that the case could be more water tight if framed differently.

          As for the second paper, it looks very interesting! Will have to read carefully.

      • Anonymous:

        > Using unpooled estimates we would worry about LQP enough to perhaps do a second round of (larger) sampling.

        OK but why not first get unpooled, pooled and partially pooled estimates and see which better predicts the larger sample.

        Either this was idiosyncratic or you are off to start learning that this usually happens. There are other studies where a second larger sample was obtained or you can fake this with cross-validation (use a small subset to predict the larger set) and then extend with mathematical analysis or simulations more generally. Having identified where partial pooling does not work well – you can “test” whether the situation in hand is likely one of those. Then you repeat the above process to assess that testing.

        I prefer to separate accurately representing uncertain quantities from making optimal decisions given that representation since decisions are based on arbitrary (preference based) loss functions.

        • @K?:

          The hierarchical estimation is specially useful when measures are very sparse (e.g. two measures in LQP). Not an ideal candidate for cross validation (the paper AG refereed do only does CV in those counties with 50 or more observations, perhaps these are on a class of their own). There is only so far you can go by pulling on your own bootstraps.

          More and more I do not know what “accurately” means separate from the loss function. A lot of evaluation-laden decisions go into computing an estimate (regularization, minimizing squared residuals, absolute distance, etc..) before we even decide what to do.

    • K?, you say you’re not trying to find p(theta | y), you’re trying to “get well reasoned guesses at (a fictional theta)”. I guess I don’t see the difference there. I, too, am interested in “well reasoned” estimates (which I guess can be called guesses). So I could say “I’m trying to find a well reasoned guess at p(theta | y)”. Perhaps it will become clearer to me when I read your thesis.

      Me, I’m almost always interested in the underlying parameters rather than in direct features of the data. I want to know “the probability that Denver will beat Seattle in the Super Bowl, given the teams’ performances during the year and the latest injury report”, or “the probability the snowpack in Northern California will end up at less than 20% of normal for the season, given the current state of the snowpack and the weather.”

      Like you, I find it strange that some current bread-and-butter Bayesian methods took so long to become standard. I did not know about Cochran and Yates specifically. Considering how old Bayes’ Theorem is, it is remarkable how long it took to develop useful methods for application. I’m sure computational difficulties were a big part of that.

      • “I’m almost always interested in the underlying parameters rather than in direct features of the data.”
        So you believe that these parameters exist?

        • Christian:

          In a sampling setting such as the radon problem, the underlying parameter theta_j in county j is essentially identical to the average radon level of all the houses in county j (or something like that), and that indeed exists.

        • Christian raises a good point though. There are probabilities that are neither 0 nor 1 because they involve “true” randomness (or at least we think there are), like “will this radon atom undergo radioactive decay within the next hour”, and there are “probabilities” that are nonzero only because we don’t know the answer. At the moment the roulette ball is released, whether it will land on 00 is determined; the fact that we assign a probability of 1/38 reflects our ignorance of the outcome, not a genuine possibility that either outcome is possible. Probably the statistical distribution of possible snowpack in the Sierra Nevada this year is another example of ignorance rather than true randomness, being already deterministically out of the reach of quantum mechanical fluctuations. So I could have agreed with K? on the fact that I, too, am often interested in a “fictional theta” rather than a real theta.

          In other cases, though, there really is a real theta. As Andrew says, we were interested in “what geometric mean radon concentration would we obtain if we measured the radon concentration in every house in the county?” That quantity “exists” in the sense that there is a correct answer, even though we don’t know what it is. There’s also a right answer to “what fraction of white female voters in Connecticut in the last election voted for Obama,” to give an example that applies to Andrew’s research.

          So, sometimes the underlying parameter I’m interested in does “exist”, and sometimes it doesn’t. But even if it doesn’t, I want the best estimate of it I can get!

      • I am just objecting to p(theta | y) as being _the_ answer with _self-evident_ measure of quality (and little of that in my thesis).

        If it’s a well constructed and tested two stage gambling machine, having observed y – p(theta | y) is the answer and it can’t be any better – but this is seldom the case in empirical research and so some sense of repeated performance is needed.

        > I want to know “the probability that Denver will beat Seattle in the Super Bowl
        Well, we all do but probably beter to settle for not being repeatedly too wrong about team X will be beat team Y in game Z (for some reference set).

        > I’m sure computational difficulties were a big part of that
        You are likely right given Cochran was still unsuccessfully struggling with just the likelihoods (~1980) as they can be multimodal.

        • Keith:

          It wasn’t just computational difficulties. If you read Rubin’s 8-schools article (it came out in 1980 or 1981), you’ll see that he struggled a bit with the whole idea of averaging over the posterior distribution of the hyper parameter. The entire analysis, which could be done in about 5 seconds today, was spread out over several pages, partly because Rubin seemed to feel a bit uncomfortable (or maybe he felt his audience would be uncomfortable) with just treating it as a full Bayesian problem. Recall that, in the 1970s, there was a lot of stuff on “empirical Bayes” but it typically was based on a point estimate of the hyper parameters. And there was lots of agonizing over “exchangeability” etc. It’s only after decades of experience with multilevel models that we easily understand how to think about such issues.

          It would be worth doing a fuller historical study here (to complement the article that Christian Robert and I wrote regarding the anti-Bayesians of the postwar period).

  3. The school example just seems to me to be combining two estimates, A and F (A, and F is the mean of the schools). One (non-Bayes) way to do this is to simply weight by c and (1-c), so cA + (1-c)F is the estimator of the effect. Varying c between 0 and 1 gives the full range of combinations, and the standard errors are calculated in the usual manner. I suppose my point is that this is being presented as an example of how Bayesian inference offers something frequentist inference does not and I don’t see that.

    • Numeric:

      See my comment above (the first comment on this post). The short answer is that there are many roads to Rome. To use your notation, the question is what value of “c” to use. It’s easy to derive a good value of c (which will depend on sample size) using the Bayesian algebra. Another way to do it would be to guess a functional form for c using statistical intuition (apparently, R. A. Fisher was good at that sort of thing) and then go from there. Depending on how you do the derivation, you can get back to the standard formula in various ways. I’ve found the Bayesian approach to work well in that it is easy to add predictors, spatial structure, etc., and the details pretty much take care of themselves. But indeed there are other ways to approach the problem.

      On the other hand, the success in recent years of Bayesian inference should be telling us something. Partial pooling comes up often enough that it’s good to have a general way to handle it, rather than having to come up with some clever function form for weighted averaging that has to be developed anew for each problem.

      • Sorry–I tuned out the first comment when I saw “I agree with just about everything you wrote”, expecting some triumphalism. Credit to you for discussing alternative approaches rather than engaging in self-congratulations.

        Incidentally, in my simple formulation (estimator = cA + (1-c)S), the value of c that gives 10 (the Bayesian answer) is .12, or about 1/8th. I interpret this as giving equal weighting to the results in all schools (since there are 8), and then taking the unique contribution of a particular school to adjust the score for that school. But, of course, the 1/8th is pretty close to the non-informative prior, which we aren’t supposed to use any longer (or, hardly ever).

        • numeric, c should (and, in the Bayesian analysis, does) depend on the standard error of the estimate for each school, so it should differ from school to school. That’s not a very large effect with the schools, whose standard errors only vary from 9 to 18, but it’s a big effect in the radon analysis. In general it certainly would not be good to use 1/n where n is the number of units.

        • Phil:

          The 1/n weight was the one Fisher argued could be the only reasonable one. Not sure anyone other than him would know why, but it does provide what some have called a robust random effects model.

          Also, rather than having a hyper-prior, prior and data model one could arguably call the prior and data model together – a compound data model and claim to not have a hyper-prior and hence claim they are not being Bayesian – as I once did here http://link.springer.com/chapter/10.1007/978-1-4613-0141-7_11 .

          But for anything much other than Normal assumptions with the within study variances assumed/taken as known (that’s often quite wrong) – how to do it might will be a (un)reasonable Phd thesis topic.

        • Numeric:

          The term “noninformative prior” is of course not clearly defined. But in regression models it is often used to denote the no-pooling model, i.e. c=1 in your notation, which is very far from the c=1/8 that you give, indeed it’s just about at the very other end of the scale.

  4. I think a remark of Phil’s points to a common slip: “situations in which the underlying parameters of interest (theta) are observed subject to error (measurements y), which is to say I am trying to find p(theta | y).”

  5. Hmm.. I think I side with n”umeric” here – how would the results differ compared to using à standard, frequentist mixed model?
    I’m using bayesian stats now, but was trained in a frequentist university. It seems to me that the main advantage with the Bayesian approach is the easy of incorporating a wide range of different components in a model,with a wide range of parametric structures. But in cases such as these I’m struggling to see the added benefit of bayes. Any possibility of shedding some light on that?

    • Gustaf:

      Most of the benefit comes from partially pooling the county-level coefficients. I don’t care if you call that Bayesian or not; I don’t think Phil cares, either. Over the years, I’ve encountered various non-Bayesians who do not want to do partial pooling; Phil discusses this in his post above. Those non-Bayesians refuse to fit “a standard, frequentist mixed model”; their ideology is holding them back. In my opinion, once you fit the mixed model or the hierarchical model or the multilevel model or whatever you call it, you’re most of the way there.

      Bayesian ideas do become helpful when the number of groups is small; Rubin discusses this in the original 8-schools paper from 1980 or 1981.

      • Although many (most?) non-Bayesian (and perhaps Bayesians) initially strongly pushed back on any sort of pooling.

        In conversation with others who worked on meta-analysis in 1980’s most recalled some outright dismissal of the idea.

        Mentioned some of mine here http://statmodeling.stat.columbia.edu/2012/02/12/meta-analysis-game-theory-and-incentives-to-do-replicable-research/#comment-73427 and this was from really smart statisticians some of them currently regarded in top 5-10% of academic statistics.

        • Keith:

          Yes, we’ve moved past the “It’s Bayesian, don’t do it!” stage to the “Everybody does it, it’s not particularly Bayesian!” stage. An improvement, I think.

      • Just went through Rubin’s article. Apparently an early nickname for posterior predictive checks was “phenomenological Bayesian monitoring” :-)

        A google search only points to this other late 1970s article by Rubin, Multiple Imputations In Sample Surveys – A Phenomenological Bayesian Approach To Nonresponse. Rubin gives a reason for this choice of adjective:

        “I am sympathetic with the imputation position for two reasons. First, it is phenomenological in that it focuses on observable values. There do not exist parameters except under hypothetical models; there do, however, exist actual observed values and values that would have been observed. Focusing on the estimation of parameters is often not what the applied person wants to do since a hypothetical model is simply a structure that guides him to do sensible things with observed values.”

  6. Pingback: Posterior samples | Sam Clifford

  7. This is one of my favorite stories in statistics that I would like to refer when someone asks me “what are you doing with your statistics?”. Thanks, Phil!

  8. Pingback: Creating a Lenin-style democracy « Statistical Modeling, Causal Inference, and Social Science Statistical Modeling, Causal Inference, and Social Science

  9. Pingback: Running Bayesian models | Sam Clifford

  10. Pingback: How about zero-excluding priors for hierarchical variance parameters to improve computation for full Bayesian inference? - Statistical Modeling, Causal Inference, and Social Science

  11. Thank you so much, this was great!!

    I have read your friend Andrew Gelman’s BDA book. It has many excellent examples, and it is the apex by which I feel I can check my level of knowledge and understanding.

    Unfortunately, like 90% of stats textbooks, it is a slog to get through. His examples are clear, but the text is overly erudite and not at all clear. I am a data scientist of many years, and I find myself often reading both stats and CS books. The CS books are nearly always clearer and easier to read, and I do not think this is because the topics are simpler.

    You have shown in this article that it IS possible to explain stats concepts and have it make sense. Thank you!!

Comments are closed.