:)

]]>I have read your friend Andrew Gelman’s BDA book. It has many excellent examples, and it is the apex by which I feel I can check my level of knowledge and understanding.

Unfortunately, like 90% of stats textbooks, it is a slog to get through. His examples are clear, but the text is overly erudite and not at all clear. I am a data scientist of many years, and I find myself often reading both stats and CS books. The CS books are nearly always clearer and easier to read, and I do not think this is because the topics are simpler.

You have shown in this article that it IS possible to explain stats concepts and have it make sense. Thank you!!

]]>Keith:

Yes, we’ve moved past the “It’s Bayesian, don’t do it!” stage to the “Everybody does it, it’s not particularly Bayesian!” stage. An improvement, I think.

]]>Just went through Rubin’s article. Apparently an early nickname for posterior predictive checks was “phenomenological Bayesian monitoring” :-)

A google search only points to this other late 1970s article by Rubin, Multiple Imputations In Sample Surveys – A Phenomenological Bayesian Approach To Nonresponse. Rubin gives a reason for this choice of adjective:

“I am sympathetic with the imputation position for two reasons. First, it is phenomenological in that it focuses on observable values. There do not exist parameters except under hypothetical models; there do, however, exist actual observed values and values that would have been observed. Focusing on the estimation of parameters is often not what the applied person wants to do since a hypothetical model is simply a structure that guides him to do sensible things with observed values.”

]]>Although many (most?) non-Bayesian (and perhaps Bayesians) initially strongly pushed back on any sort of pooling.

In conversation with others who worked on meta-analysis in 1980’s most recalled some outright dismissal of the idea.

Mentioned some of mine here http://statmodeling.stat.columbia.edu/2012/02/12/meta-analysis-game-theory-and-incentives-to-do-replicable-research/#comment-73427 and this was from really smart statisticians some of them currently regarded in top 5-10% of academic statistics.

]]>Gustaf:

Most of the benefit comes from partially pooling the county-level coefficients. I don’t care if you call that Bayesian or not; I don’t think Phil cares, either. Over the years, I’ve encountered various non-Bayesians who do not want to do partial pooling; Phil discusses this in his post above. *Those* non-Bayesians refuse to fit “a standard, frequentist mixed model”; their ideology is holding them back. In my opinion, once you fit the mixed model or the hierarchical model or the multilevel model or whatever you call it, you’re most of the way there.

Bayesian ideas do become helpful when the number of groups is small; Rubin discusses this in the original 8-schools paper from 1980 or 1981.

]]>I’m using bayesian stats now, but was trained in a frequentist university. It seems to me that the main advantage with the Bayesian approach is the easy of incorporating a wide range of different components in a model,with a wide range of parametric structures. But in cases such as these I’m struggling to see the added benefit of bayes. Any possibility of shedding some light on that? ]]>

Phil:

The 1/n weight was the one Fisher argued could be the only reasonable one. Not sure anyone other than him would know why, but it does provide what some have called a robust random effects model.

Also, rather than having a hyper-prior, prior and data model one could arguably call the prior and data model together – a compound data model and claim to not have a hyper-prior and hence claim they are not being Bayesian – as I once did here http://link.springer.com/chapter/10.1007/978-1-4613-0141-7_11 .

But for anything much other than Normal assumptions with the within study variances assumed/taken as known (that’s often quite wrong) – how to do it might will be a (un)reasonable Phd thesis topic.

]]>Numeric:

The term “noninformative prior” is of course not clearly defined. But in regression models it is often used to denote the no-pooling model, i.e. c=1 in your notation, which is very far from the c=1/8 that you give, indeed it’s just about at the very other end of the scale.

]]>numeric, c should (and, in the Bayesian analysis, does) depend on the standard error of the estimate for each school, so it should differ from school to school. That’s not a very large effect with the schools, whose standard errors only vary from 9 to 18, but it’s a big effect in the radon analysis. In general it certainly would not be good to use 1/n where n is the number of units.

]]>Sorry–I tuned out the first comment when I saw “I agree with just about everything you wrote”, expecting some triumphalism. Credit to you for discussing alternative approaches rather than engaging in self-congratulations.

Incidentally, in my simple formulation (estimator = cA + (1-c)S), the value of c that gives 10 (the Bayesian answer) is .12, or about 1/8th. I interpret this as giving equal weighting to the results in all schools (since there are 8), and then taking the unique contribution of a particular school to adjust the score for that school. But, of course, the 1/8th is pretty close to the non-informative prior, which we aren’t supposed to use any longer (or, hardly ever).

]]>@K?:

The hierarchical estimation is specially useful when measures are very sparse (e.g. two measures in LQP). Not an ideal candidate for cross validation (the paper AG refereed do only does CV in those counties with 50 or more observations, perhaps these are on a class of their own). There is only so far you can go by pulling on your own bootstraps.

More and more I do not know what “accurately” means separate from the loss function. A lot of evaluation-laden decisions go into computing an estimate (regularization, minimizing squared residuals, absolute distance, etc..) before we even decide what to do.

]]>Numeric:

See my comment above (the first comment on this post). The short answer is that there are many roads to Rome. To use your notation, the question is what value of “c” to use. It’s easy to derive a good value of c (which will depend on sample size) using the Bayesian algebra. Another way to do it would be to guess a functional form for c using statistical intuition (apparently, R. A. Fisher was good at that sort of thing) and then go from there. Depending on how you do the derivation, you can get back to the standard formula in various ways. I’ve found the Bayesian approach to work well in that it is easy to add predictors, spatial structure, etc., and the details pretty much take care of themselves. But indeed there are other ways to approach the problem.

On the other hand, the success in recent years of Bayesian inference should be telling us *something*. Partial pooling comes up often enough that it’s good to have a general way to handle it, rather than having to come up with some clever function form for weighted averaging that has to be developed anew for each problem.

Keith:

It wasn’t just computational difficulties. If you read Rubin’s 8-schools article (it came out in 1980 or 1981), you’ll see that he struggled a bit with the whole idea of averaging over the posterior distribution of the hyper parameter. The entire analysis, which could be done in about 5 seconds today, was spread out over several pages, partly because Rubin seemed to feel a bit uncomfortable (or maybe he felt his audience would be uncomfortable) with just treating it as a full Bayesian problem. Recall that, in the 1970s, there was a lot of stuff on “empirical Bayes” but it typically was based on a point estimate of the hyper parameters. And there was lots of agonizing over “exchangeability” etc. It’s only after decades of experience with multilevel models that we easily understand how to think about such issues.

It would be worth doing a fuller historical study here (to complement the article that Christian Robert and I wrote regarding the anti-Bayesians of the postwar period).

]]>Anonymous:

> Using unpooled estimates we would worry about LQP enough to perhaps do a second round of (larger) sampling.

OK but why not first get unpooled, pooled and partially pooled estimates and see which better predicts the larger sample.

Either this was idiosyncratic or you are off to start learning that this usually happens. There are other studies where a second larger sample was obtained or you can fake this with cross-validation (use a small subset to predict the larger set) and then extend with mathematical analysis or simulations more generally. Having identified where partial pooling does not work well – you can “test” whether the situation in hand is likely one of those. Then you repeat the above process to assess that testing.

I prefer to separate accurately representing uncertain quantities from making optimal decisions given that representation since decisions are based on arbitrary (preference based) loss functions.

]]>I am just objecting to p(theta | y) as being _the_ answer with _self-evident_ measure of quality (and little of that in my thesis).

If it’s a well constructed and tested two stage gambling machine, having observed y – p(theta | y) is the answer and it can’t be any better – but this is seldom the case in empirical research and so some sense of repeated performance is needed.

> I want to know “the probability that Denver will beat Seattle in the Super Bowl

Well, we all do but probably beter to settle for not being repeatedly too wrong about team X will be beat team Y in game Z (for some reference set).

> I’m sure computational difficulties were a big part of that

You are likely right given Cochran was still unsuccessfully struggling with just the likelihoods (~1980) as they can be multimodal.

Christian raises a good point though. There are probabilities that are neither 0 nor 1 because they involve “true” randomness (or at least we think there are), like “will this radon atom undergo radioactive decay within the next hour”, and there are “probabilities” that are nonzero only because we don’t know the answer. At the moment the roulette ball is released, whether it will land on 00 is determined; the fact that we assign a probability of 1/38 reflects our ignorance of the outcome, not a genuine possibility that either outcome is possible. Probably the statistical distribution of possible snowpack in the Sierra Nevada this year is another example of ignorance rather than true randomness, being already deterministically out of the reach of quantum mechanical fluctuations. So I could have agreed with K? on the fact that I, too, am often interested in a “fictional theta” rather than a real theta.

In other cases, though, there really is a real theta. As Andrew says, we were interested in “what geometric mean radon concentration would we obtain if we measured the radon concentration in every house in the county?” That quantity “exists” in the sense that there is a correct answer, even though we don’t know what it is. There’s also a right answer to “what fraction of white female voters in Connecticut in the last election voted for Obama,” to give an example that applies to Andrew’s research.

So, sometimes the underlying parameter I’m interested in does “exist”, and sometimes it doesn’t. But even if it doesn’t, I want the best estimate of it I can get!

]]>Christian:

In a sampling setting such as the radon problem, the underlying parameter theta_j in county j is essentially identical to the average radon level of all the houses in county j (or something like that), and that indeed exists.

]]>“I’m almost always interested in the underlying parameters rather than in direct features of the data.”

So you believe that these parameters exist?

The validation is done using the same sample, which is ok, but does not get into the behavior over repeated sampling. Also, it focuses on counties with more than 50 observations. And, as far as I could tell, is not set as a straight horse race with the unpooled model.

Note that if I had to bet on that horse race my money would be in hierachical bayes. But just saying that the case could be more water tight if framed differently.

As for the second paper, it looks very interesting! Will have to read carefully.

]]>K?, you say you’re not trying to find p(theta | y), you’re trying to “get well reasoned guesses at (a fictional theta)”. I guess I don’t see the difference there. I, too, am interested in “well reasoned” estimates (which I guess can be called guesses). So I could say “I’m trying to find a well reasoned guess at p(theta | y)”. Perhaps it will become clearer to me when I read your thesis.

Me, I’m almost always interested in the underlying parameters rather than in direct features of the data. I want to know “the probability that Denver will beat Seattle in the Super Bowl, given the teams’ performances during the year and the latest injury report”, or “the probability the snowpack in Northern California will end up at less than 20% of normal for the season, given the current state of the snowpack and the weather.”

Like you, I find it strange that some current bread-and-butter Bayesian methods took so long to become standard. I did not know about Cochran and Yates specifically. Considering how old Bayes’ Theorem is, it is remarkable how long it took to develop useful methods for application. I’m sure computational difficulties were a big part of that.

]]>Phil:

Yes, I didn’t mean to imply that spatial and Bayes are mutually exclusive. What I meant to say was that we could’ve ended up with estimates very similar to what we got, but traveling a much different road, based on adaptive spatial aggregation and smoothing rather than hierarchical modeling.

]]>Anonymous:

I’m happy to say that you asked questions that we have already answered!

1. Take a look at our 1996 paper. We demonstrate the effectiveness of our model using cross-validation. The unpooled estimate does not perform so well, as of course it shouldn’t, from basic mathematical principles.

2. Our analysis was all in service of decision making. See our 1999 paper for all the details. We even go into detail on the loss function (a tradeoff between dollars and lives).

]]>Andrew you’ll recall that we did (with your grad student John Boscardin) develop a way of using variograms to incorporate spatial information. So it’s not that we had to choose between spatial methods and Bayesian methods; we could and did use Bayesian spatial methods. It’s a pity that work was never published..John did a great job on it and should have gotten a publication out of it, and indeed he wrote 80% of a paper and I kept thinking I’d find time to finish it off, but that project had run out of time and money. Ah, well, whaddyagonnado.

Also, in later work I used surface geology type indicators as additional regression variables , recognizing that part of the reason they were useful was simply because they did some spatial pooling: I compared the behavior of the actual geologic types to similarly sized random clumpings of contiguous counties and found that the actual geologic types performed only slightly better (in terms of predictive accuracy) than the fake ones. This approach is not as good as a genuine spatial model — the better approach would have been using the geologic information, as well as using spatial information in an explicitly spatial model — but it was very easy to do with my existing machinery and I judged, correctly I think, that the small additional benefit of incorporating the spatial model wouldn’t have been worth it.

Just a year or two later, when BUGS became available so it was much easier to fit more complicated models, that calculation might have changed, but by then I had moved on to other things. Also, it’s just generally easier to do everything now. I don’t remember exactly how I got the latitude and longitude of each zip code, but I remember it took a long time…maybe I had to use gopher or archie to download it from somewhere but it took a long time to figure out where, or maybe I ordered a diskette from some company and it took a week to arrive. The 8 Schools problem is now a 10-minute exercise in Stan, and zip code centroids can be found in a few seconds using your favorite search engine. These kids today, they don’t know how lucky they have it. We used to live in a cardboard box in the middle of the road, get up and 3 a.m., and head to the mines.

]]>Playing Devil’s advocate:

1. In the Radon example the hierarchical Bayes estimate for Lac Qui Parle (LQP) looks more reasonable than the completely unpooled estimate. But in a way this is circular. It looks reasonable because it is more in accordance with our priors. Why? Because we used our priors to estimate it. If we judge success by what we think is reasonable, then including what we think is reasonable in the estimation is likely to be “successful”, almost by definition.

2. I think to get at success we need to put the method in the context of a decision, like the allocation of RADON abatement efforts, or whatever. And here we get the bit O’Rourke mentioned about “sufficiently self-correcting/evolving process”. Using unpooled estimates we would worry about LQP enough to perhaps do a second round of (larger) sampling. Perhaps we learn that indeed radon is very high in that county, or not.

So I think the right comparison is which method converges faster to the “truth”, or at least avoids the most harmful consequences from radon exposure, for a given cost, over a certain time period. I suspect both will have similar convergence properties but different variances. Different loss functions may give different assessments of methods.

]]>> I am trying to find p(theta | y)

Not me, I am just trying to get a sufficiently self-correcting/evolving process to get well reasoned guesses at (a fictional theta) that’s adequately purposeful. And I think Bayesian methods are often the best bet.

But otherwise, I agree with most of your post.

In fact, my thesis can be seen as an abandoned attempt at a general frequentist based approach to the 8 schools problem – http://statmodeling.stat.columbia.edu/wp-content/uploads/2010/06/ThesisReprint.pdf – and the history stuff there is quite relevant. It was abandoned as I realised Bayesian methods, at least for me, were a better way forward (though the third order asymptotic theory stuff was so much fun).

There are frequentist methods to bring to bear on these problems (as Andrew raises), the strange thing maybe is the way full Bayesian hierarchical model seemed to arise so late. Cochran may have been one the first with a Normal-Normal model, and he did a number of applied examples with Yates (1937 & 38) – both of which seemed to enrage Fisher who had very different ideas (e.g. ignore the individual school standard errors estimates, except for model testing, treat each school treatment estimate as just a random outcome and use my t.test. )

p.s. Don Rubin told me once that he was completely unaware that Cochran (his Phd supervisor) had worked on this problem

]]>I agree with just about everything you wrote above (of course), and this fits in with two of my general themes regarding comparison of statistical methods: (1) experienced researchers are typically good with the methods that they are familiar with, and (2) different methods can be good for different problems. Regarding point 1, perhaps if you’d been working at Lawrence Stanford National Laboratories and had been a family friend of Paul Switzer, you would have ended up having success with some method from spatial statistics. Regarding point 2, as you note above, hierarchical Bayes is particularly appropriate in sparse-data settings such as the problem of estimating the average home radon concentration in a county where only two measurements are available.

But your post makes me think of another issue: (3) you can end up adapting your problem to available methods. For example, our familiarity with the 8-schools example made it natural to see the radon problem in that way. But if you’d not known about hierarchical Bayes, you might have just given up on the goal of estimating the average home radon concentration in Lac Qui Parle County. Instead, for example, you could have come up with an adaptive procedure to pool adjacent counties until the total sample size reached some predetermined level that would represent stable estimation. Such a procedure would require research, and it might be that there was some existing method in spatial statistics that was a good fit to this problem. And you’d be saying, “Everything I need to know about spatial smoothing, I learned from the *** example.”

And it gets better than that. Spatial smoothing is, like hierarchical Bayes, an open-ended methodology. Beyond the challenges of deciding which groups of adjacent counties to combine (and the opportunities to get uncertainties in the estimates by making the algorithm stochastic and bootstrapping the smoothing and estimation process), the approach can be extended by using covariate information (for example, county-level soil uranium measurements) to inform the choice of which counties to combine.

And the procedure could even take on a partial-pooling flavor without being Bayesian. For example, one could compute estimates at many different levels of combination (for example, combine until each super-county has at least 10 measurements, or 20, or 50, or 100), then fit some 1/sqrt(n) type curve of the variation as a function of number of observations in the county-bundles, then for each county (including La Qui Parle!) taking some weighted average of the local observations and in larger and larger bundles. This could end up giving estimates that are just as good as what we got using the 8-schools model but without ever performing a Bayesian analysis.

None of the above contradicts your point. In your problem, a hierarchical Bayes approach *did* work well. Indeed, lots of researchers—even those who didn’t go to high school with me!—have found Bayesian inference to be helpful, it’s a great way to deal with models with lots of parameters and local information. Often, what makes a statistical approach useful is not just that it *can* solve a problem, but that the tools for the solution are readily available and that there is some earlier, similar analysis that can be used as a template. And of course there are lots and lots of non-Bayesian examples of this too.