Tangent (which I feel is ok on a post by the #1 tangent writer):

I go to Horniman all the time with the kids and have a couple of friends who have worked there. BTW they say the walrus is soaked in cyanide and that’s why you’re not allowed on the (also embarrassingly unrealistic – it’s not even cold!) plastic iceberg… I digress. The walrus is an embarrassment to them but it’s also the only reason people across the world (even on this blog) have heard of a small museum in South London. So they can’t take it away and say ‘Remember when we had that ridiculous overstuffed walrus there for 100 years – weird!’ because they would lose visitors. The good thing is their other exhibitions are superb. They don’t want people who visit to remember the visit for the walrus (and they don’t). They’re starting from below 0 so can’t put on something weak. I guess what I’m saying is that if people are aware that it’s embarrassing to use the eight schools, they’re starting from below 0 and might have to work harder on everything else they do, like thinking about how to simulate data.

PS – apparently no museum staff have ever taken photos of themselves riding the walrus on their last day.

]]>The point isn’t about how important real world examples are for testing *data models* it’s about how important real world examples are for testing *computational methods for sampling from posteriors*

You can’t test models of how electrons interact in a plasma using simulated data, you need to collect some measurements from real plasma. Similarly, you can’t test models for economic decision making using random number generators, you need to collect some data on people’s spending habits… But these are tests of models of real world processes, data models.

On the other hand, you *can* test a sampling algorithm on *any* probability density in N dimensions, and there’s no reason to think that posterior probability distributions from real world small sample information poor statistical problems out of textbooks are good for testing those algorithms.

At least that’s my interpretation of his point. I hope I’m not misstating it.

]]>A thought experiment:

Your grad student tells you that she has an algorithm that will predict the next 100 entries in a real-world data set of 1000 existing entries. But she is having trouble getting the data set to test it. You say “No problem, you know the basic parameters, test your algorithm on a synthetic data set. That’s better anyway.”

She develops a method to generate 1000 synthetic entries intended to simulate the real-world data. She runs her algorithm and predicts the next 100 entries. She then generates 100 new synthetic entries using the method she developed, but the numbers don’t match very well.

The next morning, she gets an e-mail with the full, real-world, up-to-date data set. She runs her algorithm and generates a prediction. She waits until 100 real-world entries are added to the data set, and discovers that her algorithm did an excellent job of predicting them. As more data accumulates, the fit gets better.

Do you tell her:

1. It failed when it mattered, which was with the synthetic data. Move on to a new project.

– or –

2. Your method of generating synthetic data must have been faulty. Nice algorithm!

Now swap “real world” and “synthetic” in the thought experiment. The algorithm works on the synthetic data but not on the real data. Do you say:

1. The algorithm may well be fine, the data set must not be representative. [Grad student: Not representative OF WHAT?]

– or –

2. Your method of generating synthetic data must have been faulty.

I know that statisticians would want to dive in, figuring out more of the statistical properties of the real-world data to understand what went wrong. And they wouldn’t be happy until they could generate synthetic data that works. But does any of that add to the confidence in the algorithm, or does it just allow the statisticians to sleep better at night?

Dan wrote:

“Experiments using well-designed simulated data are unbelievably important. Real data sets are not.”

Can my thought experiment be shoehorned into this? I am genuinely interested in how that would happen.

]]>And Bob and Dan agreeing that its specificity makes it “very challenging for students as a first example of a hierarchical model”, well, I don’t know what students you’re talking to.

Me, mainly—I’m the student. But I’m reasoning partly by theory that there are several moving parts—measurement error and hierarchical modeling, which can be tackled separately and then combined. I also find the first hierarchicla model example in BDA (rats in chapter 5) very challenging because it assumes a lot of background in the way it manipulates Jacobians for priors, plugs in moment-based estimates, etc. I get it now, but it’s a tough first example for someone without a strong math-stats and applied-stats background.

]]>Real world knowledge is also why I like my baseball case study (and why it fails for most readers worldwide)—we have 100 years of real-world data to act as an informal but very strong prior.

]]>Another great example of this is Mahar et al.’s lung clearance diagnostic model. It’s a physically motivated use of an exponential decay mixture.

]]>as opposed to this:

]]>Real walrus for comparison: https://www.dkfindout.com/uk/animals-and-nature/seals-sea-lions-and-walruses/walrus/

The Horniman Walrus: https://www.horniman.ac.uk/collections/stories/horniman-highlights-tour

Now, the problem I have with it for students is it makes things too simple and straight forward. The data is all in hand, there is no sense in varying quality between the studies, no sense of studies being missed, sigma_j is taken as known without discussion, etc. So my sense is they think they will understand something about meta-analysis and doing realistic hierarchical analyses when they don’t yet.

Now, the simple bootstrap is worse as it seem so straightforward and easy and works well for toy problems. I have seen countless people (often statisticians) doing silly bootstrapping on real problems thinking they should be simple too.

]]>https://www.atlasobscura.com/places/the-lion-of-gripsholm-castle-strangnas-sweden

]]>Now that I know that that it is not just me that finds the 8 schools example unsatisfying, I don’t need to blame it on my lack of statistical erudition.

Paradoxically, I like the 8 school example *more* after reading the post and the comments.

]]>(Photography was a major pain* in the late 19th century, but it was possible and lots of people put up with said pain, especially folks doing expedition sorts of things.)

*: Plates were only replaced by film (invented in 1884) in the very late 19th century. It really was a pain.

]]>If you hate it as a computational example but like it in other ways, that didn’t come across to me. And maybe that’s not true either, maybe you just plain hate it. And that’s fine. If you claimed it’s a _bad_ example, that would be a factual claim with which I would take issue. But if you’re just saying you hate it, hey, that’s also a factual claim and it’s not like I think you’re lying about it. And it’s certainly fine with me: De gustibus non est disputandum, and all that. (Why on earth does autocorrect try to ‘correct’ that to xustibus. WTF?) ]]>

One of the difficulties with the 8 schools example as it stands is that it’s not so easy for students to directly emulate, as in real-life examples you typically don’t want to treat the sigma_j’s as known.

Unfortunately, I don’t have any idea how to track down these data. About 15 or 20 years ago I asked Rubin if he had the raw data, and he looked and couldn’t find anything.

]]>I run into real-world problems all the time in which a multi-level model would be a great approach, and I always have to explain to someone, or several people, or a bunch of people, what that means. The 8 schools problem is great for this. The measurements are subject to substantial error, the true values are from some distribution with unknown mean and variance (and for that matter unknown distribution), the classical estimate is unrealistic, and I’m trying to estimate both the distribution of the true values and the true values associated with the individual observations. This general structure comes up all the time. The fact that it isn’t hard to compute…who tf cares? If anything that’s a feature, not a bug.

And Bob and Dan agreeing that its specificity makes it “very challenging for students as a first example of a hierarchical model”, well, I don’t know what students you’re talking to. When I was a student I didn’t find it “very challenging”, I found it extremely clear…and it’s not just me: I have used it to explain hierarchical models to many people and they have all understood it, or at least claimed to.

Echoing Tom P (above), one of the things I like is that the example invites pondering the two extremes — complete pooling and no pooling — and the fact that both of them seem unreasonable. It’s great for walking people through options that might seem tempting, like testing each school’s estimate against the mean, and if it differs by a ‘statistically significant’ amount then let it stand alone, otherwise give it the mean effect, and illustrating how crazy that would be in practice….and, by extension, how crazy it is to do it in many other real-world problems where that is in fact what some people do.

Dan and Bob, you both seem really focused on how it’s not a great _computational_ example, and given what you work on maybe that’s understandable, but you seem to be completely missing the fact that it is a great _conceptual_ example.

]]>Another thing. You characterize the 8-schools data as “smooth and lovely.”

Actually, the 8 schools data are smooth and lovely only in retrospect!

Here’s the background. Throughout the 1970s, lots of research was done on hierarchical regression modeling, often in education examples. Standard practice for both theory and application was to use point estimates for the hyperparameters. The 8 schools example (published in 1981, based on an experiment conducted in the late 1970s) was special because it was *difficult*, as the natural point estimate of the group-level variance was zero. This motivated Rubin to perform a full Bayes analysis.

So, I think that to describe the 8 schools as smooth and lovely is like saying that Hamlet sucks cos it’s full of cliches.

And it’s relevant that the 8 schools was a real application. Yes, Rubin (or someone else) could’ve run a simulation study of hierarchical modeling where the group-level variance was low enough that the point estimate would often be zero; indeed, Yeojin, Sophia, Jingchen, Vince, and I did this as part of our development of zero-avoiding priors for stable Bayesian point estimation of hierarchical variance parameters. But (a) I don’t know that Rubin’s example would’ve been so compelling with fake data, and (b) Let’s face it: Rubin published the 8 schools in 1981 and we did our work in 2013, and this work was very much inspired by struggles we’d had with the 8 schools over the year.

Let me put it another way. Remember that famous passage from Boswell:

After we came out of the church, we stood talking for some time together of Bishop Berkeley’s ingenious sophistry to prove the nonexistence of matter, and that every thing in the universe is merely ideal. I observed, that though we are satisfied his doctrine is not true, it is impossible to refute it. I never shall forget the alacrity with which Johnson answered, striking his foot with mighty force against a large stone, till he rebounded from it — “I refute it thus.”

The 8 schools is a stone that has been used to refute some methods. It’s not the only stone out there, that’s for sure. But a useful stone it’s been.

]]>You write, “But it’s real data! Firstly, no it isn’t. You grabbed it out of a book.”

No. I put it into a book.

]]>My problem is that when you rip these things out of context they become essentially useless. It is unclear what “population” of datasets you can generalize to.

]]>One can play around with a few different prior distributions for the standard deviation of true effects and find that it doesn’t really matter, any reasonable choice that doesn’t force the standard deviation to be very small or very large turns out to give about the same statistical distribution. School A ends up with an estimated true effect of 10 points, with about 50% of the probability between 7 and 16 points.

So the point of the example is that *real world prior information is very important* and it shows you how to use it.

As far as that goes, I think it’s a useful example.

PS: I think the gamma(4, 3/4) prior gives you a most likely value of 4 and a 95% range from about 1.4 to 11.7 which seems reasonable.

]]>In these cases, the real-world aspect can be important, because the important part isn’t so much the statistics as the building of the predictive model from real-world considerations. Though it’s still very useful to simulate how well your model would fit if your model were really and truly correct, which only happens in simulation.

]]>I also prefer examples where we have some idea of what to expect in the posterior. With 8 schools, the posterior is so sensitive to the prior it feels like we’re just making up answers because we don’t have enough data to cross-validate meaningfully. So I never know what to make from any posteriors I’m shown from 8 schools.

P.S. I’m not asking for details, but I am curious if Spooky Horrible Goose either is the goose or derived their name from the goose mentioned in the body of the post.

]]>Making the whole thing harder is that it’s nearly impossible to know the tails well enough with most real data sets.

]]>Or you could treat them as all being part of one and the same distribution, in which case the data seems to show that the same treatment effect can have quite different results.

Or, more likely, somewhere in between would be more likely. The key point for me was that the problem as posed gives us no way to choose the location on that continuum.

]]>Oh c’mon. That cannot possibly mean anything.

Sometimes the real world data is available in a spreadsheet, comprehensive and detailed, objectively gathered and presented.

It just sounds nutty to insist that a simulated data set is always better if the goal is to gain greater understanding of the real world data.

]]>I also don’t think it successfully shows very much about that class of models. What it does show could be demonstrated more clearly and more efficiently by comparing simulated data with three different values of $latex \tau$ (small, intermediate, and large)

]]>1) Demonstrate how well an algorithm works for sampling from a posterior distribution?

or

2) Demonstrate in about the simplest form possible how and why one should build a hierarchical measurement error model from real world considerations?

It seems to me your complaint is primarily about 8 schools as an instance of (1) not 8 school as an instance of (2), and yet, it also seems to me that 8 school is primarily aimed at (2), which is an educational objective, not an algorithmic one.

]]>