Tangent (which I feel is ok on a post by the #1 tangent writer):

I go to Horniman all the time with the kids and have a couple of friends who have worked there. BTW they say the walrus is soaked in cyanide and that’s why you’re not allowed on the (also embarrassingly unrealistic – it’s not even cold!) plastic iceberg… I digress. The walrus is an embarrassment to them but it’s also the only reason people across the world (even on this blog) have heard of a small museum in South London. So they can’t take it away and say ‘Remember when we had that ridiculous overstuffed walrus there for 100 years – weird!’ because they would lose visitors. The good thing is their other exhibitions are superb. They don’t want people who visit to remember the visit for the walrus (and they don’t). They’re starting from below 0 so can’t put on something weak. I guess what I’m saying is that if people are aware that it’s embarrassing to use the eight schools, they’re starting from below 0 and might have to work harder on everything else they do, like thinking about how to simulate data.

PS – apparently no museum staff have ever taken photos of themselves riding the walrus on their last day.

]]>Matt, I think Dan didn’t do a good job of putting the proper context on his post, for example see here and the follow up from Phil:

The point isn’t about how important real world examples are for testing *data models* it’s about how important real world examples are for testing *computational methods for sampling from posteriors*

You can’t test models of how electrons interact in a plasma using simulated data, you need to collect some measurements from real plasma. Similarly, you can’t test models for economic decision making using random number generators, you need to collect some data on people’s spending habits… But these are tests of models of real world processes, data models.

On the other hand, you *can* test a sampling algorithm on *any* probability density in N dimensions, and there’s no reason to think that posterior probability distributions from real world small sample information poor statistical problems out of textbooks are good for testing those algorithms.

At least that’s my interpretation of his point. I hope I’m not misstating it.

]]>“when real data doesn’t happen to behave itself”

A thought experiment:

Your grad student tells you that she has an algorithm that will predict the next 100 entries in a real-world data set of 1000 existing entries. But she is having trouble getting the data set to test it. You say “No problem, you know the basic parameters, test your algorithm on a synthetic data set. That’s better anyway.”

She develops a method to generate 1000 synthetic entries intended to simulate the real-world data. She runs her algorithm and predicts the next 100 entries. She then generates 100 new synthetic entries using the method she developed, but the numbers don’t match very well.

The next morning, she gets an e-mail with the full, real-world, up-to-date data set. She runs her algorithm and generates a prediction. She waits until 100 real-world entries are added to the data set, and discovers that her algorithm did an excellent job of predicting them. As more data accumulates, the fit gets better.

Do you tell her:

1. It failed when it mattered, which was with the synthetic data. Move on to a new project.

– or –

2. Your method of generating synthetic data must have been faulty. Nice algorithm!

Now swap “real world” and “synthetic” in the thought experiment. The algorithm works on the synthetic data but not on the real data. Do you say:

1. The algorithm may well be fine, the data set must not be representative. [Grad student: Not representative OF WHAT?]

– or –

2. Your method of generating synthetic data must have been faulty.

I know that statisticians would want to dive in, figuring out more of the statistical properties of the real-world data to understand what went wrong. And they wouldn’t be happy until they could generate synthetic data that works. But does any of that add to the confidence in the algorithm, or does it just allow the statisticians to sleep better at night?

Dan wrote:

“Experiments using well-designed simulated data are unbelievably important. Real data sets are not.”

Can my thought experiment be shoehorned into this? I am genuinely interested in how that would happen.

]]>And Bob and Dan agreeing that its specificity makes it “very challenging for students as a first example of a hierarchical model”, well, I don’t know what students you’re talking to.

Me, mainly—I’m the student. But I’m reasoning partly by theory that there are several moving parts—measurement error and hierarchical modeling, which can be tackled separately and then combined. I also find the first hierarchicla model example in BDA (rats in chapter 5) very challenging because it assumes a lot of background in the way it manipulates Jacobians for priors, plugs in moment-based estimates, etc. I get it now, but it’s a tough first example for someone without a strong math-stats and applied-stats background.

]]>Stan case study on my dropping paper balls experiment would be a good idea. It has some nice aspects where I intentionally included model error (the paper balls are nothing like smooth spheres, but they can be treated as such for this purpose and after inferring a best fit radius, it induces a sufficiently small model error that it doesn’t matter, all models are wrong, but some are useful kinda thing). Maybe I’ll look into writing it up.

]]>That’s a really good point—although highly sensitive to the prior, *reasonable* choices of priors produce similar results.

Real world knowledge is also why I like my baseball case study (and why it fails for most readers worldwide)—we have 100 years of real-world data to act as an informal but very strong prior.

]]>Thanks—that’s exactly why I wrote the Lotka-Volterra case study. I’d been talking to Michael Betancourt a lot. He’d run a course for physics students where they estimated gravity constants. I really like that example and we’d love to have a Stan case study on it. Michael reported that the students did what physics students are good at—they devised clever measurement schemes involving their phones.

Another great example of this is Mahar et al.’s lung clearance diagnostic model. It’s a physically motivated use of an exponential decay mixture.

]]>+1, certainly not an example of the worst taxidermy …

as opposed to this:

]]>Real walrus for comparison: https://www.dkfindout.com/uk/animals-and-nature/seals-sea-lions-and-walruses/walrus/

The Horniman Walrus: https://www.horniman.ac.uk/collections/stories/horniman-highlights-tour

Thanks – I had remembered it being an example selected by Don to provide a good contrast between full Bayes and empirical Bayes.

Now, the problem I have with it for students is it makes things too simple and straight forward. The data is all in hand, there is no sense in varying quality between the studies, no sense of studies being missed, sigma_j is taken as known without discussion, etc. So my sense is they think they will understand something about meta-analysis and doing realistic hierarchical analyses when they don’t yet.

Now, the simple bootstrap is worse as it seem so straightforward and easy and works well for toy problems. I have seen countless people (often statisticians) doing silly bootstrapping on real problems thinking they should be simple too.

]]>Its often done as a convenience in meta-analysis especially as the raw data is seldom available, but even with just summaries like mean and variance you can treat sigma_j as UNknown in the modelling. Now, if the samples were large it will make little difference but if they were small you can get multi-modality in the likelihoods and all sorts of numerical problems can occur.

]]>https://www.atlasobscura.com/places/the-lion-of-gripsholm-castle-strangnas-sweden

]]>Now that I know that that it is not just me that finds the 8 schools example unsatisfying, I don’t need to blame it on my lack of statistical erudition.

Paradoxically, I like the 8 school example *more* after reading the post and the comments.

]]>(Photography was a major pain* in the late 19th century, but it was possible and lots of people put up with said pain, especially folks doing expedition sorts of things.)

*: Plates were only replaced by film (invented in 1884) in the very late 19th century. It really was a pain.

]]>It’s true that one typically doesn’t want to treat the sigma_j as known, but it so happens that I am working with some data right now, tonight, in which I need, or at least strongly want, to treat the sigma_j as known. And I’ve run into it before, too. It may be more common than you think.

]]>Dan,

If you hate it as a computational example but like it in other ways, that didn’t come across to me. And maybe that’s not true either, maybe you just plain hate it. And that’s fine. If you claimed it’s a _bad_ example, that would be a factual claim with which I would take issue. But if you’re just saying you hate it, hey, that’s also a factual claim and it’s not like I think you’re lying about it. And it’s certainly fine with me: De gustibus non est disputandum, and all that. (Why on earth does autocorrect try to ‘correct’ that to xustibus. WTF?)

This came up because you asked a couple of days ago about something I said on a post about a computational method. So the reason and the context was computational. Also a few modelling reasons I listed above. I hope that clears up any lingering confusion.

]]>One of the difficulties with the 8 schools example as it stands is that it’s not so easy for students to directly emulate, as in real-life examples you typically don’t want to treat the sigma_j’s as known.

Unfortunately, I don’t have any idea how to track down these data. About 15 or 20 years ago I asked Rubin if he had the raw data, and he looked and couldn’t find anything.

]]>I run into real-world problems all the time in which a multi-level model would be a great approach, and I always have to explain to someone, or several people, or a bunch of people, what that means. The 8 schools problem is great for this. The measurements are subject to substantial error, the true values are from some distribution with unknown mean and variance (and for that matter unknown distribution), the classical estimate is unrealistic, and I’m trying to estimate both the distribution of the true values and the true values associated with the individual observations. This general structure comes up all the time. The fact that it isn’t hard to compute…who tf cares? If anything that’s a feature, not a bug.

And Bob and Dan agreeing that its specificity makes it “very challenging for students as a first example of a hierarchical model”, well, I don’t know what students you’re talking to. When I was a student I didn’t find it “very challenging”, I found it extremely clear…and it’s not just me: I have used it to explain hierarchical models to many people and they have all understood it, or at least claimed to.

Echoing Tom P (above), one of the things I like is that the example invites pondering the two extremes — complete pooling and no pooling — and the fact that both of them seem unreasonable. It’s great for walking people through options that might seem tempting, like testing each school’s estimate against the mean, and if it differs by a ‘statistically significant’ amount then let it stand alone, otherwise give it the mean effect, and illustrating how crazy that would be in practice….and, by extension, how crazy it is to do it in many other real-world problems where that is in fact what some people do.

Dan and Bob, you both seem really focused on how it’s not a great _computational_ example, and given what you work on maybe that’s understandable, but you seem to be completely missing the fact that it is a great _conceptual_ example.

]]>Another thing. You characterize the 8-schools data as “smooth and lovely.”

Actually, the 8 schools data are smooth and lovely only in retrospect!

Here’s the background. Throughout the 1970s, lots of research was done on hierarchical regression modeling, often in education examples. Standard practice for both theory and application was to use point estimates for the hyperparameters. The 8 schools example (published in 1981, based on an experiment conducted in the late 1970s) was special because it was *difficult*, as the natural point estimate of the group-level variance was zero. This motivated Rubin to perform a full Bayes analysis.

So, I think that to describe the 8 schools as smooth and lovely is like saying that Hamlet sucks cos it’s full of cliches.

And it’s relevant that the 8 schools was a real application. Yes, Rubin (or someone else) could’ve run a simulation study of hierarchical modeling where the group-level variance was low enough that the point estimate would often be zero; indeed, Yeojin, Sophia, Jingchen, Vince, and I did this as part of our development of zero-avoiding priors for stable Bayesian point estimation of hierarchical variance parameters. But (a) I don’t know that Rubin’s example would’ve been so compelling with fake data, and (b) Let’s face it: Rubin published the 8 schools in 1981 and we did our work in 2013, and this work was very much inspired by struggles we’d had with the 8 schools over the year.

Let me put it another way. Remember that famous passage from Boswell:

After we came out of the church, we stood talking for some time together of Bishop Berkeley’s ingenious sophistry to prove the nonexistence of matter, and that every thing in the universe is merely ideal. I observed, that though we are satisfied his doctrine is not true, it is impossible to refute it. I never shall forget the alacrity with which Johnson answered, striking his foot with mighty force against a large stone, till he rebounded from it — “I refute it thus.”

The 8 schools is a stone that has been used to refute some methods. It’s not the only stone out there, that’s for sure. But a useful stone it’s been.

]]>In this, as in many things Andrew, you are the exception :p

]]>You write, “But it’s real data! Firstly, no it isn’t. You grabbed it out of a book.”

No. I put it into a book.

]]>Spooky Horrible Goose is Charlie’s halloween name. I imagine it’s related to the Untitled Goose Game that all the cool kids have been talking about.

]]>I think you’re right. A thing that I think would be more useful that 8-schools (or StackLoss or any of the others) is building field-specific directories of data sets. Because the structure of actual data in, say, neuroscience differs from the structure of actual data in, say, education. 8-Schools is a perfectly ok example of a certain type of educational data. We should have more and, using that, get insight on our models and algorithms in that specific field.

My problem is that when you rip these things out of context they become essentially useless. It is unclear what “population” of datasets you can generalize to.

]]>From this perspective it could be an important example because it shows that different assumptions based on real world information results in different results, which is an indication that you should argue for and make relatively strong assumptions (ie. moderately informative priors, for examples priors that avoid zero or infinity on the between-schools standard deviation). Classical statisticians seem to see this as a flaw, whereas I see this as connecting the model to reality: we know that there will be variation among schools, tau = 0 is not realistic, neither is variation orders of magnitude larger than between-student variation, so we can choose a gamma type prior that places the high probability region on something like 1/10 to 10 x the student standard deviation and get some kind of reasonable estimates… as Phil said (and I’m assuming he’s correct, because I have never actually done numerical experiments on the 8 schools example)

One can play around with a few different prior distributions for the standard deviation of true effects and find that it doesn’t really matter, any reasonable choice that doesn’t force the standard deviation to be very small or very large turns out to give about the same statistical distribution. School A ends up with an estimated true effect of 10 points, with about 50% of the probability between 7 and 16 points.

So the point of the example is that *real world prior information is very important* and it shows you how to use it.

As far as that goes, I think it’s a useful example.

PS: I think the gamma(4, 3/4) prior gives you a most likely value of 4 and a 95% range from about 1.4 to 11.7 which seems reasonable.

]]>I also like the golf putting example for the same reasons.

In these cases, the real-world aspect can be important, because the important part isn’t so much the statistics as the building of the predictive model from real-world considerations. Though it’s still very useful to simulate how well your model would fit if your model were really and truly correct, which only happens in simulation.

]]>My own personal favorite introduction model is either my own dropping balls or the Stan group’s lotka-volterra lynxes because these emphasize that Bayesian statistics is capable of allowing us to infer things about mechanistic models of how the world works. Too much of science these days is just “Measure some stuff and see if there’s a statistically significant difference between condition A and condition B” and we need to move towards “think about how things work, come up with simplified assumptions and see how well they can do to explain real world data”

]]>That last paragraph is what I meant in my comment above by the posterior being driven by the prior in 8 schools and us having no way to evaluate it.

]]>It took me a long time coming from the wilds of machine learning to understand the point of the 8-schools example. That’s because, as Dan says, it’s an “extremely specific case of a hierarchical measurement error model.” That makes it very challenging for students as a first example of a hierarchical model.

I also prefer examples where we have some idea of what to expect in the posterior. With 8 schools, the posterior is so sensitive to the prior it feels like we’re just making up answers because we don’t have enough data to cross-validate meaningfully. So I never know what to make from any posteriors I’m shown from 8 schools.

P.S. I’m not asking for details, but I am curious if Spooky Horrible Goose either is the goose or derived their name from the goose mentioned in the body of the post.

]]>The problem with real data is that it may not generalize well enough. The problem with constructed data is that you may get the tails of the distributions wrong (i.e., not representative of real data), and neglect bias, in which case any calculations that are sensitive to the tails or bias can be far off.

Making the whole thing harder is that it’s nearly impossible to know the tails well enough with most real data sets.

]]>Or you could treat them as all being part of one and the same distribution, in which case the data seems to show that the same treatment effect can have quite different results.

Or, more likely, somewhere in between would be more likely. The key point for me was that the problem as posed gives us no way to choose the location on that continuum.

]]>The only question that real data can answer when exploring a model or an algorithm (not, of course, when you’re actually interested in what the real data measures) is this: Does there exist at least one data set with properties that make this new method better than or different to existing methods.

]]>The presentation is not the point. Sometimes real data fits smoothly within the model assumptions, but you can’t know that before you fit the model and investigate. So build data that tests your model/algorithm in various ways so that you at least know what you are looking for when real data doesn’t happen to behave itself.

]]>Oh c’mon. That cannot possibly mean anything.

Sometimes the real world data is available in a spreadsheet, comprehensive and detailed, objectively gathered and presented.

It just sounds nutty to insist that a simulated data set is always better if the goal is to gain greater understanding of the real world data.

]]>I don’t think 2 is true either. Again, the standard deviations are known. This is an extremely specific case of a hierarchical measurement error model.

I also don’t think it successfully shows very much about that class of models. What it does show could be demonstrated more clearly and more efficiently by comparing simulated data with three different values of $latex \tau$ (small, intermediate, and large)

]]>1) Demonstrate how well an algorithm works for sampling from a posterior distribution?

or

2) Demonstrate in about the simplest form possible how and why one should build a hierarchical measurement error model from real world considerations?

It seems to me your complaint is primarily about 8 schools as an instance of (1) not 8 school as an instance of (2), and yet, it also seems to me that 8 school is primarily aimed at (2), which is an educational objective, not an algorithmic one.

]]>