When did the use of stylized data examples become standard practice in statistical research and teaching?

In our discussion of the statistician Carl Morris, Phil pointed to an interview of Morris by Jim Albert from 2014 which contained this passage:

In the spring of 1970, some years after our Stanford graduations, we talked one evening outside the statistics department at Stanford and decided to write a paper together. What should it be about? Brad [Efron] suggested, “Let’s work on Stein’s estimator.” Because so few understood it back then, and because we both admired Charles Stein so much for his genius and his humanity, we chose this topic, hoping we could honor him by showing that his estimator could work well with real data.

Stein already had proved remarkable theorems about the dominance of his shrinkage estimators over the sample mean vector, but there also needed to be a really convincing applied example. For that, we chose baseball batting average data because we not only could use the batting averages of the players early in the season, but because we also later could observe how those batters fared for the season’s remainder—a much longer period of time.

What struck me about this quote was that there was such a long delay between the theoretical work and “a really convincing applied example.” Also, the “applied example” was only kind of applied. Yeah, sure, it was a real data, and it addressed a real problem—assessing performance based on noisy information—but it was what might best be called a stylized data example.

Don’t get me wrong; I think stylized data examples are great. Here are some other instances of stylized data examples in statistics:
– The 8 schools
– The Minnesota radon survey
– The Bangladesh arsenic survey
– Forecasting the 1992 presidential election
– The speed-of-light measurements.

What do these and many other examples have in common, beside that my colleagues and I used them to demonstrate methods in our books?

They are all real data, they are all related to real applied problems (in education research, environmental hazards, political science, and physics) and real statistical problems (estimating causal effects, small-area estimation, decision making under uncertainty, hierarchical forecasting, model checking), and they’re all kind of artificial, typically using only a small amount of the relevant information for the problem at hand.

Still, I’ve found stylized data examples to be very helpful, perhaps for similar reasons as Efron and Morris:

1. The realness of the problem helps sustain our intuition and also to give a sense of real progress being made by new methods, in a way that is more understandable and convincing than, say, a reduction in mean squared error.

2. The data are real and so we can be surprised sometimes! This is related to the idea of good stories being immutable.

Indeed, sometimes researchers demonstrate their methods with stylized data examples and the result is not convincing. Here’s an example from a few years ago, where a colleague and I expressed skepticism about a certain method that had been demonstrated on two social-science examples. I was bothered by both examples, and indeed my problems with these examples gave me more understanding as to why I didn’t like the method. So the stylized data examples were useful here too, even if not the way the original author intended.

In section 2 of this article from 2014 I discussed different “ways of knowing” in statistics:

How do we decide to believe in the e↵ectiveness of a statistical method? Here are a few potential sources of evidence (I leave the list unnumbered so as not to imply any order of priority):

• Mathematical theory (e.g., coherence of inference or convergence)
• Computer simulations (e.g., demonstrating approximate coverage of interval estimates under some range of deviations from an assumed model)
• Solutions to toy problems (e.g., comparing the partial pooling estimate for the eight schools to the no pooling or complete pooling estimates)
• Improved performance on benchmark problems (e.g., getting better predictions for the Boston Housing Data)
• Cross-validation and external validation of predictions
• Success as recognized in a field of application (e.g., our estimates of the incumbency advantage in congressional elections)
• Success in the marketplace (under the theory that if people are willing to pay for something, it is likely to have something to offer)

None of these is enough on its own. Theory and simulations are only as good as their assumptions; results from toy problems and benchmarks don’t necessarily generalize to applications of interest; cross-validation and external validation can work for some sorts of predictions but not others; and subject-matter experts and paying customers can be fooled.

The very imperfections of each of these sorts of evidence gives a clue as to why it makes sense to care about all of them. We can’t know for sure so it makes sense to have many ways of knowing. . . .

For more thoughts on this topic, see this follow-up paper with Keith O’Rourke.

In the above list of bullet points I described the 8 schools as a “toy problem,” but now I’m more inclined to call it a stylized data example. “Toy” isn’t quite right; these are data from real students in real schools!

Let me also distinguish stylized data examples from numerical illustrations of a method that happen to use real data. Introductory statistics books are full of examples like that. You’re at the chapter on the t test or whatever and they demonstrate it with data from some experiment in the literature. “Real data,” yes, but not really a “real example” in that there’s no engagement with the applied context; the data are just there to show how the method works. In contrast, the intro stat book by Llaudet and Imai uses what I’d call real examples. Still with the edges smoothed, but what I’d call legit stylized data examples.

It’s my impression that the use of stylized data examples has been standard in statistics research and education for awhile. Not always, but often, enough so that it’s not a surprise to see them. The remark by Carl Morris in that interview makes me think that this represents a change, that things were different 50 years ago and before.

And I guess that’s right—there really has been a change. When I think of the first statistics course I took in college, the data were all either completely fake or they were numerical illustrations. And even Tukey’s classic EDA book from 1977 is full of who-cares examples like the monthly temperature series in Yuma, Arizona. At that point, Tukey had decades of experience with real problems and real data in all sorts of application areas—yet when writing one of his most enduring works, he went with the fake? Why? I think because that’s how it was done back then. You have your theory, you have your methods, and the point of the methods research article or book is to show how to do it, full stop. In the tradition of Snedecor and Cochran’s classic book on statistical methods. Different methods but the same general approach. But something changed, and maybe the 1970s was the pivotal period. Maybe the Steve Stiglers of the future can figure this one out.

11 thoughts on “When did the use of stylized data examples become standard practice in statistical research and teaching?

    • well, “↵” means form feed, so it’s also a form of writing ff, right? Would be interesting to know how this kind of mojibake happens. Did some algorithm convert a ff ligature into U+000C FORM FEED, which was then rendered as “↵”?

  1. I think stylized examples are great, but need to be chosen carefully, as they can have pitfalls as well. The more “real world” the example, the more people will bring their own insights to the problem. These would be great if we were really trying to solve the problem, but can get in the way if we’re trying to explicate a method. And of course there’s the opposite problem which some commenters mention here, that sometimes the “real worldness” of the problem is obscure, as for the benighted souls who don’t understand baseball statistics at all.

  2. I thought they’d been around pretty much forever. I just used Fisher’s iris data for the regression section of a new Getting started in Stan with Python and cmdstan—it’s actually pretty cool in that it’s three subgroups of positive-constrained data on varying scales. Speaking of Carl Morris, I put a lot of effort into replicating Efron and Morris’s classic paper in Hierarchical partial pooling for repeated binary trials (in Stan, of course). The baseball turns out to be a distraction (though I can’t recommend Jim Albert’s Curve Ball enough for those who do like baseball and want to understand how applied stats is done at a deep conceptual level without a lot of math getting in the way), but the model really applies to any repeated binary trials (e.g., Rats in clinical trials, the stylized data used for hierarchical modeling in BDA and the BUGS examples). I’ve also done a case study on the classic ODE model that must be in every introductory ODE solving class on the planet, Predator-prey population dynamics: the Lotka-Volterra model in Stan. Arguably, the holographic coherent diffraction imaging phase retrievalmodel that Brian Ward and I worked on with David Barmherzig is almost in this direction, too, HoloML in Stan: Low-photon Image Reconstruction. It’s also the most Star Trek-sounding thing I’ve ever been involved with. We were replicating David’s paper with Stan and then extending it to Bayes in order to highlight Stan’s new(ish) complex number and FFT capabilities.

    One thing I like about the stylized data is that it’s easy to check your work against the source. Another thing I like about it is that it gives us all a set of common reference points on which we can build. Philosophers lean on this all the time with stylized thought experiments.

    Yet another thing I like about it is that it strips away all of the real-world complexity. Andrew and I often disagree about how far to go in this direction. I like a simple binomial example (as in my new intro). Andrew tends to prefer stylized data, like 8-schools as a Hello World example. The reason I don’t like 8-schools so much is that it’s a rather sophisticated model statistically. I think trying to figure out why means and standard deviations are now data and what a hierarchical prior does is a step too far, because I found it complicated when I was first learning. Andrew’s used to dealing with a more sophisticated audience than me (I was raised by wolves in computer science), and he treats the complexity as a learning opportunity.

    I often have trouble with the audience around stylized data. They want to know more about it. What were the temperatures and rainfall for those years of cats and bunnies? Where are the school covariates like average SAT scores or whatever in the 8-schools example? When you generate purely synthetic data according to a model, this never comes up. But then neither does model misspecification and error analysis, unless you purposely build a misspecified model, as for example is often done when illustrating concepts like simulation based calibration.

  3. There are several overlapping but distinct terms here: stylized, toy, example, case, and probably more. I personally like examples – there are few general results when it comes to data and I only learn through application to data (usually real data, but occasionally from simulated data). Toy examples rarely are helpful (exceptions would be Anscombe’s Quartet and Andrew’s recent extension of that). But toy or stylized examples in textbooks I find particularly annoying. They generally illustrate exactly the wrong thing – they focus on the analytical technique rather than the data (measurement and preparation of the data for analysis). The problem is that students learn how a particular technique is supposed to work, but then find they can’t use it because the real data they ultimately want to analyze is not as clean as the toy example they learned from. Even worse, they learn not to think in multivariate ways – learning how a t test (sorry to use the NHST terminology) applies deters them from the fact that there are few cases where isolating a single explanatory variable makes much sense.

    Real data seems the opposite of “stylized.” It has its own issues since each real data set is unique and lessons learned from one may not apply to others. This seems particularly true when it comes to machine learning models – the large variety of models seems to call for either a ranking of performance or a decision guide for which model to use for which type of data. I haven’t seen much progress on either (and I’m actually thankful for that, since that prevents data analysis from becoming a purely algorithmic exercise). The best protection against one example creating misleading impressions of generalization is to have more than one example.

  4. Could you say more what you mean by stylized data? This seems to be the defining feature:

    > and they’re all kind of artificial, typically using only a small amount of the relevant information for the problem at hand

    Is it stylized just because we know if this was the “real deal” we’d do more stuff and we’re simplifying things a bit?

  5. I have seen a sign in restraints something like: “Quality Food, Low Prices, Friendly Service; Choose any 2.”

    I have thought for a while that I need something similar for examples in stats class (and possibly other subjects as well), something like:

    Examples:
    Simple (quick, short, easily understood)
    Effective (demonstrate the topic)
    Relevant (meaningful to the student)

    Choose any 2.

    Most textbook and lecture examples that I see/use choose the first 2, and I think this is what is meant by stylized data examples. I have some great projects that I have worked on that would be very relevant/interesting examples, but would require taking 10-15 minutes away from statistics instruction to give an adequate background and/or would involve techniques outside the scope of the class. So, I don’t use many of these (or oversimplify them). Instead using the simple, effective examples like temperatures in Yuma, which are not relevant to the great majority of students.

    • Greg:

      My problem with the temperature-in-Yuma example is not that it’s irrelevant to students; it’s irrelevant as a subject of data analysis to just about everybody. It’s real data but I don’t think it’s a real example; there’s no real underlying problem, as there is in the 8 schools, the Minnesota radon survey, etc.

      • Sure, dismiss my entire profession, why don’t you?

        And you try living in Yuma in July and see if you still think there’s no real underlying problem with temperatures there.

        Meanwhile, did a quick Google of “Tukey Yuma” to remind myself what analysis was done with the data. Google, not believing me, returned:
        2. Yuma Turkey Trot
        5. Valley Meat Co., Yuma, Arizona
        6. Yuma Community Food Bank Thanksgiving Food Drive
        8. Cheap flights from Istanbul to Yuma, AZ starting at £376

        I confess it took me way too long to figure out the last one.

  6. I really appreciate these data settings being given a name – distinguishing “stylized data” examples from real data examples would clear things up tremendously. When learning statistics or even reading statistics papers, it can be annoying to see this disconnect. In stylized data examples, all the complexities of the data and science are ignored except the part that is relevant for the statistical method being discussed. Then when the method is applied, there’s a conclusion stated definitively (e.g. “according to this analysis, treatment A was X amount more effective than treatment B”) when really such a conclusion is misleading without addressing the other concerns about the data assumed away for simplicity.

    Relatedly, I think this is the reason why NHST remains a zombie in many applications, in stylized data examples you can assume that the null hypothesis theta=0 is realistic and relevant even if it is not in reality. This is not to say that stylized data examples don’t have their place of course; I agree with everyone else here who finds these useful. But it should be clear when we are and are not being totally serious about the data/application.

Leave a Reply

Your email address will not be published. Required fields are marked *