Rasmuth Bååth reports the following fun story in a blog post, The source of the cake dataset (it’s a hierarchical modeling example included with the R package lme4).
Rasmuth writes,
While looking for a dataset to illustrate a simple hierarchical model I stumbled upon another one: The cake dataset in the lme4 package which is described as containing “data on the breakage angle of chocolate cakes made with three different recipes and baked at six different temperatures [as] presented in Cook (1938).
The search is on.
… after a fair bit of flustered searching, I realized that this scholarly work, despite its obvious relevance to society, was nowhere to be found online.
The plot thickens like cake batter until Megan N. O’Donnell, a reference librarian (officially, Research Data Services Lead!) at Iowa State, the source of the original, gets involved. She replies to Rasmuth’s query,
Sorry for the delay — I got caught up in a deadline. The scan came out fairly well, but page 16 is partially cut off. I’ll put in a request to have it professionally scanned, but that will take some time. Hopefully this will do for now.
Rasmuth concludes,
She (the busy Research Data Services Lead with a looming deadline) is apologizing to me (the random Swede with an eccentric cake thesis digitization request) that it took a few days to get me everything I asked for!?
Reference librarians are amazing! Read the whole story and download the actual manuscript from Rasmuth’s original blog post. The details of the experimental design are quite interesting, including the device used to measure cake breakage angle, a photo of which is included in the post.
I think it’d be fun to organize a class around generating new, small scale and amusing data sets like this one. Maybe it sounds like more fun than it would actually be—data collection is grueling. Andrew says he’s getting tired of teaching data communication, and he’s been talking a lot more about the importance of data collection on the blog, so maybe next year…
P.S. In a related note, there’s something called a baath cake that’s popular in Goa and confused my web search.
Hey—that reminds me of the Ticket to Baaaath story (which you commented on!).
Nice, but there is more mystery, imo. Who added this odd dataset to LME4, and what triggered that?
Earliest I found (just clicking around) was lme4_0.9975-13 added 2007-02-11 13:41:
https://cran.r-project.org/src/contrib/Archive/lme4/
It seems they moved from subversion to git in 2011, but I didn’t find a mention in the changelog from the link above.
I don’t have access to it, but from my understanding the cake dataset was a worked example in Cochran, W. G., and Cox, G. M. (1957) Experimental designs, 2nd Ed. So I guess it was lifted from there. Would be interesting to know what model was applied to it in that book.
https://books.google.se/books?redir_esc=y&id=VPcYAQAAIAAJ&focus=searchwithinvolume&q=cake
I’m 99.9% sure it was Doug Bates. (And I support Rasmus Bååth’s guess about why/where it came from.)
Doug is a great statistician, but not always perfect at keeping the contexts of examples, e.g. https://github.com/lme4/lme4/issues/615
I also spent a while digging around for the original source of the ‘contraception’ data set that Doug often uses as an example . I think they’re originally taken from here: https://www.bristol.ac.uk/cmm/learning/mmsoftware/data-rev.html#bang which says they “come from the 1988 Bangladesh Fertility Survey”. It’s also available from Stata at https://www.stata-press.com/data/r18/bangladesh.dta , but I don’t know how/when the raw data were acquired; I haven’t tried to get ahold of the original Huq and Cleland 1990 report (I’m not as dedicated as Rasmus, and I’m not looking for interesting new information – just curious to confirm provenance …)
Also, could someone find me the raw data from the 8 schools experiment in Bayesian Data Analysis?
There we refer to Rubin (1981), which in turn refers to Alderman and Powers (1979):
Alderman, D. & Powers, D. The effects of special preparation on SAT-Verbal scores. Research Report 79-1. Princeton, N.J.: Educational Testing Service, 1979.
A quick Google search turns up The effects of special preparation on SAT-Verbal scores, by Donald L. Alderman and Donald E. Powers (1980), American Educational Research Journal, Vol. 17, No. 2, pp. 239-251, which must be the published version of that report.
Lots of Donalds on this one, huh?
Anyway, Alderman and Powers (1980) has a bunch of tables with lots more information beyond what was used in Rubin (1981) and BDA, but not the raw data on the individual students. Many years ago I asked Rubin if he had the raw data but he wasn’t able to find anything.
Such data sets are prevalent in teaching design of experiments.
See
Kenett, R. S., & Steinberg, D. M. (1987). Some experiences teaching factorial design in introductory statistics courses. Journal of Applied Statistics, 14(3), 219-227.
Hunter, W. G. (1977). Some ideas about teaching design of experiments, with 25 examples of experiments conducted by students. The American Statistician, 31(1), 12-17.
When I was first learning Stan, I used a dataset I had collected on the taste of bay leaves. Cake is better!
https://kaplanas.github.io/Bay-Leaf-Experiment/bay_leaf_analysis.html
I think that was a really neat study! And I agree with the conclusion that it’s doesn’t really matter if the bayleaf has that much of an impact, you’ll add it anyways, because that’s part of the ritual 😄
I’ve been trying to run down something in nlme and running the Orange data over and over. So now you got me going and I looked and the documentation says Draper and Smith, third edition problem 24N. I just pulled the second edition from my shelf and there is no chapter 24! They went from 10 to 26 between editions.