This is not new–it just happened to come up in class the other day and I thought it was worth saying again. I made the point in my 2003 article, A Bayesian formulation of exploratory data analysis and goodness-of-fit testing, and in chapter 6 of Bayesian Data Analysis, back in 1995.
Here’s an explainer from 2010:
So-called exploratory and confirmatory methods are not in opposition (as is commonly assumed) but rather go together. The history on this is that “confirmatory data analysis” refers to p-values, while “exploratory data analysis” is all about graphs, but both these approaches are ways of checking models. Bayesians should embrace graphical displays of data—which I interpret as visual posterior predictive checks—rather than, as is typical, treating exploratory data analysis as something to be done quickly before getting to the real work of modeling.
Here’s how I see things usually going in a work of applied statistics:
Step 1: Exploratory data analysis. Some plots of raw data, possibly used to determine a transformation.
Step 2: The main analysis—maybe model-based, maybe non-parametric, whatever. It is typically focused, not exploratory.
Step 3: That’s it.
I have a big problem with Step 3 (as maybe you could tell already). Sometimes you’ll also see some conventional model checks such as chi-squared tests or qq plots, but rarely anything exploratory. Which is really too bad, considering that a good model can make exploratory data analysis much more effective and, conversely, I’ll understand and trust a model a lot more after seeing it displayed graphically along with data.
There’s some history to all of this. John Tukey’s classic book, Exploratory Data Analysis, was published in 1977 but I believe he began to work in that area in the 1960s, about ten or fifteen years after doing his also extremely influential work on multiple comparisons (that is, confirmatory data analysis). I’ve always assumed that Tukey was finding p-values to be too limited a tool for doing serious applied statistics–something like playing the piano with mittens. I’m sure Tukey was super-clever at using the methods he had to learn from data, but it must have come to him that he was getting the most from his graphical displays of p-values and the like, rather than from their Type 1 and Type 2 error probabilities that he’d previously focused so strongly on. From there it was perhaps natural to ditch the p-values and the models entirely–as I’ve written before, I think Tukey went a bit too far in this particular direction–and see what he could learn by plotting raw data. This turned out to be an extremely fruitful direction for researchers, and followers in the Tukey tradition–I’m thinking of statisticians such as Bill Cleveland, Howard Wainer, Andreas Buja, Diane Cook, Antony Unwin, etc.–continued to make progress here. More recently, this has overlapped with work by Hadley Wickham and others on EDA-friendly graphics software and work by Jessica Hullman and others on the communication of uncertainty and variation.
The actual methods and case studies in the EDA book . . . well, that’s another story. Hanging rootograms, stem-and-leaf plots, goofy plots of interactions, the January temperature in Yuma, Nevada—all of this is best forgotten or, at best, remembered as an inspiration for important later work. Tukey was a compelling writer, though–I’ll give him that. I read Exploratory Data Analysis twenty-five years ago and was captivated. At some point I escaped its spell and asked myself why I should care about the temperature in Yuma–but, at the time, it all made perfect sense. Even more so once I realized that his methods are ultimately model-based and can be even more effective if understood in that way (a point that I became dimly aware of while completing my Ph.D. thesis in 1990–when I realized that the model I’d spent two years working on didn’t actually fit my data–and which I first formalized at a conference talk in 1997 and published in 2003 and 2004. It’s funny how slowly these ideas develop.).
It’s funny about Tukey’s work on multiple comparisons. He wrote an entire unpublished book on the topic, along with many short research articles. Individually each of these articles is readable and compelling, but, stepping back, I see how it’s all based on a foolish familywise error framework.
Here’s what I think happened: Tukey was following some version of the operations-research approach to statistics associated with Wald. It makes sense–this was the framework that they used to win the second world war. And it was pretty much the only game in town (yes, there was also Bayes, but for historical reasons Bayesian methods got no respect back then). Tukey was brilliant, a great problem solver, co-inventor of the fast Fourier transform and lots of other things, but for whatever reason he didn’t apply his depth, breadth, and creativity toward thinking about the fundamentals. I guess you’d call him a fox. In the usual telling, the fox is the hero and the hedgehog is the boring obsessive, and it’s fair to say that the world benefited from Tukey’s foxness, his interest in developing new methods and solving problems rather than refactoring the foundations of statistics–but I think that his lack of depth in that area contributed to him wasting a lot of time and effort on multiple comparisons, to the extent that his only way forward was to rip it all up and start again with EDA.
From a modern perspective, EDA and CDA are the same thing: they’re both ways of comparing observed data to hypothetical replications under a model. Recall the goal of EDA to discover the unexpected: “the unexpected” is defined relative to “the expected,” hence EDA is model checking just as CDA is, with the only differences being: (a) in EDA the display is visual rather than numerical, (b) in EDA the reference model is often defined implicitly rather than explicitly.
Indeed, the “news you can use” aspects of this post–and of my general point about EDA being the same as CDA–are:
1. If you’re gonna do EDA, make your reference model as explicit as possible. The clearer your assumptions, the better you can find problems. It’s Popper–or, really, Lakatos–in action.
2. If you’re gonna fit complex models (which we’re doing more and more of in statistics and machine learning), EDA is more important than ever. EDA is not a set of qq plots you make before getting to the serious bit of modeling; it’s a key step in workflow. I frame this Bayesianly as this is the simplest way for me to do the work, but you can do non-Bayesian versions as long as you have generative models for your data, and as long as your methods are flexible enough to accommodate different sources of information. (Recall the most important aspect of a statistical method.)
Unfortunately, Tukey was stuck in an old-fashioned statistical framework–good enough to solve operations research problems in the war, but not strong enough for all the applied problems he and others were encountering in the 1960s and later–so, for historical reasons, EDA and CDA were perceived as opposites. Which is too bad. Hence this post, which repeats things that my colleagues and I have been saying for 30 years, and which was implicit in the work of Rubin, Box, and others for longer than that.
I have blog post on the same themes; I’d be interested in feedback or ideas to incorporate in a revision: https://jdonland.github.io/posts/crossword_times/crossword_times.html
The thesis is that many elements of data visualization are implicitly models, and it behooves one to at least think about whether those models are appropriate for the phenomenon being modeled. I show posterior predictive violin plots based on a toy Bayesian models as an alternative to the LOESS-based smoothing ggplot2 produces by default.
We wrote a vis-centric paper in 2021 that you might like: https://hdsr.mitpress.mit.edu/pub/w075glo6
I’ll check it out!
Two observations.
First, apparently Tukey spent a lot of time doing data analysis. He wrote “The Future of Data Analysis” in about 1960. (Ann. Math. Statist. 33(1): 1-67 (March, 1962). DOI: 10.1214/aoms/1177704711)
As I recall (it’s been decades since I looked at EDA), many of the techniques described were manual calculations that could be performed relatively easily and quickly. Such methods have the advantages (1) the analyst is actually looking at the data and (2) one does not have to wait for a computer run to be done or engage in long and complex calculations on a desk calculator. Advantage 2 does not matter today.
Second, I fail to understand your assertion “From a modern perspective, EDA and CDA are the same thing: they’re both ways of comparing observed data to hypothetical replications under a model.”
If I take some data (pairs x,y) and plot that data I might see something that looks like a line at a 30 degree angle or something that looks like a sine function. The underlying models associated with straight lines and sine functions are probably different. Sure, you have some rough model that the x-values are related to the y-values but the EDA process lets you focus on a more detailed model.
Tukey, in the article cited above, made much the same point:
In connection with stochastic process data we noted
the advantages of techniques which were revealing in terms of many different
models. This is a topic which deserves special attention.
If one technique of data analysis were to be exalted above all others for its
ability to be revealing to the mind in connection with each of many different
models, there is little doubt which one would be chosen. The simple graph has
brought more information to the data analyst’s mind than any other device. It
specializes in providing indications of unexpected phenomena.
Bob76:
Tukey definitely did a lot of data analysis! As I wrote in the above post, I just don’t think he was sophisticated in his thinking about fundamentals. He knew enough to abandon the familywise-error-rate hypothesis testing framework but was not quite up to the ideas of Box etc. on model checking for Bayesian inference. That’s all fine–his EDA work was inspiring and was an important step in the path to our current modern synthesis.
For a long-time Tukey’s group had a contract with Consumer’s Report to do a lot of the design and analysis of their testing. Some of that moved up to Yale when Bill Eddy and (I am going to get this name all wrong as my memory is getting bad) Deccon Bancroft did their graduate work up at Yale – the latter eventually became the head of Consumer Reports’s testing for many years. I remember them in the Computer Center printing out all the graphics and tables (ah yes the wonderful era of computer centers, sort of reminds one of the cloud). I believe John Hartigan also had some influence on what they did.
Roy:
Yes, I remember that Consumer Reports used multiple comparisons methods. In retrospect I don’t think that made a lot of sense–that just happened to be the methods that Tukey was working on in the 1950s. From a modern perspective, I think multilevel modeling works much better. In practice I don’t think it really made any difference; I just think the multiple comparisons approach is more complicated and gives worse results.
Similar to the post, I’ve found it easier to distinguish between EDA and CDA by their goals rather than methods. Statistics can be used to detect potentially meaningful patterns in data (EDA) and it can also be used to quantify the degree to which someone should believe that pattern is meaningful (CDA). My impression is that the tension between these approaches arises from a belief that you can’t do both of those things at the same time.
This perspective makes no logical sense to me—as you say, “discovery” represents some deviation from expectation, implying a model and some way of quantifying the discrepancy between data and model predictions. So why would the evidence needed to convince yourself that you’ve discovered pattern be necessarily different from the evidence needed to convince someone else of that pattern? However, I think the ritualistic way in which most scientists are taught to perform statistics makes it easy for us to fall for superstitions of this kind.
This is a nice perspective to read about. I never understood what was the big deal with the distinction between “exploratory” and “confirmatory”, and maybe with reason.
I’m wondering though, if Tukey himself abandoned multiple comparisons 60 years ago, why do we still see them in many papers today? Does it reduce to the more widespread issue of p-values and decision-flavored statistics?
Luciano:
If, for whatever reason, you’ve committed to using least squares estimates with no partial pooling, then you will need some sort of multiple comparisons adjustments, and, from that perspective, Tukey’s methods were not all so bad. Once you allow yourself to do partial pooling, multiple comparisons adjustments are no longer needed (see our 2012 paper, Why we (usually) don’t have to worry about multiple comparisons), but back in the 1950s-1960s, not so many researchers knew about partial pooling, and even those who knew (such as Tukey!) often viewed it as slightly illegitimate. Even now, 60 years later, many statisticians and applied researchers have that attitude. And they have their methods, including multiple comparisons corrections. The way things go is that if applied researchers have a method, they can adapt their research to the method and get reasonable results. So I’m not saying that people who are using methods that I think are out of date are necessarily doing bad science–the method can work for them!
Thanks Andrew. I’ve been following the blog long enough to know of pooling as an alternative, and I will check the paper. Reassuringly, old doesn’t necessarily mean bad, but if I have the option, I’d rather do otherwise!
I am not a statistician, but having read your blog for a long time and read about Turkey, it seems that a lot of his work with the NSA and in election forecasting was Bayesian. Is there a good biography of his work over the years? It seems like some his most important work might have been veiled in secrecy.
David:
A quick web search turns up this mini-biography of Tukey. Regarding his commercial and classified work: I’m sure it was very high quality. Unfortunately when work cannot be published it does not always get fully developed. It’s not just that the work is secret, so outsiders don’t get to see it It’s also that, without publication and outside scrutiny, it’s harder to develop an idea. One way that multilevel models advanced during the 1970s and 1980s was by a series of applied and theoretical papers by Lindley, Efron, Morris, Rubin, Dempster, and others. I don’t think any of these researchers would’ve taken us where we needed to go all on their own.
Thank you very much, Andrew. I read the mini-biography, and I’m even more impressed now with all the areas to which he contributed and how incredibly prolific he was. I love that his brother-in-law was a Bayesian on the same faculty at Princeton. And it alludes to the fact that they used Naive Bayes in the NBC election forecasting work in the early 1960’s.
I take your point on the purpose of work in the public vs. in secret.
There are some hints of Bayesian statistics in what is known of the work of Turing at Bletchley Park, and I note that I.J.Good had worked there. If we imagine that NSA is trying to recover a cryptovariable from cipher text, this is ideally suited to Bayesian methods, because there is an obvious prior: take all possible cryptovariables to be of equal prior probability. So our imaginary cryptanalyst might routinely be a Bayesian, but they are not accumulating evidence that Bayesian statistics is useful in cases where there is no obvious prior.
The main point of Bayes rule is the normalization of the posterior. Ie, probability is a relative measure, and what you put in that denominator is of utmost importance.* The priors are more like a minor adjustment and can all cancel out for approximate cases.
Likewise the main problem with “frequentist” methods is they try to consider each hypothesis/model in isolation. Or else limit themselves to disjunctions we know the answer to beforehand (either the model is exactly correct or its not… it is not) that don’t map to any real world problem.
* The denominator is sum( p(H[0:n)*p(D|H[0:n]) ). Ie, if you only consider the possibility the earth is flat, then that has probability =1. But once someone mentions “maybe its a sphere”, you add that in the denominator and p_flat drops way down due to the lower likelihood (worse fit to data). NB: this does not require the earth is exactly a sphere!
NB #2: The true probability is some kind of “string theory” calculation where out of N = 10^500 possible universes the Earth is flat in x of them, then p = x/N. Eg: https://en.wikipedia.org/wiki/Dichronauts
Our Bayesian probability can only approximate this intractable calculation since the denominator is necessarily incomplete in practice.
I believe a number of sources state quite explicitly that Bletchley used Bayesian methods. ISTR that I. J. Good once stated that he was frustrated that, around 1960 or so, when he was confronted with statement like, “If Bayesian reasoning is so good, why doesn’t anyone use it successfully?” he could not describe the work at Bletchley.
Here’s what looks like a pretty good article on Good, Bletchley, and Bayes: https://projecteuclid.org/journals/statistical-science/volume-38/issue-2/The-Secret-Life-of-I-J-Good/10.1214/22-STS870.full
Here’s a quote from that paper:
The Bayes factor played a central role in Turing’s overall philosophy of
cryptanalysis. This was not confined to the special case
of the attack on the Enigma, as became clear several
decades later in 2012 (the centenary of Turing’s birth),
when GCHQ declassified and released a paper Turing had
written during the war, “The Applications of Probability
to Cryptography” (Turing, 2012a). In this remarkable
document (intended for newcomers to Bletchley or at least
the Bayesian approach), Turing illustrated how Bayesian
methods could be applied to four distinct classical cryptanalytic
problems (the Vigenère cipher, a letter subtractor
problem, the theory of repeats and transposition ciphers).
In each case, an attack based on Bayes factors was described,
together with a discussion of how the computations
needed for the attack could be carried out in practice.
Turing’s 1941-42 report The Applications of Probability to Cryptography is available at
https://discovery.nationalarchives.gov.uk/details/r/C11510465
Written in 1941 or 1942, it appears to have been declassified about seventy years later.
Yes, we discussed some of that in our recent post on Good.
That site appears to require login, here is another: https://arxiv.org/abs/1505.04714
He describes dealing with an intractable (at least at the time) combinatorics problem. Then assuming an encryption model and plugging stats from observed messages (likelihood) along with guesstimates (prior) along with some auxiliary assumptions (assuming the probability other models were used is negligible) into Bayes rule to simplify the problem to Bayes factors.
In 2022, I got John Chambers to provide an oral history for the Computer History Museum.
This has links to both video and transcript, he mentioned John Tukey often, so there may some insights.
He gave me the birthdate of S as May 5, 1976, at which point it was a bunch of slides, then it took a while for software to happen.
https://www.computerhistory.org/collections/search/?s=CHAMBERS+MASHEY
I’ve long wondered how Tukey’s EDA book would have looked if it had been done *after* S was there.
John:
I always assumed that Tukey’s EDA work was the inspiration for much of S, in its graphics and also its interactive, you-can-try-anything nature.
Agreed, as I think is supported by JC’s interview.
Tukey was already well along writing the EDA book (I still have my copy!) when JC&co proposed building S, so it was “in the air.”
On reflection, I guess I really wished for a 2nd edition about 10 years later, after S was well-established & less use of paper-and-pencil methods. Of course, I have William Cleveland’s books from the early 1990s, which in some sense did that.
(As you’d likely know, but others may not:
Chambers and Cleveland were in adjacent departments (which in Research were typically 6-10 people) in Henry Pollak’s Mathematics and Statistics Research Center, part of Executive Director RC Prim’s Research, Communications Principles Division.
Tukey had the really-rare title of Associate Executive Director … i.e., ultra-senior position, but unbothered by administration, budgets, etc. :-)
By somewhat odd coincidence, as a high school junior in 1962, I’d attended a National Youth Conference on the Atom in Chicago, where one of the speakers was Henry Pollak. The attending students could move around and sample talks, but fairly quickly, half were listening to him. He gave a great talk about the importance of theory and practice, first describing the classic Minimal Spanning Tree graph theory problem, then posting the question “What good is ths?” and answering by saying it saves a lot of money in deciding wherre to build network links. Anyway,, Pollak’s center was a very strong organization, with many other good researchers.
As far as I know Tukey cared a lot for the foundations (which I’d think is part of what you call “the fundamentals”), and he made contributions, some of which are very important to me. He wrote “More honest foundations for data analysis” https://www.sciencedirect.com/science/article/abs/pii/S0378375896000328 very late in his life, but I think it’s just wonderful. It builds on the Configural Polysampling work that was a bit before. “The future of data analysis” of 1960 has some brilliant thoughts that are still worth reading, mostly in the first section. He also was central in the beginnings of robust statistics. Although it is probably fair to say that his contributions there are not particularly systematic, he was one of the first to tell people that small deviations from assumed models may have bad consequences, and that we shouldn’t rely on models to hold and on what works optimally if the model assumptions are violated. His Collected Works have scattered small papers with remarks on foundations and things such as data analysis workflow.
You mention the EDA-related Tukey tradition. I see myself in the foundational Tukey tradition, the hallmark of which is “models are not there to be believed”. Chances are Laurie Davies would see himself in the same tradition. Not sure how much Larry Wasserman is influenced by Tukey in this respect.
Christian:
I agree that Tukey cared about foundations. When I said, “I just don’t think he was sophisticated in his thinking about fundamentals,” I was specifically referring to the fundamentals of inferential statistics. He had enough sense to see the problems with trying to fit all of statistical practice into a classical hypothesis testing framework, hence his abandonment of the multiple comparisons research, but he didn’t seem to be aware of the larger Bayesian perspectives of Box/Rubin/etc., that incorporated both inference and model checking. I’m not saying that Tukey was a fool or that he was a bad statistician; I just think he was working within an impoverished theoretical framework, so that all he could think of doing was to abandon inference entirely. Like you, I love Tukey’s “honest foundations” article as well as his article and book on EDA, and the tools he recommends there are great, but they won’t directly solve hard problems that require real number crunching. I think that Tukey’s attitude–that inference involved hypothesis testing, familywise error rates, and all the rest–was restrictive, and that now we can do better, both by applying his EDA principles to modeling and inference, and by applying modern methods of modeling and inference to EDA. That was kind of the point of my 2003 paper. And, as I assume you’re aware, I don’t think models are “there to be believed” either.
Sure, no big disagreement here between us. I believe Tukey has made comments on the Bayesian approach but unfortunately I don’t remember where right now. I think he was as sceptical about having a higher level probability model over parameters or over general “lower level” models as he was about the lower level models themselves (modelling directly the data generating process). I imagine he’d have asked immediately, OK, but now what if we base an analysis on an assumed (fully Bayesian) model, and this doesn’t hold? As he did regarding frequentist models. I don’t think he’d have thought of the Bayesian approach as a solution to the model uncertainty problem for that reason.
Christian:
The way Rubin put it to me once was as follows: When you have multiple comparisons you need to do some sort of adjustment. In the Bayesian approach, you adjust the estimates toward zero, which reduces the number of (conventionally) “statistically significant” comparisons. In the classical approach, you’re not allowed to touch the estimates, so you adjust by making the confidence intervals wider. This has the paradox that the comparisons information you get, the wider your individual intervals become. So I think the problem that Tukey (and many others) had was not so much an aversion to probability modeling as much as being stuck in a framework in which they seemed to think that the point estimates were untouchable. One thing I appreciate about the work of Efron and Morris with empirical Bayes in the 1970s, and the work of Hastie and Tibshirani with lasso in the 1990s, was that they framed adjustment in a way that would be acceptable to non-Bayesians or anti-Bayesians.
The difference between us is probably that you met Tukey’s work via multiple comparisions and I met it via robust statistics, so it seems different parts of his work may dominate our ideas about what he was about in the first place.
A little bit of exploratory spatial modeling would reveal that Yuma is in Arizona. And almost in California. But a good ways from Nevada.
I spent a night in Yuma last summer. If you’re going to do fieldwork, January temperatures are definitely a better topic than July temperatures.
John:
Whoops! I should’ve remembered this from the classic film, 3:10 to Yuma.