Valentin Amrhein points us to a recent article, “Exploratory hypothesis tests can be more compelling than confirmatory hypothesis tests,” published in the journal Philosophical Psychology. The article, by Mark Rubin and Chris Donkin, distinguishes between “confirmatory hypothesis tests, which involve planned tests of ante hoc hypotheses” and “exploratory hypothesis tests, which involve unplanned tests of post hoc hypotheses.”
All of that reminded me of two old posts:
From 2016, Thinking more seriously about the design of exploratory studies: A manifesto
From 2010, Exploratory and confirmatory data analysis:
I use exploratory methods all the time and have thought a lot about Tukey and his book and so wanted to add a few comments.
– So-called exploratory and confirmatory methods are not in opposition (as is commonly assumed) but rather go together. The history on this is that “confirmatory data analysis” refers to p-values, while “exploratory data analysis” is all about graphs, but both these approaches are ways of checking models. I discuss this point more fully in my articles, Exploratory data analysis for complex models and A Bayesian formulation of exploratory data analysis and goodness-of-fit testing. The latter paper is particularly relevant for the readers of this blog, I think, as it discusses why Bayesians should embrace graphical displays of data—which I interpret as visual posterior predictive checks—rather than, as is typical, treating exploratory data analysis as something to be done quickly before getting to the real work of modeling.
– Let me expand upon this point. Here’s how I see things usually going in a work of applied statistics:
Step 1: Exploratory data analysis. Some plots of raw data, possibly used to determine a transformation.
Step 2: The main analysis—maybe model-based, maybe non-parametric, whatever. It is typically focused, not exploratory.
Step 3: That’s it.
I have a big problem with Step 3 (as maybe you could tell already). Sometimes you’ll also see some conventional model checks such as chi-squared tests or qq plots, but rarely anything exploratory. Which is really too bad, considering that a good model can make exploratory data analysis much more effective and, conversely, I’ll understand and trust a model a lot more after seeing it displayed graphically along with data.
– Anyone can run a regression or an Anova! Regression and Anova are easy. Graphics is hard. Maybe things will change with the software and new media—various online tools such as Gapminder make graphs that are far far better than the Excel standard, and, with the advent of blogging, hot graphs are popular on the internet. We’ve come a long way from the days in which graphs were in drab black-and-white, when you had to fight to get them into journals, and when newspaper graphics were either ugly or (in the case of USA Today) of the notoriously trivial, “What are We Snacking on Today?”, style.
Even now, though, if you’re doing research work, it’s much easier to run a plausible regression or Anova than to make a clear and informative graph. I’m an expert on this one. I’ve published thousands of graphs but created tens of thousands more that didn’t make the cut.
One problem, perhaps, is that statistics advice is typically given in terms of the one correct analysis that you should do in any particular setting. If you’re in situation A, do a two-sample t-test. In situation B, it’s Ancova; for C you should do differences-in-differences; for D the correct solution is weighted least squares, and so forth. If you’re lucky, you’ll get to make a few choices regarding selection of predictors or choice of link function, but that’s about it. And a lot of practical advice on statistics actually emphasizes how little choice you’re supposed to have—the idea that you should decide on your data analysis before gathering any data, that it’s cheating to do otherwise.
One of the difficulties with graphs is that it clearly doesn’t work that way. Default regressions and default Anovas look like real regressions and Anovas, and in many cases they actually are! Default graphics may sometimes do a solid job at conveying information that you already have (see, for example, the graphs of estimated effect sizes and odds ratios that are, I’m glad to say, becoming standard adjuncts to regression analyses published in medical and public health journals), but it usually takes a bit more thought to really learn from a graph. Even the superplot—a graph I envisioned in my head back in 2003 (!) back at the very start of our Red State, Blue State project, before doing any data analysis at all—even the superplot required a lot of tweaking to look just right.
Perhaps things will change. One of my research interests is to more closely tie graphics to modeling and to develop a default process for looking through lots of graphs in a useful way. Researchers were doing this back in the 1960s and 70s—methods for rotating point clouds on the computer, and all that—but I’m thinking of something slightly different, something more closely connected to fitted models. But right now, no, graphs are harder, not easier, than formal statistical analysis.
– To return briefly to Tukey’s extremely influential book: EDA was published in 1977 but I believe he began to work in that area in the 1960s, about ten or fifteen years after doing his also extremely influential work on multiple comparisons (that is, confirmatory data analysis). I’ve always assumed that Tukey was finding p-values to be too limited a tool for doing serious applied statistics—something like playing the piano with mittens. I’m sure Tukey was super-clever at using the methods he had to learn from data, but it must have come to him that he was getting the most from his graphical displays of p-values and the like, rather than from their Type 1 and Type 2 error probabilities that he’d previously focused so strongly on. From there it was perhaps natural to ditch the p-values and the models entirely—as I’ve written before, I think Tukey went a bit too far in this particular direction—and see what he could learn by plotting raw data. This turned out to be an extremely fruitful direction for researchers, and followers in the Tukey tradition are continuing to make progress here.
– The actual methods and case studies in the EDA book . . . well, that’s another story. Hanging rootograms, stem-and-leaf plots, goofy plots of interactions, the January temperature in Yuma, Nevada—all of this is best forgotten or, at best, remembered as an inspiration for important later work. Tukey was a compelling writer, though—I’ll give him that. I read Exploratory Data Analysis twenty-five years ago and was captivated. At some point I escaped its spell and asked myself why I should care about the temperature in Yuma–but, at the time, it all made perfect sense. Even more so once I realized that his methods are ultimately model-based and can be even more effective if understood in that way (a point that I became dimly aware of while completing my Ph.D. thesis in 1990—when I realized that the model I’d spent two years working on didn’t actually fit my data—and which I first formalized at a conference talk in 1997 and published in 2003 and 2004. It’s funny how slowly these ideas develop.).
For exploring 2-D data (i.e., x-y or time series data), I use GF4, a Python program that I call a “Waveform Calculator”. It works like an HP Reverse Polish Notation (RPN) calculator (for those who remember using them), but using 2-D data sets instead of numbers. The graphs are very clear and readable, and no programming is required. Input data can come from comma- or tab-separated files.
As a calculator, you can do basic math on a data set – add two curves, subtract, multiply, divide, integrate, differentiate, etc. FFTs that don’t have to be powers of two in length. Curve fitting and smoothing. You can overlay curves to compare them. GF4 has a function generator that can create some basic curve types, including normal curves, Gaussian CDFs, sines and damped sines, steps, ramps, Gaussian and uniform noise, and delta functions.
The program makes exploring data easy and fun, and it would be really good for students as they start to learn probability and statistics. There is a plugin mechanism, so you could add your own processing commands (which could call R and Stan routines via Python, or example).
I’ve recently open-sourced the code. It’s at
https://github.com/tbpassin/gf4-project/
The (very incomplete) User’s Guide is at
https://tompassin.net/gf4/docs/GF4_Users_Guide.html
I started college in 1977 and remember when I took statistics a couple of years in there was a lot of Tukey in the background. I feel like a lot of his thought gets covered up by the specific graphs. Oddly to me, box plots (and stem and leaf) are considered very important in k-8 statistics, yet I find that they are very complex to teach to beginners.
I think that all of his most influential analytical work was the classified work that was done for the military and intelligence agencies. I think he was restricted in what he could publish and that maybe explains some of the way that the book leaves you wondering what now.
Perhaps another part of “exploratory analysis” should be an exploration not just of the data that’s available but the subject knowledge related to it and what data might be relevant to the question that’s not available and of course the potential scale of the impact of missing data.
Looking back at the discussion a few months ago regarding whether or not heavy traffic makes driving safer, several considerations were brought into the discussion for which no data were available but could change the outcome of an analysis if they were avaiable. For example, at the time, I intuited that the risk profile of drivers probably rises throughout the day – doctors and engineers work during the day; bartenders and waiters work in the evenings; and people often drink in the evenings. Another consideration is that there is possibly an affect from general tiredness. In the data we were using, the number of accidents increased with increasing time of day until after 8pm. How does one disentangle these effects?
IMO assessing subject-related questions and overall data availability – can the data I have provide sound or important information about the subject question? – is extremely for any kind of data analysis and should be considered as early in the processes as possible.
A few weeks ago I interviewed John Chambers to get oral history for the Computer History Museum. That was fun!
https://www.computerhistory.org/collections/catalog/102792763
Tukey of course was mentioned.
The record is there, but video and transcript aren’t yet edited and up.
Slightly off topic, but back in the day (the 1970s) a number of people at the Yale Stat Dept. were out of Princeton, were heavily influenced by Tukey, and they worked for Consumer Reports. I haven’t looked at Consumer Reports for a long time, but there was quite a spell where how they evaluated and presented their results were heavily influenced by Tukey (John Hartigan also). I of course was only about 5 at the time, and pass this on as lore that I have heard – I also sell bridges and marshland, when not being a giraffe.
Roy:
That makes sense. Consumers Union is headquartered in the northern suburbs of NY, so I could imagine them driving over to New Haven to talk with the folks at Yale.
My dad subscribed to Consumer Reports for decades, and I actually remember, when reading it back in the 1970s, that they would do multiple comparisons corrections! They didn’t call it that, and I had never heard the term, but I remember that they’d be rating 8 brands of vacuum cleaner or whatever, and they’d group them into batches that were not statistically different from each other. Years later in grad school I ran into an old unpublished paper of Tukey on grouping for multiple comparisons, which I thought of as the Consumer Reports problem, but I had no idea that Tukey’s colleagues or students had worked with them. Years after that, I realized that this grouping idea was fundamentally unsound—one way to see this is to imagine you are rating 1000 brands of vacuum cleaner instead of 8, then it would not make sense to divide them into clusters; more generally, ranking is a lot less stable than you might think, even if you’re Tukey—; in retrospect, it’s an interesting example of a statistical method that is cleverly engineered to solve a problem that seems reasonable—to divide a bunch of ratings into clusters—but ultimately does not make sense, in this case for the fundamental reason that it is calibrated based on the nonsensical (in this case, as in most cases) null hypothesis of zero differences within groups.
Not long ago I was contacted by Consumer Reports to see if I could help them fit multilevel models. I was concerned that it would conflict with other consulting that I do, so I said no. Had I known about the Tukey connection, maybe I would’ve found a way to do it.
Regarding Tukey et al. doing all sorts of brilliant work that they couldn’t share because of business or military secrecy: years ago, I gave a talk about some of our work in Bayesian hierarchical modeling for political science Some statistics professor came up to me afterward and said that this was all old stuff and he’d done it with Tukey back in 1960 or something like that. I think the idea was they did it for the TV networks and didn’t publish it because it was some kind of trade secret, and I got the impression that this professor thought our research was no big deal because they’d already done the same thing, decades before. I thought about all this later, and . . . no, I don’t think that everything we did was trivial. There’s a big difference between solving to one particular problem (as these dudes did in 1960) and coming up with a general solution to a class of problems. Also, it matters that a method is published. Not just for getting credit, but because when you put a method out there, people can use it (“beta-testing”) and people can criticize it. In retrospect I think this professor mistakenly underestimated hierarchical Bayesian methods: he’d successfully applied them to one problem, and instead of recognizing the importance and generality of the approach, he just decided it was trivial and he didn’t bother to think through its implications.
For the life of me I can’t remember for certain if the contract was with people at Yale or with Tukey’s group at Princeton, though I am leaning to the latter. One thing to remember is computers were not ubiquitous then, nor inexpensive – that was one of the advantages of a contract with a University.
Some of the people involved becomes fairly well known statisticians, one became the long-time director of research at Consumer Reports. They were grad students then and did a lot of the analyses.
I think if you look at the history of EM there are some parallels. The chapter on Tukey in The Theory that Would Not Die is very interesting. He worked on the Nike and the Fire Control Research Office (the ballistics kind of fire).
But I agree with what you say about people saying everything has been done before.
I don’t think the discussion is paying enough attention to model uncertainty or what might be called more generally “analysis uncertainty”. For example, Andrew mentioned EDA for finding data transformations. It has been well documented in the statistical literature that if a formal analysis pretends that the resulting transformation was pre-specified, the resulting confidence bands will be too narrow. Bayesian uncertainty intervals will also be too narrow. I think that we need to be careful about what types of EDA give us false confidence.