Tukeyian uphill battles

It seems that at least once a year, I find myself begging someone to make exploratory plots of some experimental data. I say begging because I have found that often when I’m being presented with some analysis and I ask questions like Did you plot all the variables first? or Did you look at this relationship visually first? not everyone hears me. I tend to experience this with students in disciplines that stress math over statistics, like computer science and economics, but it’s not exclusively those backgrounds.

During these exchanges, I feel like I understand Tukey a bit better, and why he sometimes gets misinterpreted as advocating for some “pure EDA” ideal. I don’t believe there’s a clear divide between model-free exploratory analysis and model-driven confirmatory analysis, and I don’t really think that’s what Tukey was arguing for (as we wrote for example in this recent paper). But you only have to have this experience of your plotting requests being ignored a few times to realize that you can’t expect to be able to say once, Hey how about plotting all the data first?, in a normal tone, and expect everyone to put aside the more complicated stuff they want to do and make plots. You have to repeat it, again and again, probably loudly, until you seem like some sort of graphing extremist. 

I suspect reasons this might happen include latent beliefs that plotting data is somehow inferior to anything that feels more like math. And because making plots for the sake of checking what the data look like or discovering things you didn’t expect seems disconnected and even distracting from the ultimate goal of fitting some model. I’m not sure this is unique to non-stats backgrounds. At least of the stats courses I took as a grad student, I don’t recall graphs or EDA being addressed outside of some class of model we had already decided we’d use, where the graphs were used for checking assumptions and fit. It also probably doesn’t help that so many empirical papers present plots as though they are some sort of nuisance to be glossed over (I’m thinking about cases like where the little asterisks come out to decorate the plots, lest the reader should have to actually think about the differences they see). Or don’t include them at all in favor of tables of coefficients.

More broadly, I think there’s a tendency to not want to acknowledge our own cognitive limitations as researchers, at least in fields like computer science where we like to think we work on hard technical problems. I remember as a grad student, I worked with a behavioral economist on a project and she insisted that I remake plots whenever relying on the default options produced unnecessary cognitive challenges, things like missing data being green, positive responses being red, etc. or positioning data for different conditions out of some natural order we used when we talked about them. Her reasoning was that we couldn’t afford to waste the extra effort mentally adjusting things. She was right: trying to ascertain some data generating process from a bunch of observed data is plenty complicated already, why make it harder for yourself? But at the time I remember feeling kind of impatient at the time, like, this is wasting my time when I have more important analyses to do.   

I’m not sure what can be done to instill better intuitions about the value of plotting data in general to non-stats students. I teach on the EDA process and its relationship to software in visualization courses for computer science students. They seem to get it in this context, but in a visualization class they’re incentivized to care about plots, and they have me talking directly about the dangers of modeling data you haven’t really looked at closely and how helpful plots can be for figuring out what kind of structure you’re dealing with. I don’t really know if they apply what they learn to subsequent analyses they do. I know a lot of intro stats classes start with graphing and summarizing data, and the students I teach have generally all taken some stats. But maybe once they start learning about different types of models it gets forgotten. It seems like what really needs to be instilled more is the idea that there’s a data analysis workflow, and that the early phases really are essential. Also that when we analyze data we have to make a conscious effort to overcome these tendencies we have to think we already understand things well enough to skip it, or that the data we’re working with is generally going to be trustworthy as is. Maybe teaching intro stats courses around the idea of a workflow rather than emphasizing differences between models would help.  

 

36 thoughts on “Tukeyian uphill battles

  1. I work in applied statistics and operations research, and virtually all of the statistical problems I work on revolve around making a very specific comparison between groups. The vast majority of these comparisons can be made with exploratory analysis, like graphing trends over time, boxplots, histograms, etc. When I was a student I had the idea that public policy work required huge and elaborate models, but it turns out most of the questions I get asked can be answered with a simple, well-designed graph or a table.

  2. Preach.

    These days I mostly work on my own or in a small group of very competent colleagues, but for many years I worked with grad students, postdocs, interns, and colleagues (at a national research laboratory) who didn’t have much data analysis experience, and it was often the case that they had not only not done much (or any) graphical data exploration with the data I was helping with, they were oddly reluctant to do so.

    One huge problem turned out to be: a lot of these people were relying on Excel for everything. I don’t want anyone to get the wrong idea here, so I’ll say I love Excel; it’s a fantastic program for tabulating and organizing data. For several years I was in a sports betting pool that used complicated rules; someone had encapsulated these in an Excel spreadsheet and it was great, you could click on a tab and see how many points each of us had accumulated, click another tab and see color-coded results from individual games, and so on. It was much better than anything I would have coded in R or Python or similar. Yay Excel. But: Excel is lousy for exploratory graphics, and for lots of other data analysis tasks besides.

    This was really driven home to me when I agreed to help a grad student analyze his data. The first time I met with him, of course I asked him to make a bunch of plots and to save all of the ones with interesting or unexpected features so that he and I could discuss them next time. Next time came, and he had only made a few plots. Repeat. He just would not make the damn plots! That would have been his problem alone, except I had promised his advisor that I would help with his project. So one day I went to his office, sat down next to him, and said “OK, let’s start by making a scatterplot of Y vs X, with a different symbol for each category of A”. And he went click-click-click-click, selecting columns and clicking in dialog windows, and finally came up with the plot. It was much slower than simply typing plot(a$x, a$y, pch = as.factor(a$A)), but it wasn’t unbearable. Fine. But then I asked for Y vs Z…and the clicking started again! And then the same plot, but only for the cases where Q > 0, and again, click click click! It took half an hour to make a twenty or so plots. Many things that should have been routine were cumbersome.

    Excel has improved somewhat since then, I gather, but of course so have other programs and platforms (e.g. the tidyverse). People who try to do data analysis or data science, and only use Excel for the task, are crippled when it comes to exploratory data analysis. Depending on their data and their questions they may be able to make it work, but for many problems it’s a huge handicap. And there are a _lot_ of people in this boat.

    In addition to people trying to use a poor tool, I think people who lack data analysis experience sometimes fail to understand how badly things can go wrong if they don’t. I find “Anscombe’s Quartet” to be fantastic for having this conversation…which is not surprising because that’s what Anscombe designed it for.

    • Thank you, Phil, well put.

      I’m an administrative judge. Sometimes our cases involve comparing things that were affected by an event to things that were not affected by it. (Being intentionally vague.) Rather than offering us graphs or plots of all of the measurements, experts and lawyers tend to lump the groups and then perform NHST to show “statistically significant differences” between the averages. Very frustrating.

    • “He just would not make the damn plots!” I completely understand this feeling :-)

      I use Anscombe’s quartet in my classes and a few similar examples where “smart” defaults (optimized layouts) make you miss errors in your data. They do help.

      For people who don’t know R, I often suggest Tableau as vastly better than Excel. Once you get the hang of the drag and drop to rows and columns it maks it very quick to make effective plots (because used this way it will default you to position encodings for all quantiative data and facetting for nominal/ordinal). So its harder for someone to say, but it takes so long. But then you’re asking computer scientists to use a graphical user interface and again, they make feel like that’s too “easy” ….

      • The learning curve for R is definitely a problem (as it is for any statistical programming language), so if there’s a graphical tool that can do the job, that’s great. I think the key feature would be the speed of specifying what you want to plot, and how, rather than (for instance) just making plots that look better. Excel has some famously awful choices for shading, colors, fonts, and so on, but even taken together those don’t constitute a major impediment to exploratory graphics, it’s really just how much clickity-click you have to do in order to do even simple stuff. If Tableau fixes that, or even takes a big step in the right direction, then sure, that would be better than a stats language that the person doesn’t already know.

    • On the other hand, for those of us who are fluent in Excel, there’s always the activation energy to overcome before dropping into the potential well of a faster and more capable plotting environment, and sometimes…

      …sometimes I just need the damn plots quickly so I can see what’s going on. And that’s why I’ve actually generated geospatial maps using Excel.

      • Yes, see my response to Jessica (probably immediately above this). You use the tool you know. And this is by no means unique to Excel, even when it comes to using a poor tool for geospatial maps: how many times in my career have I had geospatial data in R, and rather than using the appropriate library I just plot( dat$longitude, scale*dat$latitude )? Too many, that’s how many.

    • I have been a longtime Excel user, but agree that creating clear, understandable graphs takes too much effort. But learning new packages takes time and effort, also. I used Google Sheets and OpenRefine as well because the ‘transfer of learning’ from Excel is considerable, i.e. they are easy to learn if you already use Excel. Since we are talking visualization, imagine a map of applications that illustrates their similarity as distance. It would quickly give one an idea of which other applications they might consider to improve various tasks in their workflow.

  3. Jessica said,
    ” at least once a year, I find myself begging someone to make exploratory plots of some experimental data. I say begging because I have found that often when I’m being presented with some analysis and I ask questions like Did you plot all the variables first? or Did you look at this relationship visually first? not everyone hears me. I tend to experience this with students in disciplines that stress math over statistics, like computer science and economics, but it’s not exclusively those backgrounds. …

    I suspect reasons this might happen include latent beliefs that plotting data is somehow inferior to anything that feels more like math.”

    • (oops — pressed “submit” too early) To continue by responding to the quote:

      Speaking as a mathematician, I think that this is a matter of the students’ having a limited math background that gives the impression that “doing math” is carrying out a bunch of procedures. This is not what mathematicians do. Instead, we start with some situation, look at patterns (of whatever sort might appear), make “conjectures” about them, and try to prove or disprove the conjectures. And our attempts to prove and disprove typically involve looking at examples to give us possible insight about what is going on in the situation being studied. So to me, what Jessica calls for is behaving like a mathematician, and what she decries is behaving like someone whose math background only requires carrying out algorithms, rather than what a mathematician would call “real” mathematics.

  4. I think one contributing factor is that the recent interest in data visualization has often focused on that as the last step of the analysis- primarily as an essential part of the communications process. When I teach data visualization, I emphasize it as an essential part of the entire process (along the lines of the work flow Andrew speaks of). In fact, it is used from the beginning and throughout the process. If people think of it only as part of how you tell your story (a common phrase that is used), they tend not to think of it as a necessary part of finding your story.

    • Yeah, I’ve definitely started projects where I had the wrong technical priorities.

      It seems plausible that the people in the post see themselves as computer cogs and that someone before them did plots or someone after them will do plots, and they just don’t think of plots as part of their toolkit or their responsibility.

      Like, in all the cases where we ask where-are-the-plots, we can probably also start with what-is-the-data or what-is-the-problem. Those are different questions than where-are-the-plots (we are just making plots here — ask my boss why it’s important), but someone who thinks of themselves as a computer cog might look at the plot question the same way (we already assumed there was a signal — why are you asking about plots?).

      • I agree, and I think it is worthwhile emphasizing that the plots are not the place to begin – it is with the questions. Plots come second, often in conjunction with redefining the questions. And both of these steps are often not the focus of the more technical courses that people study.

  5. “I tend to experience this with students in disciplines that stress math over statistics, like computer science and economics”

    For computer science there’s a strong tendency to want to automate everything. Plotting data is great for a manual analysis that has a human being in the loop, but if you’re trying to automate the analysis of 1000 different data sets it’s not really an option.

    • OK, but counterpoint: if you do the data analysis *never having had a human in the loop in the first place* you could be in big trouble. Maybe look at a random subsample of your 1000 data sets so that you can at least detect “typical” weirdnesses?

      The only scenario I can think of where skipping this step would be OK would be something like:

      * you really only care about getting a certain level of out-of-sample (cross-validated, etc.) predictive accuracy, and
      * you’re willing to ignore the possibility that the data has potentially useful predictive features that would be apparent to your eyeballs that your machinery is missing

  6. For young scholars in econ and political science, and probably every field, one problem is they haven’t read enough papers. As you get older, you get impatient with reading millions of papers,a nd you want things simple– or at least, for the paper to quickly tell you what’s important. And you get tired of technique. You want a quick answer, because you can’t read all the papers you see, and you need to know if it’s worth reading carefully. This is particular important if the author is not at a top-ten department or is too young to have a recognizable name.

    Even older scholars usually don’t realize that it’s often the diagram (or, for a theory paper, the numerical example) that sells a paper and makes it famous. The author has to to do the fancy stuff, too, but often the fancy stuff is too hard to read, so people skip it. Even if they read it, what they remember is the diagram or the numerical example. And if you want to persuade the reader that your result is correct, a diagram can be crucial, even if the reader, self-deceived, thinks its the fancy regression that has persuaded him.

  7. This happens because people are lazy and they don’t really care about doing good work or advancing science. They just want to get “the answer” and score the bonus.

    We see it all the time in work that’s presented here. People don’t do the basic stuff. Science is exciting and people want to be heroes! My goodness, what’s the next amazing thing that a five minute video can do for us?

  8. As the most data sensitive person in the neighborhood, I get asked a lot to help with analysis. And I always begin with show me your pictures, and tell me your story. Nothing beats looking at the data, outliers, weird relationships, we’ve all seen it. Often the response is, tell me what to do; I have to present it tomorrow.

  9. Jessica, Andrew,

    Wow – this took me by surprise. I wrote about the need for a theory of applied statistics a decade ago and your paper fits right into that box. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2171179

    Part of that note appealed to have collaborations with other disciplines in order to develop such a theory. In particular I expanded on the role of conceptual understanding in statistics applications. For how this relates to the reproducibility debates see https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070

    Regarding Tukey and EDA -The approach to EDA is fundamentally “deconstructive”. I refer here to Jacques Derida’s approach. By using graphics you can bring out insights derived from associating non standard patterns to your understanding. If you use brushing and spatio temporal displays like burble plots “ala la Hans Rosling” this is particularly effective in bring out multidimensional features.

    In the 1980s, while I worked for Bell Labs, Paul Tukey (a distant cousin of John) was doing that in a static mode with graphical features in S. See https://www.amazon.com/Graphical-Methods-Data-Analysis-Statistics/dp/0412052717.

    We have moved a long way since then. What also happened is that many disciplines moved into this area further developing the methodology and technology. Unfortunately Tukey (John)’s leadership was not really followed by statisticians and statistics moved to the side bench in this game (so it seems to me).

    Given all that, I applaud the initiative of Jessica and Andrew. Thank you for a great and important paper.

    On the cynical side – I wonder if it would be published in a mainstream statistics journal. My prior is “no”.

    Bst

    ron

  10. “I’m not sure what can be done to instill better intuitions about the value of plotting data in general to non-stats students.”

    Make it fun by making it easy. That is, get rid of SPSS in the classroom and instead switch to SYSTAT, JASP or other software that makes it ridiculously easy to plot histograms, scatterplots and the like, Make it part of stats courses from the get-go by fotrcing studenst to ALWAYS plot their data first in meaningful and silly ways alike before they get to the number-crunching and coefficient-creating abstract part.

    • I’ve been demanding students start by plotting their data since I started teaching stats. Actually, they have to start every lab exercise starts by *describing* their data (what are the variables, units if any, what’s a case, why was this data originally taken), then plotting it and answering some questions about the plot. Then they get to whatever the lab is about that wee (often coming back to look at the plot again in the context of their analysis).

      This post and the discussion in the comments has made me realize why I insist on that, and why I’m so startled when someone doesn’t do that. I teach intro stats but I’ve never taken a stats course myself. I got my degree in particle physics, and learned stats on the job (mainly from all the modeling we do). But all through my grad school and postdoc work, we made graphs for everything, stared at graphs during meetings, and pinned graphs on our office walls to help us think about things. Data selection (selecting data and removing background) were always shown in a graph first, with actual numbers for rejection rates etc often shown in a “by the way, here are the numbers” way. This was true within our group as well as at conferences etc. Nobody ever pointed this out to me; it was baked into how we communicated and how we thought.

      We made graphs for stuff in class, too, even in undergrad, and my memory is that we usually thought seriously about them rather than making them because we had to. But I think experiencing that graph-first *culture* in grad school was where I really absorbed it.

      If that’s true, the question here becomes: How do we spread this culture around? Like you said, it’s probably a good start to push students to start everything by making *and considering* some plots.

      • This brings up the broader matter of spatial reasoning vs number crunching (and/or formula manipulation) in mathematics and its applications. In teaching mathematics, I included a lot more visualization than many students were used to. For example, in calculus, giving a picture of the graph of the derivative of a function, and using that to make a sketch of the graph of the function itself. And part of the reason I always thought this was worthwhile was that when I took an undergraduate course in mechanics (as part of my physics minor), the professor reasoned from a certain differential. equation to a qualitative sketch of the graph of the function involved.

  11. I agree with the importance of plotting data, but when you only deal with binary variables (treated or not treated, infected or not infected), what can you learn from a plot?

    • Well you can compare the two PDF’s or plot out the differences between groups at each quantile and see how the treatment affects the shape of the outcomes distribution. It would tell you real quick about the plausibility of “constant treatment effects”, or if any difference in means is driven exclusively by the tail of the treatment group distribution, or even just if treatment increases variance.

      And you could do the same by subgroups if you think effect modifiers might be a big deal. Plotting the two (Treatment/Control) outcome distributions in some way or another seems like a pretty obvious first step.

  12. +1 for the post

    It reminds me the recent post about Day Science and Night Science (where apparently Day Science is writing grants applications and Night Science is looking for more grants). In this framework making figures does not look like work, it looks like childish attempt so satisfy one’s ide curiosity. But this is what Science should be about, about satisfying curiosity, not computing means.

  13. I suspect many people don’t make plots because these people don’t see anything when they look at the plots.

    I’ve made plots where each plot tells me all sorts of things, but when I’ve shown the same plots to other people and asked them what the plots tells us, I get nothing.

    When I look at a plot, I think about what was happening to make the data look like that. So, I build a mental model that could explain what produced the data. But, most people don’t think conceptually, so when they look at a plot, they probably only see a bunch of points or some lines.

    If you fit a model, then you get a model.

  14. Nice post.

    Seems there is generalized dismissal of the value in visual reasoning pretty much everywhere but which is now dissipating – e.g. Visual Reasoning with Diagrams. Amirouche Mokte and Sun-Joo Shin.

    Two antidotes:

    The journal would not allow a single graph that vividly displayed the study results, so the senior author put it in the title – Stability of treatment preferences: although most preferences do not change, most people change some of their preferences. N Kohut, M Sam, K O’Rourke, DK MacFadden, I Salit, PA Singer

    An experienced statisticians who asked me about a problematic data set was outraged when I suggested rather than group means and variances they plot the raw data – “I am not an idiot, I know how to work means and variances”. They emailed me an apology the next day – “the plots of raw data clarified things nicely, sorry.”

  15. FWIW, the project that motivated me to write this post went down this week in a spectacular blaze of measurement error. Looking on the bright side, next time I expect I’ll get a little less resistance when I ask for plots and simulations!

Leave a Reply to Kyle C Cancel reply

Your email address will not be published. Required fields are marked *