Details matter (at least, they do for me), but we don’t yet have a systematic way of going back and forth between the structure of a graph, its details, and the underlying questions that motivate our visualizations. (Cleveland, Wilkinson, and others have written a bit on how to formalize these connections, and I’ve thought about it too, but we have a ways to go.)
I was thinking about this difficulty after reading an article on graphics by some computer scientists that was well-written but to me lacked a feeling for the linkages between substantive/statistical goals and graphical details. I have problems with these issues too, and my point here is not to criticize but to move the discussion forward.
When thinking about visualization, how important are the details?
Aleks pointed me to this article by Jeffrey Heer, Michael Bostock, and Vadim Ogievetsky, “A Tour through the Visualization Zoo: A survey of powerful visualization techniques, from the obvious to the obscure.” They make some reasonable points, but a big problem I have with the article is in the details of the actual visualizations they show. Briefly:
Figure 1A looks like it should be on a log scale, also it has an unclear y-axis (I don’t think “Gain/Loss Factor” is a standard term) and a time axis that is not fully labeled (you have to reconstruct it from the title of the graph).
Figure 1B has that notorious alphabetical order, also some weird visual artifacts that get created by stacking curves, and a x-axis that is not fully labeled. (What is the point labeled “2001”? Is it Jan 1, July 1, or some other date?) Yes, I realize that one purpose of the article is to criticize such graphs (“While such charts have proven popular in recent years, they do have some notable limitations. . . . stacking may make it difficult to accurately interpret trends that lie atop other curves.). Still, it doesn’t help to list the industries in alphabetical order.
Figure 1C (see below) seems just wrong. If you look at the graphs, unemployment seems to have gone up by something like a factor of 10 in almost every sector! Something went terribly wrong here; perhaps each graph was rescaled to its own range, which wouldn’t make much sense in a small multiples plot. (For unemployment rates, I’d think you’d want zero as a baseline, or maybe some conventional “natural rate” such as 3%.) On a more minor note, it would help to put the labels on the upper left of each series rather than below the axes. Also, the colors don’t seem to add any information, and it’s a bit odd to list “Other” as the second or third category of industries–I still can’t figure out how that happened!
I could keep going here through all the other graphs in the article But maybe these criticisms are irrelevant. On one hand, they don’t matter because the writers of the article are simply trying to give an example of each sort of graph. On the other hand, I worry that people will see this sort of authoritatively-written article and take the graphs as models for their own work.
Is it important to get the details “right”?
What harm is done, if any, by having ambiguous labels, uninformative orderings of variables, inconsistent scaling of axes, and all the rest? From a psychological or graphical perception perspective, maybe these create no problem at all. Perhaps such glitches (from my perspective) are either irrelevant to the general message of the graph or, from the other direction, force the reader to look at the graph and read the surrounding text more clearly to figure out what’s going on. After all, a graph isn’t a TV show, readers aren’t passive, so maybe it’s actually good to make them work to figure out what’s going on.
At a statistical level, though, I think the details are very important, because they connect the data being graphed with the underlying questions being studied. For example, if you want to compare unemployment rates for different industries, you want them on the same scale. If you’re not interested in an alphabetical ordering, you don’t want to put it on a graph. If you want to convey something beyond simply that big cars get worse gas mileage, you’ll want to invert the axes on your parallel coordinate plot. And so forth. When I make a graph, I typically need to go back and forth between the form of the plot, its details, and the questions I’m studying.
If you wanted to say I’m wrong, you could perhaps invoke an opportunity cost argument, that the time I spend worrying about where to label the lines on a graph (not to mention the time I spend blogging about it!) is time I could be spending doing statistical modeling and data analysis. For me, the details of the graphing are absolutely necessary to the statistical analysis–decades ago, before I did everything on the computer, I spent lots and lots of time making graphs by hand, using colored pens and all the rest–but for others, maybe not.
Dot plots, line plots, and scatterplots
My biggest complaint about the Heer et al. article is that it doesn’t mention what are perhaps the three most important kinds of graphs: dot plots, line plots, and scatterplots. See here here for a dotplot (from Jeff and Justin), and here for some line plots and scatterplots. (I just picked these for convenience; there are dozens more in Red State, Blue State and all over the place in the statistical literature.) Perhaps the authors felt that readers would be already familiar with these ideas and didn’t need to see them again. But I think, No, the readers do need to see these again! A clearer understanding of line plots would’ve been a big help in making Figure 1C, for example. And some dot plotting principles would’ve helped with Figure 4C (coming up with an ordering more sensible than alphabetical, and displaying the “KB” numbers as dots on a scale; as is, you can pretty much only read the size of each number, which really means we’re seeing the numbers on a very crude logarithmic scale).
Do I have anything constructive to say here?
OK, OK, I’m not trying to be a grump. Different people have different perspectives, and that’s fine. My point, I think, is that there’s something missing in many discussions–even well-informed discussions–of visualization. What’s missing is the link from the substantive questions (what are the reasons for making the graph in the first place?) and the details of the graph. It’s a weakness of our software, and of our conceptual frameworks for thinking about graphs, that we don’t usually have a systematic way of making that link. Instead we go through menus of possibilities (actual forced options on computer packages, or mental menus in which we make choices based on what we’ve seen before) and then have to go back and fix things.
We should be able to do better. I’m not faulting Heer et al. for not doing better, since I don’t have my own general solution either. Rather, I’m using their article as an opportunity to push for further thinking on all of this.
P.S. I wrote this in my standard blog style, which was to start with something I’d seen and go from there. Once it was done, I changed the title, “When thinking about visualization, how important are the details?” to the grabbier “A data visualization manifesto” (snappier than “A statistical graphics manifesto,” perhaps?) and appended the very first two paragraphs above as an intro. This should be better, right? Readers should be more interested in my point than in how I got there. I didn’t feel like revising the whole piece, but I guess I will if I want to rewrite the article for publication somewhere, which maybe I’ll do if I find the right coauthor.