Infovis, infographics, and data visualization: Where I’m coming from, and where I’d like to go

I continue to struggle to convey my thoughts on statistical graphics so I’ll try another approach, this time giving my own story.

For newcomers to this discussion: the background is that Antony Unwin and I wrote an article on the different goals embodied in information visualization and statistical graphics, but I have difficulty communicating on this point with the infovis people.

Maybe if I tell my own story, and then they tell their stories, this will point a way forward to a more constructive discussion.

So here goes.

I majored in physics in college and I worked in a couple of research labs during the summer. Physicists graph everything. I did most of my plotting on graph paper–this continued through my second year of grad school–and became expert at putting points at 1/5, 2/5, 3/5, and 4/5 between the x and y grid lines.

In grad school in statistics, I continued my physics habits and graphed everything I could. I did notice, though, that the faculty and the other students were not making a lot of graphs. I discovered and absorbed the principles of Cleveland’s The Elements of Graphing Data.

In grad school and beyond, I continued to use graphs in my research. But I noticed a disconnect in how statisticians thought about graphics. There seemed to be three perspectives:

1. The proponents of exploratory data analysis liked to graph raw data and never think about models. I used their tools but was uncomfortable with the gap between the graphs and the models, between exploration and analysis.

2. From the other direction, mainstream statisticians–Bayesian and otherwise–did a lot of math and fit a lot of models (or, as my ascetic Berkeley colleagues would say, applied a lot of procedures to data) but rarely made a graph. They never seemed to care much about the fit of their models to data.

3. Finally, textbooks and software manuals featured various conventional graphs such as stem-and-leaf plots, residual plots, scatterplot matrices, and q-q plots, all of which seemed appealing in the abstract but never did much for me in the particular applications I was working on.

In my article with Meng and Stern, and in Bayesian Data Analysis, and then in my articles from 2003 and 2004, I have attempted to bring these statistical perspectives together by framing exploratory graphics as model checking: a statistical graph can reveal the unexpected, and “the unexpected” is defined relative to “the expected”–that is, a model. This fits into my larger philosophy that puts model-checking at the center of the statistical enterprise.

Meanwhile, my graphs have been slowly improving. I realized awhile ago that I didn’t need tables of numbers at all. And here and there I’ve learned of other ideas, for example Howard Wainer’s practice of giving every graph a title.

A statistical graph does not stand alone. It needs some words to go along with it to explain it. I realized this when the New York Times ran a graph and some maps we’d created to show variation in opinion on gay rights. The final published version was an improvement on our R-created originals, but some of the in-between versions didn’t work at all. I realized that our plots, graphically strong though they were, did not stand on their own. The messages we thought the graphs were delivering were not the messages that the NYT graphic designers were picking up. This experience has led me to want to put more effort into explaining every graph, not merely what the points and lines are indicating (although that is important and can be hard to figure out in many published graphs) but also what is the message the graph is sending.

Most graphs are nonlinear and don’t have a natural ordering. A graph is not a linear story or a movie you watch from beginning to end; rather, it’s a cluttered house which you can enter from any room. The perspective you pick up if you start from the upstairs bathroom is much different than what you get by going through the living room–or, in graphical terms, you can look at clusters of points and lines, you can look at outliers, you can make lots of different comparisons. That’s fine but if a graph is part of a scientific or journalistic argument it can help to guide the reader a bit–just as is done automatically in the structuring of words in an article.

I’ve found systematic thinking and informal experimentation to be helpful in designing better graphs. For example, in making red state blue state maps, we’ve progressed beyond simple red and blue. Our first step was go to from red through purple to blue, but we got better results using shades of pink and light blue. And various other bits of scaling that helped out even more. Standard options for color choices didn’t work so well but we were able to build what we needed in R.

Compare, for example, this grid of maps:

to this:

The first set isn’t bad–it goes much beyond what I could’ve done even four years ago–but I think the second set is much improved (thanks to Yair and others).

While all this was happening, I also was learning more about decision analysis. In particular, Dave Krantz convinced me that the central unit of decision analysis is not the utility function or even the decision tree but rather the goal.

Applying this idea to the present discussion: what is the goal of a graph? There can be several, and there’s no reason to suppose that the graph that is best for achieving one of these goals will be optimal, or even good, for another.

So, to recap, here’s where I stood as of the beginning of April, 2009: I’m a statistician who loves graphs and uses them all the time, I’m continually working on improving my graphical presentation of data and of inferences, but I’m probably stuck (without realizing it) in a bit of a rut of dotplots and lineplots. I’m aware of an infographics community–I link to some of their blogs!–and think of them as allies with a common goal of displaying data informatively and attractively.

Here’s an example of where I’m coming from: a blog post entitled, “Is the internet causing half the rapes in Norway? I wanna see the scatterplot.” To me, visualization is not an adornment or a way of promoting social science. Visualization is a central tool in social science research. (I’m not saying visualization is strictly necessary–I’m sure you can do a lot of good work with no visual sense at all–but I think it’s a powerful approach, and I worry about people who believe social science claims that they can’t visualize. I worry about researchers who believe their own claims without understanding them well enough to visualize the relation of these claims to the data from which they are derived.)

What changed in April 2009 was that I noticed Nathan Yau’s blog on the five best data visualizations of the year, and I wasn’t so impressed by them. I realized that, although I had a lot in common with infographics people such as Yau, we had much different perspectives.

Since then I’ve had a lot of frustration trying to communicate with infovis people to get a sense of their goals.

Part of these differences may be simply temperamental: When I see a graph (or, worse, a table), my first thought is to think of how to improve it (getting down to details such as cleaning up an x-axis by labeling tick marks every 10 years rather than every year, for a graph that spans 100 years) and, even more so, thinking about how to include more information (a favorite piece of advice being to increase the time scale, for example displaying the past 50 years rather than the past 5 years of some series). In contrast, the tendency of Yau and other information visualization researchers is to celebrate, to feature the images that they like and then to move on to the next item. Both approaches make sense: by focusing on details, I think I can help people display their data better and, especially, to display more data. At the same time, graphics designers continue to innovate, and Yau and others provide a service by highlighting new visualizations without getting caught up in the specifics.

I think our disagreements are more than just the difference between a grumpy or an agreeable attitude, and I think it’s more than aesthetics. I think we have fundamentally different goals, that the statistical goal of understanding data, models, and inferences is different from the infovis goal of creating something new and visually exciting. Sometimes a visualization can serve both purposes (for example, the Baby Name Voyager) but often it doesn’t work that way. Minimalist graphs such as in Red State Blue State can convey a lot of information to the educated and interested reader but are not so accessible to an outsider. From the other direction, pretty graphs such as the map of air crash locations are inviting and get the causal reader to think about data but are frustrating to a more experienced analyst who quickly realizes that the visualization conveys very little new information (and requires quite a bit of effort to decode what information is present in the graph).

I have a new thought on how infovis designers think differently from statisticians when it comes to graphs:

Compare attitudes on pie charts, bubble plots, and dot plots. Statisticians such as Bill Cleveland (and myself) hate pie charts, dislike bubble plots (in which the area of the circle is proportional to the quantity being displayed), and love dot plots. Why? Because, as Cleveland discussed in his 1985 book, it’s easier to make visual comparisons of positions than of volumes or angles. Almost always, the purpose of a statistical graph is to make comparisons, so we think a lot about how best to do this.

Infovis people, though, do not seem to like dot plots very much. They think pie charts are OK (for example, Yau discusses how to make a good pie chart in his recent book), and they love bubble plots.

Why? What’s their principle? I think I know, and I think their principle is that graphs should be logical. Or, to put it another way, form should follow function. If you have a bunch of numbers that add up to a constant, then a pie chart displays this partitioning. Similarly, if you are displaying quantities, counts, or volumes, a bubble chart is logical because you’re showing physical areas. In contrast, displaying volumes as locations (as in a dot plot) is not natural at all: form is not following function, and this might help to explain why non-statisticians can get confused by such a plot.

I’m not saying that infovis people are hardliners on this one, just that I think there is a principle here, a principle that also helps to explain the appeal of Florence Nightingale’s famous graph of Crimean war data.

As Antony Unwin and I have discussed, we don’t think this a good statistical graph at all (with due respect to Nightingale’s pioneering work). The spiral pattern is an effective graphical display of the already well-known fact that the cycle of 12 months repeats every year. Beyond this, it’s hard to get much out of the graph at all. It does have a logical appeal, though, in that a periodic pattern is represented by a circle, with counts being represented by areas. In contrast, our preferred lineplot is much more abstract.

For another example, consider the map, made famous by Tufte, of Napoleon’s march through Russia. The x and y-axes represent spatial positions, and the width of the shaded area represents the size of the army. All these make logical sense.

As I’ve written elsewhere, people can often be usefully understood through their dislikes. I dislike graphs such as Nightingale’s which give virtually no info, in fact you have to work your butt off to recover the basics. I recognize that her graph might have been an effective infographic–even 150 years later it has a visual appeal–but I don’t think it’s a good model for statistical graphics if the goal is to learn something new from data or to convey data patterns to others who are already interested in a topic.

What do Infovis people dislike? Based on the discussions they’ve been having, one thing they hate is for people to criticize graphs. They are open-minded and don’t have too many negative things to say about anyone’s visualization.

What are my own goals in this discussion? My first goal is to get statisticians and social science researchers to think more about their goals in displaying numerical information. It would be great if infovis could inspire and empower researchers to better visualize their data, models, and inferences. My second goal is for graphics designers and creators of information visualization tools and infographics to become aware of a statistical perspective in which a graph can not only be evocative of data but can also convey quantitative comparisons. Appreciating new tools is fine, but I think infovis could also benefit from focused criticism and improvement, which might start with refections on the goals of any graph. My third, modest, goal is for statisticians and graphics designers alike to consider the virtues of multiple displays: maybe an infographic to grab the reader’s attention, followed up by a more conventional dotplot or lineplot to display as much of the data as possible, and maybe then an unusual and innovative plot that might be hard to read but might inspire some out-of-the-box thinking. One way to get the best of both worlds is to recognize the limitations of our separate approaches. On the web, there’s plenty of space for multiple visualizations of the same data.

P.S. I realize that I’m rambling a bit in this discussion. I don’t really know what to do. I’ve tried to express myself as directly as possible and to look at things from the perspective of people who disagree with me, but I still am struggling to be understood. I’m not blaming anyone but myself for this; I’m just explaining my motivation for continuing to babble. I’m hoping that if I toss off enough loose ends, someone will pick one up.

14 thoughts on “Infovis, infographics, and data visualization: Where I’m coming from, and where I’d like to go

  1. Great article, and I think you pitched it just right.

    A quick thought: does some of the difference come from the need to verify? That is, you look at a graph and want to break the data down, scramble the pieces a bit and recombine them, to see if you get the same answer. You want to learn underlying principles. You want to see if it makes sense. Which is something our society and schooling neglects, on the whole.

    An info-viz, on the other hand, is about presentation. In some sense, things are clearer than your approach. But you can’t readily break out details to learn about them or to see if they make sense. The graph presents what it presents and that’s it.

    In that sense, is it like the distinction between fiction and non-fiction? Fictional literature — at least good literature — makes a point about the real world. It reveals things, highlights them. But you can’t well dig into the fictional world very deeply to check its consistency with the real world. Non-fiction, by contrast, is all about teaching you something directly. Something you can analyze down to the step, and something you can compare with the real world.

    If this is true, as you say, scientists can be inspired to do better storytelling, and infovis’ers can be inspired to do something equivalent to “historical fiction”. And as you say, you can have multiple versions: The Civil War movie, with the true story in the Extras of the DVD, as it were.

    Thanks for making another pass at the topic!

    • I got to thinking about “what is the purpose of a graph, anyhow?” when I recently visited a site that allows you to enter a web address and it creates a word cloud for you. (http://www.wordle.net/ ) The clouds look great and serve my (visual) purposes, but what do they actually do?

      There is information in them: larger words represent more frequency of usage. Color can add another factor. But it appears to me that the they are mostly useful to inspire connections. You see words next to each other that you might not have thought of, you see words of similar size (frequency) that you might not have connected before, etc. It basically inspires you and might trigger a new insight, but then it might not.

      Statistical graphs can also “inspire” or “surprise” you. You see an interesting pattern that you would not have guessed, or perhaps a departure from what you would have expected, and dig further. But they’re also required to carry more information, and that information must be easily and accurately extractable.

  2. Are the two grids of graphs presenting the same data? For example: wealthy white Catholics in Colorado, Nebraska, and South Dakota are noticeably opposed to vouchers on the blue-brown graph and noticeably support vouchers on the green-orange one.

  3. It’s interesting to hear you say that physicists graph everything and statisticians don’t, or at least didn’t. That agrees well with my experience.

    Physics:
    In high school I did an internship at NASA, looking at data from solar flares, and most of my job consisted of sitting in front of a special graphics workstation — at the time, way back in the early 80s, many conventional computer consoles couldn’t display high-resolution graphics — and running a program that would automatically fit three different models to solar flare x-ray spectra, summarize the goodness-of-fit, and plot the spectrum and the fits. Sometimes I, or one of the scientists, would notice something peculiar about the plot and look into it: maybe the fit procedure hadn’t found the global best fit, maybe there was a problem with the data, etc.

    The same graphics-heavy emphasis continued in grad school. I worked briefly on a project to quantify the spatial distribution of visible matter in the universe — how much are galaxies clumped together? — and in that project, too, the lead scientist relied heavily on plots (in this case, plots of galaxy locations superimposed on density contour plots) both to make sure the model fit and to understand the results. When I started working on my own PhD project, which was related to the path through quantum-mechanical parameter space that an atom takes as it ionizes in an increasing electric field, one of the things I did was to, as Andrew says, “graph everything”: energy levels versus field, average quantum number versus time, etc., etc. And physics books tend to have lots of plots in them, showing how a particular function varies as a function of distance or applied force or whatever. So, yes, almost all physicists are used to plotting both models and results and to understanding the world through those plots.

    To the extent that I thought about it at all (which I did sometimes), I always assumed statisticians would be like that too, only more so: in many physics applications you have a really really good reason for applying a particular physical model, whereas much of statistics is more empirical. So it was utterly shocking to me — really! — when, as a postdoc, I went to meet with a statistician who was helping out with my project and discovered that not only did she not normally plot everything, she didn’t normally plot _anything_ and indeed couldn’t remember how to make a plot on her computer! (This was 1992, more than ten years after my high school NASA experience; by this point every computer could make plots). This still amazes me.

    I now realize that many people (not physicists!) greatly under-value plots. I work with a lot of grad students, and I seem to refer all of them to Andscombe’s Quartet within the first few weeks, to try to convince them of the importance of plotting their data and models. And I just gave a talk (about finding and interpreting problems or anomalies in “smart meter” electricity data) in which the first of my summary points was “plot the data.”

    I think there are several reasons that a lot of people don’t make enough plots:

    (1) It can be a pain to make the plot, especially with data that need to be processed in some way first and especially if people are using an inappropriate tool like a spreadsheet in order to do their work. For example, I work a lot with time-indexed data from different sources, and it can be surprisingly hard — meaning it might take, say, 5 minutes instead of 5 seconds — just to do something simple like overlay the different time series on each other on the same plot: the formats for the dates and times are often different (in both trivial ways like MM-DD-YYYY vs YY-MM-DD, or AM/PM vs military time, and less trivial ways like UTC versus local time, adjusted for daylight savings or not), and there can be other issues too.

    (2) It takes judgment, trained through practice, to even figure out what plots to make. Should I plot residuals vs time, residuals vs predictions, or what?

    (3) For those of us who use plots all the time it is very easy and natural to interpret them, but for people who are less practiced at it they have to do more work to interpret the plots. Everyone grasps time-series plots pretty well, but scatterplots (for example) take at least little bit of practice to interpret. I’ve seen people look at a plot of data vs predictions and not notice some really obvious features such as trends, offsets, heteroskedasticity (sp?)… if they looked at these more, they’d learn to interpret them much more quickly and the plots would be a lot more useful.

    (4) I touched on it in (1) but it’s worth saying it again: I think one of the biggest reasons people don’t make plots is that a lot of people use a spreadsheet tool for their data analysis. I remember once working with a grad student and asking him to make about 10 different plots to start to get a handle on our data: plot these two things versus time on the same plot, plot one versus the other, plot the relationship only for the time periods when this other variable is high and only when that other variable is low, and so on. Days went by and the student kept on saying he hadn’t had time to do it. Eventually I went to his office on campus, sat down next to him, and had him make the first of the plots. It was awful: click-click-click-click-click in all these little windows, just to make a simple plot overlaying some time series. Finally we had it. OK, now let’s plot Y vs X for all cases… not too back, click-click-click. Great, now plot Y vs X for cases when Z > 1. Click-click-click-click-click-click-click. Jeezus. It literally took two hours to make those 10 plots, that an experienced user of stats software would have made in perhaps 10 or 15 minutes. Horrible.

  4. As an information visualization researcher with a foot in the statistics world, I’ll add a few thoughts.

    First, there seems to be a terminology mismatch between you and your audience. I would consider statistical graphics to be an important *subset* of information visualization, not a separate field. I would consider Bill Cleveland, Antony Unwin, and yourself to be doing good information visualization research; research that would be very welcome and would have a very large impact at InfoVis or similar conferences, should you choose to publish (or give a keynote) there. Certainly, Bill Cleveland’s papers have been widely cited and highly influential in the InfoVis community.

    Your observations seem to focus on what I would call, “infographics”, a relatively small segment of the field (comprising maybe 10-20% of InfoVis attendees) and the branch of information visualization furthest removed from stats graphics. Infographics is really graphic design inspired by a small data set. It is an artistic field dominated by artists. Despite being a small portion of the whole field, it has become the public face of infovis; primarily because the results are sexy—they sell newspapers and make blogs popular (and bring money and attention to the field). Not because it is representative.

    When you use the term “infovis” to refer to just “infographics” you end up getting your audience confused. People, rightly I think, come away thinking that you’re painting with too broad a brush. If you get the terminology right, I think you’ll find it easier to communicate with the InfoVis community. Remember, InfoVis people think that Andrew Gelman does InfoVis research (at least part time)!

    Second, the InfoVis community has spent a lot of time discussing the trade offs between stat graphics (and other simple, data-rich visualizations) and infographics. Some of your comments (e.g. pie charts, bubble plots, and dot plots) simply recapitulate discussions that have occurred within the field for the last 20 years. A result of this discussion has been the recognition, as you point out at the end, that different visualizations should be judged by different criteria. Infographics get judged as art, while our attempts data visualizations are evaluated by how well they communicate quantitative data. Since both types of graphics are presented at conferences, we must switch between different evaluation criteria on the fly. Picking the wrong criteria for a visualization is something of a faux pas. It signals that the offender is naive or that they are challenging the field’s assumption that art and quantitative communication are worthy *independent* goals.

    My recommendation would be to back off the infographics critique (unless you have something artistic to say), and engage the rest of the InfoVis community who I think would be very interested in your statistical insights. Unfortunately, this side of the InfoVis community doesn’t have a huge web presence. But InfoVis is in Rhode Island this year; if you come, you can meet the rest of us.

  5. As an engineer, I learned to graph everything, and to build models based on physical phenomenology, to try to fit the existing data and preduct future data. So I come at the whole Info Viz scenario from the same starting point as you, and I hack my way along to get what I need from the data.

    I’m interested to read that you don’t care for the Florence Nightingale charts. I’ve never thought that they were easy to read or that they effectively showed anything. You and I seem to be the contrarians.

    • Jon: The club of contrarians is not nearly so exclusive and similar sceptical comments have been made elsewhere!

      The often repeated praise for Florence Nightingale’s most famous graph flows partly from respect for her other achievements. Nor did she invent this kind of graphic.

      It seems curiously and often overlooked that Naomi Robbins showed directly that much simpler line plots are more effective for FN’s data.

      Robbins, N.B. 2005. Creating more effective graphs. Hoboken, NJ: John Wiley, pp.144-9.

  6. Hi Andrew,
    read through your post with great interest. I have often wondered- I understand both schools of thought, I call them the anti-chartjunk vs chartjunk (for simplicity’s sake ;)). I found an academic paper which discussed the many benefits of chartjunk- which increased memory and perception by a great deal. (Not all kinds of chartjunk, but a particular subset). I think the biggest difference between the two schools of thought is the target user and it boils down to the amount of time an everyday user would want to spend reading a graph/chart. True that some anti-chartjunk data visualizations which are so “mind blowingly” good are not for more than 60% of the population to interpret easily.

    That said, we are creating a tool to aid users to create infographics and I know that there are plenty of ways to improve information visualization. Would be very interested to hear your thoughts on it :)

    Ching

    • Ching:

      I think the so-called chartjunk study has been overhyped. The trouble is that the non-chartjunk graphs used in their comparisons are of very low quality. See here for more.

      As I wrote in my discussion of that study, a huge, huge drawback of chartjunk is that it limits the amount of information you can display in a graph. If all you want is to display a sequence of 5 numbers, then, sure, go for the chartjunk, I don’t really care. But why limit yourself to only displaying 5 numbers?

  7. I’ve really enjoyed this series of posts on data visualisation. In particular the idea of getting people from different disciplines to think about how we see and display data rings pretty true with me. Just to note that in the Cleveland and McGill, 1987, Journal of the Royal Society paper about Graphical Perception, many of the responses to this note that data visualisation represents a real opportunity for people from different disciplines to work together. OK, the context might be different to that in this blog, but for me, data vis is about getting better understanding of data, for both me and other users of the data, so the graphics need to be meaningful for all audiences.

    Great article.

    • Sorry – that Journal of the Royal Statistical Society. Don’t want to mix up my Royal Societies – I might incur the wrath of Sir Isaac Newton’s ghost.

  8. I’m very interested to hear you say that you don’t feel like you need tables of numbers any more. I’ve been wondering about this myself for a while. To me the relative relationships between the bits of data seem far more interesting than whether it’s 3.2 or 3.4 or whatever.

    Funnily enough my friend took some advice from me and went graph-heavy in her Master’s thesis, omitting data tables, and was marked down. I feel very guilty about giving bad advice but I can’t shake the feeling that we were right and the graphs were more important- they really made the relationships in the data stand out, and using graphs and data tables just feels like a waste of space.

    As for my take on this whole infovis thing, I’m going to learn a lot of lessons from both camps- knock ’em dead with the infovis, and then use something a bit weightier once you’ve got their attention; I work in healthcare so making visualisations appealing to clinicians and managers and the public is a big deal for me.

  9. Andrew:

    You appear to be at least a bit surprised and frustrated by a relative lack of debate here, else why do you keep revising this statement? I don’t think the key issue is whether your argument was or is unclear.

    I do think your political science persona could come to the fore to help analyse the lack of debate:

    1. Once anyone announces a debate that hinges at least partly on how you define “fields”, “communities” or “approaches”, lots of people will back off automatically from a discussion that promises to be uninteresting and futile. Who cares much what anything is called, or whether approaches are distinct or overlapping or whatever? For example, I can readily agree that logically statistical graphics is a subset of information visualization, and _also_ that in practice people who march under these banners are trying very different things.

    2. The hardest hitting statement in your postings remains the personal declaration that none of the supposedly best examples of information visualization in a particular 2009 batch struck you as any good. (FWIW, I agree.) As an opening statement in a debate, that’s not encouraging to anyone whose views are quite different, especially in so far as the debate is to be carried out on your turf. (I know you’re certainly not insisting on that, or even expecting it.)

    That said, I agree very much that the strongest commitment among people who do statistical graphics is that the graphics are a means to an end, understanding the data. Novelty of graphical representation is not an aim, and possibly not even a desirable side-effect. Being able to use a standard graph form that readers will recognise is a virtue. It is not clear that information visualization people sign up to those commitments, or at least not to the same degree, and that’s fine.

    Also, statistical graphics does seem predominantly a very conservative field. All of the really important ideas used at all frequently were in place by 1900; computing has had only one main effect, making those graphs easier to produce; with notable exceptions the interest in statistical graphics remains in paper-printable 2D graphics, with only marginal interest in anything dynamic, interactive or animated. It’s not all surprising to me that to people outside the field it looks stale and staid, and quaint so long as the main references remain books from the 1970s or 1980s. I exaggerate, but I think these statements are much more nearly true than their opposites.

    As a small warning: “dotplots” mean pointillist histograms to many people, especially in medical statistics; slowly but surely, more people are calling them strip plots. I think you mean “dot charts” in Cleveland’s sense.

    On the history: Whatever the appearance to you when younger, exploratory data analysis and model-based thinking were always much closer than you write here. Tukey wrote about plotting residuals from models long before he started writing about stem-and-leaf and box plots. Plotting residuals from fits is a key theme in his “Exploratory data analysis” book from 1977.

  10. Pingback: Financial Econometrics Software | Research Notebook

Comments are closed.