## Choices in graphing parallel time series

I saw this graph posted by Tyler Cowen: and my first thought was that the bar plot should be replaced by a line plot: Six lines, one for each income category, with each line being a time series of these changes. With a line plot, you can more easily see each time series (these are hard to see in the barplot because you have to follow each color and jump from decade to decade) and also compare the patterns for each category. The line plot pretty much dominates the bar plot.

At least that was the theory. Now here’s what actually happened.

I downloaded the data as Excel files, saved them as csv, then read them into R. In all, it took close to an hour to get the data set up in the format that was needed to make the graphs. At this point it was pretty easy to make the line plot. But the result was disappointing: The six lines are hard to untangle (sure, a better color scheme might help, but it wouldn’t really solve the problem) and the graph as a whole is much less clear than the original bar plot.

My next try was small multiples: six little graphs, each with its own time series. That didn’t work so well either (although, on the plus side, it only took a few minutes to make that graph).

Then I thought of plotting the incomes over time (all these income values are inflation-adjusted, of course): I like this one a lot. In particular, it shows that the drop from 2000-2010 is really a drop since 2007. (Although I suppose Cowen would argue that the drop was really happening earlier and it was just that the economy was doing a Wiley E. Coyote, standing in midair and not actually going into freefall until people realized they had gone off the edge of the cliff).

Still, even the time-trends graph is not quite a replacement for the original bar plot which shows so much drama. I think my recommended solution is to give the bar plot for the initial impression and then follow up immediately with the time-trends graph, which shows the big picture much more clearly.

P.S. The data are in the location indicated by the caption of the first graph above. Here’s my (ugly) R code to make the graphs:
``` n_years <- 64 # Save F02AR_2010 as csv file income_share <- read.csv ("F02AR_2010.csv", skip=4, nrow=n_years) year_income_share <- as.numeric (substr (income_share[,1], 1, 4)) # Remove thousands separators from F07AR_2010, then save as CSV file income_mean <- read.csv ("F07AR_2010.csv", skip=5, nrow=n_years) year_income_mean <- as.numeric (substr (income_mean[,1], 1, 4)) # if (sum(year.income.share!=year_income_mean)==0) year <- year_income_share else stop() income <- (income_share[,2:7]/100)*income_mean[,6] income[,1:5] <- income[,1:5]/.2 income[,6] <- income[,6]/.05 income <- income[n_years:1,] year <- rev(year) decades <- match (seq(1950,2010,10), year) income_decades <- income[decades,] n_decades <- length (decades) after_decades <- income_decades[2:n_decades,] before_decades <- income_decades[1:(n_decades-1),] total_changes <- ((after_decades - before_decades)/before_decades)/10 after <- income[2:n_years,] before <- income[1:(n_years-1),] changes <- ((after - before)/before) avg_changes <- array (NA, c(n_decades-1,ncol(income))) dimnames (avg_changes) <- list (paste(seq(1950,2000,10),"s",sep=""), colnames(income)) for (i in 1:(n_decades-1)){ avg_changes[i,] <- colMeans (changes[decades[i]:(decades[i+1]-1),]) } pdf ("changes1.pdf", height=6, width=8) y <- avg_changes x_labels <- rownames (y) line_labels <- c("Lowest fifth", "Second fifth", "Third fifth", "Fourth fifth", "Highest fifth", "Top 5 percent") n_x <- nrow (y) n_lines <- ncol (y) par (mar=c(3,4,1,1), mgp=c(2,.5,0), tck=-.01) plot (c(1,n_x), range(y), xlab="", ylab="Avg annual change", xaxt="n", yaxt="n", bty="l", type="n") y_ticks <- seq (-2,4,2) axis (2, y_ticks/100, paste (y_ticks, "%", sep="")) par (mgp=c(1,.5,0)) axis (1, 1:n_x, x_labels) abline (0, 0, col="gray") colors <- c("black", "gray20", "gray35", "gray45", "gray65", "brown") for (i in 1:n_lines){ lines (1:n_x, y[,i], col=colors[i]) text (4, y[4,i], line_labels[i], col=colors[i]) } mtext ("Average annual change in mean family income, 1950-2010,\nby quintile and for the top 5 percent", 3, -1) dev.off () pdf ("changes2.pdf", height=4, width=5) y <- avg_changes x_labels <- rownames (y) line_labels <- c("Lowest fifth", "Second fifth", "Third fifth", "Fourth fifth", "Highest fifth", "Top 5 percent") n_x <- nrow (y) n_lines <- ncol (y) par (mar=c(3,4,1,1), mgp=c(2,.5,0), tck=-.01, mfrow=c(3,2)) for (i in 1:n_lines){ plot (c(1,n_x), c(-.03,.05), xlab="", ylab="Avg annual change", xaxt="n", yaxt="n", bty="l", yaxs="i", type="n") y_ticks <- seq (-2,4,2) axis (2, y_ticks, paste (y_ticks, "%", sep="")) par (mgp=c(2,1.5,0)) axis (1, 1:n_x, x_labels) lines (1:n_x, y[,i]) mtext (line_labels[i]) } mtext ("Average annual change in mean family income, 1950-2010,\nby quintile and for the top 5 percent", 3, -1, outer=TRUE) dev.off () pdf ("income1.pdf", height=6, width=8) y <- income x_labels <- year line_labels <- c("Lowest fifth", "Second fifth", "Third fifth", "Fourth fifth", "Highest fifth", "Top 5 percent") n_x <- nrow (y) n_lines <- ncol (y) par (mar=c(3,4,1,1), mgp=c(2,.5,0), tck=-.01) plot (range(year), range(y), xlab="", ylab="Avg family income (in 2010 dollars)", xaxt="n", yaxt="n", bty="l",, type="n", log="y") axis (1, seq(1950,2010,10)) axis (2, c(1e4,2e4,5e4,1e5,2e5), c("10K","20K","50K","100K","200K")) for (i in 1:n_lines){ lines (year, y[,i]) text (year[n_years-8], y[n_years-8,i]*.88, line_labels[i]) } mtext ("Trends in mean family income, 1947-2010,\nby quintile and for the top 5 percent", 3, -1) dev.off () ```

### 19 Comments

1. Manoel Galdino says:

I guess there’s a tipo: “by” Tyler Cowen, not “my” Tyler Cowen, right?

2. andrew says:

I don’t think I agree — the last chart ends up obscuring the fact that incomes are down for all quintiles from 2000-present.

Maybe six small multiples line charts, one for each decade?

• Antonio Pedro says:

The information regarding quintiles from 2000-on is clearly there. Yet, if the focus is on the later period – as oppose to the entire period – you might want to do a different plot.

3. John Mashey says:

1) For this size of interval, the decadal graphs seem *silly* to me: in what sense is there anything meaningful about the choice of specific 10-year periods? As you note, last decadal drop is really from 2007.

2) Given the spreadsheet, another approach, not a replacement for the last graph, is to plot approximations of the first derivatives for the 6 lines, since it is not so easy for humans to compare diagonal lines on a chart.

For example, see a time series of CO2 concentrations at a site in Antarctica over last 2000 years.

Then, that is converted to a slope chart by computing linear regressions over 25, 51, or 75 years in interval leading to each year. That was mostly to illustrate the importance of picking a long enough interval,l but also illustrated the unusual rapid dip into 1600AD.

You might try something similar, although with shorter interval, maybe 2-4 years, and plotting those, i.e., 6 lines. That allows for a comparison of growth rates by comparing positions at each date on vertical scale, rather than comparing diagonal lines to see who’s growing/shrinking faster.

• Drew D says:

I very much like this approach, thanks.

4. Carlos J. Gil Bellosta says:

Maybe it would be useful to break the data in two parts: representing on the one hand the overall variation (of the mean or median for the data) and then the variations with respect to the mean for each quintile.

Then the overall trend (in which the lower quintiles tend to get worse off overtime) would be clearer in a graph similar to the second one: the crossing of the lines would be much clearer.

5. Peter Nelson says:

Could someone share the data? The mean family income for quantiles data I can find only extends back to 1967. I’d really like to give John Mashey’s suggestion a shot.

• Andrew says:

Peter:

See P.S. above!

• Peter Nelson says:

Sorry, when I first loaded the page, the PS didn’t exist. Thanks!

• Andrew says:

No need to sorry; I added the P.S. in response to your comment.

6. Rob Bray says:

While the use of log income on the y scale (without any label) allows growth rates to be compared the presentation shown in the chart is more likely to mislead most readers who will look at the chart and say there really is too much difference in the income levels of the various quintiles.

While the use of decades in the first chart is a limitation it clearly tells a story – of strong growth across the first two decades – with the bottom quintile doing better than average and then very skewed growth in the 1980s and 1990s.

I think it is a matter of appropriate charts for different audiences – but in terms of telling the story I think Tyler is clearer. (Although he may have done better to use time spans associated with particular economic phases rather than constant decades – given that he is showing annualised growth rates this should not cause any problems.)

• Andrew says:

Rob:

1. Just to be clear: Tyler didn’t make the graph, he just posted it.

2. I agree with all of what you wrote, and I think it is consistent with what I wrote:

Still, even the time-trends graph is not quite a replacement for the original bar plot which shows so much drama. I think my recommended solution is to give the bar plot for the initial impression and then follow up immediately with the time-trends graph, which shows the big picture much more clearly.

There’s no reason to choose just one graph here. With modern technology, it’s easy enough to show the dramatic graph first and then follow up with the detailed graph (and then with the numbers in a spreadsheet).

7. Rob Bray says:

Thanks for the response – I am still concerned about most people interpreting the log axis in terms of the relativities of actual earnings.

I would also consider, in a case such as this, presenting the series as index numbers from the base year. This would allow both the cumulative trends in income over time and the actual relative patterns of growth in specific periods to be seen – it may though be, as with your line graph of growth rates, that there is too much overlap between the series.

As you said multiple graphs are useful – but the reality is so often the case that when one actually gets down to presenting results you end up having to choose the single one which best illustrates what is most important – rather than presenting different cuts to illustrate all of the facets.

8. Bob Carpenter says:

First, how about adjusting for inflation? Given the centering around 0, this will have a huge effect on how the graph looks. And it’ll also change the relative behavior of the decades as they had varying rates of inflation.

Second, Wikipedia beat you to it, with nice divisions by president:

http://en.wikipedia.org/wiki/File:IncomeInequality7.svg

and there’s also a nice discussion following some of the above with plots in:

http://aneconomicsense.com/2012/07/20/the-shift-from-equitable-to-inequitable-growth-after-1980-helping-the-rich-has-not-helped-the-not-so-rich/

This particular graph’s nice in equalizing the differences to 1980 and then plotting changes in real income.

http://aneconomicsense.files.wordpress.com/2012/07/real-incomes-by-distributional-shares-1980-2010.png

• Andrew says:

Bob:

I find the divisions by president to be a distraction but that’s a matter of taste. And, yes, the numbers were already adjusted for inflation before I even touched them (see the y-axis of my graph).

9. ezra abrams says:

if I were grading this blog post

1/2 letter grade deduction for not preciesly specifiying the data source:
at this url http://www.census.gov/hhes/www/income/data/historical/families/,
go to TABLE F3

1/2 letter grade deduction for using R when you could have done it in the orignal excel sheet in about a second and a half

1/2 letter grade deduction cause tylers original post made the point of unequal ness occuring after 1975 better

1/2 letter grade deduction for not noting tht graphs like these should be evaluated on an experimental basis: you make two versions of the graph and show them to a pool of people, and time the people for how long it takes to figure out what the graph means. You should have also referenced people like Naomi Robbins, whoose little paperbback is 1/4 the cost and weight, and has twice as much usefull info, as the books by tufte and few; point out the basic work in psychology and perception of graphical data done by cleveland; point out that today, most of this is done by people who are asking how to convey medical info (see science (aaas) sept 2011 review article by speigelhalter and refs therein; paywalled at http://www.sciencemag.org/content/333/6048/1393.short

• Andrew says:

Ezra:

If you think it’s better to make graphs in Excel than in R, then I think it’s pretty much impossible for us to talk with each other.

Also, I think it’s ridiculous to suggest that I be penalized for posting a graph without “evaluating it on an experimental basis.” You might be very good at running such experiments, but I’m not. I’m good at making graphs. If you want to run an experiment using my graphs, go for it. There’s this concept of the division of labor: I do what I’m good at, you do what you’re good at. I would love it if people were to run experiments to test my conjectures. In the meantime, I believe I add value by discussing my experiences here.

10. Kaiser says:

Great post, Andrew. One difference between the bar chart and your final chart is that change vs. value scales. I’m thinking that if you make an index of the incomes (first year = 100), and let all the lines start at the same level, it would work better to highlight the percent changes over time.

11. Rick Wicklin says:

I agree that this is a difficult problem to solve for every series. The initial data and your various solutions reminds me of a similar situation I faceD. I saw a graph in Businessweek and thought, “I can do better.” My attempts to improve it are here: http://blogs.sas.com/content/iml/2010/12/03/how-does-participation-in-social-media-vary-with-age/

I think a lesson is that “series plot” is a valuable attempt, but sometimes the data values cause overlap and we are forced to try other visualizations.