Hey, I think something’s wrong with this graph! Free copy of Regression and Other Stories to the first commenter who comes up with a plausible innocent explanation of this one.

Paul Alper points us to this column by Dana Milbank discussing the above graph from Georgia’s Department of Public Health:

Ok, the comb-style bar graph is, as always, a bad idea, as it multiplexes two dimensions (county and time) on a single x-axis. The graph should be a lineplot, with one line per county, and the lines labeled directly with county names.

But, wait . . . the graph has another problem. The ordering of the counties changes with each date!

But, wait . . . that’s not the biggest problem. As Milbank writes:

But on closer inspection, the dates on the chart showed a curious ordering: April 30 was followed by May 4; May 5 was followed by May 2, which was followed by May 7 — which in turn was followed by April 26. The dates had been re-sorted to create the illusion of a decline. The five counties were likewise re-sorted on each day to enhance the illusion.

Or maybe the software did it automatically? Fortunately, the Georgia Department of Public Health has all their data and code on Github, so you can go run the program yourself and see . . . just kidding!

Milbank continues:

The governor’s office apologized for what state Rep. Scott Holcomb, an Atlanta Democrat, properly called a “cuckoo” presentation of data. But as the Atlanta Journal-Constitution noted, it was the third such “error” in as many weeks involving sloppy counting of cases, deaths and other measures tracking covid-19. . . .

I have no idea if this was a software default (in which case, I assume the people who made the graph screwed up under pressure) or if some extra effort had to be made to sort the numbers in this way. Maybe someone just thought the graph looked prettier this way?

As a statistician who specializes in graphical communication, I’m usually happy to see statistical graphics in the news—but, after seeing this example and this one from a few days ago, I’m not so happy.

P.S. In comments people are trying to come up with innocent explanations of the above graph. It’s possible. I don’t know Excel or whatever program was used to make that graph.

Free copy of Regression and Other Stories to the first commenter who comes up with a plausible innocent explanation of the graph. You have to provide enough detail about the software to demonstrate how someone could’ve made this graph by accident.

83 thoughts on “Hey, I think something’s wrong with this graph! Free copy of Regression and Other Stories to the first commenter who comes up with a plausible innocent explanation of this one.

    • Well, it would be straightforward to produce such a graph in Excel – but not by accident.

      You could sort the data in descending order, and the dates are just strings.

        • Kind of like when Trump sent a clear message to Kemp that he should “wink, wink” should not open the state, but that he should do whatever he believed was right. Until the next day when he said he wasn’t happy with Kemp. I remember it like it was 2 weeks ago.

  1. Frankly, this seems too dumb to be an intentional attempt to deceive. I wonder if it really is some kind of weird error due to inexperienced use of Excel or something.

  2. I’m with the people saying this seems too bizarre to be deliberate. The best explanation I can come up with is that the data is in Excel and they accidentally sorted based on some column not charted, like the name of the supervisor who entered that data line or something.

      • I think Alex has the right of this – if you have columns A, B and C and you select them to sort in Excel, if you sort from smallest to largest without specifying which column then column A will define the order. They might have had an additional column such as the mean of value (which had been used in other analysis) that accidentally defined the ordering. Possibly a case of the results match what we wanted/expected to see, let’s keep moving.

        • But made available without proofing? That’s pretty egregious given the public health policy implications. And the impression of the graphic is too in line with the story some leadership wants to tell. Perhaps I am too cynical…..

    • Pretty sure it’s not excel. A ‘clustered column’ where the categories and colours appear in different orders within each cluster is actually very difficult to achieve in excel and would require the user to manually change the individual columns and alter the legend. Just changing the sort order of the data wouldn’t do that.

  3. The counties seem to be sorted by the biggest number of confirmed cases to the lowest. The dates seem to be sorted by the biggest number of confirmed cases that day to the lowest.

    • Fernando:

      Yes, the data are definitely sorted that way. The question is whether this could’ve happened by accident via some bizarre software setting, or did someone do it on purpose to make it look like cases are monotonically decreasing?

      • It’s more bizarre than that. The dates are NOT sorted by the tallest column. Just look at the second and third groups and the pattern is already broken. I’m guessing the dates are ordered by the sum of daily new cases.

        • It’s very common practice to order the categories for a bar chart by the values; usually it greatly improves the value of the chart especially if the alternative is eg alphabetical. Of course, sequencing of dates should trump this, but I can imagine it being a mistake / poos choice, rather than deliberate.

  4. It looks like this chart comes from a dashboard created using SAS: https://ga-covid19.ondemand.sas.com/

    An updated version of this chart can be found by clicking on the “Top Five Counties” tab on the panel directly below the panel with the map of Georgia – the dates and counties were ordered consistently on the latest version of the chart when I checked it.

    • Digging a bit deeper, it’s quite possible the dashboard was created with SAS Visual Analytics (https://www.sas.com/en_us/software/visual-analytics/viya-features.html). I haven’t used it, but I imagine it is a tool similar to Tableau. Interestingly, one of the listed features of this software is the following (from the above link):

      “Custom sort allows you to rank order category data items in a table or graph by characteristics (e.g., products, customers). The characteristics that are most important to your organization will be displayed first.”

      It’s possible the innocent explanation is that a feature designed to be helpful (i.e. quickly and easily sort your bar chart categories from largest to smallest, rather than alphabetically) has been misapplied here due to a misclick (or perhaps it is performed by default, I don’t know).

  5. Due to labour cut backs, the people qualified to perform this task were let go. To save money in the future they gave the task to the first person who wanted to prove that making graphs “isn’t so hard”. They weren’t quite sure how best to do this though. To save face, they started copying and pasting code from wherever they could find it. They were just told to make the situation look as good as possible, so when this graph came out they patted themselves on the back, sat back, and congratulated themselves.

  6. Something like this has happened to me more than once…

    Note that even for each day, the counties are ordered,ie their order changes from day to day.

  7. I’ll guess it was the result of someone using the histogram sort feature in Stata. Hopefully I’m wrong since I already pre-ordered the book (which was originally hardcover but somehow got changed to paperback one day (not by me)).

    • I’ve been using Stata for 25 years, and while I don’t claim to know all the ins and outs of its graphics capabilities, I’ve never seen Stata produce a graph that looks remotely like that stylistically.

      And while it is possible to rig-up a Stata data set so as to produce a multi-bar graph that was inappropriately ordered like this, it couldn’t be done by accident in any way I can think of. It would have to be intentional In fact, it would be fairly complicated to do and require a fair amount of programming: not something somebody would just stumble into through an innocent mistake.

  8. Innocent explanation:
    It’s a side effect of how SAS will sometimes order things by the way they appear in the data, and the character date variable meant that the axis was unordered by default, meaning it would inherit the data set order.

    The left to right order follows a sort by total number of cases. At some point the data was sorted by total cases by day (overall) and then cases by county within those days. The plot x-axis was for an unordered variable, but it inherited the order in the data.

  9. I can replicate the moving of the dates in ggplot2 with the reorder() function. But the sorting in descending order is not working as in the graph displayed:

    library(“tidyverse”)
    df %
    rownames_to_column() %>%
    gather(key=”key”, value=”value”, -rowname)
    df

    ggplot(df, aes(reorder(key, -value), value)) +
    geom_bar(aes(fill = rowname), position = “dodge”, stat=”identity”)

  10. I can replicate the moving of the dates in ggplot2 with the reorder() function. But the sorting in descending order is not working as in the graph displayed:

    library(“tidyverse”)
    “df %
    rownames_to_column() %>%
    gather(key=”key”, value=”value”, -rowname)
    df

    ggplot(df, aes(reorder(key, -value), value)) +
    geom_bar(aes(fill = rowname), position = “dodge”, stat=”identity”)”

  11. My wild guess is they have used Power BI – the software can automatically sort the chart in descending order based on the y axis (think it’s sorted on the total) and Power BI does not recognise data type unless you explicitly set them in the PowerQuery editor. The color scheme also seems similar to what Power BI offers.

  12. This is certainly not an excel graph and no STATA graph looks like that. I’d go with Steve’s suggestion of SAS. You can get this error if you don’t format the date column as… well… a date. Though why you would sort it by cases is strange, and the default in almost all languages is to sort in ascending (rather than descending) order. So you have to go through a few steps that don’t make sense to get to this graph. This graph is from the “Top 5 Counties” tab, which is not the first tab on the DPH website. The first tab is the daily new cases with a 7-day rolling average, which (to my knowledge) has never been strangely sorted. So again, there was some extra special sorting in the graph highlighted in this post.

    Here’s The Atlanta Journal Constitutional story on it: https://www.ajc.com/news/state–regional-govt–politics/just-cuckoo-state-latest-data-mishap-causes-critics-cry-foul/182PpUvUX9XEF8vO11NVGO/, which highlights other Georgia data mishaps. It leads with this amazing line: “Where does Sunday take place twice a week? And May 2 come before April 26? The state of Georgia, as it provides up-to-date data on the COVID-19 pandemic.”

    Here is the state spokesperson’s explanation: “The x axis was set up that way to show descending values to more easily demonstrate peak values and counties on those dates. Our mission failed. We apologize. It is fixed.” Why that would actually be helpful

  13. I would like to participate for the free copy, but, and this is a big but, some “mistakes” can not be ammended, this is abig a msitake as you can make with data during a pandemic

  14. To my mind a problem bigger than all those mentioned is that what is reported are total deaths by county rather than per capita deaths by county. It is relevant information that Hall county’s population is less than 1/4 of the population of any of the other reported counties. Cobb and Dekalb have essentially the same population, and Gwinnett and Fulton are similar, both larger by a factor of about 5/4 than Cobb and Dekalb.

  15. It looks like the dates are sorted according to the total; so if you sum the counties and sort, you would get this order of dates.
    Next, within each data, we have once again a descending order.

    So I think one could obtain this graph, just by trying to show the bars in descending order, where first there is a descending order over date and then a descending order within date.

    If you are a bit stats savvy (or minimally numerically inclined) it seems silly. But, I am teaching a course where I learn students how to visualize data and present using data visualizations (loosely inspired by a course by Andrew in fact). I believe that some of my students could come up with this and they would be convinced that this was a pretty good figure (this says maybe something about me). These students would be the ones who worked hard on mastering the tools (in my case ggplot2) but who were a bit lacking in their feel for the topic.

    As for which software was used, no idea. I could do it in R but it is not something I would come up with naturally.

    • Winner!

      So, what the graph shows is how bad things were in the worst counties on the worst days. So, if you were looking for days and locations when mourgues or ERs were overflowing, this graph might help.

      It maybe that when the site developed, the ordering of the dates was correct. In some incorrect sense of “correct.”

    • By coincidence I came across this quote: “Never attribute to malice that which is adequately explained by stupidity.”

      A bit harsh, but it seems pretty appropriate for the situation at hand.

      When can we order the book, Andrew?

  16. Do you want my address? :)

    This is a common business graphic. It has a sensible case.

    Both dates and counties are treated as categorical data. A common business case is product sales (counties in the above) by region/location (dates in the above). It’s sorted the way it is so the exec can see which region has the largest sales and which product has the largest sales in each region. I’m sure Excel would do this automatically with certain data configurations.

    It would even make sense if it were product sales by date if exec is trying to highlight the biggest sales days of the year – exec wants top five or ten sales days and what products sold best on those days; or dates could be equal to various promotions (e.g., “back to school”, “Halloween”, “Spring Break” etc).

    It may not be what a statistician would consider the best presentation, but in the cases above it’s suitably functional.

    • Actually the more I think about it the more I realize it’s actually the appropriate form for the purpose I described. The top/bottom selling regions/locations will be immediately obvious. Which product sells best is a secondary concern but it will be easy to see at a glance for any given location/region.

      A good use case for Andrew: you have a team of statisticians / data scientists and you want to track their working habits. So the dates could be replaced with data scientist name and the counties could be replaced with the use time for each software title that was active on their computer for a given day. :)

      • I’m able to reproduce everything but the sorting by totals automatically in excel. If I set up the dates as series, it plots them in the order of occurrence.

  17. Their problem was they did not fit a cubic polynomial to it before publishing it! While I would really like to get a free copy of the book, sorry I don’t believe there is an innocent explanation. Because even if you can come up with a software explanation that reproduces this behavior in an innocent way, think of the number of people who had to have looked this over both before and after it was published and didn’t see anything wrong with it.

    And I would chuckle at all this, as well as predictions of 500 deaths and it is just the flu, except for many people this is serious stuff, both in terms of people’s physical and economical well-being. And unfortunately, for a number of reasons I am in a high-risk category, so I am not laughing.

  18. In Matplotlib graphing in Python, I’ve seen both of these issues happen before. If the x axis was not typed correctly, Matplotlib will automatically sort the graph based on the string or numbers it interprets. It is actually a common confusion for rookies – you can find several stack overflow topics on this.

    Additionally, it just looks like the counties are ordered by size for each date. That seems like a default as well.

    Sorry I can’t give specifics, but I’ve made this exact type of mistake before early in my coding career (I’m a finance guy learning Python so I can escape Excel hell)

    • Wow, this sounds so weird to me. I do a lot of plotting with pyplot, but I always give the plotting function the data in the order I want. I don’t even know how to let pyplot sort it automatically. I would never have occurred to me.

  19. Back in antiquity, a typical introductory statistics course would spend (i.e., invest) much time on data presentation and how to detect misleading graphs: distortion due to omitting the origin, distortion due to perspective, colors, shapes. From the comments in this blog, the emphasis on available software indicates trickier things can be done, either innocently or on purpose. Technology can be dangerous to the nation’s health and in Georgia at least, “timing is everything.”

  20. Innocent explanation for why this wasn’t picked up by anyone after its creation: the fonts on the X axis are tiny and no one noticed they were not ordered?

  21. If only there was a world-class biomedical engineering department or a world-class research hospital system or a world-renowned center that studies diseases and their control located somewhere in the Georgia that the folks at the GA Dept. of Health could consult with!!

  22. The more I look at this the more I see what the makers intended. I suspect that their audience was Average Georgians (AGs), not Modelling Wonks (MWs). The MWs have an extremely strong prior for chronological presentation, but that’s not necessarily what the AGs are into. Not defending this chart really. But some kinds of information wouldn’t be readily visible in a time series.

    If you bust out all the different types of information that could be shown in a chart with these data, here’s how the actual chart (BC) would compare with Andrew’s time series (TC) alternative (“prop” means “proportion”):

    TS BC

    Chronology by County: Y N
    Chronology Overall: N N
    Sum by County Overall: N N
    Prop by County Overall: N N
    Max / Min Date Overall: N Y
    Max / Min Date By County: Y N
    Sum by County per Date: Y Y
    Prop by County per Date: N Y

    If Andrew added a second axis and sixth time series for the total number of cases that would give Andrew’s chart the overall advantage. But really it depends on what you’re interested in.

  23. Why would this one chart be monkeyed with to show a fake decline? The Georgia website has a giant graph called “COVID-19 Cases Over Time” that clearly (and accurately) shows the time trend.

    • Terry:

      You’ve heard of the irresistible force vs. the immovable object? This one is Hanlon’s Razor vs. Occam’s Razor: On one hand, it seems much more plausible that this graph was made by accident by some frustrated underling than that it was part of some devious disinformation campaign; on the other hand, it would take lots of steps to make this particular graph; indeed it seems harder to make than a default version where the dates are in order.

  24. My best guess is that they chose the incorrect column. Usually they would have the cumulative count next to the daily count. So they accidentally chose the daily instead of the cumulative, to avoid formatting the date correctly, and display it as it came in the original dataset. The wanted to see which county displayed the largest count by each day, and thought that would be the best way to sort it. And I think the default for most programs is to sort descending first, instead of ascending. Which would explain that it’s downwards instead of upwards.

  25. I used Excel to redo the offending graph. A graph with a normal time line shows no obvious conclusions. There is also a difference in at least one data point not readily explainable. If you want the pdf of the graph I did, let me know how to send.

  26. This seems to be caused by: sort by daily volume, sort by county volume, iterate on rows of data and plot (given fixed labels, colours).

    It can be a genuine mistake in an automated piece of code.

    – Correct: plot(data.sort(“date“, “county_name”);
    – This example: plot(data.sort(“date_total”, “county_total”).

  27. If the chance of getting this book for every person is equiprobable, my expectation will be very low. It would be better to work at a restaurant, or get a job coding, because this task could take more than an hour or two. So the responses you’re getting are likely people who don’t think of their time like this, people with salary jobs, steady income, etc.

    Why the vocabulary “multiplexing?” I’m wondering when your audience switched to communications engineers.

  28. I re-created this chart in R at http://freerangestats.info/blog/2020/05/23/ordering-in-bar-charts. It was easier than I thought. The different sequencing of the bars within each daily clump of bars can be done relatively simply with tidytext::reorder_within() and clever use of the “group” aesthetic in geom_bar().

    I am inclined to think that this was mostly just poor judgement, using a facility in SAS VA (as mentioned in other comments) that is a good idea with many bar charts, but not when the main categorical variable is time, and not for shifting sequences within clumps of bars. But I can’t prove that, and it’s certainly possible or even probable that poor judgement was tempered with a desire to make the chart descend to the right.

    Interestingly but not relevant to the discussion on visualisation methods, my county-level data sourced from the New York Times is quite different to that in the image. I am sure there are reasons and have no info to judge which is better.

Leave a Reply to Peter Ellis Cancel reply

Your email address will not be published. Required fields are marked *