Tableau and the Grammar of Graphics

The first edition of Lee Wilkinson’s book, The Grammar of Graphics came out in 1999. Whether or not you’ve heard of the book, if you’re an R user you’ve almost certainly indirectly heard about the concept, because . . . you know ggplot2? What do you think the “gg” in ggplot2 stands for? That’s right!

Then in 2002 Chris Stolte, Diane Tang, and Pat Hanrahan of Stanford University published an article, Polaris: A System for Query, Analysis, and Visualization of Multidimensional Relational Databases, where they cite The Grammar of Graphics:

Wilkinson [41] recently developed a comprehensive language for describing traditional statistical graphics and proposed a simple interface for generating a subset of the specifications expressible within his language. We have extended Wilkinson’s ideas to develop a specification that can be directly mapped to an interactive interface and that is tightly integrated with the relational data model. . . .

The primary distinctions between Wilkinson’s system and ours arise because of differences in the data models. We chose to focus on developing a tool for multidimensional relational databases . . . The differences in design are most apparent in the table algebra . . .

Shortly afterward, this work was developed into Tableau:

In 2003 Tableau spun out of Stanford University with VizQL™, a technology that completely changes working with data by allowing simple drag and drop functions to create sophisticated visualizations. The fundamental innovation is a patented query language that translates your actions into a database query and then expresses the response graphically.

Both ggplot2 and Tableau have become very successful, the first as freeware and the second as a commercial product. The documentation for ggplot2 (as well as its name) very clearly cite the Grammar of Graphics. It would be good if Tableau did this also.

22 thoughts on “Tableau and the Grammar of Graphics

  1. I never realised Tableau is part of the Grammar of Graphics inspired software.

    Mostly all the Tableau generated graphs I have seen looked icky. Maybe people are misusing it. Always looked like MBA oriented flashy stuff.

  2. I knew that Tableau was inspired by the Grammar of Graphics, so I feel like they did cite them somewhere. Maybe not very prominently?

    Tableau is a pleasure to use, and the defaults are more aesthetically pleasing than Excel or base R. But it’s always possible for a user to drive a tool into the ground, no matter how good the tool.

  3. But what does the 2 stand for? I’m sorry, I have to disagree. ggplot2 might be based on better theory. But, it is hard to learn and hard to teach since it departs from the graphics package which most users have already learned. I much prefer lattice for that reason which builds upon what users already know. Furthermore, ggplot2 is only 2D (is that what the 2 means?). Let’s have ggplot3 which fixes all of these issues (or ggplot1?).

    • Rodney:

      I used to feel that way, but recently I used ggplot2 to make a graph that I needed, which would’ve taken a lot longer using the graphics tools I’d known before. So now I’m a fan of the practice as well as the theory.

    • I guess it depends upon when you learned R. I think a lot of students nowadays learn the tidyverse, and ggplot2 (while not really part of) is obviously a conceptually related.

      Sometimes I get the feeling that the dialect differences between those using base r and those using tidyverse is becoming so wide they’re becoming mutually incomprehensible.

      (I think there Hadley Wickham had developed ggplot but quickly updated it to ggplot2, but I don’t remember where I read it. Another problem with ggplot2 is that it’s for print rather than the web and isn’t interactive. But maybe people are building extensions for those kind of things)

      • I would say ggplot2 is part of tidyverse. I feel like the tidyverse started as a relatively accessible way to perform regular data manipulation, cleaning, management. In the last couple of years it has gotten much more focused on programming.

        That has had the effect of making it less easy to access the more daily tools I use (well I don’t use them daily enough to remember them, then old code doesn’t comply, then I have to look it up and end up swamped in the weeds of programming, which I can only do in a rudimentary way). Its varying from base R started with the intention of the results being predictably what you wanted, or breaking noisily. And this has started to change a little bit as well (you’ll definitely read a few warnings in the documentation about being careful of the structure you use because it might result in not calculating what you thought you were calculating, while still outputting something that is the same shape, size and class that you expected.

        But what I really meant to comment on was that there are definitely packages and gg-extensions that provide for 3-d and dynamic graphs, while still using the syntax of ggplot2

    • I agree with Joe. ggplot only departs from what you have already learned if it’s not the first graphics package you learn, which I think it increasingly is. And though it has a higher start-up cost to learn, I think that conceptualization is more powerful and scalable. It seems to make it easy for people to develop extensions but unfortunately not really to 3D.

      Kinda relatedly, I wonder how much difficulty with learning R is because they were taught base R instead of tidyverse to start. Anecdotally it seems when I meet people who find R non-intuitive it’s because they were learning / being taught base R instead of tidyverse.

  4. I read Wilkinson’s book when it was pretty new, and I really liked the concept but had no place to use it. When Wickham’s 2009 book came out, I started using ggplot2. My impression at the time was that ggplot2 was heavily inspired by the GoG but syntactically sweetened in ways that made it easier to type but a bit harder to get creative in visualizing data.

    From memory (I haven’t reread Wilkinson’s book recently, the original GoG made explicit use of layers, which seemed to make it easier to get creative in their use (perhaps because one was always dealing with their syntax and semantics), while ggplot2 made them less visible, which usually just worked until you needed them and had to remember that there was such a construct and how one used it.

    Joe mentioned the tidyverse. It’s perhaps noteworthy that the tinyverse also uses ggplot2.

  5. “Modern platforms such as R and Tableau are built around the fundamentals of the Grammar of Graphics” All the important stuff gets mentioned in blogs, Andrew: https://www.tableau.com/about/blog/2019/2/three-waves-data-visualization-brief-history-and-predictions-future-100830

    I don’t disagree that GoG should be mentioned alongside VizQL. Though I for one am guilty of emphasizing other stuff (like VizQL and ShowMe) when I tell people who want to know about Tableau what’s special about it – the reason I recommend it to people who don’t want to deal with code or learn much about visualization is that if you use the shelf model, it does its best to generate visualizations that are perceptually effective, by assigning as many variables as possible to position encodings. So its differentiating features are not as clearly linked to GoG.

  6. My old job was heavily reliant on Tableau for creating scalable data exploration tools for non-expert users, so I had to learn it. It’s very powerful and very good at what it does, but you really need to think about optimizing your data model for Tableau if you intend to use the data in Tableau. This was actually the hard part. We often ended up having to create two versions of the output from an ETL process, one for spreadsheets and one for Tableau, with slightly different data models in order to be able to do what we wanted to do in Tableau in a way that scaled.

    I’m also decent at ggplot2 and some related packages like tmap. I never knew about the connection to Grammar of Graphics. Interesting.

  7. 2 observations about Tableau: it is notable how successful they have been commercially. Yes, the output is high quality and the user interface does not require coding (although I find it unnecessarily clunky, compared with something like the Graph Builder in JMP). But it is an expensive commercial product, and I find it ironic that it has so many customers – the same customers that insist that their new hires be proficient in coding in multiple open source languages.

    The second point I’d make is that Tableau does produce nice visualizations, but the process to get these conflicts with some fundamental principles. The best example is that the default display for all data is aggregated – if you put a continuous x and y variable in a scatterplot, you get a single point. It takes a little work (admittedly, not much, but still the default should be to show the disaggregated data) to disaggregate it to show the individual points. Having taught the use of many of these programs, I find it disturbing that the software leads you towards aggregation when I think it should do the opposite. As we know, the average effects are not nearly as interesting or important as more granular impacts, and I believe that good software should help people to think that way, with aggregation delayed as long as possible.

    • Great points.

      The default for aggregation is really irritating! I agree it should work the other way: display the data then aggregate. This seems at odds with Tableau’s apparent effort to force ppl to follow the purported “best practice”. Forced on the presentation but not on the analysis! If anything it should be the other way around, but I’d prefer they just provide the tools to do the job and then get the hell out of the way.

      In general, the “best practice” automation that has infiltrated all modern charting tools is also quite irritating. The person who generates the chart is supposed to know the best practices – for their individual case. The software should enables people to use them, rather than dictate what they are.

    • Agreed. I even have a paper where we look at how the aggregation may impact the sort of conclusions that people without much analysis background arrive at: https://mucollective.northwestern.edu/files/2020-effects-aggregation-choices.pdf. As one might expect, when data are aggregated there’s less sensitivity to changes in the sample size of the dataset, in terms of how many conclusions people report feeling comfortable drawing from the data and how confident they report being in their conclusions. We also saw more observations about the data that simply noted whether an effect/difference was present or absent when data was aggregated, relative to more statements that mentioned at effect size when it wasn’t.

      Maybe Tableau has a good reason for aggregating (I expect its the default to support the big datasets of their ‘power users’). If you know how to set up the data connection, it supports large datasets (e.g., millions of rows). Still, one might think that if it can detect when you have dates or geospatial data it could also detect the number of marks needed to disaggregate and default to aggregation only when disaggregation would lead to latency or overplotting.

  8. A couple things about that product that really rubbed me the wrong way. The first was the lack of any clear public description of the language model or an explicit grammar. That suggests to me this is less a developed interface that you can count on working a given way and more like wolfram alpha. This negative impression was really clinched when they explained that you can just use natural language and say things like “as a bar chart” to affect how the visualization is done.

    That’s a very bad sign in my experience for this kind of product. It makes for really great sales presentations but if you only had relatively simple data presentation needs you wouldn’t be paying for this product and that means you’ll need to do something a little more complex or out of the norm and natural language commands are horrible for that kind of thing. You aren’t ever totally sure what they’ll do if you try and compose them with other operations and it’s often very difficult to guess the syntax pattern for adding various kinds of modifiers if they are allowed at all.

    I don’t want to sound harsh here, it seems like there is a lot of great functionality here but in my experience the people who need (and will purchase) a product like this and the ones who need to be able to just type in english and have something simple but reasonable pop out aren’t likely to be the same group. But maybe I’m missing some use case that does unify the two but isn’t covered by relatively easy to use free tools already.

    • But based on the good things other people have said I think I’m probably just confused here and shouldn’t let my reaction to some marketing determine how I feel about the tool.

  9. Two other grammars to refer to, if one wants a broader context:
    1. Karl Pearson’s Grammar of Science, https://pure.mpg.de/%E2%80%A6/item_2%E2%80%A6/component/file_2368441/content
    2. David Cox’s Grammar of Research, https://link.springer.com/article/10.1007/s10654-017-0288-1
    Both aim higher than the Grammar of Graphics. Both are mentioned in my slides in
    https://ceeds.unimi.it/wp-content/uploads/2020/02/Kenett_Causality_2020.pdf

    The drawback to Tableau is that it is filling the place of an analytic platform and users start thinking that analytics is all about graphics…

    • Thanks for the link to the Cox paper, which as expected is filled with sensible advice. I particularly appreciated the comments on Student’s Biometrika paper on the Lanarkshire milk experiment. It should be much better known, my informal survey among econometricians suggests that it is underlooked.

  10. In one of my courses, I teach the grammar of graphics using ggplot2. I find that students pick it up much quicker when I’m explicit in naming the parameters of function calls. For example, instead of discussing something like this:

    ggplot(mtcars) + geom_point(aes(mpg, hp, color = am))
    

    I discuss something like this:

    ggplot(
      data = mtcars
    ) +
    geom_point(
      mapping = aes(
        x = mpg,
        y = hp,
        color = am
      )
    )
    

    Which allows them to explicitly read what the functions are doing. They get it much quicker.

Leave a Reply to Michael J Cancel reply

Your email address will not be published. Required fields are marked *