Any graph should contain the seeds of its own destruction

The title of this post is a line that Jeff Lax liked from our post the other day. It’s been something we’ve been talking about a long time; the earliest reference I can find is here, but it had come up before then, I’m sure.

The above histograms illustrate. The upper left plot averages away too much of the detail. The graph with default bin widths, on the upper right, is fine, but I prefer the lower left graph, which has enough detail to reveal the limits of the histogram’s resolution. That’s what I mean by the graph containing the seeds of its own destruction. We don’t need confidence bands or anything else to get a sense of the uncertainty in the bar heights; we see that uncertainty in the visible noise of the graph itself. Finally, the lower right graph goes too far, with so many bins that the underlying pattern is no longer clear.

My preferred graph here is not the smoothest or even the one that most closely approximates the underlying distribution (which in this case is a simple unit normal); rather, I like the graph that shows the data while at the same time giving a visual cue about its uncertainty.

P.S. Here’s the code:

a <- rnorm(1000)
par(mar=c(3,3,1,1), mgp=c(1.5,0.5,0), tck=-.01, mfrow=c(2,2))
hist(a, breaks=seq(-4,4,1), bty="l", main="Not enough bins", xlab="")
hist(a, bty="l", main="Default bins", xlab="")
hist(a, breaks=seq(-4,4,0.25), bty="l", main="Extra bins", xlab="")
hist(a, breaks=seq(-4,4,0.1), bty="l", main="Too many bins", xlab="")

P.S. Yeah, yeah, I agree, it would be better to do it in ggplot2. And, yeah, yeah, it's a hack to hardcode the histogram boundaries at +/-4. I'm just trying to convey the graphical point; go to other blogs for clean code!

26 thoughts on “Any graph should contain the seeds of its own destruction

  1. I always try to use a histograms with a maximal possible number of bins (i.e. pixel-wide bins). I dont see any problems with it. If it looks too noisy, then I just draw more samples from my MCMC!

    In case I cant draw more samples, or I need to superimpose one histogram on top of another for comparison, then yes, larger bins are welcome.

  2. If the goal of histogram is to estimate densities, then a histograms is a kernel fit with fixed bandwidth, which brings the possibility of a “Bayesian histogram” that assigns weights on bandwidth, although it will be hard to visualize overlaid “posterior simulations of histograms”

  3. This is somewhat connected to an idea my colleague Aimee Taylor had for communicating uncertainty in maps. Maps showing geographical variation of some phenomenon inferred under a statistical model will often attempt to produce maps as smooth as possible by using high density pixelation. In fact, deliberately “pixelating”, i.e. blurring, areas of high uncertainty could be a useful way of communicating the limits of the model (and the underlying data). A simple algorithm is presented in this paper for deciding how to pixelate according to uncertainty:
    https://arxiv.org/pdf/2005.11993.pdf

    • James:

      Interesting. Your colleague’s idea sounds like the opposite of my suggestion! I’m saying to deliberately make the graph at a high resolution to make it noisy to convey uncertainty; you’re going in the opposite direction. But I agree the ideas are related.

      • You’re right, my comment lacked clarity. Your argument re histograms is about how to convey limits in the underlying data in a simple way. The pixelation idea is about conveying uncertainty in the model (this will mostly be due to limited data). I guess this would be like trying to show the uncertainty of a density estimator.

    • “Intrinsic” visualization of uncertainty is the term I have sometimes heard applied to attempts to make the difficulty of perception proportional to uncertainty affecting data. It’s an idea that’s been batted around in cartography and visualization for awhile. Here’s a paper from a few years ago that proposes something to the one you shared, value-suppressing uncertainty palettes for maps https://idl.cs.washington.edu/files/2018-UncertaintyPalettes-CHI.pdf

      A difficult decision is how much do you blur? Ideally it’s not an arbitrary decision. My colleague Matt Kay makes some suggestions about how to guide parameterize these kinds of plots here: https://osf.io/6xcnw

      • Thank you, Jessica. I really like paper by Correll, Moritz and Heer. I haven’t read your colleague’s preprint, thank you for sharing.

        Rather than approaching the problem by hampering perception, we were aiming to provide precision only in areas where it is merited. But I agree that these are two sides of the same coin.

        The how-much-to-blur decision is indeed difficult. I’m eager to read your colleague’s preprint.

  4. Jessica: thanks for highlighting the bivariate color plot idea. It’s a nice idea but I really don’t think it’s very intuitive (or at least not for me when I look at those plots). We do actually reference one of the bivariate map ideas by Lucchesi et al. It’s important to note that this idea applies to **chloropleth** maps not **isopleth** maps (the subject of our paper). Isopleths represent spatial continua inferred under a geospatial model. I don’t know of any similar ideas for conveying uncertainty in isopleths.

    • I agree the VSUPs can be hard to interpret. Showing uncertainty in maps is hard in general because you’re often already restricted to using color for the thematic variable which is hard to read anyway. My sense is that VSUPs are most appropriate when you have many spatial units and you know people are going to want to try to make comparisons across them. Trying to make sense of bivariate color scales is too much work if you only have a few spatial units. The same approach could be applied at pixel level though.

      This old Efron and Diaconis paper on bootstrapping used resampling to show uncertainty in kriged rainfall maps: https://www.vanderbilt.edu/psychological_sciences/graduate/programs/quantitative-methods/quantitative-content/diaconis_efron_1983.pdf
      I’ve done a lot of work on animated uncertainty visualization and at once point created animated bootstrap maps like this of radon data (which I got from Andrew and Phil actually!), as well as various types of animated maps of election data. Cartographers have also played with animation some, see e.g., work from the 90s/early 2000s from Bastin, Fisher, Ehlschlaeger if you and Aimee are looking for inspiration.

  5. I think the takeaway to be emphasized here is how important it is to do due dilligence (maybe not the right term) with your data. You can’t have a “preffered graph” if you don’t. It is very tempting to bang-out ‘hist(a, bty=”l”, main=”My Data”)’, see the distribution, trust that R’s defaults are good enough, and move on.

  6. Sometimes you want a ton of bins! I was looking at a variable which consisted of cumulative scores from many items on a questionnaire. Timepoints 1 and 4 included data from scores from the full questionnaire. Timepoints 2 and 3 came from a ‘brief’ version that supposedly measured the same thing but with less granularity. The two scores had been put on the same scale by dividing by their respective totals (put into proportions) and then multiplying by the full measure total. When I looked at the histogram with a reasonable number of bins, it looked fine. When I put on a ton of bins, suddenly I could see the discreteness of the brief questionnaire that had been scaled up to the full score scale, and I could easily tell that the brief was very right shifted! They didn’t measure the same thing at all! It was two different distributions. Fewer bins hid this information.

  7. > but I prefer the lower left graph, which has enough detail to reveal the limits of the histogram’s resolution

    This assumes a smooth unimodal distribution, and knowing the limits of this histogram’s resolution depends on this assumption in a kind of circular logic. The lower left looks ideal only once we’ve seen the lower right.

  8. Andrew says he is interested in histograms of data, not in histograms of simulation, but uses simulated data. Why? Histograms are good for real data. They are useful for identifying gaps, heaping, modes, and other features. As there is no optimal histogram, draw several, and, above all, use bin boundaries that make sense for the data. Chapter 3 in my book gives examples and the plots and code are on the book’s website (http://www.gradaanwr.net/content/03-examining-continuous-variables/). The final exercise in the chapter (not on the web) is possibly the most useful for thinking of using context in studying a graphic.

    • Antony:

      Real data are fine too. I used simulated data because that works well for examples. Users can play with the simulation to get different results. The point here is that the simulated data represent real data with a fixed sample size. They don’t represent a set of simulation draws that could easily be expanded.

  9. This whole question arises from the necessity to choose a bin width in histograms. Therefore, I much prefer CDFs which don’t require such a choice. People sometimes find them hard to interpret because the magnitude of the plotted variable in a CDF is on the horizontal rather than the vertical. In papers from the mid-20th century, researchers sometimes simply plotted CDFs turned by 90 degrees, so that “higher” corresponds to “more”. I think this is a great way of plotting data — why did it fall out of favor?

  10. This is why I don’t like violin and kernel density plots: compared with histograms, it’s much harder to display both “model” and “residual”.

    Something I wish were easier in e.g. ggplot2 is vertically-oriented, center-aligned histograms (the binned analogue of a violin plot).

Leave a Reply to Andrew Cancel reply

Your email address will not be published. Required fields are marked *