What better way to start the new year than with a discussion of statistical graphics.
Mikhail Shubin has this great post from a few years ago on Bayesian visualization. He lists the following principles:
Principle 1: Uncertainty should be visualized
Principle 2: Visualization of variability ≠ Visualization of uncertainty
Principle 3: Equal probability = Equal ink
Principle 4: Do not overemphasize the point estimate
Principle 5: Certain estimates should be emphasized over uncertain
And this caution:
These principles (as any visualization principles) are contextual, and should be used (or not used) with the goals of this visualization in mind.
And this is not just empty talk. Shubin demonstrates all these points with clear graphs.
Interesting how this complements our methods for visualization in Bayesian workflow.
I’m puzzled by the statement “Boxplot is a prefect tool for showing a variability in the data, but it should not be used for visualizing the posterior distribution.” Doesn’t the provided example show that a boxplot diagram is bad for *both*? What makes it “perfect” for showing variability (if, for example, the data being shown had something like a gamma distribution)?
My argument was like this:
Imagine you measure some value several times with a noisy measurement tool. The measurements look like samples from normal (or gamma) distribution. It make sense to believe the measurements in the middle of the distribution are closer to the true value, while the measurements at the margins to be errors. Therefore, when visualizing these measurement, it make sense to emphasize the interval in the middle of the distribution.
Now imagine you used MCMC to estimate the same value. When plotting MCMC samples there is no reason the emphasize one sample over another, as they are all equally possible by definition!
Would boxplot be appropriate when visualizing variability in a population? Well, I dont know. I guess it depends if you think average values are more important than the marginal one.
So yeah, I made a mistake by calling “Boxplot to be a prefect tool for showing variability in the data”. Ill better write “boxplots may be appropriate sometimes”.
Just to add to this discussion: I hate boxplots!
I hate them too!
(OK, they worked well for Tukey when he was trying to do statistical analyses by hand on plane trips – but that need has passed.)
Justin Matejka and George Fitzmaurice made a wonderful illustration of the variability in boxplots with is a great education piece.
https://www.autodeskresearch.com/publications/samestats
My problem with Boxplots is not that they hide details (all plots based on summary statistics do this), but that they misrepresent the data by pretending the probability distribution is much narrower.
https://www.autodeskresearch.com/publications/samestats is wonderful article explaining why visualizing the full distributions is important for exploratory research. But I think there is a difference between exploratory and communicative visualization.
(see here https://ctg2pi.wordpress.com/2014/09/18/single-axiom-of-visualization/)
Imagine I visualized my posterior distributions and found no dinosaurs. In fact, the vast majority of all posterior distributions I had ever obtained look like boring Gaussians. When communicating my results to someone else, could I go straight to the point and only show my summary statistics? Yeah, I think I can. But even in this case boxplot would be misleading.
Wow, nice to see someone still reads it now… or at least someone was reading it half a year ago =)
One point of criticism I have to my post from 2015. The Principle 3: “Equal probability = Equal ink” does not work well for visualizing heavy tailed distributions. Sometimes you have relevant probability mass spread across very large interval, so if you follow the principle and spread the ink correspondingly, the ink becomes invisible.
yeah, maybe I should revive the blog…
Good post. I actually like Principle 3 as is. I think when values are spread out over a wide range, we are more uncertain about the values, so it makes sense that the ink is less visible. For that reason, I prefer panel B to panel C under your point 5.
Under point 5, is the y-axis on a probability scale? Why does the histogram at the very bottom shrink drastically from panel A to panel B?
Even wide uncertain distributions have some summary statistics (mean, credible interval) which could be important and relevant for the reader, so it would be useful to make them visible in the figure.
As for the bottom histogram… eh, I dont remember
Expanded answer: It all depends and what message you want to send with your figure.
If it is “Some estimates are very uncertain” then panel B would work the best.
If it is “There are our estimates, but some are very uncertain” then panel C would work the best.
If it is “There are our estimates, uncertainty does not matter” (maybe because variance of these different estimates are not directly compatible) then panel A would work the best.
If it is “There are our mean estimates” then maybe you should use a table.
If you are doing exploratory data visualization, and have no message to send… In this case, I would prefer panel C and a default choice.
An alternative to the standard boxplot is the Tufte boxplot which I think is a nice visual even if it does not satisfy “Equal probability = Equal ink”.
In the linked blog post by Mikhail Shubin, figure 4B is very similar to a violin plot. These plots are easy to make in R.
In the linked blog post by Mikhail Shubin, figure 5 A to C are ridgeline plots (formerly joyplots). These are also easy to make in R if you like that sort of thing.