Skip to content
 

Visually weighting regression displays

Solomon Hsiang writes:

One of my colleagues suggested that I send you this very short note that I wrote on a new approach for displaying regression result uncertainty (attached). It’s very simple, and I’ve found it effective in one of my papers where I actually use it, but if you have a chance to glance over it and have any ideas for how to sell the approach or make it better, I’d be very interested to hear them. (Also, if you’ve seen that someone else has already made this point, I’d appreciate knowing that too.)

Here’s an example:

Hsiang writes:

In Panel A, our eyes are drawn outward, away from the center of the display and toward the swirling confidence intervals at the edges. But in Panel B, our eyes are attracted to the middle of the regression line, where the high contrast between the line and the background is sharp and visually heavy. By using visual-weighting, we focus our readers’s attention on those portions of the regression that contain the most information and where the findings are strongest. Furthermore, when we attempt to look at the edges of Panel B, we actually feel a little uncertain, as if we are trying to make out the shape of something through a fog. This is good thing, because everyone knows that feeling, even if we have no statistical training (or ignore that training when its inconvenient). By aligning the feeling of uncertainty with actual statistical uncertainty, we can more intuitively and more effectively communicate uncertainty in our results to a broader set of viewers.

I like that. But, once you’re making those edges blurry, couldn’t you also spread them out, to get the best of both worlds, the uncertainty bounds and the visual weighting?

Think about it this way. Suppose that, instead of displaying the fitted curve and error bounds, you make a spaghetti-style plot showing, say, 1000 draws of the regression curve from the uncertainty distribution. Usually when we do this we just let the lines overwrite, but suppose that instead we make each of the 1000 lines really light gray but then increase the darkness when two or more lines overlap. Then you’ll get a graph where the curve is automatically darker where the uncertainty distribution is more concentrated and lighter where the distribution is more vague.

Now take this a step further. You don’t actually need to draw the 1000 lines, instead you can do it analytically and just plot the color intensities in proportion to the distributions. The result will look something like Hsiang’s visually-weighted regression but more spread out where the curve is more uncertain.

10 Comments

  1. m says:

    ggplot2 lets you set an alpha level. Overplotting lines with some translucence (alpha < 1) will give you darker regions. Is this what you're thinking of?

  2. Wayne says:

    I’ve been doing work with ensembles generated by adding white noise to an original series and then processing it, repeatedly. Your suggested method works well: just plot each ensemble member with transparency of about 70% (an alpha of 0.3). The graph is nice and dark in the center, with some softness at the edges. The only downside is that the optimal transparency level depends on how many lines you’re going to plot and how close together they are.

  3. Andrew, I love your idea in theory. My concern is that readers might better able to perceive the spread near the tails than the increase in density near the middle. I think Hsiang gets it right saying that we wan to “focus our readers’s attention on those portions of the regression that contain the most information and where the findings are strongest.” If readers look at the plot you describe and spend time with their eyes pointing at the ends then it’s not doing what we want. It could go either way. Sounds like an empirical question ripe for a lab experiment using eye tracking.

  4. Andrew says:

    Stephen:

    I agree. One could also do various intermediate solutions such as lighter shadings for strands that are further away from the central line, thus reducing the visual effect of the outliers.

    But the real point of my suggestion was not to imply that my graph would be better than Hsiang’s but rather to draw a mathematical connection between his idea and an alpha-blended spaghetti plot.

  5. Carlos says:

    There’s some experimental evidence that using blurring and opacity to represent many levels of uncertainty is not a good idea. I think it’s here:

    http://kosara.net/papers/RobertKosaraPhD.pdf

    The theory that seems best supported by current experiments is that you can encode that something is there or not with opacity and blurring (a single bit). But our eyes are not very good at telling different opacity levels when other visual information is present on the plot.

  6. Ian Finlayson says:

    Very interesting to see. I had been doing something similar recently using model fits, with the intensity of the colour of each data point corresponding to the density (with a constant applied for aesthetic reasons) of the underlying data. Found it a useful and visually-pleasing way to show that when my linear effects appeared to be curving the data behind the curved sections had very low density.

    Moved away from that approach when I wanted to start plotting estimated lines and confidence intervals, but some nice inspiration for marrying the two approaches.

  7. Andrew – your first suggestion of blurring and spreading out was very similar to what I was trying to do in this paper:

    http://amstat.tandfonline.com/doi/abs/10.1198/000313008X370843
    or
    http://www.mrc-bsu.cam.ac.uk/personal/chris/papers/denstrip.pdf
    if that’s inaccessible.

    The original idea was to replace error bars with shaded density strips, but I realised that could be generalised to regions with smoothly-varying uncertainty bounds. It’s all in the R package “denstrip”.

  8. John Mashey says:

    See IPCC AR4 WG I, from 2007.

    Various researchers have “reconstructed” Northern Hemisphere temperature histories form tree rings, ice cores, corals, sediments, etcc (‘proxies’).
    Each reconstruction produces a line and usually uncertainty ranges, which tend to get larger further back in time.
    [The most famous of these is the 1999 “hockey stick” from Mann, Bradley and Hughes, but there are plenty more.]

    The 2nd graph is a “spaghetti chart” of reconstructions of Northern Hemisphere temperatures.
    They naturally differ because they use different combinations of proxies, different mathematical techniques, and sometimes claim to represent different parts of the NH (from 0degN, or 23.5degN, or 30degN, and it makes a difference).

    I’ve always disliked the spaghetti graph because it draws the eyes to the edges of the envelope, and likely overemphasizes the more jiggly curves, and of course loses uncertainty ranges (and discussions often found in the original papers). It obscures the fact that sometimes a lot of the reconstructions agree.

    I prefer the 3rd graph, which shows “the likely reality is in the middle somewhere, probably where it is darkest, when you combine these.”

    It eliminates the total unrealism of a hard, sharp line against a less strong background.

  9. […] the moment, just a quick link to an interesting idea I saw on Andrew Gelman’s blog.  The idea of de-emphasizing a region of high variability, instead of drawing humongous areas […]