Antony Unwin saw this scatterplot (see here for background):
and had some comments and suggestions. I’ll show his plots below, but first I want to talk about jittering. I wonder if the main problem my original graph above is that it is too small.
In any case, the jittering makes it looks weird, but I wonder whether it would be better if it were jittered a bit more, so that the clusters of points blurred into each other completely. (Since the data are integers, we could just jitter by adding random U(-.5,.5) numbers to each point in the x and y directions.)
But Antony says:
Jittering makes strong assumptions, which are rarely mentioned. It is bad for small cell sizes (you can get odd patterns unless you specifically adjust your jittering to account for cell size and how many do that?) and is bad for large cell sizes (because of overplotting and because you get solid blocks which can hardly be distinguished from one another). In fairness I should declare myself as an anti-jittering fundamentalist and say that there are hardly any circumstances when I think jittering is useful. Jittering is a legacy from the days when you could only plot points. Area plots should always be the first choice.
Maybe he’s right. A gray-scale plot using image() might be a better way to go in a situation like this one with many hundreds of data points.The Unwin solution
OK, now here are Antony’s plots:
and the following description:
The attached fluctuation diagram was drawn in iPlots (which expects mosaicplot variables to be factors):soc.scoreS1<--soc.score.S econ<-as.factor(econ.score.S) soc<-as.factor(soc.scoreS1) imosaic(econ,soc, type="fluctuation")
I think this plot shows the bivariate distribution of the data much better. It is clear where the bulk of the data lie and differences between cells are much more apparent. Best of all, you can link to other displays. I have included the same plot twice more, once with the Democrats highlighted, once with the Republicans highlighted.
The parties were selected from a barchart of party affiliation:ibar(pid)
He then makes a plot showing Dems, Reps, and Independents in the same grid:
You just include another level in the fluctuation diagram:
I [Antony] think it looks better with the cells shaded, which can be achieved with
or colour them by party with
It might be better to leave out the Independents, as there are not so many, though the ordering looks right, with their plot lying between those for the Democrats and the Republicans.
Here are the pics:
(Sorry about all the white space. I converted the plots from pdf to png to display on the blog.)
(Fixed 1/24/2008 10:42)
These new (to me, anyway) plots look good.
But in your original, it seems like there are no single point outliers, so, what if you just made each dot = several observations?
Another solution for the problem of over-plotting is adjust the transparency for each point to create an elevation effect. See e.g. the alpha-channel function in the Mondrian software
There are times, however, when jittering works and fluctuation diagrams won't – when the discreteness is truly two-dimensionsal, not just the product of two discrete 1d variables. It doesn't happen very often though!
The y-axis on the fluctuation diagrams is a little odd – normally we expect small values on the bottom, not on the top.
They seem to have fallen out of favor lately, but…
Sunflower plots, anyone?
When the cells are more or less discrete (as in your example), Antony's approach is superior: it allows reading off the frequency in a cell.
But when the variables are continuous, jittering does help identify the areas that might have high overlap. Yes, we adjust the point size so that no area is blacked out, but artifacts in the data won't filter that out. Better solutions such as the interactive fisheye projection and alpha channel transparency are described in this thesis (although the concepts have been known before that).
I like the first two of Anthony's plots, although they woulbetter if the thermometers were filled with blue and red (for Dems and Pubs) rather than white and red.
I like jittering in some circumstances. It is bad when there are lots of points per "cell" (as in Andrew's original plot), in part because even post-jittering you can't see what you have. But when there is a modest amount of overlap it can be pretty useful. I think it's most appropriate when the data are "artificially" overlapping (e.g. due to rounding, or due to forcing people to choose a category). I think jittering is useful in the following example, but would not be useful if there were 4 times as many points:
Jittering has a place as a 2 second hack during diagnostic phase and can actually be better than some other representations. For instance, I was once doing MCMC sampling on a Dirichlet distribution and got some very strange (looking) results when testing. Adding a little jitter made it clear that what was confusing me was that the Metropolis algorithm had a very low acceptance rate near the boundaries and thus many of my points were actually multiple samples. Adding the jittering was incredibly easy to do. A gray scale plot is nice, but when exploring it may not be as easy as just adding a bit of noise to make sure things look the way you expect.
Of course, the deeper problem was that one should switch to the soft-max basis for Dirichlet variables, but it was the jittered plot that made that fact so abundantly clear to me.
I second the transparency recommendation. If you use col=rgb(r,g,b,alpha) the results are very nice. You are just limited to Quartz, windows, or pdfs as output unless you install the Cairo package.
Antony's point on small and large data sets make sense but I do think jittering has a place where the designer thinks it is important to show every response. In many applications, the point is to show patterns, not individual responses, so I too usually use contour or image plots.
Plotting all data (which includes jittering) is very undesirable in large data sets. Try printing these graphs! Or even opening them up in certain software.
For Andrew's purpose, I believe jittering is actually fine; what would really help is to put univariate histograms / pdfs along each axis. This really helps readers visualize the point that most of America is right of Kerry on economic issues, etc.
Jittering isn't ideal, as it is at best an indirect way to show (part of) what you want to be shown, the frequency of each pair of values. And plotting small areas instead is more faithful to the data as they arrive, especially the discreteness of each variable.
But beyond that the choice, in this example, is between six and half a dozen. The presumption seems to be that voters may be placed on continuous scales and we're interested in the marginal and joint distributions. That being so, I find the original plot more congenial, especially as the area plot does not, in the event, seem to reveal any interesting fine structure that would be poorly shown or not shown at all with jittering.
I'd suggest jittering by adding U(-.45, .45) which would leave a fine grid-like reminder of the discreteness, like a chessboard.
Incidentally, "fluctuation diagram" is not an especially transparent term. Who would guess from the term that it means a plot like that shown?
Perhaps this topic got so many interesting comments because of Andrew's provocative heading! It's a pity the graphics got cut off. The last three sets of pairs (Dems and Ind) should be three sets of three, with the Republicans to the right.
Several people have mentioned transparency/alpha-blending and even Martin Theus's Mondrian (get your free copy at stats.math.uni-augsburg.de/software/), which pleased me, but transparency is not so effective for this kind of data. Comparing shades of grey is not nearly as easy as comparing sizes (Cleveland & McGill, yet again).
I agree with Hadley on genuine 2-d data, though I would also try bubble plots with alpha-blending as an alternative. As for the scales, the plots were drawn in a hurry without prettification, which is also my excuse for using default colours and not the appropriate party colours. Apologies.
Anyone mentioning sunflowerplots for a dataset of this size has suffered from too much sun. (Actually, I have my doubts about sunflower plots for any size of dataset.)
Alexs: If you have interactivity, you certainly don't need jittering. Next time try Mondrian or iPlots.
Phil: Jittering to counteract rounding makes for more natural looking scatterplots, but may mislead. Shouldn't we distinguish between plots of raw data and plots of models of the data? And if you are going to model the data, it makes sense to think of a better model than jittering. I'm all for both kinds of plots, but I think that you should always have plots of the data as given.
Kaiser: Agreed, the one-d displays would help a lot.
Nick: You're right that there is not any interesting fine structure here, but at least with the fluctuation diagrams you have a chance of seeing if there is any. There is little chance with Andrew's original style of plot. Fluctuation diagrams are used primarily for comparing equivalent groupings (eg sales by regions this year and last year or different clusterings of the same data). Hence the name. If you have a better suggestion…
Antony: I don't have a better idea for a term for "fluctuation diagram" because I am not clear that one is needed, or what the precise definition of such a diagram is in any case.
We don't need distinct terms for every minutely different kind of diagram. I know that everyone with an original idea wants to flag that idea, and good luck to them, but otherwise the proliferation of terminology is often just a nuisance. Yet it's also true that a catchy name can help an idea spread. Box plots caught on where the same idea as dispersion diagrams or under other names attracted only localised attention in work long before Tukey wrote about them.
Another example: as you know from elsewhere, I have also found the literature on spine plots pretty unclear and even contradictory on precisely what is a spine plot is, for example.
But the term spine plot is catchy.
+1 for Ted's note that, 'Jittering has a place as a 2 second hack during diagnostic phase.'
If you're doing anything even slightly deeper, I never have seen the point. Just because it's elementary in R doesn't necessarily make it worth doing. Difficulty with output are your problem, pace Cleveland & McGill.
(Parenthetically, didn't those two worthies like the here-maligned sunflower plots themsleves, once?)
For application where there is high data density, check out hexbin plots in R.