A tale of two discussion papers

Over the years I’ve written a dozen or so journal articles that have appeared with discussions, and I’ve participated in many published discussions of others’ articles as well. I get a lot out of these article-discussion-rejoinder packages, in all three of my roles as reader, writer, and discussant.

Part 1: The story of an unsuccessful discussion

The first time I had a discussion article was the result of an unfortunate circumstance. I had a research idea that resulted in an article with Don Rubin on monitoring the mixing of Markov chain simulations. I new the idea was great, but back then we worked pretty slowly so it was awhile before we had a final version to submit to a journal. (In retrospect I wish I’d just submitted the draft version as it was.) In the meantime I presented the paper at a conference. Our idea was very well received (I had a sheet of paper so people could write their names and addresses to get preprints, and we got either 50 or 150 (I can’t remember which, I guess it must have been 50) requests), but there was one person who came up later and tried to shoot down our idea. The shooter-down, Charlie Geyer, has done some great work but in this case he was confused, I think in retrospect because we did not have a clear discussion of the different inferential goals that arose in the sorts of calculations he was doing (inference for normalizing constants of distributions) and which I was doing (inference for parameters in fitted models). In any case, the result was that our new and exciting method was surrounded by an air of controversy. In some ways that was a good thing: I became well known in the field right away, perhaps more than I deserved at the time (in the sense that most of my papers up to then and for the next few years were on applied topics; it was awhile before I published other major papers on statistical theory, methods, and computation). But overall I’d rather have been moderately known for an excellent piece of research than very well known for being part of a controversy. I didn’t seek out controversy; it arose because someone else criticized our work without seeing the big picture, and at the time neither he nor I nor my collaborator had the correct synthesis of my work and his criticism.

(Again, the synthesis was that he was trying to get precise answers for hard problems and was in a position where he needed to have a good understanding of the complex distributions he was simulating from, whereas I was working on a method to apply routinely in relatively easy (but nontrivial!) settings. For Charlie’s problems, my method would not suffice because he wouldn’t be satisfied until he was directly convinced that the Markov chain was exploring all the space. For my problems, Charlie’s approach (to run a million simulations and work really hard to understand the computation for a particular model) wasn’t a practical solution. His approach to applied statistics was to handcraft big battleships to solve large problems, one at a time. I wanted to fit lots of small and medium-sized models (along with the occasional big one), fast.)

Anyway, this “different methods for different goals” conversation never occurred, hence I left that meeting with an unpleasant feeling that our method was controversial, not fully accepted, and not fully understood. So I got it into my head that our article should be published as a discussion, so that Geyer and others could comment and we could respond.

But we never had that discussion, not in those words. Neither Charlie nor I nor Don Rubin was aware enough of the sociological context, as it were, so we ended up talking past each other.

In retrospect, that particular discussion did not work so well.

Here’s another example from about the same time, the Ising model. Here’s one chain from the Gibbs sampler. After 2000 iterations, it looks like it’s settled down to convergence (here we’re plotting the log probability density, which is a commonly used summary for this sort of distribution).

But then look at the second plot: the first 500 iterations. If we’d only seen these, we might have been tempted to declare victory too early!

At this point, the naive take-home point might be that 500 iterations was not enough but we’re safe with 2000. But no! Even that the last bit of those 2000 looks as stationary and clean as can be, if we start from a different point and run for 2000, we get something different:

This one looks stationary too! But a careful comparison with the graphs above (even clearer when I displayed these on transparency sheets and overlaid them on the projector) reveals that the two “stationary” distributions are different. The chains haven’t mixed, the process hasn’t converged. R-hat reveals this right away (without even having to look at the graphs, but you can look at the graphs if you want).

As I wrote in our article in Bayesian Statistics 4,

This example shows that the Gibbs sampler can stay in a small subset of its space for a long time, without any evidence of this problematic behavior being provided by one simulated series of finite length. The simplest way to run into trouble is with a two-chambered space, in which the probability of switching chambers is very low, but the above graphs are especially disturbing because the probability density in the Ising model has a unimodal (in the sense that this means anything in a discrete distribution) and approximately Gaussian marginal distribution on the gross scale of interest. That is, the example is not pathological; the Gibbs sampler is just very slow. Rather than being a worst-case example, the Ising model is typical of the probability distributions for which iterative simulation methods were designed, and may be typical of many posterior distributions to which the Gibbs sampler is being applied.

So that was my perspective: start from one point and the chain looks fine; start from two points and you see the problem. But Charlie had a different attitude toward the Ising example. His take on it was: the Ising model is known to be difficult, no one but a fool would try to simulate it with 2000 iterations of a Gibbs sampler. There’s a huge literature on the Ising model already!

Charlie was interested in methods for solving large, well-understood problems one at a time. I was interested in methods that would be used for all sorts of problems by statisticians such as myself who, for applied reasons, bite off more in model than we can chew in computation and understanding. For Charlie with the Ising model, multiple sequences missed the point entirely, as he knew already that 2000 iterations of Gibbs wouldn’t do it. For me, though . . . as an applied guy I was just the kind of knucklehead who might apply Gibbs to this sort of problem (in my defense, Geman and Geman made a similar mistake in 1984, I’ve been told), so it was good to have a practical convergence check.

Again, I think that in our discussion and rejoinder, Don and I presented our method well, in the context of our applied purposes. But I think it would’ve worked better as a straight statistics article. Nothing much useful came out of the discussion because none of us cut through to the key difference in the sorts of problems we were working on.

Part 2: A successful discussion

In the years since then, I’ve realized that communication is more than being right (or, should I say, thinking that one is right). Statistical ideas (and, for that matter, mathematical and scientific ideas in general) are sometimes best understood through their limitations. It’s Lakatos’s old “proofs and refutations” story all over again.

Recently I was involved in a discussion that worked out well. It started a few years ago with a post of mine on the differences between the sorts of data visualizations that go viral on the web (using some examples that were celebrated by statistician/designer Nathan Yau), as compared to statistical graphics of the sort that we are trained to make. It seemed to me that many visualizations that are successful with general audiences feature unique or striking designs and puzzle-like configurations, whereas the most successful statistical graphics have more transparent formats that foreground data comparisons. Somewhere in between are the visualizations created by lab scientists, who generally try to follow statistical principles but usually (in my view) try too hard to display too much information on a single plot.

My posts, and various follow-ups, were disliked by many in the visualization community. They didn’t ever quite disagree with my claim that many successful visualizations involve puzzles, but they didn’t like what they perceived as my negative tone.

In attempt to engage the fields of statistics and visualization more directly, I wrote an article (with Antony Unwin) on the different goals and different looks of these two sorts of graphics. Like many of my favorite papers, this one took a lot of effort to get into a journal. But finally it was accepted in the Journal of Computational and Graphical Statistics, with discussion.

The discussants (Stephen Few, Robert Kosara, Paul Murrell, and Hadley Wickham; links to all four discussions are here on Kosara’s blog) politely agreed with us on some points and disagreed with us on others. And then it was time for us to write our rejoinder.

In composing the rejoinder I finally came upon a good framing of the problem. Before we’d spoken of statistical graphs and information visualization as having different goals and looking different. But that didn’t work. No matter how often I said that it could be a good thing that an infovis is puzzle-like, or no matter how often I said that as a statistician I would prefer graphing the data like This but I can understand how graphing it like That could attract more viewers . . . no matter how much I said this sort of thing, it was interpreted as a value judgment (and it didn’t help when I said that something “sucked,” even if I later modified that statement).

Anyway, my new framing, that I really like, is in terms of tradeoffs. Not “two cultures,” not “different goals, different looks,” but tradeoffs. So it’s not stat versus infographics; instead it’s any of us trying to construct a graph (or, better still, a grid of graphs) and recognizing that it’s not generally possible to satisfy all goals at once, so we have to think about what goals are most important in any given situation:

In the internet age, we should not have to choose between attractive graphs and informational graphs: it should be possible to display both, via interactive displays. But to follow this suggestion, one must first accept that not every beautiful graph is informative, and not every informative graph is beautiful.

Yes, it can sometimes be possible for a graph to be both beautiful and informative, as in Minard’s famous Napoleon-in-Russia map, or more recently the Baby Name Wizard, which we featured in our article. But such synergy is not always possible, and we believe that an approach to data graphics that focuses on celebrating such wonderful examples can mislead people by obscuring the tradeoffs between the goals of visual appeal to outsiders and statistical communication to experts.

So it’s not Us versus Them, it’s each of us choosing a different point along the efficient frontier for each problem we care about.

And I think the framing worked well. At least, it helped us communicate with Robert Kosara, one of our discussants. Here’s what Kosara wrote, after seeing our article, the discussions (including his), and our rejoinder:

There are many, many statements in that article [by Gelman and Unwin] that just ask to be debunked . . . I [Kosara] ended up writing a short response that mostly points to the big picture of what InfoVis really is, and that gives some examples of the many things they missed.

While the original article is rather infuriating, the rejoinder is a great example of why this kind of conversation is so valuable. Gelman and Unwin respond very thoughtfully to the comments, seem to have a much more accurate view of information visualization than they used to, and make some good points in response.

Great! A discussion that worked! This is how it’s supposed to go: not a point-scoring debate, not people talking past each other, but an honest and open discussion.

Reflections

Perhaps my extremely, extremely frustrating experience early in my career (detailed in Part 1 above) motivated me to think seriously about the Lakatosian attitude toward understanding and explaining ideas. If you compare Bayesian Data Analysis to other statistics books of that era, for example, I think we did a pretty good job (although maybe not good enough) of understanding the methods through their limitations. But even with all my experience and all my efforts, this can be difficult, as revealed by the years it took for us to finally process our ideas on graphics and visualization to the extent that we could communicate with experts in these fields.

20 thoughts on “A tale of two discussion papers

  1. “Tradeoffs” is a very important concept. Thomas Sowell is fond of saying, “There are no solutions, only tradeoffs.”

  2. It’s interesting that you got into a tussle with Geyer. He’s clearly very strongly in favor of “one big run” and I agree with him in the sense that if 2000 iterations *are* enough, you can’t get them by starting at 20 different location and running 100 iterations in each chain (even though, with multiple cores or on a cluster of workstations, you’d LOVE to do this). But I’m surprised that he didn’t see your point about the concept of convergence checking. I mean we all have to stop at some point, and we’d like to know if there was a way to see that point is obviously not yet mixed.

    Speaking of mixing, Geyer’s parallel tempering seems like a pretty useful technique. Any chance of including it into Stan so that Stan will run NUTS on 2 or 3 additional tempered posterior densities, with the tempering coefficients given as an array or something? If you do switching with lowish probability then the chains may be mostly able to run in parallel threads on multiple cores without a lot of synchronization overhead and this could be a fantastic way to improve exploration of the space at basically no wall-clock-time cost.

  3. Specifically, I imagine something like the following method:

    You set up your N temperatures, maybe something like

    stan_parallel_temps <- [1.25,1.5,3]; ## you always need 1 to be first, so perhaps it's better to just declare the additional temps and let stan put 1 at the front of this list on its own.

    during warmup Stan estimates the average length between U turns in the untempered distribution internally. It then chooses a number N at random exponentially distributed (or maybe gamma distributed or with a distribution the user specifies in the model file) with mean equal to some constant times this average inter-u-turn length (rounded to the nearest int).

    All the chains then run N steps and synchronize on a thread semaphore so that when all the high temp threads are done with N steps the temp 1 thread can proceed. It proceeds by choosing a random adjacent pair of temperatures and attempting an exchange between those states, and then generating a new N and setting all the threads back to work to do N HMC timesteps.

    in this scheme you most of the time complete several HMC trajectories before trying to exchange, so you don't disrupt the NUTS sampler too much, but then you have several tempered distributions running in parallel, and feeding your untempered simulation new regions of space so you may be able to explore space more readily. And as I said, with 4 cores your wall clock overhead is only due to the synchronization, which shouldn't be too bad.

  4. The problem with Andrew’s InfoVis critique is it’s subjectivity. When we say “successful statistical graphics” what do we mean by “successful”. Assuming we mean something like “conveys key features of the dataset to the target audience”, has anyone tested the efficacy of various formats?

    So also for notions of “attractiveness” and “informative”; it is not what the makers of InfoGraphics think but what the audience thinks. The only metric that comes close to empiricism in this whole debate is perhaps Graphs that people vote as best on Reddit or Online Competitions etc.

    It is somewhat ironic that a data driven field like statistics has itself relied very little on data when judging the merits of various visualization strategies. I find the whole debate clouded a lot by personal bias and subjectivity in tastes.

    • Rahul:

      I am not offering a critique of infovis; I’m exploring the different goals of data graphics. A problem with discussions of data graphics, from Tufte on forward, has been the implicit assumption that there is a right way to do things, that there is some single goal. This assumption has led to difficulties in the empirical study of visualization as well. The point of my article with Unwin is to think more carefully about the multiple goals and tradeoffs involved in graphics. I am not a skilled or talented experimenter myself, but I hope that I can contribute to this field in my own way, which in this case is to try to make explicit the different goals involved in the visual display of quantitative information.

      • Can you say more about
        “A problem with discussions of data graphics, from Tufte on forward, has been the implicit assumption that there is a right way to do things, that there is some single goal.”

        I have Cleveland’s books and all of Tufte’s and went to his course, but I generally got the idea that there were many ways to show data, some of which were truly awful, and some of which were compelling, although the latter often were required creativity to invent. Of course, SGI’s customers were always slicing and dicing data in different ways, sometimes the same data but for different uses. I certainly used different slides with same data for different audiences.

        It is certainly the case that if there are 3 slightly different ways to display the same data, wiht no extra content, it might be a good idea to settle on one and use it, as people will then recognize it quickly.

        Anyway, I may have misinterpreed Tufte, or can easily have missed discussions, but you might expand on this.

        • John:

          I think Tufte is very reasonable but I think a message people got out of his book is that it’s possible to do it all: to have one single graph that beautifully and clearly displays a complex dataset and tells a story.

          If I could be granted one wish in the world of graphical communication, it would be that the famous Napoleon-in-Russia graph/map had never been created. Everybody’s trying to replicate that experience, and I think they’d typically do better to separate their goals and not try to make one display that does it all.

          A related issue is that Cleveland, Tufte, etc., focus on one graph at a time. The idea is for each graph to be clear and crisp. That’s fine, but look at what happened to me when my colleagues and I wrote Red State Blue State. We had about 100 graphs. We put care into each graph, each one was clear and crisp, but in aggregate they were monotonous. They all looked the same. A bit of variation, even some chartjunk, might’ve helped a bit.

        • Thanks.
          If I read this right, it sounds like more that the existence of cool (but rare) graphs like Napoleon, sometimes encourages suboptimal behavior, i.e., cramming too much in.

          But, as a good outcome for you :-), I just ordered the book, even though I have a big queue of others here to read.

        • I wouldn’t say that people trying to emulate the very best — or the most impressive, as say Minard on Napoleon — is a major problem in statistical graphics. I am more concerned about a pervasive mediocrity, in which people’s graphs are often dull and ineffective, but could and should be improved.

          Certain kinds of graphs have become unhelpful stereotypes. Many histograms I see suppress too much detail, because the bin width is too coarse. Box plots work well when you are comparing _lots_ of groups, but for comparing a small number of groups, density traces, quantile plots or dot/strip plots are usually better as they give detail too, and can easily be combined with summary information. (It’s astonishingly standard in introductory texts to back up t-tests of means with box plots, slurring completely over the key detail that the latter are only indirectly connected to the former.)

          Non-linear scales deserve more use. Many graphics groups even resist the idea of logarithmic scales, as supposedly too difficult for the audience, although it’s rare to meet anyone competent in addition but incompetent in multiplication. For more technical audiences, slightly non-standard scales such as logit, square root, reciprocal, cube root often help greatly. In decent software, a non-standard scale can always be labelled on the original scale.

          In many fields, an extraordinary amount of effort is expended in producing elaborate tables of results in a ritual display, all the way down to * ** *** for significance levels, a throw-back to the 1930s. I am not aware of evidence that they are widely consulted or used, but there is frequent insistence on their ritual inclusion in theses, papers, and reports. (Naturally, people _should_ be able to access your detailed results somehow, but I still think this practice oversold.) As Andrew in particular has often commented, simple table-like displays based on dot charts or similar designs are in practice much more likely to be looked at and understood (and if not, why not?).

          An old joke runs that armies are often well trained to fight the last war. In graphics, many statistical people are well trained to re-produce the kinds of graphics they learned in introductory courses, often on very small datasets. What doesn’t help here is a pervasive residual snobbery against graphs as stuff for showing the obvious to the ignorant (Tufte’s wording) and for fitting the most complicated, recently fashionable model that’s currently hot (or cool).

        • Needless to say, in computing, some of us have long used log scales, necessarily.
          In the Computure History Museum, one wall has a a linear Moore’s Law, but with a log-scale version next to it, but we did feel it necessary to have both.

      • Andrew:

        Can you elaborate about your statement “This assumption has led to difficulties in the empirical study of visualization as well.”

        What studies are you talking about?

        • Thanks. I agree with your point there: that was just a bad study.

          I have been reading through your paper and one point jumps at me:

          Choosing, Nathan Yau’s list as an archetype of Infovis is where I think you go wrong. Those clearly aren’t good examples (to me at least), yet I’ve seen many Infograpics I love that don’t conform to your somewhat simple and austere ideals of good statistical graphs.

          I think choosing Yau’s list was where you created a straw-man and the rest of your paper continued to demolish it.

      • Objections to the single goal arose on old S news group with Brian Ripley and others – the _goodness_ of a graph should primarily depend on its purpose (and also if it needs to stand alone).

        Also, when I was teaching an intro stats course, the material on _good graphs_ followed the section on confounding. It was hard not to worry that most of the examples of good graphs were just confounded with good stories – content making the form look better than it actually was.

      • An aside: One of Paul Murrell’s rules for making good graphs (in one of those rebuttals) was a bit new and confusing to me. He writes:

        “2. Use horizontal lengths in preference to vertical lengths.”

        Does anyone know why? And what’s an example? In scatter plots both vertical and horizontals are by default in use. Is the rule for histograms? Or…..?

  5. Andrew,

    I am intrigued. Are you getting any visitors to your blog from downunder at all?

    I am doing my best and please keep it up, quality and quantity. wow you are the Statistics Castle!! ( inblog joke)

  6. Interesting take on Part 1. You left out a couple details, which I’m impolitic enough to share. One was that Rob Kass (then-editor of Stat Sci) got your paper and Charlie’s at the same time, and noticed that they said essentially the opposite thing (run a few parallel chains vs what are you crazy always run one long chain — a point of view he stands by to this day). So he decided to publish them as a set, which he knew would lead to controversy… which is sort of what you’re looking for as an editor publishing discussion papers (the old Oscar Wilde adage about how you’d rather be talked about than not talked about). another detail was that, as I recall, Charlie inadvertently sampled an improper posterior while illustrating his method, an embarrassing blunder that he likely would have caught had he but used 2 parallel chains (your idea). You (or Don?) made this point more than once in your rejoinder, which probably contributed to the communication breakdown; indeed I was passing notes between you and Charlie for a while; I thought I was in junior high again ;)

    Fortunately we’re all older and wiser now, as Part 2 helps reveal :)

    PS Folks don’t comment as much as we should because we’re all overwhelmed with other stuff we gotta do

    • Brad,

      Actually, there’s more to the story than that. When we submitted to Stat Sci, the editor was Carl Morris, who sent it to a referee who wanted to reject it. Don and I had to do a bit of revision. But, the reason we submitted to Stat Sci in the first place was that we wanted a lively discussion by Charlie and others.

      In retrospect, I think that was a mistake on our part, we should’ve just sent the paper to JASA. All the discussion stuff was a waste of time because none of the discussants (including Don, Charlie, and me) had a grip on what was the fundamental point of contention: Charlie was using simulation to attack large problems for major research projects, whereas Don and I were interested in simulation as a more generic tool for practitioners. Essentially, Charlie was in the world of the computational physicists developing big tools to crack big problems, and Don and I were anticipating the world of BUGS, developing automatic tools that would give good answers to relatively easy problems and hopefully reasonable answers to relatively hard problems.

      To put it another way, Charlie seemed to think I was an idiot because I didn’t realize how difficult Monte Carlo integration is, as a general problem; and I found Charlie frustrating because he was making a big deal about methods such as time-series estimates of Monte Carlo variance which I’d already tried and abandoned years earlier because they didn’t work on simple examples.

      The whole experience was a bit traumatic to me, but it was only a few years ago that I was able to really figure out what was going on.

      • I’m glad you’re over the trauma and at peace with things now. Maybe I shouldn’t be stirring the pot. But then, you started it ;)

        I think Charlie would continue to maintain that since one cannot ever define the term “overdispersed” in a setting where you don’t know the right answer ahead of time, that there is no way to “overdisperse” a bunch of parallel sampling chains relative to the target. Your argument seems to be that there are a lot of “easy” problems for which you can sort of do this, and most of the problems we need to solve are “easy”. But I honestly think he’d still reject that, all these years later.

        I’m an applied guy too so I’m basically on your side. But only in the sense that 2 or at most 3 sampling chains are something my students and I routinely use; we don’t actually computer G&R (or its offspring) and simply “go to the beach” while the chains run and the G&R stat is computed for all possible parameters, with the machine stopping the sampler when all are less than some number (1.05 or whatever). I think there was a time when you recommended this? Anyway we just use informal methods to assess mixing and then maybe use effective sample size ideas (which Charlie wrote up way back when, copying them from the time series literature) to decide how many samples we’ve really got and whether we need any more.

        • Brad:

          I agree with you that overdispersion is less important than Don and I originally thought. Measuring mixing of chains is important but that can be done without overdispersion. The advantage of a numerical measure such as R-hat is that it allows us to monitor thousands of parameters without needing to look at graphs of each one. Then, if there are problems, the user can zoom in and look at traceplots of parameters that are poorly mixing.

          Our current practice in Stan is to run 4 chains and monitor convergence using split R-hat and go until total effective sample size is some number such as 100. Split R-hat uses some within-chain information and seems to work better than the old R-hat to catch problems of poor mixing, and our new measure of effective sample size uses both between and within-chain information and, in the examples we’ve looked at, outperforms the purely within-chain approach and the purely between-chain approach that Geyer and i used in our 1992 papers. So progress is being made. Details on split R-hat and our new effective sample size measures are in the Stan manual and in the forthcoming 3rd edition of Bayesian Data Analysis; we’re also in process of writing up a paper on them.

Comments are closed.