Jessica Hullman and I wrote an article that begins,

Computer science research has produced increasingly sophisticated software interfaces for interactive and exploratory analysis, optimized for easy pattern finding and data exposure. But assuming that identifying what’s in the data is the end goal of analysis misrepresents strong connections between exploratory and confirmatory analysis and contributes to shallow analyses. We discuss how the concept of a model check unites exploratory and confirmatory analysis, and review proposed Bayesian and classical statistical theories of inference for visual analysis in light of this view. Viewing interactive analysis as driven by model checks suggests new directions for software, such as features for specifying one’s intuitive reference model, including built-in reference distributions and graphical elicitation of parameters and priors, during exploratory analysis, as well as potential lessons to be learned from attempting to build fully automated, human-like statistical workflows.

Jessica provides further background:

Tukey’s notion of exploratory data analysis (EDA) has had a strong influence on how interactive systems for data analysis are built. But the assumption has generally been that exploratory analysis precedes model fitting or checking, and that the human analyst can be trusted to know what to do with any patterns they find. We argue that the symbiosis of analyst and machine that occurs in the flow of exploratory and confirmatory statistical analysis makes it difficult to make progress on this front without considering what’s going on, and what should go on, in the analyst’s head. In the rest of the paper we do the following:

– We point out ways that optimizing interactive analysis systems for pattern finding and trusting the user to know best can lead to software that conflicts with goals of inference. For example, interactive systems like Tableau default to aggregating data to make high level patterns more obvious but this diminishes some people’s acknowledgment of variation. Researchers evaluate interactive visualizations and systems based on how well people can read data, how much they like using the system, or how evenly they distribute their attention across data, not how good their analysis or decisions are. Various algorithms for progressive computation or privacy preservation treat the dataset as though it is an object of inherent interest without considering its use in inference.

– We propose that a good high level understanding frames interactive visual analysis as driven by model checks. The idea is that when people are “exploring” data using graphics, they are implicitly specifying and fitting pseudo-statistical models, which produce reference distributions to compare to data. This makes sense because the goal of EDA is often described in terms of finding the unexpected, but what is unexpected is only defined via some model or view of how the world should be. In a Bayesian formulation (following Gelman 2003, 2004, our primary influence for this view), the reference distribution is produced by the posterior predictive distribution. So looking at graphics is like doing posterior predictive checks, where we are trying to get a feel for the type and size of discrepancies so we can decide what to do next. We like this view for various reasons, including because (1) it aligns with the way that many exploratory graphics get their meaning from implicit reference distributions, like residual plots or Tukey’s “hanging rootograms”; (2) it allows us to be more concrete about the role prior information can play in how we examine data; and (3) it suggests that to improve tools for interactive visual analysis we should find ways to make the reference models more explicit so that our graphics better exploit our abilities to judge discrepancies, such as through violations of symmetry, and the connection between exploration and confirmation is enforced.

– We review other proposed theories for understanding graphical inference: Bayesian cognition, visual analysis as implicit hypothesis testing, multiple comparisons. The first two can be seen as subsets of the Bayesian model checking formulation, and so can complement our view.

– We discuss the implications for designing software. While our goal is not to lay out exactly what new features should be added to systems like Tableau, we discuss some interesting ideas worth exploring more, like how the user of an interactive analysis system could interact with graphics to sketch their reference distribution or make graphical selections of what data they care about and then choose between options for their likelihood, specify their prior graphically, see draws from their model, etc. The idea is to brainstorm how our usual examinations of graphics in exploratory analysis could more naturally pave the way for increasingly sophisticated model specification.

– We suggest that by trying to automate statistical workflows, we can refine our theories. Sort of like the saying that if you really want to understand some topic, you should teach it. If we were to try to build an AI that can do steps like identify model misfits and figure out how to improve the model, we’d like have more ideas about what sorts of features our software should offer people.

– We conclude with the idea that the framing of visualization as model checking relates to ideas we’ve been thinking about recently regarding data graphics as narrative storytelling.

**P.S.** Ben Bolker sent in the above picture of Esmer helping out during a zoom seminar.

From the paper:

“According to many accounts of how knowledge is created during data analysis, so-called exploratory analysis is “model-free” and consists of preparing oneself with data, searching for useful representations or transformations, and noting interesting observations.

and:

“Model-driven inference plays a role even in canonically exploratory activities; after all, what is surprising is defined by the implicit model of our expectations.”

Why not just use the word “knowledge” instead of “model?” Or better yet, just say what you really mean, “pre-conceived notion.” You don’t have an implicit model until you have a pre-conceived notion about every aspect of a phenomenon, to the extent that you can envision an input and predict the output.

I cannot imagine anyone thinking of anything anymore as being “model-free” because even a single vague notion is a “model” these days.

Same rant, different day.

Are you familiar with Forrest Young’s ViSta software? It was seemingly an early attempt to routinize statistical workflows, and it was built on top of XLISP-STAT. https://www.uv.es/visualstats/Book/ provides one of perhaps several Web sites focused on ViSta.

I haven’t read the paper in full yet; I probably should. Anyway, on one hand this is super interesting and relevant, on the other hand it is also super hard to come up with a theory of how unformalised decisions by the user as integral part of interactive analysis influence later working with a model that resulted from it. (The same issue is probably behind quite a few “forking paths” in the notorious garden…)

We’re not really trying to predict downstream analyses or results with theory. Our goal is more, how do we understand exploratory activities like examining graphs as model-driven in a way that gives us testable claims (so that as people who create software and visualization tools we can continue to improve our ‘model’ of what happens in EDA). And can theories of graphical inference give us insight how to design software for interactive EDA that does a better job of making the transition from EDA to CDA natural, over, say the status quo of assuming you should just optimize all graphics for perception and let the user do whatever they will in EDA, assuming that this will be separate from any CDA they do later.

Thanks for posting this draft. It is an important topic and you have plenty of constructive ideas. The article comes across as in favour of more modeling and against looking at data closely and against overreliance on patterns found in graphics. Looking at data closely is a good thing!

You use histograms as examples a few times and imply that all that matters is what might be a suitable distributional model. How would that approach cope with outliers, multiple modes, gaps, heaping, and other possible features that might arise?

Overreliance on patterns found in graphics is undoubtedly a problem, but that is because graphics is rarely taught well, if at all. If you look at graphics you have to know how to interprest them AND you have to know how to check any results you think you can see. Check with additional data, check with other variables, and, most importantly, check with context and background information. You really only mention checking in connection with models. That makes your examples bland and your discussion dry. For instance, you could have explained where the data in the left of Figure 1 come from, what they mean, and how you would check why the residual plot has that unusual pattern.

You rightly mention uncertainty and the difficulty of representing it. The term uncertainty can also be used in connection with uncertainty about the quality of data and uncertainty about the provenance of data. These are important forms of uncertainty too, and graphics are a great help in exploring them. Using graphics in this way led to several of the comments and queries I had about the examples in your recent book, Andrew.

The term interactive means different things to different people. Given the number of references cited, it is disappointing that you make no mention of the kind of interactive graphics found in Paul Velleman’s Data Desk and John Sall’s JMP, both impressive pieces of software whose versions of over 30 years ago would still make most modern interactive graphics software look rudimentary. You told me you had never used that kind of software, Andrew, so I imagine you would not presume to write critically about it, but your readers should at least be aware it exists.

Thanks for the comments! I agree our examples are a bit high level and we could probably go through examples with more attention to what the data are. ‘The term uncertainty can also be used in connection with uncertainty about the quality of data and uncertainty about the provenance of data.’ Agree. One thing that’s been useful to me about using Bayesian cognition to understanding visual statistical inference is that we can use Bayesian models to try to learn about how people perceive bias in the data and how it impacts their comparisons. We mention this but only briefly. ‘The term uncertainty can also be used in connection with uncertainty about the quality of data and uncertainty about the provenance of data.’ My impression as someone who does visualization in computer science is that we haven’t thought enough about how to represent non quantified uncertainty; we like precise things and so we’ve avoided it. Also thanks for the pointers to Data Desk and JMP, it probably would be useful to talk a bit more about some of the earlier systems that are more in line with what we are proposing.