Hey! Here’s some R code to make colored maps using circle sizes proportional to county population.

Kieran Healy shares some code and examples of colored maps where each region is given a circle in proportion to its population. He calls these “Dorling cartograms,” which sounds kinda mysterious to me but I get that there’s no easy phrase to describe them. It’s clear in the pictures, though:

I wrote to Kieran asking if it was possible to make the graphs without solid circles around each point, as that could make them more readable.

He replied:

Yeah it’s easy to do that, you just give different parameters to geom_sf(), specifically you set the linewidth to 0 so no border is drawn on the circles. So instead of geom_sf(color=“gray30”) or whatever you say geom_sf(linewidth=0). But I think this does not in fact make things more readable with a white, off-white, or light gray background:

The circle borders do a fair amount of work to help the eye see where the circles actually are as distinct elements. It’s possible to make the border more subtle and still have it work:

In this version the circle borders are only a *very slightly* darker gray than the background, but it makes a big difference still.

Finally you could also remove the circle borders but make the background very dark, like this:

Not bad, though there issue becomes properly seeing the dark orange— especially smaller counties with very high pct Black. This would work better with one of the other palettes.

Interesting. Another win for ggplot.

Hand-drawn Statistical Workflow at Nelson Mandela

In September 2023 I taught a week-long course on statistical workflow at the Nelson Mandela African Institution of Science and Technology (NM-AIST), a public postgraduate research university in Arusha, Tanzania established in 2009.

NM-AIST – CENIT@EA

The course was hosted by Dean Professor Ernest Rashid Mbega and the Africa Centre for Research, Agricultural Advancement, Teaching Excellence and Sustainability (CREATES) through the Leader Professor Hulda Swai and Manager Rose Mosha.

Our case study was an experiment on the NM-AIST campus designed and implemented by Dr Arjun Potter and Charles Luchagula to study the effects of drought, fire, and herbivory on growth of various acacia tree species. The focus was pre-data workflow steps, i.e. experimental design. The goal for the week was to learn some shared statistical language so that scientists can work with statisticians on their research.

Together with Arjun and Charles, with input from Drs Emmanuel Mpolya, Anna Treydte, Andrew Gelman, Michael Betancourt, Avi Feller, Daphna Harel, and Joe Blitzstein, I created course materials full of activities. We asked participants to hand-draw the experimental design and their priors, working together with their teammates. We also did some pencil-and-paper math and some coding in R.

Course participants were students and staff from across NM-AIST. Over the five days, between 15 and 25 participants attended on a given day.

Using the participants’ ecological expertise, we built a model to tell a mathematical story of how acacia tree height could vary by drought, fire, herbivory, species, and plot location. We simulated parameters and data from this model, e.g. beta_fire = rnorm(n = 1, mean = -2, sd = 1) then simulated_data …= rnorm(n, beta_0 + beta_fire*Fire +… beta_block[Block], sd_tree). We then fit the model to the simulated data.

Due to difficulty in manipulating fire, fire was assigned at the block-level, whereas drought and herbivory were assigned at the sub-block level. We saw how this reduced precision in estimating the effect of fire:

We redid the simulation assuming a smaller block effect and saw improved precision. This confirmed the researcher’s intuitions that they need to work hard to reduce the block-to-block differences.

To keep the focus on concepts not code, we only simulated once from the model. A full design analysis would include many simulations from the model. In Section 16.6 of ROS they fix one value for the parameters and simulate multiple datasets. In Gelman and Carlin (2014) they consider a range of plausible parameters using prior information. Betancourt’s workflow simulates parameters from the prior.

Our course evaluation survey was completed by 14 participants. When asked “which parts of the class were most helpful to you to understand the concepts?”, respondents chose instructor explanations, drawings, and activities as more helpful than the R code. However, participants also expressed eagerness to learn R and to analyze the real data in our next course.

The hand-drawn course materials and activities were inspired by Brendan Leonard’s illustrations in Bears Don’t Care About Your Problems and I Hate Running and You Can Too. Brendan wrote me,

I kind of think hand-drawing stuff makes it more fun and also maybe less intimidating?

I agree.

More recently, I have been reading Introduction to Modern Causal Inference by Alejandro Schuler and Mark van der Laan, who say

It’s easy to feel like you don’t belong or aren’t good enough to participate…

yup.

To deal with that problem, the voice we use throughout this book is informal and decidedly nonacademic…Figures are hand-drawn and cartoonish.

I’m excited to return to NM-AIST to continue the workflow steps with the data that Dr Arjun Potter and Charles Luchagula have been collecting. With the real data, we can ask: is our model realistic enough to achieve our scientific goals ?

Listen to those residuals

This is Jessica. Speaking of data sonification (or sensification), Hyeok, Yea Seul Kim, and I write

Data sonification-mapping data variables to auditory variables, such as pitch or volume-is used for data accessibility, scientific exploration, and data-driven art (e.g., museum exhibitions) among others. While a substantial amount of research has been made on effective and intuitive sonification design, software support is not commensurate, limiting researchers from fully exploring its capabilities. We contribute Erie, a declarative grammar for data sonification, that enables abstractly expressing auditory mappings. Erie supports specifying extensible tone designs (e.g., periodic wave, sampling, frequency/amplitude modulation synthesizers), various encoding channels, auditory legends, and composition options like sequencing and overlaying. Using standard Web Audio and Web Speech APIs, we provide an Erie compiler for web environments. We demonstrate the expressiveness and feasibility of Erie by replicating research prototypes presented by prior work and provide a sonification design gallery. We discuss future steps to extend Erie toward other audio computing environments and support interactive data sonification.

Have you ever wanted to listen to your model fit? I haven’t, but I think it’s worth exploring how one would do so effectively, either for purposes of making data representations accessible to blind and visual impaired users, or for other purposes like data journalism or creating “immersive” experiences of data like you might find in museums.

But turns out it’s really hard to create data sonifications with existing tools! You have to learn low-level audio programming and use multiple tools to do things like combine several sonifications into a single design. Other tools only offer the ability to make sonifications corresponding to a narrow range of chart types, perhaps as a result of a bias toward thinking about sonifications only from the perspective of how they map to existing visualizations.

Hyeok noticed some of these issues and decided to do something about it. Erie provides a flexible specification format where you can define a sonification design in terms of tone (the overall quality of a sound) and encodings (mappings from data variables to auditory features). You can compose more complex sonifications by repeating, sequencing, and overlaying sonifications, and it interfaces with standard web audio APIs. 

Documentation on how to install and use Erie is here. There’s also an online editor you can use to try out the grammar. But first I recommend playing some of the examples, which include some simple charts and recreations of data journalism examples. My favorites are the residuals from a poorly fit model and a better fitting one. Especially if you play just the data series of these back to back, the better fit should sound more consistent and slightly more harmonious.

This was really Hyeok’s vision; I can’t claim to have contributed very much to this work. But it was interesting to watch it come together. During our meetings about the project, it was initially very unfamiliar to me, trying to interpret audio variables like pitch as carrying information about data values, and I can’t really say it’s gotten easier. I guess this gets at how hard it is to make data easily consumable in a serial format like audio, at least for users who are accustomed to all the benefits of parallel visual processing. 

Using the term “visualization” for non-visual representation of data

The other day we linked to a study whose purpose was to “investigate challenges faced by curators of data visualizations for blind and low-vision individuals.”

JooYoung Seo, the organizer of that project, provides further background:

With the exception of a few IRB-related constraints, below is my brief response to your feedback.

1. Unclear terminology. You recommended that we use the verb “create” instead of “curate” in our survey, and we completely agree with your confusion about this terminology. We chose to use the verb curate because our research was funded by the Institute of Museum and Library Services (IMLS) to develop a tool that would make it easier for data curators to create accessible visualizations. We also began our research proposal as a community partnership with the Data Curation Network (DCN), so we were using terminology that was tailored to a specific professional group. In an effort to strike a better balance between the confusion of the terminology and the directionality of our research goals, we will add some explanations to make it easier to understand.

2. The inappropriateness of the term “visualization”. You raised the issue of the inappropriateness of using the term “visualization” in our survey to refer to accessible data “sensification”. This is very insightful.

I can assure you that our team is in no way trying to subscribe to or perpetuate the term visualization. As the PI, I am a lifelong blind person and my student who is co-leading this research is also a lifelong low-vision person, so we have given a lot of thought to the term “visualize”.

In our research, visualization is one way of encoding/decoding data representation. We believe that accessible data representation can only be achieved through multimodal data representation (sonification, tactile representation, text description, AI conversation, etc.) that comprehensively conveys various modalities along with visualization. Our initial research, Multimodal Access and Interactive Data Representation (MAIDR: https://github.com/uiuc-ischool-accessible-computing-lab/maidr), which we will present in May at the CHI2024 conference, reflects this belief, and this survey is an extension of our MAIDR research.

Despite the bias that the term visualization can introduce, we chose to use it in this survey for two reasons: first, we wanted to follow the convention of using terminology that is more easily understood by survey participants, assuming that they are data curators with varying levels of experience with accessibility, ranging from no experience to expert level experience. We could of course use the terms “data sensification” or “data representation” for further explanation, but since this initial study is focused on observing and understanding the status quo rather than “education,” we wanted to reduce potentially confusing new concepts as much as possible.

In parallel to our survey of data curators, we are also conducting a separate survey with blind people asking them about the accessibility issues with data visualization that they encounter in their daily lives. In that survey, we want to understand how blind people are approaching visualization.

Second, the reason we use the term visualization in our research involving blind and low-vision people is to challenge the misconception that being visually impaired excludes people from visualization altogether. For example, there are many blind and low-vision people who use their residual vision to approach visualization. Depending on when you were blind, there are some people who use the visualizations that remain in their brains as they learn. As someone who became blind as a teenager, I still use visual cues like color and brightness to help me learn and retain information.

If you cannot use the term visualize just because you can’t see, “See you tomorrow,” “Let’s see,” and “let me take a look” would also be unusable for blind people. Blind people are just as capable of using visual encoding/decoding.

I get his point on the term “visualization.” Indeed, I can visualize scenes with my eyes closed. In our paper, we used “sensification” in part to emphasize that we are interested in engaging other senses than vision, especially hearing and the muscular resistance sense.

Click here to help this researcher gather different takes on making data visualizations for blind people

Here’s the survey, and here’s what it says:

The purpose of this study is to investigate challenges faced by curators of data visualizations for blind and low-vision individuals. This includes, but is not limited to, graphs, charts, plots, diagrams, and data tables.

We invite individuals who meet the following criteria to participate in our survey:
– Aged 18 or older
– Data curators or professionals who have experience creating a visualization to depict data and are interested in the accessibility of data visualizations
– Data professionals with or without experience in assisting BLV individuals with data visualizations
– Currently located in the United States

The survey uses the term “curating,” which doesn’t seem quite right to me. I create lots of visualizations; I don’t spend much time “curating” them. So that’s confusing. I think they should replace “curate” by “create,” or maybe “create or use,” throughout.

Also, they keep saying “visualizations,” which doesn’t sound quite right given that vision will not be involved. I’d prefer a term such as “sensification” or “vivification,” as we discuss in our article on the topic.

Also it’s funny how the survey starts with screenfuls of paperwork. The whole IRB thing really is out of control. It’s a mindless bureaucracy. I don’t blame the researcher on this study—it’s not his fault, he’s at a university and he has to play by the rules. You just end up with ridiculous things like this:

When they write the story of the decline and fall of western civilization, they’ll have to devote a chapter to Institutional Review Boards. Academic bigshots lie, cheat, and steal, and in the meantime people who do innocuous surveys have to put in these stupid warnings. Again, not the fault of the researcher! It’s the system.

Anyway, sensification for blind and low-vision people is a topic worth studying, and if you fill out the survey maybe it will be of help, at least a starting point for further exploration of this issue.

P.S. More here.

Minimum criteria for studies evaluating human decision-making

This is Jessica. A while back on the blog I shared some opinions about studies of human-decision making, such as to understand how visualizations or displays of model predictions and explanations impact people’s behavior. My view is essentially that a lot of the experiments being used to do things like rank interfaces or model explanation techniques are not producing very informative results because the decision task is defined too loosely. 

I decided to write up some thoughts rather than only blogging them. In Decision Theoretic Foundations for Human Decision Experiments (with Alex Kale and Jason Hartline), we write: 

Decision-making with information displays is a key focus of research in areas like explainable AI, human-AI teaming, and data visualization. However, what constitutes a decision problem, and what is required for an experiment to be capable of concluding that human decisions are flawed in some way, remain open to speculation. We present a widely applicable definition of a decision problem synthesized from statistical decision theory and information economics. We argue that to attribute loss in human performance to forms of bias, an experiment must provide participants with the information that a rational agent would need to identify the normative decision. We evaluate the extent to which recent evaluations of decision-making from the literature on AI-assisted decisions achieve this criteria. We find that only 6 (17%) of 35 studies that claim to identify biased behavior present participants with sufficient information to characterize their behavior as deviating from good decision-making. We motivate the value of studying well-defined decision problems by describing a characterization of performance losses they allow us to conceive. In contrast, the ambiguities of a poorly communicated decision problem preclude normative interpretation. 

We make a couple main points. First, if you want to evaluate human decision-making from some sort of information interface, you should be able to formulate the task you are studying as a decision problem as defined by statistical decision theory and information economics. Specifically, a decision problem consists of a payoff-relevant state, a data-generating model which produces signals that induce a distribution over the state, an action space from which the decision-maker chooses a response, and a scoring rule that defines the quality of the decision as a function of the action that was chosen and the realization of the payoff-relevant state. Using this definition of a decision problem gives you a statistically coherent way to define the normative decision, i.e., the action that a Bayesian agent would choose to maximize their utility under whatever scoring rule you’ve set up. In short, if you want to say anything based on your results that implies people’s decisions are flawed, you need to make clear what is optimal, and you’re not going to do better than statistical decision theory. 

The second requirement is that you communicate to the study participants sufficient information for a rational agent to know how to optimize: select the optimal action after forming posterior beliefs about the state of the world given whatever signals –visualizations, displays of model predictions, etc–you are showing them. 

When these criteria are met you gain the ability to conceive of different sources of performance loss implied by the process that the rational Bayesian decision-maker goes through when faced with the decision problem: 

  • Prior loss, the loss in performance due to the difference between the agent’s prior beliefs and those used by the researchers to calculate the normative standard.
  • Receiver loss, the loss due to the agent not properly extracting the information from the signal, for example, because the human visual system constrains what information is actually perceived or because participants can’t figure out how to read the signal.
  • Updating loss, the loss due to the agent not updating their prior beliefs according to Bayes rule with the information they obtained from the signal (in cases where the signal does not provide sufficient information about the posterior probability on its own).
  • Optimization loss, the loss in performance due to not identifying the optimal action under the scoring rule. 

Complicating things is loss due to the possibility that the agent misunderstands the decision task, e.g., because they didn’t really internalize the scoring rule. So any hypothesis you might try to test about one of the sources of loss above is actually testing the joint hypothesis consisting of your hypothesis plus the hypothesis that participants understood the task. We don’t get into how to estimate these losses, but some of our other work does, and there’s lots more to explore there. 

If you communicate to your study participants part of a decision problem, but leave out some important component, you should expect their lack of clarity about the problem to induce heterogeneity in the behaviors they exhibit. And then you can’t distinguish such “heterogeneity by design” from the real differences between decision-quality based on the differences between the conditions that you are trying to study. You don’t know if participants are making flawed decisions because of real challenges with forming accurate beliefs or selecting the right action under different types of signals or because they are operating under a different version of the decision problem than you have in mind.

Here’s a picture that comes to mind:

Diagram showing underspecified decision problem being interpreted differently by people  

 

I.e., each participant might have a unique way of filling in the details about the problem that you’ve failed to communicate, which differs from how you analyze it. Often I think experimenters are overly optimistic about how easy it is to move from the left side–the artificial world of the experiment–to draw conclusions about the right. I think sometimes people believe that if they leave out some information (e.g.,  they don’t communicate to participants the prior probability of recidivating in a study on recidivism prediction, or they set up a fictional voting scenario but don’t give participants a clear scoring rule when studying effects of different election forecast displays), they are “being more realistic”, because in the real world people rely on their own intuitions and past experience so there are lots of possible influences on how a person makes their decision. But, as we write in the paper, this is a mistake, because people will generally have different goals and beliefs in an experiment than they do in the real world. Even if everyone is influenced in the experiment by a different factor that does operate in the real world, the idea that the composition of all these interpretations gives us a good approximation of real world behavior is not supported, as we say in the paper it “arises from a failure to recognize our fundamental uncertainty about how the experimental context relates to the real world.“ We can’t know for sure how good a simulacrum our experimental context is for the real world task, so we should at least be very clear about what the experimental context is so we can draw internally valid conclusions. 

Criteria 1 is often met in visualization and human-centered AI, but Criteria 2 is not

I don’t think these two criteria are met in most of the interface decision experiments I come across. In fact, as the abstract mentions, Alex and I looked at a sample of 46 papers on AI assisted decision-making that a survey previously labeled as evaluating human decisions; of these 11 were interested in studying tasks for which you can’t define ground truth, like emotional responses people had to recommendations, or had a descriptive purpose, like estimating how accurately a group of people can guess the post-release criminal status of a set of defendants in the COMPAS dataset. Of the remaining 35, only a handful gave participants enough information for them to at least in theory know how to best respond to the problem. And even when sufficient information to solve the decision problem in theory is given, often the authors use a different scoring rule to evaluate the results than they gave to participants. The problem here is that you are assigning a different meaning to the same responses when you evaluate versus when you instruct participants. There were also many instances of information asymmetries between conditions the researchers compared, like where some of the prediction displays contained less decision-relevant information or some of the conditions got feedback after each decision while others didn’t. Interpreting the results is easier if the authors account for the difference in expected performance based on giving people a slightly different problem. 

In part the idea of writing this up was that it could provide a kind of explainer of the philosophy behind work we’ve done recently that defines rational agent benchmarks for different types of decision studies. As I’ve said before, I would love to see people studying interfaces adopt statistical decision theory more explicitly. However, we’ve encountered resistance in some cases. One reason I suspect is because people don’t understand the assumptions made in decision theory, so this is an attempt to walk through things step by step to build confidence. Though there may be other reasons too, related to people distrusting anything that claims to be “rational.”  

They solved the human-statistical reasoning interface back in the 80s

Eytan Adar pointed me to a video, Reasoning Under Uncertainty, from the historical ACM SIGCHI (computer-human interaction) Video Project. Beyond fashionable hairstyles, it demos interfaces from a software curriculum to teach high school students statistical reasoning in 1989. They’ve got stretchy histograms, where you can adjust bars to see how the moments change (reminding me of more recent work on sketch-based and other graphical elicitation interfaces, including some of my own), and shifty lines, where you drag a line trying to find the best fit, as in more recent work on regression by eye. Ben Shneiderman’s interface design guidelines came up recently on the blog, and both are nice early examples of interacting with data through direct manipulation, one of the ideas he pioneered. They were also using hypothetical outcome plots years before we attempted to make them cool! 

Sometimes I suspect the majority of the good ideas for statistical software were already out there back in the 80s and 90s, at least in pedagogical form. I’m thinking about work like Andreas Buja and colleagues’ “Elements of a viewing pipeline for data analysis” (from 1988) and their early mentions of graphical inference techniques like the line-up in the 90s. There’s also the ASA Statistical Graphics video library which includes a lot of cool early work.

It’s too bad that many of the mainstream GUI-based interactive visualization systems we see today dropped many of the ideas related to sampling and uncertainty in favor of what we’ve called  “data exposure” (making it as immediate as possible for the user to see and interact with the data itself). Much of the research too, at least in the more computer-science oriented data visualization community, seems more interested in pursuing variants of an exposure philosophy like “behavior-driven optimization”, where the system is trying to learn what data or queries the user might want to see next so as to serve those up. The idea of GUI-based visualization tools that let you simulate processes, for example, has never really caught on (though we’re still trying, e.g., here). 

P.S. I  haven’t watched the video that follows it, but it promises to describe the ‘The Andrew System’, a computing system that can offer the benefits of many diverse and useful systems within a common environment.  Maybe they were on to something there too!

Progress in 2023, Charles edition

Following the examples of Andrew, Aki, and Jessica, and at Andrew’s request:

Published:

Unpublished:

This year, I also served on the Stan Governing Body, where my primary role was to help bring back the in-person StanCon. StanCon 2023 took place at the University of Washington in St. Louis, MO and we got the ball rolling for the 2024 edition which will be held at Oxford University in the UK.

It was also my privilege to be invited as an instructor at the Summer School on Advanced Bayesian Methods at KU Leuven, Belgium and teach a 3-day course on Stan and Torsten, as well as teach workshops at StanCon 2023 and at the University of Buffalo.

Progress in 2023, Jessica Edition

Since Aki and Andrew are doing it… 

Published:

Unpublished/Preprints:

Performed:

If I had to choose a favorite (beyond the play, of course) it would be the rational agent benchmark paper, discussed here. But I also really like the causal quartets paper. The first aims to increase what we learn from experiments in empirical visualization and HCI through comparison to decision-theoretic benchmarks. The second aims to get people to think twice about what they’ve learned from an average treatment effect. Both have influenced what I’ve worked on since.

Ben Shneiderman’s Golden Rules of Interface Design

The legendary computer science and graphics researcher writes:

1. Strive for consistency.

Consistent sequences of actions should be required in similar situations; identical terminology should be used in prompts, menus, and help screens; and consistent color, layout, capitalization, fonts, and so on, should be employed throughout. Exceptions, such as required confirmation of the delete command or no echoing of passwords, should be comprehensible and limited in number.

2. Seek universal usability.

Recognize the needs of diverse users and design for plasticity, facilitating transformation of content. Novice to expert differences, age ranges, disabilities, international variations, and technological diversity each enrich the spectrum of requirements that guides design. Adding features for novices, such as explanations, and features for experts, such as shortcuts and faster pacing, enriches the interface design and improves perceived quality.

3. Offer informative feedback.

For every user action, there should be an interface feedback. For frequent and minor actions, the response can be modest, whereas for infrequent and major actions, the response should be more substantial. Visual presentation of the objects of interest provides a convenient environment for showing changes explicitly.

4. Design dialogs to yield closure.

Sequences of actions should be organized into groups with a beginning, middle, and end. Informative feedback at the completion of a group of actions gives users the satisfaction of accomplishment, a sense of relief, a signal to drop contingency plans from their minds, and an indicator to prepare for the next group of actions. For example, e-commerce websites move users from selecting products to the checkout, ending with a clear confirmation page that completes the transaction.

5. Prevent errors.

As much as possible, design the interface so that users cannot make serious errors; for example, gray out menu items that are not appropriate and do not allow alphabetic characters in numeric entry fields. If users make an error, the interface should offer simple, constructive, and specific instructions for recovery. For example, users should not have to retype an entire name-address form if they enter an invalid zip code but rather should be guided to repair only the faulty part. Erroneous actions should leave the interface state unchanged, or the interface should give instructions about restoring the state.

6. Permit easy reversal of actions.

As much as possible, actions should be reversible. This feature relieves anxiety, since users know that errors can be undone, and encourages exploration of unfamiliar options. The units of reversibility may be a single action, a data-entry task, or a complete group of actions, such as entry of a name-address block.

7. Keep users in control.

Experienced users strongly desire the sense that they are in charge of the interface and that the interface responds to their actions. They don’t want surprises or changes in familiar behavior, and they are annoyed by tedious data-entry sequences, difficulty in obtaining necessary information, and inability to produce their desired result.

8. Reduce short-term memory load.

Humans’ limited capacity for information processing in short-term memory (the rule of thumb is that people can remember “seven plus or minus two chunks” of information) requires that designers avoid interfaces in which users must remember information from one display and then use that information on another display. It means that cellphones should not require reentry of phone numbers, website locations should remain visible, and lengthy forms should be compacted to fit a single display.

Wonderful, wonderful stuff. When coming across this, I saw that Shneiderman taught at the University of Maryland . . . checking his CV, it turns out that he taught there back when I was a student. I could’ve taken his course!

It would be interesting to come up with similar sets of principles for statistical software, statistical graphics, etc. We do have 10 quick tips to improve your regression modeling, so that’s a start.

Progress in 2023

Published:

Unpublished:

Enjoy.

“How not to be fooled by viral charts”

Good post with the above title from economics journalist Noah Smith.

Just for you, I’ll share a few more from some of our old blog posts:

Suspiciously vague graph purporting to show “percentage of slaves or serfs in the world”:

slaves-serfs

Debunking the so-called Human Development Index of U.S. states:

hdi0.png

(Worst) graph of the year:

The worst graph every made?:

skillswise.png

And, ok, this isn’t a “viral chart” at all, but it’s the absolute worst ever:

You can go through the blog archives to find other fun items.

Judgments versus decisions

This is Jessica. A paper called “Decoupling Judgment and Decision Making: A Tale of Two Tails” by Oral, Dragicevic, Telea, and Dimara showed up in my feed the other day. The premise of the paper is that when people interact with some data visualization, their accuracy in making judgments might conflict with their accuracy in making decisions from the visualization. Given that the authors appear to be basing the premise in part on results from a prior paper on decision making from uncertainty visualizations I did with Alex Kale and Matt Kay, I took a look. Here’s the abstract:

Is it true that if citizens understand hurricane probabilities, they will make more rational decisions for evacuation? Finding answers to such questions is not straightforward in the literature because the terms “judgment” and “decision making” are often used interchangeably. This terminology conflation leads to a lack of clarity on whether people make suboptimal decisions because of inaccurate judgments of information conveyed in visualizations or because they use alternative yet currently unknown heuristics. To decouple judgment from decision making, we review relevant concepts from the literature and present two preregistered experiments (N=601) to investigate if the task (judgment vs. decision making), the scenario (sports vs. humanitarian), and the visualization (quantile dotplots, density plots, probability bars) affect accuracy. While experiment 1 was inconclusive, we found evidence for a difference in experiment 2. Contrary to our expectations and previous research, which found decisions less accurate than their direct-equivalent judgments, our results pointed in the opposite direction. Our findings further revealed that decisions were less vulnerable to status-quo bias, suggesting decision makers may disfavor responses associated with inaction. We also found that both scenario and visualization types can influence people’s judgments and decisions. Although effect sizes are not large and results should be interpreted carefully, we conclude that judgments cannot be safely used as proxy tasks for decision making, and discuss implications for visualization research and beyond. Materials and preregistrations are available at https://osf.io/ufzp5/?view only=adc0f78a23804c31bf7fdd9385cb264f. 

There’s a lot being said here, but they seem to be getting at a difference between forming accurate beliefs from some information and making a good (e.g., utility optimal) decision. I would agree there are slightly different processes. But they are also claiming to have a way of directly comparing judgment accuracy to decision accuracy. While I appreciate the attempt to clarify terms that are often overloaded, I’m skeptical that we can meaningfully separate and compare judgments from decisions in an experiment. 

Some background

Let’s start with what we found in our 2020 paper, since Oral et al base some of their questions and their own study setup on it. In that experiment we’d had people make incentivized decisions from displays that varied only how they visualized the decision-relevant probability distributions. Each one showed a distribution of expected scores in a fantasy sports game for a team with and without the addition of a new player. Participants had to decide whether to pay for the new player or not in light of the cost of adding the player, the expected score improvement, and the amount of additional monetary award they won when they scored above a certain number of points. We also elicited a (controversial) probability of superiority judgment: What do you think is the probability your team will score more points with the new player than without? In designing the experiment we held various aspects of the decision problem constant so that only the ground truth probability of superiority was varying between trials. So we talked about the probability judgment as corresponding to the decision task.

However, after modeling the results we found that depending on whether we analyzed results from the probability response question or the incentivized decision, the ranking of visualizations changed. At the time we didn’t have a good explanation for this disparity between what was helpful for doing the probability judgment versus the decision, other than maybe it was due to the probability judgment not being directly incentivized like the decision response was. But in a follow-up analysis that applied a rational agent analysis framework to this same study, allowing us to separate different sources of performance loss by calibrating the participants’ responses for the probability task, we saw that people were getting most of the decision-relevant information regardless of which question they were responding to; they just struggled to report it for the probability question. So we concluded that the most likely reason for the disparity between judgment and decision results was probably that the probability of superiority judgment was not the most intuitive judgment to be eliciting – if we really wanted to elicit the beliefs directly corresponding to the incentivized decision task, we should have asked them for the difference in the probability of scoring enough points to win the award with and without the new player. But this is still just speculation, since we still wouldn’t be able to say in such a setup how much the results were impacted by only one of the responses being incentivized. 

Oral et al. gloss over this nuance, interpreting our results as finding “decisions less accurate than their direct-equivalent judgments,” and then using this as motivation to argue that “the fact that the best visualization for judgment did not necessarily lead to better decisions reveals the need to decouple these two tasks.” 

Let’s consider for a moment by what means we could try to eliminate ambiguity in comparing probability judgments to the associated decisions. For instance, if only incentivizing one of the two responses confounds things, we might try incentivizing the probability judgment with its own payoff function, and compare the results to the incentivized decision results. Would this allow us to directly study the difference between judgments and decision-making? 

I argue no. For one, we would need to use different scoring rules for the two different types of response, and things might rank differently depending on the rule (not to mention one rule might be easier to optimize under). But on top of this, I would argue that once you provide a scoring rule for the judgment question, it becomes hard to distinguish that response from a decision by any reasonable definition. In other words, you can’t eliminate confounds that could explain a difference between “judgment” and “decision” without turning the judgment into something indistinguishable from a decision. 

What is a decision? 

The paper by Oral et al. describes abundant confusion in the literature about the difference between judgment and decision-making, proposing that “One barrier to studying decision making effectively is that judgments and decisions are terms not well-defined and separated.“ They criticize various studies on visualizations for claiming to study decisions when they actually study judgments. Ultimately they describe their view as:

In summary, while decision making shares similarities with judgment, it embodies four distinguishing features: (I) it requires a choice among alternatives, implying a loss of the remaining alternatives, (II) it is future-oriented, (III) it is accompanied with overt or covert actions, and (IV) it carries a personal stake and responsibility for outcomes. The more of these features a judgment has, the more “decision-like” it becomes. When a judgment has all four features, it no longer remains a judgment and becomes a decision. This operationalization offers a fuzzy demarcation between judgment and decision making, in the sense that it does not draw a sharp line between the two concepts, but instead specifies the attributes essential to determine the extent to which a cognitive process is a judgment, a decision, or somewhere in-between [58], [59].

This captures components of other definitions of decision I’ve seen in research related to evaluating interfaces, e.g., as a decision as “a choice between alternatives,” typically involving “high stakes.” However, like these other definitions, I don’t think Oral et al.’s definition very clearly differentiates a decision from other forms of judgment. 

Take the “personal stake and responsibility for outcomes” part. How do we interpret this given that we are talking about subjects in an experiment, not decisions people are making in some more naturalistic context?    

In the context of an experiment, we control the stakes and one’s responsibility for their action via a scoring rule. We could instead ask people to imagine making some life or death decision in our study and call it high stakes, as many researchers do. But they are in an experiment, and they know it. In the real world people have goals, but in an experiment you have to endow them

So we should incentivize the question to ensure participants have some sense of the consequences associated with what they decide. We can ask them to separately report their beliefs, e.g., what they perceive some decision-relevant probability to be as we did in the 2020 study. But if we want to eliminate confounds between the decision and the judgment, we should incentivize the belief question too, ideally with a proper scoring rule so that it’s in their best interest to tell me the truth. Now both our decision task and our judgment task, from the standpoint of the experiment subject, would both seem to have some personal stake. So we can’t distinguish the decision easily based on its personal stakes.

Oral et al. might argue that the judgment question is still not a decision, because there are three other criteria for a decision according to their definition. Considering (I), will asking for a person’s belief require them to make a choice between alternatives? Yes, it will. Because any format we elicit their response in will naturally constrain it. Even if we just provide a text box to type in a number between 0 and 1, we’re going to get values rounded at some decimal place. So it’s hard to use “a choice among alternatives” as a distinguishing criteria in any actual experiment. 

What about (II), being future-oriented? Well, if I’m incentivizing the question then it will be just as future-oriented as my decision is, in that my payoff depends on my response and the ground truth, which is unknown to me until after I respond.

When it comes to (III), overt or covert actions, as in (I), in any actual experiment, my action space will be some form of constrained response space. It’s just that now my action is my choice of which beliefs to report. The action space might be larger, but again there is no qualitative difference between choosing what beliefs to report and choosing what action to report in some more constrained decision problem.

To summarize, by trying to put judgments and decisions on equal footing by scoring both, I’ve created something that seems to achieve Oral et al.’s definition of decision. While I do think there is a difference between a belief and a decision, I don’t think it’s so easy to measure these things without leaving open various other explanations for why the responses differ.

In their paper, Oral et al. sidestep incentivizing participants directly, assuming they will be intrinsically motivated. They report on two experiments where they used a task inspired by our 2020 paper (showing visualizations of expected score distributions and asking, Do you want the team with or without the new player, where the participant’s goal is to win a monetary award that requires scoring a certain number of points). Instead of incentivizing the decision by using the scoring rule to incentivize participants, they told them to try to be accurate. And instead of eliciting the corresponding probabilistic beliefs for the decision, they asked them two questions: Which option (team) is better?, and Which of the teams do you choose? They interpret the first answer as the judgment and the second as the decision. 

I can sort of see what they are trying to do here, but this seems like essentially the same task to me. Especially if you assume people are intrinsically motivated to be accurate and plan to evaluate responses using the same scoring rule, as they do. Why would we expect a difference between these two responses? To use a different example that came up in a discussion I was having with Jason Hartline, if you imagine a judge who cares only about doing the right thing (convicting the guilty and acquitting the innocent), who must decide whether to acquit or convict a defendant, why would you expect a difference (in accuracy) when you ask them ‘Is he guilty’ versus ‘Will you acquit or convict?’ 

In their first experiment using this simple wording, Oral et al. find no difference between responses to the two questions. In a second experiment they slightly changed the wording of the questions to emphasize that one was “your judgment” and one was “your decision.” This leads to what they say is suggestive evidence that people’s decisions are more accurate than their judgments. I’m not so sure.

The takeway

It’s natural to conceive of judgments or beliefs as being distinct from decisions. If we subscribe to a Bayesian formulation of learning from data, we expect the rational person to form beliefs about the state of the world and then choose the utility maximizing action given those beliefs. However, it is not so natural to try to directly compare judgments and decisions on equal footing in an experiment. 

More generally, when it comes to evaluating human decision-making (what we generally want to do in research related to interfaces) we gain little by preferring colloquial verbal definitions over the formalisms of statistical decision theory, which provide tools designed to evaluate people’s decisions ex-ante. It’s much easier to talk about judgment and decision-making when we have a formal way of representing a decision problem (i.e., state space, action space, data-generating model, scoring rule), and a shared understanding of what the normative process of learning from data to make a decision is (i.e., start with prior beliefs, update them given some signal, choose the action that maximizes your expected score under the data-generating model). In this case, we could get some insight into how judgments and decisions can differ simply by considering the process implied by expected utility theory. 

On a proposal to scale confidence intervals so that their overlap can be more easily interpreted

Greg Mayer writes:

Have you seen this paper by Frank Corotto, recently posted to a university depository?

It advocates a way of doing box plots using “comparative confidence intervals” based on Tukey’s HSD in lieu of traditional error bars. I would question whether the “Error Bar Overlap Myth” is really a myth (i.e. a widely shared and deeply rooted but imaginary way of understanding the world) or just a more or less occasional misunderstanding, but whatever it’s frequency, I thought you might be interested, given your longstanding aversion to box plots, and your challenge to the world to find a use for them. (I, BTW, am rather fond of dox plots.)

My reply: Clever but I can’t imagine ever using this method or recommending it to others. The abstract connects the idea to Tukey, and indeed the method reminds me of some of Tukey’s bad ideas from the 1950s involving multiple comparisons. I think the problem here is in thinking of “statistical significance” as a goal in the first place!

I’m not saying it was a bad idea for this paper to be written. The concept could be worth thinking about, even if I would not recommend it as a method. Not every idea has to be useful. Interesting is important too.

EDA and modeling

This is Jessica. This past week we’ve been talking about exploratory data analysis (EDA) in my interactive visualization course for CS undergrads, which is one of my favorite topics. I get to talk about model checks and graphical inference, why some people worry about looking at data too much, the limitations of thinking about the goal of statistical analysis as rejecting null hypotheses, etc. If nothing else, I think the students get intrigued because they can tell I get worked up about these things!

However, I was also reminded last week in reading some recent papers that there are still a lot of misconceptions about exploratory data analysis in research areas like visualization and human-computer interaction. EDA is sometimes described by well-meaning researchers as being essentially model-free and hypothesis-free, as if it’s a very different style of analysis than what happens when an analyst is exploring some data with some hunches about what they might find. 

It bugs me when people use the term EDA as synonymous with having few to no expectations about what they’ll find in the data. Identifying the unexpected is certainly part of EDA, but casting the analyst as a blank slate loses much of the nuance. For one, it’s hard to even begin making graphics if you truly have no idea what kinds of measurements you’re working with. And once you learn how the data were collected, you probably begin to form some expectations. It also mischaracterizes the natural progression as you build up understanding of the data and consider possible interpretations. Tukey for instance wrote about different phases in an exploratory analysis, some of which involve probabilistic reasoning in the sense of assessing “With what accuracy are the appearances already found to be believed?“ Similar to people assuming that “Bayesian” is equivalent to Bayes rule, the term EDA is often used to refer to some relatively narrow phase of analysis rather than something multi-faceted and nuanced. 

As Andrew and I wrote in our 2021 Harvard Data Science Review article, the simplistic (and unrealistic) view of EDA as not involving any substantive a priori expectations on the part of the analyst can be harmful for practical development of visualization tools. It can lead to a plethora of graphical user interface systems, both in practice and research, that prioritize serving up easy-to-parse views of the data, at the expense of surfacing variation and uncertainty or enabling the analyst to interrogate their expectations. These days we have lots of visualization recommenders for recommending the right chart type given some query, but it’s usually about getting the choice of encodings (position, size, etc.) right. 

What is better? In the article we had considered what a GUI visual analysis tool might look like if it took the idea of visualization as model checking seriously, including displaying variation and uncertainty by default and making it easier for the analyst to specify and check the data against provisional statistical models that capture relationships they think they see. (In Tableau Software, for example, it’s quite a pain to fit a simple regression to check its predictions against the data). But there was still a leap left after we wrote this, between proposing the ideas and figuring out how to implement this kind of support in a way that would integrate well with the kinds of features that GUI systems offer without resulting in a bunch of new problems. 

So, Alex Kale, Ziyang Guo, Xiao-li Qiao, Jeff Heer, and I recently developed EVM (Exploratory Visual Modeling), a prototype Tableau-style visual analytics tool where you can drag and drop variables to generate visualizations, but which also includes a “model bar.” Using the model bar, the analyst can specify provisional interpretations (in the form of regression) and check their predictions against the observed data. The initial implementation provides support for a handful of common distribution families and takes input in the form of Wilkinson-Pinheiro-Bates syntax. 

The idea is that generating predictions under different model assumptions absolves the analyst from having to rely so heavily on their imagination to assess hunches they have about which variables have explanatory power. If I think I see some pattern as I’m trying out different visual structures (e.g., facetting plots by different variables) I can generate models that correspond to the visualization I’m looking at (in the sense of having the same variables as predictors as shown in the plot), as well as view-adjacent models, that might add or remove variables relative to the visualization specification.

As we were developing EVM, we quickly realized that trying to pair the model and the visualization in terms of constraining them to involve the same variables is overly restrictive. And a visualization will always generally map to multiple possible statistical models so why aim for congruency.

I see this project, which Alex presented this week at IEEE VIS in Melbourne, as an experiment rather than a clear success or failure. There have been some interesting ideas proposed over the years related to graphical inference, and the connection between visualizations and statistical models, but I’ve seen few attempts to locate them in existing workflows for visual analysis like those supported by GUI tools. Line-ups, for instance, which hide a plot of the observed amongst a line-up of plots representing the null hypothesis, are a cool idea, but the implementations I’ve seen have been standalone software packages (e.g., in R) rather than attempts to integrate them into the types of visual analysis tools the non-programmers are using. To bring these ideas into existing tools, we have to think about what kind of workflow we want to encourage, and how to avoid new potential failure modes. For example, with EVM there’s the risk that having the ability to directly check different models one generates as they look at data leaves them with a sense that they’ve thoroughly checked their assumptions and can be even more confident about what explains the patterns. That’s not what we want.

Playing around with the tool ourselves has been interesting, in that it’s forced us to think about what the ideal use of this kind of functionality is, and under what conditions it seems to clearly benefit an analysis over not having it. The benefits are nuanced. We also had 12 people familiar with visual analysis in tools like Tableau use the system, and observed how their analyses of datasets we gave them seemed to differ from what they did without the model bar. Without it they all briefly explored patterns across a broad set of available variables and then circled back to recheck relationships they had already investigated. Model checking on the other hand tended to structure all but one participants’ thinking around one or two long chains of operations geared toward gradually improving models, through trying out different ways of modeling the distribution of the outcome variable, or selection of predictor variables. This did seem to encourage thinking about the data-generating process, which was our goal, though a few of them got fixated on details in the process, like trying to get a perfect visual match between predictions and observed data (without any thought as to what they were changing in the model spec).

Figuring out how to avoid these risks requires understanding who exactly can benefit from this, which is itself not obvious because people use these kinds of GUI visual analysis tools in lots of different ways, from data diagnostics and initial data analysis to dashboard construction as a kind of end-user programming. If we think that a typical user is not likely to follow up on their visual interpretations by gathering new data to check if they still hold, then we might need to build in hold-out sets to prevent perceptions that models fit during data exploration are predictive. To improve the ecosystem of visual analysis tools, we need to understand goals, workflow, and expertise.

“Are there clear examples of the opposite idea, where four visually similar visualizations can have vastly different numerical stats?”

Geoffrey Yip writes:

You’ve written before on how numerical stats can mislead people. There are great visuals for this idea through Causal Quartets or Ascombe’s Quartets. Are there clear examples of the opposite idea, where four visually similar visualizations can have vastly different numerical stats?

My reply: Sure, graphs can be horrible and convey no information or even actively mislead. Tables can be hard to read, but I guess that, without actually faking the numbers, it would be hard to make a table that’s as misleading as some very bad graphs. But I think that a good graph should always be better than the corresponding table; see this article from 2002 for exploration of this point.

Evaluating Visualizations for Inference and Decision-Making (Jessica Hullman’s talk in the Columbia statistics seminar next Monday)

Social Work Bldg room 903, at 4pm on Mon 18 Sep 2023:

Evaluating Visualizations for Inference and Decision-Making

Research and development in computer science and statistics have produced increasingly sophisticated software interfaces for interactive visual data analysis. Data visualizations have also become ubiquitous for communication in the news and scientific publishing. Despite these successes, our understanding of how to design effective visualizations for data-driven decision-making remains limited. Design philosophies that emphasize data exploration and hypothesis generation can encourage pattern-finding at the expense of quantifying uncertainty. Designing visualizations to maximize perceptual accuracy and self-reported satisfaction can lead people to adopt visualizations that promote overconfident interpretations. I will motivate a few alternative objectives for measuring the effectiveness of visualization, and show how a rational agent framework based in statistical decision theory can help us understand the value of a visualization in the abstract and in light of empirical study results.

This is a super-important topic, also interesting because in many cases people think evaluation is a big deal without thinking too hard about what are the goals of the graph. It’s hard to design a good evaluation without having some goals in mind. For example, see this discussion from a few years ago of a study that was described as finding that “chartjunk is more useful than plain graphs” and this paper with Antony Unwin on different goals of infoviz and statistical and graphics.

Another thing I like about the above abstract from Jessica is how she’s talking about two different goals of statistical graphics: (1) clarifying a point that you want to convey, and (2) providing opportunity for discovery. Both are important!

A rational agent framework for improving visualization experiments

This is Jessica. In The Rational Agent Benchmark for Data Visualization, Yifan Wu, Ziyang Guo, Michalis Mamakos, Jason Hartline and I write: 

Understanding how helpful a visualization is from experimental results is difficult because the observed performance is confounded with aspects of the study design, such as how useful the information that is visualized is for the task. We develop a rational agent framework for designing and interpreting visualization experiments. Our framework conceives two experiments with the same setup: one with behavioral agents (human subjects), and the other one with a hypothetical rational agent. A visualization is evaluated by comparing the expected performance of behavioral agents to that of a rational agent under different assumptions. Using recent visualization decision studies from the literature, we demonstrate how the framework can be used to pre-experimentally evaluate the experiment design by bounding the expected improvement in performance from having access to visualizations, and post-experimentally to deconfound errors of information extraction from errors of optimization, among other analyses.

I like this paper. Part of the motivation behind it was my feeling that even when we do our best to rigorously define a decision or judgment task for studying visualizations,  there’s an inevitable dependence of the results on how we set up the experiment. In my lab we often put a lot of effort into making the results of experiments we run easier to interpret, like plotting model predictions back to data space to reason about magnitudes of effects, or comparing people’s performance on a task to simple baselines. But these steps don’t really resolve this dependence. And if we can’t even understand how surprising our results are in light of our own experiment design, then it seems even more futile to jump to speculating what our results imply for real world situations where people use visualizations. 

We could summarize the problem in terms of various sources of unresolved ambiguity when experiment results are presented. Experimenters make many decisions in design–some of which they themselves may not even be aware they are making–which influence the range of possible effects we might see in the results. When studying information displays in particular, we might wonder about things like:

  • The extent to which performance differences are likely to be driven by differences in the amount of relevant information displays convey for that task. For example, often different visualization strategies for showing distribution vary in how they summarize the data (e.g., means versus intervals vs density plots).
  • How instrumental the information display is to doing well on the task – if one understood the problem but answered without looking at the visualization, how well would we expect them to do? 
  • To what extent participants in the study could be expected to be incentivized to use the display. 
  • What part of the process of responding to the task – extracting the information from the display, or figuring out what to do with it once it was extracted – led to observed losses in performance among study participants. 
  • And so on.

The status quo approach to writing results sections seems to be to let the reader form their own opinions on these questions. But as readers we’re often not in a good position to understand what we are learning unless we take the time to analyze the decision problem of the experiment carefully ourselves, assuming the authors have even presented it in enough detail to make that possible. Few readers are going to be willing and/or able to do this. So what we take away from the results of empirical studies on visualizations is noisy to say the least.

An alternative which we explore in this paper is to construct benchmarks using the experiment design to make the results more interpretable. First, we take the decision problem used in a visualization study and formulate it in decision theoretic terms of a data-generating model over an uncertain state drawn from some state space, an action chosen from some action space, a visualization strategy, and a scoring rule. (At least in theory, we shouldn’t have trouble picking up a paper describing an evaluative experiment and identifying these components, though in practice in fields where many experimenters aren’t thinking very explicitly about things like scoring rules at all, it might not be so easy). We then conceive a rational agent who knows the data-generating model and understands how the visualizations (signals) are generated, and compare this agent’s performance under different assumptions in pre-experimental and post-experimental analyses. 

Pre-experimental analysis: One reason for analyzing the decision task pre-experimentally is to identify cases where we have designed an experiment to evaluate visualizations but we haven’t left a lot of room to observe differences between them, or we didn’t actually give participants an incentive to look at them. Oops! To define the value of information to the decision problem we look at the difference between the rational agent’s expected performance when they only have access to the prior versus when they know the prior and also see the signal (updating their beliefs and choosing the optimal action based on what they saw). 

The value of information captures how much having access to the visualization is expected to improve performance on the task in payoff space. When there are multiple visualization strategies being compared, we calculate it using the maximally informative strategy. Pre-experimentally, we can look at the size of the value of information unit relative to the range of possible scores given by the scoring rule. If the expected difference in score from making the decision after looking at the visualization versus from the prior only is a small fraction of the range of possible scores on a trial, then we don’t have a lot of “room” to observe gains in performance (in the case of studying a single visualization strategy) or (more commonly) in comparing several visualization strategies. 

We can also pre-experimentally compare the value of information to the baseline reward one expects to get for doing the experiment regardless of performance. Assuming we think people are motivated by payoffs (which is implied whenever we pay people for their participation), a value of information that is a small fraction of the expected baseline reward should make us question how likely participants are to put effort into the task.   

Post-experimental analysis: The value of information also comes in handy post-experimentally, when we are trying to make sense of why our human participants didn’t do as well as the rational agent benchmark. We can look at what fraction of the value of information unit human participants achieve with different visualizations. We can also differentiate sources of error by calibrating the human responses. The calibrated behavioral score is the expected score of a rational agent who knows the prior but instead of updating from the joint distribution over the signal and the state, they update from the joint distribution over the behavioral responses and the state. This distribution may contain information that the agents were unable to act on. Calibrating (at least in the case of non-binary decision tasks) helps us see how much. 

Specifically, calculating the difference between the calibrated score and the rational agent benchmark as a fraction of the value of information measures the extent to which participants couldn’t extract the task relevant information from the stimuli. Calculating the difference between the calibrated score and the expected score of human participants (e.g., as predicted by a model fit to the observed results) as a fraction of the value of information, measures the extent to which participants couldn’t choose the optimal action given the information they gained from the visualization.

There is an interesting complication to all of this: many behavioral experiments don’t endow participants with a prior for the decision problem, but the rational agent needs to know the prior. Technically the definitions of the losses above should allow for loss caused by not having the right prior. So I am simplifying slightly here.  

To demonstrate how all this formalization can be useful in practice, we chose a couple prior award-winning visualization research papers and applied the framework. Both are papers I’m an author on – why create new methods if you can’t learn things about your own work? In both cases, we discovered things that the original papers did not account for, such as weak incentives to consult the visualization assuming you understood the task, and a better explanation for a disparity in visualization strategy rankings by performance for a belief versus a decision task. These were the first two papers we tried to apply the framework to, not cherry-picked to be easy targets.  We’ve also already applied it in other experiments we’ve done, such as for benchmarking privacy budget allocation in visual analysis.

I continue to consider myself a very skeptical experimenter, since at the end of the day, decisions about whether to deploy some intervention in the world will always hinge on the (unknown) mapping between the world of your experiment and the real world context you’re trying to approximate. But I like the idea of making greater use of rational agent frameworks in visualization in that we can at least gain a better understanding of what our results mean in the context of the decision problem we are studying.

My two courses this fall: “Applied Regression and Causal Inference” and “Communicating Data and Statistics”

POLS 4720, Applied Regression and Causal Inference:

This is a fast-paced one-semester course on applied regression and casual inference based on our book, Regression and Other Stories. The course has an applied and conceptual focus that’s different from other available statistics courses.
Topics covered in POLS 4720 include:
• Applied regression: measurement, data visualization, modeling and inference, transformations, linear regression, and logistic regression.
• Simulation, model fitting, and programming in R.
• Causal inference using regression.
• Key statistical problems include adjusting for differences between sample and population, adjusting for differences between treatment and control groups, extrapolating from past to future, and using observed data to learn about latent constructs of interest.
• We focus on social science applications, including but not limited to: public opinion and voting, economic and social behavior, and policy analysis.
The course is set up using the principles of active learning, with class time devoted to student-participation activities, computer demonstrations, and discussion problems.

The primary audience for this course is Poli Sci Ph.D. students, and it should also be ideal for statistics-using graduate students or advanced undergraduates in other departments and schools, as well as students in fields such as computer science and statistics who’d like to get an understanding of how regression and causal inference work in the real world!

STAT 6106, Communicating Data and Statistics:

This is a one-semester course on communicating data and statistics, covering the following modes of communication:
• Writing (including storytelling, writing technical articles, and writing for general audiences)
• Statistical graphics (including communicating variation and uncertainty)
• Oral communication (including teaching, collaboration, and giving presentations).
The course is set up using the principles of active learning, with class time devoted to discussions, collaborative work, practicing and evaluation of communication skills, and conversations with expert visitors.

The primary audience for this course is Statistics Ph.D. students, and it should also be ideal for Ph.D. students who do quantitative work in other departments and schools. Communication is sometimes thought of as a soft skill, but it is essential to statistics and scientific research more generally!

See you there:

Both courses have lots of space available, so check them out! In-person attendance is required, as class participation is crucial for both. POLS 4720 is offered Tu/Th 8:30-10am; STAT 6106 will be M/W 8:30-10:am. These are serious classes, with lots of homework. Enjoy.

In the real world people have goals and beliefs. In a controlled experiment, you have to endow them

This is Jessica. A couple weeks ago I posted on the lack of standardization in how people design experiments to study judgment and decision making, especially in applied areas of research like visualization, human-centered AI, privacy and security, NLP, etc. My recommendation was that researchers should be able to define the decision problems they are studying in terms of the uncertain state on which the decision or belief report in each trial is based, the action space defining the range of allowable responses, the scoring rule used to incentivize and/or evaluate the reports, and process that generates the signals (i.e., stimuli) that inform on the state. And that not being able to define these things points to limitations in our ability to interpret the results we get.

I am still thinking about this topic, and why I feel strongly that when the participant isn’t given a clear goal to aim for in responding, i.e., one that is aligned with the reward they get on the task, it is hard to interpret the results. 

It’s fair to say that when we interpret the results of experiments involving human behavior, we tend to be optimistic about how what we observe in the experiment relates to people’s behavior in the “real world.” The default assumption is that the experiment results can help us understand how people behave in some realistic setting that the experimental task is meant to proxy for. There sometimes seems to be a divide among researchers, between a) those who believe that judgment and decision tasks studied in controlled experiments can be loosely based on real world tasks without worrying about things being well-defined in the context of the experiment and b) those who think that the experiment should provide (and communicate to participants) some unambiguously defined way to distinguish “correct” or at least “better” responses, even if we can’t necessarily show that this understanding matches some standard we expect to operate the real-world. 

From what I see, there are more researchers running controlled studies in applied fields that are in the former camp, whereas the latter perspective is more standard in behavioral economics. Those in applied fields appear to think it’s ok to put people in a situation where they are presented with some choice or asked to report their beliefs about something but without spelling out to them exactly how what they report will be evaluated or how their payment for doing the experiment will be affected. And I will admit I too have run studies that use under-defined tasks in the past. 

Here are some reasons I’ve heard for using not using a well-defined task in a study:

People won’t behave differently if I do that. People will sometimes cite evidence that behavior in experiments doesn’t seem very responsive to incentive schemes, extrapolating from this that giving people clear instructions on how they should think about their goals in responding (i.e., what constitutes good versus bad judgments or decisions) will not make a difference. So it’s perceived as valid to just present some stuff (treatments) and pose some questions and compare how people respond.

The real world version of this task is not well-defined. Imagine studying how people use dashboards giving information about a public health crisis, or election forecasts. Someone might argue that there is no single common decision or outcome to be predicted in the real world when people use such information, and even if we choose some decision like ‘should I wear a mask’ there is no clear single utility function, so it’s ok not to tell participants how their responses will be evaluated in the experiment. 

Having to understand a scoring rule will confuse people. Relatedly, people worry that constructing a task where there is some best response will require explaining complicated incentives to study participants. They might get confused, which will interfere with their “natural” judgment processes in this kind of situation. 

I do not find these reasons very satisfying. The problem is how to interpret the elicited responses. Sure, it may be true that in some situations, participants in experiments will act more or less than the same when you put some display of information on X in front of them and say “make this decision based on what you know about X” and when you display the same information and ask the same thing but you also explain exactly how you will judge the quality of their decision. But – I don’t think it matters if they act the same. There is still a difference: in the latter case where you’ve defined what a good versus bad judgment or decision is, you know that the participants know (or at least that you’ve attempted to tell them) what their goal is when responding. And ideally you’ve given them a reason to try to achieve that goal (incentives). So you can interpret their responses as their attempt at fulfilling that goal given the information they had at hand. In terms of the loss you observe in responses relative to the best possible performance, you still can’t disambiguate the effect of their not understanding the instructions from their inability to perform well on the task despite understanding it. But you can safely consider the loss you observe as reflecting an inability to do that task (in the context of the experiment) properly. (Of course, if your scoring rule isn’t proper then you shouldn’t expect them to be truthful under perfect understanding of the task. But the point is that we can be fairly specific about the unknowns). 

When you ask for some judgment or decision but don’t say anything about how that’s evaluated, you are building variation in how the participants interpret the task directly into your experiment design. You can’t say what their responses mean in any sort of normative sense, because you don’t know what scoring rule they had in mind. You can’t evaluate anything. 

Again this seems rather obvious, if you’re used to formulating statistical decision problems. But I encounter examples all around me that appear at odds with this perspective. I get the impression that it’s seen as a “subjective” decision for the researcher to make in fields like visualization or human-centered AI. I’ve heard studies that define tasks in a decision theoretic sense accused of “overcomplicating things.” But then when it’s time to interpret the results, the distinction is not acknowledged, and so researchers will engage in quasi-normative interpretation of responses to tasks that were never well defined to begin with.

This problem seems to stem from a failure to acknowledge the differences between behavior in the experimental world versus in the real world: We do experiments (almost always) to learn about human behavior in settings that we think are somehow related to real world settings. And in the real world, people have goals and prior beliefs. We might not be able to perceive what utility function each individual person is using, but we can assume that behavior is goal-directed in some way or another. Savage’s axioms and the derivation of expected utility theory tell us that for behavior to be “rationalizable”, a person’s choices should be consistent with their beliefs about the state and the payoffs they expect under different outcomes.

When people are in an experiment, the analogous real world goals and beliefs for that kind of task will not generally apply. For example, people might take actions in the real world for intrinsic value – e.g., I vote because I feel like I’m not a good citizen if I don’t vote. I consult the public health stats because I want to be perceived by others as informed. But it’s hard to motivate people to take actions based on intrinsic value in an experiment, unless the experiment is designed specifically to look at social behaviors like development of norms or to study how intrinsically motivated people appear to be to engage with certain content. So your experiment needs to give them a clear goal. Otherwise, they will make up a goal, and different people may do this in different ways. And so you should expect the data you get back to be a hot mess of heterogeneity. 

To be fair, the data you collect may well be a hot mess of heterogeneity anyway, because it’s hard to get people to interpret your instructions correctly. We have to be cautious interpreting the results of human-subjects experiments because there will usually be ambiguity about the participants’ understanding of the task. But at least with a well-defined task, we can point to a single source of uncertainty about our results. We can narrow down reasons for bad performance to either real challenges people face in doing that task or lack of understanding the instructions. When the task is not well-defined, the space of possible explanations of the results is huge. 

Another way of saying this is that we can only really learn things about behavior in the artificial world of the experiment. As much as we might want to equate it with some real world setting, extrapolating from the world of the controlled experiment to the real world will always be a leap of faith. So we better understand our experimental world. 

A challenge when you operate under this understanding is how to explain to people who have a more relaxed attitude about experiments why you don’t think that their results will be informative. One possible strategy is to tell people to try to see the task in their experiment from the perspective of an agent who is purely transactional or “rational”:

Imagine your experiment through the eyes of a purely transactional agent, whose every action is motivated by what external reward they perceive to be in it for them. (There are many such people in the world actually!) When a transactional agent does an experiment, they approach each question they are asked with their own question: How do I maximize my reward in answering this? When the task is well-defined and explained, they have no trouble figuring out what to do, and proceed with doing the experiment. 

However, when the transactional human reaches a question that they can’t determine how to maximize their reward on, because they haven’t been given enough information, they shut down. This is because they are (quite reasonably) unwilling to take a guess at what they should do when it hasn’t been made clear to them. 

But imagine that our experiment requires them to keep answering questions. How should we think about the responses they provide? 

We can imagine many strategies they might use to make up a response. Maybe they try to guess what you, as the experimenter, think is the right answer. Maybe they attempt to randomize. Maybe they can’t be bothered to think at all and they call in the nearest cat or three year old to act on their behalf. 

We could probably make this exercise more precise, but the point is that if you would not be comfortable interpreting the data you get under the above conditions, then you shouldn’t be comfortable interpreting the data you get from an experiment that uses an under-defined task.