In the real world people have goals and beliefs. In a controlled experiment, you have to endow them

This is Jessica. A couple weeks ago I posted on the lack of standardization in how people design experiments to study judgment and decision making, especially in applied areas of research like visualization, human-centered AI, privacy and security, NLP, etc. My recommendation was that researchers should be able to define the decision problems they are studying in terms of the uncertain state on which the decision or belief report in each trial is based, the action space defining the range of allowable responses, the scoring rule used to incentivize and/or evaluate the reports, and process that generates the signals (i.e., stimuli) that inform on the state. And that not being able to define these things points to limitations in our ability to interpret the results we get.

I am still thinking about this topic, and why I feel strongly that when the participant isn’t given a clear goal to aim for in responding, i.e., one that is aligned with the reward they get on the task, it is hard to interpret the results. 

It’s fair to say that when we interpret the results of experiments involving human behavior, we tend to be optimistic about how what we observe in the experiment relates to people’s behavior in the “real world.” The default assumption is that the experiment results can help us understand how people behave in some realistic setting that the experimental task is meant to proxy for. There sometimes seems to be a divide among researchers, between a) those who believe that judgment and decision tasks studied in controlled experiments can be loosely based on real world tasks without worrying about things being well-defined in the context of the experiment and b) those who think that the experiment should provide (and communicate to participants) some unambiguously defined way to distinguish “correct” or at least “better” responses, even if we can’t necessarily show that this understanding matches some standard we expect to operate the real-world. 

From what I see, there are more researchers running controlled studies in applied fields that are in the former camp, whereas the latter perspective is more standard in behavioral economics. Those in applied fields appear to think it’s ok to put people in a situation where they are presented with some choice or asked to report their beliefs about something but without spelling out to them exactly how what they report will be evaluated or how their payment for doing the experiment will be affected. And I will admit I too have run studies that use under-defined tasks in the past. 

Here are some reasons I’ve heard for using not using a well-defined task in a study:

People won’t behave differently if I do that. People will sometimes cite evidence that behavior in experiments doesn’t seem very responsive to incentive schemes, extrapolating from this that giving people clear instructions on how they should think about their goals in responding (i.e., what constitutes good versus bad judgments or decisions) will not make a difference. So it’s perceived as valid to just present some stuff (treatments) and pose some questions and compare how people respond.

The real world version of this task is not well-defined. Imagine studying how people use dashboards giving information about a public health crisis, or election forecasts. Someone might argue that there is no single common decision or outcome to be predicted in the real world when people use such information, and even if we choose some decision like ‘should I wear a mask’ there is no clear single utility function, so it’s ok not to tell participants how their responses will be evaluated in the experiment. 

Having to understand a scoring rule will confuse people. Relatedly, people worry that constructing a task where there is some best response will require explaining complicated incentives to study participants. They might get confused, which will interfere with their “natural” judgment processes in this kind of situation. 

I do not find these reasons very satisfying. The problem is how to interpret the elicited responses. Sure, it may be true that in some situations, participants in experiments will act more or less than the same when you put some display of information on X in front of them and say “make this decision based on what you know about X” and when you display the same information and ask the same thing but you also explain exactly how you will judge the quality of their decision. But – I don’t think it matters if they act the same. There is still a difference: in the latter case where you’ve defined what a good versus bad judgment or decision is, you know that the participants know (or at least that you’ve attempted to tell them) what their goal is when responding. And ideally you’ve given them a reason to try to achieve that goal (incentives). So you can interpret their responses as their attempt at fulfilling that goal given the information they had at hand. In terms of the loss you observe in responses relative to the best possible performance, you still can’t disambiguate the effect of their not understanding the instructions from their inability to perform well on the task despite understanding it. But you can safely consider the loss you observe as reflecting an inability to do that task (in the context of the experiment) properly. (Of course, if your scoring rule isn’t proper then you shouldn’t expect them to be truthful under perfect understanding of the task. But the point is that we can be fairly specific about the unknowns). 

When you ask for some judgment or decision but don’t say anything about how that’s evaluated, you are building variation in how the participants interpret the task directly into your experiment design. You can’t say what their responses mean in any sort of normative sense, because you don’t know what scoring rule they had in mind. You can’t evaluate anything. 

Again this seems rather obvious, if you’re used to formulating statistical decision problems. But I encounter examples all around me that appear at odds with this perspective. I get the impression that it’s seen as a “subjective” decision for the researcher to make in fields like visualization or human-centered AI. I’ve heard studies that define tasks in a decision theoretic sense accused of “overcomplicating things.” But then when it’s time to interpret the results, the distinction is not acknowledged, and so researchers will engage in quasi-normative interpretation of responses to tasks that were never well defined to begin with.

This problem seems to stem from a failure to acknowledge the differences between behavior in the experimental world versus in the real world: We do experiments (almost always) to learn about human behavior in settings that we think are somehow related to real world settings. And in the real world, people have goals and prior beliefs. We might not be able to perceive what utility function each individual person is using, but we can assume that behavior is goal-directed in some way or another. Savage’s axioms and the derivation of expected utility theory tell us that for behavior to be “rationalizable”, a person’s choices should be consistent with their beliefs about the state and the payoffs they expect under different outcomes.

When people are in an experiment, the analogous real world goals and beliefs for that kind of task will not generally apply. For example, people might take actions in the real world for intrinsic value – e.g., I vote because I feel like I’m not a good citizen if I don’t vote. I consult the public health stats because I want to be perceived by others as informed. But it’s hard to motivate people to take actions based on intrinsic value in an experiment, unless the experiment is designed specifically to look at social behaviors like development of norms or to study how intrinsically motivated people appear to be to engage with certain content. So your experiment needs to give them a clear goal. Otherwise, they will make up a goal, and different people may do this in different ways. And so you should expect the data you get back to be a hot mess of heterogeneity. 

To be fair, the data you collect may well be a hot mess of heterogeneity anyway, because it’s hard to get people to interpret your instructions correctly. We have to be cautious interpreting the results of human-subjects experiments because there will usually be ambiguity about the participants’ understanding of the task. But at least with a well-defined task, we can point to a single source of uncertainty about our results. We can narrow down reasons for bad performance to either real challenges people face in doing that task or lack of understanding the instructions. When the task is not well-defined, the space of possible explanations of the results is huge. 

Another way of saying this is that we can only really learn things about behavior in the artificial world of the experiment. As much as we might want to equate it with some real world setting, extrapolating from the world of the controlled experiment to the real world will always be a leap of faith. So we better understand our experimental world. 

A challenge when you operate under this understanding is how to explain to people who have a more relaxed attitude about experiments why you don’t think that their results will be informative. One possible strategy is to tell people to try to see the task in their experiment from the perspective of an agent who is purely transactional or “rational”:

Imagine your experiment through the eyes of a purely transactional agent, whose every action is motivated by what external reward they perceive to be in it for them. (There are many such people in the world actually!) When a transactional agent does an experiment, they approach each question they are asked with their own question: How do I maximize my reward in answering this? When the task is well-defined and explained, they have no trouble figuring out what to do, and proceed with doing the experiment. 

However, when the transactional human reaches a question that they can’t determine how to maximize their reward on, because they haven’t been given enough information, they shut down. This is because they are (quite reasonably) unwilling to take a guess at what they should do when it hasn’t been made clear to them. 

But imagine that our experiment requires them to keep answering questions. How should we think about the responses they provide? 

We can imagine many strategies they might use to make up a response. Maybe they try to guess what you, as the experimenter, think is the right answer. Maybe they attempt to randomize. Maybe they can’t be bothered to think at all and they call in the nearest cat or three year old to act on their behalf. 

We could probably make this exercise more precise, but the point is that if you would not be comfortable interpreting the data you get under the above conditions, then you shouldn’t be comfortable interpreting the data you get from an experiment that uses an under-defined task.

A time series so great, they plotted it twice. (And here’s a better way to do it:)

Someone who I don’t know writes:

If you decide to share this publicly, say in your blog, let me stay anonymous.

It’s funny how people want anonymity on these things!

Anyway, my correspondent continues:

I came across this 2016 PNAS article, “Seasonality in human cognitive brain responses.”

It has this interesting figure:

The same data are plotted twice, once in the left half of the figure, and again in the right half. The horizontal axis is repeated, so we are not looking at data fabrication. In the caption, the authors say “n=28”. (Two pairs of dots overlap, so you see only 26 dots in each half). They also describe this figure as a “double plot”. I did an internet search for “double plot” and, so far as I can tell, there is no such thing. The closest thing was a dual-axis plot, which is not what the authors have here. They’ve used “double plots” in other figures in the paper too.

Going by how the authors drew the x-axis and their disclosure that “n=28”, I assume that the authors did not mean to deceive the readers. But I still find it deceptive. I can hardly think of a situation where repeating a plot is a good idea. But if an author must do it, they should probably not just call it a “double plot” and leave it at that. They should describe what it is they have done and why.

Yeah, this is wack! The natural thing would be to just show one year and not duplicate any data—I guess then there’s a concern that you wouldn’t see the continuity between December and January. But, yeah, repeating the entire thing seems like a bit much.

Here’s what I’d recommend: Display one year, Winter/Spring/Summer/Fall, then append Fall on the left and Winter on the right (so now you’re displaying 18 months) but gray out the duplicate months, so then it’s clear that they’re not additional data, they’re just showing the continuity of the pattern.

Best of both worlds!

P.S. The duplicate graph reminds me of a hilarious lampshade I saw once that looked like a map of the world, but it actually was two maps: that is, it went around the world twice, so that from any horizontal angle you could see all 360 degrees. I tried to find an image online but no amount of googling took me to it.

Defining decisions in studies of visualization and human centered AI

This is Jessica. A few years ago, Dimara and Stasko published a paper pointing to the lack of decision tasks in evaluation of visualization research, where it’s common to talk about decision-making, but then to ask simpler perceptual style questions in the study you run. A few years earlier I had pointed to the same irony when taking stock of empirical research on visualizing uncertainty, where despite frequent mention of “better decision-making” as the objective for visualizing uncertainty, few well-defined decision tasks are studied, and instead most studies evaluate how well people can read data from a chart and how confident they report feeling about their answers to the task. Colloquially, I’ve heard a decision task described as a choice between alternatives, or a choice where the stakes are high, but neither of these isolates a clear set of assumptions. 

Then there is all the research being produced on “AI-advised decisions” – how people make decisions with the aid of AI and ML models–which has become much more popular in the last five years. Some of this research is invested in isolating decision tasks to build general understanding about how people use model predictions. E.g., according to one recent survey of empirical human-subjects studies on using AI to augment human decisions, these studies are “necessary to evaluate the effectiveness of AI technologies in assisting decision making, but also to form a foundational understanding of how people interact with AI to make decisions.” The body of empirical work on AI-advised decisions as a whole is thought to allow us to “develop a rigorous science of humanAI decision-making”; or “to assess that trust exists between human-users and AI-embedded systems in decision making”; etc. Reading these kinds of statements, it would seem there must be some consensus on what a decision is and how decisions are a distinct form of human behavior compared to other tasks that are not considered decisions. 

But looking at the range of tasks that get filed under “studying decision making” in both of these areas, it’s not very clear what the common task structure is that makes these studies about decision-making. For some, the point seems to be to compare human decisions to a definition of rational behavior, but sometimes this is defined by the researchers but other times it’s not. Sometimes the point is to study “subjective decisions,” like helping a friend decide if the list price of a house matches its valuation where the participants are intended to use their own judgment in thinking about prioritizing them.  

If we are going to isolate decision-making as an important class of behavior when we study interfaces, I think we should be able to give a definition of what that means. I get that human decision-making might appear to require hard to formalize sometimes (because if it wasn’t, why haven’t the humans figured out how to automate it?) And decision theory isn’t necessarily familiar if you’re coming from a computer science background. But it seems hard to make progress or learn from a body of empirical work on decisions if we can’t say exactly what classifies a task as decision  The survey papers I’ve seen on decision making in visualization or in human-centered AI conclude that we need a more coherent definition of decision, but no one appears to be suggesting anything concrete. 

So here’s a proposal for one way to understand what constitutes a decision problem: a decision task involves the person choosing a response from some set of possible responses, where we know that the quality of the response depends on the realization of a state of the world which is uncertain at the time of the decision. Additionally, for the fields I’m talking about above, we generally want to assume that there is some information (or signal) available to the decision maker when they make their decision which is correlated with the uncertain state. 

We can summarize this by saying when we talk about people making a decision we should be able to point to: 

  1. An uncertain state of the world. E.g., if we are taking recidivism prediction as the decision task, the uncertain state is whether the person will commit another crime after being released, which can take the value 0 (no) or 1 (yes).  
  2. An action space from which the decision maker chooses a response. Above I called the action space a choice of response, because I’ve seen a colloquial understanding of decision that assumes that a task has to involve choosing between some actions that correspond to something that feels like a real-world choice between a small number of options. It doesn’t. The response to the decision problem it could be a probability forecast or some other numeric response. What matters is that we have a way to evaluate the quality of that response that accounts for the realization of the state.
  3. A (scoring) rule that assigns some quality score to each chosen action so we can evaluate the decision. I think sometimes people hear ‘scoring rule’ and they think it has to be a proper scoring rule, or at least a continuous function, or something like that. It can refer to any way in which we assign value to different responses to the task. However, we should acknowledge that we can use scoring rules for different purposes even in a single study, and think about why we do this. E.g., sometimes we might have one scoring rule that we use to incentivize participants in our study (which may just be a flat reward scheme regardless of the quality of your responses, which is used a lot at least in visualization research). Then we have some different scoring rule that we use to evaluate their responses, like evaluating the accuracy of the responses. If we want to conclude that we have learned about the quality of human decisions, it’s the latter we care about more, but in many situations we should also be thinking carefully about the former as well, so that participants understand the decision problem the same way that we do when we analyze it . So we should be able to identify both when running a decision study. 

 And, optionally for studies comparing different interfaces to aid a decision maker:

  1. Some signal (or set of signals) that inform about the uncertain state that the decision-maker has access to in making their decision. This could be a visualization of some relevant data, or the prediction made by a model on some instance.   

This may seem obvious to some, as I am essentially just describing components of statistical decision theory. But I think making these aspects of a decision problem explicit when talking about decisions in visualization and HCAI research would already be a step forward. It would at least give us a list of things to make sure we can identify if we are trying to understand a decision task. And it could help us realize the limitations on how much we can really say about decision-making when we can’t specify all these components. For example, if there’s no uncertain state on which the quality of someone’s response to the task depends, what’s the point of trying to evaluate different decision strategies? Or if we can’t describe how the signals our interface is providing compare in terms of conveying information about the uncertain state, then how can we evaluate different approaches to presenting them?

Something else I’ve noticed is papers that refer to a decision task but then bury the information about the nature of the scoring rule that is used, especially that used to incentivize the participants, as if it doesn’t matter for anything. But if we give no thought to what to tell study participants about how to make a decision well, then we should be careful about using the results of the study to talk about making better or worse decisions – they might have been trying different things, doing whatever seemed quickest, etc. 

Also, some studies resist defining decision quality even in assessing responses, as if this would take away from the realism of the task. I think there’s a temptation in studying these topics to assume there’s some inherent blackbox nature to how people make decisions or use visualizations that absolves us as researchers of having to try to formalize anything. It isn’t necessarily wrong to study tasks where we can’t say what exactly would constitute a better decision in some real world decision pipeline, but if we want to work toward a general understanding of human decision making with AIs or with visualizations through controlled empirical experiments, we should study scenarios that we can fully understand. Otherwise we can make observations perhaps, but not value judgments.  

Related to this, I think adhering to this definition of decision would make it easier for researchers to tell the difference between normative decision studies and descriptive or exploratory ones. If the goal is simply to understand how people approach some kind of decision task, or what they need, like this example of how child welfare workers screen cases, then it doesn’t necessarily matter if we can’t say what the scoring rule/rules is – maybe that’s part of what we’re trying to learn. But I would argue that whenever we want to conclude something about decision quality, we should be able to describe and motivate the scoring rule(s) we’re using. I’ve come across papers in both of the areas I mentioned that seem to confuse the two. I also have mixed feelings about labeling some decisions as “subjective,” though I understand the motivation behind trying to distinguish the more formally defined tasks from those that seem underspecified. There’s a risk of “subjective” making it sound like it’s possible to have a decision for which there really is no scoring rule, implicit or not, but I don’t think that makes sense.

Of course there is lots more that could be said about all this. For example, if you’re going to study some decision task with the goal of producing new knowledge about human decision making in general, I think you should be able to go further than just specifying your decision problem: you should also understand its properties and motivate why they are important. I find that often there is an iterative process of specifying the decision task for an experiment – you take a stab at specifying something you think might work, then you attempt to understand how well it “works” for the purposes of your experiment. This process can be very opaque if you don’t have a good sense of what properties matter for the kind of claim you hope to make from your results. I have some recent work where we lay out an approach to evaluating the decision experiments themselves, but will leave that for a follow-up post.

P.S. These thoughts are preliminary, and I welcome feedback from those studying decisions in the areas I’ve mentioned, or other domains.

Can Visualization Alleviate Dichotomous Thinking? Some experimental evidence.

Jouni Helske, Satu Helske, Matthew Cooper, Anders Ynnerman, and Lonni Besançon write:

Can Visualization Alleviate Dichotomous Thinking? Effects of Visual Representations on the Cliff Effect

Common reporting styles for statistical results in scientific articles, such as p-values and confidence intervals have been reported to be prone to dichotomous interpretations, especially with respect to the null hypothesis significance testing framework. . . . This type of reasoning has been shown to be potentially harmful to science. Techniques relying on the visual estimation of the strength of evidence have been recommended to reduce such dichotomous interpretations but their effectiveness has also been challenged. We ran two experiments on researchers with expertise in statistical analysis to compare several alternative representations of confidence intervals and used Bayesian multilevel models to estimate the effects of the representation styles on differences in researchers’ subjective confidence in the results. We also asked the respondents’ opinions and preferences in representation styles. Our results suggest that adding visual information to classic CI representation can decrease the tendency towards dichotomous interpretations—measured as the `cliff effect’: the sudden drop in confidence around p-value 0.05—compared with classic CI visualization and textual representation of the CI with p-values. All data and analyses are publicly available at https://github.com/helske/statvis.

This sounds cool. I’ll let co-blogger Jessica judge the relevance and quality of the research, as this is in her area of expertise.

The three ages of -i

Laura Wattenberg’s Namerology is always worth a read:

[The above] u-turn curve, while accurate, is misleading. What looks like a return to earlier times is actually a revolution. And each of the three sections of the historical curve—flat, peak, and flat again—is its own cultural era of naming.

1900 to WWII: -i is for Immigrants

In the early decades of the 20th Century, classic English naming still dominated, especially among American-born parents. The shape of English names left -i names scarce. In fact, the biblical name Naomi accounted for the majority of American -i babies born, without ever cracking a top-100 name list.

The remaining -i names of the period included a scattering of rarer biblical names, girls’ nicknames, and an impressive variety of names from cultures around the world, brought to America by immigrant parents. After the biblical names Levi and Eli, the next most common boys’ -i names were Hiroshi (Japanese), Henri (French) and Luigi (Italian). The top girls’ -i names included the Finnish names Lempi and Aili. Names like these were seldom adopted by American families of other backgrounds, and most of them disappeared as immigration declined.

WWII to the Reagan Era: -i is for Informal

The mid-century brought an “American girl” explosion, a wave of new -i hits with a casual, carefree attitude and a youthful sound. The top 35 -i names of the middle period were all female, and except for the holdover Naomi every one was 2 syllables. Most significantly, the names were newly configured, homegrown hits that made up a distinctly American style.

The top 3 names, Lori, Vicki and Terri, demonstrate the blueprint. A familiar nickname like Laurie or Vicky, or occasionally a surname like Tracy or word name like Brandy, was updated with an -i spelling. The -i variant was perceived as fresh and female, a perception which some parents leveraged to put a feminine edge on names like Toni, Jeri and Randi. (For a sense of the male counterparts to this all-female -i phase, the five fastest-rising names of 1957 were Mike, Jeff, Tim, Greg and Tom.) With a couple of generations’ distance this whole era of -i names now looks remarkably unified, and like the face of a generation.

“Mike, Jeff, Tim, Greg and Tom.” That’s funny.

Wattenberg continues:

Reagan Era to Today: -i is for Impact

The -i style of recent times is defined by the quest to stand out, with previous generations’ name standards as the backdrop. For boys, the -i ending itself achieved that goal since it had always been uncommon. Parents made hits of biblical names like Levi and Malachi, and imports like Giovanni and Nikolai were increasingly chosen by families of diverse ethnic backgrounds. A slew of new African-American -i names for both sexes were built on the model of African names like Imani. Striking words and brand names like Bodhi and Armani also became popular given names.

Notably, the -i choices of this era also made an impact with length, at both extremes. The typical American boy’s name is 2 syllables and 5-6 letters. Not one of the top 9 -i male names of the Impact era fit that mold. Parents went short with names like Kai and Ari, and long with the likes of Giovanni and Malachi.

A Shift in Mindset

The three eras reflect not just different styles, but different naming impulses. In the first era, most parents assumed they would select baby names from a traditional pool, or from their own family trees. Parents of the middle era began to push against the weight and formality of the past, but they weren’t prepared to go too far out on a limb and conformity was still the order of the day. Then in the third, contemporary era, parents rejected the traditional model of a set pool of names and moved toward something more like personal branding.

Put it together and you have a good thumbnail portrait of American name history, and arguably of the evolution of American culture. All through the lens of a single letter.

Cool!

New open access journal on visualization and interaction

This is Jessica. I am on the advisory board of an open access visualization research journal called the Journal of Visualization and Interaction (JoVI), recently launched by Lonni Besançon, Florian Echtler, Matt Kay, and Chat Wacharamanotham. From their website:

The Journal of Visualization and Interaction (JoVI) is a venue for publishing scholarly work related to the fields of visualization and human-computer interaction. Contributions to the journal include research in:

  • how people understand and interact with information and technology,
  • innovations in interaction techniques, interactive systems, or tools,
  • systematic literature reviews,
  • replication studies or reinterpretations of existing work,
  • and commentary on existing publications.

One component of their mission is to require materials to be open by default, including exposing all data and reasoning for scrutiny, and making all code reproducible “within a reasonable effort.” Other goals are to emphasize knowledge and discourage rejection based on novelty concerns (a topic that comes up often in computer science research, see e..g., my thoughts here). They welcome registered reports, and say they will not impose top down constraints on how many papers can be published that can lead to arbitrary-seeming decisions on papers that hinge on easily fixable mistakes. This last part makes me think they are trying to avoid the kind of constrained decision processes of conference proceeding publications, which are still the most common publication mode in computer science. There are existing journals like Transactions on Visualization and Computer Graphics that give authors more chances to go back and forth with reviewers, and my experience as associate editor there is that papers don’t really get rejected for easily fixable flaws. Part of JoVI’s mission seems to be about changing the kind of attitude that reviewers might bring, away from one of looking for reasons to reject and toward trying to work with the authors to make the paper as good as possible. If they can do this while also avoiding some of the other CS review system problems like lack of attention or sufficient background knowledge of reviewers, perhaps the papers will end up being better than what we currently see in visualization venues. 

This part of JoVI’s mission distinguishes it from other visualization journals:

Open review, comments, and continued conversation

All submitted work, reviews, and discussions will by default be publicly available for other researchers to use. To encourage accountability, editors’ names are listed on the articles they accept, and reviewers may choose to be named or anonymous . All submissions and their accompanying reviews and discussions remain accessible whether or not an article is accepted. To foster discussions that go beyond the initial reviewer/author exchanges, we welcome post-publication commentaries on articles.

Open review is so helpful for adding context to how papers were received at the time of submission, so I hope it catches on here. Plus I really dislike by the attitude that it is somehow unfair to bring up problems with published work, at least outside of the accepted max 5 minutes of public QA that happens after the work is presented at a conference. People talk amongst themselves about what they perceive the quality or significance of new contributions to be, but many of the criticisms remain in private circles. It will be interesting to see if JoVI gets some commentaries or discussion on published articles, and what they are like. 

This part is also interesting: “On an alternate, optional submission track, we will continually experiment with new article formats (including modern, interactive formats), new review processes, and articles as living documents. This experimentation will be motivated by re-conceptualizing peer review as a humane, constructive process aimed at improving work rather than gatekeeping.” 

distll.pub is no longer publishing new stuff but some of their interactive ML articles were very memorable and probably had more impact than more conventionally published papers on the topic. Even more so I like the idea of trying to support articles as living documents that can continue to be updated. The current publication practices in visualization seem a long way from encouraging a process where it’s normal to first release working papers. Instead, people spend six months building their interactive system or doing their small study to get a paper-size unit of work, and then they move on. I associate the areas where working papers seem to thrive (e.g., theoretical or behavioral econ) with theorizing or trying to conceptualize something fundamental to behavior, rather than just describing or implementing something. The idea that we should be trying to write visualization papers that really make us think hard over longer periods, and that may not come in easily bite-size chunks, seems kind of foreign to how the research is conceptualized. But any steps toward thinking about papers as incomplete or imperfect, and building more feedback and iteration into the process, are welcome.

Some cool interactive covid infographics from the British Medical Journal

I agree with Aleks that these are excellent. The above image is just a screenshot; the links below are all live and interactive:

Covid-19 test calculator: How to interpret test results

Current evidence for covid-19 prophylaxis: Visual summary of living systematic review and network meta-analysis

Visualising expert estimates of covid-19 transmission: What might be the best ways of protecting ourselves from covid-19?

Covid-19 lateral flow tests: Calculator for interpreting test results

Great stuff, and a model for risk communication going forward.

Two talks about robust objectives for visualization design and evaluation

This is Jessica. I’ll be giving a talk twice this week, on the topic of how to make data visualizations more robust for inference and decision making under uncertainty. Today I’m speaking at the computer science seminar at University of Illinois Urbana Champaign, and  Wednesday I’ll be giving a distinguished data science lecture at Cornell. In the talk I consider what’s a good objective to use as a target in designing and evaluating visualization displays, one that is “robust” in the sense that it leads us to better designs even if people don’t use the visualizations as intended. My talk will walkthrough what I learned from using effect size judgments and decisions as a design target, and how aiming for visualizations that facilitate implicit model checks can be a better target for designing visual analysis tools. At the end I jump up a level to talk about what our objectives should be when we design empirical visualization experiments. I’ll talk about a framework we’re developing that uses the idea of a rational agent with full knowledge of a visualiation experiment design to create benchmarks that can be used to determine when an experiment design is good (by asking e.g., is the visualization important to do well on the decision problem under the scoring rule used?) and which can help us figure out what causes losses in observed performance by participants in our experiment.

Open problem: How to make residual plots for multilevel models (or for regularized Bayesian and machine-learning predictions more generally)?

Adam Sales writes:

I’ve got a question that seems like it should be elementary, but I haven’t seen it addressed anywhere (maybe I’m looking in the wrong places?)

When I try to use binned residual plots to evaluate a multilevel logistic regression, I often see a pattern like this (from my student, fit with glmer):

I think the reason is because of partial pooling/shrinkage of group-level intercepts being shrunk towards the grand mean.

I was able to replicate the effect (albeit kind of mirror-imaged—the above plot was from a very complex model) with fake data:

makeData <- function(ngroup=100,groupSizeMean=10,reSD=2){
  groupInt <- rnorm(ngroup,sd=reSD)
  groupSize <- rpois(ngroup,lambda=groupSizeMean)
  groups <- rep(1:ngroup,times=groupSize)
  n <- sum(groupSize)
  data.frame(group=groups,y=rbinom(n,size=1,prob=plogis(groupInt[groups])))
}
dat <- makeData()
mod <- glmer(y~(1|group),data=dat,family=binomial)
binnedplot(predict(mod,type='response'),resid(mod,type='response'))

Model estimates (i.e., point estimates of the parameters from a hierarchical model) of extreme group effects are shrunk towards 0---the grand mean intercept in this case---except at the very edges when the 0-1 bound forces the residuals to be small in magnitude (I expect the pattern would be linear in the log odds scale).

When I re-fit the same model on the same data with rstanarm and looked at the fitted values I got basically the same result.

On the other hand, when looking at 9 random posterior draws the pattern mostly goes away:

Now here come the questions---is this really a general phenomenon, like I think it is? If so, what does it mean for the use of binned residual plots for multilevel logistic regression, or really any time there's shrinkage or partial pooling? Can binned residual plots be helpful for models fit with glmer, or only by plotting individual posterior draws from a Bayesian posterior distribuion?

My reply: Yes, the positive slope for resid vs expected value . . . that would never happen in least-squares regression, so, yeah, it has to do with partial pooling. We should think about what's the right practical advice to give here. Residual plots are important.

As you note with your final graph above, the plots should have the right behavior (no slope when the model is correct) when plotting the residuals relative to the simulated parameter values. This is what Xiao-Li, Hal, and I called "realized discrepancies" in our 1996 paper on posterior predictive checking, but then in our 2000 paper on diagnostic checks for discrete-data regression models using posterior predictive simulations, Yuri, Francis, Ivan, and I found that the use of realized discrepancies added lots of noise in residual plots.

What we'd like is an approach that gives us the clean comparisons but without the noise.

Hey—here’s some ridiculous evolutionary psychology for you, along with some really bad data analysis.

Jonathan Falk writes:

So I just started reading The Mind Club, which came to me highly recommended. I’m only in chapter 2. But look at the above graph, which is used thusly:

“As figure 5 reveals, there was a slight tendency for people to see more mind (rated consciousness and capacity for intention) in faster animals (shown by the solid sloped line)—it is better to be the hare than the tortoise. The more striking pattern in the graph is an inverted U shape (shown by the dotted curve), whereby both very slow and very fast animals are seen to have little mind, and human-speeded animals like dogs and cats are seen to have the most mind. This makes evolutionary sense, as potential predators and prey are all creatures moving at roughly our speed, and so it pays to understand their intentions and feelings. In the modern world we seldom have to worry about catching deer and evading wolves, but timescale anthropomorphism stays with us; in the dance of perceiving other minds, it pays to move at the same speed as everyone else.”

Wegner, Daniel M.; Gray, Kurt. The Mind Club (pp. 29-30). Penguin Publishing Group. Kindle Edition.

That “inverted U shape” seems a bit housefly-dependent, wouldn’t you say? And how is the “slight tendency” less “striking” than this putative inverse U shape?

Yeah, that quadratic curve is nuts. As is the entire theory.

Also, what’s the scale of the x-axis on that graph? If a sloth’s speed is 35, the wolf should be more than 70, no? This seems like the psychology equivalent of that political science study that said that North Carolina was less democratic than North Korea.

Falk sent me the link to the article, and it seems that the speed numbers are survey responses for “perceived speed of movement.” GIGO all around!

The “percentogram”—a histogram binned by percentages of the cumulative distribution, rather than using fixed bin widths

Jamie Elsey writes:

I’ve been really interested to see you talking more about data visualisation in your blog as it’s a topic I really enjoy and think it is underappreciated. I’ve recently been working on some ways of legibly presenting uncertainty as part of my work, and devised what is, to me, a slightly novel way of showing distributions of data in a way I find to be quite useful. I wondered if you have seen this type of thing before, and what you think? – basically, it is like a histogram or density plot in that is shows the overall shape of the distribution, but what I find nice is that each bar is made to have the same area and to specifically represent a chosen percentage. One could call it an “percentogram.” Hence, it is really easy to assess how much of the distribution is falling in particular ranges. You can also specifically color code the bars according to e.g., particular quantiles, deciles etc.

I think this could be potentially useful for plotting things like posterior distributions, or the results of things like cost effectiveness analyses where some of the inputs include uncertainty/are simulated with variability. This is not a proper geom yet and the code is probably a bit janky, but if you’d like to see the code I can also share what I have so far (it is a function that will take a vector of data and returns a dataframe from which this kind of plot can be easily made).

I thought you might find it interesting especially if it is something you haven’t seen before, or maybe there is some good reason why this kind of plot is not used!

The above graphs show percentograms for random draws from the normal and exponential distributions.

In response to Elsey’s question, my quick answer is that I’ve seen histograms with varying bin widths but not with equal probability.

Elsey did some searching and found this on varying binwidth histograms, with references going back to the 1970s. It makes sense that people were writing about the topic back then, because that was a time when statisticians thought a lot about unidimensional data display. Nowadays we think more about time series and scatterplots, but histograms still get used, which is why I’m sharing the idea here.

I googled *equal probability histograms in r* and found this amusing bit from 2004, classic R-list stuff, no messing around:

Q: I would like to use R to generate a histogram which has bars of variable bin width with each bar having an equal number of counts. For example, if the bin limits are the quartiles, each bar would represent 1/4 of the total probability in the distribution. An example of such an equal-probability histogram is presented by Nicholas Cox at http://www.stata.com/support/faqs/graphics/histvary.html.

A: So you can calculate the quartiles using the quantiles() function and set those quartiles as breaks in hist().

Indeed:

percentogram <- function(a, q=seq(0, 1, 0.05), ...) {
  hist(a, breaks=quantile(a, q), xlab="", main="Percentogram", ...)
}

I'll try it on my favorite example, a random sample from the Cauchy distribution:

> y <- rcauchy(1e5)
> percentogram(y)

And here's what comes up:

This is kinda useless: there's a wide range of the data and then you see no detail in the middle. You'll get similar problems with a classical equal-width histogram (try it!).

There's no way out of this one . . . except that if we're going with percentiles anyway, we could just trim the extremes:

percentogram <- function(a, q=seq(0, 1, 0.05), ...) {
  n <- length(a)
  b <- quantile(a, q)
  include <- (a >= b[1]) & (a <= b[length(b)])
  hist(a[include], breaks=b, xlab="", main=paste("Percentogram between", q[1], "and", q[length(q)], "quantiles"), ...)
}

OK, now let's try it, chopping off the lower and upper 1%:

percentogram(y, q=c(0.01, seq(0.05,0.95,0.05), 0.99))

Not bad! I kinda like this percentogram as a default.

“Assault Deaths in the OECD 1960-2020”: These graphs are good, and here’s how we can make them better:

Kieran Healy posts this time series of assault deaths in the United States and eighteen other OECD countries:

Good graph. I’d just make three changes:

1. Label y-axis as deaths per million (with labels at 0, 25, 50, 75, 100) rather than deaths per 100,000. Why? I just think that “per million” is easier to follow. I can picture an area with a million people and say, OK, it would have about 50 or 75 deaths in the U.S., as compared to 0 or 15 in another OECD country.

2. Put a hard x-axis at y=0. As it is now, the time series kinda float in midair. A zero line would provide a useful baseline.

3. When listing the OECD countries, use the country names, not the three-letter abbreviations, and list them in decreasing order of average rates, rather than alphabetically.

I’d also like to see the rates for other countries in the world. But could be a mess to cram them all on the same graph, so maybe do a few more: one for Latin America, one for Africa, one for the European and Asian countries not in the OECD. Or something like that. You could display all 4 of these graphs together (using a common scale on the y-axis) to get a global picture.

And another

OK, that was good. Here’s another graph from Healy, who introduces it as follows:

It’s a connected scatterplot of total health spending in real terms and life expectancy of the population as a whole. The fact that real spending and expectancy tend to steadily increase for most countries in most years makes the year-to-year connections work even though they’re not labeled as such.

And the graph itself:

I’d seen a scatterplot version of this one . . . This time-series version adds a lot of context, in particular showing how the U.S. used to fit right in with those other countries but doesn’t anymore.

And what are my graphics suggestions?

1. Maybe label two or three of those OECD countries, just to give a sense of the range? You could pick three of them and color them blue, or just heavy black, and label them directly on the lines. I’d also label the U.S. directly on the red line; no need for a legend.

2. To get a sense of the time scale, you could put a fat dot along each series every 10 years. Or, if that’s too crowded, you could do every 20 years: 1980, 2000, 2020. Otherwise as a reader I’m in an awkward position of not having a clear sense of how the curves line up.

3. Again, I’d like to see a few more graphs showing the other countries of the world.

Nationally poor, locally rich: Income and local context in the 2016 presidential election

Thomas Ogorzaleka, Spencer Piston, and Luisa Godinez Puig write:

When social scientists examine relationships between income and voting decisions, their measures implicitly compare people to others in the national economic distribution. Yet an absolute income level . . . does not have the same meaning in Clay County, Georgia, where the 2016 median income was $22,100, as it does in Old Greenwich, Connecticut, where the median income was $224,000. We address this limitation by incorporating a measure of one’s place in her ZIP code’s income distribution. We apply this approach to the question of the relationship between income and whites’ voting decisions in the 2016 presidential election, and test for generalizability in elections since 2000. The results show that Trump’s support was concentrated among nationally poor whites but also among locally affluent whites, complicating claims about the role of income in that election. This pattern suggests that social scientists would do well to conceive of income in relative terms: relative to one’s neighbors.

Good to see that people are continuing to work on this Red State Blue State stuff.

P.S. Regarding the graph above: They should’ve included the data too. It would’ve been easy to put in points for binned data just on top of the plots they already made. Clear benefit requring close to zero effort.

Association between low density lipoprotein cholesterol and all-cause mortality

Larry Gonick asks what I think of this research article, Association between low density lipoprotein cholesterol and all-cause mortality: results from the NHANES 1999–2014.

The topic is relevant to me, as I’ve had cholesterol issues. And here’s a stunning bit from the abstract:

We used the 1999–2014 National Health and Nutrition Examination Survey (NHANES) data with 19,034 people to assess the association between LDL-C level and all-cause mortality. . . . In the age-adjusted model (model 1), it was found that the lowest LDL-C group had a higher risk of all-cause mortality (HR 1.7 [1.4–2.1]) than LDL-C 100–129 mg/dL as a reference group. The crude-adjusted model (model 2) suggests that people with the lowest level of LDL-C had 1.6 (95% CI [1.3–1.9]) times the odds compared with the reference group, after adjusting for age, sex, race, marital status, education level, smoking status, body mass index (BMI). In the fully-adjusted model (model 3), people with the lowest level of LDL-C had 1.4 (95% CI [1.1–1.7]) times the odds compared with the reference group, after additionally adjusting for hypertension, diabetes, cardiovascular disease, cancer based on model 2. . . . In conclusion, we found that low level of LDL-C is associated with higher risk of all-cause mortality.

The above quotation is exact except that I rounded all numbers to one decimal place. The original version presented them to three decimals (“1.708,” etc.) and that made me cry.

In any case, the finding surprised me. I don’t know that it’s actually a medical surprise; I just had the general impression that cholesterol is a bad thing to have. Also, I was gonna say I was surprised that the estimated effects were so large, but then I saw the large widths of the confidence intervals, and that surprised me too at first, but then I realized that not so many people in the longitudinal study would have died during the period, so the effective sample size isn’t quite as large as it might seem at first.

The researchers also fit some curves:

Next, the inferences that the curve came from:

The data are consistent with high risks at low cholesterol levels and nothing happening at high levels, also consistent with other patterns, as can be seen from the uncertainty lines.

The published paper does a good job of presenting data and conclusions clearly without any overclaiming that I can see.

Anyway, I don’t really know what to make of this study, and I know nothing about the literature in the area. I’ll still go by my usual algorithm and just trust my doctor on everything.

I’m posting because (a) I just think it’s cool that the author of the Cartoon Guide to Statistics reads our blog, and (b) it can be helpful to our readers to see an example of my ignorance.

When plotting all the data can help avoid overinterpretation

Patrick Ruffini and David Weakliem both looked into this plot that’s been making the rounds, which seems to suggest a sudden drop in some traditional values:

Percent who say these values are 'very important' to them

But the survey format changed between 2019 and 2023, both moving online and randomizing the order of response options.

Perhaps one clue that you shouldn’t draw sweeping conclusions specific to these values is that there is a drop in the importance of “self-fulfillment” and “tolerance” too. Weakliem writes that once you collapse a couple response options…

there’s little change–they are almost universally regarded as important at all three times. The results for “self-fulfillment,” which isn’t mentioned in the WSJ article, are particularly interesting–the percent rating it as very important fell from 64% in 2019 to 53% in 2023. That’s hard to square with either the growing selfishness or the social desirability interpretations, but is consistent with my hypothesis. These figures indicate some changes in the last few years, but not the general collapse of values that is being claimed.

If the importance of everything drops at once, this might be a clue that selective interpretation of some thematically-related drops is likely not justified — whether this is because of survey format changes or otherwise (say something else becoming comparatively more important, but not asked about).

So perhaps this is a good reminder of the benefits of plotting more of the data — even if you want to argue the action is all in a few of the items. (You could even think of this as something like a non-equivalent comparison group or differences-in-differences design.)

Update: Here is a plot I made from the numbers from the Weakliem post. In making this plot, I formulated one guess of why the original plot has this weird x-axis: when making it with a properly scaled x-axis of years, you can easily run into problems with the tick labels running into each other. (Note that I copied the original use of “’23” as a shortening of 2023.)

Small multiples of WSJ/NORC survey data

[This post is by Dean Eckles.]

They did a graphical permutation test to see if students could reliably distinguish the observed data from permuted replications. Now the question is, how do we interpret the results?

1. Background: Comparing a graph of data to hypothetical replications under permutation

Last year, we had a post, I’m skeptical of that claim that “Cash Aid to Poor Mothers Increases Brain Activity in Babies”, discussing recently published “estimates of the causal impact of a poverty reduction intervention on brain activity in the first year of life.”

Here was the key figure in the published article:

As I wrote at the time, the preregistered plan was to look at both absolute and relative measures on alpha, gamma, and theta (beta was only included later; it was not in the preregistration). All the differences go in the right direction; on the other hand when you look at the six preregistered comparisons, the best p-value was 0.04 . . . after adjustment it becomes 0.12 . . . Anyway, my point here is not to say that there’s no finding just because there’s no statistical significance; there’s just a lot of uncertainty. The above image looks convincing but part of that is coming from the fact that the responses at neighboring frequencies are highly correlated.

To get a sense of uncertainty and variation, I re-did the above graph, randomly permuting the treatment assignments for the 435 babies in the study. Here are 9 random instances:

2. Planning an experiment

Greg Duncan, one of the authors of the article in question, followed up:

We almost asked students in our classes to guess which of ~15 EEG patterns best conformed to our general hypothesis of negative impacts for lower frequency bands and positive impacts for higher-frequency bands. One of the graphs would be the real one and the others would be generated randomly in the same manner as in your blog post about our article. I had suggested that we wait until we could generate age and baseline-covariate-adjusted versions of those graphs . . . I am still very interested in this novel way of “testing” data fit with hypotheses — even with the unadjusted data — so if you can send some version of the ~15 graphs then I will go ahead with trying it out on students here at UCI.

I sent Duncan some R code and some graphs, and he replied that he’d try it out. But first he wrote:

Suppose we generate 14 random + 1 actual graphs; recruit, say, 200 undergraduates and graduate students; describe the hypothesis (“less low-frequency power and more high-frequency power in the treatment group relative to the control group”); and ask them to identify their top and second choices for the graphs that appear to conform most closely with the hypothesis. I would also have them write a few sentences justifying their responses in order to coax them to take the exercise seriously.

The question: how would you judge whether the responses convincingly favored the actual data? More than x% first-place votes; more than y% first or second place votes? Most votes? It would be good to pre-specify some criteria like that.

I replied that I’m not sure if the results would be definitive but I guess it would be intereseting to see what happens.

Duncan responded:

I agree that the results are merely useful but not definitive.

Your blog post used these graphs to show that the data, if manipulated with randomly-generated treatment dummies, produced an uncomfortable number of false positives. This exercise would inform that intuition, even if we want to rely on formal statistics for the most systematic assessment of how confident we should be with the results.

I agree, and Drew Bailey, who was also involved in the discussion, added:

The earlier blog post used these graphs to show that the data, if manipulated with randomly-generated treatment dummies, produced an uncomfortable number of false positives. This new exercise would inform that intuition, even if we want to rely on formal statistics for the most systematic assessment of how confident we should be with the results.

3. Experimental conditions

Duncan was then ready to go. He wrote:

I am finally ready to test randomly generated graphs out on a large classroom of undergraduate students.

Paul Yoo used Stata to generate 15 random graphs plus the real one (see attached). The position (10th) in the 16 for the PNAS graph was determined from a random number draw. (We could randomize its position but that increases the scoring task considerably.) We put an edited version of the hypothesis that was preregistered/spelled out in our original NICHD R01 proposal below the graphs. My plan is to ask class members to select their first and second choices for the graph that conforms most closely to the hypothesis.

Bailey responded:

Yes, with the same caveat as before (namely, that the paths have already forked: we aren’t looking at a plot of frequency distributions for one of the many other preregistered outcomes in part because these impacts didn’t wind up on Andrew’s blog).

4. Results

Duncan reported:

97 students examined the 16 graphs shown in the 4th slide in the attached powerpoint file. The earlier slides set up the exercise and the hypothesis.

Almost 2/3rds chose the right figure (#10) on their first guess and 78% did so on their first or second guesses. Most of the other guesses are for figures that show more treatment-group power in the beta and gamma ranges but not alpha.

5. Discussion

I’m not quite sure what to make of this. It’s interesting and I think useful to run such experiments to help stimulate our thinking.

This is all related to the 2009 paper, Statistical inference for exploratory data analysis and model diagnostics, by Andreas Buja, Dianne Cook, Heike Hofmann, Michael Lawrence, Eun-Kyung Lee, Deborah Swayne, and Hadley Wickham.

As with hypothesis tests in general, I think the value of this sort of test is when it does not reject the null hypothesis, which represents a sort of negative signal that we don’t have enough data to learn more on the topic.

The thing is, I’m not clear what to make of the result that almost 2/3rds chose the right figure (#10) on their first guess and 78% did so on their first or second guesses. On one hand, this is a lot better than the 1/16 and 1/8 we would expect by pure chance. On the other hand, the fact that some of the alternatives were similar to the real data . . . this is all getting me confused! I wonder what Buja, Cook, etc., would say about this example.

6. Expert comments

Dianne Cook responded in detail in comments. All of this is directly related to our discussion so I’m copying her comment here:

The interpretation depends on the construction of the null sets. Here you have randomised the group. There is no control of the temporal dependence or any temporal trend, so where the lines cross or the volatility of lines is possibly distracting.

You have also asked a very specific one-sided question – it took me some time to digest what your question is asking. Effectively it is, in which plot is the solid line much higher than the dashed line only in three of the zones. When you are randomising groups, the group labels have no relevance, so it would be a good idea to set the higher-valued one to be the solid line in all null sets. Otherwise, some plots would be automatically irrelevant. People don’t need to know the context of a problem to be an observer for you, and it is almost always better if the context is removed. If you had asked a different question, eg in which plot are the lines getting further apart at higher Hz, or in which plot are the two lines the most different, would likely yield different responses. The question you ask matters. We typically try to keep it generic “which plot is different” or “which plot shows the most difference between groups”. Being too specific can create the same problem as creating the hypothesis post-hoc after you have seen the data, eg you spot clusters and then do a MANOVA test. You pre-registered your hypothesis so this shouldn’t be a problem. Thus your null hypothesis is “There is NO difference in the high-frequency power between the two groups.”

When you see as much variability in the null sets as you have here, it would be recommended to make more null sets. With more variability, you need more comparisons. Unlike a conventional test where we see the full curve of the sampling distribution and can check if the observed test statistic has a value in the tails, with randomisation tests we have a finite number of draws from the sampling distribution on which to make a comparison. Numerically we could generate tons of draws but for visual testing, it’s not feasible to look at too many. However, you still might need more than your current 15 nulls to be able to gauge the extent of the variability.

For your results, it looks like 64 of the 97 students picked plot 10, their first pick. Assuming that this was done independently and that they weren’t having side conversations in the room, then you could use nullabor to calculate the p-value:

> library(nullabor)
> pvisual(64, 97, 16)
x simulated binom
[1,] 64 0 0

which means that the probability that this many people would pick plot 10, if it really was truly a null sample, is 0. Thus we would reject the null hypothesis, and with strong evidence, conclude that there is more high frequency in the high-cash group. You can include the second votes by weighting the p-value calculation by two picks out of 16 instead of one, but here the p-value is still going to be 0.

To understand whether observers are choosing the data plot, for reasons related to the hypothesis you have to ask them why they made their choice. Again, this should be very specific here because you’ve asked a very specific question, things like “the lines are constantly further apart on the right side of the plot”. For people that chose null plots instead of 10, it would be interesting to know what they were looking at. In this set of nulls, there are so many other types of differences! Plot 3 has differences everywhere. We know there are no actual group differences, so this big of an observed difference is consistent with there being no true difference. It is ruled out as a contender only because the question asks in 3 of the 4 zones if is there a difference. We see crossings of lines in many plots, so this is something very likely to see assuming the null is true. The big scissor pattern in 8 is interesting, but we know this has arisen by chance.

Well, this has taken some time to write. Congratulations on an interesting experiment, and interesting post. Care needs to be taken in designing data plots, constructing the null-generating mechanisms and wording questions appropriately when you apply the lineup protocol in practice.

This particular work has been borne from curiosity about a published data plot. It reminds me of our work in Roy Chowdhury et al (2015) (https://link.springer.com/article/10.1007/s00180-014-0534-x). It was inspired by a plot in a published paper where the authors reported clustering. Our lineup study showed that this was an incorrect conclusion, and the clustering was due to the high-dimensionality. I think your conclusion now would be that the published plot does show the high-frequency difference reported.

She also lists a bunch of relevant references at the end of the linked comment.

Problems with a CDC report: Challenges of comparing estimates from different surveys. Also a problem with rounding error.

A few months ago we reported on an article from the Columbia Journalism Review that made a mistake by comparing numbers from two different sources.

The CJR article said, “Before the 2016 election, most Americans trusted the traditional media and the trend was positive, according to the Edelman Trust Barometer. . . . Today, the US media has the lowest credibility—26 percent—among forty-six nations, according to a 2022 study by the Reuters Institute for the Study of Journalism.” That sentence makes it look like there was a drop of at least 25 percentage points (from “most Americans” to “26 percent”) in trust in the media over a six-year period. Actually, though, as noticed by sociologist David Weakliem, the “most Americans” number from 2016 came from one survey and the “26%” from 2022 came from a different survey asking an entirely different question. When comparing comparable surveys, the drop in trust was about 5 percentage points.

This comes up a lot: when you compare data from different sources and you’re not careful, you can get really wrong answers. Indeed, this can even arise if you compare data from what seem to be the same source—consider these widely differing World Bank estimates of Russia’s GDP per capita.

It happened to the CDC

Another example came up recently, this time from the Centers for Disease Control and Prevention. The story is well told in this news article by Glenn Kessler. It started out with a news release from the CDC stating, “More than 1 in 10 [teenage girls] (14%) had ever been forced to have sex — up 27% since 2019 and the first increase since the CDC began monitoring this measure.” But, Kessler continues:

A CDC spokesman acknowledged that the rate of growth highlighted in the news release — 27 percent — was the result of rounding . . . The CDC’s public presentation reported that in 2019, 11 percent of teenage girls said that sometime in their life, they had been forced into sex. By 2021, the number had grown to 14 percent. . . . the more precise figures were 11.4 percent in 2019 and 13.5 percent in 2021. That represents an 18.4 percent increase — lower than the initial figure, 27 percent.

Rounding can be tricky. It seems reasonable to round 11.4% to 11% and 13.5% to 14%—indeed, that’s how I would report the numbers myself, as in a survey you’d never realistically have the precision to estimate a percentage to an accuracy of less than a percentage point. Even if the sample is huge (which it isn’t in this case), the underlying variability of the personal-recall measurement is such that reporting fractional percentage points would be inappropriate precision.

But, yeah, if you’re gonna compare the two numbers, you should compute the ratio based on the unrounded numbers, then round at the end.

This then logically brings us to the next step, which is that this “18.4% increase” can’t be taken so seriously either. It’s not that an 18.4% increase is correct and that a 27% increase is wrong: both are consistent with the data, along with lots of other possibilities.

The survey data as reported do show an increase (although there are questions about that too; see below), but the estimates from these surveys are just that—estimates. The proportion in 2019 could be a bit different than 11.4% and the proportion in 2021 could be a bit different than 13.5%. Even just considering sampling error alone, these data might be consistent with an increase of 5% from one year to the next, or 40%. (I didn’t do any formal calculations to get those numbers; this is just a rough sense of the range you might get, and I’m assuming the difference from one year to the other is “statistically significant,” so that the confidence interval for the change between the two surveys would exclude zero.)

There’s also nonsampling error, which gets back to the point that these are two different surveys, sure, conducted by the same organization but there will still be differences in nonresponse. Kessler discusses this too, linking to a blog by David Stein who looking into this issue. Given that the surveys are only two years apart, it does seem likely that any large increases in the rate could be explained by sampling and data-collection issues rather than representing large underlying changes. But I have not looked into all this in detail.

Show the time series, please!

The above sort of difficulty happens all the time when looking at changes in surveys. In general I recommend plotting the time series of estimates rather than just picking two years and making big claims from that. From the CDC page, “YRBSS Overview”:

What is the Youth Risk Behavior Surveillance System (YRBSS)?

The YRBSS was developed in 1990 to monitor health behaviors that contribute markedly to the leading causes of death, disability, and social problems among youth and adults in the United States. These behaviors, often established during childhood and early adolescence, include

– Behaviors that contribute to unintentional injuries and violence.
– Sexual behaviors related to unintended pregnancy and sexually transmitted infections, including HIV infection.
– Alcohol and other drug use.
– Tobacco use.
– Unhealthy dietary behaviors.
– Inadequate physical activity.

In addition, the YRBSS monitors the prevalence of obesity and asthma and other health-related behaviors plus sexual identity and sex of sexual contacts.

From 1991 through 2019, the YRBSS has collected data from more than 4.9 million high school students in more than 2,100 separate surveys.

So, setting aside everything else discussed above, I’d recommend showing time series plots from 1991 to the present and discussing recent changes in that context, rather than presenting a ratio of two numbers, whether that be 18% or 27% or whatever.

Plotting the time series doesn’t remove any concerns about data quality; it’s just an appropriate general way to look at the data that gets us less tangled in statistical significance and noisy comparisons.

How to add scaffolding when making a graph that’s hard to follow?

Julien Gori writes with a visualisation / statistical graphics question:

Some time ago, I was working with a linear model with an exponentially modified Gaussian noise, which I had reasons to believe would fit my empirical data well. To assess the fit of the model to the data, I worked up a visualisation, which essentially is a combination of binning + QQplots. See figure above.

The visualisation is pretty straightforward for people who know how to interpret QQplots. In the example attached, you see that the empirical data is left skewed for low values of the independent variable (ID) and right skewed for high values of ID, which means I should probably have the variance of the noise increase with ID levels. I have a few ideas to make the visualisation better, which I might explore depending on your reply.

I didn’t think much of it at the time, but one thing lead to another and I was advised to reach out to you . . . Do you see value in this type of visualisation; perhaps there are better ways of observing the same information, or perhaps it is an obvious visualisation?

My reply: the above graph is clever and I could imagine it could be very useful to someone who’s deep into these data and wanting to understand more about this particular pattern. As an outsider, though, I don’t find it straightforward at all! I was kinda thinking the red dots should be some ordered version of the blue dots, but the ranges of the points don’t line up. For example, there’s a red point with y above 7 but no blue point in that range. Looking at things the other way, there are a bunch of blue points below the red diagonal line but no red points below the line. And there are a bunch of blue points with x above 5 but only a few red points int hat range, and none more than 5.2 or 5.3.

Maybe what you need is an intermediate graph to help map out what you’re doing, both to help readers follow what is going on and maybe to help you interpret the data as well?

P.S. Gori supplies more detail here.

Tidyverse examples for Regression and Other Stories

Bill Behrman writes:

I commend you for providing the code used in Regression and Other Stories. In the class I’ve co-taught here at Stanford with Hadley Wickham, we’ve found that students greatly benefit from worked examples.

We’ve also found that students with no prior R or programming experience can, in a relatively short time, achieve considerable skill in manipulating data and creating data visualizations with the Tidyverse. I don’t think the students would be able to achieve the same level of proficiency if we used base R.

For this reason, as I read the book, I created a Tidyverse version of the examples.

For these examples, I tried to write model code suitable for learning how to manipulate data and create data visualizations using the Tidyverse.

For readers seeking the code for a given section of the book, I’ve provided a table of contents at the top of each example with links from book sections to the corresponding code.

I notice that the book’s website has a link to an effort to create a Python version of the examples. Perhaps the examples I’ve create can serve a resource for those seeking to learn the Tidyverse tools for R.

This is a great resource! It’s good to see this overlap of modeling and exploratory data analysis (EDA) attitudes in statistics. Traditionally the modelers don’t take graphics and communication seriously, and the EDA people disparage models. For example, the Venables and Ripley book, excellent though it was, had a flaw in that it did not seem to take modeling seriously. I appreciate the efforts of Behrman, Wickham, and others on this, and I’m sure it will help lots of students and practitioners as well.

Behrman adds:

I couldn’t agree more on the complementary roles of EDA and modeling.

For some of the tidyverse ROS examples, I added an EDA section at the top, both to illustrate how to understand the basics of a dataset and to orient readers to the data before turning to the modeling.

We’ve found that with ggplot2, students with no prior R or programing experience can become quite proficient at data visualization. When we give them increasing difficult EDA challenges, they actually enjoy becoming data detectives. Our alums doing data work in industry tell us that EDA is one of their most useful data tools.

Since ours is an introductory class, what we teach in “workflow” has modest aims, primarily to help students better organize their work. We created a function dcl::create_data_project() to automatically create a directory with subdirectories useful in almost all data projects. With two more commands, students can make this a GitHub repo.

We stress the importance of reproducibility and having all data transformations done by scripts. And to help their future selves, we show students how to use the Unix utility make to automate certain tasks.

P.S. We put their code, along with others, on the webpage for Regression and Other Stories:

And here’s their style guide and other material they use in their data science course.