with Lauren Kennedy and Jessica Hullman: “Causal quartets: Different ways to attain the same average treatment effect”

Lauren, Jessica, and I just wrote a paper that I really like, putting together some ideas we’ve been talking about for awhile regarding variation in treatment effects. Here’s the abstract:

The average causal effect can often be best understood in the context of its variation. We demonstrate with two sets of four graphs, all of which represent the same average effect but with much different patterns of heterogeneity. As with the famous correlation quartet of Anscombe (1973), these graphs dramatize the way in which real-world variation can be more complex than simple numerical summaries. The graphs also give insight into why the average effect is often much smaller than anticipated.

And here’s the background.

Remember that Anscombe (1973) paper with these four scatterplots that all look different but have the same correlation:

This inspired me to make four graphs showing with different individual patterns of causal effects but the same average causal effect:

And these four; same idea but this time conditional on a pre-treatment predictor:

As with the correlation quartet, these causal quartets dramatize all the variation that is hidden when you just look at a single-number summary.

In the paper, we don’t just present the graphs; we also give several real-world applications where this reasoning made a difference.

During the past few years, we’ve been thinking more and more about models for varying effects, and I’ve found over and over that considering the variation in an effect can also help us understand its average level.

For example, suppose you have some educational intervention that you hope or expect could raise test scores by 20 points on some exam. Before designing your study based on an effect of 20 points, think for a moment. First, the treatment won’t help everybody. Maybe 1/4 of the students are so far gone that the intervention won’t help them and 1/4 are doing so well that they don’t need it. Of the 50% in the middle, maybe only half of them will be paying attention during the lesson. And, of those who are paying attention, the new lesson might confuse some of them and make things worse. Put all this together: if the effect is in the neighborhood of 20 points for the students who are engaged by the treatment, then the average treatment effect might be more like 3 points.

OK, I made up all those numbers. The point is, these are things we should be thinking about whenever you design or analyze a study—but, until recently, I wasn’t!

Anyway, these issues have been coming up a lot lately—it’s the kind of thing where, once you see it, you start seeing it everywhere, you can’t un-see it—and I was really excited about this “graphical quartet” idea as a way to make it come to life.

As part of this project, Jessica created an R package, causalQuartet, so you can produce your own causal quartets. Just follow the instructions on that Github page and go for it!

Software to sow doubts as you meta-analyze

This is Jessica. Alex Kale, Sarah Lee, TJ Goan, Beth Tipton, and I write,

Scientists often use meta-analysis to characterize the impact of an intervention on some outcome of interest across a body of literature. However, threats to the utility and validity of meta-analytic estimates arise when scientists average over potentially important variations in context like different research designs. Uncertainty about quality and commensurability of evidence casts doubt on results from meta-analysis, yet existing software tools for meta-analysis do not necessarily emphasize addressing these concerns in their workflows. We present MetaExplorer, a prototype system for meta-analysis that we developed using iterative design with meta-analysis experts to provide a guided process for eliciting assessments of uncertainty and reasoning about how to incorporate them during statistical inference. Our qualitative evaluation of MetaExplorer with experienced meta-analysts shows that imposing a structured workflow both elevates the perceived importance of epistemic concerns and presents opportunities for tools to engage users in dialogue around goals and standards for evidence aggregation.

One way to think about good interface design is that we want to reduce sources of the “friction” like the cognitive effort users have to exert when they go to do some task; in other words minimize the so-called gulf of execution. But then there are tasks like meta-analysis where being on auto-pilot can result in misleading results. We don’t necessarily want to create tools that encourage certain mindsets, like when users get overzealous about suppressing sources of heterogeneity across studies in order to get some average that they can interpret as the ‘true’ fixed effect. So what do you do instead? One option is to create a tool that undermines the analyst’s attempts to combine disparate sources of evidence every chance it gets. 

This is essentially the philosophy behind MetaExplorer. This project started when I was approached by an AI firm pursuing a contract with the Navy, where systematic review and meta-analysis are used to make recommendations to higher-ups about training protocols or other interventions that could be adopted. Five years later, a project that I had naively figured would take a year (this was my first time collaborating with a government agency) culminated in a tool that differs from other software out there primarily in its heavy emphasis on sources of heterogeneity and uncertainty. It guides the user through making their goals explicit, like what the target context they care about is; extracting effect estimates and supporting information from a set of studies; identifying characteristics of the studied populations and analysis approaches; and noting concerns about assymmetries, flaws in analysis, or mismatch between the studied and target context. These sources of epistemic uncertainty get propagated to a forest plot view where the analyst can see how an estimate varies as studies are regrouped or omitted. It’s limited to small meta-analyses of controlled experiments, and we have various ideas based on our interviews of meta-analysts that could improve its value for training and collaboration. But maybe some of the ideas will be useful either to those doing meta-analysis or building software. Codebase is here.

Lancet finally publishes letters shooting down erroneous excess mortality estimates.

Ariel Karlinsky writes:

See here for our (along with 4 other letters) critique of IHME/Lancet covid excess mortality estimates which Lancet has published after first dragging their feet for almost a year, then rejecting it and then accepting it.

Our letter and this tweet with the above graph shows the issue best than the letter itself, and should be right up your alley as it plots the raw data and doesn’t hide it behind regression coefficients, model averaging etc.

I wonder if there was some politics involved? I say this because when Lancet screws up there often seems to be some some political angle.

On the plus side, it took them less than a year to publish the critique, which is slower than they were with Surgisphere but much faster than with that Andrew Wakefield article.

P.S. Here are some old posts on the University of Washington’s Institute for Health Metrics and Evaluation (not to be confused with the Department of Epidemiology at that university):

14 Apr 2020: Hey! Let’s check the calibration of some coronavirus forecasts.

5 May 2020: Calibration and recalibration. And more recalibration. IHME forecasts by publication date

9 May 2021: Doubting the IHME claims about excess deaths by country

19 Sep 2021: More on the epidemiologists who other epidemiologists don’t trust

Statistical analysis: (1) Plotting the data, (2) Constructing and fitting models, (3) Plotting data along with fitted models, (4) Further modeling and data collection

It’s a workflow thing.

Here’s the story. Carlos Ronchi writes:

I have a dataset of covid hospitalizations from Brazil. The values of interest are day of first symptoms, epidemiological week and day of either death or cure. Since the situation in Brazil has been escalating and getting worse every day, I wanted to compare the days to death in hospitalized young people (20-29 years) between two sets of 3 epidemiological weeks, namely weeks 1-3 and 8-10. The idea is that with time the virus in Brazil is getting stronger due to mutations and uncontrolled number of cases, so this is somehow reflected in the time from hospitalization to death.

My idea was to do an Anova by modeling the number of days to death from hospitalization in patients registered in 3 epidemiological weeks with a negative binomial regression. The coefficients would follow a normal distribution (which would be exponentiated afterwards). Once we have the coefficients we can simply compare the distributions and check the probability that the days to death are bigger/smaller in one of the groups.

Do you think this is a sound approach? I’m not sure, since we have date information. The thing is I don’t know how I would do a longitudinal analysis here, even if it makes sense.

My reply: I’m not sure either, as I’ve never done an analysis quite like this, so here are some general thoughts.

First step: Plotting the data

Start by graph the data using scatterplots and time-series plots. In the absence of variation in outcomes, plotting the data would tell us the entire story, so from this point of view the only reason we need to go beyond direct plots is to smooth out variation. Smoothing the variation is important—at some point you’ll want to fit a model, I fit models all the time!—; I just think that you want to start with plotting, for several reasons:

1. You can sometimes learn a lot from a graph: seeing patterns you expected to see can itself be informative, and then there are often surprises as well, things you weren’t expecting to see.

2. Seeing the unexpected, or even thinking about the unexpected, can stimulate you to think more carefully about “the expected”: What exactly did you think you might see? What would constitute a surprise? Just as the steps involved in planning an experiment can be useful in organizing your thoughts even if you don’t actually go and collect the data, so can planning a graph be helpful in arranging your expectations.

3. A good plot will show variation (any graph should contain the seeds of its own destruction), and this can give you a sense of where to put your modeling effort.

Remember that you can make lots of graphs. Here, I’m not talking about a scatterplot matrix or some other exhaustive set of plots, but just of whatever series of graphs you make while exploring your data. Don’t succumb to the Napoleon-in-Russia fallacy of thinking you need to make one graph that shows all the data at once. First, that often just can’t be done; second, even if a graph with all the data can be constructed, it can be harder to read than a set of plots; see for example Figure 4.1 of Red State Blue State.

Second step: Statistical modeling

Now on to the modeling. The appropriate place for modeling in data analysis is in the “sweet spot” or “gray zone” between (a) data too noisy to learn anything and (b) patterns so clear that no formal analysis is necessary. As we get more data or ask more questions, this zone shifts to the left or right. That’s fine. There’s nothing wrong with modeling in regions (a) or (b); these parts of the model don’t directly give us anything new, but they bridge to the all-important modeling in the gray zone in the middle.

Getting to the details: the way the problem is described in the above note, I guess it makes sense to fit a hierarchical model with variation across people and over time. I don’t think I’d use a negative binomial model of days to death; to me, it would be more natural to model time to death as a continuous variable. Even if the data happen to be discrete in that they are rounded to the nearest day, the underlying quantity is continuous and it makes sense to construct the model in that way. This is not a big deal; it’s relevant to our general discussion only in the “pick your battles” sense that you don’t want to spend your effort modeling some not-so-interesting artifacts of data collection. In any case, the error term is the least important aspect of your regression model.

Third step: Using graphs to understand and find problems with the model

After you’ve fit some models, you can graph the data along with the fitted models and look for discrepancies.

Fourth step: Improving the model and gathering more data

There are various ways in which your inferences can be lacking:

1. No data in regime of interest (for example, extrapolating about 5-year survival rates if you only have 2 years of data)

2. Data too noisy to get a stable estimate. This could be as simple as the uncertainty for some quantity of interest being larger than you’d like.

3. Model not fitting the data, as revealed by your graphs in the third step above.

These issues can motivate additional modeling and data collection.

Controversy over an article on syringe exchange programs and harm reduction: As usual, I’d like to see more graphs of the data.

Matt Notowidigdo writes:

I saw this Twitter thread yesterday about a paper recently accepted for publication. I thought you’d find it interesting (and maybe a bit amusing).

It’s obvious to the economists in the thread that it’s a DD [difference-in-differences analysis], and I think they are clearly right (though for full disclosure, I’m also an economist). The biostats author of the thread makes some other points that seem more sensible, but he seems very stubborn about insisting that it’s not a DD and that even if it is a DD, then “the literature” has shown that these models perform poorly when used on simulated data.

The paper itself is obviously very controversial and provocative, and I’m sure you can find plenty of fault in the way the Economist writes up the paper’s findings. I think the paper itself strikes a pretty cautious tone throughout, but that’s just my own judgement.

I took a look at the research article, the news article, and the online discussion, and here’s my reply:

As usual I’d like to see graphs of the raw data. I guess the idea is that these deaths went up on average everywhere, but on average more in comparable counties that had the programs? I’d like to see some time-series plots and scatterplots, also whassup with that bizarre distorted map in Figure A2? Also something weird about Figure A6. I can’t imagine there are enough counties with, say, between 950,000 and 1,000,000 people to get that level of accuracy as indicated by the intervals. Regarding the causal inference: yes, based on what they say it seems like some version of difference in differences, but I would need to see the trail of breadcrumbs from data to estimates. Again, the estimates look suspiciously clean. I’m not saying the researchers cheated, they’re just following standard practice and leaving out a lot of details. From the causal identification perspective, it’s the usual question of how comparable are the treated and control groups of counties: if they did the intervention in places that were anticipating problems, etc. This is the usual concern with observational comparisons (diff-in-diff or otherwise), which was alluded to by the critic on twitter. And, as always, it’s hard to interpret standard errors from models with all these moving parts. I agree that the paper is cautiously written. I’d just like to see more of the thread from data to conclusions, but again I recognize that this is not how things are usually done in the social sciences, so to put in this request is not an attempt to single out this particular author.

It can be difficult to blog on examples such as this where the evidence isn’t clear. It’s easy to shoot down papers that make obviously ridiculous claims, but this isn’t such a case. The claims are controversial but not necessarily implausible (at least, not to me, but I’m a complete outsider.). This paper is an example of a hard problem with messy data and a challenge of causal inference from non-experimental data. Unfortunately the standard way of writing these things in econ and other social sciences is to make bold claims, which then encourages exaggerated headlines. Here’s an example. Click to the Economist article and the headline is the measured statement, “America’s syringe exchanges might be killing drug users. But harm-reduction researchers dispute this.” But the Economist article’s twitter link says, “America’s syringe exchanges kill drug users. But harm-reduction researchers are unwilling to admit it.” I guess the Economist’s headline writer is more careful than their twitter-feed writer!

The twitter discussion has some actual content (Gilmour has some graphs with simulated data and Packham has some specific responses to questions) but then the various cheerleaders start to pop in, and the result is just horrible, some mix on both sides of attacking, mobbing, political posturing, and white-knighting. Not pretty.

In its subject matter, the story reminded me of this episode from a few years ago, involving an econ paper claiming a negative effect of a public-health intervention. To their credit, the authors of that earlier paper gave something closer to graphs of raw data—enough so that I could see big problems with their analysis, which led me to general skepticism about their claims. Amusingly enough, one of the authors of the paper responded on twitter to one of my comments, but I did not find the author’s response convincing. Again, it’s a problem with twitter that even if at some point there is a response to criticism the response tends to be short. I think blog comments are a better venue for discussion; for example I responded here to their comment.

Anyway, there’s this weird dynamic where that earlier paper displayed enough data for us to see big problems with its analysis, whereas the new paper does not display enough for us to tell much at all. Again, this does not mean the new paper’s claims are wrong, it just means it’s difficult for me to judge.

This all reminds me of the idea, based on division of labor (hey, you’re an economist! you should like this idea!), that the research team that gathers the data can be different from the team that does the analysis. Less pressure then to come up with strong claims, and then data would be available for more people to look at. So less of this “trust me” attitude, both from critics and researchers.

They say that stocks go down during the day and up at night.

Bruce Knuteson writes:

Prompted by your blog post this morning, I attach a plot from Figure 3 of They Still Haven’t Told You showing overnight and intraday returns to AIG (with logarithmic vertical scale, updated with data through the end of October).

If you invested $1 in AIG at the start of 1990 and received only intraday returns (from market open to market close), you would be left with one-twentieth of a penny, suffering a cumulative return of -99.95%. If you received only overnight returns (from market close to the next day’s market open), you would have $1,017, achieving a cumulative return of roughly +101,600%.

You can easily reproduce this plot yourself. Data are publicly available from Yahoo Finance.

AIG is just one of many stocks with a suspiciously divergent time series of overnight and intraday returns.

If you have a compelling innocuous explanation for these strikingly suspicious overnight and intraday returns that I have not already addressed, I would of course be keen to hear it.

Alternatively, if you can think of a historical example of a strikingly suspicious return pattern in a financial market that turned out to clearly be fine, I would be keen to hear it.

If neither, perhaps you can bring these strikingly suspicious return patterns to the attention of your readers.

What continues to stun me is how something can be clear and unambiguous, and it still takes years or even decades to resolve.

The linked article is fun to read, but I absolutely have no idea about this, so just sharing with you. Make of it what you will. You can also read this news article from 2018 by Matt Levine which briefly discusses Knuteson’s idea.

“The Undying Holiday-Suicide Myth”

Kevin Lewis points us to this article from the Annenberg Public Policy Center:

The holiday-suicide myth, the false claim that the suicide rate rises during the year-end holiday season, persisted in some news coverage through the 2021-22 holidays . . . In fact, although the U.S. suicide rate increased in 2021 after two years of declines, the average daily suicide rate during the holiday months remained among the lower rates of the year.

APPC’s media analysis, which is based on newspaper stories published over the 2021-22 holiday season, found that a little more than half of the stories that directly discussed the holidays and the suicide rate supported the false myth, while the remainder debunked it.

How horrible! It’s a Gladwell world out there, we just live in it.

The APPC article continues with some graphs:

OK, just 15 of those bad news stories in the past year. That’s 15 more than we’d hope, but not as bad as it sounded from the headline!

Also, I gotta say that this second graph above is not so great. It takes up a lot of space and conveys very little information. Why not just show the two lines of raw data (number of news stories supporting or debunking) on the same graph? That would be much clearer and would avoid the weird Rorschach dynamics of the second graph above. (By the way: I don’t think I’ve ever been able to spell Rorschach correct on the first try. Or even the second try.)

But let’s forget about the confusing graph and forget about the problem with the news stories. What about the data? APPC supplies the graph:

This is a nice graph.

The only thing is . . . given the whole holiday thing, it would be helpful to see the data for each week. They supply a ridiculous table which provides essentially no information beyond the time series graph but they manage to display so that it takes up a whole page on the screen:

I made it small here to spare you some of the pain. “130.52,” indeed. Why stop at 2 decimal places? Why not give full precision with 130.5183428923748923748293?

In any case, the topic of suicide is important. The APPC article states:

It’s important for reporters and news organizations to dispel the myth because allowing people to think that suicide is more likely during the holiday season can have contagious effects on people who are contemplating suicide. National recommendations for reporting on suicide advise journalists not to promote information that can increase contagion, such as reports of epidemics or seasonal increases, especially when the claim has no basis in fact.

I get that . . . but would they also say it’s wrong for news outlets to report that suicide rates are higher in the summer and early fall? I’m kinda suspicious of the recommendation not to report things.

Just to be on the safe side, I delayed this post until February.

Alberto Cairo’s visualization course

Alberto Cairo writes:

Every semester I [Cairo] teach my regular introduction to information design and data visualization class. Most students are data scientists, statisticians, engineers, interaction designers, plus a few communication and journalism majors.

At the beginning of the semester, many students are wary about their lack of visual design and narrative skills, and they are often surprised at how fast they can improve if they are willing to engage in intense practice and constant feedback. I’m not exaggerating when writing “intense”: an anonymous former student perfectly described the experience of taking my class in RateMyProfessors: “SO. MUCH. WORK”.

Indeed. The only way to learn a craft is to practice the craft nonstop.

My classes consist of three parts:

First month: lectures, readings, discussions, and exercises to master concepts, reasoning, and software tools. I don’t grade these exercises, I simply give credit for completion, but I hint what grades students would receive if I did grade them.

Second month: Project 1. I give students a general theme and a client. This semester I chose The Economist magazine’s Graphic Detail section, so a requirement for the project was that students tried to mimic its style. Once a week during this second month I give each student individualized advice on their progress prior to the deadline. I don’t give most feedback after they turn their project in, but before.

Third month: Project 2. I give students complete freedom to choose a topic and a style. I also provide weekly feedback, but it’s briefer and more general than on Project 1.

He then shares some examples of student projects. The results are really impressive! Sure, one reason they look so good is that they’re copying the Economist’s style (see “Second month”) above, but that’s fine. To have made something that looks so clean and informative is a real accomplishment and a great takeaway from a semester-long course.

When I teach, I try—not always with success—to make sure that, when the course is over, students can do a few things they could not do before, and that they can fit these new things into their professional life. Cairo seems to have done this very effectively here.

You wish you’re first to invent this scale transform: -50 x*x + 240 x – 7

This post is written by Kaiser, not Andrew.

Harvard Magazine printed the following chart that confirms a trend in undergraduate majors that we all know about: students are favoring STEM majors at the expense of humanities.

I like the chart. The dot plot is great for showing this dataset. They placed the long text horizontally. The use of color is crucial, allowing us to visually separate the STEM majors from the humanities majors.

Then, the axis announced itself.
I was baffled, then disgusted.
Here is a magnified view of the axis:

Notice the following features of this transformed scale:

  • It can’t be a log scale because many of the growth values are negative.
  • The interval for 0%-25% is longer than for 25%-50%. The interval for 0%-50% is also longer than for 50%-100%. On the positive side, the larger values are pulled in and the smaller values are pushed out.
  • The interval for -20%-0% is the same length as that for 0%-25%. So, the transformation is not symmetric around 0.

I have no idea what transformation was applied. I took the growth values, measured the locations of the dots, and asked Excel to fit a polynomial function, and it gave me a quadratic fit, R square > 99%.

Here’s the first time I’ve seen this transform: -50 x^2 + 240 x – 7.

This formula fits the values within the range extremely well. I hope this isn’t the actual transformation. That would be disgusting. Regardless, they ought to have advised readers of their unusual scale.

Using the inferred formula, I retrieved the original growth values. They are not extreme, falling between -50% and 125%. There is nothing to be gained by transforming the scale.

The following chart undoes the transformation.

(I also grouped the majors by their fields.)

P.S. A number of readers have figured out the actual transformation, which is the log of the relative ratio of the number of degrees. A reader also pointed us to an article by Benjamin Schmidt who made the original chart. In several other analyses, he looked at “market share” metrics. I prefer the share metric also for this chart. In the above, a 50% increase doesn’t really balance out a 33% decrease because the 2011 values differ across majors.

P.P.S. Schmidt adds some useful information in comments:

Yeah, this is code that I run every year. Harvard Magazine asked if they could use the 2022 version, and must have done something in Illustrator. (Also they dropped a bunch of fields that don’t apply to Harvard–Harvard has no business major, so they hide business.)

I’ll switch it to label fractions rather than percentages next time, and grouping across areas as above is a solid improvement. But the real problem here is that there aren’t *enough* log scales in the world for it to be obvious what’s going on. They are better when discussing rates of change. A linear scale implies the possibility of a 150% drop. That’s worse than meaningless–it’s connected to a lot of mistaken reasoning about percentages. (E.g. people thinking that a 5% drop followed by a 5% rise would bring you back to the same point; the failure to understand compound interest; etc.) IMO charts shouldn’t reflect expectations when those expectations rest on bad mental models.

Here, FWIW, is the code.

p1 = c2 %>%
ggplot() + aes(color = label, y = change, x = reorder(Discipline, change, function(x) {x[1]})) +
geom_point() + labs(title=paste0(“Fig. 1: Change in degrees, “, y1, “-“, y2), y = paste0(“Change in number awarded “, y1, “-“, y2), x = “Major”, color = “General Field”, caption = “Sources: NCES IPEDS data; taxonomy adapted from American Academy of Arts and Sciences.\nBen Schmidt, 2022”) +
coord_flip() + scale_y_continuous(breaks = c(1/2, 2/3, 4/5, 1, 5/4, 3/2, 2, 3), labels=scales::percent(c(1/2, 2/3, 4/5, 1, 5/4, 3/2, 2, 3) – 1)) + theme_bw() + scale_x_discrete(“Major”) + theme(legend.position = “bottom”) + scale_color_brewer(type=’qual’, palette = 2, guide = guide_legend(ncol = 2))

Labeling the x and y axes: Here’s a quick example where a bit of care can make a big difference.

I’ll have more to say about the above graph some other time—it comes from this excellent post from Laura Wattenberg. Here I just want to use it as an example of statistical graphics.

Overall, the graph is great: clear label, clean look, good use of color. I’m not thrilled with the stacked-curve thing—it makes it hard to see exactly what’s going on with the girls’ names in the later time period—I think I’d prefer a red line for the girls, a blue line for the boys, and a gray line with total so you can see that if you want. Right now the graph shows boys and total, so we need to do some mental subtraction to see the trend for girls. Also, a very minor thing: the little squares with M and F on the bottom could be a bit bigger and just labeled as “boys” and “girls” or “male” and “female” rather than just “M” and “F.”

Also, I think the axis labels could be improved.

Labeling the x-axis every four years is kind of a mess: I’d prefer just labeling 1900, 1920, 1940, etc., with tick marks every 5 and 10 years.

As for the y-axis, it’s not so intuitive to have labels at 4, 8, 12, etc—I’d prefer 5, 10, 15, etc., or really just 10, 20, 30 would be fine—; also, I don’t find numbers such as 10,000 per million to be so intuitive. I’d prefer expressing in terms of percentages. So that 10,000 per million would be 1%. Labeling the y-axis as 1%, 2%, etc., that should be clearer, no? Lots less mysterious than 4000, 8000, etc. Right now, about 3% of baby names end in “i.” At the earlier peak in the 1950s-70s, it was about 2%.

Just to be clear: I like the graph a lot. The criticisms point to ways to make it even better (I hope).

The point of this post is how, with a bit of thought, we can improve a graph. We see lots of bad graphs that can be improved. It’s more interesting to see how even a good graph can be fixed up.

Data-perception tasks where sound perception is more effective than sight

Francis Goodling writes:

When you cut [remove a frame from a movie reel], you also lose a frame’s worth of the magnetic strip that holds the soundtrack. And while a missing 1/24 of a second is undetectable to the eye, it turns out that 1/24 of a second in lost sound is impossible to miss: there is a tic in the music, a skip in the background noise, or a word that has a bite taken out of it. You can’t see the lost frame, but you can hear it. At 24 frames per second, the eye loses track and registers seamless animation, but the ear is counting time.

Interesting. In Delivering data differently, Gwynn, Jonathan, and I talked about the comparative advantages and disadvantages of looking vs. listening as modes of understanding data. With visualization you can consider a lot more data at once: this is a result of visualization occurring over space, whereas sounds develop over time, also our brains can just process a lot more information through visual than sonic channels. Two advantages of sonic perception that we did consider were:

1. To gather visual information you have to look; you can’t do much visual processing in the background. In contrast, you can notice sound without paying attention. This suggests a role for sonic information transfer for diagnostic methods where the goal is not to draw inferences, perceive complex patterns, or synthesize large amounts of data but rather to be alerted to sudden changes.

2. Sound and music are emotionally engaging in a way that visual images typically are not. Sounds can be soothing, annoying, or all sorts of other things. Perhaps there is some way to harnessing this emotional connection for the purpose of learning from data.

Goodling’s above-quoted remark suggests a third distinction:

3. With vision, our brain fills in gaps so it can be difficult to notice small changes; consider those “find 7 differences between these two pictures” puzzles they used to have in the kids’ pages in the newspaper. In contrast, sonic gaps or discordances jump out at you; recall that famous Bugs Bunny cartoon with the wrong note on the piano. It’s similar to how a block of wood can look smooth, but if you run a finger along it you’ll feel the imperfections (and maybe even get a splinter)!.

When I was first thinking about general data perception, I was thinking of other senses as replacements for sight, thus trying to come up with sonic or haptic alternatives to scatterplots, time series plots, and so on. My current thinking is to set aside the things that vision can do best, and instead focus on the applications where the other senses are effective in ways that vision is not.

Bayesian geomorphology

John “not the Jaws guy” Williams points us to this article by Oliver Korup which begins:

The rapidly growing amount and diversity of data are confronting us more than ever with the need to make informed predictions under uncertainty. The adverse impacts of climate change and natural hazards also motivate our search for reliable predictions. The range of statistical techniques that geomorphologists use to tackle this challenge has been growing, but rarely involves Bayesian methods. Instead, many geomorphic models rely on estimated averages that largely miss out on the variability of form and process. Yet seemingly fixed estimates of channel heads, sediment rating curves or glacier equilibrium lines, for example, are all prone to uncertainties. Neighbouring scientific disciplines such as physics, hydrology or ecology have readily embraced Bayesian methods to fully capture and better explain such uncertainties, as the necessary computational tools have advanced greatly. The aim of this article is to introduce the Bayesian toolkit to scientists concerned with Earth surface processes and landforms, and to show how geomorphic models might benefit from probabilistic concepts. I briefly review the use of Bayesian reasoning in geomorphology, and outline the corresponding variants of regression and classification in several worked examples.

Cool! It’s good to see Bayesian ideas in different scientific fields. And they use Stan. I also like that they have graphs of data and fitted models.

What is the original source of the bulls-eye graphs that represent bias and variance?

You’ve all seen those bulls-eye pictures representing bias and variance (see for example the wikipedia page). Where did this come from? I’m writing an article where I want to cite the idea and I’m not sure where to look. It’s a pretty natural idea but it must come from somewhere, right?

Update 3 – World Cup Qatar 2022 predictions (round of 16)

World Cup 2022 is progressing, many good matches and much entertainment. Time then for World Cup 2022 predictions of the round of 16 matches from our DIBP model  – here the previous update. In the group stage matches the average of the model probabilities for the actual final results was about 0.52.

Here there are the posterior predictive match probabilities for the held-out matches of the Qatar 2022 round of 16 to be played from December 3rd to December 6th, along with some ppd ‘chessboard plots’ for the exact outcomes in gray-scale color – ‘mlo’ in the table denotes the ‘most likely result’ , whereas darker regions in the plots correspond to more likely results. In the plots below, the first team listed in each sub-title is the ‘favorite’ (x-axis), whereas the second team is the ‘underdog’ (y-axis). The 2-way grid displays the 8 held-out matches in such a way that closer matches appear at the top-left of the grid, whereas more unbalanced matches (‘blowouts’) appear at the bottom-right.  The matches are then ordered from top-left to bottom-right in terms of increasing winning probability for the favorite teams. The table reports instead the matches according to a chronological order.

Apparently, Brazil is highly favorite against South Korea, and Argentina seems much ahead against Australia, whereas much balance is predicted for Japan-Croatia, Netherlands-United States and Portugal-Switzerland. Note: take in consideration that these probabilities refer to the regular times, then within the 90 minutes. The model does not capture supplementary times probabilities.

You find the complete results, R code and analysis here. Some preliminary notes and model limitations can be found here.

Next steps: we’ll update the predictions for the quarter of finals. We are still discussing about the possibility to report some overall World Cup winning probabilities, even though I am personally not a huge fan of these ahead-predictions (even coding this scenario is not straightforward…!). However, we know those predictions could be really amusing for fans, so maybe we are going to report them after the round of 16. We also could post some pp checks for the model and more predictive performance measures.

Stay tuned!

The more I thought about them, the less they seemed to be negative things, but appeared in the scenes as something completely new and productive

This is Jessica. My sabbatical year, which most recently had me in Berkeley CA,  is coming to an end. For the second time since August I was passing through Iowa. Here it is on the way out to California from Chicago and on the way back.

A park in Iowa in AugustA part in Iowa in November

If you squint (like, really really squint), you can see a bald eagle overhead in the second picture.

One association that Iowa always brings to mind for me is that Arthur Russell, the musician, grew up there. I have been a fan of Russell’s music for years, but somehow had missed Iowa Dream, released in 2019 (Russell died of AIDS in 1992, and most of his music has been released posthumously). So I listened to it while we were driving last week. 

Much of Iowa Dream is Russell doing acoustic and lofi music, which can be surprising if you’ve only heard his more heavily produced disco or minimalist pop. One song, called Barefoot in New York, is sort of an oddball track even amidst the genre blending that is typical of Russell. It’s probably not for everyone, but as soon as I heard it I wanted to experience it again. 

NPR called it “newfound city chaos” because Russell wrote it shortly after moving to New York, but there’s also something about the rhythm and minutae of the lyrics that kind of reminds me of research. The lyrics are tedious, but things keep moving like you’re headed towards something. The speech impediment evokes getting stuck at times and having to explore one’s way around the obstruction. Sometimes things get clear and the speaker concludes something. Then back to the details that may or may not add up to something important. There’s an audience of backup voices who are taking the speaker seriously and repeating bits of it, regardless of how inconsequential. There’s a sense of bumbling yet at the same time iterating repeatedly on something that may have started rough but becomes more refined. 

Then there’s this part:

I really wanted to show somehow how things deteriorate

Or how one bad thing leads to another

At first, there were plenty of things to point to

Lots of people, places, things, ideas

Turning to shit everywhere

I could describe these instances

But the more I thought about them

The less they seemed to be negative things

But appeared in the scenes as something completely new and productive

And I couldn’t talk about them in the same way

But I knew it was true that there really are

Dangerous crises

Occurring in many different places

But I was blind to them then

Once it was easy to find something to deplore

But now it’s even worse than before

I really like these lyrics, in part because they make me uncomfortable. On the one hand, the idea of wanting to criticize something, but losing the momentum as things become harder to dismiss closer up, seems opposite of how many realizations happen in research, where a few people start to notice problems with some conventional approach and then it becomes hard to let them go. The replication crisis is an obvious example, but this sort of thing happens all the time. In my own research, I’ve been in a phase where I’m finding it hard to unsee certain aspects of how problems are underspecified in my field, so some part of me can’t relate to everything seeming new and productive. 

But at the same time the idea of being won over by what is truly novel feels familiar when I think about the role of novelty in defining good research. I imagine this is true in all fields to some extent, but especially in computer science, there’s a constant tension around how important novelty is in determining what is worthy of attention. 

Sometimes novelty coincides with fundamentally new capabilities in a way that’s hard to ignore. The reference to potentially “dangerous crises” brings to mind the current cultural moment we’re having with massive deep learning models for images and text. For anyone coming from a more classical stats background, it can seem easy to want to dismiss throwing huge amounts of unlabeled data at too-massive-and-ensembled-to-analyze models as a serious endeavor… how does one hand off a model for deployment if they can’t explain what it’s doing? How do we ensure it’s not learning spurious cues, or generating mostly racist or sexist garbage? But the performance improvements of deep neural nets on some tasks in the last 5 to 10 years is hard to ignore, and phenomena like how deep nets can perfectly interpolate the training data but still not overfit, or learn intermediate representations that align with ground truth even when fed bad labels, makes it hard to imagine dismissing them as a waste of our collective time. Other areas, like visualization, or databases, start to seem quaint and traditional. And then there’s quantum computing, where the consensus in CS departments seems to be that we’re going all in regardless of how many years it may still be until its broadly usable. Because who doesn’t like trying to get their head around entanglement? It’s all so exotic and different.

I think many people gravitate to computer science precisely because of the emphasis on newness and creating things, which can be refreshing compared to fields where the modal contribution it to analyze rather than invent. We aren’t chained to the past the way many other fields seem to be. It can also be easier to do research in such an environment, because there’s less worry about treading on ground that’s already been covered.

But there’s been pushback about requiring reviewers to explicitly factor novelty into their judgments about research importance or quality, like by including a seperate ranking for “originality” in a review form like we do in some visualization venues. It does seem obvious that including statements like “We are first to …” in the introduction of our papers as if this entitles us to publication doesn’t really make the work better. In fact, often the statements are wrong, at least in some areas of CS research where there’s a myopic tendency to forget about all but the classic papers and what you saw get presented in the last couple years. And I always cringe a little when I see simplistic motiations in research papers like, no one has ever has looked at this exact combination (of visualization, form of analysis etc) yet. As if we are absolved of having to consider the importance of a problem in the world when we decide what to work on.

The question would seem to be how being oriented toward appreciating certain kinds of novelty, like an ability to do something we couldn’t do before, affects the kinds of questions we ask, and how deep we go in any given direction over the longer term. Novelty can come from looking at old things in new ways, for example developing models or abstractions that relate previous approaches or results. But these examples don’t always evoke novelty in the same way that examples of striking out in brand new directions do, like asking about augmented reality, or multiple devices, or fairness, or accessibility, in an area where previously we didn’t think about those concerns much.

If a problem is suddenly realized to be important, and the general consensus is that ignoring it before was a major oversight, then its hard to argue we should not set out to study the new thing. But a challenge is that if we are always pursuing some new direction, we get islands of topics that are hard to relate to one another. It’s useful for building careers, I guess, to be able to relatively easily invent a new problem or topic and study it in a few papers then move on. And I think it’s easy to feel like progress is being made when you look around at all the new things being explored. There’s a temptation I think to assume that  it will all “work itself out” if we explore all the shiny new things that catch our eye, because those that are actually important will in the end get the most attention. 

But beyond not being to easily relate topics to one another, a problem with expanding, at all times, in all directions at once, would seem to be that no particular endeavor is likely to be robust, because there’s always an excitement about moving to the next new thing rather than refining the old one. Maybe all the trendy new things distract from foundational problems, like a lack of theory to motivate advances in many areas, or sloppy use of statistics. The perception of originality and creativity certainly seem better at inspiring people than obsessing over being correct.

Barefoot in NY ends with a line about how, after having asked whether it was in “our best interest” to present this particular type of music, the narrator went ahead and did it, “and now, it’s even worse than before.” It’s not clear what’s worse than before, but it captures the sort of committment to rapid exploration, even if we’re not yet sure how important the new things are, that causes this tension. 

When a conceptual tool is used as a practical tool (Venn diagrams edition)

Everyone’s seen Venn diagrams so they’re a great entry to various general issues in mathematics and its applications.

The other day we discussed the limitations of Venn diagrams with more than 3 circles as an example of our general failures of intuitions in high dimensions.

The comment thread from that post featured this thoughtful reflection from Eric Neufeld:

It’s true that Venn diagrams are not widely applicable. But thinking about this for a few days, suggests to me that Venn diagrams play a role similar to truth tables in propositional logic. We can quickly establish the truth of certain tautologies, mostly binary or ternary, with truth tables, and from there move to logical equivalences. And so on. But in a foundation sense, we use the truth tables to assert certain foundational elements and build from there.

Something identical happens with Venn diagrams. A set of basic identifies can be asserted and subsequently generalized to more widely applicable identifies.

Some find it remarkable that all of logic can be seen as resting on purely arbitrary definitions of two or three primitive truth tables (usually and, or and not). Ditto, the core primitives of sets agree with intuition using Venn diagrams. No intuition for gigantic truth tables or multidimensional Venn diagrams.

That’s an interesting point and it got me thinking. Venn diagrams are a great way to teach inclusion/exclusion in sets, and the fact that they can be cleanly drawn with one, two, or three binary factors underlines the point that inclusion/exclusion with interactions is a general idea. It’s great that Venn diagrams are taught in schools, and if you learn them and mistakenly generalize and imagine that you could draw complete Venn diagrams with 4 or 5 or more circles, that’s kind of ok: you’re getting it wrong with regard to these particular pictures—there’s no way to draw 5 circles that will divide the plane into 32 pieces—but you’re correct in the larger point that all these subsets can be mathematically defined and represent real groups of people (or whatever’s being collected in these sets).

Where the problem comes up is not in the use of Venn diagrams as a way to teach inclusions, unions, and intersections of sets. No, the bad stuff happens when they’re used as a tool for data display. Even in the three-circle version, there’s the difficulty that the size of the region doesn’t correspond to the number of people in the subset—and, yes, you can do a “cartogram” version but then you lose the clear “Venniness” of the three-circle image. The problem is that people have in their minds that Venn diagrams are the way to display interactions of sets, and so they try to go with that as a data display, come hell or high water.

This is a problem with statistical graphics, that people have a few tools that they’ll use over and over. Or they try to make graphs beautiful without considering what comparisons are being facilitated. Here’s an example in R that I pulled off the internet.

Yes, it’s pretty—but to learn anything from this graph (beyond that there are high numbers in some of the upper cells of the image) would take a huge amount of work. Even as a look-up table, the Venn diagram is exhausting. I think an Upset plot would be much better.

And then this got me thinking about a more general issue, which is when a wonderful conceptual tool is used as an awkward practical tool. A familiar example to tech people of a certain age would be the computer language BASIC, which was not a bad way for people to learn programming, back in the day, but was not a great language for writing programs for applications.

There must be many other examples of this sort of thing: ideas or techniques that are helpful for learning the concepts but then people get into trouble by trying to use them as practical tools? I guess we could call this, Objects of the class Venn diagrams—if we could just think of a good set of examples.

Concreteness vs faithfulness in visual summaries

This is Jessica. I recently had a discussion with collaborators that got me thinking about trade-offs we often encounter in summarizing data or predictions. Specifically, how do we weigh the value of deviating from a faithful or accurate representation of how some data was produced in order to it more interpretable to people? This often comes up as sort of an implicit concern in visualization, when we decide things like whether we should represent probability as frequency to make it more concrete or usable for some inference task. It comes up more explicitly in some other areas like AI/ML interpretability, where people debate the validity of using post-hoc interpretability methods. Thinking about it through a visualization example more has made me realize that at least in visualization research, we still don’t really have principled foundation for resolving these questions.

My collaborators and I were talking about designing a display for an analysis workflow involving model predictions. We needed to visualize some distributions, so I proposed using a discrete representation of distribution based on how they have been found to lead to more accurate probability judgments and decisions among non-experts in multiple experiments. By “discrete representations” here I mean things like discretizing a probability density function by taking some predetermined number of draws proportional to the inverse cumulative distribution function and showing it in a static plot (quantile dotplot), or animating draws from the distribution we want to show over time (hypothetical outcome plots), or possibly some hybrid of static and animated. However, one of my collaborators questioned whether it really makes sense to use, for example, a ball swarm style chart if you aren’t using a sampling based approach to quantify uncertainty. 

This made me realize how common it is in visualization research to try to separate the visual encoding aspect from the rest of the workflow. We tend to see the question of how to visualize distribution as mostly independent from how to generate the distribution. So even if we used some analytical method to infer a sampling distribution, the conclusions of visualization research as typically presented would suggest that we should still prefer to visualize it as a set of outcomes sampled from the distribution. We rarely discuss how much the effectiveness of some technique might vary when the underlying uncertainty quantification process is different. 

On some level this seems like an obvious blind spot, to separate the visual representation from the underlying process. But I can think of a few reasons why researchers might default to trying to separate encodings from generating processes and not necessarily question doing this. For one, having worked in visualization for years, at least in the case of uncertainty visualization I’ve seen various instances where users of charts seem to be more sensitive to changes to visual cues than they are to changes to descriptions of how some uncertainty quantification was arrived at. This implies that aiming for perfect faithfulness in our descriptions is not necessarily where we want to spend our effort. E.g, change an axis scaling and the effect size judgments you get in response will be different, but modifying the way you describe the uncertainty quantification process alone probably won’t result in much of a change to judgments without some addtional change in representation. So the focus naturally goes to trying to “hack” the visual side to get the more accurate or better calibrated responses.

I could also see this way of thinking becoming ingrained in part because people who care about interfaces have always had to convince others of the value of what they do through evidence that the reprersentation alone matters. Showing the dependence of good decisions on visualization alone is perceived as sort of a fundamental way to argue that visualization should be taken seriously as a distinct area.

At the same time though, disconnecting visual from process could be criticized for suggesting a certain sloppiness in how we view the function of visualization. Not minding the specific ways that we break the tie between the representation and the process might imply we don’t have a good understanding of the constraints on what we are trying to achieve. Treating the data generating process as a black box is certainly much easier than trying to align the representations to it, so it’s not necessarily surprising that the research community seems to have settled with the former.

Under this view, it becomes research-worthy to point out issues that only really arise because we default to thinking that representation and generation are separate. For example, there’s a well known psych study suggesting we don’t want to visualize continuous data with bar charts because people will think they are seeing discrete groups (and vice versa). It’s kind of weird that we can have these one-off results be taken very seriously, but then not worry so much about mismatch in other contexts, like acknowledging that making some assumptions to compute a confidence interval and then sampling some hypothetical outcomes from that is different from using sampling directly to infer a distribution. 

I suspect for this particular uncertainty visualization example, the consequences of the visual metaphor not faithfully capturing underlying distribution generation process seem minor relative to the potential benefits of getting people thinking more concretely about the implications of error in the estimate. There’s also a notion of frequency that’s also inherent in the conventional construction of confidence intervals which maybe makes a frequency representation seem less egregiously wrong. Still, there’s the potential for the discrete representation to be read as mechanistic, i.e., as signifying a bootstrap construction process even where it actually doesn’t that my collaborator seemed to be getting at.

But on the other hand, any data visualization is a concretization of something nebulous, i.e., an abstraction encoded in the visual-spatial realm used to represent our knowledge of some real world thing approximated by a measurement process. So one could also point out that it doesn’t really make sense to act as though there are going to be situations where we are free from representational “distortion.”

Anyway, I do think there’s a valid criticism to be made through this example of how research hasn’t really attempted to address these trade-offs directly. Despite all of the time we spend emphasizing the importance of the right representation in interactive visualization, I expect most of us would be hard pressed to explain the value of a more concrete representation over a more accurate one for a certain problem without falling back on intuitions. Should we be able to get precise about this, or even quantify it? I like the idea of trying, but in an applied field like infovis would expect the majority to judge it to be not worth the effort (if only because theory over intuition is a tough argument to make when funding exists without it).  

Like I said above, a similar trade-off seems to come up in areas like AI/ML interpretatibility and explainability, but I’m not sure if there are attempts yet to theorize it. It could maybe be described as the value of human model alignment, meaning the value of matching the representation of some information to metaphors or priors or levels of resolution that people find easier to mentally compute with, versus generating model alignment, where we constrain the representation to be mechanistically accurate. It would be cool to see examples attempting to quantify this trade-off or otherwise formalize it in a way that could provide design principles.

Here’s a fun intro lesson on how to read a graph!

Paul Alper sent us this fun feature from the New York Times that teaches students how to read a graph.

They start with the above scatterplot and then ask a series of questions:

What do you notice?

What do you wonder?

How does this relate to you and your community?

What’s going on in this graph? Create a catchy headline that captures the graph’s main idea.

The questions are intended to build on one another, so try to answer them in order.

Then they follow up with some details:

This graph appeared in the Nov. 17, 2021 New York Times Upshot article “Where Are Young People Most Optimistic? In Poorer Nations.” It displays statistics from an international survey of more than 21,000 people from 21 countries conducted by Gallup for UNICEF. A report entitled “The Changing Childhood Project: A Multigenerational, International Survey of 21st Century Childhood” offers all of the survey’s findings with its 32-question survey and its methodology by country. The survey sample was nationally representative and was conducted by landline and mobile telephone from February to June 2021. The survey’s objective was to find out how childhood is changing in the 21st century, and where divisions are emerging between generations.

Are the 15- to 24 year olds (youth) more optimistic than 40+ year olds (parents)? Does the difference in optimism vary between the least wealthy and most wealthy countries? How might the degree of political stability, economic opportunity, climate change and the Covid pandemic affect the youths’ and parents’ responses? Which countries’ statistics surprise you? What do you think about the statistics for the United States?

And:

Here are some of the student headlines that capture the stories of these charts: “The Opposing Futures in the Eyes of Different Generations” by Helena of Pewaukee High School and “The Ages of Optimism” by Zoe, both from Wisconsin; “Is Each Generation Making the World Better?” by Maggie of Academy of Saint Elizabeth in Morristown, N.J. and “Generation Battle: Is the World Getting Better or Worse?” by Taim of Gladeville Middle School in Mt. Juliet, Tenn.

Cool! I really like the idea of teaching statistical ideas using recent news.

And it seems they do this every week or two. Here’s the next one that came up:

Since they’re doing it as a bar graph anyway, I guess they could have the y-axis go all the way down to zero. Also, hey, let’s see the time series of divorces too!

“Graphs do not lead people to infer causation from correlation”

Madison Fansher, Tyler Adkins, and Priti Shah write:

Media articles often communicate the latest scientific findings, and readers must evaluate the evidence and consider its potential implications. Prior work has found that the inclusion of graphs makes messages about scientific data more persuasive (Tal & Wansink, 2016). One explanation for this finding is that such visualizations evoke the notion of “science”; however, results are mixed. In the current investigation we extend this work by examining whether graphs lead people to erroneously infer causation from correlational data. In two experiments we gave participants realistic online news articles in which they were asked to evaluate the research and apply the work’s findings to a real-life hypothetical scenario. Participants were assigned to read the text of the article alone or with an accompanying line or bar graph. We found no evidence that the presence of graphs affected participants’ evaluations of correlational data as causal. Given that these findings were unexpected, we attempted to directly replicate a well-cited article making the claim that graphs are persuasive (Tal & Wansink, 2016), but we were unsuccessful. Overall, our results suggest that the mere presence of graphs does not necessarily increase the likelihood that one infers incorrect causal claims.

A paper by Wansink didn’t replicate??? Color me gobsmacked.

Rich guys and their dumb graphs: The visual equivalents of “Dow 36,000”

Palko links to this post by Russ Mitchell linking to this post by Hassan Khan casting deserved shade on this post, “The Third Transportation Revolution,” from 2016 by Lyft Co-Founder John Zimmer, which includes the above graph.

What is it about rich guys and their graphs?

slaves-serfs

screen-shot-2016-11-30-at-12-22-53-pm

Or is it just a problem with transportation forecasts?

VMT-C-P-chart-big1-541x550

I’m tempted to say that taking a silly statement and putting it in graph form makes it more persuasive. But maybe not. Maybe the graph thing is just an artifact of the power point era.

Rich guys . . .

I think the other problem is that people give these rich guys a lot of slack because, y’know, they’re rich, so they must know what they’re doing, right? That’s not a ridiculous bit of reasoning. But there are a few complications:

1. Overconfidence. You’re successful so you start to believe your own hype. It feels good to make big pronouncements, kinda like when Patrick Ewing kept “guaranteeing” the Knicks would win.

2. Luck. Successful people typically have had some lucky breaks. It can be natural to attribute that to skill.

3. Domain specificity. Skill in one endeavor does not necessarily translate to skill in another. You might be really skillful at persuading people to invest money in your company, or you might have had some really good ideas for starting a business, but that won’t necessarily translate into expertise in transportation forecasting. Indeed, your previous success in other areas might reduce your motivation to check with actual experts before mouthing off.

4. No safe haven. As indicated by the last graph above, some of the official transportation experts don’t know jack. So it’s not clear that it would even make sense to consult an official transportation expert before making your forecast. There’s no safe play, no good anchor for your forecast, so anything goes.

5. Selection. More extreme forecasts get attention. It’s the man-bites-dog thing. We don’t hear so much about all the non-ridiculous things that people say.

6. Motivations other than truth. Without commenting on this particular case, in general people can have financial incentives to take certain positions. Someone with a lot of money invested in a particular industry will want people to think that this industry has a bright future. That’s true of me too: I want to spread the good news about Stan.

So, yeah, rich people speak with a lot of authority, but we should be careful not to take their internet-style confident assertions too seriously.

P.S. I have no reason to be believe that rich people make stupider graphs than poor people do. Richies just have more resources so we all get to see their follies.