Supplement that alphabetized display with another graph showing the states in a more informative order.

I just wrote a long post inspired by a recent post from economist Paul Krugman. Krugman’s post was good, but I’m annoyed that his graph (reproduced above) lists the states alphabetically. Don’t do that! It’s called the Alabama first error.

I would’ve put this as a P.S. on my earlier post but I was afraid that would distract people from my larger point, so I’m just raising the graphical issue here.

If the goal is to have a look-up table, then, sure, alphabetical is fine. But I don’t think that’s the point of that graph. Indeed, if you wanted a look-up table, I’d still prefer a non-alphabetical graph and then you could click to get the numbers in a spreadsheet.

How best to order the states in that graph, then? You could try different things. My first idea is to list in order of average per-capita income by state. (These rankings don’t change much over time; for clarity we could just order by average per-capita income in 2020.)

P.S. All the commenters so far are disagreeing with me, so let me reassess.

I doubt that most readers are looking at this graph to look up individual states. I think the goal is to present the general trend and variation across U.S. states. For this purpose, alphabetical order makes it hard to see systematic patterns that might be clearer using any reasonable ordering.

That said, alphabetical order has the benefit of familiarity, and given that all of you think this is important, I’m willing to believe that my take is a minority view, and maybe the designer of the graph is better off going with the majority.

So I’ll alter my recommendation. Instead of saying, “Don’t alphabetize,” I’ll say, “Supplement with another graph showing the states in a more informative order.”

R wins statistics award.

Elena Belogolovsky writes:

Congratulations to the R Core Team on receiving the 2026 Rousseeuw Prize for Statistics.

R has made creative, open-ended statistical analysis and graphics accessible to generations of statisticians and applied researchers. It has also been central to statistical research, methodology, and applications during decades when statistics became more computational and more important across science, engineering, business, and public health.

One of the great strengths of R is that it is not just a software platform. It is also a community. The system of R packages allows anyone to implement a new method and share it with the world, helping make statistical research more open, useful, and alive. R has also been the medium for major developments in statistical graphics, transforming applied statistics and the way people work with data.

The volunteers who have developed, guided, and maintained R and the R community are richly deserving of this major award.

I agree with the committee that the R team is an excellent recipient of this award. I say this for several reasons:

– Most obviously, R is super-useful and it’s changed statistics, both by enabling more complicated and reliable analysis and by establishing a common language for statistical coding.

– R integrates statistical modeling with graphics, which traditionally (but, in my opinion, mistakenly) have been thought of as in opposition.

– R is open source. This might sound like no big deal, but its predecessor was Splus, which was a commercial package. Before that came S, which was open but was not set up to expand in a scalable way.

– With its system of packages, R became modular: different groups of users (including me!) could write their own packages and develop new and useful tools without needing to get tangled in core R issues. For example, we have cmdstanr, which lets you run Stan programs from R. This is super-useful for Bayesian workflow.

– R is a programming language, not a menu-based set of commands. This is no big deal now, given that the natural comparison to R is Python, but, back in the day, when R’s competitors were Sas, Spss, Stata, etc., it was a big deal that with R you write programs, you don’t just push buttons. A big deal for workflow in statistics and data science.

– Regarding the R community . . . ok, this gets complicated. Still and all, the R core team is very helpful to outsiders and has been a clear net benefit to the communities of developers, statisticians, and users.

I’m sure I’m missing a few things. My only disagreement with the award citation is that it doesn’t mention S, the statistical software environment developed by John Chambers and others at Bell Labs back in the 1980s. R is a rewrite of S. With lots of improvements, but I do think the S team deserves credit for setting up the template.

Statistical analysis recapitulates the development of statistical methods

We ran this a few years ago but it remains interesting so I’m reposting:

There’s a old saying in biology that the development of the organism recapitulates the development of the species: thus in utero each of us starts as a single-celled creature and then develops into an embryo that successively looks like a simple organism, then like a fish, an amphibian, etc., until we reach our human form in preparation for birth.

Modern biologists don’t believe in this recapitulation. But taking this as an intriguing idea, I see an analogy with statistical practice.

Some version of this recapitulation occurs just about whenever we do applied statistics. We start with the simplest methods–univariate data summaries and some basic multivariate analyses–then we perform some comparisons which we check via standard errors and off-the-shelf hypothesis tests, then we move to modeling. We might well start with least squares and maximum likelihood and then move to regularization and multilevel modeling as needed, then throw in measurement error models, selection models, nonparametric this and that, and so forth.

The analogy isn’t perfect–in particular, we don’t always begin an analysis with simple averages and plots; sometimes we begin with a sophisticated nonparametric data-exploration tool such as lowess or deep nets. And, lots of methods for graphical exploratory data analysis have only been developed recently; indeed, even methods as basic as scatterplots are only a few centuries old.

Within the context of modeling, though, it does seem to me that we tend to start simple and then add more complicated features one at a time–and this seems like a sensible way to proceed. In so proceeding, we’re motivated in part by computational stability but also in part by the logic of increasing complexity: we take each step for a reason. Thus it is logical that statistical analysis recapitulates the development of statistical methods.

Jonah’s seminar tomorrow: “Bayesian Workflow and the Software That Shapes It”

This is Leo. Jonah Gabry (Stan developer, Andrew’s collaborator, etc.) is spending the whole month of May as a visiting professor here with us at the University of Trieste in Italy. Tomorrow, May 19th, in the De Finetti room at the University of Trieste, at  9 am NYC time (GMT-4), Jonah will give the following talk:

“Bayesian Workflow and the Software That Shapes It”

based on the upcoming book:  “Bayesian Workflow”.

For anyone local, you are welcome to come in person. Anyone else can join on Microsoft Teams (available here).

Should French pollsters be using Mister P?

An anonymous statistics student from France sends in the above plots (click twice to see big versions) and writes:

I’m trying to push French pollsters to start doing MRP.

I made a poll agregator and applied it to the last 100 days of the last five french presidential elections.

I did some smoothing using an algorithm from a paper of Aki Vehtari. It is Kalman-RTS with cross-validated levels of noises.

I tested it on some simulated data to confirm it is fitting properly.

I put the data and the code on my blog.

What I shared as “the data” is the smoothed result. I fitted it on the wikipedia pages of the french polls.

On the plots, the same parties (with changed names or fusions) are on the same position horizontally to allow comparisons.

I see some periodic movements in opinion that I think may be coming from a periodic non-response.
Also, the movements seem far too large to me. I can believe 10% increase for a candidate in five years, but not in less than 100 days.

The French polling industry is in profound need of reform. A fun fact: They allow themselves to change the final result by plus or minus one point based on the feelings of the person in charge of the poll. They call that the “pifomètre” or nosemeter. I heard about this in an interview with sociologist Hugo Touzet on his book, “Produire l’Opinion: Une Enquête Sur Le Travail Des Sondeurs.” I trust his descriptions of their methods since he has interviewed their workers.

I think MRP would allow the pollsters to do predictions for the legislative elections and municipal elections, which have been largely ignored because they are too difficult and expensive with quota sampling.

I know next to nothing about French polling, but, yeah, I do think they should be using Mister P (multilevel regression and poststratification; MRP).

P.S. Here’s a fun cranky post from this student.

All graphs are comparisons, and the relevance of this principle to practical advice for producing better graphs

Following up on this discussion from a few years ago, John Kastellec offers some practical advice for producing better graphs:

In this short paper I present a few practical tips for producing better published graphs. These include: making labels big enough to read; avoiding legends and labeling lines directly; using small multiple plots; and using different line types and shapes to draw distinctions.

I agree with these points. Well, not the last one. I much prefer using colors, rather than line types and shapes. Sometimes line types and shapes can work, but I recommend first using colors.

Also, I’d add the recommendation to give each graph a title or embed it in a figure with a caption. Remember, a picture plus a thousand words is better than two pictures or two thousand words. It’s a classic error to think that a graph should be self-explanatory.

In any case, I disagree with this statement from John:

These are suggestions designed for better published graphs; they are not generally applicable for scholars’ own visualizations in the course of their research workflow. Indeed, some of the basic problems arise because what works well as a default during the data exploration phase does not translate well to published graphs.

No no no no no!

The same attributes that make a graph more useful and readable by others will make it more useful and readable by you. You are the first audience for any graph. All too often I’ve seen people make ugly, unreadable default graphs, and this has inhibited their ability to learn from their data and models. Give your graphs titles, label those axes, use small multiples without making each plot too crowded, use colors effectively, etc etc . . . do it for yourself too!

OK, here’s a story for you. Many years ago I knew a couple who were moving and they needed to sell their apartment. To make it more sellable, they had all sorts of work done, fixing everything that was broken, getting new appliances, new floors, etc. It was wonderful! And they should’ve done it earlier, while they were still living there!

Oh, one more thing. I don’t like “comb plots,” which multiplex an x- or y-axis. Here’s an example from John’s paper:

I think a simple dot plot would be better, with a red dot and a blue dot on each row. That said, I appreciate that they listed the issues in decreasing order of frequency rather than alphabetically.

Overall I think that John’s article should be useful to many people. Even if you don’t agree with all the recommendations, it gives you something to think about.

For more on the topic, placing graphical displays in the context of statistical measurement, I recommend chapter 2 of Regression and Other Stories.

Remember what Tufte taught us: All graphs are comparisons.

Edward Tufte on graphs as comparisons

I’m always going around saying that graphs are comparisons. For example, see Section 2.3 of Regression and Other Stories or the discussion here.

“All graphs are comparisons” has four implications:
1. If you’re considering making a graph, think of how to construct it to focus on the comparisons you want to examine and to convey.
2. If you’re doing research or writing up results, think about what are the comparisons you care about, and then figure out how to make graphs to show these.
3. To evaluate a graph you’ve already made, think about what comparisons it is displaying.
4. To understand a graph produced by someone else, think of it as displaying a comparison.

The graphs-as-comparison idea is relevant for both research and presentation graphics. Indeed, I’ve argued that research graphics and presentation graphics are pretty much the same thing. In a research graph, you still have an audience–it is you and your collaborators–and in a presentation graph you want convey some understanding to your reader.

Anyway, the other day I came across this old post where Dale Lehman in the comment thread points to this quote from Edward Tufte:

To be truthful and revealing, data graphics must bear on the question at the heart of quantitative thinking: “Compared to what?”

And Chandrasekhar Ramakrishnan points to this other Tufte line:

Small multiple designs, multivariate and data bountiful, answer directly by visually enforcing comparisons of changes, of the differences among objects, of the scope of alternatives.

I read these Tufte books many years ago, so I guess that’s where my “Graphs are comparisons” mantra came from.

The stories behind our published research from last year

It’s January so time to look back on what we’ve done in the past year.  I thought this time I’d give a little story of background on each of our published papers.

First, here’s the list of recently published papers:

Also we completed some new work that’s not yet been published:

We have a lot on deck for 2026, including two new books (Bayesian Workflow and the second edition of the edited Handbook of Monte Carlo) and a bunch of research articles on different topics in statistical modeling, causal inference, and social science.

And you can expect another 600 or so blog posts.

The stories behind the papers

It’s hard for me to pick my favorites among all the recently published papers, so let me just say something about each of them, in the same order they were listed above (roughly inverse chronological order of publication):

  • Adaptive sequential Monte Carlo for structured cross validation in Bayesian hierarchical models:  GH took a couple of my classes and had ideas for a couple of papers, including this one.  This is his idea that I just helped on a small amount.
  • Reanalysis of “Competition and innovation: An inverted-U relationship”:  This was originally a blog post.  The editor of the Journal of Robustness Reports asked me to submit it to them.  It took a couple rounds–the reviewers made some good points!–and fun thing about this journal is you can go to the link and see the entire review process.
  • The ladder of abstraction in statistical graphics:  I absolutely love this paper.  It originated in a talk I gave to Ron Yurko’s statistical graphics class at CMU.  I sent it to the journal and they had some good suggestions for improvement that my friend and colleague Kaiser Fung was able to do.
  • Statistical workflow:  As many of youall know, we’ve been writing a book on Bayesian Workflow–it will appear very soon!  I felt that the workflow concept would be useful in non-Bayesian statistics too, so my colleagues and I organized a special issue of a journal, where we solicited a bunch of articles from theoretical and applied researchers, mostly not Bayesian, to get different perspectives on workflow.  The journal issue is looking good–I guess it will be out soon–and we wrote this short article to lead off that issue.  It’s a short paper and I recommend you take a look!
  • Adjusting for underreporting of child protective services involvement in the Future of Families and Child Wellbeing Study and assessing its empirical implications through illustrative analyses of young adult disconnection:  OK, I don’t have much to say about this one.  It’s by my colleagues at the school of social work at Columbia; I was involved in the survey weighting for the study.
  • A multilevel Bayesian approach to climate-fueled migration and conflict:  Hey, I don’t remember much about this at all!  But, yeah, multilevel modeling, I guess I did something useful here!
  • Artificial intelligence and aesthetic judgment:  This one’s mostly by Jessica and Ari, but I made some contributions throughout, which might be recognized from earlier appearances of some of these ideas on the blog.  It’s published in Sankhya because I think they asked me to submit something for a special issue, and we had this cool paper that we couldn’t figure out what to do with.
  • Discussion of “Statistical exploration of the manifold hypothesis”:  This journal sometimes runs papers with discussions (they did a couple of mine in the past decade), and sometimes I contributed something.  Here I saw a good opportunity to remind people of my thoughts on Tibshirani’s “bet on sparsity” principle and where it can go wrong.
  • Meta-analysis with a single study:  What can I say?  This paper has an awesome title.  Erik, Witold, and I have been meeting weekly and will be coming out with more articles soon on science and meta-science.
  • Normative scientific conflict is unavoidable and should be welcomed:  I can’t remember how, but I came across an announcement of a special issue of the journal Theory and Society on the topic of normative scientific conflict.  I had some things to say on the topic, and this seemed like a good outlet.  I like this paper!  You should read it.
  • Russian roulette:  The need for stochastic potential outcomes when utilities depend on counterfactuals:  This paper has a funny story behind it.  I was contacted by economist Amanda Kowalski about a paper she and her colleagues had written about causal inference.  That paper got me thinking about stochastic potential outcomes and asymmetric utility functions, and I had this idea of demonstrating these ideas in a simple example of Russian roulette.  Jonas joined as a collaborator and clarified a bunch of issues that I’d been sloppy with.  We asked Amanda if she wanted to join in, but she was too busy on her own stuff.  Anyway, the final paper is cool–it’s really clean, and it’s timely because lots of people are interested in going beyond the stable unit treatment value assumption.
  • Multilevel regression and poststratification using margins of poststratifiers:  Improving inference for HIV health outcomes during the COVID-19 pandemic:  Qixuan has been taking the lead on a bunch of papers we’ve been doing, generalizing MRP in various ways.  I think we’re gradually moving toward a bright future of generalizing from sample to population.
  • Statistical graphics and comics:  Parallel histories of visual storytelling:  This is an idea that I’ve had for a while.  I mentioned it in class offhandedly one day, and one of the students told me she was interested in the topic too, so we wrote this article.  It was a true collaboration.  It’s kind of a specialized topic, but I think it should have a potentially wide audience, because lots of people love comics and lots of people love statistical graphics.  We focus on the fascinating question of how it is that these two modes of communication have developed only in the past few centuries, even though they could have been invented much earlier.  This is a sister paper to the “ladder of abstraction” paper mentioned above.
  • Letter to the editor:  Long story here.   Back in 2017, a bigshot professor lied about me in a published article in the journal, Perspectives on Psychological Science.  It was the kind of crap article that should never have been accepted, but at the time that journal was run by a corrupt cabal and they were publishing their friends’ articles essentially without peer review.  At the time I complained to the journal but only got rude responses from the cabal.  But things change.  The journal is now run by civilized people and they published my letter.  Better 8 years late than never at all.  And, no, the people who wrote and published the lies never apologized.  Of course not!  Apologies are for losers, not for members of the prestigious National Academy of Sciences.
  • Rethinking approaches to analysis of global randomised controlled trials:  Epidemiologist Jay Brophy wrote this one.  I had some minor contribution, I can’t remember what.
  • Simulation-based calibration checking for Bayesian computation:  The choice of test quantities shapes sensitivity:  This is the latest version of a long series of papers on SBC, starting with Samantha Cook’s Ph.D. thesis, which we turned into a paper that was published twenty years earlier.  I continue to be interested in the idea of accompanying inferences with simulations that check the computations.
  • Visualizing distributions of covariance matrices:  This paper is nearly 20 years old!  At the time we had difficulty getting it published and we moved on to other things.  Then a couple years ago a journal asked me for an article and I sent them this one.  Unfortunately it was a so-called predatory journal, and one of my coauthors didn’t want our article appearing there.  Fair enough!  But then we thought we might as well get it published, so we sent it off.  I like the paper, and I also like that it’s on the relatively understudied topic of visualizing models (as opposed to visualizing data).
  • Interrogating the “cargo cult science” metaphor:  This topic had been bugging me for a while, and Megan and I wrote this paper which got rejected by a couple of places.  Neither of us really knows how to communicate with researchers in the field of science studies, so it was a hard paper to place, even though it makes a clean point.  Then I happened to hear about the journal Theory and Society, which seemed like the perfect place.  I don’t know if anyone read our article, but I’d like to think that, in the future, people will think twice before talking about cargo cult science.
  • A calibrated BISG for inferring race from surname and geolocation:  This is Philip’s project.  I did help out a bit, but I remain frustrated in that we haven’t been able to frame this in a fully Bayesian or generative way.  We’re continuing to work on the problem, and we have a new method, supercaliBISG, which does even better than caliiBISG, which is an improvement on BISG, which itself has the word “improved” in its title (and also calls itself Bayesian, but it’s not fully so).
  • Hierarchical Bayesian models to mitigate systematic disparities in prediction with proxy outcomes:  I can’t remember exactly where his paper came from, but it was somehow associated with some conversations we had with Sharad Goel and others on statistical measures of disparity.  As is often the case, I think much is gained by framing the problem within a generative model.
  • The piranha problem:  Large effects swimming in a small pond:  This one’s important!  The basic idea–there are probabilistic or statistical constraints regarding patterns of dependence in high dimensions, and this has implications for our understanding of patterns in complex structures–was mine, but the coauthors did most of the rest, to collect some relevant mathematical results.  As I like to say, I think there’s more to be said in this area, maybe some connections to random matrix theory.  Also, the paper has an unusual publication story.  What happened was that a student from the statistics club at San Diego State University asked me to do a remote meeting with them.  I did so–it was a fun conversation–and it turned out that their faculty adviser, Richard Levine, was editor of the Notices of the American Mathematical Society, and was looking for general-interest math papers with applied or statistical relevance.  So I sent him the piranha paper.  Articles in this journal have a strict limit of no more than 10 pages and no more than 20 references.  It was hard for me to keep the references under 20 while demonstrating the applied relevance of the topic, so I cheated and wrote a blog post entitled, “Here are just some of the factors that have been published in the social priming and related literatures as having large effects on behavior,” so that just counted as 1 reference in our paper.  Kind of like if the genie gives you 3 wishes and you spend one of them on more wishes.
  • For how many iterations should we run Markov chain Monte Carlo?:  This is an update of my paper with Kenny Shirley for the new edition of the MCMC handbook.  Charles took the lead on this chapter.

The signal-to-noise ratio in statistics

This is Erik. When fortune smiles on us, we may get an unbiased, normally distributed estimator y with standard error s of some (unknown) parameter of interest theta. Here, we’ll even assume that s is known. The difference between knowing s and having to estimate it, is the difference between a t-test and a z-test. That’s a minor difference when there aren’t any serious outliers and the sample size is not too small. So, let’s just hope for the best and assume that y has the normal distribution with mean theta and standard deviation s. The 95% confidence interval for theta is y ± 1.96 × s. All very standard.

The z-statistic is the ratio of the estimator to its standard error, so z=y/s. It follows that z has the normal distribution with mean theta/s and standard deviation 1. There doesn’t seem to be a good name in statistics for theta/s, i.e. the ratio of the true parameter to the standard error of its estimator. However, we could borrow a term from engineering: the signal-to-noise ratio or SNR. So, let’s define SNR=theta/s. Then the z-statistic has the normal distribution with mean SNR and standard deviation 1.

The SNR is easy to interpret. If theta=0 then the SNR is zero as well. SNR=1 (or -1) means that the parameter we’re trying to estimate has about the same magnitude as the noise in our estimator. That’s not a very favorable situation. For example there is a 16% chance that the estimator has a different sign than the true parameter. That’s because

P(y < 0 | SNR=1) = P(z < 0 | SNR=1) = pnorm(0,1,1)=0.16.

If SNR=2.8 (or -2.8) then the probability to reject the hypothesis that theta=0 is 80% (alpha=0.05 two-sided). That’s because

P(|z|>1.96 | SNR=2.8)=pnorm(-1.96,2.8,1) + 1 – pnorm(1.96,2.8,1) = 0.8.

There is a 1-1 relation between the absolute z-statistic and the two-sided p-value for testing the hypothesis that theta=0. In R, we have z=qnorm(1-p/2) and p=2*pnorm(-abs(z)).Still, I like z-statistics better than p-values because of their direct relation to the SNR. In fact, we can think of the z-statistic as an estimate of the SNR with standard error 1. The SNR and z-statistic say something about the quality of an experiment without reference to hypothesis testing.

A few days ago, I posted a histogram of z-statistics from PubMed, and noted the lack of z-statistics between -2 and 2. I’ve now also made histograms of the corresponding two-sided p-values, with and without log-transformed axis. As expected, these show a steep drop at 0.05. To me, the histogram of the z-statistics is easiest to read. One might even say that the p-value is a distortion of the z-statistic. Or do you think that goes too far?

The fifth anniversary of a viral histogram

This is Erik. Five years ago, I wrote a short paper about The significance filter, the winner’s curse and the need to shrink (with Eric Cator). The main purpose was to publish some mathematical results for later reference. To make the paper a little more interesting, we wanted to add a motivating example. I came across a paper by Barnett and Wren (2019) who scraped more than a million confidence intervals of ratio estimates from PubMed and made them publicly available. I converted the confidence intervals to z-statistics, made a histogram, and was struck by the lack of z-statistics between -2 and 2 (i.e. non-significant results).

load(url("https://github.com/agbarnett/intervals/raw/master/data/Georgescu.Wren.RData"))
d=complete[complete$mistake==0,]
L=log(d$lower)  # take the log because these are ratio estimates
U=log(d$upper)
estimate=(L+U)/2
stderror=(U-L)/(2*1.96)
z=estimate/stderror
hist(z[abs(z)<10],100)

Richard McElreath noticed the figure, snipped it from our paper and posted it on Twitter. Next, it was picked up by several (relatively) large accounts; Harlan Krumholz, Inquisitive Bird, Kareem Carr, John Holbein, Cremieux and  Nicolas Fabiano. Just 2 weeks ago John Holbein re-upped his earlier post and got another few thousand likes. The histogram is also quite popular with bloggers, see here, here, here, here, here, here, here, here, here, here, here and here. Adrian Barnett and David Borg wrote a blog post with their own version of the histogram. Several memes were also created.

For the fifth anniversary of the histogram, I wanted to react to a few typical comments. For example, Adriano Aguzzi commented:

Let’s not hyperventilate about this. It’s in the nature of things that negative results are rarely informative and therefore rarely published. And that is perfectly legitimate.

It is disappointing – to say the least – that many people still fail to see the problem of distorting the scientific record by selectively reporting and publishing results that meet p<0.05.

Another typical comment (Simo110901):

I don’t think this is inherently bad, a part of this bias certainly comes from publishing bias, but a significant part (hopefully the majority) could be that researchers are often really good at formulating educated guesses and therefore able to reject null results in most cases.

Many other commenters also believe that the lack of non-significant results is due to researchers’ ability to size their studies exactly right to obtain statistical significance with minimal undershoot. This is very unlikely. In the figure below, I compare the z-statistics from Barnett and Wren (2019) to a set of more than 20,000 z-statistics of the primary efficacy endpoints of clinical trials from the Cochrane Database of Systematic Reviews (CDSR).

d=read.csv("https://osf.io/xq4b2/?action=download")
d=d %>% filter(RCT=="yes",outcome.group=="efficacy", outcome.nr==1, abs(z)<20 ) 
d=group_by(d,study.id.sha1) %>% sample_n(size=1)        # one outcome per study 
hist(d$z[abs(d$z)<10],50)


The histogram from the CDSR (right) shows no appreciable gap. I can’t know for sure why that is, but would guess it’s due to the fact that clinical trials are serious research. They are usually pre-registered in the sense that they have a protocol which was approved by some Institutional Review Board. They are expensive and time-consuming, so even if they are not significant it would be a shame not to get a publication out of them. Finally, it would be unethical to the participants not to publish.

Another typical comment (Daniel Lakens):

This is not an accurate picture of how biased the literature is. The authors only analyze p-values in abstracts.

Barnett and Wren (2019) collected z-statistics from both abstracts and full text sources. The full text data are available for papers that are on PubMed Central. There are 961,862 abstracts and  348,809 full-text sources. Below, I show the z-statistics separately. The distributions are remarkably similar, although there is a slightly higher proportion non-significant results from the full texts

Any automated scraping algorithm is bound to miss some things. It’s quite possible that non-significant results that are not in the abstract or main text are still reported in separate tables, appendices and supplements. However, it doubt that that’s the reason for the huge gap. I’m quite convinced that the z-statistics from PubMed really do provide strong evidence of publication bias against non-significant results in the medical literature. However, it should be noted that the underrepresentation of z-statistics between -2 and 2 is probably not only due to publication bias, but also due to authors not reporting confidence intervals for non-significant results. Of course, that is still not a good thing.

Predictive Modelling for Football Analytics is available!

This post is by Leo.
After a long and exciting journey, the book I co-authored with Dimitris Karlis, and Ioannis Ntzoufras Predictive Modelling for Football Analytics edited by CRC Press is available!

The book discusses the most well-known classical and Bayesian models, along with the main computational tools used in the football analytics domain. It also introduces the footBayes R package (built on Stan and CmdStan), which accompanies the reader through all the examples proposed in the book. It aims to be both a practical guide and a theoretical foundation for students, data scientists, sports analysts, and football professionals who wish to understand and apply predictive modelling in a football context.

This text is primarily for senior undergraduates, graduate students, and academic researchers in mathematics, statistics, and computer science who are interested in learning about football analytics. For sure, we really enjoyed writing this book.

Here’s the table of contents:

Chapter 1 – A short introduction to football analytics

Chapter 2 – Methods, algorithms and computational tools

Chapter 3 – Tournament and game prediction via simulation

Chapter 4 – Implementation of basic models in R via footBayes

Chapter 5 – Additional statistical models for the scores

Chapter 6 – Modelling international matches: the Euro and World cup experience

Chapter 7 – Compare statistical models’ performance with the bookmakers

You can order the book here.

And you can download all the data and reproducible R code of the book here.

P.S. I still remember fitting my first Stan model on football data when I was a visiting scholar at Columbia, in the Department of Statistics, back in 2016. I was sitting in Andrew’s office, hoping for a decent fit, and discussing with Jonah how to improve it. Almost ten years later, we now have some proof that we can always improve our models;)

P.S.2 As the great coach José Mourinho once said, “He who knows only about football, knows nothing about football”. For me, football is simply a tool to make statistics more accessible and engaging especially for those outside the field.

Reanalysis of that Nobel prizewinning study of patents and innovation (with R and Stan code)

A few days ago we discussed a paper from 2005, Competition and Innovation: An Inverted-U Relationship, two of whose authors recently won the Nobel prize in economics.

I had some concerns about the analysis, which I can express with reference to the above figure from that paper:

1. The paper’s all about an inverted U relationship, but this is driven by fitting a quadratic curve rather than, say, a curve with diminishing returns.

2. The line does not seem to go through the data points. In particular, the curve seems to be too low at the rightmost part of the graph–an artifact of fitting a quadratic, perhaps?

3. The y-axis is some weighted count of patents but it’s being used as a measure of the more abstract concept of “innovation.”

4. The x-axis is an average of profit margins but it’s being used as labeled a measure of the more abstract concept of “competition.”

5. The model predicts patents from profit margin in the same year, but to the extent the model is appropriate I think you’d expect a lag.

6. They use Poisson regression, but the data are not counts, also if you don’t somehow correct for overdispersion your standard errors will be too low.

The plan

Here are the ways I’m gonna adjust for the above issues in my reanalysis:

1. I’ll fit a quadratic curve to replicate what they did in the paper, and I’ll also fit another family of curves (the “hinge”) that allows for nonlinearity but without enforcing non-monotonicity.

2. One problem with the above graph is that it excludes 20% of the data (see the figure caption). I’ll plot the fitted curves showing all the data. Another possible reason for the problematic fit is that in the paper they say they adjust for industry and year effects, and so I’ll make a plot showing the fitted curve and the data broken down by industry. I’ll do the adjustments in two ways: “fixed effects” as in the published paper (using various R packages), and multilevel modeling (using Stan).

3, 4. I’ll relabel the axes of the graph to more accurately capture the authors’ measures of competitiveness and innovation, but I’ll swallow the concern that the analysis only uses a subset of the data that were available at the time (from the paper, “Our sample includes all firms with names beginning “A” to “L” plus all large R&D firms. After removing firms involved in large mergers or acquisitions and those with missing data . . .”).

5. This lag thing is an issue, but I have enough concerns with items 3 and 4 that it’s hard for me to take the paper’s theoretical and causal arguments seriously. But, sure, you could replicate my analyses at different lags.

6. I’ll compare four sorts of models: Poisson, quasipoisson, negative binomial, and normal regression on log(y+1). Quasipoisson and negative binomial are two different ways to correct Poisson regression for overdispersion. Modeling on log(y+1) is a completely different way to model data that are mostly positive but have some zeroes, and I think it actually makes the most sense for this example, as the data are not actually counts. When fitting Poisson and negative binomial regressions, I’ll round the data to the nearest integer, which turns out to essentially not affect the results.

The above is not a preregistration plan; I wrote it in the middle of the project after doing some of the analyses already.

The data

Bradford in comments links to a post from 2014 by Leif Nelson and Uri Simonsohn that includes a file linking to the website of Nicholas Bloom, one of the authors of the paper. Navigating the site takes me to this page with a link to a zip file with the data, which I then downloaded.

Good job by Bloom to post the data from a paper published in 2005. I’m not so good with data availability myself.

The data file has 354 records, including data from 1973-1994 and from 17 industries, and for each of these records it has the weighted patent count (“patcw”), the measure of profit margin (“Lc”), and a bunch of variables that I didn’t try to figure out, because these are all I need to replicate the main analysis. (The file does not seem to include a code book, but I’m not complaining, as they’re already way ahead of me by making the data available at all.)

Initial data analysis: quadratic regression using Poisson, quasipoisson, and negative binomial models

I get going by plotting the data and fitting some quadratic models:

Compared to the graph at the top of this post, the above plot shows more data (I’m not trimming the upper and lower 10% of observations), and the quadratic curves are much flatter than were shown in that paper:

(1) green curve: poisson fit to rounded data (using glm() in R)
(2) red curve: quasipoisson fit to rounded data (using glm())
(3) blue curve: quasipoisson fit to raw data (using glm())
(4) pink curve: negative binomial fit to rounded data (using glm.nb())
(5) orange curve: negative binomial fit to rounded data (using stan_glm())

The first four curves are so similar that the lines overlap and you can’t see the green and red curves: They’re there, just overwritten. For each model I display the point estimate of the curve (best fit for models 1-4, posterior median for model 5). In the models above, quasipoisson uses the Poisson fit and then adjusts standard errors to account for overdispersion; negative binomial is a probability distribution that accounts for overdispersion in a different way.

We went from 1 to 2 to see if switching to quasipoisson made a difference (it did; the standard error was much bigger after we allowed for overdispersion); we went from 2 to 3 to see if rounding made a difference (it did’t for these data); we went from 3 to 4 to see if switching to negative binomial made a difference (it didn’t do much, but the standard error increased slightly); and we went from 4 to 5 to see if switching to full Bayes made a difference (it looked a lot different, actually).

I’ll put the code at the end of this post so as not to distract from the story. Here are the relevant pieces of console output:

(1) poisson fit to rounded data
            coef.est coef.se
(Intercept) -74.82    25.85 
Lc          164.76    54.90 
I(Lc^2)     -88.39    29.14

(2) quasipoisson fit to rounded data
            coef.est coef.se
(Intercept) -74.82    84.64 
Lc          164.76   179.78 
I(Lc^2)     -88.39    95.44

(3) quasipoisson fit to raw data
            coef.est coef.se
(Intercept) -75.01    84.01 
Lc          165.13   178.43 
I(Lc^2)     -88.55    94.72 

(4) negative binomial fit to rounded data
            Estimate Std. Error
(Intercept)   -80.80      89.49
Lc            177.20     190.10
I(Lc^2)       -94.85     100.92

(5) negative binomial fit to rounded data (Bayesian)
            Median MAD_SD
(Intercept)  -7.1   33.2 
Lc           21.1   69.5 
I(Lc^2)     -12.2   37.3

In a paper I’d prefer to display these uncertainties graphically; here I’m giving the console output to give a sense of how things might go in our usual workflow of fitting models and looking at them. Here you can see that the quadratic terms all have large standard errors–except for the very first model, the Poisson, but its standard error is too low as it does not account for overdispersion.

The only thing that really puzzles me here is what’s going on with model 5. Could it be that the Bayesian model uses the posterior median rather than the optimum? I’ll check by re-fitting it, running stan_glm on “optimizing” setting:

(5_opt) negative binomial fit to rounded data (Bayesian posterior mode)
            Median MAD_SD
(Intercept)  -5.3   36.8 
Lc           17.1   76.7 
I(Lc^2)     -10.9   39.4 

That’s not much different from the posterior median. So I guess the difference between the negative binomial regressions fit by glm.nb() and stan_glm() arise from differences in the fitting algorithms, or maybe something I missed in the coding. If I wanted to proceed further down this track I would have to investigate this a bit further, reading up on what glm.nb is actually doing in R, and in Stan programming the negative binomial model myself from scratch and also comparing to brms.

For now I’ll just move on. My guess is that the difference in fits has something to do with how seriously the model takes the small number of influential data points near the top of the graph, but I’ll set that aside for now because we’ll be fitting some more models.

Quadratic regression adjusting for industry and year effects

The next step is to adjust for industry and year effects, which I’ll do in a few ways. I’ll aid in the interpretation of these by plotting the data and fitted curve separately for each industry. In the data the industry codes took on 17 different values ranging from 22 to 49, which according to the paper correspond to “two-digit SIC codes.” So I googled “two-digit SIC codes,” which are listed in various places online. I could not find an official link, but various unofficial sources seemed to agree; here’s one such list.

For each plot, black dots show the data for that industry, with a large dot showing the data from the first year of the data and thin black lines showing the time sequence. The colored curves are the fitted quadratics for each industry in an average year, as explained below.

It’s kind of weird that Furniture has so many patents, and that the number of patents for machinery and computers started out high and then dropped, and that electric and gas services had no patents for the first few years and then suddenly had a lot . . . this is just what’s in the data. I checked a few things to make sure I didn’t garble the categories but it’s possible that I’m missing something.

Here are the models I fit:

(3a) blue curve: quasipoisson fit to rounded data, including factors for industry and year (using glm())
(4a) pink curve: negative binomial fit to rounded data, including factors for industry and year (using glm.nb())
(5a) orange curve: negative binomial fit to rounded data, including factors for industry and year (using stan_glm())
(6a) purple curve: multilevel negative binomial fit to rounded data, with varying intercepts for industry and year (using stan_glmer())

Models (3a), (4a), and (5a) include unmodeled coefficients for industry and year (“fixed effects,” in economics jargon); model (6a) considers these coefficients as latent variables and estimates their distributions (this is what economists call “random effects”).

From the above graph you can see that, after adjusting for industry and year effects, the quadratic curves are much stronger. Most of this comes from the industry effects; there’s not much evidence for unexplained variation at the year level. Here’s the relevant console output from model (6a), which conveniently estimates the scale of each batch of varying intercepts:

stan_glmer
 family:       neg_binomial_2 [log]
 formula:      patcw_rounded ~ Lc + I(Lc^2) + (1 | sic2) + (1 | year)
 observations: 354
------
            Median MAD_SD
(Intercept) -68.3   27.1 
Lc          146.6   57.5 
I(Lc^2)     -77.9   30.9 

Auxiliary parameter(s):
                      Median MAD_SD
reciprocal_dispersion 5.6    1.0   

Error terms:
 Groups Name        Std.Dev.
 year   (Intercept) 0.11    
 sic2   (Intercept) 2.11    
Num. levels: year 22, sic2 17

In this fitted model (6a), and also in (4a), and (5a), not shown here, the estimated coefficient for the quadratic term is a bit more than 2 standard errors away from zero, that is, statistically significant at the conventional level. The estimated quadratic term in (3a), the quasipoisson regression, is a bit more than 4 standard errors from zero; I guess this difference is attributable to the different ways that the quasipoisson and negative binomial effectively weight the extreme values in the data.

Quadratic regressions on log(y+1)

When modeling count data we usually start with the negative binomial model with log link. And here I wanted to connect to whatever version of Poisson regression was used in that published paper.

But, as noted above, these data aren’t counts–they’re not even integers, and I think it makes sense to just directly model them on the log scale. Some of the observations have zero values, though, and so I’ll follow the standard practice of modeling nonnegative data y by fitting regressions to log(y+1). My general recommendation along these lines is to model log(y+A), where A is some constant that corresponds to a baseline level of the data. In this case, the 354 data points include 46 zeros, then another 60 values between 0 and 1, then various values (none of them are integers!) ranging as high as 44.7. For the purpose of studying patent counts as measures of innovation, it seems reasonable to add 1 to these data, which blurs the lowest values (there is no big distinction between 0 and the lowest nonzero observation, 0.037) while preserving the distinction between the higher levels. This seems about right: to the extent these data will be supplying a signal on innovation, we’ll want to mostly be learning from data on the high end.

Here’s what happens when we fit the quadratic regression, not adjusting for industry and year effects, to the log(y+1) transformed data. Now we can just use normal errors and we don’t have to worry about Poisson or negative binomials, so there’s just one curve:

You can see the zeros at the very bottom of the graph. The model of normally-distributed errors:

(7) green curve: normal regression fit to log(y+1) (using lm())

is not perfect, but I think it’s close enough, and we no longer have to worry about exactly how we’re modeling extremely high values. On the log scale we still see a fitted quadratic curve. The fit is noisy–the coefficient for the quadratic term has a standard error that is much higher than the estimate itself–but, again, let’s move on to the regressions that adjusts for industry and year effects.

As before, we show the data and fit (this time, with both on the log(y+1) scale) broken out by industry:

The curves come from two fitted models:

(8) orange curve: normal regression fit to log(y+1), including factors for industry and year (using stan_glm())
(9) purple curve: multilevel regression fit to log(y+1), with varying intercepts for industry and year (using stan_glmer())

They’re pretty similar; I assume that the small differences between the two fits arise from the fact that the least-squares model (8) does more adjustment for industries. The estimated coefficients for the quadratic terms are 3 or 4 standard errors from zero. Here’s the output from model (9):

 family:       gaussian [identity]
 formula:      log_patcw ~ Lc + I(Lc^2) + (1 | sic2) + (1 | year)
 observations: 354
------
            Median MAD_SD
(Intercept) -86.8   24.6 
Lc          186.6   52.5 
I(Lc^2)     -98.6   27.9 

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.5    0.0   

Error terms:
 Groups   Name        Std.Dev.
 year     (Intercept) 0.09    
 sic2     (Intercept) 1.04    
 Residual             0.47    
Num. levels: year 22, sic2 17

Hinge regression on log(y+1)

As we have discussed already, the big problem with quadratic regression is that it enforces non-monotonicity. I’d like to fit a model that allows nonlinearity but without a declining slope automatically turning into a negative slope.

My first thought was to fit a model with an asymptote, something like this: b_0 + b_1*(1 – exp(x/b_3)), as an alternative. But this curve has the opposite problem: it is restricted to be monotonic. What I really want is a model that can have diminishing returns or can asymptote or can have that inverted U shape, with the question answered by the data.

We have such a model–the hinge function:


This is a curve that smoothly connects two straight lines which can have arbitrary slopes. The parameters of the curve are:
x0: the value of x where the two lines would intersect
a: the value of y where the two lines would intersect
b0: the slope of the line on the left side of the hinge
b1: the slope of the line on the right side of the hinge
delta: the scale of the continuous curve connecting the two lines

For our purposes, we are most interested in b1: is there evidence that this slope is negative?

We program the hinge model in Stan–the code’s right there at the linked post–also including a multilevel model with varying intercepts for industry and year, as above. Here’s the whole program:

functions {
  vector logistic_hinge(vector x, real x0, real a, real b0, real b1, real delta) { 
    vector[size(x)] xdiff = x - x0;
    return a + b0 * xdiff + (b1 - b0) * delta * log1p_exp(xdiff / delta);
  }
}
data {
  real<lower=0> delta;
  int N;
  vector[N] x, y;
  int J1, J2;
  array[N] int<lower=1,upper=J1> group1;
  array[N] int<lower=1,upper=J2> group2;
}
parameters {
  real x0;
  real a, b0, b1;
  real<lower=0> sigma, sigma1, sigma2;
  vector<offset=0, multiplier=sigma1>[J1] a1;
  vector<offset=0, multiplier=sigma2>[J2] a2;
}
model {
  x0 ~ normal(1, 1);
  a ~ normal(0, 100);
  b0 ~ normal(0, 100);
  b1 ~ normal(0, 100);
  a1 ~ normal(0, sigma1);
  a2 ~ normal(0, sigma2);
  y ~ normal(logistic_hinge(x, x0, a, b0, b1, delta) + a1[group1] + a2[group2], sigma);
} 

When fitting the model, we specify the scale parameter delta, setting it to 0.05, which allows for a gentle curve within the range of the data (as you can see from the plots above, x ranges from about 0.85 to 1.0). There’s some arbitrariness here, but it’s just too hard to fit this parameter directly from the data. In effect this curvature is hard-coded into the quadratic regressions fit earlier.

We assign weak priors to the other parameters in the data, effectively excluding extreme values for the slopes of the curves. The only one of these priors that might be confusing is the prior for x0, the x-position of the hinge. We’re soft-bounding it at the high and low ends just so that the fit won’t get lost in extreme values: once x0 is far outside the range of the data, the curve effectively becomes a straight line and the location of the hinge becomes non-identified.

So, yeah, the hinge is a bit more work to fit compared to the quadratic, but that’s the price we have to pay to fit a more flexible model to this small dataset. And in this case I think the flexible model is absolutely necessary given the goal of seeing whether the data indicate that inverted U shape.

Here’s the result:

In each plot, the blue curve represents the posterior median of the parameters and the red curves correspond to 20 draws from the posterior distribution.

Here’s the summary of inferences:

 variable   mean median    sd   mad      q5    q95 rhat ess_bulk ess_tail
   lp__    71.69  71.85  6.65  6.87   60.81  82.19 1.00      838     1428
   x0       0.94   0.93  0.07  0.09    0.82   1.05 1.04      132     1204
   a        4.03   4.03  0.72  0.69    2.84   5.23 1.01     1178     1549
   b0      52.71  36.93 42.67 24.45   15.29 146.35 1.03      251     1288
   b1     -50.66 -31.22 46.50 22.85 -149.89 -11.44 1.03      146      883
   sigma    0.47   0.47  0.02  0.02    0.44   0.50 1.00     4279     2726
   sigma1   1.11   1.08  0.21  0.20    0.81   1.50 1.01      880     1217
   sigma2   0.08   0.08  0.04  0.04    0.02   0.16 1.00      999     1301
   a1[1]   -0.25  -0.25  0.30  0.30   -0.74   0.26 1.00      486      736
   a1[2]    0.23   0.22  0.30  0.30   -0.26   0.75 1.00      491     1036
...

The key parameter is b1, the slope on the right side of the curve. The posterior mean is -50.66 with standard error 46.50–so, apparently, not statistically significant–but if you look at the posterior quantiles, you’ll see that the distribution is very skewed. Indeed, it turns out that all 4000 of the posterior simulation draws of b1 are negative here.

So, after all this modeling, it seems that the data do clearly indicate an inverted U pattern!

That said, I haven’t explored the hinge model too carefully. For example, when I re-fit it with some other choices of the hinge scale parameter delta, it sometimes doesn’t mix well. I think the above graph with delta=0.5 fits these points about as well as is possible, so I’m ok with it as a data summary, but I’m not saying that my fitting procedure is fully computationally robust. If I were to be working more on the computation for this particular problem, I’d start by removing the year effects, as they can be adding stress to the computation without affecting the fit in any meaningful way.

Before getting to the interpretation of these results when it comes to competition and innovation, I want to tie up a couple of loose ends.

Quadratic regression on log(y+1), programmed in Stan

In fitting the above multilevel hinge regression, I moved from the pre-programmed Stan code of stan_glmer() to a custom Stan program. Just to check that nothing funny is going on here, I’ll go back and program the multilevel quadratic regression directly in Stan. It’s easy enough to write the program; I just replace the hinge by a quadratic function and get rid of the priors, and, indeed, we get essentially the same result as when fitting using stan_glmer(). I won’t bother showing the new graph here. No surprise, it’s just good to check.

The other thing we can do with the Bayesian fit is see whether the peak of the curve falls within the range of the data. The curve b0 + b1*x + b2*x^2 has its peak at x = -b1/(2*b2) (just take the derivative, set to 0, and solve for x), so we can compute the posterior distribution of this value from our simulations. It turns out that only 3 of the 4000 simulated curves has a peak outside the range of the data, so fair enough that, according to the fitted quadratic model (which I don’t like), the inverse U pattern is occurring within those bounds.

Adjusting for the group-level mean of the predictor

In general when fitting a multilevel model, it’s a good idea to adjust for the group-level mean of the predictor–in this case, the average value of x within each industry. Otherwise you have to worry about correlations between x and the varying intercepts. In this case, adding this group-level predictor doesn’t change much. I won’t share the result here just because I’ve already done a lot of work to write this up and there are enough other concerns with the analysis, but, yeah, it’s a good idea to include this predictor too.

Summary

OK, so what have we learned?

• The graph from the original paper (reproduced at the top of the above post) did not show a good fit because it didn’t display the adjustments for industry. We can see the pattern a lot clearer using separate plots for each industry.

• I think it makes more sense all around to model these data on the log(y+1) scale rather than using Poisson regression or any of its variants.

• The patterns within and between industries aren’t so clear to me. I don’t get why furniture has so many patents, why the number of patents for machinery and computers dropped, why the number of patents for electric and gas services shot up, and so forth. I’m not saying these numbers are wrong, and I might have made some mistake in coding; I’m just saying I’m baffled.

• I think the hinge model is much more appropriate than the quadratic for the propose of seeing the extent to which the data support an inverted U pattern. It wasn’t hard at all to program, debug, and fit the hinge model, and in this case it does support the inverted U. So in that sense the published paper was correct in its statistical conclusion, even if I think they got kinda lucky in finding the pattern with their quadratic curves.

• I still don’t buy the claim in the paper that they “find strong evidence of an inverted U relationship” between “product market competition and innovation.” Again, my main problem here is their measure of innovation using average weighted patent counts, an issue that further bothers me given the odd data patterns seen in the graphs for some of the industries, also with the selection in the data (“Our sample includes all firms with names beginning “A” to “L” plus all large R&D firms. After removing firms involved in large mergers or acquisitions and those with missing data . . .”).

• Also the model predicts y from x in the same year, and we’d expect a lag. I didn’t bother addressing this because of my other concerns about the data.

I guess that later work followed up with more comprehensive data sources, but I remain concern that any inverse U pattern, even if supported by this particular dataset, could depend strongly on various artifacts of the measures they are using. There aren’t really a lot of data points at high values of x here, so this downward slope is defined by how these patents are being counted in just a couple of industries.

All this might sound like picky criticism, but it’s the authors of the paper (and the Nobel prize committee), not me, who are making these broad claims about competition and innovation, and concerns about measurement seem really important here.

That said, it’s cool to see that a careful statistical analysis does find the inverted U. It does seem like a real pattern in the data, so the only question is the relevance of these data to the larger economics question.

tl;dr

In response to one of the comments below, I wrote that I don’t at all trust the idea of using time variation of patents within an industry to measure time variation of innovation. I get that the authors were doing their best using available data; still, I don’t think their data and analysis provide evidence for their substantive conclusions.

In the immortal words of John Tukey, “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”

Looking at it that way, all my statistical analysis with the transformations and the quadratic curves and the hinge function was all a waste of time, as I’m just doing a fancy analysis of bad (or, at least, irrelevant) data. But . . . summarizing the statistical information still has value in itself. My analysis just took a few hours, a small fraction of the time the authors must have spent on preparing, analyzing, and interpreting their data. I think it’s important, when doing statistical analysis, to do the best we can do, in this case accounting for all these sources of variation and uncertainty.

Or, to put it another way, yes, it would have been fine to dismiss the published results entirely given the problems with the data. But some of the problems with the data became apparent only after I made those plots showing the time series for each industry. At that point, all the fitting may well have been a waste of time–but I only thought of making the plots because I was trying to understand the puzzling fit.

All the links connecting theory, measurement, data, analysis, and conclusions are important.

P.S. All these data and measurement problems just leap out at us, but there’s a whole world of people out there who just accept these conclusions–even some of the graphs from that 2005 paper–as truth. For example I came across this post by political journalist Matthew Yglesias that just straight up accepts the iffy empirical claims from that paper. It’s tricky–journalists are busy, and it’s natural to think that a much-cited paper written by two Nobel prize winners and published in a top journal has to be correct. I don’t really know what to say, except that this is one of the useful functions of social media, to allow us to push back against default narratives. Not that the default is necessarily wrong, just that it can be wrong, and it’s hard for journalists to escape the bubble and recognize this.

P.P.S. There was some concern about the arbitrariness of the log(y+1) transformation so I also fit the model to sqrt(y), which is a standard variance-stabilizing transformation for count data. The results looked essentially the same:

So the inverted U really seems to be there–but you have to assume the curve has the same shape for all industries. If you fit a separate curve for each industry, there’s no way you’d find the U, and you can see that in the scatterplots. The trouble is that the data are too sparse and variable to try to estimate a separate curve for each industry. Just look at electric and gas services, for example. Or passenger transit.

Data and code
Continue reading

Statistical Graphics and Comics: Parallel Histories of Visual Storytelling

We blogged about this paper a few months ago. Following the suggestion of commenter Jarfas, we sent it to Nightingale, the journal of the Data Visualization Society.

Nightingale did a great job at formatting the article, and here it is!

The article begins:

What do data visualization and comics have in common? One of these is used to communicate in science and journalism, and the other appears in fine art and the entertainment media, but both combine text and image to tell stories. And both these media are relatively new, having made rapid progress only in the past few centuries, despite requiring little in the way of raw material to produce. We connect this history to a combination of abstraction and accessibility in both these forms of visual expression: comic strips and scatterplots both now seem intuitive but represent the development of abstract conventions. We also discuss differences between these two methods of visual storytelling in their goals and in how they are experienced by the reader.

As a heavy user of statistical graphics, a theorist of the use of graphics in statistical inference, and a fan of bandes dessinées, it meant a lot to me to put all this together in one paper, and I was lucky to be able to work with Susan Kruglinski on this project.

As background, here are some posts on comics and bandes dessinées:

And here are some published articles on the role of graphics in statistical inference:

I won’t list our blog posts on the topic because there are too many–779 so far, almost as many as the 793 in the Literature category–so you can just click through and read old posts as far as you’d like to go back. Here are the most recent few:

And here are slides from a talk from a few years back, Choices in statistical graphics: my stories.

Again, it was a lot of fun to work with Susan and put all this together in one article.

The history of Venn diagrams

As we discussed a couple years ago, I think that Venn diagrams have no practical use beyond three circles, but we should all be able to agree that the above pictures are pretty.

Jack Murtagh supplies the background:

Have you ever seen a proper Venn diagram with four overlapping circles? No, because it’s impossible. . . . John Venn knew of the shortcoming with circles, so he proposed ellipses to represent four sets. . . . This overcomes the limitations with circles but only temporarily. Ellipses work for four and five sets before failing in the same way that circles did. As the number of sets grows, we need more and more exotic shapes to portray them.

After reading that, I assumed that, in six or more dimensions, a complete Venn diagram would require at least one nonconvex shape. But I was wrong–a quick google search led to this article from 1997-2005 by Frank Ruskey and Mark Weston with lots of details.

Murtagh continues:

One could reasonably argue that beyond four sets of elements, Venn diagrams lose their utility. The four-ellipse image is already pretty chaotic. Maybe for five-plus sets we should abandon visual representations. But utility does not animate the mathematician so much as beauty and curiosity.

Agreed! Except that I’d say that the four-ellipse image is already too complicated for me to imagine it being useful in any case. Even the three-circle Venn diagram doesn’t seem to me to serve any practical purposes other than pedagogical.

In any case, here’s the fascinating punchline:

Venn and his successors believed that ellipses couldn’t portray all 32 regions required for a five-set diagram. Not until 1975 did mathematician Branko Grünbaum prove them wrong by example [see image above] . . . Grünbaum’s diagram displays a pleasing rotational symmetry. . . . Typical two- and three-circle Venn diagrams share this property. . . . But the four-ellipse diagram doesn’t have rotational symmetry. Can that be fixed? What do two, three and five have in common that four doesn’t?

In 1960 a then undergraduate student at Swarthmore College, David W. Henderson, answered this question with a surprising discovery (Stan Wagon and Peter Webb filled in some gaps later): Rotationally symmetric Venn diagrams are possible only when the number of sets is a prime number—a number divisible only by 1 and itself, such as 2, 3 and 5 but not 4. Henderson only showed that a prime number of sets is necessary, not that you can always design a symmetric Venn diagram for every prime number. . . . Mathematicians at the University of South Carolina settled the question in 2004 by showing that rotationally symmetric Venn diagrams exist for every prime number of sets.

Cool! Just one thing. I have no idea why Murtagh named Henderson, Wagon, and Webb but left anonymous the “mathematicians at the University of South Carolina” who settled the question. He links to the paper, though, so I can share the authors’ names: they’re Jerrold Griggs, Charles E. Killian, and Carla D. Savage. Thanks, Jerrold, Charles, and Carla!

Pascal’s triangle, the Ramanujan principle, and what makes something look like a part of an ellipse or a part a parabola?

John Cook writes:

The nth row of Pascal’s triangle contains the binomial coefficients C(n, r) for r ranging from 0 to n. For large n, if you print out the numbers in the nth row vertically in binary you can see a circular arc.

He explains:

The length of the numerical representation of a number is roughly proportional to its logarithm. Changing the base only changes the proportionality constant. The examples above suggests that a plot of the logarithms of a row of Pascal’s triangle will be a portion of a circle, up to some scaling of one of the axes, so in general we have an ellipse.

Cook continues with an explanation of why the ellipse fits so well:

WoЇfgang pointed out that the curve should be a parabola rather than an ellipse because the binomial distribution is asymptotically normal. Makes perfect sense.

So I redid my plots with the parabola that interpolates log C(n, r) at 0, n/2, and n. This also gives a very good fit, but not as good!

But that’s not a fair comparison because it’s comparing the best (least squares) elliptical fit to a convenient parabolic fit.

So I redid my plots again with the least squares parabolic fit. The fit was better, but still not as good as the elliptical fit.

I think the reason the ellipse fits better than the parabola has to do with the limitations of the central limit theorem. First of all, it applies to CDFs, not PDFs. Second, it applies to absolute error, not relative error. In practice, the CLT gives a good approximation in the middle but not in the tails. With all the curves mentioned above, the maximum error is in the tails.

Beyond the issue of the tails, I think there’s a perceptual issue, which is that we learn about parabolas in their convex orientation, as here:

The other thing is that a circle or an ellipse is finite and a parabola keeps going forever. So, the very fact that this graph stops makes it look less parabola-like, as compared to the sort of graph you might make where you can visually follow the curve off the edge of the graph.

The Ramanujan principle

Also I wanted to connect Cook’s point, that a table of numbers expressed in positional notation is approximately a graph of their logarithms, to the Ramanujan principle:

Tables are commonly read as crude graphs: what you notice in a table of numbers is (a) the minus signs, and thus which values are positive and which are negative, and (b) the length of each number, that is, its order of magnitude.

The name of the principle comes from a famous story of the mathematician Srinivasa Ramanujan supposedly conjecturing the asymptotic form of the partition function based on a look at a table of the first several partition numbers: he was essentially looking at a graph on the logarithmic scale.

The ladder of abstraction in statistical graphics

It was so much fun having a graphics post yesterday that I thought I’d do another, this time sharing one of my favorite recent articles, which begins:

Graphical forms such as scatterplots, line plots, and histograms are so familiar that it can be easy to forget how abstract they are. As a result, we often produce graphs that are difficult to follow. We propose a strategy for graphical communication by climbing a ladder of abstraction, starting with simple plots of special cases and then at each step embedding a graph into a more general framework. We demonstrate with two examples, first graphing a set of equations related to a modeled trajectory and then graphing data from an analysis of income and voting.

I really like this idea of presenting a sequence of increasingly abstract graphs. It’s kind of a graphical analogue to statistical workflow, in that we can understand the more complicated product by explicitly connecting it to the simpler steps that came before. All too often, we have this killer graph which we then have to spend lots of time explaining. Instead of presenting the graph and providing a separate explanation, my new recommendation is to build up from simpler graphs, explaining each new degree of freedom as it comes up.

Statistical graphics: When does it make sense to introduce deliberate distortion to counteract an expected perceptual illusion?

It’s been awhile since we’ve had a post entirely devoted to graphics!

Kaiser writes:

The link here contains an example of how the line-angle illusion can lead to misreading of trends on line charts:

Is there a bigger difference in revenue at Time 1 than Time 2? Many of us will think so but on careful judgment, I think all of us can agree that the difference at Time 2 is in fact larger. . . .

Studies have shown that humans tend to read not the vertical gaps but the angular gaps. Again, this issue is illustrated in the first mentioned paper:

Matthias explained that their implementation of the hammock plot uses a strategy to counteract this line-angle illusion.

I take this to mean they distort the data in such a way that after readers apply the line-angle illusion, the resulting view would convey correctly the correct trend. A kind of double negative strategy. The paper linked above offers one such counter-illusion strategy.

I imagine this is a bit controversial as we are introducing deliberate distortion to counteract an expected perceptual illusion.

I’m not aware of any software that offers built-in functions that perform this type of illusion-busting adjustments. Do you know any?

I’ve actually thought about this question a lot! In some sense, just about all statistical graphs introduce deliberate distortion to counteract an expected perceptual illusion, in the sense that, with the exception of maps and astronomical charts, a graph is an abstract representation of data.

But to get closer to what Kaiser is asking: the analogy I’ve given is, suppose you’re building a wooden chair but using boards that are warped. In this case, the right thing to do is to incorporate the warp into the design, i.e. cut some pieces shorter than others and at different angles, etc., so that they fit together as is, rather than trying to go all rectilinear and then glue/nail everything together. The trouble with the latter strategy is that the wood will exert pressure on the joints and eventually the chair will break or distort itself in some way.

So, similarly, if you can anticipate that a graph will be misread, it’s a good idea to account for this possibility in the design.

Most simply we do this by just not making graphs such as tilted pie charts, 3-D bar charts, and other gimmicks that jump off the page and engage all sorts of visual illusions.

The other common option is to make an additional graph. For example, you could keep the graph above with the two time series but just add another graph showing their difference. Why not?

But what about Kaiser’s original question: are there any graphs with deliberate distortions designed to counteract perceptual illusions of visual artifacts? (I guess that a log transformation doesn’t count here.)

I don’t have any perfect examples here, but I have one example from my applied research that comes close.

The example comes from my paper with Yotam on social penumbras. First there’s the explanatory diagram:

Then the data graph:

There are two things going on here.

First, the diagram is a circle but the data graphs are quarter circles, which we did for two reasons: (a) the quarter circle takes up only a quarter as much space, which is important when we’re displaying 14 of these at once (yes, you could just make smaller full circles but then you have only half the resolution when comparing sizes), and (b) for the goal of comparing one group to the next, quarter circles are better because you can compare the slices, as compared to full circles which all just look like bullseyes and are hard to tell apart.

Second, areas of shapes are notoriously difficult to compare. Why did we do these damn circle plots at all? Why not dot plots or repeated bar charts or some other visualization that would facilitate linear comparisons? The answer is that it was important to us to preserve the “feel” of the penumbra, the idea of concentric social groups. We were willing to pay a bit in statistical clarity in order to have this conceptual unity of the graph and the content of the paper.

But then the issue arises that, when comparing areas, people don’t really compare areas. Nor do they compare linear dimensions. At least according to Cleveland’s classic book, the implicit comparison is something in between. So by displaying the data as areas, we’re knowingly handing people a distortion. For example, if a certain group represents 1% of the population, then the core group (the yellow circle in the graph) will take up 1% of the area of the full circle and thus will be 10% in linear dimension.

That’s bad, right? Maybe not! Several of the groups in our study did have core populations, and if these were displayed as 1% in the linear dimension, they’d be really hard to tell apart. By using these intuitive-looking area graphs, we’re implicitly doing a square-root transformation without having to explain it. So I think it’s fair to say that we’re taking advantage of a perceptual illusion.

P.S. Pro tip: Do you see how in our graph, we order the groups by increasing size. Not alphabetically. You should almost never display your data alphabetically. OK, here’s a rare counterexample, with a slightly prettier visualization and some more graphs in Section 2.3 (“All graphs are comparisons”) of Regression and Other Stories:

For these we displayed the data straight up, no distortion.

Election analytics positions available at the New York Times

Will Davis writes:

I oversee the Election Analytics department (aka The Needle and The Times/Siena Poll) at The Times.

We’re lucky enough to have two great roles open on the team right now for people excited to make a career in the field of election analytics.

* Election analyst: This is a mostly technical role, but it comes with the opportunity to write. This person would be responsible for some most essential elements of our election polling and modeling — modeling unit-level turnout and vote share, creating baselines for The Needle and helping us continue to innovate the design of The Times/Siena Poll.

* Election researcher: This is a great job for somebody with less of a technical skillset who’s eager to be around a team they can learn a ton from. It’s primarily focused on manual research and data entry.

These sound like excellent an opportunity to combine statistical modeling, computation, and graphics.

“Exploratory data analysis” and “confirmatory data analysis” are the same thing.

This is not new–it just happened to come up in class the other day and I thought it was worth saying again. I made the point in my 2003 article, A Bayesian formulation of exploratory data analysis and goodness-of-fit testing, and in chapter 6 of Bayesian Data Analysis, back in 1995.

Here’s an explainer from 2010:

So-called exploratory and confirmatory methods are not in opposition (as is commonly assumed) but rather go together. The history on this is that “confirmatory data analysis” refers to p-values, while “exploratory data analysis” is all about graphs, but both these approaches are ways of checking models. Bayesians should embrace graphical displays of data—which I interpret as visual posterior predictive checks—rather than, as is typical, treating exploratory data analysis as something to be done quickly before getting to the real work of modeling.

Here’s how I see things usually going in a work of applied statistics:

Step 1: Exploratory data analysis. Some plots of raw data, possibly used to determine a transformation.

Step 2: The main analysis—maybe model-based, maybe non-parametric, whatever. It is typically focused, not exploratory.

Step 3: That’s it.

I have a big problem with Step 3 (as maybe you could tell already). Sometimes you’ll also see some conventional model checks such as chi-squared tests or qq plots, but rarely anything exploratory. Which is really too bad, considering that a good model can make exploratory data analysis much more effective and, conversely, I’ll understand and trust a model a lot more after seeing it displayed graphically along with data.

There’s some history to all of this. John Tukey’s classic book, Exploratory Data Analysis, was published in 1977 but I believe he began to work in that area in the 1960s, about ten or fifteen years after doing his also extremely influential work on multiple comparisons (that is, confirmatory data analysis). I’ve always assumed that Tukey was finding p-values to be too limited a tool for doing serious applied statistics–something like playing the piano with mittens. I’m sure Tukey was super-clever at using the methods he had to learn from data, but it must have come to him that he was getting the most from his graphical displays of p-values and the like, rather than from their Type 1 and Type 2 error probabilities that he’d previously focused so strongly on. From there it was perhaps natural to ditch the p-values and the models entirely–as I’ve written before, I think Tukey went a bit too far in this particular direction–and see what he could learn by plotting raw data. This turned out to be an extremely fruitful direction for researchers, and followers in the Tukey tradition–I’m thinking of statisticians such as Bill Cleveland, Howard Wainer, Andreas Buja, Diane Cook, Antony Unwin, etc.–continued to make progress here. More recently, this has overlapped with work by Hadley Wickham and others on EDA-friendly graphics software and work by Jessica Hullman and others on the communication of uncertainty and variation.

The actual methods and case studies in the EDA book . . . well, that’s another story. Hanging rootograms, stem-and-leaf plots, goofy plots of interactions, the January temperature in Yuma, Nevada—all of this is best forgotten or, at best, remembered as an inspiration for important later work. Tukey was a compelling writer, though–I’ll give him that. I read Exploratory Data Analysis twenty-five years ago and was captivated. At some point I escaped its spell and asked myself why I should care about the temperature in Yuma–but, at the time, it all made perfect sense. Even more so once I realized that his methods are ultimately model-based and can be even more effective if understood in that way (a point that I became dimly aware of while completing my Ph.D. thesis in 1990–when I realized that the model I’d spent two years working on didn’t actually fit my data–and which I first formalized at a conference talk in 1997 and published in 2003 and 2004. It’s funny how slowly these ideas develop.).

It’s funny about Tukey’s work on multiple comparisons. He wrote an entire unpublished book on the topic, along with many short research articles. Individually each of these articles is readable and compelling, but, stepping back, I see how it’s all based on a foolish familywise error framework.

Here’s what I think happened: Tukey was following some version of the operations-research approach to statistics associated with Wald. It makes sense–this was the framework that they used to win the second world war. And it was pretty much the only game in town (yes, there was also Bayes, but for historical reasons Bayesian methods got no respect back then). Tukey was brilliant, a great problem solver, co-inventor of the fast Fourier transform and lots of other things, but for whatever reason he didn’t apply his depth, breadth, and creativity toward thinking about the fundamentals. I guess you’d call him a fox. In the usual telling, the fox is the hero and the hedgehog is the boring obsessive, and it’s fair to say that the world benefited from Tukey’s foxness, his interest in developing new methods and solving problems rather than refactoring the foundations of statistics–but I think that his lack of depth in that area contributed to him wasting a lot of time and effort on multiple comparisons, to the extent that his only way forward was to rip it all up and start again with EDA.

From a modern perspective, EDA and CDA are the same thing: they’re both ways of comparing observed data to hypothetical replications under a model. Recall the goal of EDA to discover the unexpected: “the unexpected” is defined relative to “the expected,” hence EDA is model checking just as CDA is, with the only differences being: (a) in EDA the display is visual rather than numerical, (b) in EDA the reference model is often defined implicitly rather than explicitly.

Indeed, the “news you can use” aspects of this post–and of my general point about EDA being the same as CDA–are:

1. If you’re gonna do EDA, make your reference model as explicit as possible. The clearer your assumptions, the better you can find problems. It’s Popper–or, really, Lakatos–in action.

2. If you’re gonna fit complex models (which we’re doing more and more of in statistics and machine learning), EDA is more important than ever. EDA is not a set of qq plots you make before getting to the serious bit of modeling; it’s a key step in workflow. I frame this Bayesianly as this is the simplest way for me to do the work, but you can do non-Bayesian versions as long as you have generative models for your data, and as long as your methods are flexible enough to accommodate different sources of information. (Recall the most important aspect of a statistical method.)

Unfortunately, Tukey was stuck in an old-fashioned statistical framework–good enough to solve operations research problems in the war, but not strong enough for all the applied problems he and others were encountering in the 1960s and later–so, for historical reasons, EDA and CDA were perceived as opposites. Which is too bad. Hence this post, which repeats things that my colleagues and I have been saying for 30 years, and which was implicit in the work of Rubin, Box, and others for longer than that.

Intergenerational socioeconomic mobility over time for different ethnic groups

Elisa Jácome, Ilyana Kuziemko, Suresh Naidu write:

We estimate long-run trends in intergenerational relative mobility for representative samples of the U.S.-born population. Harmonizing all surveys that include father’s occupation and own family income, we develop a mobility measure that allows for the inclusion of non-whites and women for the 1910s–1970s birth cohorts. We show that mobility increases between the 1910s and 1940s cohorts and that the decline of Black-white income gaps explains about half of this rise. We also find that excluding Black Americans, particularly women, considerably overstates the level of mobility for twentieth-century birth cohorts while simultaneously understating its increase between the 1910s and 1940s.

This is an interesting paper both for its substantive content and for its use of data. The authors also prepared a teaching supplement that walks through the analysis in detail. Good job!

P.S. For both of the displays above, I think a grid of small plots would work better. There’s this problem where researchers seem to think they need to cram as much as possible into one graph, but they you end up with all these symbols and colors, and the reader needs to go back and forth between the graph and the legend . . . it’s kind of a mess. On the plus side, it’s good to see any scatterplots at all in an empirical paper!