Skip to content

They added a hierarchical structure to their model and their parameter estimate changed a lot: How to think about this?

Jesús Humberto Gómez writes:

I am an epidemiologist and currently I am studying my fourth year of statistics degree.

Currently we have a dataset with data structure shown here:

We want to investigate the effect of mining contamination on the blood lead levels. We have a total of 8 inhabited locations and the participants and politicians want to know the mean levels in each location.

To give an answer, we have constructed a hierarchical model: level one children (or mothers), level two defined by the locations and mining zone like a population effect (what is, a fixed effect). There are no explainatory variables at the second level. The sizes of the locations are 114, 37, 19, 11, 63, 56, 40, 12 (first four mining zone, second four non mining zone).

The model converges properly. Our data are left censored data and this has been taken account in the likelihood. The mining effect obtained is 23% higher that the obtained running the model without the hierarchical structure (0.59 vs 0.48) and this worries us.

In addition, we have doubts about what we are doing arise because locations are nested in the zones (mining vs non mining) and we are modeling it like a population effect.

Then, is this approach correct?

At the moment, we are going to study separately the mining effect in children and mothers, but in the future we will study both mothers and children together since the blood lead levels correlation is high.

Of course, we are using Stan.

My reply:

First, good call on using Stan! This gives you the flexibility to expand the model as needed.

Now on to the question. Without looking at all the details, I have a few thoughts:

First, if you fit a different model you’ll get a different estimate, so in that sense there’s no reason to be bothered that your estimate is changing.

But it does make sense to want to understand why, or should I say how, the estimate changes when you add more information, or when you improve your model. For simple regressions there are methods such as partial correlations that are designed to facilitate these explanations. We need something similar for multilevel models—for statistical models in general—a “trail of breadcrumbs” tracking how inferences for qois change as we change our models.

For the particular example discussed above, I have one more suggestion which is to include, as a group-level predictor, the group-level average of your individual-level predictor. Bafumi and I discuss this in our unpublished paper from 2006.

Rao-Blackwellization and discrete parameters in Stan

I’m reading a really dense and beautifully written survey of Monte Carlo gradient estimation for machine learning by Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. There are great explanations of everything including variance reduction techniques like coupling, control variates, and Rao-Blackwellization. The latter’s the topic of today’s post, as it relates directly to current Stan practices.

Expecations of interest

In Bayesian inference, parameter estimates and event probabilities and predictions can all be formulated as expectations of functions of parameters conditioned on observed data. In symbols, that’s

\displaystyle \mathbb{E}[f(\Theta) \mid Y = y] = \int f(\theta) \cdot p(\theta \mid y) \, \textrm{d}\theta

for a model with parameter vector \Theta and data Y = y. In this post and most writing about probability theory, random variables are capitalized and bound variables are not.

Partitioning variables

Suppose we have two random variables A, B and want to compute an expectation \mathbb{E}[f(A, B)]. In the Bayesian setting, this means splitting our parameters \Theta = (A, B) into two groups and suppressing the conditioning on Y = y in the notation.

Full sampling-based estimate of expectations

There are two unbiased approaches to computing the expectation \mathbb{E}[f(A, B)] using sampling. This first one is traditional, with all random variables in the expectation being sampled.

Draw (a^{(m)}, b^{(m)}) \sim p_{A,B}(a, b) for m \in 1:M and estimate the expectation as

\displaystyle\mathbb{E}[f(A, B)] \approx \frac{1}{M} \sum_{m=1}^M f(a^{(m)}, b^{(m)}).

Marginalized sampling-based estimate of expectations

The so-called Rao-Blackwellized estimator of the expectation involves marginalizing p_{A,B}(a, b) and sampling b^{(m)} \sim p_{B}(b) for m \in 1:M. The expectation is then estimated as

\displaystyle \mathbb{E}[f(A, B)] \approx \frac{1}{M} \sum_{m=1}^M \mathbb{E}[f(A, b^{(m)})]

For this estimator to be efficiently computatable, the nested expectation must be efficiently computable,

\displaystyle \mathbb{E}[f(A, b^{(m)})] = \int f(a, b^{(m)}) \cdot p(a \mid b^{(m)}) \, \textrm{d}a.

The Rao-Blackwell theorem

The Rao-Blackwell theorem states that the marginalization approach has variance less than or equal to the direct approach. In practice, this difference can be enormous. It will be based on how efficiently we could estimate \mathbb{E}[f(A, b^{(m)})] by sampling a^{(n)} \sim p_{A \mid B}(a \mid b^{(m)}),

\displaystyle \mathbb{E}[f(A, b^{(m)})] \approx \frac{1}{N} \sum_{n = 1}^N f(a^{(n)}, b^{(m)})

Discrete variables in Stan

Stan does not have a sampler for discrete variables. Instead, Rao-Blackwellized estimators must be used, which essentially means marginalizing out the discrete parameters. Thus if A is the vector of discrete parameters in a model, B the vector of continuous parameters, and y the vector of observed data, then the model posterior is p_{A, B \mid Y}(a, b \mid y).

With a sampler that can efficiently make Gibbs draws (e.g., BUGS or PyMC3), it is tempting to try to compute posterior expectations by sampling,

\displaystyle \mathbb{E}[f(A, B) \mid Y = y] \approx \frac{1}{M} \sum_{m=1}^M f(a^{(m)}, b^{(m)}) where (a^{(m)}, b^{(m)}) \sim p_{A,B}(a, b \mid y).

This is almost always a bad idea if it possible to efficiently calculate the inner Rao-Blackwellizization expectation, \mathbb{E}[f(A, b^{(m)})]. With discrete variables, the formula is just

\displaystyle \mathbb{E}[f(A, b^{(m)})] = \sum_{a \in A} p(a \mid b^{(m)}) \cdot f(a, b^{(m)}).

Usually the summation can be done efficiently in models like mixture models where the discrete variables are tied to individual data points or in state-space models like HMMs where the discrete parameters can be marginalized using the forward algorithm. Where this is not so easy is with missing count data or variable selection problems where the posterior combinatorics are intractable.

Gains from marginalizing discrete parameters

The gains to be had from marginalizing discrete parameters are enormous. This is even true of models coded in BUGS or PyMC3. Cole Monnahan, James Thorson, and Trevor Branch wrote a nice survey of the advantages of marginalization for some ecology models that compares marginalized HMC with Stan to JAGS with discrete sampling and JAGS with marginalization. The takeway here isn’t that HMC is faster than JAGS, but that JAGS with marginalization is a lot faster than JAGS without.

The other place to see the effects of marginalization are in the Stan User’s Guide chapter on latent discrete parameters. The first choice-point example shows how much more efficient the marginalization is by comparing it directly with estimated generated from exact sampling of the discrete parameters conditioned on the continuous ones. This is particularly true of the tail statistics, which can’t be estimated at all with MCMC because too many draws would be required. I had the same experience in coding the Dawid-Skene model of noisy data coding, which was my gateway to Bayesian inference—I had coded it with discrete sampling in BUGS, but BUGS took forever (24 hours compared to 20m for Stan for my real data) and kept crashing on trivial examples during my tutorials.

Marginalization calculations can be found in the MLE literature

The other place marginalization of discrete parameters comes up is in maximum likelihood estimation. For example, Dawid and Skene’s original approach to their coding model used the expectation maximization (EM) algorithm for maximum marginal likelihood estimation. The E-step does the marginalization and it’s exactly the same marginalization as required in Stan for discrete parameters. You can find the marginalization for HMMs in the literature on calculating maximum likelihood estiates of HMMs (in computer science, electrical engineering, etc.) and in the ecology literature for the Cormack-Jolly-Seber model. And they’re in the Stan user’s guide.

Nothing’s lost, really

[edit: added last section explaining how to deal with posterior inference for the discrete parameters]

It’s convenient to do posterior inference with samples. Even with a Rao-Blackwellized estimator, it’s possible to sample a^{(m)} \sim p(a \mid b^{(m)}) in the generated quantities block of a Stan program and then proceed from there with full posterior draws (a^{(m)}, b^{(m)}) of both the discrete and continuous parameters.

As tempting as that is because of simplicitly, the marginalization is worth the coding effort, because the gain in efficiency from working in expectation with the Rao-Blackwellized estimator is enormous for discrete parameters. It can often take problems from infeasible to straightforward computationally.

For example, to estimate the posterior distribution of a discrete parameter, we need the expectation

\displaystyle \mbox{Pr}[A_n = k] = \mathbb{E}[\textrm{I}[A_n = k]].

for all values k that A_n might take. This is a trival computation with MCMC (assuming the number of values is not too large) and carried out in Stan by defining an indicator variable and setting it. In contrast, estimating such a variable by sampling a^{(m)} \sim p(a \mid b^{(m)}) is very inefficient and increasingly so as the probability \mbox{Pr}[A_n = k] being estimated is small.

Examples of both forms of inference are shown in the user’s guide chapter on latent discrete parameters.

David Leavitt and Meg Wolitzer

Staying at a friend’s place, I saw on the shelf Martin Bauman, a novel by David Leavitt published in 2000 that I’d never heard of. I read it and it was excellent. I’d call it “Jamesian”: I’ve never read anything by Henry James, but the style seems to fit the many descriptions of James that I’ve gathered from literary critics over the years. Comparing to authors I’ve actually read, I’d say that Martin Bauman is similar to The Remains of the Day and other books by Ishiguro: a style that is so simple and open and guileless that it approaches parody. Indeed, The Remains of the Day is clearly parodic, or at least a classic of the “unreliable narrator” genre; Martin Bauman falls just short of this, to the extent that, when I looked up reviews of the book, I found that some labeled the book as satire and others took it straight. I’m not sure what Leavitt was intending, but as a reader I’d prefer to just take the book’s sincerity at face value, with any parodic elements merely representing Levitt’s recognition of life’s absurdities.

In any case, I’m reminded of a couple other authors we’ve been discussing recently. First is Ted Heller / Sam Lipsyte, whose style is in some way the complete opposite of Leavitt’s (straight rather than gay, brash rather than decorous, etc.) but is telling a similar story. An amusing comparison is that Heller/Lipsyte describe male characters in an accurate way, while all the women are pictured through the prism of sexual and social desire. With Leavitt it’s the reverse: the female characters get to be simply human, while the men are viewed through the prism.

The other comparison is to Meg Wolitzer. I’ve read several of her books recently, and she has a style that’s direct and open, similar though not identical to that of Leavitt. I get the impression that Leavitt is a bit more ruthless, willing to let his characters hang in classic British style (e.g., Evelyn Waugh or George Orwell), in contrast to Wolitzer who likes her characters so much that she wants to give them a happy ending. But, still, lots of similarities, not just in biography (the two authors, close to the same age, had literary success while in college and then each wrote a series of what might be called upper-middlebrow novels about families and relationships) but also in style.

Are GWAS studies of IQ/educational attainment problematic?

Nick Matzke writes:

I wonder if you or your blog-colleagues would be interested in giving a quick blog take on the recent studies that do GWAS (Genome-Wide-Association Studies) on “traits” like IQ, educational attainment, and income?

Matzke begins with some background:

The new method for these studies is to claim that a “polygenic score” can be constructed — these postulate that there are thousands of SNPs (single-nucleotide polymorphisms) that have tiny independent effects on the trait, and that by adding these up, the trait can be predicted to some degree. (The SNPs could themselves be causal, or perhaps in linkage disequilibrium (LD) with causal SNPs.)

I am an evolutionary biologist/phylogeneticist, but I do not work in GWAS. However, my sense of it is that the main way these studies work is that they construct hundreds of thousands of individual linear models, one for each (non-linked) SNP, do something like a Bonferroni correction, and then take all the SNPs beyond the p-value cutoff (something like 5×10-8, although even within a paper there seem to be multiple cutoffs used) as the interesting ones. Then the individual effects are summed to produce a polygenic score for educational attainment, IQ, etc.

This work today has received a huge publicity boost in the New York Times:

– An editorial by a psychologist arguing that progressives should welcome these new results, with few hints about the limitations and problems with these kinds of studies:

Why Progressives Should Embrace the Genetics of Education
By Kathryn Paige Harden
Dr. Harden is a psychologist who studies how genetic factors shape adolescent development.
July 24, 2018

– A news report by Carl Zimmer on the results, which seems much more responsible and mentions some of the limitations stated in the paper, but not what I think are possible bigger statistical issues:

Years of Education Influenced by Genetic Makeup, Enormous Study Finds
More than a thousand variations in DNA were involved in how long people stayed in school, but the effect of each gene was weak, and the data did not predict educational attainment for individuals.
By Carl Zimmer
July 23, 2018

Here’s the referenced paper:

Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals
James J. Lee, Robbee Wedow, […]David Cesarini
Nature Genetics (2018)
Published: 23 July 2018

There was another round of this a few months back in various publications, about race and intelligence, that also involved GWAS, and seemed to try to prepare/repair the ground for people to accept the idea of genetic differences in IQ between races.

– How Genetics Is Changing Our Understanding of ‘Race’
By David Reich
March 23, 2018

– DNA is not our destiny; it’s just a very useful tool
Ewan Birney
Yes, our genes affect everything we do, from educational attainment to health, but they are only a contributing factor

– Denying Genetics Isn’t Shutting Down Racism, It’s Fueling It
By Andrew Sullivan

Some pushback:

– Genetic Intelligence Tests Are Next to Worthless
And not just because one said I was below average.
Carl Zimmer, May 29, 2018

He then expresses some concerns:

I [Matzke] am worried that (a) these GWAS statistical methods might be fundamentally flawed, despite their widespread popularity, leading to wrong or largely wrong conclusions both about the genetics of intelligence/education and perhaps many other traits (medical traits etc.), and (b) if flawed methods are contributing to the same bad old narratives about genetic causes of inequality (going back to eugenics, anti-immigration propaganda, genetic racism, etc.), we really need to know that!

Things that make me worry:

* The Lee et al. study reports that their polygenic score, derived from ~1 million individuals, still explains only 11% of the variance in educational attainment, and the median effect for an individual SNP was ~1 week of education

* The effect sizes go down by 40% when family-level variation is used (e.g. siblings where one has the SNP and one doesn’t)

* The polygenic score’s predictive ability, such as it was for a European-derived population, didn’t work well for an African-American population. Another case of GWAS predictions outside of the training population being problematic is this one on schizophrenia:
Polygenic risk score for schizophrenia is more strongly associated with ancestry than with schizophrenia
David Curtis

Key quote: “There are striking differences in the schizophrenia PRS between cohorts with different ancestries. The differences between subjects of European and African ancestry are much larger, by a factor of around 10, than the differences between subjects with schizophrenia and controls of European ancestry. . . . Two kinds of explanation suggest themselves. The most benign, from the point of view of the usefulness of the PRS, is that the PRS does indeed indicate genetic susceptibility to schizophrenia and that the contributing alleles are under stronger negative selection in African than non-African environments. The least benign would be to say that the PRS is basically an indicator of African ancestry and that for some reason, perhaps through mechanisms such as social adversity, subjects in the PGC with schizophrenia have a higher African ancestry component than controls. It does not seem that the latter can be a full explanation, because it does seem that the PRS is associated with schizophrenia risk in a homogeneous sample after correction for principal components. On the other hand, it is difficult to accept that the PRS does not index ancestry to at least some extent. . . . Whatever the explanation, these results have important implications for the interpretation of the PRS. . . .”

Much of the GWAS data comes from sources like the UK BioBank. We know, even if all the samples are from “European” individuals, that there will be genetic structure in the data due to ancestral geography and isolation-by-distance. All of these social “traits” — education, income (and IQ which correlates with both — Ken Richardson [] argues that IQ may be nothing more than an index of these middle/upper-class attributes) also have geographically-structured variation, simply due to the history of economic development (among many other things). It seems to me that all it would take would be some regional historical variation in wealth/education, and some spatial structure in the genetics, to lead to weak correlations between certain alleles and educational attainment. That would apply in the UK with deep ancestral genetic structure, or in the USA where the history of immigration (even just European immigration) has been highly nonrandom, as has the wealth and status accumulation by ethnic group. I think it is not a stretch to say that there might be a difference in wealth and educational attainment between USA people with different European ethnicities — say, classic WASP populations in New England that date back to before the American Revolution, versus southern and eastern European populations that came later. This difference in wealth/average education would not have a genetic cause, but it would definitely have genetic correlations.

Matzke concludes:

I wonder if these GWAS studies for wealth/IQ/education are mostly picking up accidental correlations due to ancestry (perhaps with a moderate proportion of genuinely causal alleles, perhaps mostly ones with a pathological effect). This would be a ready explanation of why polygenic scores can be nonpredictive or pathological outside of the training population, and why the effect sizes decrease dramatically when studying variants within families.

In other words, are GWASes on education and perhaps many other social traits mostly bunk? Are we perhaps going to see another great statistical crisis (like the crises in small-data psychology, or the p-value/replicability crises), but in the “big data” arena of Genome-Wide Association Studies?

And, is Harden’s essay in the New York Times, “Why Progressives Should Embrace the Genetics of Education”, thus wildly misguided, expressing confidence about statistical results that we shouldn’t be confident in, and dissuading skepticism about the Very Long And Bad history of people trying to explain systematic inequalities through genetics, when in fact we should be maintaining or increasing our skepticism in the modern world of genomics and GWAS?

PS: There is an extensive FAQ from the authors of the Nature Genetics study, which makes me feel somewhat better about the population stratification issue:

Also this from Graham Coop about the generic topic of between-population differences:
Polygenic scores and tea drinking

This stuff is so technical, and I have not tried to follow the details. But the topic seems important enough that I thought I’d share with all of you. Speaking generally, I can see the appeal of both sides of the argument. On one hand, even noisy data can provide some insight, and it seems reasonable to start by drawing hypotheses and tentative conclusions based on what we have; on the other, when a variable or set of variables explains only a small percentage of the variation in the outcome, you have to be concerned that selection biases will overwhelm any effect of interest. We can draw an analogy here to surveys with 10% response rates: for many purposes this is just fine, as long as we adjust for relevant differences between sample and population, but there will be questions for which the results of any comparisons are driven by biases that are hard to adjust for.

Causal inference in AI: Expressing potential outcomes in a graphical-modeling framework that can be fit using Stan

David Rohde writes:

We have been working on an idea that attempts to combine ideas from Bayesian approaches to causality developed by you and your collaborators with Pearl’s do calculus. The core idea is simple, but we think powerful and allows some problems previously that only had known solutions with the do calculus to be solved in the Bayesian framework (in particular the front door rule).

In order to make the idea accessible we have produced a blog post (featuring animations), an online talk and technical reports. All the material can be found here.

Currently we focus on examples and intuition, we are still working on proofs. Although we don’t emphasise it, our idea is quite compatible with probabilistic programming languages like Stan where the probability of different outcomes for different counterfactual actions can be computed in the generated quantities block.

I took a quick look. What they’re saying about causal inference being an example of Bayesian inference for latent variables makes sense; I think this is basically the perspective of Rubin (1974). I think this is a helpful way of thinking, so I’m glad to see it being expressed in a different language. I’d recommend adding Rubin (1974) to your list of references. This is also the way we discuss causal inference in our BDA book (in all editions, starting with the first edition in 1995), where we take some of Rubin’s notation and explicitly integrate them into a Bayesian framework. But the causal analyses we do in BDA are pretty simple; it seems like a great idea to express these general ideas in more computing-friendly framework.

Regarding causal inference in Stan: I think that various groups been implementing latent-variable and instrumental-variables models, following the ideas of Angrist, Imbens, and Rubin, but generalizing to allow prior information and varying treatment effects. It’s been awhile since I’ve looked at the Bayesian instrumental and latent-variables literature, but it’s my recollection that I thought things could be improved using stronger priors: a lot of the pathological results that arise with weak instruments can be traced to bad things in the limits with weak priors. These are the sorts of examples where a full Bayesian inference can be worse than some sort of maximum likelihood or marginal maximum likelihood because of problems of integrating over the distribution of a ratio whose denominator is of uncertain sign.

I guess what I’m saying is that I think there are some important open problems of statistical modeling here. Improvements in conceptualization and computation (such as may be demonstrated by the above-linked work) could be valuable in motivating researchers to push forward on the modeling as well.

My review of Ian Stewart’s review of my review of his book

A few months ago I was asked to review Do Dice Play God?, the latest book by mathematician and mathematics writer Ian Stewart.

Here are some excerpts from my review:

My favorite aspect of the book is the connections it makes in a sweeping voyage from familiar (to me) paradoxes, through modeling in human affairs, up to modern ideas in coding and much more. We get a sense of the different “ages of uncertainty”, as Stewart puts it.

But not all the examples work so well. The book’s main weakness, from my perspective, is its assumption that mathematical models apply directly to real life, without recognition of how messy real data are. That is something I’m particularly aware of, because it is the business of my field — applied statistics.

For example, after a discussion of uncertainty, surveys and random sampling, Stewart writes, “Exit polls, where people are asked who they voted for soon after they cast their vote, are often very accurate, giving the correct result long before the official vote count reveals it.” This is incorrect. Raw exit polls are not directly useful. Before they are shared with the public, the data need to be adjusted for non-response, to match voter demographics and election outcomes. The raw results are never even reported. The true value of the exit poll is not that it can provide an accurate early vote tally, but that it gives a sense of who voted for which parties once the election is over.

It is also disappointing to see Stewart trotting out familiar misconceptions of hypothesis testing . . . Here’s how Stewart puts it in the context of an otherwise characteristically clearly described example of counts of births of boys and girls: “The upshot here is that p = 0.05, so there’s only a 5% probability that such extreme values arise by chance”; thus, “we’re 95% confident that the null hypothesis is wrong, and we accept the alternative hypothesis”. . . .

As I recall the baseball analyst Bill James writing somewhere, the alternative to good statistics is not no statistics: it’s bad statistics. We must design our surveys, our clinical trials and our meteorological studies with an eye to eliminating potential biases, and we must adjust the resulting data to make up the biases that remain. . . . One thing I like about Stewart’s book is that he faces some of these challenges directly. . . .

I believe that a key future development in the science of uncertainty will be tools to ensure that the adjustments we need to make to data are more transparent and easily understood. And we will develop this understanding, in part, through mathematical and historical examples of the sort discussed in this stimulating book.

As you can see from the above excerpts, my review is negative in some of the specifics but positive in general. Stewart had some interesting things to say but, when he moved away from physics and pure mathematics to applied statistics, he got some details wrong.

A month or so after my review appeared, Stewart replied in the same journal. His reply is short so I’ll just quote the whole thing:

In his review of my book Do Dice Play God?, Andrew Gelman focuses on sections covering his own field of applied statistics (Nature 569, 628–629; 2019). However, those sections form parts of just two of 18 chapters. Readers might have been better served had he described the book’s central topics — such as quantum uncertainty, to which the title of the book alludes.

Gelman accuses me of “transposing the probabilities” when discussing P values and of erroneously stating that a confidence interval indicates “the level of confidence in the results”. The phrase ‘95% confident’, to which the reviewer objects, should be read in context. The first mention (page 166) follows a discussion that ends “there’s only a 5% probability that such extreme values arise by chance. We therefore … reject the null hypothesis at the 95% level”. The offending sentence is a simplified summary of something that has already been explained correctly. My discussion of confidence intervals has a reference to endnote 57 on page 274, which gives a more technical description and makes essentially the same point as the reviewer.

I also disagree with Gelman’s claim that I overlook the messiness of real data. I describe a typical medical study and explain how logistic and Cox regression address issues with real data (see pages 169–173). An endnote mentions the Kaplan-Meier estimator. The same passage deals with practical and ethical issues in medical studies.

Here’s my summary of what Stewart said:

1. My review focuses on my own areas of expertise, which only represent a small subset of what the book is about.

2. His technically erroneous statements about hypothesis testing should be understood in context.

3. He doesn’t mention the bit about polling. Maybe he agrees he made a mistake there but he doesn’t want to talk about it, or maybe he didn’t want to look into polling too deeply, or maybe thinks the details of exit polls don’t really matter.

In reply, I’ll just say:

1a. I like Stewart’s book, and my review was largely positive!

1b. I think my review is more valuable when I can engage with my areas of expertise. Had I focused my review on Stewart’s treatment of quantum mechanics, I wouldn’t have had much of anything useful to say.

2. I recognize that it’s a challenge to convey technical concepts in words. It’s easy to write something that vaguely seems correct but actually is not. Here’s an embarrassing example from one of my own textbooks! So I have sympathy for Stewart here. Still, he got it wrong.

3. I think polling is important! If you’re gonna include something on exit polls in your book, you should try your best to get it right.

By writing a book with many examples, you leave many hostages to fortune. That’s ok—a book can have mistakes and still be valuable.

Which teams have fewer fans than their namesake? I pretty much like this person’s reasoning except when we get to the chargers and raiders.

Someone pointed me to this delightful collection of short statistical analyses:

In the Chicago Bears roast thread, 69memelordharambe420 posted “There are more Bears than Bears fans.” That got me [the author of this post] thinking: Is that true? And more generally, which teams have fewer fans than there exist whatever they’re named after?

To start, I needed a rough estimate of the number of NFL fans in the world. This turned out to be difficult to find. I found several reasonable estimates that ranged from 200,000,000 to 400,000,000, but the average estimate seems to be about 300,000,000, so I decided to go with that. If you prefer a different estimate, you can easily scale all of the final numbers up or down as needed.

Of those 300,000,000, about 90%, or 270,000,000, consider themselves fans of one team in particular. To find out how these 270,000,000 fans apportion themselves among the 32 teams, I used this page, which lists how many likes each team has on Facebook (it lists the St. Louis Rams and the San Diego Chargers but still has accurate numbers for the Facebook likes, I checked), and calculated the total number of likes across the 32 teams: 91,712,968. Then, I took the number of likes for each team and multiplied it by 270,000,000/91,712,968 (then rounded to the nearest whole number) to get the best estimate that I was realistically going to be able to get for the total number of fans that each team has. Here are my results:

Bears: There are roughly 12,092,476 Bears fans. There are eight species of bear, plus the grizzly-polar hybrid. I won’t go through all of my calculations, but I came up with a final number of 1,148,364. There are more Bears fans than bears.

Lions: There are roughly 5,642,181 Lions fans. The worldwide lion population is somewhere around 20,000. There are more Lions fans than lions.

Packers: There are roughly 16,024,215 Packers fans. I don’t really feel like doing extensive research on the worldwide meatpacking industry, but the U.S. meatpacking industry employs about 148,100 and there is no way that there are a hundred times that number outside of the country. There are more Packers fans than packers.

Vikings: There are roughly 6,200,740 Vikings fans. The Viking Age ended nearly a millennium ago. There are more Vikings fans than Vikings.

. . .

He runs out of steam near the end. For example:

Chargers: There are roughly 4,700,430 Chargers fans. The Los Angeles Chargers don’t seem to have been named after an actual thing, so I’ll improvise. . . .

I followed the link, which goes to wikipedia, where it says:

Frank Leahy, picked the Chargers name when he purchased an AFL franchise for Los Angeles: “I liked it because they were yelling ‘charge’ and sounding the bugle at Dodger Stadium and at USC games.”

OK, fine, but I think when they yell “charge” (especially when following that bugle tune), they’re talking about a cavalry charge. So the number of “chargers” in the world would be the number of horses in cavalry around the world, or something like that. So I think it’s pretty clear there are more Chargers fans than chargers.

Similarly, the author writes:

There are roughly 10,099,869 Raiders fans. Meanwhile, a ‘raider’ isn’t really an actual thing.

But a “raider” is an actual thing, right? I’m thinking a raider is some kind of pirate, maybe a land pirate of some sort, like a Viking, more generally some kind of violent thief. So what we have to do is compare the number of Raiders fans to the number of muggers in the world. I’m too lazy to estimate that—I’ll leave this one to the criminologists in the audience—but I’m guessing that it’s less than 10 million. Hmmm, there are approx 8 billion people in the world, suppose that half are little kids or too old to mug, that give 4 billion, then we can assume that almost all the muggers are male, that’s 2 billion, hmmmm, if 0.1% of all men in the world are muggers, that would be 2 million muggers. For the number of muggers to be 10 million, we’d have to have approx 1/2 of 1% of all men in the world being muggers, and that just seems a bit high to me.

OK, I’ve done my part now.

P.S. Zad sent the above picture illustrating that there are more cats than fans of Cats.

The intellectual explosion that didn’t happen

A few years ago, we discussed the book, “A Troublesome Inheritance: Genes, Race, and Human History,” by New York Times reporter Nicholas Wade.

Wade’s book was challenging to read and review because it makes lots of claims that are politically explosive and could be true but do not seem clearly proved given available data. There’s a temptation in reviewing such a book to either accept the claims as correct and move straight to the implications, or conversely to argue that the claims are false.

The way I put it was:

The paradox of racism is that at any given moment, the racism of the day seems reasonable and very possibly true, but the racism of the past always seems so ridiculous.

I reviewed Wade’s book for Slate, we discussed it on the blog, and then I further discussed on the sister blog the idea that racism is a framework, not a theory, and that its value, or anti-value, comes from it being a general toolkit which can be used to explain anything.

I recently came a review essay on Wade’s book, by sociologist Philip Cohen from 2015, that made some interesting points, in particular addressing the political appeal of scientific racism.

Cohen quotes from a book review in the Wall Street Journal by conservative author Charles Murray, who wrote that the publication of “A Troublesome Inheritance” would “trigger an intellectual explosion the likes of which we haven’t seen for a few decades.”

This explosion did not happen.

Maybe one reason that Murray anticipated such an intellectual explosion is that this is what happened with his own book, “The Bell Curve,” back in 1995.

So Murray’s expectation was that A Troublesome Inheritance would be the new Bell Curve: Some people would love it, some would hate it, but everyone would have to reckon with it. That’s what happened with The Bell Curve, and also with Murray’s earlier book, Losing Ground. A Troublesome Inheritance was in many ways a follow-up to Murray’s two successful books, it was written by a celebrated New York Times author, so it would seem like a natural candidate to get talked about.

Another comparison point is Jared Diamond’s “Guns, Germs, and Steel,” which, like Wade, attempted to answer the question of why some countries are rich and some are poor. I’m guessing that a big part of Diamond’s success was his book’s title. His book is not so much about guns or steel, but damn that’s a good title. A Troublesome Inheritance, not so much.

So what happened? Why did Wade’s book not take off? It can’t just be the title, right? Nor can it simply be that Wade was suppressed by the forces of liberal political correctness. After all, those forces detested Murray’s books too.

Part of the difference is that The Bell Curve got a push within the established media, as it was promoted by the “even the liberal” New Republic. A Troublesome Inheritance got no such promotion or endorsement. But it’s hard for me to believe that’s the whole story either: for one thing, the later book was written by a longtime New York Times reporter, so “the call was coming from inside the house,” as it were. But it still didn’t catch on.

Another possibility is that Wade’s book was just ahead of its time, not scientifically speaking but politically speaking. In 2014, racism seemed a bit tired out and it did not seem to represent much of a political constituency. After 2016, with Donald Trump’s victory in the U.S. and the rise of neo-fascist parties in Europe, racism is much more of a live topic. If Wade’s book had come out last year, maybe it would be taken as a key to understanding the modern world, a book to be taken “seriously but not literally” etc. If the book had come out when racism was taken to represent an important political constituency, then many there would’ve been a more serious attempt to understand its scientific justifications. At this point, though, the book is five years old so it’s less likely to trigger any intellectual explosions.

Anyway, the above is all just preamble to a pointer to Philip Cohen’s thoughtful article.

The latest Perry Preschool analysis: Noisy data + noisy methods + flexible summarizing = Big claims

Dean Eckles writes:

Since I know you’re interested in Heckman’s continued analysis of early childhood interventions, I thought I’d send this along: The intervention is so early, it is in their parents’ childhoods.

See the “Perry Preschool Project Outcomes in the Next Generation” press release and the associated working paper.

The estimated effects are huge:

In comparison to the children of those in the control group, Perry participants’ children are more than 30 percentage points less likely to have been suspended from school, about 20 percentage points more likely never to have been arrested or suspended, and over 30 percentage points more likely to have a high school diploma and to be employed.

The estimates are significant at the 10% level. Which may seem like quite weak evidence (perhaps it is), but actually the authors employ a quite conservative inferential approach that reflects their uncertainty about how the randomization actually occurred, as discussed in a related working paper.

My quick response is that using a noisy (also called “conservative”) measure and then finding p less than 0.10 does not constitute strong evidence. Indeed, the noisier (more “conservative”) the method, the less informative is any given significance level. This relates to the “What does not kill my statistical significance makes me stronger” fallacy that Eric Loken and I wrote about (and here’s our further discussion)—but only more so here, as the significance is at the 10% rather than the conventional 5% level.

In addition, I see lots and lots and lots of forking paths and researcher degrees of freedom in statements such as, “siblings, especially male siblings, who were already present but ineligible for the program when families began the intervention were more likely to graduate from high school and be employed than the siblings of those in the control group.”

Just like everyone else, I’m rooting for early childhood intervention to work wonders. The trouble is, there are lots and lots of interventions that people hope will work wonders. It’s hard to believe they all have such large effects as claimed. It’s also frustrating when people such as Heckman routinely report biased estimates (see further discussion here). They should know better. Or they should at least know enough to know that they don’t know better. Or someone close to them should explain it to them.

I’ll say this again because it’s such a big deal: If you have a noisy estimate (because of biased or noisy measurements, small sample size, inefficient (possibly for reasons of conservatism or robustness) estimation, or some combination of these reasons), this does not strengthen your evidence. It’s not appropriate to give extra credence to your significance level, or confidence interval, or other statement of uncertainty, based on the fact that your data collection or statistical inference are noisy.

I’d say that I don’t think the claims in the above report would replicate—but given the time frame of any potential replication study, I don’t think replication will be tested one way or another, so a better way to put it is that I don’t think the estimates are at all accurate or reasonable.

But, hey, if you pick four point estimates to display, you get this:

That and favorable publicity will get you far.

P.S. Are we grinches for pointing out the flaws in poor arguments in favor of early childhood intervention? I don’t think so. Ultimately, our goal has to be to help these kids, not just to get stunning quotes to be used in PNAS articles, NPR stories, and Ted talks. If the researchers in this area want to flat-out make the argument that exaggeration of effects serves a social good, that these programs are so important that it’s worth making big claims that aren’t supported by the data, then I’d like to hear them make this argument in public, for example in comments to this post. But I think what’s happening is more complicated. I think these eminent researchers really don’t understand the problems with noise, researcher degrees of freedom, and forking paths. I think they’ve fooled themselves into thinking that causal identification plus statistical significance equals truth. And they’re supported by a academic, media, and governmental superstructure that continues to affirm them. These guys have gotten where they are in life by not listening to naysayers, so why change the path now? This holds in economics and policy analysis, just as it does in evolutionary psychology, social psychology, and other murky research areas. And, as always, I’m not saying that all or even most researchers are stuck in this trap; just enough for it to pollute our discourse.

What makes me sad is not so much the prominent researchers who get stuck in this way, but the younger scholars who, through similar good intentions, follow along these mistaken paths. There’s often a default assumption that, as the expression goes, with all this poop, there must be a pony somewhere. In addition to all the wasted resources involved in sending people down blind alleys, and in addition to the statistical misconceptions leading to further noisy studies and further mistaken interpretations of data, this sort of default credulity crowds out stronger, more important work, perhaps work by some junior scholar that never gets published in a top 5 journal or whatever because it doesn’t have that B.S. hook.

Remember Gresham’s Law of bad science? Every minute you spend staring at some bad paper, trying to figure out reasons why what they did is actually correct, is a minute you didn’t spend looking at something more serious.

And, yes, I know that I’m giving attention to bad work here, I’m violating my own principles. But we can’t spend all our time writing code. We have to spend some time unit testing and, yes, debugging. I put a lot of effort into doing (what I consider to be) exemplary work, into developing and demonstrating good practices, and into teaching others how to do better. I think it’s also valuable to explore how things can go wrong.

Are the tabloids better than we give them credit for?

Joshua Vogelstein writes:

I noticed you disparage a number of journals quite frequently on your blog.
I wonder what metric you are using implicitly to make such evaluations?
Is it the number of articles that they publish that end up being bogus?
Or the fraction of articles that they publish that end up being bogus?
Or the fraction of articles that get through their review process that end up being bogus?
Or the number of articles that they publish that end up being bogus AND enough people read them and care about them to identify the problems in those articles.

My guess (without actually having any data), is that Nature, Science, and PNAS are the best journals when scored on the metric of fraction of bogus articles that pass through their review process. In other words, I bet all the other journals publish a larger fraction of the false claims that are sent to them than Nature, Science, or PNAS.

The only data I know on it is described here. According to the article, 62% of social-science articles in Science and Nature published from 2010-2015 replicated. A earlier paper from the same group found that 61% of papers from specialty journals published between 2011 and 2014 replicated.

I’d suspect that the fraction of articles on social sciences that pass the review criteria for Science and Nature is much smaller than that of the specialty journals, implying that the fraction of articles that get through peer review in Science and Nature that replicate is much higher than the specialty journals.

My reply: I’ve looked at no statistics on this at all. It’s my impression that social science articles in the tabloids (Science, Nature, PNAS) are, on average, worse than those in top subject-matter journals (American Political Science Review, American Sociological Review, American Journal of Sociology, etc.). But I don’t know.


A computer program can be completely correct, it can be correct except in some edge cases, it can be approximately correct, or it can be flat-out wrong.

A statistical model can be kind of ok but a little wrong, or it can be a lot wrong. Except in some rare cases, it can’t be correct.

An iterative computation such as a Stan fit can have approximately converged, or it can be far from convergence. Except in some rare cases, it will never completely converge.

Where are the famous dogs? Where are the famous animals?

We were having a conversation the other day about famous dogs. There are surprisingly few famous dogs. Then I realized it’s not just that. There are very few famous animals, period.

If you exclude racehorses and the pets of heads of state, these are all the famous animals we could think of:

dogs: Lassie, Rin Tin Tin, Balto
cats: Trim, Grumpy cat, Morris the cat
horses: Clever Hans, Traveller
sheep: Dolly
groundhogs: Punxsutawney Phil
octopuses: Paul
gorillas etc.: Harambe, also that chimp that learned sign language
dolphins: Flipper
cows: Mrs. O’Leary’s
lions: Cecil
elephants: Jumbo
dinosaurs: Sue

That’s only 18. 18! Or 19 if you count Dolly as 2. Just 18 or 19 from the entire animal kingdom. I’m sure we’re missing a few, but still. I wouldn’t have thought that there were so few famous animals (again, not counting racehorses and royal or presidential pets, which I’d consider to be special cases).

P.S. Fictional animals don’t count.

P.P.S. Lots of good suggestions in comments. The #1 missing item above is Laika. You don’t have to believe me on this, but we did discuss Laika in our conversation. It was just my bad to forget to include him her when typing up the blog post.

From comments, some others in addition to Laika:

horses: Bucephalus, Incitatus, Mr. Ed
lions: Elsa
gorillas: Koko

Top 5 literary descriptions of poker

Yesterday I wrote about Pocket Kings by Ted Heller, which gives one of the most convincing literary descriptions of poker that I’ve ever read. (Much more so than all those books and articles where the author goes on expense account to compete at the World Series of Poker. I hope to never see that again.)

OK, here’s my list of the best literary descriptions of poker, starting at the top:

1. James Jones, From Here to Eternity. The best ever. An entirely convincing poker scene near the beginning drives the whole plot of this classic novel.

2. Dealer’s Choice, by Patrick Marber. Deemonds!

3. David Spanier, Total Poker. Lots of wonderful stories as well as some poker insight. He wrote some other books about poker that were not so interesting or readable.

4. Frank Wallace, Poker: A guaranteed income for life by using the advanced concepts of poker. I tracked this one down and read it after reading about it in Total Poker. Wallace’s book is pretty much devoid of any intentional literary merit, but I agree with Spanier that on its own terms it’s a kind of outsider-art masterpiece.

5. Ted Heller, Pocket Kings. See my review from yesterday.

That’s it. I can’t think of anything else I’ve read about poker that would be worth mentioning here. Lots of poker manuals which in some cases are well written but I would not say they are particularly interesting to read except for the poker content, and lots of books about poker by serious writers with poker scenes that do not seem at all insightful in any general way. So the above four, that’s all I have to offer.

Am I missing anything that’s worth including in the above list?

P.S. In my first version of this post, I forgot Dealer’s Choice. I added it after Phil reminded me.

Pocket Kings by Ted Heller

So. I’m most of the way through Pocket Kings by Ted Heller, author of the classic Slab Rat. And I keep thinking: Ted Heller is the same as Sam Lipsyte. Do these two guys know each other? They’re both sons of famous writers (OK, Heller’s dad is more famous than Lipsyte’s, but still). They write about the same character: an physically unattractive, mildly talented, borderline unethical shlub from New Jersey, a guy in his thirties or forties who goes through life powered by a witty resentment toward those who are more successful than him. A character who thinks a lot about his wife and about his friends his age, but never his parents or siblings. (A sort of opposite character from fellow Jerseyite Philip Roth / Nathan Zuckerman, whose characters tended to be attractive, suave, and eternally focused on the families of their childhoods. Indeed, the Heller/Lipsyte character is the sort of irritating pest who Roth/Zuckerman is always trying to shake off.)

It’s hard for me to see how Ted Heller and Sam Lipsyte can coexist in the same universe, but there you have it. One thing I don’t quite understand is the age difference: Lipsyte was born in 1968, which makes sense given the age of his characters, but Heller was born twelve years earlier, which makes him a decade or two older than the protagonist of Pocket Kings. That’s ok, of course—no requirement that an author write about people his or her own age—still, it’s a bit jarring to me to think about in the context of these particular authors, who seem so strongly identified with this particular character type.

One more thing. With their repeated discussions of failure, fear of failure, living with failure, etc., these books all seem to be about themselves, and their authors’ desire for success and fears of not succeeding.

Some works of art are about themselves. Vermeer making an incredibly detailed painting of a person doing some painstaking task. Titanic being the biggest movie of all time, about the biggest ship of all time. Primer being a low-budget, technically impressive movie about some people who build a low-budget time machine. Shakespeare with his characters talking about acting and plays. And the Heller/Lipsyte oeuvre.

I feel like a lot of these concerns are driven by economics. What with iphones and youtube and all these other entertainment options available, there’s not so much room for books. In Pocket Kings, Heller expresses lots of envy and resentment toward successful novelists such as Gary Shteyngart and everybody’s favorite punching bag, Jonathan Franzen—but, successful as these dudes are, I don’t see them as having the financial success or cultural influence of comparable authors in earlier generations. There’s less room at the top, or even at the middle.

And, as we’ve discussed before, it doesn’t do any help to professional writers that there are people like me around, publishing my writing every day on the internet for free.

Back to Pocket Kings. It’s not a perfect book. The author pushes a bit hard on the jokes at times. But it’s readable, and it connects to some deep ideas—or, at least, ideas that resonate deeply with me.

It’s giving nothing away to say that the book’s main character plays online poker as an escape from his dead-end life, and then he’s living two parallel lives, which intersect in various ways. He’s two different people! But this is true of so many of us, in different ways. We play different roles at home and at work. And, for that matter, when we read a novel, we’re entering a different world. Reading about this character’s distorted life made me question my own preference for reading books and communicating asynchronously (for example, by blogging, which is the ultimate in asynchronous communication, as I’m writing this in August to appear in January). Face-to-face communication can take effort! There must be a reason that so many people seem to live inside their phones. In that sense, Pocket Kings, published in 2012, was ahead of its time.

Some Westlake quotes

Clint Johns writes:

I’m a regular visitor to your blog, so I thought you might be interested in this link. It’s a relatively recent article (from 7/12) about Donald Westlake and his long career. For my money, the best part of it is the generous number of Westlake quotations from all sorts of places, including interviews as well as his novels. There are lots of writers who can turn a phrase, but Westlake was in a class by himself (or maybe with just a few others).

The Westlake quotes are good, but my favorite for these sorts of quotes is still George V. Higgins.

Graphs of school shootings in the U.S.

Bert Gunter writes:

This link is to an online CNN “analysis” of school shootings in the U.S. I think it is a complete mess (you may disagree, of course).

The report in question is by Christina Walker and Sam Petulla.

Gunter lists two problems:

1. Graph labeled “Race Plays A Factor in When School Shootings Occur”:
AFAICT, they are graphing number of casualties vs. time of shooting. But they should be graphing the number of shootings vs time; in fact, as they should be comparing incident *rates* vs time by race, they should be graphing the proportion of each category of schools that have shooting incidents vs time (I of course ignore more formal statistical modeling, which would not be meaningful for a mass market without a good deal of explanatory work).

2. Graph of “Shootings at White Schools Have More Casualties”:
The area of the rectangles in the graph appears to be proportional to the casualties per incident but with both different lengths and widths, it is not possible to glean clear information by eye (for me anyway). And aside from the obvious huge 3 or 4 largest incidents in the White Majority schools, I do not see any notable differences by category. Paraphrasing Bill Cleveland, the graph is a puzzle to be decipered: it appears to violate most of the principles of good graphics.

Moreover, it is not clear that casualties per incident is all that meaningful anyway. Maybe White schools involved in shootings just have more students so that it’s easier for a shooter to amass more casualties.

The “appropriate” analysis is: “Most school shootings everywhere involve 1 or 2 people, except for a handful of mass shootings at White schools. The graph is a deliberate attempt to mislead, not just merely bad.”

Unfortunately, as you are well aware, due to intense competition for viewer eyeballs, both formerly only print (NYT, WSJ, etc.) and purely online news media are now full of such colorful, sometimes interactive, and increasingly animated data analyses whose quality is, ummm… rather uneven. So impossible to discuss statistical deficiences and the possible political/sociological consequences of such mass media data analytical malfeasance in it all.

My reply:

I think the report is pretty good. Sure, some of the graphs don’t present data patterns so clearly, but as Antony Unwin and I wrote a few years ago, infovis and statistical graphics have different goals and different looks. In this case, I think these are the main messages being conveyed by these plots:
– There have been a lot of school shootings in the past decade.
– They’ve been happening all over the place, at all different times and to all different sorts of students.
– This report is based on real data that the researchers collected.
Indeed, at the bottom of the report they provide a link to the data on Github.

Regarding Gunter’s points 1 and 2 above, sure, there are other ways of analyzing and graphing the data. But (a) I don’t see why he says the graph is a deliberate attempt to mislead, and (b) I think the graphs are admirably transparent.

Consider for example the first two graphs in the report, here:

and here:

Both these graphs have issues, and there are places where I would’ve made different design choices. For example, I think the color scheme is confusing in that the same palette is used in two different ways, also I think it’s just wack to make three different graphs for early morning, daytime, and late afternoon and evening (and to compress the time scales for some of these). Also a mistake to compress Sat/Sun into one date: distorting the scale obscures the data. Instead, they could simply have rotated that second graph 90 degrees, running day of week down from Monday to Sunday on the vertical axis and time of day from 00:00 to 24:00 on the horizontal axis. One clean graph would then display all the shootings and their times.

The above graph has a problem that I see a lot in data graphics, and in statistical analysis more generally, which is that it is overdesigned. The breaking up into three graphs, the distortion of the hour and day scales, the extraneous colors (which convey no information, as time is already indicated by position on the plot) all just add confusion and make a simple story look more complicated.

So, sure, the graphs are not perfect. Which is no surprise. We all have deadlines. My own published graphs could be improved too.

The thing I really like about the graphs in Walker and Petulla’s report is that they are so clearly tied to the data. That’s important.

If someone were to do more about this, I think the next step would be to graph shootings and other violent crimes that occur outside of schools.

In Bayesian inference, do people cheat by rigging the prior?

Ulrich Atz writes in with a question:

A newcomer to Bayesian inference may argue that priors seem sooo subjective and can lead to any answer. There are many counter-arguments (e.g., it’s easier to cheat in other ways), but are there any pithy examples where scientists have abused the prior to get to the result they wanted? And if not, can we rely on this absence of evidence as evidence of absence?

I don’t know. It certainly could be possible to rig an analysis using a prior distribution, just as you can rig an analysis using data coding or exclusion rules, or by playing around with what variables are included in a least-squares regression. I don’t recall ever actually seeing this sort of cheatin’ Bayes, but maybe that’s just because Bayesian methods are not so commonly used.

I’d like to believe that in practice it’s harder to cheat using Bayesian methods because Bayesian methods are more transparent. If you cheat (or inadvertently cheat using forking paths) with data exclusion, coding, or subsetting, or setting up coefficients in a least squares regression, or deciding which “marginally significant” results to report, that can slip under the radar. But the prior distribution—that’s something everyone will notice. I could well imagine that the greater scrutiny attached to Bayesian methods makes it harder to cheat, at least in the obvious way by using a loaded prior.

American Causal Inference May 2020 Austin Texas

Carlos Carvalho writes:

The ACIC 2020 website is now up and registration is open.

As a reminder, proposals information can be found in the front page of the website.
Deadline for submissions is February 7th.

I think that we organized the very first conference in this series here at Columbia, many years ago!

Is it accurate to say, “Politicians Don’t Actually Care What Voters Want”?

Jonathan Weinstein writes:

This was a New York Times op-ed today, referring to this working paper. I found the pathologies of the paper to be worth an extended commentary, and wrote a possible blog entry, attached. I used to participate years ago in a shared blog at Northwestern, “Leisure of the Theory Class,” but nowadays I don’t have much of a platform for this.

The op-ed in question is by Joshua Kalla and Ethan Porter with title, “Politicians Don’t Actually Care What Voters Want,” and subtitle, “Does that statement sound too cynical? Unfortunately, the evidence supports it.” The working paper, by the same authors, is called, “Correcting Bias in Perceptions of Public Opinion Among American Elected Officials: Results from Two Field Experiments,” and begins:

While concerns about the public’s receptivity to factual information are widespread, muchless attention has been paid to the factual receptivity, or lack thereof, of elected officials. Re-cent survey research has made clear that U.S. legislators and legislative staff systematicallymisperceive their constituents’ opinions on salient public policies. We report results from twofield experiments designed to correct misperceptions of sitting U.S. legislators. The legislators (n=2,346) were invited to access a dashboard of constituent opinion generated using the 2016 Cooperative Congressional Election Study. Here we show that despite extensive outreach ef-forts, only 11% accessed the information. More troubling for democratic norms, legislators who accessed constituent opinion data were no more accurate at perceiving their constituents’ opinions. Our findings underscore the challenges confronting efforts to improve the accuracy of elected officials’ perceptions and suggest that elected officials may be more resistant to factual information than the mass public.

Weinstein’s criticism of the Kalla and Porter article is here, and this is Weinstein’s main point:

The study provided politicians with data on voters’ beliefs, and attempted to measure changes in the politicians’ perception of these beliefs. No significant effects were found. But there are always many possible explanations for null results! The sensational, headlined explanation defies common sense and contradicts other data in the paper itself, while other explanations are both intuitive and supported by the data.


The authors claim that the study is “well-powered,” suggesting an awareness of the issue, but they do not deal with it adequately, say by displaying confidence intervals and arguing that they prove the effect is small. It is certainly not obvious that a study in which only 55 of 2,346 potential subjects complied with all phases is actually well-powered.

My reaction to all this was, as the social scientists say, overdetermined. That is, the story had a bunch of features that might incline me to take one view or another:

1. Weinstein contacted me directly and said nice things about this blog. +1 for the criticism. A polite email doesn’t matter, but it should.

2. Weinstein’s an economist, Kalla and Porter are political scientists and the topic of the research is politics. My starting point is to assume that economists know more about economics, political scientists know more about politics, sociologists know more about sociology. So +1 for the original paper.

3. On the substance, there’s some work by Lax and Phillips on congruence of political attitudes and legislative positions. The summary of this work is that public opinion does matter to legislators. So +1 for the criticism. On the other hand, public opinion is really hard to estimate. Surveys are noisy, there’s lots of conflicting information out there, and I could well believe that, in many cases, even if legislators would like to follow public opinion, it wouldn’t make sense for them to do much with it. So +1 for the original paper.

4. The sample size of 55, that seems like an issue, and I think we do have to worry about claims of null effects based on not seeing any clear pattern in noisy data. So +1 for the criticism.

5. The paper uses Mister P to estimate state-level opinion. +1 for the paper.

And . . . all the pluses balance out! I don’t know what I’m supposed to think!

Also, I don’t know any of these people—I don’t think that at the time of this writing [July 2019] I’ve ever even met them. None of this is personal. Actually, I think my reactions would be pretty similar even if I did know some of these people. I’m willing to criticize friends’ work and to praise the work of people I dislike or don’t know personally.

Anyway, my point in this digression is not that it’s appropriate to evaluate research claims based on these sorts of indirect arguments, which are really just one step above attitudes of the form, “Don’t trust that guy’s research, he’s from Cornell!”—but rather to recognize that it’s inevitable that we will have some reactions based on meta-data, and I think it’s better to recognize these quasi-Bayesian inferences that we are doing, even if for no better reason than to avoid over-weighting them when drawing our conclusions.

OK, back to the main story . . . With Weinstein’s permission, I sent his criticisms to Kalla and Porter, who replied to Weinstein’s 3-page criticism with a 3-page defense, which makes the following key point:

His criticisms of the paper, however, do not reflect exposure to relevant literature—literature that makes our results less surprising and our methods more defensible . . .

Since Miller and Stokes (1963), scholars have empirically studied whether elected officials know what policies their constituents want. Recent work in political science has found that there are systematic biases in elite perceptions that suggest many state legislators and congressional staffers do not have an accurate assessment of their constituents’ views on several key issues. . . . Hertel-Fernandez, Mildenberger and Stokes (2019) administer surveys on Congressional staff and come to the same conclusion. . . . elected officials substantially misperceive what their constituents want. The polling that does take place in American politics either is frequently devoid of any issue content (horserace polling) or is devised to develop messages to distract and manipulate the mass public, as documented in Druckman and Jacobs (2015). Contrary to Professor Weinstein’s description, our results are far from “bizarre,” given the state of the literature.

Regarding the small sample size and acceptance of the null, Kalla and Porter write:

Even given our limited sample size, we do believe that our study is sufficiently well-powered to demonstrate that this null is normatively and politically meaningful . . . our study was powered for a minimal detectable effect of a 7 percentage point reduction in misperceptions, where the baseline degree of misperception was 18 percentage points in the control condition.

So there you have it. In summary:
Research article
Response to criticism

I appreciate the behavior of all the researchers here. Kalla and Porter put their work up on the web for all to read. Weinstein followed up with a thoughtful criticism. Harsh, but thoughtful and detailed, touching on substance as well as method. Kalla and Porter used the criticism as a way to clarify issues in their paper.

What do I now think about the underlying issues? I’m not sure. Some of my answer would have to depend on the details of Kalla and Porter’s design and data, and I haven’t gone through all that in detail.

(To those of you who say that I should not discuss a paper that I’ve not read in full detail, I can only reply that this is a ridiculous position to take. We need to make judgments based on partial information. All. The. Time. And one of the services we provide as this blog is to model such uncertain reactions, to take seriously the problem of what conclusions should be drawn based on the information available to us, processed in available time using available effort.)

But I can offer some more general remarks on the substantive question given in the title of this post. My best take on this, given all the evidence I’ve seen, is that it makes sense for politicians to know where their voters stand on the issues, but that information typically isn’t readily available. At this point, you might ask why politicians don’t do more local polling on issues, and I don’t know—maybe they do—but one issue might be that, when it comes to national issues, you can use national polling and approximately adjust using known characteristics of the district compared to the country, based on geography, demographics, etc. Also, what’s typically relevant is not raw opinion but some sort of average, weighted by likelihood to vote, campaign contributions, and so forth.

I guess what I’m saying is that I don’t see a coherent story here yet. This is not meant as a criticism of Kalla and Porter, who must have a much better sense of the literature than I do, but rather to indicate a difficulty in how we think about the links between public opinion and legislator behavior. I don’t think it’s quite that “Politicians Don’t Actually Care What Voters Want”; it’s more that politicians don’t always have a good sense of what voters want, politicians aren’t always sure what they would do with that information if they had it, and whatever voters think they want is itself inherently unstable and does not always exist independent of framing. As Jacobs and Shapiro wrote, “politicians don’t pander.” They think of public opinion as a tool to get what they want, not as some fixed entity that they have to work around.

These last comments are somewhat independent of whatever was in Kalla and Porter’s study, which doesn’t make that study irrelevant to our thinking; it just implies that further work is needed to connect these experimental results to our larger story.

Call for proposals for a State Department project on estimating the prevalence of human trafficking

Abby Long points us to this call for proposals for a State Department project on estimating the prevalence of human trafficking:

The African Programming and Research Initiative to End Slavery (APRIES) is pleased to announce a funding opportunity available through a cooperative agreement with the U.S. Department of State, Office to Monitor and Combat Trafficking in Persons (J/TIP) with the following two aims:

To document the robustness of various methodological approaches in human trafficking prevalence research.
To identify and build the capacity of human trafficking teams in the design, testing, and dissemination of human trafficking prevalence data.
To achieve these aims, we are seeking strong research teams to apply at least two methods of estimating human trafficking prevalence in a selected hot spot and sector outside the United States.*

View the full call for proposals of the Prevalence Reduction Innovation Forum (PRIF).

Application deadline: March 4, 2020, 5:00 PM Eastern Standard Time. Please submit full proposals to (strongly preferred) with the subject line “PRIF proposal” or mail to the address indicated below by this deadline. Late submissions will not be accepted.

Dr. Lydia Aletraris, Project Coordinator
African Programming and Research Initiative to End Slavery
School of Social Work, Room 204
279 Williams Street,
Athens GA, 30602, USA

Award notification: April 2020

Questions prior to the deadline may be submitted via email to Use the subject line “PRIF questions.”

Award Amount: $200,000-$450,000. Only in exceptional circumstances might a higher budget be considered for funding.

Eligibility: Nonprofit organizations in or outside of the United States, including universities, other research organizations, NGOs, INGOs are eligible to apply. Government agencies and private entities are not eligible to apply.