Skip to content

Merlin did some analysis of possible electoral effects of rejections of vote-by-mail ballots . . .

Elliott writes:

Postal voting could put America’s Democrats at a disadvantage: Rejection rates for absentee ballots have fallen since 2016, but are higher for non-whites than whites

The final impact of a surge in postal voting will not be known until weeks after the election. Yet North Carolina, a closely contested state, releases detailed data on ballots as they arrive. So far, its figures suggest that a tarnished election is unlikely—but that Democrats could be hurt by their disproportionate embrace of voting by mail. . . .

The Tar Heel state has received eight times as many postal votes as it had by this point in 2016. Despite fears about first-time absentee voters botching their ballots, the share that are rejected has in fact fallen to 1.3%, from 2.6% in 2016. This is probably due in part to campaigns educating supporters on voting by mail, and also to new efforts by the state to process such ballots.

However, these gains have been concentrated among white and richer voters, causing North Carolina’s already large racial gap in rejection rates to widen. In 2016 black voters sent in 10% of postal ballots, but 18% of discarded ones. This year, those shares are 17% and 42%. That hurts Democrats, who rely on black voters’ support. . . .

Partisan differences over voting by mail exacerbate this effect. In the past, Democrats and Republicans were equally likely to do so. But polling by YouGov now shows that 51% of likely Democratic voters plan to vote absentee, compared with 32% of Republicans. Extrapolating North Carolina’s patterns nationwide, a model built by Merlin Heidemanns of Columbia University finds that 0.7% of ballots intended for Joe Biden, the Democrats’ presidential nominee, will be rejected postal votes, versus 0.3% of those cast for Donald Trump. . . .

Kyle Hausmann saw the above-linked article and asked if we had any thoughts on how impactful that might be on the election outcome. He also asked “whether or not trends in ballot rejection might already be implicitly baked into your economist forecast, simply by virtue of the rejected ballots not being included in the historical voter count date.”

Merlin replied:

Elliott and I expect the overall number of rejected ballots to be higher and for Democrats to be disproportionately negatively affected because rejection rates are larger within the groups that tend to vote for them and because they are more likely to vote absentee to begin with. While an equity issue, we don’t expect this to meaningfully affect the outcome of the election given that it primarily affects states that are safely Democrat aside from one or two that are mentioned in the article. I did some work on this for USA Today and did some further exploratory analysis of the NC data here and some raw number rejections by ethnic group based on 2016 data here .

It’s not baked into our forecast because vote-by-mail numbers will be at a historic high this year.

Interactive analysis needs theories of inference

Jessica Hullman and I wrote an article that begins,

Computer science research has produced increasingly sophisticated software interfaces for interactive and exploratory analysis, optimized for easy pattern finding and data exposure. But assuming that identifying what’s in the data is the end goal of analysis misrepresents strong connections between exploratory and confirmatory analysis and contributes to shallow analyses. We discuss how the concept of a model check unites exploratory and confirmatory analysis, and review proposed Bayesian and classical statistical theories of inference for visual analysis in light of this view. Viewing interactive analysis as driven by model checks suggests new directions for software, such as features for specifying one’s intuitive reference model, including built-in reference distributions and graphical elicitation of parameters and priors, during exploratory analysis, as well as potential lessons to be learned from attempting to build fully automated, human-like statistical workflows.

Jessica provides further background:

Tukey’s notion of exploratory data analysis (EDA) has had a strong influence on how interactive systems for data analysis are built. But the assumption has generally been that exploratory analysis precedes model fitting or checking, and that the human analyst can be trusted to know what to do with any patterns they find. We argue that the symbiosis of analyst and machine that occurs in the flow of exploratory and confirmatory statistical analysis makes it difficult to make progress on this front without considering what’s going on, and what should go on, in the analyst’s head. In the rest of the paper we do the following:

– We point out ways that optimizing interactive analysis systems for pattern finding and trusting the user to know best can lead to software that conflicts with goals of inference. For example, interactive systems like Tableau default to aggregating data to make high level patterns more obvious but this diminishes some people’s acknowledgment of variation. Researchers evaluate interactive visualizations and systems based on how well people can read data, how much they like using the system, or how evenly they distribute their attention across data, not how good their analysis or decisions are. Various algorithms for progressive computation or privacy preservation treat the dataset as though it is an object of inherent interest without considering its use in inference.

– We propose that a good high level understanding frames interactive visual analysis as driven by model checks. The idea is that when people are “exploring” data using graphics, they are implicitly specifying and fitting pseudo-statistical models, which produce reference distributions to compare to data. This makes sense because the goal of EDA is often described in terms of finding the unexpected, but what is unexpected is only defined via some model or view of how the world should be. In a Bayesian formulation (following Gelman 2003, 2004, our primary influence for this view), the reference distribution is produced by the posterior predictive distribution. So looking at graphics is like doing posterior predictive checks, where we are trying to get a feel for the type and size of discrepancies so we can decide what to do next. We like this view for various reasons, including because (1) it aligns with the way that many exploratory graphics get their meaning from implicit reference distributions, like residual plots or Tukey’s “hanging rootograms”; (2) it allows us to be more concrete about the role prior information can play in how we examine data; and (3) it suggests that to improve tools for interactive visual analysis we should find ways to make the reference models more explicit so that our graphics better exploit our abilities to judge discrepancies, such as through violations of symmetry, and the connection between exploration and confirmation is enforced.

– We review other proposed theories for understanding graphical inference: Bayesian cognition, visual analysis as implicit hypothesis testing, multiple comparisons. The first two can be seen as subsets of the Bayesian model checking formulation, and so can complement our view.

– We discuss the implications for designing software. While our goal is not to lay out exactly what new features should be added to systems like Tableau, we discuss some interesting ideas worth exploring more, like how the user of an interactive analysis system could interact with graphics to sketch their reference distribution or make graphical selections of what data they care about and then choose between options for their likelihood, specify their prior graphically, see draws from their model, etc. The idea is to brainstorm how our usual examinations of graphics in exploratory analysis could more naturally pave the way for increasingly sophisticated model specification.

– We suggest that by trying to automate statistical workflows, we can refine our theories. Sort of like the saying that if you really want to understand some topic, you should teach it. If we were to try to build an AI that can do steps like identify model misfits and figure out how to improve the model, we’d like have more ideas about what sorts of features our software should offer people.

– We conclude with the idea that the framing of visualization as model checking relates to ideas we’ve been thinking about recently regarding data graphics as narrative storytelling.

P.S. Ben Bolker sent in the above picture of Esmer helping out during a zoom seminar.

Follow-up on yesterday’s posts: some maps are less misleading than others.

Yesterday I complained about the New York Times coronavirus maps showing sparsely-populated areas as having a case rate very close to zero, no matter what the actual rate is. Today the Times has a story about the fact that the rate in rural areas is higher than in more densely populated areas, and they have maps that show the rate in sparsely populated areas! 

I’m not sure what is going on with these choices. It does make sense to me to show only rural areas if you are doing a story on the case rate in rural areas, and it would make sense to me to show only urban areas if you were doing a story on the case rate in urban areas, but neither of these make sense to me as a country-wide default. (It’s also a bit strange to me that they changed the scale, showing average cases per million on the new plot with numbers up to about 800; while showing average cases per 100,000 on the other plot, with numbers up to about 64, which is 640 per million. These are not wildly different and could work fine on the same scale.)

I could imagine leaving some areas blank if there are literally no permanent residents there — National Wilderness and National Forest, for instance — but if they are going to do that, they should not use the same color for ‘zero population density’ that they use for ‘zero coronavirus case rate’. These mean different things. That’s what I really dislike about the other plot: the same color is used for low-population areas, independent of the rate. Everywhere else on the map the color means “rate”, and then there are these huge sections where they color means “population density.”  On this one, at least they use different colors for the places where they aren’t showing us the data (white) and where the rate is low (gray). So, of the two, this one is better. But I think they should just combine the two plots. 

“Election Forecasting: How We Succeeded Brilliantly, Failed Miserably, or Landed Somewhere in Between”

I agreed to give a talk in December for Jared, and this is what I came up with:

Election Forecasting: How We Succeeded Brilliantly, Failed Miserably, or Landed Somewhere in Between

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

Several months before the election we worked with The Economist magazine to build a presidential election forecasting model combining national polls, state polls, and political and economic fundamentals. This talk will go over how the forecast worked, our struggles in evaluating and improving it, and more general challenges of communicating data-based forecasts. For some background, see this article.

Actually, the title is misleading. Our model could fail miserably (for example, if Joe Biden wins Alabama, which we say has less than a 1% chance of happening) or it could land somewhere in between (for example, if Biden wins the electoral college but with just 51% of the popular vote, which is at the edge of our forecast interval) but it can’t really succeed brilliantly. Even if our model “correctly predicts 49 states” or whatever, that’s as much luck as anything else, as our estimates have margins of error. That’s one reason why, many years ago, my colleague and I decided not to put more effort into election forecasting: it’s a game where you can’t win big but you can lose big.

Anyway, I’ll be able to say more about all this in a couple weeks.

An odds ratio of 30, which they (sensibly) don’t believe

Florian Wickelmaier and Katharina Naumann write:

In a lab course, we came across a study on the influence of “hemispheric activation” on the framing effect in decision making by Todd McElroy and John J. Seta [Brain and Cognition 55 (2004) 572-580, doi:10.1016/j.bandc.2004.04.002]:

Two experiments were conducted to determine whether the functional specializations of the left and the right hemispheres would produce different responses to a traditional framing task. In Experiment 1, a behavioral task of finger tapping was used to induce asymmetrical activation of the respective hemispheres. In Experiment 2, a monaural listening procedure was used. In both experiments, the predicted results were found. Framing effects were found when the right hemisphere was selectively activated whereas they were not observed when the left hemisphere was selectively activated.

Two aspects of this study reminded us of recurring topics in your blog. [No, it was not cats, John Updike, or Jamaican beef patties; sorry! — AG]

Use of buzzwords: Why call it “hemispheric activation” when what participants did in Exp. 1 was tapping with their left versus right hand? This is a bit like saying “upper-body strength” instead of fat arms.

Unrealistic effect size: A 30-fold increased framing effect when tapping with your left hand (“right hemisphere activated”) Sounds like a lot. Even the foreign-language researchers claimed only a 2-fold increase ( agree-with-the-view-that-being-convinced-an-effect-is-real-relieves-a-researcher-from-statistically-testing-it/). Maybe it was a combination of low power and selection by significance that rendered so large an effect?

Here are the original data (Tab. 1):

right-hand tapping left-hand tapping
safe risky safe risky
gain 8 4 12 1
loss 7 4 3 9

With right-hand tapping, the odds ratio is 8/4/(7/4) = 1.1 (no framing effect). With left-hand tapping, it is 12/1/(3/9) = 36. So the ratio of odds ratios is about 30.

We asked our students to try to replicate the experiment. We used an Edlin factor of about 0.1 for sample size calculation. Our data are 52/31/(26/57) = 3.7 with right-hand tapping and 56/27/(30/53) = 3.7 with left-hand tapping. The effect has vanished in the larger sample.

We think this makes a useful teaching example as it illustrates the now well-known limitations of a small-scale study with flashy results. We also see some progress because students increasingly become aware of these limitations and get the chance to learn how to avoid them in the future.

This reminds me of the 50 shades of gray study.

I agree that it’s good for students to be able to do these replication experiments themselves. Also good that we can start with a default skepticism about such claims, rather than having to first find some major problem in the study. Attention given to pizzagate-like irregularities should not distract us from the larger problem of hardworking scientists using bad research methods and getting bad conclusions. Remember, honesty and transparency are not enuf.

All maps of parameter estimates are (still) misleading

I was looking at this map of coronavirus cases, pondering the large swaths with seemingly no cases. I moused over a few of the gray areas. The shading is not based on counties, as I assumed, but on some other spatial unit, perhaps zip codes or census blocks or something. (I’m sure the answer is available if I click around enough).  Thing is, I doubt that all of the cases in the relatively low-population areas in the western half of the country are concentrated in those little shaded areas. I suspect those are where the tests are performed, or similar, not the locations of the homes of the infected people. [Added later: Carlos Ungil points out that there was indeed a link, just below the map, that says “For per capita: Parts of a county with a population density lower than 10 people per square mile are not shaded.”] 

I’m well aware that all maps of parameter estimates are misleading (one of my favorite papers), but I think the way in which this map is misleading may be worse than some of the alternatives, such as coloring the entire county. Yes, coloring the whole county would give a false impression of spatial uniformity for some of those large counties, but I think that’s better than the current false impression of zero infection rates in a large swath of the country. In terms of cases per 100,000 Nevada is much worse than Ohio but it sure doesn’t look like that on the map. [Note: I originally said ‘Illinois’ but either that was a mistake, pointed out by Carlos Ungil, or it changed when the map was updated in the past hour].  


Many western states appear to have low case rates but actual rates are not low

Hiring at all levels at Flatiron Institute’s Center for Computational Mathematics

We’re hiring at all levels at my new academic home, the Center for Computational Mathematics (CCM) at the Flatiron Insitute in New York City.

We’re going to start reviewing applications January 1, 2021.

A lot of hiring

We’re hoping to hire many people for each of the job ads. The plan is to grow CCM from around 30 people to around 60, which will involve hiring 20 or 30 more postdocs and permanent research staff over the next few years!!! Most of those hires are going to be on the machine learning and stats side.

What we do

I’m still working on Stan and computational stats and would like to hire some more people to work on computational stats and probabilistic programming. There’s also lots of other fascinating work going on in the center, including equivariant neural networks for respecting physical constraints, spike sorting for neural signal classification, Markov chain Monte Carlo for molecular dynamics, phase retrieval (Fourier inverse problem) for cryo electron microscopy, and lots of intricate partial differential equation solvers being developed for challenging problems like cellular fluid dynamics or modeling light flow in the visual system of a wasp.

Plus, there’s a lot of communication across centers. There are Stan users in both the astro and bio centers, working on things like the LIGO project for gravitational waves, ocean biome factor modeling, and protein conformation estimation. If you like science, stats, and computation, it’s a great place to hang out.

The mission

The Flatiron Institute is unusual for an academic institution in that it’s focused squarely on computation.

The mission of the Flatiron Institute is to advance scientific research through computational methods, including data analysis, theory, modeling and simulation.

The motivation remains to fill a gap in scientific software development that’s not supported well by research grants, academia, or industry.

The job listings

The job listings are for two postdoc-level positions (called “fellows”) and an open-rank faculty-level positions (your choice of a “research scientist” or “data scientist” title). Please keep in mind that we’re hiring for all of these positions in bulk over the next few years across the range of center interests.

If you’re interested in comp stats and are going to apply, please drop me a line directly at

“Model takes many hours to fit and chains don’t converge”: What to do? My advice on first steps.

The above question came up on the Stan forums, and I replied:

Hi, just to give some generic advice here, I suggest simulating fake data from your model and then fitting the model and seeing if you can recover the parameters. Since it’s taking a long time to run, I suggest just running your 4 parallel chains for 100 warmup and 100 saved iterations and set max treedepth to 5. Just to get things started, cos you don’t want to be waiting for hours every time you debug the model. That’s like what it was like when I took a computer science class in 1977 and we had to write our code on punch cards and then wait hours for it to get run through the computer.

P.S. Commenter Gec elaborates:

In my [the commenter’s] experience, I treat an inefficient model as a sign that I don’t really understand the model. Of course, my lack of understanding might be “shallow” in that I just coded it wrong or made a typo. But typically my lack of understanding runs deeper, in that I don’t understand how parameters trade off with one another, whether they lead to wonky behavior in different ranges of values, etc.

While there is no one route to improving this understanding, some of it can come from finding analytic solutions to simplified/constrained versions of the full model. A lot comes from running simulations, since this gives insight into how the model’s behavior (i.e., patterns of data) relate to its parameter settings. For example, I might discover that a fit is taking a long time because two parameters, even if they are logically distinct, end up trading off with one another. Or that, even if two parameters are in principle identifiable, the particular data being fit doesn’t distinguish them.

It might seem like these model explorations take a long time, and they do! But I think that time is better spent building up this understanding than waiting for fits to finish.

Exactly. Workflow, baby, workflow.

Piranhas in the rain: Why instrumental variables are not as clean as you might have thought

Woke up in my clothes again this morning
I don’t know exactly where I am
And I should heed my doctor’s warning
He does the best with me he can
He claims I suffer from delusion
But I’m so confident I’m sane
It can’t be a statistical illusion
So how can you explain
Piranhas in the rain
And if you see us on the corner
We’re just dancing in the rain
I tell my friends there when I see them
Outside my window pane
Piranhas in the rain.
— Sting (almost)

Gaurav Sood points us to this article by Jonathan Mellon, “Rain, Rain, Go away: 137 potential exclusion-restriction violations for studies using weather as an instrumental variable,” which begins:

Instrumental variable (IV) analysis assumes that the instrument only affects the dependent variable via its relationship with the independent variable. Other possible causal routes from the IV to the dependent variable are exclusion-restriction violations and make the instrument invalid. Weather has been widely used as an instrumental variable in social science to predict many different variables. The use of weather to instrument different independent variables represents strong prima facie evidence of exclusion violations for all studies using weather as an IV. A review of 185 social science studies (including 111 IV studies) reveals 137 variables which have been linked to weather, all of which represent potential exclusion violations. I conclude with practical steps for systematically reviewing existing literature to identify possible exclusion violations when using IV designs.

That sounds about right.

This story reminds me of when we were looking at the notorious ovulation-and-voting study and we realized that the evolutionary psychology and social priming literatures are just loaded with potential confounders:

But the papers on ovulation and voting, shark attacks and voting, college football and voting, etc., don’t just say that voters, or some voters, are superficial and fickle. No, these papers claim that seemingly trivial or irrelevant factors have large and consistent effects, and that I don’t believe. I do believe that individual voters can be influenced these silly things, but I don’t buy the claim that these effects are predictable in that way. The problem is interactions. For example, the effect on my vote of the local college football team losing could depend crucially on whether there’s been a shark attack lately, or on what’s up with my hormones on election day. Or the effect could be positive in an election with a female candidate and negative in an election with a male candidate. Or the effect could interact with parent’s socioeconomic status, or whether your child is a boy or a girl, or the latest campaign ad, etc.

This is also related to the piranha problem. If you take these applied literatures seriously, you’re led to the conclusion that there are dozens of large effects floating around, all bumping against each other.

Or, to put it another way, the only way you can believe in any of this sort of studies is if you don’t believe in any of the others.

It’s like religion. I can believe in my god, but only if I think that none of your gods exist.

The nudgelords won’t be happy about this latest paper, as it raises the concern that any nudge they happen to be studying right now is uncomfortably interacting with dozens of other nudges unleashed upon the world by other policy entrepreneurs.

Maybe they could just label this new article as Stasi or terrorism and move on to their next NPR appearance?

Presidents as saviors vs. presidents as being hired to do a job

There’s been a lot of talk about how if Biden is elected president it will be a kind of relief, a return to problem solving and dialing down of tension. This is different from Obama, who so famously inspired all that hope, and it made me think about characterizing other modern presidents in this way:

Saviors: Trump, Obama, Clinton, Reagan, Roosevelt

Hired to do or continue a job: Bush 2, Bush 1, Nixon, Johnson, Truman

I’m not quite sure how I’d characterize the other elected presidents from that era: Carter, Kennedy, Eisenhower. Carter in retrospect doesn’t have a savior vibe, but he was elected as a transformative outlier. Kennedy looms large in retrospect but it’s not clear that he was considered as a savior when he was running for president. Eisenhower I’m not sure about either.

Another complication is that there have been changes in Congress at the same time. There was the radicalism of the post-1974 reform movement, the 1994 Newt Gingrich revolution, and then the locked-in partisanship of congressional Republicans since 2010, all of these which can be considered both as responses to executive overreach by opposition presidents and which have motivated counterreactions.

Estimated “house effects” (biases of pre-election surveys from different pollsters) and here’s why you have to be careful not to overinterpret them:

Elliott provides the above estimates from our model. As we’ve discussed, as part of our fitting procedure we estimate various biases, capturing in different ways the fact that surveys are not actually random samples of voters from an “urn.” One of these biases is the “house effect.” In our model, everything’s on the logit scale, so we divide by 4 to get biases on the probability scale. The above numbers have already been divided by 4.

So we estimate the most biased polling organizations to range from about +2 Biden to +3 Trump, but with most of them between -1 and 1 percentage points. (This is in the scale of vote proportion, not vote margin.)

You can also see that there’s lots of uncertainty about the house effect for any given pollster. That’s because we used a weak prior, normal with mean 0 and sd 1.5, implying that a priori we didn’t say much about house effects except that they’re most likely less than 2 percentage points in either direction.

I have no reason to think that most of these biases represent any kind of political biases coming from the polling organizations. Rather, different orgs use different data collection methods and do different adjustments. So they’ll get different answers. In any given election cycle, these different approaches will have different biases, but with only a few polls it’s hard to pin down these biases with any precision, especially given that each poll has its own idiosyncratic bias as well, having to do with whatever was going on the day the survey was in the field.

Don’t overinterpret the chart!

It’s tempting to read the above graph and use it to rate the polls. Don’t do that!

Let me emphasize that a survey having a low estimated bias in the above chart does not necessarily mean it’s a better poll. I say this for two reasons:

1. The estimates of bias are really noisy! It would be possible to get an illusorily precise estimate of house effects by doing some simple averaging, but that would be wrong because it would not account for nonsampling errors that vary by poll.

2. Even if these estimates were precise, they’re just for the current 2020 campaign. Polling and adjustment strategies that work well this year might run into problems in other settings.

Whassup with the dots on our graph?

The above is from our Economist election forecast. Someone pointed to me that our estimate is lower than all the dots in October. Why is that? I can come up with some guesses, but it’s surprising that the line is below all the dots.

Merlin replied:

That happened a bunch of times before as well. I’d assume that there are more pro Dem pollsters than pro Rep or unbiased pollsters. Plus those who don’t adjust should also be above the line. And then we are still getting dragged towards the fundamentals.

And Elliott wrote:

First, the line is not lower than all the dots; there are several polls (mainly from the internet panels that weight by party) that are coming in right on it.

Merlin’s email explains a lot of this, I think — our non-response correction is subtracting about 2 points from Dems on margin right now,

But perhaps a more robust answer is that state polls simply aren’t showing the same movement. That’s why 538 is also showing Biden up 7-8 in their model, versus the +10 in their polling average.

And I realized one other thing, which is that the difference between the dots and the line is not as big as it might appear at first. Looking at the spread, at first it looks like the The line for Biden is at around 54%, and the dots are way above the line. But that’s misleading: you should roughly compare the line to the average of the dots. And the average of the recent blue dots is only about 1 percentage points higher than the blue line. So our model is estimating that recent national polls are only very slightly overestimating Biden’t support in the population. Yes, there’s that one point that’s nearly 5 percentage points above the line, and our model judges that to be a fluke—such things happen!—but overall the difference is small. I’m not saying it’s zero, and I wouldn’t want it to be zero; it’s just what we happen to see given the adjustments we’re using in the model.

P.S. Elliott also asked whether this is worth blogging about. I replied that everything’s worth blogging about. I blog about Jamaican beef patties.

Pre-register post-election analyses?

David Randall writes:

I [Randall] have written an article on how we need to (in effect) pre-register the election—preregister the methods we will use to analyze the voting, with an eye to determining if there is voter fraud.

I have a horrible feeling we’re headed to civil war, and there’s nothing that can be done about it—but I thought that this might be a feather in the wind to prevent a predictable post-voting-day wrangle that could tip the country over the edge.

I wanted to get this out a month ago, and ideally in some less political venue. It got caught with an editor for several weeks, and now there is much less time to do anything. Still, better late than never.

Here’s what Randall writes in his article:
Continue reading ‘Pre-register post-election analyses?’ »

Between-state correlations and weird conditional forecasts: the correlation depends on where you are in the distribution

Yup, here’s more on the topic, and this post won’t be the last, either . . .

Jed Grabman writes:

I was intrigued by the observations you made this summer about FiveThirtyEight’s handling of between-state correlations. I spent quite a bit of time looking into the topic and came to the following conclusions.

In order for Trump to win a blue state, either:

1.) That state must swing hard to the right or

2.) The nation must have an swing toward Trump or

3.) A combination of the previous two factors.

The conditional odds of winning the electoral college dependent on winning a state therefore are a statement about the relative likelihood of these scenarios.

Trump is quite unlikely to win the popular vote (538 has it at 5% and you at <1%), so the odds of Trump winning a deep blue state due to a large national swing are extremely low. Therefore, if the state's correlation with the nation is high, Trump would almost never win the state. In order for Trump to win the state a noticeable proportion of the time (say >0.5%), the correlation needs to be low enough that the state specific swing can get Trump to victory without making up the sizable national lead Biden currently has. This happens in quite a number of states in 538’s forecast, but also can be seen less frequently in The Economist’s forecast. For example, in your forecast Trump only wins New Mexico 0.5% of the time, but it appears that Biden wins nationally in a majority of those cases due to its low correlation with most swing states.

It is hard for me to determine what these conditional odds ought to be. If Trump needs to make up 15 points in New Mexico, I can’t say which of these implausible scenarios is more likely: That he makes up 6 points nationally and an additional 9 in New Mexico (likely losing the election) or that he makes up 9 nationally and an additional 6 in New Mexico (likely winning the election).

If you are interested in more on my thoughts, I recently posted an analysis of this issue on Reddit that was quite well received: Unlikely Events, Fat Tails and Low Correlation (or Why 538 Thinks Trump Is an Underdog When He Wins Hawaii).

I replied that yes, the above is just about right. For small swings, these conditional distributions depend on the correlations of the uncertainties between states. For large swings, these conditional distributions depend on higher-order moments of the joint distribution. I think that what Fivethirtyeight did was to start with highly correlated errors across states and then add a small bit of long-tailed error that was independent across states, or something like that. The result of this is that if the swing in a state is small, it will be highly predictive of swings in other states, but if the swing is huge, then it is most likely attributable to that independent long-tailed error term and then it becomes not very predictive of the national swing. That’s how the Fivethirtyeight forecast can simultaneously say that Biden has a chance of winning Alabama but also say that if he wins Alabama, that this doesn’t shift his national win probability much. It’s an artifact of these error terms being added in. As I wrote in my post, such things happen: these models have so many moving parts that I would expect just about any model, including ours, to have clear flaws in its predictions somewhere or another. Anyway, you can get some intuition about these joint distributions by doing some simulations in R or Python. It’s subtle because we’re used to talking about “correlation,” but when the uncertainties are not quite normally distributed and the tails come into play, it’s not just correlation that matters–or, to put it another way, the correlation conditional on a state’s result being in the tail is different than the correlation conditional on it being near the middle of its distribution.

It’s good to understand how our models work. One advantage of a complicated multivariate prediction is that it gives so many things to look at, that if you look carefully enough you’ll find a problem with just about any prediction method. Reality is more complicated than any model we can build—especially given all the shortcuts we take when modeling.

Once you’ve made a complicated prediction, it’s great to be able to make it public and get feedback, and to take that feedback seriously.

And once you recognize that your model will be wrong—as they say on the radio, once you recognize that you are a sinner—that’s the first step on the road to improvement.

Also the idea that the correlation depends on where you are in the distribution: that can be important sometimes.

Reference for the claim that you need 16 times as much data to estimate interactions as to estimate main effects

Ian Shrier writes:

I read your post on the power of interactions a long time ago and couldn’t remember where I saw it. I just came across it again by chance.

Have you ever published this in a journal? The concept comes up often enough and some readers who don’t have methodology expertise feel more comfortable with publications compared to blogs.

My reply:

Thanks for asking. I’m publishing some of this in Section 16.4 of our forthcoming book, Regression and Other Stories. So you can cite that.

Calibration problem in tails of our election forecast

Following up on the last paragraph of this discussion, Elliott looked at the calibration of our state-level election forecasts, fitting our model retroactively to data from the 2008, 2012, and 2016 presidential elections. The plot above shows the point prediction and election outcome for the 50 states in each election, showing in red the states where the election outcome fell outside the 95% predictive interval. The actual intervals are shown for each state too and we notice a few things:

1. Nearly 10% of the statewide election outcomes fall outside the 95% intervals.

2. A couple of the discrepancies are way off, 3 or 4 predictive sd’s away.

3. Some of the errors are in the Democrats’ favor and some are in the Republicans’. This is good news for us in that these errors will tend to average out (not completely, but to some extent) rather than piling up when predicting the national vote. But it also is bad news in that we can’t excuse the poor calibration based on the idea that we only have N = 3 national elections. To the extent that these errors are all over the place and not all occurring in the same direction and in the same election, that’s evidence of an overall problem of calibration in the tails.

We then made a histogram of all 150 p-values. For each state election, if we have S simulation draws representing the predictive distribution, and X of them are lower than the actual outcome (the Democratic candidate’s share of the two-party vote), then we calculate the p-value as (2*X + 1) / (2*N + 2), using a continuity correction so that it’s always between 0 and 1 (something that Cook, Rubin, and I did in our simulation-based calibration paper, although we neglected to mention that in the article itself; it was only in the software that we used to make all the figures).

Here’s what we found

There are too many p-values below 0.025, which is consistent with what we saw in the plots with the interval coverage. But the distribution is not quite as U-shaped as we might have feared. The problem is at the extremes: the lowest of the 150 p-values are three values of 0.00017 (even one of these should happen only about once in 6000 predictions), and the highest are two values above 0.9985 (and one of these should happen only about once in every 600 cases).

Thus, overall I’d say that the problem is not that our intervals are generally too narrow but that we have a problem in the extreme tails. Everything might be just fine if we swap out the normal error model for something like a t_4. I’m not saying that we should add t_4 random noise to our forecasts; I’m saying that we’d change the error model and let the posterior distribution work things out.

Let’s see what happens when we switch to t_4 errors:

Not much different, but a little better coverage. Just to be clear: We wouldn’t use a set of graphs like this on their own to choose the model. Results for any small number of elections will be noisy. The real point is that we had prior reasons for using long-tailed errors, as occasional weird and outlying things do happen in elections.

At this point you might ask, why did it take us so long to notice the above problem? It took us so long because we weren’t looking at those tails! We were looking at national electoral votes and looking at winners of certain swing states. We weren’t checking the prediction for Obama’s vote margin in Hawaii and Trump’s margin in Wyoming. But these cases supply information too.

We have no plans to fix this aspect of our model between now and November, as it won’t have any real effect on our predictions of the national election. But it will be good to think harder about this out going forward.

General recommendation regarding forecasting workflow

I recommend that forecasters do this sort of exercise more generally: produce multivariate forecasts, look at them carefully, post the results widely, and then carefully look at the inevitable problems that turn up. No model is perfect. We can learn from our mistakes, but only if we are prepared to do so.

Stan’s Within-Chain Parallelization now available with brms

The just released R package brms version 2.14.0 supports within-chain parallelization of Stan. This new functionality is based on the recently introduced reduce_sum function in Stan, which allows to evaluate sums over (conditionally) independent log-likelihood terms in parallel, using multiple CPU cores at the same time via threading. The idea of reduce_sum is to exploit the associativity and commutativity of the sum operation, which allows to split any large sum into many smaller partial sums.

Paul Bürkner did an amazing job to enable within-chain parallelization via threading for a broad range of models as supported by brms. Note that currently threading is only available with the CmdStanR backend of brms, since the minimal Stan version supporting reduce_sum is 2.23 and rstan is still at 2.21. It may still take some time until rstan can directly support threading, but users will usually not notice any difference between either backend once configured.

We encourage users to read the new threading vignette in order to get an intuition of the new feature as to what speedups one can expect for their model. The speed gain by adding more CPU cores per chain will depend on many model details. In brief:

  • Stan models taking days/hours can run in a few hours/minutes, but models running just a few minutes will be hard to accelerate
  • Models with computationally expensive likelihoods will parallelize better than those with cheap to calculate ones like a normal or a Bernoulli likelihood
  • Non-Hierarchical and hierarchical models with few groupings will greatly benefit from parallelization while hierarchical models with many random effects will gain somewhat less in speed

The new threading feature is marked as „experimental“ in brms, since it is entirely new and there may be a need to change some details depending on further experience with it. We are looking forward to hear from users about their stories when using the new feature at the Stan Discourse forums.

She’s wary of the consensus based transparency checklist, and here’s a paragraph we should’ve added to that zillion-authored paper

Megan Higgs writes:

A large collection of authors describes a “consensus-based transparency checklist” in the Dec 2, 2019 Comment in Nature Human Behavior.

Hey—I’m one of those 80 authors! Let’s see what Higgs has to say:

I [Higgs] have mixed emotions about it — the positive aspects are easy to see, but I also have a wary feeling that is harder to put words to. . . . I do suspect this checklist will help with transparency at a fairly superficial level (which is good!), but could it potentially harm progress on deeper issues? . . . will the satisfactory feeling of successfully completing the checklist lull researchers into complacency and keep them from spending effort on the deeper layers? Will it make them feel they don’t need to worry about the deeper stuff because they’ve already successfully made it through the required checklist?

She summarizes:

I [Higgs] worry the checklist is going to inadvertently be taken as a false check of quality, rather than simply transparency (regardless of quality). . . . We should always consider the lurking dangers of offering easy solutions and simple checklists that make humans feel that they’ve done all that is needed, thus encouraging them to do no more.

I see her point, and it relates to one of my favorite recent slogans: honesty and transparency are not enough.

I signed on to the checklist because it seemed like a useful gesture, a “move the ball forward” step.

Here are a couple of key sentences in our paper:

Among the causes for this low replication rate are underspecified methods, analyses and reporting practices.

We believe that consensus-based solutions and user-friendly tools are necessary to achieve meaningful change in scientific practice.

We said among the causes (not the only cause), and we said necessary (not sufficient).

Still, after reading Megan’s comment, I wish we’d added another paragraph, something like this:

Honesty and transparency are not enough. Bad science is bad science even if it open, and applying transparency to poor measurement and design will, in and of itself, not create good science. Rather, transparency should reduce existing incentives for performing bad science and increase incentives for better measurement and design of studies.

We are stat professors with the American Statistical Association, and we’re thrilled to talk to you about the statistics behind voting. Ask us anything!

It’s happening at 11am today on Reddit.

It’s a real privilege to do this with Mary Gray, who was so nice to me back when I took a class at American University several decades ago.

Fiction as a window into other cultures

tl;dr: more on Updike.

In our recent discussion of reviews of John Updike books, John Bullock pointed us to this essay by Claire Lowdon, who begins:

In the opening scene of Rabbit, Run (1960), John Updike’s second published novel, the twenty-six-year-old Harry Angstrom – aka Rabbit – joins some children playing basketball around a telephone pole. One of the boys is very good.

He’s a natural. The way he moves sideways without taking any steps, gliding on a blessing: you can tell. The way he waits before he moves. With luck he’ll become in time a crack athlete in the high school; Rabbit knows the way. You climb up through the little grades and then get to the top and everybody cheers; with the sweat in your eyebrows you can’t see very well and the noise swirls around you and lifts you up, and then you’re out, not forgotten at first, just out, and it feels good and cool and free. You’re out, and sort of melt, and keep lifting, until you become like to these kids just one more piece of the sky of adults that hangs over them in the town, a piece that for some queer reason has clouded and visited them. They’ve not forgotten him: worse, they never heard of him. Yet in his time Rabbit was famous through the county; in basketball in his junior year he set a B-league scoring record that in his senior year he broke with a record that was not broken until four years later.

Are the kids reading John Updike now? Or is he, like his most famous creation, “just one more piece of the sky of adults”? For “adults”, in 2019, read Dead White Males . . .

Lowdon continues:

When we reread Rabbit, Run today, with almost 2020 vision [this review was published in 2019 — AG], the novel’s assumptions about men and women leap out at us. . . .

There are plenty of things I [Lowdon] could say in “defence” of these awkward moments. First, they are all thought or spoken by Updike’s characters, not by him. (Counter-objection: Updike is closely aligned with his own protagonists.) . . . Second, Updike himself balances the male gaze with powerful moments of insight into the female perspective. . . . I could go on – or, indeed, start to counter those counter-objections. . . .

So far, so balanced: after all, it would be an unusual choice in this era to review Updike and not address, in some way or another, the perception that he’s a sexist. This wouldn’t be quite on the level of reviewing Leni Riefenstahl’s cinematography, but you get the point.

But then Lowdon turns in a new (to me) and interesting direction:

In the third episode of the charming podcast “Medieval History for Pleasure and Profit”, Alice Rio and Alice Taylor give a surprising response to a listener’s question, “how badly did it smell, really?” They point out how relative smell is – how a medieval person travelling forward in time to today would be overwhelmed by the stench of petrol fumes, which we mostly don’t notice. The things we smell in Updike’s work, or Bellow’s, are as indicative of our own time as of theirs. And times change very quickly. . . .

Who knows – in another two decades, 2019’s heated discussions about race and gender may look equally quaint. The atrocities we’re unconsciously committing in our novels today are probably something to do with the environment. All those casual plane journeys in Rachel Cusk’s Outline trilogy! . . . If we lift our muzzles from the scent trail of sexism in Updike’s early work and look around, we find ourselves standing firmly in the past.

One reason we read fiction is to make sense of our world. Another reason is to learn about worlds other than ours. Updike, like all fiction writers, does both these things. Even bad writers are there to make sense of our world and tell us about other worlds. Bad writing can do this indirectly (by revealing an author’s unconsidered stereotypes) and tediously (so that it’s just not worth the effort to read), but they still do so in some ways. Recall my argument about works of alternative history.

Lowdon is making the point that anything we read will be presenting a different perspective. That said, some perspectives seem to date faster than others. Mark Twain, for example, seems strikingly modern to me, even though he was writing 150 years ago. Just to be clear, I’m not simply using “modern” as a shorthand for “interesting” or “relevant.” Shakespeare remains interesting and relevant, but I wouldn’t say he has a modern perspective: he has all this stuff about noble blood etc. which I guess is how most people thought back then but which today seems “faintly naff” (as Lowdon would say).

I’m also reminded of the point I encountered in some book about translation (maybe this one), about the tension when translating a book from another language: On one hand, you want your reading of the book to be similar to the experience of a native speaker of that language, hence you want a smooth translation into readable modern English. On the other hand, one reason you’re reading a book from another culture is that you want to get a feel for that culture, so you want the English translation to capture some of this.

To put it another way: when I read a book written in England, I don’t want words such as “lift,” local markers such as “Tesco,” and expressions such as “too clever by half” to be translated into American, any more than I’d want Lucinda Williams’s music re-sung by someone with a mid-Atlantic accent. The only difference with a book written in a foreign language is that some amount of translation is required—it’s just not clear how much.

I don’t think anyone’s proposing that we read a Bowdlerized Updike, shorn of its mid-twentieth-century sexual politics, any more than I’d want a hypothetical reader of this blog in 2100 to push the delete key because I refer too many times to airline flights, beef patties, and other signifiers of our current resource-devouring era.

Speaking of retro sexual politics, let me remind you that one characteristic of stereotyping is that it can go in any direction. Recall this example. It seems to me that the most important aspect of stereotyping is not its direction but rather in its strongly essentialist perspective.

That all said, it’s not just that Updike is old-fashioned and was a man of his time; his specific attitudes can affect how we experience his books. You can learn from his books, but that doesn’t mean you will like them, if his views are just too far from your own. I can enjoy novels from authors whose political and social views differ a lot from mine—it takes me out of my comfort zone, and that can be good—but if you go too far, and without any irony, eventually I’ll find it just too unpleasant to take. When I read supposedly humorous essays by people joking about how parents should be allowed to give their kids a good whuppin’, I don’t think it’s charming; I’m just repulsed. It’s just too much of a distraction, at least for me.

I think that’s the point about Updike’s male gaze etc. It doesn’t happen to bother me when I read the books, but I can see that it could bother you a little so that you have to come to terms with it (that’s Lowdon’s position), or it could bother you so much that you just don’t want to deal with it. I respect that last position as well.

All of this becomes more complicated because we’re talking about fiction rather than nonfiction or political speeches or journalism or whatever where statements can be taken literally. I have a friend who no longer enjoys football because he can’t stop thinking about the injuries. I respect that position, even though I’m not quite there yet.

P.S. My own take on Updike is different from most of what I’ve seen. I read Updike for the content, not the style. Or, to put it another way, I value Updike’s style because it’s a way for him to get to his content. To me, Rabbit, Run is not about the glissando of Updike’s descriptions or his male gaze or whatever; it’s all about Harry, this character who’s still young but with adult responsibilities that he doesn’t really want. Kind of like John Updike—or the United States—in 1960. On the back of my paperback copy of Rabbit Run is the following quote from a review . . . oh, I don’t remember it exactly, let me go to my bookshelf . . . it’s not there! I wonder what happened to my copy of Rabbit, Run? I like to reread this book from time to time. OK, let me do some web searching . . . here’s the blurb, I think: “a powerful writer with his own vision of the world.”

P.P.S. Let me again plug James Atlas’s book about the writing of literary biography.