Skip to content

Reverse-engineering the problematic tail behavior of the Fivethirtyeight presidential election forecast

We’ve been writing a bit about some odd tail behavior in the Fivethirtyeight election forecast, for example that it was giving Joe Biden a 3% chance of winning Alabama (which seemed high), it was displaying Trump winning California as in “the range of scenarios our model thinks is possible” (which didn’t seem right), and it allowed the possibility that Biden could win every state except New Jersey (?), and that, in the scenarios where Trump won California, he only had a 60% chance of winning the election overall (which seems way too low).

My conjecture was that these wacky marginal and conditional probabilities came from the Fivethirtyeight team adding independent wide-tailed errors to the state-level forecasts: this would be consistent with Fivethirtyeight leader Nate Silver’s statement, “We think it’s appropriate to make fairly conservative choices especially when it comes to the tails of your distributions.” The wide tails allow the weird predictions such as Trump winning California. And independence of these extra error terms would yield the conditional probabilities such as the New Jersey and California things above. I’d think that if Trump were to win New Jersey or, even more so, California, that this would most likely happen only as part of a national landslide of the sort envisioned by Scott Adams or whatever. But with independent errors, Trump winning New Jersey or California would just be one of those things, a fluke that provides very little information about a national swing.

You can really only see this behavior in the tails of the forecasts if you go looking there. For example, if you compute the correlation matrix of the state predictors, this is mostly estimated from the mass of the distribution, as the extreme tails only contribute a small amount of the probability. Remember, the correlation depends on where you are in the distribution.

Anyway, that’s where we were until a couple days ago, when commenter Rui pointed to a file on the Fivethirtyeight website with the 40,000 simulations of the vector of forecast vote margins in the 50 states (and also D.C. and the congressional districts of Maine and Nebraska).

Now we’re in business.

I downloaded the file, read it into R, and created the variables that I needed:

library("rjson")
sims_538 <- fromJSON(file="simmed-maps.json")
states <- sims_538$states
n_sims <- length(sims_538$maps)
sims <- array(NA, c(n_sims, 59), dimnames=list(NULL, c("", "Trump", "Biden", states)))
for (i in 1:n_sims){
  sims[i,] <- sims_538$maps[[i]]
}
state_sims <- sims[,4:59]
trump_share <- (state_sims + 1)/2
biden_wins <- state_sims < 0
trump_wins <- state_sims > 0

As a quick check, let’s compute Biden’s win probability by state:

> round(apply(biden_wins, 2, mean), 2)
  AK   AL   AR   AZ   CA   CO   CT   DC   DE   FL   GA   HI   IA   ID   IL   IN   KS   KY   LA   M1   M2   MA   MD   ME   MI 
0.20 0.02 0.02 0.68 1.00 0.96 1.00 1.00 1.00 0.72 0.50 0.99 0.48 0.01 1.00 0.05 0.05 0.01 0.06 0.98 0.51 1.00 1.00 0.90 0.92 
  MN   MO   MS   MT   N1   N2   N3   NC   ND   NE   NH   NJ   NM   NV   NY   OH   OK   OR   PA   RI   SC   SD   TN   TX   UT 
0.91 0.08 0.10 0.09 0.05 0.78 0.00 0.67 0.01 0.01 0.87 0.99 0.97 0.90 1.00 0.44 0.01 0.97 0.87 1.00 0.11 0.04 0.03 0.36 0.04 
  VA   VT   WA   WI   WV   WY 
0.99 0.99 0.99 0.86 0.01 0.00 

That looks about right. Not perfect—I don’t think Biden’s chances of winning Alabama are really as high as 2%—but this is what the Fivethirtyeight is giving us, rounded to the nearest percent.

And now for the fun stuff.

What happens if Trump wins New Jersey?

> condition <- trump_wins[,"NJ"]
> round(apply(trump_wins[condition,], 2, mean), 2)
  AK   AL   AR   AZ   CA   CO   CT   DC   DE   FL   GA   HI   IA   ID   IL   IN   KS   KY   LA   M1   M2   MA   MD   ME 
0.58 0.87 0.89 0.77 0.05 0.25 0.10 0.00 0.00 0.79 0.75 0.11 0.78 0.97 0.05 0.87 0.89 0.83 0.87 0.13 0.28 0.03 0.03 0.18 
  MI   MN   MO   MS   MT   N1   N2   N3   NC   ND   NE   NH   NJ   NM   NV   NY   OH   OK   OR   PA   RI   SC   SD   TN 
0.25 0.38 0.84 0.76 0.76 0.90 0.62 1.00 0.42 0.96 0.97 0.40 1.00 0.16 0.47 0.01 0.53 0.94 0.08 0.39 0.08 0.86 0.90 0.85 
  TX   UT   VA   VT   WA   WI   WV   WY 
0.84 0.91 0.16 0.07 0.07 0.50 0.78 0.97 

So, if Trump wins New Jersey, his chance of winning Alaska is . . . 58%??? That’s less than his chance of winning Alaska conditional on losing New Jersey.

Huh?

Let’s check:

> round(mean(trump_wins[,"AK"] [trump_wins[,"NJ"]]), 2)
[1] 0.58
> round(mean(trump_wins[,"AK"] [biden_wins[,"NJ"]]), 2)
[1] 0.80

Yup.

Whassup with that? How could that be? Let’s plot the simulations of Trump’s vote share in the two states:

par(mar=c(3,3,1,1), mgp=c(1.7, .5, 0), tck=-.01)
par(pty="s")
rng <- range(trump_share[,c("NJ", "AK")])
plot(rng, rng, xlab="Trump vote share in New Jersey", ylab="Trump vote share in Alaska", main="40,000 simulation draws", cex.main=0.9, bty="l", type="n")
polygon(c(0.5,0.5,1,1), c(0,1,1,0), border=NA, col="pink")
points(trump_share[,"NJ"], trump_share[,"AK"], pch=20, cex=0.1)
text(0.65, 0.25, "Trump wins NJ", col="darkred", cex=0.8)
text(0.35, 0.25, "Trump loses NJ", col="black", cex=0.8)

The scatterplot is too dense to read at its center, so I'll just pick 1000 of the simulations at random and graph them:

subset <- sample(n_sims, 1000)
rng <- range(trump_share[,c("NJ", "AK")])
plot(rng, rng, xlab="Trump vote share in New Jersey", ylab="Trump vote share in Alaska", main="Only 1000 simulation draws", cex.main=0.9, bty="l", type="n")
polygon(c(0.5,0.5,1,1), c(0,1,1,0), border=NA, col="pink")
points(trump_share[subset,"NJ"], trump_share[subset,"AK"], pch=20, cex=0.1)
text(0.65, 0.25, "Trump wins NJ", col="darkred", cex=0.8)
text(0.35, 0.25, "Trump loses NJ", col="black", cex=0.8)

Here's the correlation:

> round(cor(trump_share[,"AK"], trump_share[,"NJ"]), 2)
[1] 0.03

But from the graph with 40,000 simulations above, it appears that the correlation is negative in the tails. Go figure.

OK, fine. I only happened to look at Alaska because it was first on the list. Let's look at a state right next to New Jersey, a swing state . . . Pennsylvania.

> round(mean(trump_wins[,"PA"] [trump_wins[,"NJ"]]), 2)
[1] 0.39
> round(mean(trump_wins[,"PA"] [biden_wins[,"NJ"]]), 2)
[1] 0.13

OK, so in the (highly unlikely) event that Trump wins in New Jersey, his win probability in Pennsylvania goes up from 13% to 39%. A factor of 3! But . . . it's not enough. Not nearly enough. Currently the Fivethirtyeight model gives Trump a 13% chance to win in PA. Pennsylvania's a swing state. If Trump wins in NJ, then something special's going on, and Pennsylvania should be a slam dunk for the Republicans.

OK, time to look at the scatterplot:

The simulations for Pennsylvania and New Jersey are correlated. Just not enough. At lest, this still doesn't look quite right. I think that if Trump were to do 10 points better than expected in New Jersey, that he'd be the clear favorite in Pennsylvania.

Here's the correlation:

> round(cor(trump_share[,"PA"], trump_share[,"NJ"]), 2)
[1] 0.43

So, sure, if the correlation is only 0.43, it almost kinda makes sense. Shift Trump from 40% to 50% in New Jersey, then the expected shift in Pennsylvania from these simulations would be only 0.43 * 10%, or 4.3%. But Fivethirtyeight is predicting Trump to get 47% in Pennsylvania, so adding 4.3% would take him over the top, at least in expectation. Why, then, is the conditional probability, Pr(Trump wins PA | Trump wins NJ) only 43%, and not over 50%? Again, there's something weird going on in the tails. Look again at the plot just above: in the center of the range, x and y are strongly correlated, but in the tails, the correlation goes away. Some sort of artifact of the model.

What about Pennsylvania and Wisconsin?

> round(mean(trump_wins[,"PA"] [trump_wins[,"WI"]]), 2)
[1] 0.61
> round(mean(trump_wins[,"PA"] [biden_wins[,"WI"]]), 2)
[1] 0.06

These make more sense. The correlation of the simulations between these two states is a healthy 0.81, and here's the scatterplot:

Alabama and Mississippi also have a strong dependence and give similar results.

At this point I graphed the correlation matrix of all 50 states. But that was too much to read, so I picked a few states:

some_states <- c("AK","WA","WI","OH","PA","NJ","VA","GA","FL","AL","MS")

I ordered them roughly from west to east and north to south and then plotted them:

cor_mat <- cor(trump_share[,some_states])
image(cor_mat[,rev(1:nrow(cor_mat))], xaxt="n", yaxt="n")
axis(1, seq(0, 1, length=length(some_states)), some_states, tck=0, cex.axis=0.8)
axis(2, seq(0, 1, length=length(some_states)), rev(some_states), tck=0, cex.axis=0.8, las=1)

And here's what we see:

Correlations are higher for nearby states. That makes sense. New Jersey and Alaska are far away from each other.

But . . . hey, what's up with Washington and Mississippi? If NJ and AK have a correlation that's essentially zero, does that mean that the forecast correlation for Washington and Mississippi is . . . negative?

Indeed:

> round(cor(trump_share[,"WA"], trump_share[,"MS"]), 2)
[1] -0.42

And:

> round(mean(trump_wins[,"MS"] [trump_wins[,"WA"]]), 2)
[1] 0.31
> round(mean(trump_wins[,"MS"] [biden_wins[,"WA"]]), 2)
[1] 0.9

If Trump were to pull off the upset of the century and win Washington, it seems that his prospects in Mississippi wouldn't be so great.

For reals? Let's try the scatterplot:

rng <- range(trump_share[,c("WA", "MS")])
plot(rng, rng, xlab="Trump vote share in Washington", ylab="Trump vote share in Mississippi", main="40,000 simulation draws", cex.main=0.9, bty="l", type="n")
polygon(c(0.5,0.5,1,1), c(0,1,1,0), border=NA, col="pink")
points(trump_share[,"WA"], trump_share[,"MS"], pch=20, cex=0.1)
text(0.65, 0.3, "Trump wins WA", col="darkred", cex=0.8)
text(0.35, 0.3, "Trump loses WA", col="black", cex=0.8)

What the hell???

So . . . what's happening?

My original conjecture was that the Fivethirtyeight team was adding independent long-tailed errors to the states, and the independence was why you could get artifacts such as the claim that Trump could win California but still lose the national election.

But, after looking more carefully, I think that's part of the story---see the NJ/PA graph above---but not the whole thing. Also, lots of the between-state correlations in the simulations are low, even sometimes negative. And these low correlations, in turn, explain why the tails are so wide (leading to high estimates of Biden winning Alabama etc.): If the Fivethirtyeight team was tuning the variance of the state-level simulations to get an uncertainty that seemed reasonable to them at the national level, then they'd need to crank up those state-level uncertainties, as these low correlations would cause them to mostly cancel out in the national averaging. Increase the between-state correlations and you can decrease the variance for each state's forecast and still get what you want at the national level.

But what about those correlations? Why do I say that it's unreasonable to have a correlation of -0.42 between the election outcomes of Mississippi and Washington? It's because the uncertainty doesn't work that way. Sure, Mississippi's nothing like Washington. That's not the point. The point is, where's the uncertainty in the forecast coming from? It's coming from the possibility that the polls might be way off, and the possibility that there could be a big swing during the final weeks of the campaign. We'd expect a positive correlation for each of these, especially if we're talking about big shifts. If we were really told that Trump won Washington, then, no, I don't think that should be a sign that he's in trouble in Mississippi. I wouldn't assign a zero correlation to the vote outcomes in New Jersey and Pennsylvania either.

Thinking about it more . . . I guess the polling errors in the states could be negatively correlated. After all, in 2016 the polling errors were positive in some states and negative in others; see Figure 2 of our "19 things" article. But I'd expect shifts in opinion to be largely national, not statewide, and thus with high correlations across states. And big errors . . . I'd expect them to show some correlation, even between New Jersey and Alaska. Again, I'd think the knowledge that Trump won New Jersey or Washington would come along with a national reassessment, not just some massive screw-up in that state's polls.

In any case, Fivethirtyeight's correlation matrix seems to be full of artifacts. Where did the weird correlations come from? I have no idea. Maybe there was a bug in the code, but more likely they just took a bunch of state-level variables and computed their correlation matrix, without thinking carefully about how this related to the goals of the forecast and without looking too carefully at what was going on. In the past few months, we and others have pointed out various implausibilities in the Fivethirtyeight forecast (such as that notorious map where Trump wins New Jersey but loses all the other states), but I guess that once they had their forecast out there, they didn't want to hear about its problems.

Or maybe I messed up in my data wrangling somehow. My code is above, so feel free to take a look and see.

As I keep saying, these models have lots of moving parts and it's hard to keep track of all of them. Our model isn't perfect either, and even after the election is over it can be difficult to evaluate the different forecasts.

One thing exercise demonstrates is the benefit of putting your granular inferences online. If you're lucky, some blogger might analyze your data for free!

Why go to all this trouble?

Why go to all the above effort rooting around in the bowels of some forecast?

A few reasons:

1. I was curious.

2. It didn't take very long to do the analysis. But it did then take another hour or so to write it up. Sunk cost fallacy and all that. Perhaps next time, before doing this sort of analysis, I should estimate the writing time as well. Kinda like how you shouldn't buy a card on the turn if you're not prepared to stay in if you get the card you want.

3. Teaching. Yes, I know my R code is ugly. But ugly code is still much more understandable than no code. I feel that this sort of post does a service, in that it provides a model for how we can do real-time data analysis, even if in this case the data are just the output from somebody else's black box.

No rivalry

Let me emphasize that we're not competing with Fivethirtyeight. I mean, sure the Economist is competing with Fivethirtyeight, or with its parent company, ABC News---but I'm not competing. So far the Economist has paid me $0. Commercial competition aside, we all have the same aim, which is to assess uncertainty about the future given available data.

I want both organizations to do the best they can do. The Economist has a different look and feel from Fivethirtyeight---just for example, you can probably guess which of these has the lead story, "Who Won The Last Presidential Debate? We partnered with Ipsos to poll voters before and after the candidates took the stage.", and which has a story titled, "Donald Trump and Joe Biden press their mute buttons. But with 49m people having voted already, creditable performances in the final debate probably won’t change much." But, within the constraints of their resources and incentives, there are always possibilities for improvement.

P.S. There's been a lot of discussion in the comments about Mississippi and Washington, which is fine, but the issue is not just with those two states. It's with lots of states with weird behavior in the joint distribution, such as New Jersey and Alaska, which was where we started. According to the Fivethirtyeight model, Trump is expected to lose big in New Jersey and is strong favorite, with a 80% chance of winning, in Alaska. But the model also says that if Trump were to win in New Jersey, that his chance of winning in Alaska would drop to 58%! That can't be right. At least, it doesn't seem right.

And, again, when things don't seem right, we should examine our model carefully. Statistical forecasts are imperfect human products. It's no surprise that they can go wrong. The world is complicated. When a small group of people puts together a complicated model in a hurry, I'd be stunned if it didn't have problems. The models that my collaborators and I build all have problems, and I appreciate when people point these problems out to us. I don't consider it an insult to the Fivethirtyeight team to point out problems in their model. As always: we learn from our mistakes. But only when we're willing to do so.

P.P.S. Someone pointed out this response from Nate Silver:

Our [Fivethirtyeight's] correlations actually are based on microdata. The Economist guys continually make weird assumptions about our model that they might realize were incorrect if they bothered to read the methodology.

I did try to read the methodology but it was hard to follow. That's not Nate's fault; it's just hard to follow any writeup. Lots of people have problems following my writeups too. That's why it's good to share code and results. One reason we had to keep guessing about what they were doing at Fivethirtyeight is that the code is secret and, until recently, I wasn't aware of simulations of the state results. I wrote the above post because once I had those simulations I could explore more.

In that same thread, Nate also writes:

I do think it's important to look at one's edge cases! But the Economist guys tend to bring up stuff that's more debatable than wrong, and which I'm pretty sure is directionally the right approach in terms of our model's takeaways, even if you can quibble with the implementation.

I don't really know what he means by "more debatable than wrong." I just think that (a) some of the predictions from their model don't make sense, and (b) it's not a shock that some of the predictions don't make sense, as that's how modeling goes in the real world.

Also, I don't know what he means by "directionally the right approach in terms of our model's takeaways." His model says that, if Trump wins New Jersey, that he only has a 58% chance of winning Alaska. Now he's saying that this is directionally the right approach. Does that mean that he thinks that, if Trump wins New Jersey, that his chance of winning in Alaska goes down, but maybe not to 58%? Maybe it goes down from 80% to 65%? Or from 80% to 40%? The thing is, I don't think it should go down at all. I think that if things actually happen so that Trump wins in New Jersey, that his chance of winning Alaska should go up.

What seems bizarre to me is that Nate is so sure about this counterintuitive result, that he's so sure it's "directionally the right approach." Again, his model is complicated. Lots of moving parts! Why is it so hard to believe that it might be messing up somewhere? So frustrating.

P.P.P.S. Let me say it again: I see no rivalry here. Nate's doing his best, he has lots of time and resource constraints, he's managing a whole team of people and also needs to be concerned with public communication, media outreach, etc.

My guess is that Nate doesn't really think that, a NJ win for Trump would make it less likely for him to win Alaska; it's just that he's really busy right now and he's rather reassure himself that his forecast is directionally the right approach than worry about where it's wrong. As I well know, it can be really hard to tinker with a model without making it worse. For example, he could increase the between-state correlations by adding a national error term, or by adding national and regional error terms, but then he'd have to decrease the variance within each state to compensate, and then there are lots of things to check, lots of new ways for things to go wrong---not to mention the challenge of explaining to the world that you've changed your forecasting method. Simpler, really, to just firmly shut that Pandora's box and pretend it had never been opened.

I expect that sometime after the election's over, Nate and his team will think about these issues more carefully and fix their model in some way. I really hope they go open source, but even if they keep it secret, as long as they release their predictive simulations we can look at the correlations and try to help out.

Similarly, they can help out with us. If there are any particular predictions from our model that Nate thinks don't make sense, he should feel free to let us know, or post it somewhere that we will find it. A few months ago he commented that our probability of Biden winning the popular vote seemed too high. We looked into it and decided that Nate and other people who'd made that criticism were correct, and we used that criticism to improve our model; see the "Updated August 5th, 2020" section at the bottom of this page. And our model remains improvable.

Let me say this again: the appropriate response to someone pointing out a problem with your forecasts is not to label the criticism as a "quibble" that is "more debatable than wrong" or to say that you're "directionally right," whatever that means. How silly that is! Informed criticism is a blessing! You're lucky when you get it, and use that criticism as an opportunity to learn and to do better.

Merlin did some analysis of possible electoral effects of rejections of vote-by-mail ballots . . .

Elliott writes:

Postal voting could put America’s Democrats at a disadvantage: Rejection rates for absentee ballots have fallen since 2016, but are higher for non-whites than whites

The final impact of a surge in postal voting will not be known until weeks after the election. Yet North Carolina, a closely contested state, releases detailed data on ballots as they arrive. So far, its figures suggest that a tarnished election is unlikely—but that Democrats could be hurt by their disproportionate embrace of voting by mail. . . .

The Tar Heel state has received eight times as many postal votes as it had by this point in 2016. Despite fears about first-time absentee voters botching their ballots, the share that are rejected has in fact fallen to 1.3%, from 2.6% in 2016. This is probably due in part to campaigns educating supporters on voting by mail, and also to new efforts by the state to process such ballots.

However, these gains have been concentrated among white and richer voters, causing North Carolina’s already large racial gap in rejection rates to widen. In 2016 black voters sent in 10% of postal ballots, but 18% of discarded ones. This year, those shares are 17% and 42%. That hurts Democrats, who rely on black voters’ support. . . .

Partisan differences over voting by mail exacerbate this effect. In the past, Democrats and Republicans were equally likely to do so. But polling by YouGov now shows that 51% of likely Democratic voters plan to vote absentee, compared with 32% of Republicans. Extrapolating North Carolina’s patterns nationwide, a model built by Merlin Heidemanns of Columbia University finds that 0.7% of ballots intended for Joe Biden, the Democrats’ presidential nominee, will be rejected postal votes, versus 0.3% of those cast for Donald Trump. . . .

Kyle Hausmann saw the above-linked article and asked if we had any thoughts on how impactful that might be on the election outcome. He also asked “whether or not trends in ballot rejection might already be implicitly baked into your economist forecast, simply by virtue of the rejected ballots not being included in the historical voter count date.”

Merlin replied:

Elliott and I expect the overall number of rejected ballots to be higher and for Democrats to be disproportionately negatively affected because rejection rates are larger within the groups that tend to vote for them and because they are more likely to vote absentee to begin with. While an equity issue, we don’t expect this to meaningfully affect the outcome of the election given that it primarily affects states that are safely Democrat aside from one or two that are mentioned in the article. I did some work on this for USA Today and did some further exploratory analysis of the NC data here and some raw number rejections by ethnic group based on 2016 data here .

It’s not baked into our forecast because vote-by-mail numbers will be at a historic high this year.

Interactive analysis needs theories of inference

Jessica Hullman and I wrote an article that begins,

Computer science research has produced increasingly sophisticated software interfaces for interactive and exploratory analysis, optimized for easy pattern finding and data exposure. But assuming that identifying what’s in the data is the end goal of analysis misrepresents strong connections between exploratory and confirmatory analysis and contributes to shallow analyses. We discuss how the concept of a model check unites exploratory and confirmatory analysis, and review proposed Bayesian and classical statistical theories of inference for visual analysis in light of this view. Viewing interactive analysis as driven by model checks suggests new directions for software, such as features for specifying one’s intuitive reference model, including built-in reference distributions and graphical elicitation of parameters and priors, during exploratory analysis, as well as potential lessons to be learned from attempting to build fully automated, human-like statistical workflows.

Jessica provides further background:

Tukey’s notion of exploratory data analysis (EDA) has had a strong influence on how interactive systems for data analysis are built. But the assumption has generally been that exploratory analysis precedes model fitting or checking, and that the human analyst can be trusted to know what to do with any patterns they find. We argue that the symbiosis of analyst and machine that occurs in the flow of exploratory and confirmatory statistical analysis makes it difficult to make progress on this front without considering what’s going on, and what should go on, in the analyst’s head. In the rest of the paper we do the following:

– We point out ways that optimizing interactive analysis systems for pattern finding and trusting the user to know best can lead to software that conflicts with goals of inference. For example, interactive systems like Tableau default to aggregating data to make high level patterns more obvious but this diminishes some people’s acknowledgment of variation. Researchers evaluate interactive visualizations and systems based on how well people can read data, how much they like using the system, or how evenly they distribute their attention across data, not how good their analysis or decisions are. Various algorithms for progressive computation or privacy preservation treat the dataset as though it is an object of inherent interest without considering its use in inference.

– We propose that a good high level understanding frames interactive visual analysis as driven by model checks. The idea is that when people are “exploring” data using graphics, they are implicitly specifying and fitting pseudo-statistical models, which produce reference distributions to compare to data. This makes sense because the goal of EDA is often described in terms of finding the unexpected, but what is unexpected is only defined via some model or view of how the world should be. In a Bayesian formulation (following Gelman 2003, 2004, our primary influence for this view), the reference distribution is produced by the posterior predictive distribution. So looking at graphics is like doing posterior predictive checks, where we are trying to get a feel for the type and size of discrepancies so we can decide what to do next. We like this view for various reasons, including because (1) it aligns with the way that many exploratory graphics get their meaning from implicit reference distributions, like residual plots or Tukey’s “hanging rootograms”; (2) it allows us to be more concrete about the role prior information can play in how we examine data; and (3) it suggests that to improve tools for interactive visual analysis we should find ways to make the reference models more explicit so that our graphics better exploit our abilities to judge discrepancies, such as through violations of symmetry, and the connection between exploration and confirmation is enforced.

– We review other proposed theories for understanding graphical inference: Bayesian cognition, visual analysis as implicit hypothesis testing, multiple comparisons. The first two can be seen as subsets of the Bayesian model checking formulation, and so can complement our view.

– We discuss the implications for designing software. While our goal is not to lay out exactly what new features should be added to systems like Tableau, we discuss some interesting ideas worth exploring more, like how the user of an interactive analysis system could interact with graphics to sketch their reference distribution or make graphical selections of what data they care about and then choose between options for their likelihood, specify their prior graphically, see draws from their model, etc. The idea is to brainstorm how our usual examinations of graphics in exploratory analysis could more naturally pave the way for increasingly sophisticated model specification.

– We suggest that by trying to automate statistical workflows, we can refine our theories. Sort of like the saying that if you really want to understand some topic, you should teach it. If we were to try to build an AI that can do steps like identify model misfits and figure out how to improve the model, we’d like have more ideas about what sorts of features our software should offer people.

– We conclude with the idea that the framing of visualization as model checking relates to ideas we’ve been thinking about recently regarding data graphics as narrative storytelling.

P.S. Ben Bolker sent in the above picture of Esmer helping out during a zoom seminar.

Follow-up on yesterday’s posts: some maps are less misleading than others.

Yesterday I complained about the New York Times coronavirus maps showing sparsely-populated areas as having a case rate very close to zero, no matter what the actual rate is. Today the Times has a story about the fact that the rate in rural areas is higher than in more densely populated areas, and they have maps that show the rate in sparsely populated areas! 

I’m not sure what is going on with these choices. It does make sense to me to show only rural areas if you are doing a story on the case rate in rural areas, and it would make sense to me to show only urban areas if you were doing a story on the case rate in urban areas, but neither of these make sense to me as a country-wide default. (It’s also a bit strange to me that they changed the scale, showing average cases per million on the new plot with numbers up to about 800; while showing average cases per 100,000 on the other plot, with numbers up to about 64, which is 640 per million. These are not wildly different and could work fine on the same scale.)

I could imagine leaving some areas blank if there are literally no permanent residents there — National Wilderness and National Forest, for instance — but if they are going to do that, they should not use the same color for ‘zero population density’ that they use for ‘zero coronavirus case rate’. These mean different things. That’s what I really dislike about the other plot: the same color is used for low-population areas, independent of the rate. Everywhere else on the map the color means “rate”, and then there are these huge sections where they color means “population density.”  On this one, at least they use different colors for the places where they aren’t showing us the data (white) and where the rate is low (gray). So, of the two, this one is better. But I think they should just combine the two plots. 

“Election Forecasting: How We Succeeded Brilliantly, Failed Miserably, or Landed Somewhere in Between”

I agreed to give a talk in December for Jared, and this is what I came up with:

Election Forecasting: How We Succeeded Brilliantly, Failed Miserably, or Landed Somewhere in Between

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

Several months before the election we worked with The Economist magazine to build a presidential election forecasting model combining national polls, state polls, and political and economic fundamentals. This talk will go over how the forecast worked, our struggles in evaluating and improving it, and more general challenges of communicating data-based forecasts. For some background, see this article.

Actually, the title is misleading. Our model could fail miserably (for example, if Joe Biden wins Alabama, which we say has less than a 1% chance of happening) or it could land somewhere in between (for example, if Biden wins the electoral college but with just 51% of the popular vote, which is at the edge of our forecast interval) but it can’t really succeed brilliantly. Even if our model “correctly predicts 49 states” or whatever, that’s as much luck as anything else, as our estimates have margins of error. That’s one reason why, many years ago, my colleague and I decided not to put more effort into election forecasting: it’s a game where you can’t win big but you can lose big.

Anyway, I’ll be able to say more about all this in a couple weeks.

An odds ratio of 30, which they (sensibly) don’t believe

Florian Wickelmaier and Katharina Naumann write:

In a lab course, we came across a study on the influence of “hemispheric activation” on the framing effect in decision making by Todd McElroy and John J. Seta [Brain and Cognition 55 (2004) 572-580, doi:10.1016/j.bandc.2004.04.002]:

Two experiments were conducted to determine whether the functional specializations of the left and the right hemispheres would produce different responses to a traditional framing task. In Experiment 1, a behavioral task of finger tapping was used to induce asymmetrical activation of the respective hemispheres. In Experiment 2, a monaural listening procedure was used. In both experiments, the predicted results were found. Framing effects were found when the right hemisphere was selectively activated whereas they were not observed when the left hemisphere was selectively activated.

Two aspects of this study reminded us of recurring topics in your blog. [No, it was not cats, John Updike, or Jamaican beef patties; sorry! — AG]

Use of buzzwords: Why call it “hemispheric activation” when what participants did in Exp. 1 was tapping with their left versus right hand? This is a bit like saying “upper-body strength” instead of fat arms.

Unrealistic effect size: A 30-fold increased framing effect when tapping with your left hand (“right hemisphere activated”) Sounds like a lot. Even the foreign-language researchers claimed only a 2-fold increase (https://statmodeling.stat.columbia.edu/2015/09/22/i-do-not agree-with-the-view-that-being-convinced-an-effect-is-real-relieves-a-researcher-from-statistically-testing-it/). Maybe it was a combination of low power and selection by significance that rendered so large an effect?

Here are the original data (Tab. 1):

right-hand tapping left-hand tapping
safe risky safe risky
gain 8 4 12 1
loss 7 4 3 9

With right-hand tapping, the odds ratio is 8/4/(7/4) = 1.1 (no framing effect). With left-hand tapping, it is 12/1/(3/9) = 36. So the ratio of odds ratios is about 30.

We asked our students to try to replicate the experiment. We used an Edlin factor of about 0.1 for sample size calculation. Our data are 52/31/(26/57) = 3.7 with right-hand tapping and 56/27/(30/53) = 3.7 with left-hand tapping. The effect has vanished in the larger sample.

We think this makes a useful teaching example as it illustrates the now well-known limitations of a small-scale study with flashy results. We also see some progress because students increasingly become aware of these limitations and get the chance to learn how to avoid them in the future.

This reminds me of the 50 shades of gray study.

I agree that it’s good for students to be able to do these replication experiments themselves. Also good that we can start with a default skepticism about such claims, rather than having to first find some major problem in the study. Attention given to pizzagate-like irregularities should not distract us from the larger problem of hardworking scientists using bad research methods and getting bad conclusions. Remember, honesty and transparency are not enuf.

All maps of parameter estimates are (still) misleading

I was looking at this map of coronavirus cases, pondering the large swaths with seemingly no cases. I moused over a few of the gray areas. The shading is not based on counties, as I assumed, but on some other spatial unit, perhaps zip codes or census blocks or something. (I’m sure the answer is available if I click around enough).  Thing is, I doubt that all of the cases in the relatively low-population areas in the western half of the country are concentrated in those little shaded areas. I suspect those are where the tests are performed, or similar, not the locations of the homes of the infected people. [Added later: Carlos Ungil points out that there was indeed a link, just below the map, that says “For per capita: Parts of a county with a population density lower than 10 people per square mile are not shaded.”] 

I’m well aware that all maps of parameter estimates are misleading (one of my favorite papers), but I think the way in which this map is misleading may be worse than some of the alternatives, such as coloring the entire county. Yes, coloring the whole county would give a false impression of spatial uniformity for some of those large counties, but I think that’s better than the current false impression of zero infection rates in a large swath of the country. In terms of cases per 100,000 Nevada is much worse than Ohio but it sure doesn’t look like that on the map. [Note: I originally said ‘Illinois’ but either that was a mistake, pointed out by Carlos Ungil, or it changed when the map was updated in the past hour].  

 

Many western states appear to have low case rates but actual rates are not low

Hiring at all levels at Flatiron Institute’s Center for Computational Mathematics

We’re hiring at all levels at my new academic home, the Center for Computational Mathematics (CCM) at the Flatiron Insitute in New York City.

We’re going to start reviewing applications January 1, 2021.

A lot of hiring

We’re hoping to hire many people for each of the job ads. The plan is to grow CCM from around 30 people to around 60, which will involve hiring 20 or 30 more postdocs and permanent research staff over the next few years!!! Most of those hires are going to be on the machine learning and stats side.

What we do

I’m still working on Stan and computational stats and would like to hire some more people to work on computational stats and probabilistic programming. There’s also lots of other fascinating work going on in the center, including equivariant neural networks for respecting physical constraints, spike sorting for neural signal classification, Markov chain Monte Carlo for molecular dynamics, phase retrieval (Fourier inverse problem) for cryo electron microscopy, and lots of intricate partial differential equation solvers being developed for challenging problems like cellular fluid dynamics or modeling light flow in the visual system of a wasp.

Plus, there’s a lot of communication across centers. There are Stan users in both the astro and bio centers, working on things like the LIGO project for gravitational waves, ocean biome factor modeling, and protein conformation estimation. If you like science, stats, and computation, it’s a great place to hang out.

The mission

The Flatiron Institute is unusual for an academic institution in that it’s focused squarely on computation.

The mission of the Flatiron Institute is to advance scientific research through computational methods, including data analysis, theory, modeling and simulation.

The motivation remains to fill a gap in scientific software development that’s not supported well by research grants, academia, or industry.

The job listings

The job listings are for two postdoc-level positions (called “fellows”) and an open-rank faculty-level positions (your choice of a “research scientist” or “data scientist” title). Please keep in mind that we’re hiring for all of these positions in bulk over the next few years across the range of center interests.

If you’re interested in comp stats and are going to apply, please drop me a line directly at bcarpenter@flatironinstitute.org.

“Model takes many hours to fit and chains don’t converge”: What to do? My advice on first steps.

The above question came up on the Stan forums, and I replied:

Hi, just to give some generic advice here, I suggest simulating fake data from your model and then fitting the model and seeing if you can recover the parameters. Since it’s taking a long time to run, I suggest just running your 4 parallel chains for 100 warmup and 100 saved iterations and set max treedepth to 5. Just to get things started, cos you don’t want to be waiting for hours every time you debug the model. That’s like what it was like when I took a computer science class in 1977 and we had to write our code on punch cards and then wait hours for it to get run through the computer.

P.S. Commenter Gec elaborates:

In my [the commenter’s] experience, I treat an inefficient model as a sign that I don’t really understand the model. Of course, my lack of understanding might be “shallow” in that I just coded it wrong or made a typo. But typically my lack of understanding runs deeper, in that I don’t understand how parameters trade off with one another, whether they lead to wonky behavior in different ranges of values, etc.

While there is no one route to improving this understanding, some of it can come from finding analytic solutions to simplified/constrained versions of the full model. A lot comes from running simulations, since this gives insight into how the model’s behavior (i.e., patterns of data) relate to its parameter settings. For example, I might discover that a fit is taking a long time because two parameters, even if they are logically distinct, end up trading off with one another. Or that, even if two parameters are in principle identifiable, the particular data being fit doesn’t distinguish them.

It might seem like these model explorations take a long time, and they do! But I think that time is better spent building up this understanding than waiting for fits to finish.

Exactly. Workflow, baby, workflow.

Piranhas in the rain: Why instrumental variables are not as clean as you might have thought

Woke up in my clothes again this morning
I don’t know exactly where I am
And I should heed my doctor’s warning
He does the best with me he can
He claims I suffer from delusion
But I’m so confident I’m sane
It can’t be a statistical illusion
So how can you explain
Piranhas in the rain
And if you see us on the corner
We’re just dancing in the rain
I tell my friends there when I see them
Outside my window pane
Piranhas in the rain.
— Sting (almost)

Gaurav Sood points us to this article by Jonathan Mellon, “Rain, Rain, Go away: 137 potential exclusion-restriction violations for studies using weather as an instrumental variable,” which begins:

Instrumental variable (IV) analysis assumes that the instrument only affects the dependent variable via its relationship with the independent variable. Other possible causal routes from the IV to the dependent variable are exclusion-restriction violations and make the instrument invalid. Weather has been widely used as an instrumental variable in social science to predict many different variables. The use of weather to instrument different independent variables represents strong prima facie evidence of exclusion violations for all studies using weather as an IV. A review of 185 social science studies (including 111 IV studies) reveals 137 variables which have been linked to weather, all of which represent potential exclusion violations. I conclude with practical steps for systematically reviewing existing literature to identify possible exclusion violations when using IV designs.

That sounds about right.

This story reminds me of when we were looking at the notorious ovulation-and-voting study and we realized that the evolutionary psychology and social priming literatures are just loaded with potential confounders:

But the papers on ovulation and voting, shark attacks and voting, college football and voting, etc., don’t just say that voters, or some voters, are superficial and fickle. No, these papers claim that seemingly trivial or irrelevant factors have large and consistent effects, and that I don’t believe. I do believe that individual voters can be influenced these silly things, but I don’t buy the claim that these effects are predictable in that way. The problem is interactions. For example, the effect on my vote of the local college football team losing could depend crucially on whether there’s been a shark attack lately, or on what’s up with my hormones on election day. Or the effect could be positive in an election with a female candidate and negative in an election with a male candidate. Or the effect could interact with parent’s socioeconomic status, or whether your child is a boy or a girl, or the latest campaign ad, etc.

This is also related to the piranha problem. If you take these applied literatures seriously, you’re led to the conclusion that there are dozens of large effects floating around, all bumping against each other.

Or, to put it another way, the only way you can believe in any of this sort of studies is if you don’t believe in any of the others.

It’s like religion. I can believe in my god, but only if I think that none of your gods exist.

The nudgelords won’t be happy about this latest paper, as it raises the concern that any nudge they happen to be studying right now is uncomfortably interacting with dozens of other nudges unleashed upon the world by other policy entrepreneurs.

Maybe they could just label this new article as Stasi or terrorism and move on to their next NPR appearance?

Presidents as saviors vs. presidents as being hired to do a job

There’s been a lot of talk about how if Biden is elected president it will be a kind of relief, a return to problem solving and dialing down of tension. This is different from Obama, who so famously inspired all that hope, and it made me think about characterizing other modern presidents in this way:

Saviors: Trump, Obama, Clinton, Reagan, Roosevelt

Hired to do or continue a job: Bush 2, Bush 1, Nixon, Johnson, Truman

I’m not quite sure how I’d characterize the other elected presidents from that era: Carter, Kennedy, Eisenhower. Carter in retrospect doesn’t have a savior vibe, but he was elected as a transformative outlier. Kennedy looms large in retrospect but it’s not clear that he was considered as a savior when he was running for president. Eisenhower I’m not sure about either.

Another complication is that there have been changes in Congress at the same time. There was the radicalism of the post-1974 reform movement, the 1994 Newt Gingrich revolution, and then the locked-in partisanship of congressional Republicans since 2010, all of these which can be considered both as responses to executive overreach by opposition presidents and which have motivated counterreactions.

Estimated “house effects” (biases of pre-election surveys from different pollsters) and here’s why you have to be careful not to overinterpret them:

Elliott provides the above estimates from our model. As we’ve discussed, as part of our fitting procedure we estimate various biases, capturing in different ways the fact that surveys are not actually random samples of voters from an “urn.” One of these biases is the “house effect.” In our model, everything’s on the logit scale, so we divide by 4 to get biases on the probability scale. The above numbers have already been divided by 4.

So we estimate the most biased polling organizations to range from about +2 Biden to +3 Trump, but with most of them between -1 and 1 percentage points. (This is in the scale of vote proportion, not vote margin.)

You can also see that there’s lots of uncertainty about the house effect for any given pollster. That’s because we used a weak prior, normal with mean 0 and sd 1.5, implying that a priori we didn’t say much about house effects except that they’re most likely less than 2 percentage points in either direction.

I have no reason to think that most of these biases represent any kind of political biases coming from the polling organizations. Rather, different orgs use different data collection methods and do different adjustments. So they’ll get different answers. In any given election cycle, these different approaches will have different biases, but with only a few polls it’s hard to pin down these biases with any precision, especially given that each poll has its own idiosyncratic bias as well, having to do with whatever was going on the day the survey was in the field.

Don’t overinterpret the chart!

It’s tempting to read the above graph and use it to rate the polls. Don’t do that!

Let me emphasize that a survey having a low estimated bias in the above chart does not necessarily mean it’s a better poll. I say this for two reasons:

1. The estimates of bias are really noisy! It would be possible to get an illusorily precise estimate of house effects by doing some simple averaging, but that would be wrong because it would not account for nonsampling errors that vary by poll.

2. Even if these estimates were precise, they’re just for the current 2020 campaign. Polling and adjustment strategies that work well this year might run into problems in other settings.

Whassup with the dots on our graph?

The above is from our Economist election forecast. Someone pointed to me that our estimate is lower than all the dots in October. Why is that? I can come up with some guesses, but it’s surprising that the line is below all the dots.

Merlin replied:

That happened a bunch of times before as well. I’d assume that there are more pro Dem pollsters than pro Rep or unbiased pollsters. Plus those who don’t adjust should also be above the line. And then we are still getting dragged towards the fundamentals.

And Elliott wrote:

First, the line is not lower than all the dots; there are several polls (mainly from the internet panels that weight by party) that are coming in right on it.

Merlin’s email explains a lot of this, I think — our non-response correction is subtracting about 2 points from Dems on margin right now,

But perhaps a more robust answer is that state polls simply aren’t showing the same movement. That’s why 538 is also showing Biden up 7-8 in their model, versus the +10 in their polling average.

And I realized one other thing, which is that the difference between the dots and the line is not as big as it might appear at first. Looking at the spread, at first it looks like the The line for Biden is at around 54%, and the dots are way above the line. But that’s misleading: you should roughly compare the line to the average of the dots. And the average of the recent blue dots is only about 1 percentage points higher than the blue line. So our model is estimating that recent national polls are only very slightly overestimating Biden’t support in the population. Yes, there’s that one point that’s nearly 5 percentage points above the line, and our model judges that to be a fluke—such things happen!—but overall the difference is small. I’m not saying it’s zero, and I wouldn’t want it to be zero; it’s just what we happen to see given the adjustments we’re using in the model.

P.S. Elliott also asked whether this is worth blogging about. I replied that everything’s worth blogging about. I blog about Jamaican beef patties.

Pre-register post-election analyses?

David Randall writes:

I [Randall] have written an article on how we need to (in effect) pre-register the election—preregister the methods we will use to analyze the voting, with an eye to determining if there is voter fraud.

I have a horrible feeling we’re headed to civil war, and there’s nothing that can be done about it—but I thought that this might be a feather in the wind to prevent a predictable post-voting-day wrangle that could tip the country over the edge.

I wanted to get this out a month ago, and ideally in some less political venue. It got caught with an editor for several weeks, and now there is much less time to do anything. Still, better late than never.

Here’s what Randall writes in his article:
Continue reading ‘Pre-register post-election analyses?’ »

Between-state correlations and weird conditional forecasts: the correlation depends on where you are in the distribution

Yup, here’s more on the topic, and this post won’t be the last, either . . .

Jed Grabman writes:

I was intrigued by the observations you made this summer about FiveThirtyEight’s handling of between-state correlations. I spent quite a bit of time looking into the topic and came to the following conclusions.

In order for Trump to win a blue state, either:

1.) That state must swing hard to the right or

2.) The nation must have an swing toward Trump or

3.) A combination of the previous two factors.

The conditional odds of winning the electoral college dependent on winning a state therefore are a statement about the relative likelihood of these scenarios.

Trump is quite unlikely to win the popular vote (538 has it at 5% and you at <1%), so the odds of Trump winning a deep blue state due to a large national swing are extremely low. Therefore, if the state's correlation with the nation is high, Trump would almost never win the state. In order for Trump to win the state a noticeable proportion of the time (say >0.5%), the correlation needs to be low enough that the state specific swing can get Trump to victory without making up the sizable national lead Biden currently has. This happens in quite a number of states in 538’s forecast, but also can be seen less frequently in The Economist’s forecast. For example, in your forecast Trump only wins New Mexico 0.5% of the time, but it appears that Biden wins nationally in a majority of those cases due to its low correlation with most swing states.

It is hard for me to determine what these conditional odds ought to be. If Trump needs to make up 15 points in New Mexico, I can’t say which of these implausible scenarios is more likely: That he makes up 6 points nationally and an additional 9 in New Mexico (likely losing the election) or that he makes up 9 nationally and an additional 6 in New Mexico (likely winning the election).

If you are interested in more on my thoughts, I recently posted an analysis of this issue on Reddit that was quite well received: Unlikely Events, Fat Tails and Low Correlation (or Why 538 Thinks Trump Is an Underdog When He Wins Hawaii).

I replied that yes, the above is just about right. For small swings, these conditional distributions depend on the correlations of the uncertainties between states. For large swings, these conditional distributions depend on higher-order moments of the joint distribution. I think that what Fivethirtyeight did was to start with highly correlated errors across states and then add a small bit of long-tailed error that was independent across states, or something like that. The result of this is that if the swing in a state is small, it will be highly predictive of swings in other states, but if the swing is huge, then it is most likely attributable to that independent long-tailed error term and then it becomes not very predictive of the national swing. That’s how the Fivethirtyeight forecast can simultaneously say that Biden has a chance of winning Alabama but also say that if he wins Alabama, that this doesn’t shift his national win probability much. It’s an artifact of these error terms being added in. As I wrote in my post, such things happen: these models have so many moving parts that I would expect just about any model, including ours, to have clear flaws in its predictions somewhere or another. Anyway, you can get some intuition about these joint distributions by doing some simulations in R or Python. It’s subtle because we’re used to talking about “correlation,” but when the uncertainties are not quite normally distributed and the tails come into play, it’s not just correlation that matters–or, to put it another way, the correlation conditional on a state’s result being in the tail is different than the correlation conditional on it being near the middle of its distribution.

It’s good to understand how our models work. One advantage of a complicated multivariate prediction is that it gives so many things to look at, that if you look carefully enough you’ll find a problem with just about any prediction method. Reality is more complicated than any model we can build—especially given all the shortcuts we take when modeling.

Once you’ve made a complicated prediction, it’s great to be able to make it public and get feedback, and to take that feedback seriously.

And once you recognize that your model will be wrong—as they say on the radio, once you recognize that you are a sinner—that’s the first step on the road to improvement.

Also the idea that the correlation depends on where you are in the distribution: that can be important sometimes.

Reference for the claim that you need 16 times as much data to estimate interactions as to estimate main effects

Ian Shrier writes:

I read your post on the power of interactions a long time ago and couldn’t remember where I saw it. I just came across it again by chance.

Have you ever published this in a journal? The concept comes up often enough and some readers who don’t have methodology expertise feel more comfortable with publications compared to blogs.

My reply:

Thanks for asking. I’m publishing some of this in Section 16.4 of our forthcoming book, Regression and Other Stories. So you can cite that.

Calibration problem in tails of our election forecast

Following up on the last paragraph of this discussion, Elliott looked at the calibration of our state-level election forecasts, fitting our model retroactively to data from the 2008, 2012, and 2016 presidential elections. The plot above shows the point prediction and election outcome for the 50 states in each election, showing in red the states where the election outcome fell outside the 95% predictive interval. The actual intervals are shown for each state too and we notice a few things:

1. Nearly 10% of the statewide election outcomes fall outside the 95% intervals.

2. A couple of the discrepancies are way off, 3 or 4 predictive sd’s away.

3. Some of the errors are in the Democrats’ favor and some are in the Republicans’. This is good news for us in that these errors will tend to average out (not completely, but to some extent) rather than piling up when predicting the national vote. But it also is bad news in that we can’t excuse the poor calibration based on the idea that we only have N = 3 national elections. To the extent that these errors are all over the place and not all occurring in the same direction and in the same election, that’s evidence of an overall problem of calibration in the tails.

We then made a histogram of all 150 p-values. For each state election, if we have S simulation draws representing the predictive distribution, and X of them are lower than the actual outcome (the Democratic candidate’s share of the two-party vote), then we calculate the p-value as (2*X + 1) / (2*N + 2), using a continuity correction so that it’s always between 0 and 1 (something that Cook, Rubin, and I did in our simulation-based calibration paper, although we neglected to mention that in the article itself; it was only in the software that we used to make all the figures).

Here’s what we found

There are too many p-values below 0.025, which is consistent with what we saw in the plots with the interval coverage. But the distribution is not quite as U-shaped as we might have feared. The problem is at the extremes: the lowest of the 150 p-values are three values of 0.00017 (even one of these should happen only about once in 6000 predictions), and the highest are two values above 0.9985 (and one of these should happen only about once in every 600 cases).

Thus, overall I’d say that the problem is not that our intervals are generally too narrow but that we have a problem in the extreme tails. Everything might be just fine if we swap out the normal error model for something like a t_4. I’m not saying that we should add t_4 random noise to our forecasts; I’m saying that we’d change the error model and let the posterior distribution work things out.

Let’s see what happens when we switch to t_4 errors:

Not much different, but a little better coverage. Just to be clear: We wouldn’t use a set of graphs like this on their own to choose the model. Results for any small number of elections will be noisy. The real point is that we had prior reasons for using long-tailed errors, as occasional weird and outlying things do happen in elections.

At this point you might ask, why did it take us so long to notice the above problem? It took us so long because we weren’t looking at those tails! We were looking at national electoral votes and looking at winners of certain swing states. We weren’t checking the prediction for Obama’s vote margin in Hawaii and Trump’s margin in Wyoming. But these cases supply information too.

We have no plans to fix this aspect of our model between now and November, as it won’t have any real effect on our predictions of the national election. But it will be good to think harder about this out going forward.

General recommendation regarding forecasting workflow

I recommend that forecasters do this sort of exercise more generally: produce multivariate forecasts, look at them carefully, post the results widely, and then carefully look at the inevitable problems that turn up. No model is perfect. We can learn from our mistakes, but only if we are prepared to do so.

Stan’s Within-Chain Parallelization now available with brms

The just released R package brms version 2.14.0 supports within-chain parallelization of Stan. This new functionality is based on the recently introduced reduce_sum function in Stan, which allows to evaluate sums over (conditionally) independent log-likelihood terms in parallel, using multiple CPU cores at the same time via threading. The idea of reduce_sum is to exploit the associativity and commutativity of the sum operation, which allows to split any large sum into many smaller partial sums.

Paul Bürkner did an amazing job to enable within-chain parallelization via threading for a broad range of models as supported by brms. Note that currently threading is only available with the CmdStanR backend of brms, since the minimal Stan version supporting reduce_sum is 2.23 and rstan is still at 2.21. It may still take some time until rstan can directly support threading, but users will usually not notice any difference between either backend once configured.

We encourage users to read the new threading vignette in order to get an intuition of the new feature as to what speedups one can expect for their model. The speed gain by adding more CPU cores per chain will depend on many model details. In brief:

  • Stan models taking days/hours can run in a few hours/minutes, but models running just a few minutes will be hard to accelerate
  • Models with computationally expensive likelihoods will parallelize better than those with cheap to calculate ones like a normal or a Bernoulli likelihood
  • Non-Hierarchical and hierarchical models with few groupings will greatly benefit from parallelization while hierarchical models with many random effects will gain somewhat less in speed

The new threading feature is marked as „experimental“ in brms, since it is entirely new and there may be a need to change some details depending on further experience with it. We are looking forward to hear from users about their stories when using the new feature at the Stan Discourse forums.

She’s wary of the consensus based transparency checklist, and here’s a paragraph we should’ve added to that zillion-authored paper

Megan Higgs writes:

A large collection of authors describes a “consensus-based transparency checklist” in the Dec 2, 2019 Comment in Nature Human Behavior.

Hey—I’m one of those 80 authors! Let’s see what Higgs has to say:

I [Higgs] have mixed emotions about it — the positive aspects are easy to see, but I also have a wary feeling that is harder to put words to. . . . I do suspect this checklist will help with transparency at a fairly superficial level (which is good!), but could it potentially harm progress on deeper issues? . . . will the satisfactory feeling of successfully completing the checklist lull researchers into complacency and keep them from spending effort on the deeper layers? Will it make them feel they don’t need to worry about the deeper stuff because they’ve already successfully made it through the required checklist?

She summarizes:

I [Higgs] worry the checklist is going to inadvertently be taken as a false check of quality, rather than simply transparency (regardless of quality). . . . We should always consider the lurking dangers of offering easy solutions and simple checklists that make humans feel that they’ve done all that is needed, thus encouraging them to do no more.

I see her point, and it relates to one of my favorite recent slogans: honesty and transparency are not enough.

I signed on to the checklist because it seemed like a useful gesture, a “move the ball forward” step.

Here are a couple of key sentences in our paper:

Among the causes for this low replication rate are underspecified methods, analyses and reporting practices.

We believe that consensus-based solutions and user-friendly tools are necessary to achieve meaningful change in scientific practice.

We said among the causes (not the only cause), and we said necessary (not sufficient).

Still, after reading Megan’s comment, I wish we’d added another paragraph, something like this:

Honesty and transparency are not enough. Bad science is bad science even if it open, and applying transparency to poor measurement and design will, in and of itself, not create good science. Rather, transparency should reduce existing incentives for performing bad science and increase incentives for better measurement and design of studies.

We are stat professors with the American Statistical Association, and we’re thrilled to talk to you about the statistics behind voting. Ask us anything!

It’s happening at 11am today on Reddit.

It’s a real privilege to do this with Mary Gray, who was so nice to me back when I took a class at American University several decades ago.