With the presidential election season coming up (not that it’s ever ended), here’s a quick summary of the problems/challenges with two poll-based forecasting methods from 2020.
How this post came about: I have a post scheduled about a dispute between election forecasters Elliott Morris and Nate Silver about whether the site Fivethirtyeight.com should be including polls from the Rasmussen organization in their analyses.
At the end of the post I had a statistical discussion about the weaknesses of existing election forecasting methods . . . and then I realized that this little appendix was the most interesting thing in my post!
Whether Fivethirtyeight includes Rasmussen polls is a very minor issue, first because Rasmussen is only one pollster and second because if you do include their polls, any reasonable approach would be to give them a very low weight or a very large adjustment for bias. So in practice for the forecast it doesn’t matter so much if you include those polls, although I can see that from a procedural standpoint it can be challenging to come up with a rule to include or exclude them.
Now for the more important and statistically interesting stuff.
Key issues with the Fivethirtyeight forecast from 2020
They start with a polling average and then add weights and adjustments; see here for some description. I think the big challenge here is that the approach of adding fudge factors makes it difficult to add uncertainty without creating weird artifacts in the joint distribution, as discussed here and here. Relatedly, they don’t have a good way to integrate information from state and national polls. The issue here is not that they made a particular technical error; rather, they’re using a method that starts in a simple and interpretable way but then just gets harder and harder to hold together.
Key issues with the Economist forecast from 2020
From the other direction, the weakness of the Economist forecast (which I was involved in) was a lack of robustness to modeling and conceptual errors. Consider that we had to overhaul our forecast during the campaign. Also our forecasts had some problems with uncertainties, weird things relating to some choices in how we modeled between-state correlation of polling errors and time trends. I don’t think there’s any reason that a Bayesian forecast should necessarily be overconfident and non-robustness to conceptual errors in the model, but that’s what seemed to have happened with us. In contrast, the Fivethirtyeight approach was more directly empirical, which as noted above had its own problems but didn’t have a bias toward overconfidence.
Key issues with both forecasts
Both of the 2020 presidential election forecasts had difficulty handling data other than horse-race polls. The challenging information included economic and political “fundamentals,” which were included in the forecasts but with some awkwardness, in part arising from the fact that these variables themselves change over time during the campaign, known polling biases such as differential nonresponse, knowledge of systematic polling errors in previous elections, issues specific to the election at hand (street protests, covid, Clinton’s email server, Trump’s sexual assaults, etc.), issue attitudes in general to the extent they were not absorbed into horse-race polling, estimates of turnout, vote suppression, and all sorts of other data sources such as new-voter registration numbers. All these came up as possible concerns with forecasts, and it’s not so easy to include them in a forecast. No easy answers here—at some level we just need to be transparent and people can take our forecasts as data summaries—but these concerns arise in every election.
good summary andrew, although i was 4 years old in 2000 and don’t think i was forecasting
Elliott:
The first election I ever forecast was 1992. We did it when writing our article, Why are American Presidential election campaign polls so variable when votes are so predictable?. If you want to claim that votes are predictable, you have to do some prediction! This was a fundamentals-based prediction, not a polls-based prediction. I would’ve asked for your input, but you were -4 years old at the time. Kari Lock and I did some poll-based forecasting in 2008 (Bayesian combination of state polls and election forecasts, published in 2010), and I’m sure your input would’ve been valuable then—as a 12-year-old, you could’ve run out and got us sandwiches when we were hungry, right?—; I just didn’t think to ask.
I think you mean 2020 not 2000?
When Andrew wrote this:
“and then I realized that this little appendix was the most interesting think in my post!”
I thought I found a typo because I felt it should be
“and then I realized that this little appendix was the most interesting thing in my post!”
However, upon reflection, maybe this is not a typo but is intentional, as in think = thought.
Typos fixed; thanks.
I did some analysis on election forecasts last year, and found that Ann Selzer’s final Iowa poll has done better at predicting Iowa’s election outcomes than either 538 or the Economist. You can use this result to improve their final national forecasts by re-weighting the individual simulation draws based on how closely they match the final Selzer poll as well. This makes me think that the push to include basically all polls regardless of quality and data other than horse-race polls might actually be harming the models rather than helping, if you can improve the models by giving more weight to gold-standard polling in a single state. But this is all looking at predictions a day to a week ahead of the election, and probably doesn’t generalize to forecasts many months ahead.
https://secondhandcartography.com/2022/10/15/ann-selzer/
Paul:
Yes, if you assume that the error of a particular poll will be extremely well predicted by the errors of polls from that organization in the two previous elections, than you can get a very precise forecast.
I guess I have two questions:
(A) Would this have worked better or worse if you used polls from the Trafalgar Group instead of Selzer (worth thinking about, not an actual question — could substitute Trafalgar with any pollster with below-average Dem percentages)
(B) I think the brier score for 538 is lower than this “Selter+” model for 2022, which is kinda what we’d get at if we assumed pollster-level bias was generated by an AR process across years
(C) What if we look at polls taken before the final 30 days of the election?
Elliott
(A) For 2020 and 2016, I think anchoring to Trafalgar polls would probably have improved the 538 model, but not by as much as anchoring to Selzer polls. That’s based on a quick look at the errors in the final 2020 Trafalgar polls, compared to the error in the 538 forecast and in the final Selzer poll.
(B) I calculate that the Brier score is lower for Seltzer+ than for 538 in 2022, technically (0.0376 v 0.0378). The predictions from 538 and Selzer for Iowa were basically identical, and the tiny differences you see are driven by small changes in the uncertainties. https://secondhandcartography.com/2022/12/29/2022-forecast-performance/
I’m inferring a broader point, that you think the Selzer’s performance is driven mostly by 2016 and 2020, and anyone can get lucky twice. That’s definitely a concern, but I think her track record outside of those years is good enough that she can’t be purely dismissed as “two-hit wonder”. It’s not like the Trafalgar group which did well in 2016 and 2020 and then had a pretty terrible 2022. Using the “Selzer+” approach also improves outcomes in other years, not just 2020. I couldn’t find past simulation draws from 538, but I could for the Economist, and found improvements in Brier scores from using the “Selzer+” method for all 4 elections tested (2008, 2012, 2016, and 2020). And see above about the 2022 election.
(C) Selzer’s polls from earlier in 2020 and 2022 were all tilted heavily toward the Democrats compared to the actual outcomes (Biden & Trump even in September, Grassley +3 in October). In 2022 this caused the “Selzer+” model to increase the chance of the Dems winning the house and senate by 5 percentage points. I haven’t calculated what that would translate to in Brier scores.
Andrew and Elliott,
In 2020, I was one of the people who criticized statistical aspects of the Economist’s forecasting model in comments on this blog. To me, Andrew’s summary of the 2020 Economist and FiveThirtyEight models seems accurate and fair. Thank you for being evenhanded in your reflections.
With 2024 coming, it could make sense to revise Economist-style models, including through
A. More thorough posterior predictive checks in backtesting. Judged from the joint distribution of model-predicted state results, the actual 2008, 2012, and 2016 presidential election results were improbable according to the backtested 2020 Economist model (several Bayesian p-values were extreme). This shows model overconfidence that is identifiable and fixable in advance of the election you aim to predict.
B. Use of heavy tails, including testing substantially heavy tails (for example, meaning t df of 1.5, 2, or 3, instead of 4 or more). For a tossup race, the predictions of heavy tailed and non-heavy tailed models can be very similar. However, when one candidate is far ahead, the predictions can differ greatly, with the heavy tailed model less confident. So, in choosing tail weight, it is worth considering model behavior under the scenario that a candidate ends up far ahead.
C. Have a written plan for how and when to address problems found in deployed models (who investigates, what are timelines…). If problems happened several times in past, they will happen in future.
FWIW.
The issue of empirical vs model-based uncertainties is one I run into all the time, and in different contexts. When you have a small dataset (such as presidential elections) you don’t have enough data to characterize the tails of the distribution. That means you can’t use just empirical results, nor do you have enough data to fit a model for the tails. Compounding the problem, the underlying phenomena that govern the tail distribution are often changing with time, so your limited dataset is even more limited than it seems. There have been 59 presidential elections, which is already a rather small number, but it’s not like you can learn much about current election probabilities from the election of John Adams: things have changed. Same with, say, the probability of a flood of at least a specified magnitude in a specific location: you might have a 200-year record of peak annual flows — or much longer, in some places — but forests in the watershed have been replaced by farmland and development, maybe dams have been built, and the climate has changed, so that history might not be worth very much.
It’s hard to convert these general observations (and perhaps obvious ones) into concrete advice. I agree with the spirit of Fogpine’s point A. Except for convenience (e.g. if you’re using ordinary linear regression) I think the normal distribution is a bad default in most cases, or at least in most cases I’ve run into. I feel like the t4 or maybe t6 distribution makes a better default, and then vary from that if you think the extreme tails should be even heavier….although you’ll usually be doing that without any specific statistical support, but based on general feelings or theory or something.
Phil:
There’s this paper on using a combination of theory and statistical analysis to estimate the probability of rare events.
I remember that paper. I like it, and I agree with the point that sometimes you can bring in information from related events rather than relying purely on the data you’re interested in. Historical flood heights may not be useful on their own, if you don’t have any 1-in-200 events in your dataset, but maybe you can use frequency data from other towns or other rivers or whatever. You don’t have enough U.S. presidential elections to say much about the tails but you have other elections, other countries, etc., and as that paper points out you can look at data at the state level rather than just the popular vote. I agree these are good ideas.
But in the end, I think that in most contexts if you say such-and-such an event has a 0.2% probability, when you have no examples of that event in the data, I think it’s rare that you should be confident that the ‘true probability’ isn’t 1% or 0.04%. And if you say the event has probability of 0.001% or something, then your uncertainty might easily be 10x or more.
Of course, in the case of such rare events it’s not even clear how well “probability” can be defined, as has often been discussed.