Why are we making probabilistic election forecasts? (and why don’t we put so much effort into them?)

Political scientists Justin Grimmer, Dean Knox, and Sean Westwood released a paper that begins:

Probabilistic election forecasts dominate public debate, drive obsessive media discussion, and influence campaign strategy. But in recent presidential elections, apparent predictive failures and growing evidence of harm have led to increasing criticism of forecasts and horse-race campaign coverage. Regardless of their underlying ability to predict the future, we show that society simply lacks sufficient data to evaluate forecasts empirically. Presidential elections are rare events, meaning there is little evidence to support claims of forecasting prowess. Moreover, we show that the seemingly large number of state-level results provide little additional leverage for assessment, because determining winners requires the weighted aggregation of individual state winners and because of substantial within-year correlation.

I agree with all of that. There’s too much horse-race coverage—as we wrote many years ago, “perhaps journalists and then the public will understand that the [general election polls for president] are not worthy of as much attention as they get.” I also agree about the small sample limiting what we can learn about election forecast accuracy from data alone.

Grimmer et al. then continue:

We demonstrate that scientists and voters are decades to millennia away from assessing whether probabilistic forecasting provides reliable insights into election outcomes.

I think that’s an overstatement, and I’ll explain in the context of going through the Grimmer et al. paper. I don’t think this is a bad paper, I just have a disagreement with how they frame part of the question.

Right at the beginning, they write about “recent failures” of election forecasts. I guess that 2016 represents a failure in that some prominent forecasts in the news media were giving Clinton a 90% chance of winning the electoral vote, and she lost. I wouldn’t say that 2020 was such a failure: forecasts correctly predicted the winner, and the error in the predicted vote share (about 2 percentage points of the vote) was within forecast uncertainties. Julia Azari and I wrote about 2016 in our paper, 19 things we learned from the 2016 election, and I wrote about 2020 in the paper, Failure and success in political polling and election forecasting (which involved a nightmare experience with a journal—not the one that ultimately published it). I’m not pointing to those article because they need to be cited–Grimmer et al. already generously cite me elsewhere!—but just to give a sense that the phrase “recent failures” could be misinterpreted.

Also if you’re talking about forecasting, I strongly recommend Rosenstone’s classic book on forecasting elections, which in many ways was the basis of my 1993 paper with King. See in particular table 1 of that 1993 paper, which is relevant to the issue of fundamentals-based forecasts outperforming poll-based forecasts. I agree completely with the larger point of Grimmer et al. that this table is just N=3; it shows that those elections were consistent with forecasts but it can’t allow us to make any strong claims about the future. That said, the fact that elections can be predicted to within a couple percentage points of the popular vote given information available before the beginning of the campaign . . . that’s an important stylized fact about U.S. general elections for president, not something that’s true in all countries or all elections (see for example here).

Grimmer et al. are correct to point out that forecasts are not just polls! As I wrote the other day, state polls are fine, but this whole obsessing-over-the-state-polls is getting out of control. They’re part of a balanced forecasting approach (see also here) which allows for many sources of uncertainty. Along these lines, fundamentals-based forecasts are also probabilistic, as we point out in our 1993 paper (and is implicitly already there in the error term of any serious fundamentals-based model).

Getting back to Rosenstone, they write that scientists are decades away from knowing if “probabilistic forecasting is more accurate than uninformed pundits guessing at random.” Really??? At first reading I was not quite sure what they meant by “guessing at random.” Here are two possibilities; (1) the pundits literally say that the two candidates are equally likely to win, (2) the pundits read the newspapers, watch TV, and pull guesses out of their ass in some non-algorithmic way.

If Grimmer et al. are talking about option 1, I think we already know that probabilistic forecasting is more accurate than a coin flip: read Rosenstone and then consider the followup elections of 1984, 1988, 1992, and 1996, none of which were coin flips and all of which were correctly predicted by fundamentals (and, for that matter, by late polls). From 2000 on, all the elections have been close (except 2008, and by historical standards that was pretty close too), so, sure, in 2000, 2004, 2012, 2016, and 2020, the coin-flip forecast wasn’t so bad. But then their argument is leaning very strongly on the current condition of elections being nearly tied.

If they’re talking about option 2, then they have to consider that, nowadays, even uninformed pundits are aware of fundamentals-based ideas of the economy and incumbency, and of course they’re aware of the polls. So, in that sense, sure, given that respected forecasts exist and pundits know about them, pundits can do about as well as forecasts.

OK, now I see in section 3 of their paper that by “guessing at random,” Grimmer et al. really are talking about flipping a coin. I disagree with the method they are using in section 3—or, I should say, I’m not disagreeing with their math; rather, I think the problem is that they’re evaluating each election outcome as binary. But some elections have more information than others. Predicting 1984 or 1996 as a coin flip would be ridiculous. The key here is that forecasters predict the popular and electoral vote margins, not just the winner (see here).

I also don’t see why they are demanding the forecast have a 95% chance of having a higher score, but I guess that’s a minor point compared to the larger issue that forecasts should be evaluated using continuous outcomes.

Finally, at the end they ask, why are we making probabilistic forecasts? I have some answers, other than “the lucrative marketing of statistical expertise.” First, political science. Rosenstone made a probabilistic forecasting model back in 1983, and we used an improved version of that model for our 1993 paper. The fact that U.S. general elections for president are predictable, within a few percentage points, helps us understand American politics. Second, recall baseball analyst Bill James’s remark that the alternative to good statistics is not “no statistics,” it’s “bad statistics.” Political professionals, journalists, and gamblers are going to make probabilistic forecasts one way or another; fundamentals-based models exist, polls exist . . . given that this information is going to be combined in some way, I don’t think there’s any shame in trying to do it well.

In summary, I agree with much of what Grimmer et al. have to say. We can use empirical data to shoot down some really bad forecasting models such as those that were giving Hillary Clinton a 99% chance of winning in 2016 (something that can happen from a lack of appreciation for non-sampling error in polls, a topic that has been studied quantitatively for a long time (see for example this review by Ansolabehere and Belin from 1993), and other times we can see mathematical or theoretical problems even before the election data come in (for example this from October 2020), but once we narrow the field to reasonable forecasts, it’s pretty much impossible to choose between them on empirical grounds. This is a point I made here and here; again, my point in giving all these links is to avoid having to restate what I’ve already written, I’m not asking them to cite all these things.

I sent the above to Grimmer et al., who responded:

To your point about elections from 30-40 years ago—sure, the forecasts from then look reasonable in retrospect. But again, as our calculations show, we need more information to distinguish those forecasts from other plausible forecasts. Suppose a forecaster like Rosenstone is accurate in his election predictions 85% of the time. It would take 48 years to distinguish his forecast from the 50% accuracy pundit (on average). This would mean that, on average, if Rosenstone started out of sample forecasting in 1980, then in 2028 we’d finally be able to distinguish from the 50% correct pundit. If we built a baseline pundit with more accuracy (say, accounting for obvious elections like you suggest) it would take even longer to determine whether Rosenstone is more accurate than the pundit.

I agree regarding the comparison to the baseline pundit, as this pundit can pay attention to the polls and prediction markets, read election forecasts, etc. The pundit can be as good as a forecast simply by repeating the forecast itself! But my point about Rosenstone is not that his model predicts the winner 85% of the time; it’s that his model (or more improved versions of it) predict the vote margin to a degree of accuracy that allows us to say that the Republicans were heavily favored in 1984 and 1988, the Democrats were favored in 1992 and 1996, etc. Not to mention the forecasts for individual states. Coin flipping only looks like a win if you collapse election forecasting to binary outcomes.

So I disagree with Grimmer et al.’s claim that you would need many decades of future data to learn that a good probabilistic forecast is better than a coin flip. But since nobody’s flipping coins in elections that are not anticipated to be close, this is a theoretical disagreement without practical import.

So, after all that, my summary is that, by restricting themsleves to evaluating forecasts in a binary way, they’ve overstated their case, and I disagree with their negative attitude about forecasting, but, yeah, we’re not gonna ever have the data to compare different reasonable forecasting methods on straight empirical grounds—and that doesn’t even get into the issue that the forecasts keep changing! I’m working with the Economist team and we’re doing our best, given the limited resources we’ve allocated to the problem, but I wouldn’t claim thatg our models, or anyone else’s, “are the best”; there’s just no way to know.

The other thing that’s relevant here is that not much effort is being put into these forecasts! These are small teams! Ben Goodrich and I have helped out the Economist as a little side gig, and the Economist hasn’t devoted a lot of internal resources to this either. I expect the same is true of Fivethirtyeight and other forecasts. Orders of magnitude more time and money is spent on polling (not to mention campaigns’ private polls and focus groups) than on statistical analysis, poll aggregation, fundamentals models, and the rest. Given that the information is out there, and it’s gonna be combined in some way, it makes sense that a small amount of total effort is put into forecasting.

In that sense, I think we already are at the endgame that Grimmer et al. would like: some version of probabilistic forecasting is inevitable, there’s a demand for it, so a small amount of total resources are spent on it. I get the sense that they think probabilistic forecasts are being taken too seriously, but given that these forcasts currently show a lot of uncertainty (for example, the Economist forecast currently has the race at 60/40), I’d argue that they’re doing their job in informing people about uncertainty.

Prediction markets

I sent the above discussion to Rajiv Sethi, an economist who studies prediction markets. Sethi points to this recent paper, which begins:

Any forecasting model can be represented by a virtual trader in a prediction market, endowed with a budget, risk preferences, and beliefs inherited from the model. We propose and implement a profitability test for the evaluation of forecasting models based on this idea. The virtual trader enters a position and adjusts its portfolio over time in response to changes in the model forecast and market prices, and its profitability can be used as a measure of model accuracy. We implement this test using probabilistic forecasts for competitive states in the 2020 US presidential election and congressional elections in 2020 and 2022, using data from three sources: model-based forecasts published by The Economist and FiveThirtyEight, and prices from the PredictIt exchange. The proposed approach can be applied more generally to any forecasting activity as long as models and markets referencing the same events exist.

Sethi writes:

I suspect that the coin flip forecaster would lose very substantial sums.

Joyce Berg and colleagues have been looking at forecasting accuracy of prediction markets for decades, including the IEM vote share markets. This survey paper is now a bit dated but has vote share performance relative to polls for elections in many countries.

They are looking at markets rather than models but the idea that we don’t have enough data to judge would seem to apply to both.

I think these comparisons need to deal with prediction markets, the implicit suggestion in the paper (I think) is that we don’t know (and will never know) whether they can be beaten by coin flippers, and I think we do know.

Yes, as discussed above, I think the problem is that Grimmer et al. were only using binary win/loss outcomes in the analysis where they compared forecasts to coin flips. Throwing away the information on vote margin is going to make it much much harder to distinguish an informative forecast from noise.

Commercial election forecasting

I sent the above discussion to Fivethirtyeight’s Elliott Morris, who wrote:

It’s interesting to see how academics answer the questions we ask ourselves all the time in model development and forecast evaluation. Whether we are better than dart-throwing is a very important question.

2. I’m reasonably confident we are better. As a particularly salient example, a monkey does not know that Wyoming is (practically speaking) always going to be red and CT always blue in 2024. Getting one of those states wrong would certainly erase any gains in accuracy (take the Brier score) from reverting probabilities towards 50-50 in competitive states.

3. Following from that, it seems a better benchmark (what you might call the “smarter pundit” model) would be how the state voted in the last election—or even better, p(win | previous win + some noise). That might still not replicate what pundits are doing but I’d find losing to that hypothetical pundit more troubling for the industry.

4. Could you propose a method that grades the forecasters in terms of distance from the result on vote share grounds? This is closer to how we think of things (we do not think of ourselves as calling elections) and adds some resolution to the problem and I imagine we’d see separation between forecasters and random guessing (centered around previous vote, maybe), much sooner (if not practically immediately).

5. Back on the subject of how we grade different forecasts, we calculate the LOOIC of candidate models on out-of-sample data. Why not create the dumb pundit model in Stan and compare information criterion in a Bayesian way? I think this would augment the simulation exercise nicely.

6. Bigger picture, I’m not sure what you’re doing is really grading the forecasters on their forecasting skill. Our baseline accuracy is set by the pollsters, and it is hard to impossible to overcome bias in measurement. So one question would be whether pollsters beat random guessing. Helpfully you have a lot more empirical data there to test the question. Then, if polls beat the alternative in long-range performance (maybe assign binary wins/losses for surveys outside the MOE?), and pollsters don’t, that is a strong indictment.

7. An alternative benchmark would be the markets. Rajiv’s work finds traders profited off the markets last year if they followed the models and closed contracts before expiration. Taking this metaphor: If you remove assignment risk from your calculations, how would a new grading methodology work? Would we need a hundred years of forecasts or just a couple cycles of beating the CW?

Morris’s “smarter pundit” model in his point #3 is similar to the fundamentals-based models that combine national and state predictors, including past election results. This is what we did in our 1993 paper (we said that elections were predictable given information available ahead of time, so we felt the duty to make such a prediction ourselves) and what is done in a much improved way to create the fundamentals-based forecast for the Economist, Fivethirtyeight, etc.

Political science

Above I wrote about the relevance of effective forecasting to our understanding of elections. Following up on Sethi’s mention of the Berg et al.’s research on prediction markets and polls, political scientist Chris Wlezien points us to two papers with Bob Erikson.

Are prediction markets really superior to polls as election predictors?:

We argue that it is inappropriate to naively compare market forecasts of an election outcome with exact poll results on the day prices are recorded, that is, market prices reflect forecasts of what will happen on Election Day whereas trial-heat polls register preferences on the day of the poll. We then show that when poll leads are properly discounted, poll-based forecasts outperform vote-share market prices. Moreover, we show that win projections based on the polls dominate prices from winner-take-all markets.

Markets vs. polls as election predictors: An historical assessment:

When we have both market prices and polls, prices add nothing to election prediction beyond polls. To be sure, early election markets were (surprisingly) good at extracting campaign information without scientific polling to guide them. For more recent markets, candidate prices largely follow the polls.

This relates to my point above that one reason people aren’t always impressed by poll-based forecasts is that, from the polls, they already have a sense of what they expect will happen.

I sent the above discussion to economist David Rothschild, who added that, even beyond whatever predictive value they give,

Polling data (and prediction market data) are valuable for political scientists to understand the trajectory and impact of various events to get to the outcome. Prediction markets (and high frequency polling) in particular allow for event studies.

Good point.

Rothschild adds:

Duncan Watts and I have written extensively on the massive imbalance of horse-race coverage to policy coverage in Mainstream Media election coverage. Depending on how you count it, no more than 5-10% of campaign coverage, even at the New York Times, covers policy in any remotely informative way. There are a myriad of reasons to be concerned about the proliferation of horse-race coverage, and how it is used to distract or misinform news consumers. But, to me, that seems like a separate question from making the best forecasts possible from the available data (how much horse-race coverage should we have), rather than reverting to earlier norms of focusing on over individual polls without context (conditional on horse-race coverage, should we make it as accurate and contextualized as possible).

Summary

I disagree with Grimmer et al. that we can’t distinguish probabilistic election forecasts from coin flips. Election forecasts, at the state and national level, are much better than coin flips, as long as you include non-close elections such as lots of states nowadays and most national elections before 2000. If all future elections are as close in the electoral college as 2016 and 2020, then, sure, the national forecasts aren’t much better than coin flips, but then their conclusion is very strongly leaning on that condition. In talking about evaluation of forecasting accuracy, I’m not offering a specific alternative here–my main point is that the evaluation should use the vote margin, not just win/loss. When comparing to coin flipping, Grimmer et al. only look at predicting the winner of the national election, but when comparing forecasts, they also look at electoral vote totals.

I agree with Grimmer et al. that it is essentially impossible from forecasting accuracy alone to choose between reasonable probabilistic forecasts (such as those from the Economist and Fivethirtyeight in 2020 and 2024, or from prediction markets, or from fundamentals-based models in the Rosenstone/Hibbs/Campbell/etc. tradition). N is just too small, also the models themselves along with the underlying conditions change from election to election, so it’s not even like there are stable methods to make such a comparison.

Doing better than coin flipping is not hard. Once you get to a serious forecast using national and state-level information and appropriate levels of uncertainty, there are lots of ways to go, there are reasons to choose one forecast over another based on your take on the election, but you’re not gonna be able to empirically rate them based on forecast accuracy, a point that Grimmer et al. make clearly in their Table 2.

Grimmer et al. conclude:

We think that political science forecasts are interesting and useful. We agree that the relatively persistent relationship between those models and vote share does teach us something about politics. In fact when one of us (Justin) teaches introduction to political science, his first lecture focuses on these fundamental only forecasts. We also agree it can be useful to average polls to avoid the even worse tendency to focus on one or two outlier polls and overinterpret random variation as systematic changes.

It is a leap to go from the usefulness of these models for academic work or poll averaging to justifying the probabilities that come from these models. If we can never evaluate the output of the models, then there is really no way to know if these probabilities correspond to any sort of empirical reality. And what’s worse, there is no way to know that the fluctuations in probability in these models are any more “real” than the kind of random musing from pundits on television.

OK, I basically agree (even if I think “there is really no way to know if these probabilities correspond to any sort of empirical reality” is a slight overstatement).

Grimmer et al. are making a fair point. My continuation of their point is to say that this sort of poll averaging is gonna be done, one way or another, so it makes sense to me that news organizations will try to do it well. Which in turn should allow the pundits on television to be more reasonable. I vividly recall 1988, when Dukakis was ahead in the polls but my political scientist told be that Bush was favored because the state of the economy (I don’t recall hearing the term “fundamentals” before our 1993 paper came out). The pundits can do better now, but conditions have changed, and national elections are much closer.

All this discussion is minor compared to horrors such as election denial (Grimmer wrote a paper about that too), and I’ll again say that the total resources spent on probabilistic forecasting is low.

One thing I think we can all agree on is that there are better uses of resources than endless swing-state and national horserace polls, and that there are better things for political observers to focus on than election forecasts. Ideally, probabilistic forecasts should help for both these things, first by making it clear how tiny the marginal benefit is from each new poll, and second by providing wide enough uncertainties that people can recognize that the election is up in the air and it’s time to talk about what the candidates might do if they win. Unfortunately, poll averaging does not seem to have reduced the attention being paid to polls, and indeed the existence of competing forecasts just adds drama to the situation. Which perhaps I’m contributing to, even while writing a post saying that there are too many polls and that poll aggregation isn’t all that.

Let me give the last word to Sean Westwood (the third author of the above-discussed paper), who writes:

Americans are confused by polls and even more confused by forecasts. A significant point in our work is that without an objective assessment of performance, it is unclear how Americans should evaluate these forecasts. Is being “right” in a previous election a sufficient reason to trust a forecaster or model? I do not believe this can be the standard. Lichtman claims past accuracy across many elections, and people evaluated FiveThirtyEight in 2016 with deference because of their performance in 2008 and 2012. While there is value in past accuracy, there is no empirical reason to assume it is a reliable indicator of overall quality in future cycles. We might think it is, but at best this is a subjective assessment.

Agreed.

41 thoughts on “Why are we making probabilistic election forecasts? (and why don’t we put so much effort into them?)

  1. One thing that strikes me is that the campaigns themselves have all the forecast data they need, and are responding to data based needs. I think we should assume a near coin flip election for the foreseeable future simply because of a two party feedback control system holding it there.

  2. This post has clarified an issue for me that I had not previously appreciated. That is the difference between polls and forecasts. From my simplistic view, polls are the raw material for forecasts. They could simply be used on their own, or forecasts can be developed using the polls (and other inputs) to derive probabilistic estimates. In that sense, polls alone are the most basic type of forecast (aside from coin flipping) – various forecast methodologies should be able to do better than simple poll aggregation, although you point out a number of issues regarding how to assess the evidence.

    What bothers me about both polls and forecasts is that you appear to treat them as exogenous to what they are trying to measure. Relevant actors respond to both by allocating their resources differently, changing messaging, and perhaps even shifting positions. The question is whether these things are good or not – what I view as an ethical, not a statistical, question. And here I have many concerns. I put little trust in any stated positions (I don’t know if they used to mean more, but in the past decade they seem to mean virtually nothing) and any reallocation of election resources is about winning the game – which I don’t view as the same as serving the public. The link between election results and public welfare is increasingly tenuous in my mind. I suppose I am one of those people losing my faith in democracy.

    • I have long believed that the real benefit of democracy on a large scale* is that it provides a nonviolent way to get rid of widely disliked leaders, not that it actually provides better leaders/decisions.

      *Because with about 700,000 people per Representative, they can’t meaningfully hear from all their constituents. On a local scale, or for small population nations, this problem is much less severe.

      • The papacy has changed hands mostly non-violently and mostly without serious challenges (yes, i know there are some major exceptions) for a few thousand years. Their secret is that they usually elect old popes who die in five or ten years.

        • I was going to say that’s different because the Pope is a religious leader not a political one, but from Carolingian times to the 19th century the Pope was a monarch as well*, which still leaves over 500 years of basically peaceful sucession between the end of the Western Schism/Avignon Papacy and the unification of Italy.

          Still the religious factor makes a big difference. Even when the Shoguns were effective total dictators of Japan, none of them overthrew the Emperor and declared themselves Emperor – the Emperor remained sacrosanct even if politically weak/powerless.

          Still, I think my point remains: democracy allows removing a leader peacefully, and that is arguably the real advantage (arguably one does not really get better leaders, since you get to choose only among people ambitious enough to want the job – thus arguably worse than randomly picked citizens.)

          *Ok technically he still is, but Vatican City’s population is like 800 and basically all clergy, so that doesn’t really count

  3. Evaluating how often the candidate who gets a plurality of the vote in a state is also the candidate that the model gives a greater than 1 / 2 probability of winning that state is an improper scoring rule that can be maximized by a dishonest predictor. Unfortunately, it is the only scoring rule that people use.

    There is no reason why people who publish probabilistic models couldn’t use a proper scoring rule, such as Expected Log Predictive Density (ELPD) of future polls, which is how Bayesians would evaluate any other predictive model. Much of Aki’s ELPD stuff was published after the 3rd edition of Bayesian Data Analysis but Richard McElreath explains it perfectly in section 7.2 of the second edition of his Statistical Rethinking. However, I don’t think I have ever seen an ELPD estimate for election-season models, and I don’t think these models perform particularly well at predicting future polls.

    If anyone were to estimate the ELPD via Pareto-Smoothed Importance Sampling (PSIS), which is what loo::loo does in R, they would presumably find that the estimate is overly sensitive to many past polls, which invalidates the estimator. However, Silva and Zanella (2022) shows how this issue with PSIS can be avoided using mixtures, which is quite simple and works really well and is explained in one of the vignettes for the loo package:

    https://mc-stan.org/loo/articles/loo2-mixis.html

  4. “… the fact that elections can be predicted to within a couple percentage points of the popular vote given information available before the beginning of the campaign . . . that’s an important stylized fact about U.S. general elections for president, not something that’s true in all countries or all elections.” Did this used to be less true for presidential elections in this country? I started paying attention to politics in the 50s (I was a weird kid – the Democratic Party used to put out a little magazine called the Democratic Digest, and I read it regularly), and my memory is that political identities were not as fixed as they are now. It also seems to me that there was less difference between the parties, e.g., Earl Warren was the Republican governor of California before he was appointed to the Supreme Court.

    • Politics has changed dramatically since about the end of the Cold War, or maybe 2000ish. The parties have pulled farther apart, and states have become much more reliably R or D. Reagan won 49 states once!

    • So I disagree with Grimmer et al.’s claim that you would need many decades of future data to learn that a good probabilistic forecast is better than a coin flip.

      Seems to me that like forecasts can use averaging across models (with weighting based on track record of reliability) the value of probabilistic forecasts doesn’t have to come from them as standalone forecasts. I don’t just look at one and leave it at that. I look across the forecasts and do a kind of averaging. I guess averaging across forecasts that use different methodology might not be valid from a mathematical framework, but I dunno, I kind of can’t help but do it.

      So then I don’t think it makes sense to compare individual forecasts to coin flips, because I compare a kind of averaging of forecasts and I would guess most people do also. The value of an individual forecast isn’t as a standalone, but as a part of a group.

      The other thing I remember here is how it seems to me to just be wrong to say a forecast that gave Trump a 30% chance of winning was “wrong” when Trump won. That should happen 1 out of 3 times. You can’t even say that one that gave him a 10% chance was “wrong.” Yeah, seems to me you’d have to run a lot of elections (at least 10?) to say the model is bad.

      I’m not sure I get the whole point of these comparisons – as if you could really evaluate which model is better in any generic way, across contexts (so much changes that might make one model better or worse in any given year).

  5. I’m a total outsider to this discussion, but here’s something different. Most of us, pundits included, assess the state of an election on our reading of the people we’re in contact with: what they think about the performance of the candidates and campaigns, the vibes, how they think the wider public will respond. There’s a minus and a plus with this. The obvious minus is that our circle is not representative, and this is what polls try to over come. The plus is that the information we take in is thick, whereas, AFAIK, most polling gives us thin data.

    What would be useful would be polling on some of the factors influencing ultimate outcome and not just the projected outcomes themselves. I assume this is what campaign organizations do as a matter of course. If I’m right, the issue raised in the OP reflects the more general question of how to combine thinner large-N data with thicker small and biased N.

    • My largest concern with polling this time is that the top line results, as I understand it, depend on what the pollster expects the composition of the electorate to be – thus heavily dependent on assumptions about turnout of different groups.

      So big questions for this election is whether:
      – low propensity voters who turned out for Trump in 2016/2020 but not for Republicans this time will show up;
      – abortion issue motivated voters who turned out in 2022 after the Dobbs decision will turn out disproportionately this time, or whether in a Presidential year electorate the effect will be muted vs a midterm

      But I am not by any means an expert, merely interested. I’d love clarification from more informed people on here.

  6. Andrew says, “in 2000, 2004, 2012, 2016, and 2020, the coin-flip forecast wasn’t so bad.”

    Let’s break this down. We want our probabistic forecasts to be calibrated in the same the way we want our estimators to be unbiased. We also want our probabilistic forecasts to be sharp in the same way we want our estimators to be low variance. And just as in the estimation case, we may trade sharpness for calibration to reduce overall expected error in some cases.

    Over a long enough period, a coin flip might be reasonably well calibrated for elections on the whole. But now imagine someone predicts the winner every year and assigns that guess 100% probability. Those predictions are also well calibrated. The difference is that they’re maximally sharp—those are the best possible predictions someone could make about a binary outcome. Assuming it is calibrated, the coin flip is the worst prediction among calibrated predictions!

    The real question is, could we have done better? The answer may just be that there isn’t enough information in what we’re given—non-representative, non-missing-at-random poll results from a limited number of people coupled with whatever we can gin up about the fundamentals. The coin flip may be the best we can achieve given the information we have (i.e., our given covariates). For example, if all I tell you is that I’m going to flip a coin, and you have to call it in the air (after the flipping, but before it lands), then you should assign a 50-50 chance of heads. But that isn’t “correct” in any fundamental metaphysical sense. Rather, it’s just that you don’t know the launch velocity, terrain, and envrionmental conditions well enough and even if you did, you don’t have a good enough simulator to predict the results. Laplace argued that a higher being (now affectionately known as “Laplace’s Demon”) could do that calculation—he was arguing that probabilities are epistemic and always based on what information you’re given.

    • Bob:

      The key is that there’s more information in the data. It’s not just that a model correctly predicts that Reagan would win in 1984; it’s that it can predict he would win approximately 60% of the two-party vote. In 2000, 2004, 2012, and 2016., the elections are so close that it would not be reasonable to expect a good forecasting model to predict the winner with a probability much different from a coin flip.

    • If you are an election forecaster, and you predict 50/50, no one is going to be impressed. That is just what someone would say who knows nothing. Predicting 60/40 seems like the ideal prediction to me. If the favored candidate wins, you can claim that you correctly predicted it. Otherwise you say you missed what was essentially a coin toss.

      • Roger:

        Yes, good point: 60/40 is that sweet spot! This relates to something I wrote a couple months ago about incentives for forecasters.

        This is also one reason why we demand that forecasters explain their methods, and why we are disappointed when we can’t quite figure out what they’re doing and have to reverse-engineer them.

        There’s also annoying contradiction inherent in the current way that election forecasts are consumed. On one hand, we expect the election to be close and we already have tons of information from fundamentals and polls, so, with rare exceptions, new information will not be expected to change the forecast in any real way from day to day. On the other hand, consumers are desperate for horse-race news, so you get the sorry spectacle of forecasters staying relevant in the public eye by announcing that their forecast probability has shifted from 61.2% to 57.8% or whatever. The motivation for publicity is to turn the changing forecasts into news themselves.

        • > with rare exceptions, new information will not be expected to change the forecast in any real way from day to day.

          Another exception may be The Economist’s forecast which changes from “about a 1 in 2 chance” to “about a 3 in 5 chance” every other day :-)

        • > we’re not tweeting it out every day as if it’s news

          Maybe not every day – but shortly after your comment:

          https://x.com/TheEconomist/status/1829606030700863887

          “Our forecast reckons that Donald Trump and Kamala Harris have nearly equal chances”

          Actually The Economist has been sending tweets about the forecast most days since August 15 (15×3, 16×5, 17×4, 18×3, 21, 22×2, 27×2, 28×2).

          It’s curious that “52%”, “coin flip” and “nearly equal chances” are mentioned often but the “about a 3 in 5 chance” that has happened in different occasions in the last days was never mentioned explicitly (at least I cannot find it).

          If I’m getting it right, that 60% forecast was published between the 23rd and the 26th – a period when The Economist just was unsually quiet on Twitter about the forecast.

          Harris probability was 60% again on the 29th but there was not tweet on that day either. They went directly from “nearly equal chances […] our forecast model finds” on the 28th to “our forecast reckons […] nearly equal chances” yesterday.

          So I guess not only you/they “are not tweeting it out every day as if it’s news” but there may be an intentional effort to not tweet it out when the forecast is not 50/50.

    • > a coin flip might be reasonably well calibrated for elections on the whole

      It’s really easy to get perfect calibration: pick at random one of the two parties and assign 50% probability to that party winning. (At least if we can assume that the outcome will always be that either one or the other wins.)

      • … and then you get a couple of faithless electors, nobody gets 270 electoral votes, and the House picks the third party candidate who got a couple of faithless-elector votes in a contingent election!

        (Ok, that is ridiculously unlikely, but maybe not more so than some of the 538 tail outcomes that show a margin of victory of like 500 EV.)

  7. In terms of reasons for constructing forecasts, another way to think about it is in relation to all the research related to “clinical vs statistical judgment” that’s shown that when people have access to approximately the same information as a statistical model (and sometimes also when they have acces to additional relevant information), the statistical model will generally outperform them. Given demand for forecasts, it would seem irrational for society to not at least try to statistically aggregate the information we think might be relevant to elections.

      • I think their perceived lack of helpfulness in practice is largely because the popular demand for forecasts often revolves around the binary outcome of who wins. This is the task-relevant prediction to many people, who don’t really care about how close the election is if their candidate loses. But this is also the prediction that’s hardest to evaluate. So something that seems rational in theory starts to seem potentially harmful or ill-motivated in practice.

        But it’s also hard to imagine a satisfying evaluation of whether people are worse off when they use forecasts. There are many possible actions/behaviors/states of mind etc that might be influenced by my having knowledge of an election forecast. The right thing to do conditional on a forecast is going to depend on my personal values and that’s hard to elicit. Attempts to study how people use election forecasts in lab settings can tell us something about possible directions of effects, e.g., if we present the forecast this way, people seem to get more overconfident, but it’s hard to say how much these effects might matter in reality. So the net benefit question seems unlikely to be answered, and we’re left with arguments like, It makes sense in theory.

        • Yeah.

          I think forecasts do affect people’s behavior, but how? If their preferred candidate is behind, are they less likely to vote because people don’t want to back a loser? But if their preferred candidate is too much ahead will some people not bother to vote? (I’ve heard that argument for 2016, but don’t entirely buy it because if that were a major factor seems like you’d see a lot fewer “vote only in Presidential years” voters in deep red/deep blue states.)

          But I also wonder if the forecasts make people feel more like the election is something out of our control. (Of course, living in a non swing state my vote will not have any effect on the Presidential outcome, but maybe on a potentially competitive Senate race?)

  8. How does 1.2bn bullshit reads factor into polls and forecasts?
    “According to a new report from the Center for Countering Digital Hate, Musk himself has posted 50 false election claims on X so far this year. They’ve got a total of 1.2bn views. None of them had a “community note” from X’s supposed fact-checking system”

    • Kt2:

      I dunno. Usually we act as if media effects as already being baked into the polls, but that can’t be right, as they’re really more like fundamentals in that they represent an environment that exists before the campaign begins. I would think that Fox, Twitter, and other partisan news sources would have their effects, although I’ve ben skeptical about claims of large effects (see also here).

      • I suppose media shifts once the campaigning has begun might be different?

        But honestly I think any model is going to have tons of uncertainty: this is the first presidential election in the modern highly polarized era with a major-party candidate change this late. It’s easy to write narratives for:
        – a close but definite Trump win (“Harris is still basically polling as Generic Democrat, and that can’t last” or “more voters are reluctant to vote for a female President than polling currently suggests” or “Trump is still being underestimated by polling”)
        – a 2000 style nail biter, possibly resolved by the SC since recounts havent finished in time (“the current national polling averages are right, but the EC-PV gap hasn’t actually shrunk since 2016/2020, so about D+3.5 nationally means basically zero margin in the deciding state”)
        – a close but definite Harris win (same as above but the EC-PV gap *has* shrunk)
        – a strong near Obama 2008 Harris win (either “the debate will show how much Trump has aged and he will look weak, so turnout for him will be much depressed” or “Dobbs decision effect isn’t captured in the polling”)

        My only real prior here is that the election will be close by historical standards, as all recent ones have been (even Obama 2008 is not that dramatic by 20th century standards, and I don’t think 2008 margins are possible today). That’s why I don’t really think Harris is being underestimated by a Trump in 2020 margin, even with possible overcorrection by pollsters, because I don’t think a better than +6 PV margin is plausible in today’s polarized environment.

        I don’t think Biden would have ended up quite as badly as he was polling in July (I don’t see NJ as becoming a swing state for example), either.

  9. Just glanced at today’s (8/31) fivethirtyeight presidential polls & spotted an error…not in statistical methodology, but in grade-school math. 47.1 minus 43.8 shows a difference of 3.3, not 3.2. Not a huge error, but makes me wonder what else 538/ABCNews gets wrong.

    • Tes:

      This could be a rounding thing. For example:
      47.06 rounds to 47.1
      43.83 rounds to 43.8
      If you subtract the rounded quantities, you get 47.1 – 43.8 = 3.3.
      But if you subtract the full numbers, you get 47.06 – 43.83 = 3.23, rounds to 3.2.

  10. The framing about “recent failures” of election forecasts always strikes me as weird. A probabilistic forecast model saying for example that Clinton had a 70% chance of winning (as reputable forecasts did in 2016) isn’t falsified by the fact that she lost the EC. It simply means that an event estimated to have a 30% probability occurred, which happens every day.

    Only if several times in a row the event rated less likely occurred can we start to assess that the model is probably wrong.

    I still agree with the main point that these forecasts don’t deserve as much attention.

  11. Strongly agree with the idea that this paper’s surprising finding seems to hinge on collapsing all the details behind the models into a binary win/loss prediction.

    What makes the models convincing and interesting (in a way that does not take 50-years to verify) is that the headline predictions come from aggregating lots of smaller predictions about voting all over the country. After the votes come in, there are hundreds of model checking comparisons we can make (as Andrew has in various forecast postmortem blog posts).

    At the risk of just rattling on about the painfully obvious, this is why campaigns strategize with probabilistic forecasts as opposed to coin flips.

  12. >this is why campaigns strategize with probabilistic forecasts as opposed to coin flips.

    I would love to know which forecasts the two campaigns are looking at.

    Although they probably know stuff the models don’t (like how their candidate is preparing for the debate).

    • Am I the only person bothered by the idea of campaigns strategizing with probabilistic forecasts? It only seems natural that they should care about public opinion, but I have no confidence that their strategizing has any meaningful relation to what a candidate or party will actually attempt to do – rather, I see it as a means for them to gear their messaging to what people want to hear. My own belief is that messaging and policy are almost two divergent sets. I suppose long term policy choices might be impacted by what polls are saying – for example, if the public strongly opposes aid to Ukraine, then it might affect how a candidate/party might make subsequent decisions. But in the heat of a race, I don’t see that happening. Instead, I think the poll results will affect how much an issue is emphasized or how a candidate speaks about an issue. But I don’t see any of those statements as binding at all, so as the say “talk is cheap.”

      • Dale:

        As always, the alternative to good statistics is not “no statistics,” it’s “bad statistics.” Campaigns will always be strategizing based on their expectations of the elections. It makes sense for them to form those expectations as best they can based on fundamentals, polls, past elections, and other information available to them.

        • I agree. But I am questioning whether “good statistics” in any way equals “good public policy.” In some ways, a better analysis of the polls may lead to worse policies (and more dishonesty). Am I being too skeptical here?

        • Dale:

          Yes, it could be that knowing more will lead to bad outcomes. This is something I worry about with all my work. I’m doing technical work, making it easier for people to learn from data. This can help the bad guys as well as the good guys. I don’t really know what to think about it. My usual perspective is that, on the whole, scientific and technological progress has been beneficial to humanity, but (a) that doesn’t mean this will continue to be the case, and (b) it’s not gonna be true of every individual scientific and technological development.

          Even in baseball, people have argued that sabermetrics has made the sport worse, by focusing it on the notorious “three true outcomes.”

Leave a Reply

Your email address will not be published. Required fields are marked *