Some thoughts on election forecasting

I’ve written a lot on polls and elections (“a poll is a snapshot, not a forecast,” etc., or see here for a more technical paper with Kari Lock) but had a few things to add in light of Sam Wang’s recent efforts. As a biologist with a physics degree, Wang brings an outsider’s perspective to political forecasting, which can be a good thing. (I’m a bit of an outsider to political science myself, as is my sometime collaborator Nate Silver, who’s done a lot of good work in the past few years.)

But there are two places where Wang misses the point, I think.

He refers to his method as a “transparent, low-assumption calculation” and compares it favorably to “fancy modeling” and “assumption-laden models.” Assumptions are a bad thing, right? Well, no, I don’t think so. Bad assumptions are a bad thing. Good assumptions are just fine. Similarly for fancy modeling. I don’t see why a model should get credit for not including a factor that might be important.

Let me clarify. If a simple model with only a couple of variables does as well, or almost as well, as a complicated effort that includes a lot more information, then, sure, that’s interesting.and suggests that all that extra modeling isn’t getting you much. Fine. But I don’t see that there’s anything wrong with putting in that additional info. In the elections context, it might not change your national forecasts much but it might help in individual districts.

Or, as Radford Neal put it in one of my favorite statistics quotes of all time:

Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.

Wang’s other mistake, I think, is his claim that “Pollsters sample voters with no average bias. Their errors are small enough that in large numbers, their accuracy approaches perfect sampling of real voting.” This is a bit simplistic, no? Nonresponse rates are huge and pollsters make all sorts of adjustments. In non-volatile settings such as a national general election, hey can do a pretty good job with all these adjustments, but it’s hardly a simple case of unbiased sampling.

Finally, I was interested in Wang’s claim that Nate’s estimates overstated their uncertainty. This would be interesting, if true, given the huge literature on overconfidence in forecasting.

Yang Hu gathered Nate’s latest probability forecasts for me, along with the election outcomes, and I checked the calibration as follows. I divided the House elections into bins: those where Nate gave a 0-10% chance of the Republcan winning, a 10-20% chance, etc., all the way up to 90-100%. For each of the ten categories, I counted the number of elections in that bin and the percentage where the Republicans actually won.

If the forecasts are perfectly calibrated, we’d expect to see the empirical frequency of Republican wins go smoothly from 0 to 1 as the bins go from forecast probability 0 to 1. If the forecasts are overconfident, we’d expect to see empirical frequencies closer to 0.5. If forecasts are underconfident (as Wang alleged), we’d see empirical frequencies closer to 0 and 1.

Here’s what we actually found

Forecast R win prob    #cases    Empirical R win freq
     0-10%               165            0.01
    10-20%                11            0.27
    20-30%                 5            0.20
    30-40%                11            0.27
    40-50%                 6            0.50
    50-60%                12            0.67
    60-70%                10            0.90
    70-80%                10             1
    80-90%                19             1
   90-100%               186             1

So, yes, Nate’s forecasts do seem overunderconfident! Out of the 39 races where he gave the Republican candidate between 60% and 90% chance of winning, the Republicans snagged all 38. Apparently he could’ve tightened up his predictions a lot. Wang appears to be correct.

(Since the original posting, I updated up the above table to include all the election results.)

But . . . before jumping on Nate here, I’d suggest some caution. As we all know, district-by-district outcomes are highly correlated. And if you suppose that the national swing as forecast was off by 3 percentage points in either direction, you get something close to calibration

To put it another way: In statistics, we’re always looking for 95% confidence intervals. IN experimental physics, I wouldn’t be surprised if 99% is the standard. But what does a 95% interval mean in political terms? Midterm elections occur every four years–so, if you want an interval that is correct 19 times out of 20, you have to account for 80 years of contingencies. And nobody would consider fitting their models to 80-year-old election data, at least not without a lot of adjustment.

So, to put it another way: if you really want 95% intervals and true calibration, you’ll need uncertainties that are wide enough so that, most of the time, you’re gonna look overconfident. I don’t see any easy answer here, but it’s an issue which, as a Bayesian election modeler, I’ve been aware of for awhile. Usually I just take whatever model probabilities are given to me and go from there, without trying to think too hard about their calibration. That is, I’ll either take wide Silver-style intervals and treat them as all-encompassing forecasts, or I’ll take narrow Wang-style intervals and treat them as conditional on a model.

It’s ironic that Wang characterizes his method as less assumption-laden than Nate’s. it’s simpler and more transparent, and I agree that these are virtues, but ultimately I think it’s more model-based in the sense that one has to rely strongly on a model to map Wang’s poll averages to election predictions. That’s fine–I love models–I just think there’s room for endless confusion when “assumption-laden” is used as a putdown.

Full disclosure: I (among others) gave Nate a few suggestions on combining information for his forecasting model. But the model and the effort behind it are Nate’s own.

P.S. Just for laffs, we also evaluated the calibration of another set of forecasts. The Huffington Post only gave probability forecasts for 118 battleground races, which I augmented by taking all of Nate’s essentially certain races (probabilities less than .02 or more than .98) and counting them as certain for Huffington Post as well. Here’s the calibration summary for Huffington:

Forecast R win prob    #cases    Empirical R win freq
     0-10%               130            0.01
    10-20%                24            0.29
    20-30%                 7            0.43
    30-40%                 6            0.17
    40-50%                 4            0.75
    50-60%                10            0.60
    60-70%                 7            0.86
    70-80%                11            0.90
    80-90%                25             1
   90-100%               142             1

P.P.S. Sorry about the ugly formatting. Serves me right for using tables, I suppose.

13 thoughts on “Some thoughts on election forecasting

  1. I don't mean to nitpick, but this

    So, yes, Nate's forecasts do seem overconfident!

    threw me off. You mean "underconfident" here, right?

  2. Professor Gelman, I am interested to know what you think about what to me is the most jarring outcome of Tuesday night: the consensus of generic ballot polls performed very well in predicting the final vote share (R+7), but the translation into seats didn't match historical standards. Specifically, the GOP strongly overperformed in seats (56%) compared to its 2-way vote share (53%), when history suggests the exact reverse should happen to them as the minority party (e.g. Kastellec 2006, 2008).
    Anecdotally, this suggests that the incumbency advantage is actually negative ths year. Might applying an inverse incumbency correction (penalizing instead of rewarding incumbency) to Bafumi-style forecasts have accorded better with the results?

    Another way to look at this is that while the generic ballot average was on the mark, local congressional district polls had a systematic bias of about D+2 (Mark Blumenthal at Pollster). One might read this as voters hedging an anti-Democratic/anti-incumbent vote at the level of specific candidates that is only revealed with a generic ballot-type question.

  3. "So, to put it another way: if you really want 95% intervals and true calibration, you'll need uncertainties that are wide enough so that, most of the time, you're gonna look overconfident. I don't see any easy answer here, but it's an issue which, as a Bayesian election modeler, I've been aware of for awhile. "

    I produced a Bayesian forecasting model this cycle that ended up being pretty well calibrated, so it isn't a fundamental property of Bayesian modeling.

    Nate's issue is probably in teasing out national-level and district level swings. He employs an empirical approach, determining how much noise to add by looking at how much election results have diverged from the polls X days out. He does this on a race level and on a national level. This double-counts variance, as he acknowledged a while back, but he prefers to be conservative, and probably figures that the extra variance compensates for ignoring the variance in his pollster rankings and house effect estimates.

    The deeper issue is that in order to properly tease out these effects, you need to actually write out a coherent model of why and how Opinion moves, and this isn't compatible with his "figure out an optimal weight for a poll and then take a weighted average" approach.

    It's also pretty hard, there are a lot of things that need to be taken into account in order to capture all of the variance: House Effects, design effects, industry bias, hyper-parameters for race variance, race-to-race correlation, etc.

    Just as importantly, it seems that the political audience doesn't trust large Bayesian models yet, which is probably why Nate has been forced to do everything via nested regressions.

  4. Sebastian etc:

    Typo fixed. I guess I'm so used to talking about overconfidence that I couldn't help typing it!


    Much depends on the details of the districts–how many are near the 50% boundary, after adjusting for incumbency. I haven't yet done a seats-votes analysis of 2010 so I can't be sure. It's possible that incumbency advantage is lower this year which could explain some of the pattern you noticed.


    1. Nate has an econ degree, so it's possible he never learned much about Bayesian methods. I agree with you that, ultimately, a probability model is the beset way to model uncertainty. Weighting and playing with error bars can only take you so far.

    2. When you way that your model "ended up being pretty well calibrated," that's still just an n=1 of national elections (or maybe n=3 or 5 or whatever if you fit your model to several past elections). There's still the question of calibration with respect to unexpected national events, no?

  5. Andrew: Think I know what you mean re: "Weighting and playing with error bars can only take you so far"

    but _given_ the probability model and even just a bit of data (or slightly more) and transformations you should be able to get weights and error bars to adequately approximate things?

    (David Cox worked out the technical details somewhere)


  6. Belegoster:

    One thing that's interesting is that in the previously Republican held open seats showed a six percent swing *toward* the Democrats, even in the context of a large national swing in the opposite direction. Or at least that was what the polls had shown, I don't know if it's held up.

    Since the Democrats obviously wouldn't have lost if incumbency advantage was that high, this seems to be a good example of Gelman's paper that asserted that Incumbency advantage varies widely from candidate to candidate.


    That's a good point. You could argue that if my probabilities seem well-calibrated, in the presence of race-to-race correlation, then they can't be well calibrated!

    94.7% of the results ended up being within my model's 95% confidence interval, and the looser CI's held up near-perfectly. But in the absence of national variation, that might actually be an ill sign.

    Also, estimating calibration via winning percentages is a bit hard, since it's especially sensitive to national shifts and most races were near 0 or 1. Perhaps if calibration is looked at by checking the accuracy of his observation CI's, it'd end up looking a lot better.

  7. Regarding the "assumption-laden models", one of my favourite quotes, I think by Wilfrid Kendall, is that "statisticians buy information with assumptions".

    Even apparently simple and transparent methods have lurking assumptions–usually linearity or iid-ness–which can often be wildly inappropriate.

  8. I've always thought that this would be a good way to evaluate intelligence analysis. Force analysts to put their predictions in probability clusters – maybe not 10, but say 4 categories – and then promote/reward the ones not who got the most predictions right, but who could best calibrate their uncertainty by matching the number of realized events to the probabilities they attached to them.

    My thought was that if we only reward prognosticators who get predictions correct, we bias the prognosticators towards making conservative predictions (ie., predicting high likelihood outcomes). With House Races this might be less of a problem because there is a well defined universe of "predictions" you need to make, but in intelligence analysis in many cases the analyst is also picking what to bet on, in addition to which bets to place.

  9. A good example is the Bank of England's forecasts. They make probability distributions for inflation & GDP, and in a review in 2005 they found that they were being under-confident:

    And then the 2008 outcomes were, relative to 2007 forecasts, so far out in the tails, they had 1/300K probability (inflation) or 1/30Tn probability (GDP).

    (You could say the problem was either they were using a too-thin-tailed distribution, or that it was thin tailed but variance changes over time).

  10. Joshua:

    There's some research in the decision analysis literature on this. The short answer is that if you have what's called a "proper scoring rule" for rewarding accuracy or probability judgments, it will automatically provide an incentive for calibration as well.

  11. Still one instance of "overconfident" needs changing to "underconfident", right?

    So, to put it another way: if you really want 95% intervals and true calibration, you'll need uncertainties that are wide enough so that, most of the time, you're gonna look overconfident.

    Unless I misunderstand.

Comments are closed.