What would would mean to really take seriously the idea that our forecast probabilities were too far from 50%?

Here’s something I’ve been chewing on that I’m still working through.

Suppose our forecast in a certain state is that candidate X will win 0.52 of the two-party vote, with a forecast standard deviation of 0.02. Suppose also that the forecast has a normal distribution. (We’ve talked about the possible advantages of long-tailed forecasts, but for the purpose of this example, the precise form of the distribution doesn’t matter, so I’ll use the normal distribution for simplicity.)

Then your 68% predictive interval for the candidate’s vote share is [0.50, 0.54], and your 95% interval is [0.48, 0.56].

Now suppose the candidate gets exactly half of the vote. Or you could say 0.499, the point being that he lost the election in that state.

This outcome falls on the boundary of the 68% interval, it’s one standard deviation away from the forecast. In no sense would this be called a prediction error or a forecast failure.

But now let’s say it another way. The forecast gave the candidate an 84% chance of winning! And then he lost. That’s pretty damn humiliating. The forecast failed.

Here we might just stop and say: Ha ha, people can’t understand probability.

But I don’t want frame it that way. Instead, flip it around. If you don’t want to go around regularly assigning 84% probabilities to this sort of event, then, fine assign a lower probability that candidate X wins, something closer to 50/50. Suppose you want the candidate to have a 60% chance of winning. Then you need to do some combination of shifting his expected vote toward 0.5 and increasing the predictive standard deviation to get this to work.

So what would it take? If our point prediction for the candidate’s vote share is 0.52, how much would we need to increase the forecast standard deviation to get his win probability down to 60%?

Let’s start with our first distribution, just to check that we’re on track:

> pnorm(0.52, 0.50, 0.02)
[1] 0.84

That’s right. A forecast of 0.52 +/- 0.02 gives you an 84% chance of winning.

We want to increase the sd in the above expression so as to send the win probability down to 60%. How much do we need to increase it? Maybe send it from 0.02 to 0.03?

> pnorm(0.52, 0.50, 0.03)
[1] 0.75

Uh, no, that wasn’t enough! 0.04?

> pnorm(0.52, 0.50, 0.04)
[1] 0.69

0.05 won’t do it either. We actually have to go all the way up to . . . 0.08:

> pnorm(0.52, 0.50, 0.08)
[1] 0.60

That’s right. If your best guess is that candidate X will receive 0.52 of the vote, and you want your forecast to give him a 60% chance of winning the election, you’ll have to ramp up the sd to 0.08, so that your 95% forecast interval is a ridiculously wide 0.52 +/- 2*0.08, or [0.36, 0.68].

Here’s the point. If you really want your odds to be as close as 60/40, and you don’t want to allow really extreme outcomes in your forecast, then You. Have. To. Move. Your. Point. Prediction. To. 0.50.

For example, here’s what you get if you move your prediction halfway to 0.50 and also increase your uncertainty:

> pnorm(0.51, 0.50, 0.03)
[1] 0.63

Still a bit more than 60%, but we’re getting there.

And what does this imply for election forecasts?

If our probabilistic forecast of candidate X’s vote share is 0.52 +/- 0.02, that would traditionally be considered a “statistical tie” or “within the margin of error.” And we wouldn’t feel embarrassed if candidate X were to suffer a close loss: that would be within the expected range of uncertainty.

But, considered as a probabilistic forecast, 0.52 +/- 0.02 is a strong declaration, a predictive probability of 84% that candidate X wins. 5-to-1 odds.

How is it that you can offer 5-to-1 odds based on a “statistical tie”???

What to do?

It seems like we’re trapped here between the immovable force and the irresistible object. On one hand, it seems weird to go around offering 5-to-1 odds to something that could be called a statistical tie; on the other hand, if we really feel we need to moderate the odds, then as discussed above we’d have to shift the forecast toward 0.50.

I’m still not sure on this, but right now I guess, yeah, if you really don’t buy the long odds, I think you should be shifting the prediction.

It would go like this: You do your forecast and it ends up as 0.52 +/- 0.02, but you don’t feel comfortable offering 5-to-1 odds. Maybe you only feel comfortable saying the probability is 60%. So you have to shift your point prediction down to 51% or maybe lower and also increase your uncertainty.

This now looks a lot like Bayesian inference—but the hitch is that your original 0.52 +/- 0.02 was already supposed to be Bayesian. The point is that the statement, “you don’t feel comfortable offering 5-to-1 odds,” represents information that was not already in your model.

So your next job is to step back and ask, why don’t you feel comfortable offering 5-to-1 odds? What’s wrong with that 84% probability, exactly? It’s tempting to just say that we should be wary about assigning probabilities far from 50% to any event, but that’s not right. For example, it would’ve been nuts to assign anything less than a 99% probability that the Republican candidate for Senate in Wyoming would cruise to victory. And, even before the election, I thought that Biden’s chance of winning in South Dakota was closer to 1% than to the 6% assigned by Fivethirtyeight. We talked about this in our recent article, that if you’re a forecaster who’s gonna lose reputation points every time a forecast falls outside your predictive interval, that creates an incentive to make those intervals wider. Maybe this is a good incentive, as it counteracts other incentives for overconfidence.

But then a lot has to do with what’s considered the default. Are the polls the default? (If so, which polls?) Is the fundamentals-based model the default? (If so, which model?) Is 50/50 the default? (If so, popular vote or electoral vote?)

So I’m still working this through. The key point I’ve extracted so far is that if we want to adjust our model because a predictive probability seems too high, we should think about shifting the point prediction, not just spreading out the interval.

From a Bayesian standpoint, the question is, when we say that the probability should be close to 50/50 (at least for certain elections), what information does this represent? If it represents generic information available before the beginning of the campaign, it should be incorporated into the fundamentals-based model. If it represents information we learn during the campaign, it should go in as new data, in the same way that polls represent new data. Exactly how to do this is another question, but I think this is the right way of looking at it.

174 thoughts on “What would would mean to really take seriously the idea that our forecast probabilities were too far from 50%?

  1. I think this is a very interesting point, but isn’t there also an extra issue with symmetry? I know you’re using normality by way of simplification, but what if the underlying predictive were highly skewed? Would this alleviate the trade off between “location” and “scale” of the prediction?

    Essentially, I believe you’re right in the point about wanting to dilute your estimates because you think they’re overly precise, which indicates you’ve missed something in the model specification (may be the prior. Or may be the sampling distribution should be inflated in its variance for measurement error. Or something…).

    • I think this is a good point. I wonder about the use of the gaussian errors. Should they really be symmetric and not highly skewed? 0.52 is very central on the interval [0,1] so you might think this is a good bet. On the other hand, is the the interval really [0,1]? Don’t you have a pretty good prior that in a highly polarized case you’ll never see anything near 0.8, maybe even 0.7, 0.6. How far do you take it? I don’t know, but it seems to me that might be worth digging into.

      Maybe I’m wrong, but it would be interesting to look at the evidence we have. Perhaps the errors tend to fall systematically more central than to to the extreme (i.e. the actual votes are closer to 0.5 than the polling which is closer to either 0 or 1).

      Might be wrong, but interesting to consider anyway.

  2. Andrew,
    One thing that might make statisticians uneasy, it the apparent target you set of a posterior estimate of 60%. It would seem preferable to reach a posterior from transparent priors, rather than adjust it based on how comfortable we are with it.

    I have been wondering about our prior on systematic polling error. It seems that extreme aggregate polls add credence to the possibility of high systematic error; this approach might help model some of the uncertainty you touch on in this post

  3. This is not directly addressing the question at hand, but this discussion reminds me of a seemingly fundamental distinction between the Economist model, which seems to be based on more objective / rigorous prior specifications, and the 538 forecast, which as somebody put it to me, was more of a “gambler’s forecast”: the modeler had a prior subjective belief that things were more uncertain than they otherwise looked, went looking for things that would reflect that uncertainty, and doesn’t care how politically realistic the tails are because getting the uncertainty (more) right is what matters. Both of these models of course are now getting the same criticisms anyway, so a whole lot of good that did.

    This has made me think a lot about the problem of a subject matter expert who is personally convinced an outcome is much more likely to be different from what otherwise-reasonable data seems to suggest. Ultimately that expectation can be judged in hindsight, but properly / rigorously accounting for that modeler unease up front seems like a hard problem. Maybe your prior ends up being largely subjective, you disclose this fact, and people can take it or leave it up until scoring time shows who is right? To use a more specific example, if Ann Selzer tells you (or her polling tells you) that you should have a heavily informative prior of R+7 for Iowa, such that it would largely dominate the likelihood generated by any other polling you might conduct of the state, isn’t that actually a reasonable thing to do? And how do we figure out who the Ann Selzers are? Many people are great prognosticators until they aren’t.

    I realize that this dog can chase its tail all day long but there are a lot of fundamental questions that keep coming up.

    • This.

      Before that Ann Seltzer poll, we had every reason to poll results were almost as steady as a rock heading into the election. For Iowa, 538 had odds of a Biden win at 50-50 on Oct 28. His chances of winning dropped to 35% by Nov. 1. The Seltzer poll seemed to be sole contributor to this drop, as far as Iowa polls go.

      Ultimately, adding ~2-3 points in the direction of Trump for every poll of this election would seem to have been the easiest way to incorporate the systemic error into forecasts, but why? It seemed in lead up to the election, Nate Silver was explicitly frustrated at suggestions that systemic error would still occur in 2020 as it did in 2016. That’s enough to suggest to me that 538’s priors did not account for this possibility.

    • This aligns with my first reaction too; currently the pervasive understanding seems to be that the forecaster’s role is to simply see what the data say. Whereas adjusting whatever it takes to get the betting odds one thinks are correct suggests that it’s the forecaster’s implicit knowledge that should make the final call. Which is fine, but it seems somewhat opposed to expectations about forecasts (which may be unrealistic suggesting we need to reframe election forecasting more drastically.)

      Something about this also makes me think about what it means to update one’s beliefs like a Bayesian. A Bayesian belief stream requires movement to equal uncertainty reduction in expectation (http://faculty.haas.berkeley.edu/ned/AugenblickRabin_MovementUncertainty.pdf), such that for a coin flip example, the further your beliefs get from 50/50, the more certain you should be. But if we want to make assertions about odds, eg due to ontological uncertainty when forecasting events as complex as elections, you could see that as constraining the movement of the belief stream, so that it can’t ever move too far from 50/50.

      • It feels like, if your model parameters make sense and you get unexpected results, either you are wrong or your model doesn’t have all the data it needs.

        In the case of elections, it has been a few years that a portion of the population don’t take part to the polls or are not honest about their intent. So, it feels like a question shift is what would be a good data source to correct that?
        I feel if you take all the polls from previous presidential elections and model a time trend with the gap between the last poll and the results, the model will be a lot less certain about the outcome.
        Now, I have never done those types of analysis so I have no technical idea about the how.

        To come back on the theoretical level, maybe it is also a good test to ascertain the quality of your model. If you add the 2nd most informative data source to your model, by how much its outcome shift?

  4. I wonder if ordinary concepts of probability really apply well to elections like this. In particular, it’s hard to impossible to know whether a mis-prediction is a statistical fluke or a basic problem with the prediction machinery. A presidential election cannot be run again more times with all the conditions the same, so it’s not possible to directly find out if the result was a fluke. If you can’t check directly, maybe the best you can do is a kind of meta-analysis, and they have their own issues. If you can’t say if your prediction was right, then maybe you are not doing science.

    In addition to problems with polls, and whether they can really measure what they claim to, there are problems related to human behavior. For example, it seems that more Trump supporters came out and voted than anticipated. Was this because of all the late rallies that Trump engaged in? It seems plausible, but it’s hard to say, especially beforehand. Was it because Republican voters feared election fraud by Democrats, having heard this would happen over and over, and therefore more wanted to vote to offset perceived fraud? It also seems plausible.

    These kinds of things are closely tied to nonlinear human behaviors, and they can be the hardest to predict reliably. They seem to have been in play more in this election (and perhaps the 2016 one) than in some previous ones. But how can they be incorporated into the election models?

    Crudely speaking, it’s obvious that the unmodeled existence these sorts of things would at a minimum increase the variance and bias of forecasts. But that’s not really too helpful, and anyway these human behaviors change with time.

        • Anoneuoid –

          > Its a meaningless term.

          It has meaning to me, and I dare say the vast majority of people who hear it. I would say that your confusion about that it means is likely willful – as a rhetorical device.

          It’s like how you could just find out what “freeper” means by a simple Google (it should be obvious anyway). That you wouldn’t do that suggests (to me) a willful if ignorance.

        • Some more detailed info on another “glitch”:

          The companies “uploaded something last night, which is not normal, and it caused a glitch,” said Marcia Ridley, elections supervisor at Spalding County Board of Election. That glitch prevented pollworkers from using the pollbooks to program smart cards that the voters insert into the voting machines.

          […]

          When voters sign in at a voting location, poll workers insert a voter access card into the Poll Pad tablet and encode it for that voter. The card is then inserted into voting machines to display the proper ballot for that voter. The glitch apparently prevented poll workers from encoding those cards.

          https://www.politico.com/news/2020/11/04/georgia-election-machine-glitch-434065

          In this case we have better info than the meaningless term “glitch”, and learn there was some kind of buggy update that wasn’t supposed to happen. From that it sounds like these voting machines are online and can receive remote commands?

        • “Never assume a conspiracy isn’t afoot just because there’s no evidence of a conspiracy.”

          +1!! Ben Franklin would have loved that!

      • No idea what a freeper is. But the government obviously has trouble counting votes correctly and these errors are only being found when people sanity check the outputs. Failing a sanity check indicates at least one problem somewhere.

        • If an auditor fills out 150 million ballots with known counts then sends them through this system how off will the final count be?

          It could be a couple million, idk but its clearly greater than 0. Thats without any fraud at all.

        • Ha. Voting machines around the country are glitching and apparently have cellular modems in them for some idiotic reason but there is “no evidence” the vote count is inaccurate.

          See the domain freep.com, “whoever posted this is a freeper.”

          I keep seeing this type of “logic” (nb, quotes do not indicate a conspiracy theory). It’s of the form:

          There is “no evidence” (ie, no RCT) for vitamin c and covid so we should not correct a deficiency in those patients. But there were two patients that received vitamin c along with a cocktail of kidney-damaging drugs got kidney stones. That is conclusive evidence it’s dangerous to correct vitamin c deficiencies.

      • Reading this thread just reminds me of the whole NHST problem. Stats people argue about the right way to adjust a p-value or whatever when the bigger issue is researchers have stopped testing their own hypotheses and instead test meaningless null hypotheses.

        What if youve got all this time and resources devoted to using garbage polls to predict garbage election vote counts and this has been going on for decades? Look at eg, kaggle datasets and half the time the “ground truth” has some kind of big problem. Here it’s your vote counts.

        There are now cell modems in the voting machines so the people in charge of counting votes are clearly incompetent. Who knows what else has been going on.

      • Integrity of the vote is extremely important. It’s right to be concerned.

        Some machines have vulnerabilities that make them hackable. That’s true and is not news, unfortunately. I think such machines should be removed from operation. I’d like to see nationwide standards — but not just one approved machine. We do have a problem here.

        But you (Anon) seem to have gone from “some machines have vulnerabilities that make them hackable” straight to “we can’t trust the outcome of this election.” I realize you did not make that exact quote, but it seems to be your attitude. If you have evidence for it, that’s one thing, but you don’t seem to have evidence.

        You are coming across as borderline nutty. I’m not saying you _are_ borderline nutty, but that’s the way you’re presenting. Don’t shoot the messenger! Just calling it the way I see it.

        • Where did I mention outcome of the election? I mentioned the accuracy of the vote counts. It seems everyone in this thread besides me wants to assume it is 100%, what is it actually?

          But yeah, if I knew there were modems in the voting machines I wouldn’t have even bothered to do my 3rd party votes. What a joke.

        • I doubt a single person in this thread thinks the count is 100% accurate. I don’t know where you’re getting that.

          Most states use paper ballots. I think there are fewer than a dozen that use electronic voting. If you’re in one of those states there’s more reason to worry.

        • I doubt a single person in this thread thinks the count is 100% accurate. I don’t know where you’re getting that.

          All the calculations assume that’s a decent approximation, based on what? It only takes 1-3% error rate to put most election results within the range.

          Most states use paper ballots. I think there are fewer than a dozen that use electronic voting. If you’re in one of those states there’s more reason to worry.

          This company alone (responsible for the “glitching” machines discussed above) claims to “serve” 40% of US voters:
          https://www.dominionvoting.com/about/

          That cell modems exist in voting machines at all indicates much deeper problems. So many bad decisions but different people had to be made for that to happen. I had no idea something so stupid could be done. These elections are clearly very insecure.

        • I mean say there was a good enough reason they should be networked at all (there isnt), why not put a wired network card in there so people on site can at least tell if its plugged in or not?

        • See, you -are- talking about the election outcome.

          And you’re moving the goalposts: previously you were saying everyone here seems to think the count is 100% accurate, which is false. Now you’re talking about 98% accuracy.

          Let’s all get on board with improving election integrity. But I still encourage you to change the way you discuss this stuff.. You could make your points in a way that others find convincing, or you can come across as a bit of a nut. You’re choosing to do the latter fir some reason, at least in my opinion. I have no reason to lie to you about this.

        • Yea, vote counts determine the outcome. You need the error to be less than the margin of victory for it to be valid. Who beleived otherwise?

          I didnt move any goal posts. Been saying the same thing yhis entire thread. How accurate are these counts?

          https://statmodeling.stat.columbia.edu/2020/11/07/what-would-would-mean-to-really-take-seriously-the-idea-that-our-forecast-probabilities-were-too-far-from-50/#comment-1576690

          Btw, as far as Im concerned 98.5% of people voted for more debt and spying (banks and intel community becomes more powerful), which are the only issues I really care about and are at the root of 90% of what other people care about but they dont realize it (wealth inequality, abuse by corrupt law enforcement, etc).

          Let’s all get on board with improving election integrity.

          Allowing cell modems in voting machines is basically zero effort at election integrity.

    • How much of your trouble is caused by the idea that the probability distribution should be some sort of bell-shaped curve (that just won’t let itself be massaged into a semblance of reality)?

      If I measured a small distance with a yardstick that had only inch marks, then the error term should not be bell-shaped: if my measurement is 13 inches, the probility for it being 12 or 14 is not small, it’s zero; and from 12.6 to 13.4 the probability density isn’t curved, it’s straight. When you argue that setting a 3% win probability for California is nonsense, and it should be 0%, don’t you argue for a probability distribution that has some of these properties?

  5. I’ve got a question.

    When you aggregate polling to get an average, you’re effectively aggregating and averaging the margin of errors together as well. But is that really valid? Shouldn’t the margin of error be from the highest of the range at one end of the polling to the lowest point of the range at the other end of the polling?

    • no because the aggregation reduces the error.

      Your standing in the woods with your GPS. You let it take one measurement of your position. That measurement has a ±50m error. If you let it take 500 measurements of your position, each measurement might have ±50m error but the combined error shrinks as each individual measurement adds to a cluster around the actual position.

      • jim –

        > no because the aggregation reduces the error.

        But wouldn’t that depend on whether you’re considering the individual polling as independent of each other? Seems to me that they should be considered as such.

        • “But wouldn’t that depend on whether you’re considering the individual polling as independent of each other?”

          Hmm..I think the short answer is that it doesn’t matter. Technically each measurement contributes to the total as function of its individual measurement error, and I think that individual measurement error is what expresses the independence of the instrument.

          I suppose it’s possible that one poll would have such an insanely large error that it would drag the aggregate error way down, but that defeats the purpose of aggregating it so you would just throw it out.

  6. Without delving into this too much: looking at this post my thought is that 75% doesn’t seem so bad, and ±0.03 is reasonable, giving a 95% interval of [0.46, 0.58]. Even at 70% and ±0.04, we have [0.44, 0.60] – bit wide, but survivable. I think 3-1 odds feels fine.

    The problem with ±0.08 is that it goes too far in trying to move arbitrarily to 50% probability: 60% is really small, and we’re definitely approaching “people can’t understand probability” if we have to spit out 60% probability just to avoid people shaming forecasters. Seems easy in hindsight to move the point estimate of course, but I don’t see it being necessary.

  7. Seems like the point forecast needed to be shifted for the “shy Trump voters”. Then you still get your 99% probabilities for safe states and Florida looks more like a toss up

    • Except there’s no evidence for the “shy Trump voter”.

      Republicans did, after all, outperform Trump in the House, Senate, and a slew of state races. Trump lost votes the the Republican Party apparently did not, overall,

  8. Andrew, I’ve said it since back in 2016, that it made sense to include an overall prior on a consistent bias in the polls. Let’s call it normal(0,.03), and Combine this with a prior that the election will be close, say something like normal(0.5,.015) because all elections have been close in the last 20 years, and it’s a commonly held belief that the US is highly polarized… and you’ll get the right answer I think.

    The point estimate for the real underlying opinion, will be drawn towards 50/50 because it’s more likely that there’s a 3 point bias in the polls (with their 7% response rate or whatnot) than that the election will have a 3 point real margin.

    I don’t really understand why you didn’t do this ? Or am I missing something and this is in your model?

        • All right, I’ll partially answer my own question… I see in the code:

          https://github.com/TheEconomist/us-potus-model/blob/master/scripts/model/poll_model_2020.stan

          raw_polling_bias ~ std_normal();

          So this suggests you believed the bias was around 1 pct point? or is this 1 unit on the logit scale? I’m not sure.

          I find the code nearly impossible to read, because variable names like “mu_e_bias” and “raw_mu_c” are meaningless to me. Is there a write-up of the model construction that might help parse out the meaning?

        • Daniel:

          That’s just the raw value, it’s later multiplied by a state-specific scaling factor obtained from a cholesky decomposition of the (estimated) state covariance matrix. The estimation of the state covariance, I think, isn’t part of the stan model, but is plugged in from a 2016 estimate. (Correct me if I’m wrong)

          vector[S] polling_bias = cholesky_ss_cov_poll_bias * raw_polling_bias;

        • Got it. that makes more sense. It’s a fairly high level model and I can see they might have preferred that methodology due to it being faster to fit.

          My own thoughts on methodology would be to make it more mechanistic/agent based (this should surprise no-one). I’d break down the demographics in each state, and then model the individual demographics on the basis of both vote share and turnout… with a popular vote level pooling parameter.

          I tried to do a similar kind of thing for a model of cost of living in the US. It was successful but extremely slow to fit, and hence to debug. I have some ideas about how I’d make that work better now. With sufficient computing power, you could make this model work for the Economist, but building it is a challenge. You’d probably want to start back in 2018 or so and begin collecting the data, debugging it, getting it to look right…

    • Note that it might make sense to give heavier tails than normal to those priors… allowing for the option that maybe everyone had enough of Trump and his pandemic response, and/or maybe the polls have even bigger errors due to intentional lying to pollsters or whatever. But the basic concept: pool the popular vote strongly towards +- a percent or two, and allow for a non-sampling bias across all the polls of several percent… I think it’d give you what you need to make more sense out of the results.

    • Just for kicks this is what just the bias term does:

      > nreps voteshare.mean = 0.52
      > voteshare.sd = 0.02
      > bias.sd results for (i in 1:nreps){
      + bias <- rnorm(1, mean=0, sd=bias.sd)
      + pwin <- pnorm(voteshare.mean + bias, 0.5, voteshare.sd)
      + results[i] mean(results)
      [1] 0.7098428

      However maybe the uncertainty around vote share is already supposed to reflect polling bias?

  9. It seems to me that the normality assumption is what’s throwing things off. But the issue isn’t that the tails are too light, it’s that the tails are too *heavy*, and the gaussian is too lepyokurtic! If the point estimate is .52, then maybe what we want is for a nearly uniform posterior covering like .49 to .55, but a sharp dropoff below that, to ensure that we’re not predicting a Democrat wins Wyoming. Would something like that work? It might make sense if the concern is a systemic bias, rather than a sampling bias…

  10. You want the distribution to be centered on your point prediction… that’s causing the problem. Don’t move the point prediction, just center the error distribution around a point below the point prediction. Now you need a *justification* for this. I don’t have a complete one, but there is the glimmer of a Lucas Critique: the fact that the polls are public actually causes reactions to it in which those ahead get complacent and those behind get determined to upset it. (You have to explain why they don’t just get discouraged, which is often alleged.) Then as the point prediction gets farther and farther from 50%, underperformance becomes more and more likely, although an increase in the point prediction still increases the aggregate probability of a win.

    This explains why you need to make an adjustment even after your Bayesian prediction of the polls… because the election result needs to take into account voter reaction to the polls themselves. I confess I have no idea how to calibrate this, unless you take a betting odds approach and move the center of the error term to calibrate to your subjective odds.

    • I think several factors would pertain to how to calibrate the point you are making.

      1. Related to number of voters: As campaigns are the primary users of polls, they would like their resources to go where they are more effective. Is the percentage of historical voters relative to potential voters low? Can the intention to vote (of groups, one party, etc) be increased? Places where 95% of registered voters vote have less leeway to affect the final outcome.

      2. Historical trends. Has the place gone to different parties in recent elections? Also indicator that campaign efforts can be effective.
      3. Size. This is more complicated. My intuition is that if the size is too small, campaign efforts are meaningless. If it is too big, both campaigns will make equal efforts and thus cancel each other efforts to alter what polls reflected.

    • This was a thought I had too, that this would be a parsimonious way to think about it. But in reality I’m not sure how much we can expect the polls to shift behavior. Maybe if we think of people as having their own internal pollster, where as they perceive things to shift from 50/50, they either ramp up their efforts if their candidate is losing or relax if their candidate is winning. So the system is on some level tending toward higher entropy and this tendency gets stronger the further we get from 50/50.

    • On further thought, the easier way to do this is make the error asymmetric with a mode at the point estimate and more weight towards 50%. Maybe something gamma-like.

  11. I played around with similar ideas for uncertainty color palettes here: https://osf.io/6xcnw

    The relevant part is section 4. The basic idea is to take what we know about people’s perception of probabilities and try to find a distribution to show them that they will perceive as they distribution we want them to see. If you assume that probability perception follows a “linear in probit” model (which plenty of past literature suggests it might) and that your target distribution is Normal, then what you need to do is exactly what you suggest: move the mean and scale the SD.

    If you have data in the slope and intercept of the linear in probit perceptual function for the communication context you are working in, these coefficients translate directly into how much to scale and shift the mean (top of page 4). While I was applying this only to color palettes, you could eadily stuff an entire distribution through this transformation before reporting it.

  12. All these interesting modeling points aside, I don’t get the major issue. Suppose I do NHST on a finding and the p value is .05. What are the betting odds of this result being true? I thought we all have rejected such reasoning – I certainly have. Now, because it is a political poll, somehow binary reasoning reasserts itself. An election has a binary outcome – but the “truth” of a decision is also binary. I thought evidence is always gray, however, and a decision analysis should include consideration of costs and benefits of various decisions. So, we have a political poll, or many polls and many models. The resulting confidence interval reveals something about the uncertainty of the evidence. But suddenly we are focused on what that evidence implies about the probability of candidate X winning the election. And, the concern is over what that probability says about the binary outcome. It seems to me like we have suddenly taken a giant step backwards. Isn’t the problem trying to stuff a square peg (the uncertain evidence) into a round hole (the binary result of the election)?

    • I don’t think your analogy quite works. In the NHST example, you aren’t betting whether the result is true… You’re (at best) betting whether you’d observe an effect this big or bigger were there no effect at all. And given that sharp nulls are almost never really true, there’s really nothing observable to bet on that’s related to the 0.05. In elections, though, there *is* something to bet on: who will win.

      • Then, use a different analogy. You study a new medical treatment and the odds ratio it is better than the old one (on average) is 1.05 with a confidence interval of (1, 1.10). You can raise analogous questions about the odds you would (or would not) give on whether the new treatment really is more effective than the old. I maintain that the only reason why you might be more comfortable offering a bit on the medical treatment and not on the election, is that the election will have a clear (don’t we wish!) binary outcome, while the medical treatment will not. It is out comfort or discomfort that I am questioning here. Andrew’s post seems to me like a retreat to binary thinking that would not come up if we were talking about the medical treatment.

        I am not questioning the issues surrounding whether the confidence intervals in the polls and models should be wider or not. There are legitimate issues to consider and many of the comments on this post are addressing those. But it is the interpretation of whether or not we are comfortable with the 84% probability of winning that seems like an about-face on the discussions we have had with NHST.

    • Yes — thanks for writing this. All the critiques (correctly) applied to social/life sciences are equally valid here, yet we seem to be ignoring them. There are giant measurement errors (Andrew’s kangaroo jumping on a scale from old posts), making sophisticated technical analyses pointless, as well as this problem of binary thinking (reminiscent of the awful “there is an effect” or “there isn’t an effect” methodology the life sciences). It seems to me as well that these post-hoc analyses, thoughtful as they are, are a step backwards. I wonder what the meta-Andrew, writing from outside the perspective of this field, would say?

    • A good model should assign high probability to outcomes in the vicinity of the actual outcome (ie. high probability density to the actual outcome).

      Now, this model assigned 90+% to Biden wins, and he did… that’s not the issue.

      The issue is the model assigned high probability to vote differentials that were much much bigger than the real vote differentials.

        • I agree.

          If the model is giving you probabilities that you don’t believe, then the model is probably wrong. But, people are hard to model. The normal is very good for modeling things engineers build. The uncertainty that your model gives is much more sensitive to details of your model than is the point estimate.

        • Right. All other things being equal, the normal will less wrong than anything else, more often. But they aren’t equal, unfortunately. Or the events are just too rare.

        • What if the model was:

          Wisconsin +8 (±10)
          Michigan +8 (±10)
          Minnesota +9 (±10)
          New Mexico +13 (±10)
          New Mexico +7 (±10)

          Now your model is much better. Some states are on the tail of the interval, but within it just the same.

          So – I’ll just keep arguing this point for the sake of argument until someone convinces me otherwise – the problem is in the error, not the mean.

        • I suspect that the problem with individual ranges so large is that they won’t be consistent with the popular vote being in a narrow interval of 0.5 +- 0.01 unless you have very odd correlations between states, like for every vote you get in NV you lose one in PA or something.

        • “they won’t be consistent with the popular vote being in a narrow interval of 0.5 +- 0.01 ”

          But that narrow interval expresses only the most recent elections, correct?

        • Hang on, though. Firstly you’re picking a specific selection of states that were close. So there’s a multiple comparison problem here. Secondly there’s an issue of correlation in the state polling errors. The website shows these state forecasts individually, but the model actually has these as correlated, so some of the probability mass placed in Wisconsin being <1% on the map will also be those instances where Michigan <2 etc. So showing a string of states being missed kinda exaggerates the issue.

        • I actually picked the ones off the top of the list that had large predicted margins… and then found that the real margins were *all* much lower. Basically, the mode in the output of the model was kinda far from reality. Now I don’t think the mode needs to be right ON the actual outcomes, but it shouldn’t be 5-10 percentage points away because of *polling*. That is, polling is just not that trustworthy, and the election results last time are very trustworthy… so it just makes sense to me that no matter how the polls work, you’d want to keep them moving the vote margins less than 5 percentage points or something.

          Gelman’s model wasn’t bad at the individual state level, but I think it makes more sense to use the aggregate popular vote level information to further constrain what is reasonable in the model. I think doing that would have pooled the state results back towards 50/50 if the balance were right… but I haven’t wrapped my head around the model.

        • election results last time are very trustworthy…

          See above, thats what everyone assumes but based on what? Has anyone ever fed known input into the whole election system and seen what comes out the other end?

        • > I actually picked the ones off the top of the list that had large predicted margins… and then found that the real margins were *all* much lower.

          Well, ultimately the Economist predicted a popular vote margin of 8.8% with a 95% CI of about 2-16%. The current two-party popular vote margin is about 3.5%. This might increase by a little as more mail in/provision votes come in.

          So on average, state margins will be off by about 5%, but the miss is within the 95% CI. The 10% states should be atypical.

          I think 538 did a little better in this respect.

          >That is, polling is just not that trustworthy, and the election results last time are very trustworthy… so it just makes sense to me that no matter how the polls work, you’d want to keep them moving the vote margins less than 5 percentage points or something.

          You mean that election results in the last election (in particular, the last presidential election, *not* 2018) are more reliable an indicator of the current election than current polling? I dunno man, that sounds like a very strong assertion of make and I think that if you apply it retrospectively such a sort of modeling might perform very badly.

        • > You mean that election results in the last election (in particular, the last presidential election, *not* 2018) are more reliable an indicator of the current election than current polling?

          No, I mean that the measurement error in the election counts were low. they’re not zero but just based on polling non-response, I’d expect polls to have up to say 5% errors, so ~ normal(0,0.025) or so… Whereas the counts in the previous election I’d expect to have say normal(0,0.0015) or something like that.

          Sure, things could shift from their positions 4 years ago, but at least we know what the positions 4 years ago were pretty well. Shift from then could be around +- 5% at most I’d guess, and since the polling error is +-7% including a polling bias which is at least a few percentage points the polls only give us kinda vague information about what is going on. I’d suggest that the polls could show you trends, but the overall level of support is hard to estimate

          poll[i] = population_position + individual_poll_error[i] + overall_poll_bias

          aggregating polls will help you eliminate the invididual_poll_error, but doesn’t give you *any* information about the overall_poll_bias, which it’s reasonable to believe could be as high as 3 or 4% points.

        • I threw around a bunch of numbers that weren’t entirely consistent there… but the point is that an individual poll might have several sources of error:

          1) an overall distribution of biases across states… which might have a most likely bias and a range of biases across states… a hierarchical prior so to speak.
          2) the individual bias in a given state… hierarchical prior across the polls
          3) the individual bias from the polling company … crosses state lines
          4) sampling noise in the individual poll

          put all together it wouldn’t be surprising to see a given poll more than say 10% points away from the truth, and an aggregate of polls in a given state around say 5% points away from the truth in that state, and again this more than 3 or 4 pct points away from the overall cross-state bias…

          But the total drift in the underlying position of the people is unlikely to be more than 3 to 5 pct points because that’s a rather largish swing for the US under the current conditions… call that maybe normal(0,3) pct points.

          so we start with a precise location 4 years ago, and we drift forward 4 years by an amount that’s very likely less than the error in the polling… and then we try to estimate it by processing polls that might easily have a lot of bias and random error… some of this bias we can eliminate when it’s due to the polling organization, and we may be able to reduce the non-response bias somewhat as well. But in the end we’re probably kidding ourselves if we think poll adjustment reduces the residual error in the poll aggregate below about the size of the prior known swing size (3 to 5%).

          Intuitively if we are constraining the nationwide swing, and the individual state swings, and then adding noisy polls… we should be pooling pretty far back towards the last election + noise…

          My impression from Gelman’s model is that he just didn’t model the error in the polling as sufficiently wide that the pooling came back enough towards “neutral”… Any model that predicts a mode of 12 pct point swings in a given state on the basis of largely polling is just … probably not that reasonable.

        • Daniel:

          I feel like you are vastly underestimating swings from election to election.

          For instance, in the UK general elections, between 2017 and 2019 the two-party margin changed from 2.9% to 15.2%. That’s a 13% swing in 2 years, far worse than polling error. Similarly, from 2004 to 2008 in the US you have a switch from a 2.4% GOP lead to a 7.3% DEM lead, i.e. a 10% swing.

          I mean you can make some argument about the “under the current conditions”, but again I think you’ll run into big difficulties defining what those are without the benefit of hindsight.

        • I definitely think each election you should re-evaluate the size of the swing. Each election is a singular event, and though it’s worth it to see how much swing we’ve had at other times or places, it’s not really that informative for a US election that a UK election had a large swing.

          The popular swing from Bush to Obama was about 9 percentage points or so. That was a fairly tremendous swing with no incumbent in the 2008 election.

          I’d agree we should probably include wider swings, but do so by long-tailed distributions, not widening a normal distribution. The high density region should be +- 3 or 4, and the tails extending into the +- 10 range. a t distribution with 3 degrees of freedom scaled by 3.33 gives you 80% chance of between +- 5.5 and yet 95% interval is +- 10 that seems fairly reasonable

      • Daniel,
        You say “A good model should assign high probability to outcomes in the vicinity of the actual outcome (ie. high probability density to the actual outcome).” That seems to be an anodyne statement that we should all be able to get behind, but I find that I cannot.

        Suppose I assign about a 6% probability to pulling a coin out of my pocket and flipping four consecutive heads (on the first four tries). Now I pull out the coin and try it…and I flip four consecutive heads. There’s nothing wrong with my model, certainly there is not _obviously_ something wrong with my model, and yet I have assigned low probability to the actual outcome.

        Upsets do happen. A novice golfer hits a hole in one, a poker player gets dealt a full house of aces over kings, a roulette wheel comes up red ten consecutive times, etc. If you assign a low probability to an event that turns out to actually happen, it’s certainly a good idea to question your model. But I have to disagree with the blanket statement that a good model should assign high probability to the vicinity of the actual outcome.

        • Well, Phil, “higher is better” I think ought to be uncontroversial, and “no higher than your information justifies” is also uncontroversial. The goal of a Bayesian model is to balance those.

          IMHO we had information from the last election that we weren’t using, and instead we were relying on polls with single digit response rates to give more information than was justified.

          I still don’t know how to read the code, but I think if I’d been modeling this I would have modeled each state as having a bunch of unswingable voters, and a bunch of swing voters, and then unknown turnouts for each. I’d have used polling demographics to try to estimate turnouts, and poll responses to estimate swingable sentiment, and combined with census data to know the state demographics. You could then pool information about turnout across the states, and get correlations without any explicit correlation matrix type models.

          Finally you’d soft constrain the popular vote split to be within a couple percent… 48-52

          This is my love of mechanistic modeling at work. But I suspect it would be sloooow to fit as it’d require small time steps to manage the correlations that come up.

    • The problem is that, from the perspective of creating a model for the general public of a winner-take-all election, the uncertainty and vote-share estimates don’t matter. What people want to know is who will actually win the election, which is a dichotomous outcome. No one really cares if you were off by 10 votes or 10,000; in the end you must predict one outcome or the other and you will be either right or wrong. The error and uncertainty calculations are of interest to statisticians but not to the ordinary people who are looking at these models on the Economist and 538.

      As I mentioned in a comment on another post, I think part of the problem is that the predictions are difficult, and the probabilities difficult to calibrate, because of the infrequency of elections. But the infrequency of elections also contributes to people’s desire for accuracy. When elections are infrequent, you care MORE about having the correct binary prediction, because you have to live with the outcome for two or four years.

      I think this paradox means that it will be difficult to nudge the general public toward the “square hole” you envision. The actual event outcomes are discrete and the consequences people have to live with are discrete.

  13. I like this post. It does seem to me that 2016/2020 can give high confidence (I.e., probability) of a particular winner, but then after election experts say polling error was not outside the realm of normal errors. I think your simple example illustrates this seeming contradiction.

    I also agree that you can’t just make your intervals wider so shifting point prediction is a consideration. I am not sure why 538 lets weight of fundamentals model go to zero on election day. I think you do something similar? If we expect polls to be off, then maybe fundamentals can tell us direction of error. It seems like things such as high polarization should temper expectations of landslide victory.

    I am not certain you actually need to shift your point estimates. There may also be a possibility to change how you communicate the forecast so that you start from the assumption of polling bias when communicating the results. I tried to describe this in a comment on previous communication post.

  14. The OP is really about two issues. One is whether forecasts should be adjusted in some way so they “feel” right, the other the merits of adjusting the point estimate vs the dispersion. I’ll pass on the second; the first is the most interesting.

    I agree completely that, if you think your feeling is meaningful, your should try to figure out what information, if any, it reflects that isn’t incorporated in the model or the data it draws on. Commenters have given us a bunch of them: systematic response bias in the polls that may have increased since 2016, likely voter bias (again specific to 2020), etc. Now, suppose you harbor such doubts and have a basis for them—what should you do with them?

    I think this is a moment when it makes sense to go back to the initial point that all model output is conditional on the assumptions embodied in the model. Somewhere, if only on a linked website, there should always be a full list of these assumptions. The inchoate evidence that underlies the hunches people have about forecast error comes into play as it bears on this list.

    Example: turnout seems to have been unexpectedly high among Republicans/Trump voters. There were indications before the election that this might be the case; I posted in late October about the complete absence of Biden bumper stickers on the highways where I live but frequent signs of Trump support, the enthusiasm problem. This is soft, anecdotal evidence to be sure, but not nothing. I was also struck, as others were, by the Trump Stores that sprang up around the country (including here in the Pacific Northwest) with no apparent support from the Trump organization. More soft evidence.

    One of the assumptions of most of the polls might be that the likely voter weights from 2016 would also characterize 2020. If so, I would recommend either (a) making that assumption explicit when reporting forecasts or (b) creating an alternative scenario in which an arbitrary adjustment is made for a shift in turnout toward Trump and reporting both scenarios. I realize that the scenario approach wasn’t popular the last time I broached it here, but I think it can be clarifying in murky situations like this. The practical problem is that this is a poll-by-poll adjustment, but even so one might do this with a few polls and then extrapolate.

    • Peter said,

      “I think this is a moment when it makes sense to go back to the initial point that all model output is conditional on the assumptions embodied in the model. Somewhere, if only on a linked website, there should always be a full list of these assumptions. ”

      Yes! (i.e., transparency is needed to detect GIGO.)

    • Wasn’t there also a good deal of soft evidence that moderate Republicans were defecting? I’m thinking of the Lincoln Project, the couple of Republican governors who wouldn’t support Trump, the five attorneys from the Reagan Administration, etc? Trump obviously had a lot of enthusiastic supporters, but I thought more Republicans (and conservative independents) would simply stay home.

  15. I’ve had two thoughts on this:

    – Is it possible that election forecasts have become so well publicized that voters actually change behavior in response to them now? Maybe not in terms of candidate preference, but in terms of turnout? I think it’s at least plausible that Republicans thought Biden would win, and were motivated to protect their Senate majority, which in turn led to more votes for Trump.

    – Do you use a proper scoring rule on the sampled predictive posteriors to compare the performance of models? I wonder if using a metric like CRPS could help guide the modeling decisions to better trade-offs.

  16. I was wondering if there is any theory in statistics for quantifying “surprise”. A top Google result suggested defining surprise as the ratio between your posterior given some outcome and you prior. If my prior based on forecast is that Biden is clearly favored and then Trump is +5 in Florida and +8 in Ohio it seems like objectively that is surprising. However, I didn’t crunch the numbers on this. Could your forecast have told me that this outcome should not be very surprising? That would be useful. The reason I check the forecast is so that I have a good idea of what is likely to happen on election day. However, it seems unavoidable that I am in for at least a few surprises unless the polling is pretty accurate. I guess whether the model makes me more or less surprised on election day depends on what my prior would be without checking the forecast, but maybe forecast sets a reasonable prior assuming I can interpret it correctly

  17. Here is our old friend 84% again! :-)

    You’ve discussed many times how non-informative priors can lead you “to make statements with 5:1 odds based on data that are indistinguishable from pure noise”.

    Even with informative priors “it seems weird to go around offering 5-to-1 odds to something that could be called a statistical tie”. Don’t call it a statistical tie then.

    That’s what you get if you define statistical tie as one standard deviation (using the normal distribution for simplicity): odds less extreme than 5-to-1.

    If you want that something that could be called a statistical tie has odds no longer than 2 to 1 the simplest (and only) solution is not to call something a statistical tie if the difference is more than 0.431 standard deviations.

    • To be clear, the differece mentioned in the previous comment is the difference between the vote share estimate and 0.5. The different between the estimates of the candidates will be twice as much.

      In summary, don’t say that 0.52 +/- 0.02 (0.48 +/- 0.02) is a “statistical tie” if it feels weird to call statistical tie something more extreme than 0.509 +/- 0.02 (0.491 +/- 0.2).

  18. This may sound ridiculous, but I think an 84% chance just isn’t as big as it feels.

    As I mentioned in a comment on another post, I have a friend who was furious with Fivethirtyeight last election because they felt fivethirtyeight had ‘promised’ that Trump would lose. I pointed out that they had actually given Trump about a 20% chance of winning, and that’s not really a promise at all, but my friend still felt like they shouldn’t have said Clinton had an 80% chance if there was actually some chance she would lose. It’s an emotion-based argument, not a statistical one. 80% felt to my friend like it was practically 100%…but it isn’t.

    I don’t think giving something an 84% chance, and then having it fail to happen, represents an ’embarrassing’ failure. If I predict that you aren’t going to pull a coin out of your pocket, flip it three times, and get heads every time, I might be wrong but I’m not going to be embarrassed about making the prediction. If it happens, well, 12.5% events happen every now and then — indeed they aren’t even all that infrequent!

    So: maybe fatten the tails a bit so it’s an 80% chance rather than an 84% chance — turn it into 4:1 instead of 5:1 — but other than that I don’t think this example calls for rethinking anything fundamental. If something has a 20% chance of happening, and it happens, that should really only be mildly surprising. And I speak as someone who is easily surprised!

    • That’s what I think, too. Events which you give 84% chance of happening should actually go wrong 16% of the time. We’re seeing loads and loads of random experiments where an outcome has a “pretty unlikely but realistically still quite possible” probability assigned, that’s how I’d interpret something around 16%. And in fact such events happen all the time. As they should.

    • Maybe the issue is really that sober probability modelling and the ambition to look like a winner in the polling game really are quite alien to each other.

    • Was thinking the other day that we should communicate odds rather than probabilities. So when Andrews model moved Biden’s win probability from 90% to 95% taht is a big deal! It’s the difference between rolling a 1 on a 10 sided die versus a 20 sided die. But it looks like such a small difference :)

    • In a sense I think you’re right, but I think that’s mainly true for elections, because of how infrequent and consequential they are.

      In a board game where you roll a die every turn, the chance of avoiding a 1 is about 83%. That is “okay” because you’re rolling every turn and if there are many turns you can recover from rolling a 1. Even if you roll a lot of 1s, you can play another game and roll a bunch more times. A presidential election is like rolling a die once every 4 years. You cannot afford to roll a 1 because you will probably only be able to roll the dice about 15 times in your entire life. An even more extreme example would be Russian roulette with a round in one of six chambers; you really don’t want to roll a 1 here.

      To me these issues are less about probabilities and more about the psychology of people’s reactions to events. The meaning and importance of an 84% probability is not invariant across different types of events. It could be useful to incorporate psychological findings in risk-aversion, etc. to foreground the fact that people’s reactions may not scale in expected ways with probabilities.

      • Suppose I have a model which says “the probability of a ‘fair’ die coming up 1 is 1/6”. Say you obtain from some knight of the green table a ‘fair’ die; and you toss it once. Does the appearance of a “1” make you less confident in my model (whatever be its minutiae); or does it make you less confident that the die is fair? How to apportion one’s (mild) surprise? Is the model wrong or is the die itself an odd lot — or could it be both? The evidence is slender and the stakes are low and the question is not of great interest. But when the stakes very high, the question becomes interesting; even when the underlying evidence may be no less slender than the single toss of a die.

    • Maybe you can get a better feel for the probabilities if you imagine betting a large sum of money. 80% sounds pretty good, but I wouldn’t bet a lot of money on such a gamble, if I couldn’t afford losing. I might bet on 99%.

  19. Unless my R skills have deteriorated significantly more than I think, doesn’t pnorm(0.52,0.5,0.02) give the c.d.f. of a normal distribution with mean 0.5 and sigma 0.02 evaluated at 0.52? How is that the probability of winning?

  20. I think you’re zeroing in on the fundamental issue, which is that modeling the vote in a state as a single variable which some variance is just not what’s going on. The difference between 0.52 and 0.499 isn’t because you’re drawing a random draw from a noisy variable. It’s because some group that votes 30/70 showed up more than you thought, and another group that votes 90/10 showed up a little less.

    What happens if instead, you go into the crosstabs of each poll, model the voter intention for each demographic, and then add in additional uncertainty about turnout (which can also include uncertainty about voter suppression) for that demographic?

  21. The likelihood ratio (likelihood true proportion voting for candidate X >0.50 versus likelihood true proportion voting for candidate X <0.50) if we observe a proportion of 0.5+1 s.d. in a sample, is (exp(-0.5*1*1)) = 1.65. If your prior odds are 1:1, this gives a posterior probability of about 62% which is close to what you want. Or have I got something wrong?

  22. This is related to some thoughts I’ve had about how Bayesian models are presented in general. In my field (microeconomics), Bayesian methods are extremely uncommon, and the methodological conservatives often bristle at the idea of “priors”. As a result, most practitioners who utilize Bayesian methods will report point estimates and “standard errors” in order to make the audience more comfortable. They avoid talking about posterior distributions for basically the same reasons you express here.

    This seems like a missed opportunity to me. When presenting Bayesian models, I think the emphasis should be on the prior instead of the posterior. In the context of the election model, predictions should presented as “here’s what your prior about the polling bias would have to be in order for the model to predict Florida as 50-50.” In economics, it would be something like “here’s what the prior of the marginal effect would have to be in order for the posterior mean to be zero,” or something like that. What I like about this formulation is that it’s much closer to how we interpret (or should be interpreting) frequentist estimates, but it’s much more intuitive.

    My sense is that Bayesians spend too much energy choosing and then justifying priors. Let the priors speak for themselves, I say.

    • MJ—I’m not sure what you’re recommending here. If we don’t spend energy choosing and justifying priors, why would we trust them to speak for themselves?

      The traditional “subjective Bayes” methodology is to encode what you believe (or know) in your priors, then just turn the crank and accept the results. A reaction to that is “objective Bayes”, where the goal is to find “objective” reference priors and use those everywhere.

      Following the advice in some parts of Bayesian Data Analysis, a bunch of Stan devs including Andrew just wrote a Bayesian workflow paper where we recommend actually looking at the resulting posterior predictive inferences and checking them against held out data. That’s the usual way to measure calibration of a model. (I wrote all new chapters for Part III of the User’s Guide that shows how to code all the evaluations in Stan.)

      The methodological conservatives are only fussing about half the problem. They should be worrying even more about the choice of likelihoods! Let’s use an example from micro-econ. Let’s say I have a sequence of prices over time. Which time-series model do I choose? Do I look at the data to see if there are long-term trends, periods of varying volatility, exogenous effects from policy, heterogeneous random effects, spatial effects, etc.? We can only start to worry about priors on parameters for these things once they’re part of the likelihood. Even if I’m doing a simple linear regression, do I use normal errors or do I make the model less sensitive to extreme values by using Student-t distributed errors (and if so, how many degrees of freedom do I use or do I just throw that in as a parameter as well)?

      • > A reaction to that is “objective Bayes”, where the goal is to find “objective” reference priors and use those everywhere.

        I was refreshing my knowledge of “objective” Bayes and came across this: https://projecteuclid.org/download/pdf_1/euclid.ba/1340371039

        “In some settings, such as the NBC Election Night model of the 1960s and the 1970s (which used a fully Bayesian hierarchical model), considering alternative priors is important. For the NBC Election Night implementation, this meant multiple priors based on past elections to choose from in real time, and the choice was often crucial in close elections.”

        Nihil novi sub sole.

        • Love that note. It’s about calibration. And it was written by my first dean when I was at CMU, Steve Fienberg. It makes the same point I’m making about the likelihood also being important (this is an obvious point that Andrew continually stresses, but it seems to get lost in translation).

          My favorite part of the note is the citation of Chapman et al. (1975) with a picture! I like Mosteller and Wallace (1964) even more, though the only funny bits are the thank yous to the lab members who devoted their weekends to pushing slide rules and index cards to do the negative binomial calculations. That book is still way ahead of almost all of the natural language processing field in statistical sophistication. Steve worked on author identification, among other things—the last time I saw him before he died was at a conference on statistics, natural language, and the law, where people were trying to argue from linguistic principles about the source of text messages, letters, etc. At the time, he was working on analyzing the Reagan presidential library—Reagan was a very coherent and prolific author—the New Yorker did a piece a while back on some of his early political writing.

          I wonder how they could’ve fit a Bayesian hierarchical model in the 1960s. Anyone know?

      • I think my point is that there’s a better way to present Bayesian results. Of course I want to see the posterior associated with the researcher’s preferred prior, but I *also* want to see what other priors would look like. As a starting point, I think it would be useful to see the “extreme” prior that would nullify the preferred results. Let the audience find their own prior within that range, rather than imposing a single prior on the audience.

      • And a small quibble: Time series models are uncommon in microeconomics. The model you’re describing is decidedly macro, which is a field where Bayesian methods have become popular. Microeconomics is the land of least squares models and asymptotic theory.

  23. Personally, I’d re-iterate that I think, given the circumstances, polling did actually quite well this year. There’s just a *vast* amount of unprecedented factors this year. Just off the top of my head

    1. The last minute supreme court smash-and-grab
    2. Rumours of Trump sending paramilitary forces to polling sites
    3. Large scale attempts at voter suppression that will take a long time to quantify
    4. Covid-19 and the shift to mail votes

    The most significant result was that large numbers of democrats voted early, leading to pre-election-day news suggesting dems having a strong lead. One can easily see this as having an effect on the poll, with dems becoming complacent and not showing up on election day and republicans being motivated to overturn a “fait accompli”.

    All of these factors were difficult to foresee and near-impossible to estimate from historical data. The fact that the polling miss this year was firmly *average* seems almost miraculous to me.

    And yes, there are simple ad-hoc methods that in hindsight would have given more accurate results. But it would be deepest folly to trust in stopped clocks to slow ones, just because they happen to be more accurate in *one instance*. And as with every election, there’s a whole lot of stopped clocks out there.

    • It is interesting in this regard that the polls in early March (before COVID took hold) basically nailed the election perfectly….perhaps a coincidence, or perhaps an indication that all of the swingy subsequent events weaken the desire to firmly state preferences that will be returned to…

  24. When polling opened the implied probability on 538 that Trump would be elected was 10%. The probability on Betfair (£400m traded) was 39%. Which was more precise? Does it matter?

  25. I’m slightly unclear: do you mean to feel better or do you mean to communicate better so ‘ordinary’ people can understand better or do you mean a more accurate portrayal? The reason I ask is your use of ‘win probability’ conveys all those. I cant answer the first. The third is really hard because you’re reacting to an anecdote: what if the polls were wrong the other way by as much? I doubt you’d write that off as ‘hey, we were just as inaccurate but the guy we predicted to win did, so we don’t care’.

    The middle question is, to me: not many baseball fans grasp the concept of win probability. Or win shares. I’d say they more generally grasp OBP, but not slugging, so they can see OPS but they really dont know what it stands for. Win shares and the like, no. Win probability, really no. They can grasp the general idea that 10-1 is longer odds than 4-1, but that’s in generalities that simplify to ‘the amount you need to offer as a payout to attract money to that side’, so at least bettors get it.

    So, if you’re trying to communicate the concept better, it’s not your fault that people dont get it. IMO, what has happened is that a rather narrow class of people have become used to computing ‘talk’ so they accept ideas like win probabilty. That doesnt mean they get it. But they accept that it’s a thing, which many people cannot do. As in, I’d bet lots of people think hand counting is more accurate than machine counting, because that’s what their minds accept. Someone looking at a ballot is to them material, even if in real life that person is far more likely to be inaccurate than a machine. Look at the firestorms over counting: they imagine the ones counting ‘their’ ballots are good, but the ones counting ‘the other’ ballots are bad, whether that means inaccurate or evil.

    Why do people think this way? Because understandings of probability and statistics matter to them in ways like ‘the odds I’ll make it across the street without getting run over’, which leads to the advice ‘dont be a statistic’. As a person who tends to talk over people’s heads, that there is a more popular presentation of stastical thinking doesnt mean the ideas arent going over heads. They are. The nature of the popular presentation is not that it reduces the difficult to simple but that it packages the difficult to be appealing. This attracts some who understand a lot, some who understand a little, some who are attracted to the way it is presented, etc. It’s like the 4 Questions: you have to assume only 1 of the 4 kids is really interested, but all 4 are there.

    Here’s a very basic example. I’m a big Taylor Swift fan. Most Taylor fans are under 25 and many are teen or younger. I hear (and read) densely constructed imagery with lots of word play. Kids may hear that but they dont get it. Example is from an early song: you made a rebel of a careless man’s careful daughter. It’s a cool line. It connects rebellion to relationships. It personifies that to ‘you’, so you as listener participate. It’s a tiny bit daring, implying the potential for sexuality if not the reality. But in there is wordplay because careless is also care less, which tends to carefree, and careful is also care full, so it becomes you made a rebel of a carefree man’s carefilled daughter. Why read it that way? Taylor has said she is so consumed by worry that she carries around bandages in case someone near her is attacked in the craziness. She’s described her obsession with details. Her fans know this, so they think every photo, every appearance is a clue. And then she proves that: posts a photo of cinnamon buns with a quote from her own song 22, nothing much going on, and that now reads to fans as a hint she had buns in the oven, meaning the top selling album of 2020 that she made in secret. All this then includes meanings like the fact that her dad gave up a life on a Christmas Tree Farm so his daughter could become Taylor Swift, literally uprooting to Nashville where she moved into the country music scene. I dont expect kids to get that. The press, which is run by people of adult age, generally cant get past hunting for possible references to boyfriends or feuds or whatever.

    So, if you want to communicate, you end up with multiple levels but those dont connect in understanding the way they do in appeal. You simply cannot speak to all levels with the expectation that these levels will all get even near the same things. But you keep grinding away at it to refine what you can say, which is good.

    Look at baseball again: why cant people get something like slugging percentage? One reason, IMO, is that a concept like OBP is just batting average with walks included, so it’s how often you’re on base. You can know that higher slugging percentage means you’re more of a slugger but that only has meaning to most people as a relative term: you slug more or less than this guy or the usual. Same then with OPS: it’s somehow a mark that you’re more or less dangerous or productive. If you ask ‘what’s slugging percentage?’, not many people could tell you what it is. They may know OPS includes slugging percentage, but that doesnt define it in their heads. You’re asking people to take a chain calculation of bases (which doesnt include walks and some other stuff unless you go to a variant). It’s incredibly simple but most people cant visualize a chain calculation. And the way it’s presented includes an additional abtraction layer: rather than just say count all the bases, it’s 2 by 2B, so the concept of a double is (accurately) divided into the label 2B and the number of bases. This kind of basic rigor is built into mathematical thinking, but most people dont get that kind of abstraction.

    • I think you underestimate baseball fans. My dad couldn’t tell a normal distribution from a hole in the ground, but he could define at bats, plate appearances, batting average, slugging average, and on-base percentage. He knows that OBP isn’t just batting average plus walks, even if you include hit-by-pitch in the walk column, because sacrifice flies don’t count as at bats. Dad and I calculated a lot of sports stats when I was a kid.

      It’s a stretch to think of “slugging percentage” as a percentage because it’s just average bases per at bat—that is, units are bases/at-bat (though walks aren’t included as bases here because walks aren’t at bats); to think of it as a percentage, a double has to be 200%, a triple 300% and a home run 400%. On-base percentage on the other hand, is hits + bases on balls + hit-by-pitch divided by plate appearances (which include at bats, walks, hit by pitch, and sacrifices). It makes no sense whatsoever to add these numbers—they have different units! No wonder everyone’s confused.

      Edit: I meant to add: Commonly reported baseball stats are terrible. Those are very convoluted definitions only hardcore fans (like my dad and I) know (I fact checked myself before posting and had the right definitions in my head). Take a look at Tom Tango et al.’s book on baseball stats, The Book: Playing the Percentages in Baseball in Amazon preview and you’ll see what I mean.

      • I think it also reflects the fact that the prevailing mass media and social media message was that the “race is close”, with many baselessly believing that Trump victory was inevitable right up until the end.

    • Feels to me like the key point.

      Many people would be uninterested in giving 5-1 odds on something they believe has a 16% chance of happening. But not because it’s not a fair bet, rather precisely because it IS a fair bet. And they’re risk averse, so they don’t like making fair bets.

      This feels like Andrew saying that whenever the projection is within one standard deviation (we have a “statistical tie”) he’s not comfortable giving odds better than what would imply, say, a 40-60 probability of occurrence. Perfectly reasonable position for a risk-averse person.

  26. I don’t get it.

    All this does for me is point out what we already knew — that the concept of “margin of error” does a really shitty job of conveying the underlying reality of the probability distribution, and in particular the terminology “inside the margin of error” and “outside the margin of error” promote some major logical fallacies. [E.g., the fallacy that a) 0.52 +/- 0.02 is much more alike to b) 0.48 +/- 0.02 than it is to c) 0.53 +/- 0.02, because a) and b) are both “inside the margin of error” while c) is “outside the margin of error”].

    What really confuses me is that Andrew seems to be buying wholesale into that fallacy. I mean, who cares that it’s common to call it a “statistical tie”? That’s wording does a really poor job of characterizing the situation, and it’s silly to look at a full 68% of the probability distribution (from +1 std dev to -1 std dev) and go “meh, that’s pretty much all the same thing”.

    So yeah, just shitcan the idea of “statistical tie”.

  27. The symmetric posterior is problematic, it should be skewed in my opinion.

    Central to Talebs point(although he is certainly not good at clarifying his points) is the fact that in the case of binary options, higher variance leads to the price being closer to 0.5(ofcourse depends on current state as well). When there is longer time until maturity, there is more variance to be accumulated until maturity, and thus predictions at earlier times should be closer to 50/50(Integration of the time dependent volatility sigma(t) is non-decreasing as the maturity is increased). This should incorporate all possible events until the elections on top of polls errors(news, weather on election day etc) and is ofcourse highly subjective. So predicting 90/10 probabilities with months until the elections seems absurd if we are predicting how the election will actually go but is more reasonable if we are predicting how the election would go were it today.

        • And I would say that in your model that we discussed the other day, more time remaining “skews the posterior” for the terminal vote share: the median remains fixed at the current value but the mean gets closer to 50%.

        • Exactly. This happens both in Zhous and my variants of the model, just with slightly different mechanics. There absolutely should be pooling towards 0.5 with both greater volatility and greater time until election. I’m surprised Andrew isn’t discussing this more! Although maybe the drift term in his model accomplishes this? I’m doubtful because they’d need the drift to have variance/volatility estimated from the data and I have not seen that idea mentioned or discussed.
          But this does seem like one potential fix for overconfident bets.

        • The drift term used to move the distribution toward their fundamental forecast (based on macroeconomic conditions) without causing any spread. A few months ago the model was revised and seemed to spread out the distribution but it’s unclear how much (the noise in the interval bounds made difficult to see what was the underlying prediction). And there is no guarantee that it will make things go towards 50/50, the drift could move the polls-based prediction far from 50% if the fundamentals-based prediction is more extreme.

        • Carlos, yes that makes sense to me. I would think that a two-step process could be used: 1. fit the model ‘as is’ with all the complex specifications. Then 2. estimate the relevant volatilities/uncertainties from the model (and maybe hierarchically with recent elections too or something) and then use a monte carlo approach as Zhou initially suggested to pool the election day forecast toward 0.5. The main issue that I see that would need to be resolved is the structure of the volatility, probably want many small movements while allowing possibility of some larger shifts.
          What am I missing?

        • Basically, if you go back to our code and plot winprob versus winprob2, or vote share on a day versus final vote share (in Zhou’s revised code), you will see the classic sigmoidal shape for adding uncertainty to probabilities. Both axes are bounded in (0,1), but the final vote share has added volatility/uncertainty and thus imposes skepticism on shifts, particularly in the middle.

        • Enh okay, I’d have considered that an artifact of the parameterisation. You can certainly reparam to get rid of this effect and maybe it’s kinda desirable to do so.

        • If you increase variance in a bounded process like vote_share(t) I’m not sure you can get rid of mean reversion. I think reparametrization doesn’t get rid of the effect, it solves a different problem.

        • + 1. I don’t have a formal proof to hand, but intuitively and having played with simulations, I don’t see a way around this…

        • Anyway, in that model vote share is not more likely to go towards the mean than to stay far from it (compressed in the extremes with supressed variance). The forecast is a deformed version of the random walk at the heart of the simulation (it’s constrained to remain in 0/1 and forced to go to 0 or 1 in the end) and it doesn’t remain close to 0.5 (actually spreads slightly faster in the beginning).

          Using an autoregressive model (instead of a simple random walk, shrink the current level towards 0 using some factor) the vote share becomes really mean reverting and the ending value much less predictable. The forecast oscillates around 0.5 in the beginning with the divergence becoming larger as time goes on, slowly at first and then faster (but it doesn’t look like Fig. 3 in Taleb’s paper).

        • Well, in the model I posted you get rid of it if you do the logit transform to get back to the original random walk setup, no? The same in the model Taleb posted.

          Okay, I probably went too far in asserting it generally. But if you are talking about mean reversion in that sense, it’s basically impossible for Nate or Andrew’s model – or anyone else’s model for that matter – to not have it to some degree. It just happens that vote share estimates are near enough to 0.5 for this to not be a very visible effect. So it’s kinda a meaningless critique.

        • I don’t think we really disagree. I tried again to understand what Taleb’s paper is about but I cannot make much sense of it. I don’t know what the arbitrage pricing computations try to show, I don’t even see why they should the be meaningful (an arbitrary model for an ill-specified underlying which cannot be traded). Even if a price could be determined from non-arbitrage considerations it’s not clear to me the relevance regarding the admissibility of forecasts.

          I also have no idea what Fig. 3 is about. According to the caption “Shows the estimation process cannot be in sync with the volatility of the estimation of (electoral or other) votes as it violates arbitrage boundaries” but there is no reference to it in the text. I’ve only looked at the preprint, does the figure appear in the published paper? That would be funny because, in their reply to Clayton, Madeka and Taleb write “There is no mention of FiveThirtyEight in Taleb (2018)” but one of the lines in that chart is labeled “538”. The other is labeled “rigorous updating”, a concept which doesn’t appear again in the paper either.

        • Carlos:

          Try my quora answer link. https://www.quora.com/What-is-the-Nate-Silver-vs-Nassim-Taleb-Twitter-war-all-about

          The paper is actually a writeup of a twitter thread (linked in my answer), where he actually shows the Matlab code. As I establish in my quora answer, the graph is actually the result of a programming error.

          Taleb’s essential argument – I feel reasonably confident – is really about his personal intuition that 538’s estimates (purely by looking at the graph of win probability) shows (a) too much variability to be real and (b) the wiggling up and down of the graph shows the estimate has mean-reversion and hence is not arbitrage free. (i.e. if a candidate’s win prob is high, you should expect it to come back down again at a later point, so a candidate’s win probability is not a true estimate of their final win probability so if these probabilities translate to betting odds you should be able to make a riskless profit)

          This is a poor critique because as is demonstrated, even Taleb’s own toy example can create win probability graphs with significant variability, and also it’s really impossible to determine the presence/absence of mean reversion purely from a single year’s win prob curve.

        • Looking at the code, I agree that the chart shows in orange the correct forecast obtained from that model [*] using a straight-forward calculation (that has nothing to do with arbitrage pricing of binary options). The green line is an incorrect forecast obtained by sticking in some place a mysterious 14. As you say, he might be misunderstanding his own chart. I definitely give up in trying to understand it if not only the labels and caption are cryptic and there is no explanation in the text but also the labels may be switched and the chart may not even mean whatever he thinks it means.

          [*] Which makes no sense as a model of vote share anyway. As time goes on, the probability of finding the vote share close to 0.5 (say between 0.25 and 0.75) goes to zero. But we all known that going back many, many scores the actual vote share stays around 0.5.

        • ( I meant “many, many scores of years”, I saw Lincoln’s name in a long-term vote share chart and the “four scores and seven years” came to my mind.)

          I noticed a reference from Taleb to another paper from Fry and Burke: “An options-pricing approach to election prediction” https://twitter.com/nntaleb/status/1299069089353170944

          Unfortunately (unsurprisingly?) it doesn’t make much sense either.

          They define as underlying asset the vote share for one party P(t). The vote share at the time of the election is P(T).

          The price at time t of the (binary call) option that pays $1 if P(T) is above K is exp(-r(T-t))pnorm(d) where d = ( log(P(t)/K) + (r – 0.5 sigma^2)(T-t))/(sigma sqrt(T-t))

          The price at time t of the (binary put) option that pays $1 if P(T) is below K is exp(-r(T-t))pnorm(-d)

          The risk-free rate r is an unnecessary distraction that only serves to induce into mistakes like the one they commit in equation (3). To find the median P(T) implied by the option prices they solve for the strike K that makes the price of the call equal to 1/2. They should actually look for the strike that makes the price of the call and the put equal (so you’re indiferent about betting over/under K). Only when r=0 both prices are equal to 1/2. They find K as the product of three factors, the last one drops when their mistake is corrected and we simply have to solve for d=0.

          The correct solution is that the implied median share vote at time T is K = P(t) exp( (r – 0.5 sigma^2) (T-t) )

          Setting r=0 for simplicity, the implied median share vote at time T is less than current vote share due to the factor exp( (-0.5 sigma^2) (T-t) )

          So if we say that P(t)=0.5 is the share of Republican votes, the model says that the median Republican vote share at the election is less than 0.5 and Democrats have more than 50% probability of winning.

          But we could just as well say that P(t)=0.5 is the share of Democrat votes, so the model says that the median Democrat vote share at the election is less than 0.5 and Replublicans have more than 50% probability of winning.

          Vote share is not a financial asset and definitely a lognormal model is not adequate.

          It introduces a volatility drag that makes that (assuming that r=0 and there is no trend) if the current vote share is 0.5 the expected value in the future is 0.5 but the distribution is skewed to the upside and the median goes down. When the vote share is P(t)=0.5 the price of the “win / P(T) above K=0.5” bet is lower than the price of the “lose / P(T) below K=0.5” bet as is apparent in the formulas above (their equation 1).

    • Jonjo:

      Unfortunately, as discussed in the above post, “predicting 90/10 probabilities with months until the election seems the absurd” does not give us much as a general principle. We’d actually have no problem making a prediction months before the election with odds of 90/10, or even 99/1, that Biden would win California. And I don’t think that 90/10 was extreme odds, months before the election, that Biden would win the national popular vote. It’s easy to say that the price should be closer to 0.5, but that does not answer the question of which price do you want to set close to 0.5.

      • I was talking about 90/10 in context of the national popular vote. And I am definetly not saying that it is easy to incorporate all that uncertainty into the model, I am just pointing out that if we are forecasting far into the future for binary outcome with underlying volatility process then we have to adjust towards 50/50 outcome, and more so as the forcasting horizon is further away. If the current estimate is far away from 50/50 then that ofcourse also influences the estimate. And also, the underlying voting process perhaps has almost hard-boundaries(estimated from polls/prior knowledge), so the adjustment for very one-sided states could be minor(California). That is the take-away from the financial mathematics literature although the underlying process in this case is very different from the financial ones.

        • Jonjo Shelvey: You shouldn’t have to “adjust” towards 50/50. A well defined model, for a given set of observations (i.e. fundamentals, polling etc) should inherently be closer to 50/50 early in the election vs later. The factors that cause this are part of the model. Adding an *additional* adjustment to bring the % closer to 50/50 is doubling up on this effect and is very hard to justify. The 90/10 estimate is what it was because of those effects.

        • I did not bother to answer your other strawman regarding the skewed distribution as Carlos Ungil already did. Ofcourse this can and should be incorporated in the model, the adjustment towards 50/50 should be the result of that model. I was talking about the underlying, which is a time dependent stochastic process of number of votes at time t<T, T being the time of election. As t is smaller, the more variance there should be left. This is definitely incorporated somehow in some models, but Talebs criticism is that it is grossly underestimated in 538 case(especially when t<<T).

  28. How much of your trouble is caused by the idea that the probability distribution should be some sort of bell-shaped curve (that just won’t let itself be massaged into a semblance of reality)?

    If I measured a small distance with a yardstick that had only inch marks, then the error term should not be bell-shaped: if my measurement is 13 inches, the probility for it being 12 or 14 is not small, it’s zero; and from 12.6 to 13.4 the probability density isn’t curved, it’s straight. When you argue that setting a 3% win probability for California is nonsense, and it should be 0%, don’t you argue for a probability distribution that has some of these properties?

      • What if we’re measuring the same distance with multiple yardsticks?
        That can’t be normal distributed. That distribution still has no tails.

        If you take a poll as a measurement of the eventual outcome, I don’t think you can assume that there’ll be a normal distribution, because voter opinion for,ing is not a random process, and the polls are correlated.

        If you find a bell-shaped distribution doesn’t cut it, shouldn’t you question your assumption that this is the correct distribution?

        • Random sampling from a population and taking the proportion is a very classic binomial problem. Sampling error like that works very well with the central limit theorem.

          You could still get long tails in the distribution, sure, if you specify e.g. long tailed error distribution for late events in the election, or weird shit happening on election day, or some kind of systematic error in polling. That doesn’t look like what happened this time though.

        • I mean yeah, situations where the central limit theorem doesn’t apply do exist. It’s just really hard to think them up – I mean, I know, because my courses are a pain in the arse.

          I think even your ruler example probably fails. You just need a *lot* of measurements. Suppose you measure the same distance a million times and every time you get 13 inches. Do we still believe that 13.4 is a likely result? I’d doubt it – since if the measurement was that close to the boundary, you’d expect a few of those times for someone to misjudge it and give 14 instead.

  29. I am posting here as it seems the last post about election forecast. To put it simply, the main issue is that when well established sources (say 538 and the Economist) write that Biden has 89% chances of winning what is missing is that that estimate is dependent on the polls. And if the polls are biased then the prediction may be off. You and Silver obviously know this. Silver always talked about: “can Biden survive a, say, 2 point polling error?” etc. But, still, if you have a lot of garbage in (the bias in the polls), then you have some garbage out too, no matter how cleverly you adjusted it. Obviously you or Silver are not pollsters, but I think readers usually do not realize that the predictions are conditional on a variety of assumptions that may well turn out to be wrong. In other words, the publicity that these predictions get is too much.

    There is also the issue (that I think was more relevant in 2016 but still) that predictions affect elections too. If in the end I would rather have Biden over Trump but I truly despise the DNC, I may decide to not vote, or vote Jill Stein, as a sign of protest because the polls tell me Biden is going to win anyway. Strangely for me, this issue is never mentioned.

    So, as usual, quantitative social science can be helpful but should be used with caution.

    • Tom:

      As Bill James wrote, the alternative to good statistics is not “no statistics,” it’s “bad statistics.” In the absence of careful analyses such as what we did for the Economist, it’s not like people would just throw up their hands and say they don’t know. Sure, some people will say that polls are crap and so we don’t know anything. But other people will overreact to each new poll, or will interpret polls without reference to previous elections.

      Also, yes, our predictions are conditional on assumptions, but we are allowing for the possibility that the polls can be off. That’s an explicit part of our model. Finally, it’s not at all true that the issue of forecasts affecting votes “is never mentioned.” We’ve talked about this a lot on this blog and also in our published article. I’m skeptical of this having much effect in 2020, in part because it seems that there does not seem to have been a consensus among the general population that Biden was a strong favorite.

      • Andrew, thanks for the reply. I agree with everything you said (I have not read this blog often recently and so I must have missed the “forecast affecting votes” posts, my apologies). My point is that one thing is to present, at least implicitly, the forecast as:

        1) “this forecast was built using the most advanced statistical tools and so it should be trusted”

        as opposed to:

        2) “this forecast was built using the most advanced statistical tools but there may be still a lot of noise out there and so use it with caution”.

        I think many see the forecast as 1) rather than 2) and so when the forecast turns out to be off from a wide margin, they trash the science rather than appreciate that doing quantitative social science (and here we are talking about estimating a population parameter, the simplest statistical problem!) is hard.

      • Let me add another thing about this statement “we are allowing for the possibility that the polls can be off”. My understanding is that non-response bias is not modeled economically just statistically. That is, the model does not take into account the motives that may lead polled individuals to not answer the poll (or to lie on it). Casey Mulligan (with whom I strongly disagree on many things), had some good points about this here (see the graph “Voter Incentives 1: Social Desirability”):

        http://caseymulligan.blogspot.com/2020/10/do-election-forecasts-suffer-from-lack.html

        (You could see this as a revenge of Heckman selection model on polisci!)

        Anyway, thanks again for your answer above, it is always a pleasure to read your blog.

        • What exactly would “economic” modelling of non-response bias actually involve? The linked article just makes a bunch of theories but doesn’t validate them with any sort of data, for example not recognising that there was no shy voter effect in 2018.

        • Answering a survey has pros and cons for each sampled individual. List the factors that affect the choice of answering the survey, say x and y, and then write:

          P(response = 1) = f(x,y)

          Then study carefully if x and y are correlated with P(Trump = 1). I know adjustments are regularly done to polls for non-response bias. But I have checked some of the polls listed by 538, even the A rated. They all report the sample size and sampling method but none of them reports the response rate which we know has been declining steeply for phone interviews (lower than 10%). So you are banking on x and y not being correlated on P(Trump = 1) or on the statistical adjustment making a lot of the work. Nothing wrong with that, you go to statistical analysis with the data you have. But again this stuff is buried in the footnotes, if at all, and the careless reader (e.g. Democratic operative in DC) leaves with the impression that the signal is much stronger than the noise.

        • But P(response) is also correlated with P(vote). A lot of the non-responders are just not going to vote, you can’t just assume that people with similar characteristics to a Trump voter, but don’t respond, are gonna be “shy” Trumpies. Further, how are you going to fit a model for non-response if by definition non-responders don’t respond and give you X and Y? Are you just gonna assume a f(x,y)?

          In the end I don’t really see the point of having a non-response model, when what you are actually interested in is the voting likelihood. You can directly estimate, from past performance in elections, using all your covariates how a person ticking Yes (or non-response) corresponds to the final result in the election, and the uncertainty in that estimate. Raw values like the response rate are going to be fairly meaningless, relatively speaking.

      • Tom –

        > but I think readers usually do not realize that the predictions are conditional on a variety of assumptions that may well turn out to be wrong.

        They state, explicitly, that their projections are conditional on the polls being accurate.

        538 goes on to explain what would happen if the polls were in error to different degrees

        The NYT even went so far as to say what would happen if the polls were as far off as they were in 2016 and 2012.

        Wjar would you havw them do. It seems to me that they only way out of the dilemma you describe is for them simply to not do any projections. Which is fine, but if the polls are done than someone is going to do projections. And if the polls aren’t done then someone is going to estimate public sentiment without doing polls.

        • Joshua, I have read 538 more than the Economist (sorry Andrew!) and I agree with you that they have been very explicit about the assumptions. But most readers just read the headlines. 538 has the graph with Biden winning 89 out of 100 times. Maybe rather than the point estimates they could just show the confidence intervals (and yes maybe stats should be taught more in high school!).

          Plus, even if you know stats, the error could be due to drawing an outlier from a distribution you had identified correctly, or it could be due to the fact that there was sampling bias to begin with. I think even more sophisticated readers may think “well, we cannot have another outlier in 2020 as we had in 2016, can we?” as opposed to asking the hard questions of how these polls are produced in the first way (cheap advertising for many colleges, if you ask me) and entertaining the thought that there could be major sampling bias.

          I am not saying that we should go back to reading entrails as it was before systematic forecast were produced. And so I have no problem with people doing election forecasts, just that these forecasts get too much attention for what they are worth. And clearly 538 has a product to sell and so they cannot downplay it too much.

        • 2016 *wasn’t* an outlier though. 2016 was bog standard when it comes to polling misses. Nate’s been repeatedly clear about that.

        • Tom –

          I think we are in some agreement. I think it’s worth investigating how the forecasters might better communicate about the uncertainty. I think that Andrew et al. and Silver et al. take the pretty seriously and deserve credit for doing so.

          I agree that clearly the issue of sampling bias needs to be tackled more comprehensively – particularly non-response bias. Of course, I also recognize how that would be very difficult to do, methodologically. But yeah, there definitely seems like a big need there.

          I agree that the forecasts get a lot of attention, and that in some ways that can have a negative impact. But I’m not sure that there’s a reasonable response to that problem. In a capitalist society, commercial entities will give consumers what consumers want to pay for. People like polling and forecasts. I like polling and forecasts. It reflects some attributes very basic to human nature. Modern communication technology and social media only amplify thee effect of those basic attributes. Observing and trying to correct for the flaws is important. But we have to expect for inadequacies and having an expectation of being error free is unreasonable. There will be errors and determining what level of error is unacceptable is necessarily subjective.

          I come at this from watching uncertainty get weaponized in all manner of issues where statistical science intersects with political ideology. For me, it is important to avoid the easy ways that becomes weaponized.

          For example, I think that link you provided above referring to the potential causes of sampling bias is taking an important issue but dealing with it in an entirely subjective manner. It’s a problem when people dress up examining uncertainties subjectivity and present it as if they’re being objective. IMO, people like Nate Silver and Andrew take great pains to avoid doing that. Of course, they can’t be perfect. And we can’t be completely subjective in our viewpoint – but it’s our responsibility to be careful in not accepting a false balance.

        • Joshua, let me emphasize that Andrew is one of my heroes for, among others, the reasons you mention. That is why I was a bit surprised that he would participate in an endeavor in which, as you also recognize, there are strong incentives to claim that the evidence is stronger than it really is (but in defense of Andrew, you are damned if you do it, but also if you don’t do it, as others who are not as careful in presenting uncertainty will do it).

          But this stuff matters. Dem operatives are saying they were surprised by the poor House and Senate results. I think many read the forecasts and figured that 2016 was just a glitch: after all, how many people could still support Trump after these four years, and the last months especially? And so they were left with that sense of security that the current message (basically, “we are not Trump”) was working, even if it was delivered by a 77 year old white man who voted to bomb Iraq, among other things (and, yes, I voted for him, with no hesitations).

          I admit that my views are colored by an aversion to how polls are used in US politics. Politics is not a beauty contest, it is also about making the case for a different future, providing options that may seem at first not to have much of a chance. E.g. in 1958, only 4% approved of marriage between blacks and whites (source: https://news.gallup.com/poll/163697/approve-marriage-blacks-whites.aspx).

          Thanks for your comments anyway.

          \end{rant}

  30. Tom –

    > There is also the issue (that I think was more relevant in 2016 but still) that predictions affect elections too. […] Strangely for me, this issue is never mentioned.

    I don’t think it is at all true that it is “never mentioned.”

    But what I do see often mentioned is an assumption that the influence of the polling only affects outcomes in one direction – the “Suppression”

    • … the “suppression” argument that Trump et al. are making. But I don’t see any reason to assume that the influence of the polling runs in any particular direction. For example, for everyone who might decide that the large polling advantage for Biden was a reason to not vote for him there may well have been many who voted for Biden because of his lead in the polling. I think there’s plenty of evidence suggesting a bandwagon effect whereby people like to be associated with a winning candidate. Why else do you think that Trump was constantly claiming that the polls were wrong and he was going to win in a landslide? I think that we can agree it wasn’t to convince Republican sympathetic independents to vote for Kanye West

Leave a Reply to Matthew Kay Cancel reply

Your email address will not be published. Required fields are marked *