Probabilistic forecasts cause general misunderstanding. What to do about this?

The above image, taken from a site at the University of Virginia, illustrates a problem with political punditry: There’s a demand for predictions, and there’s no shortage of outlets promising a “crystal ball” or some other sort of certainty.

Along these lines, Elliott Morris points us to this very reasonable post, “Poll-Based Election Forecasts Will Always Struggle With Uncertainty,” by Natalie Jackson, who writes:

Humans generally do not like uncertainty. We like to think we can predict the future. That is why it is tempting to boil elections down to a simple set of numbers: The probability that Donald Trump or Joe Biden will win the election. Polls are a readily available, plentiful data source, and because we know that poll numbers correlate strongly with election outcomes as the election nears, it is enticing to use polls to create a model that estimates those probabilities.

Jackson concludes that “marketing probabilistic poll-based forecasts to the general public is at best a disservice to the audience, and at worst could impact voter turnout and outcomes.”

This is a concern to Elliott, Merlin, and me, given that we have a probabilistic poll-based forecast of the election! We’ve been concerned about election forecast uncertainty, but that hasn’t led us to take our forecast down.

Jackson continues:

[W]e do not really know how to measure all the different sources of uncertainty in any given poll. That’s particularly true of election polls that are trying to survey a population — the voters in a future election — that does not yet exist. Moreover, the sources of uncertainty shift with changes in polling methods. . . . In short, polling error is generally larger than the reported margin of error. . . . Perhaps the biggest source of unmeasurable error in election polls is identifying “likely voters,” the process by which pollsters try to figure out who will vote. The population of voters in the future election simply does not yet exist to be sampled, which means any approximation will come with unknown, unmeasurable (until after the election) errors.

We do account for nonsampling error in our model, so I’m not so worried about us understating polling uncertainty in general terms. But I do agree that ultimately we’re relying on pollster’s decisions.

Jackson also discusses one of my favorite topics, the challenge of communicating uncertainty:

Most people don’t have a solid understanding of how probability works, and the models are thoroughly inaccessible for those not trained in statistics, no matter how hard writers try to explain it. . . . It is little wonder that research shows people are more likely to overestimate the certainty of an election outcome when given a probability than when shown poll results.

That last link is to a paper by Westwood, Messing, and Lelkes that got some pushback a few months ago when it appeared on the internet, with one well-known pundit saying (mistakenly, in my opinion) “none of the evidence in the paper supports their claims. It shouldn’t have been published.”

I looked into all this in March and wrote a long post on the paper and the criticisms of it. Westwood et al. made two claims:

1. If you give people a probabilistic forecast of the election, they will, on average, forecast a vote margin that is much more extreme than is reasonable.

2. Reporting probabilistic forecasts can depress voter turnout.

The evidence for point 1 seemed very strong. The evidence for point 2 was not so clear. But point 1 is important enough on its own.

Here’s what I wrote in March:

Consider a hypothetical forecast of 52% +/- 2%, which is the way they were reporting the polls back when I was young. This would’ve been reported as 52% with a margin of error of 4 percentage points (the margin of error is 2 standard errors), thus a “statistical dead heat” or something like that. But convert this to a normal distribution, you’ll get an 84% probability of a (popular vote) win.

You see the issue? It’s simple mathematics. A forecast that’s 1 standard error away from a tie, thus not “statistically distinguishable” under usual rules, corresponds to a very high 84% probability. I think the problem is not merely one of perception; it’s more fundamental than that. Even someone with a perfect understanding of probability has to wrestle with this uncertainty.

As is often the case, communication problems are real problems; they’re not just cosmetic.

Even given all this, Elliott, Merlin, and I are keeping our forecast up. Why? Simplest answer is that news orgs are going to be making probabilistic forecasts anyway, so we want to do a good job by accounting for all those sources of polling error that Jackson discusses.

Just one thing

Jackson’s article is on a site with the url, “centerforpolitics.org/crystalball/”. The site is called “Sabato’s Crystal Ball.”

Don’t get me wrong: I think Jackson’s article is excellent. We publish our pieces where we can, and to publish an article does not imply an endorsement of the outlet where it appears. I just think it’s funny that a site called “Crystal Ball” decided to publish an article all about the problems with pundits overstating their certainty. In all seriousness, I suggest they take Jackson’s points to heart and rename their site.

50 thoughts on “Probabilistic forecasts cause general misunderstanding. What to do about this?

  1. Thanks for the kind words here. I appreciate your engagement with the article, and with the overall topic. I don’t disagree with anything you’ve said here (although the Crystal Ball name is not mine to comment on, of course – I very much appreciate their willingness to publish my piece).

    For what it’s worth, I am also realistic about these models existing. My goal is not to get them taken down, because I believe that would be a futile effort. My goal is to spread caution about the models, particularly given my role in the field in 2016. I was not cautious enough – although not maliciously or carelessly, just fell into the don’t-do-anything-subjective trap I describe in the article – and that was a disservice to everyone. I think these models have been taken too seriously as because they worked for in 2008-2012, which is really not a long track record to go on. Perspective on how we discuss the models, and how media covers them, is my goal.

    I am grateful that Andrew, Elliott, and Merlin are thinking so carefully about these things, and I have been impressed with their willingness to adjust the model and engage with criticism.

    • Natalie,

      I read your article and basically agree with everything. I would echo Andrew’s point that we are being super careful with uncertainty, most notably and novelly by (a) adding extra measurement error for other sources of non-sampling variance and (b) by attempting to adjust for partisan non-response, which to my knowledge no other major forecaster is doing.

      But this is where things get tricky, right? Because if every forecaster thinks they are doing something better, there is no incentive to NOT do a forecast. We certainly believe that our forecast is worthwhile because without it, we would be left in a world with fewer good forecasters and certainly more bad punditry. But there is definitely an acknowledgement of the doom loop here.

      One thing our forecast may be blind to is hereto-unforeseen biases in likely voter filters because of covid and postal service troubles. I am of the belief that such errors are about as likely as they always have been (and have been meaning to blog about this) but I think this is an area where we might be able to make some improvements.

      Finally, I am pleased that we can engage in a discussion about this subject with respect and level-headedness. Not everyone in this industry is so willing.

      • Andrew, it seems you have improved your model to have uncertainty intervals that widen noticeably as they move into the future.

        The “change of winning the most votes” in the top of the page is 98% for Biden and 3% for Trump (!). Let’s say the right figure is 97.5% for Biden.

        The popular vote prediction on election date for Biden is the 95% uncertainty interval [49.3% 58.3%]. I got it looking at the chart, the only figure which is given is the point estimate 54.2% (different from the central point in the interval, which would be 53.8% according to my guess).

        If the 95% uncertainty interval is calculated using quantiles, there is already 2.5% probability of Biden getting less than 49.3% of the popular vote on the election and one still has to add the probability of the [49.3% 50%] interval.

        On the other hand, if it is a high-density interval it could be the case that the probability of Biden losing the popular vote is just 2.5%. Does the asymmetry of the distribution explain these numbers?

      • Yes, you do. You show two probabilities and the CI.

        I’m suggesting showing the CI *instead* of the probabilities. Just show the range of outcomes in creative ways, forget probabilities as a way to communicate.

  2. On the name: my read is that Crystal Ball is self-aware. Crystal balls don’t work. If someone told me that they got info from a cyrstall ball, I’d assume they were knocking the source.

      • Andrew –

        Could you elaborate on why you’re so disdainful of Lichtman’s prediction method?

        -snip-

        Retrospectively, the keys model accounts for the outcome of every American presidential election since 1860, much longer than any other prediction system. Prospectively, the Keys to the White House has correctly forecast the popular vote winner of all seven presidential elections from 1984 to 2012, usually months or even years prior to Election Day. The chart to the right shows the Keys model’s vote share forecasts in comparison to the actual vote for the incumbent party’s candidate in each of those election. On average, the Keys model missed the final election result by 2.4 percentage points.

        -snip-

        >https://pollyvote.com/en/components/models/mixed/keys-to-the-white-house/

      • I’m with Caleb. “Sabato’s Crystal Ball” sounds like the musings of some guy. Rating things as “Safe”, “Likely”, “Lean”, or “Toss-up” conveys that the ratings are qualitative. The nebulousness of what “Likely” actually means is a feature because it gives the reader a sense that the predictions are imprecise.

        In contrast, the header of your model on the Economist website says “The Economist is analysing polling, economic and demographic data to predict America’s elections in 2020”. This implies a rigorous, quantitative approach that can be trusted to the final digit reported. Especially for people who are not familiar with all the assumptions that go into this sort of modeling, reporting a probability like 87% chance of a Biden win conveys a false sense about the precision of the approach. I don’t think you would say that you are certain the actual probability is 87% and not 84% or 90%, but the way you present things indicates this to the general public (and I’ve seen far worse examples from other forecasts where they report things like 87.23% which is plainly a ridiculous level of precision).

        I think you could revise the way you present your model results to better convey the inherent uncertainty in this sort of forecast.

        • “reporting a probability like 87% chance of a Biden win conveys a false sense about the precision of the approach. ”

          Following Andrew’s links I don’t see where you get just a plain “87%”. His 12-June post has two charts both of which have huge shaded areas clearly indicating the range of possibilities.

          I’m sure the public doesn’t understand exactly what errors “95% confidence interval” means, but I’m even more sure people understand that forecasts and predictions are inherently uncertain.

          Alot of the public doesn’t even understand what “percent” means. No one is ever going to convey a detailed knowledge of uncertainty to those people. Yet they can still have intuitive knowledge that predictions are uncertain and the media is so full of wrong and wacky predictions it would be surprising if they weren’t aware that election forecasts are highly uncertain.

        • > Following Andrew’s links I don’t see where you get just a plain “87%”.

          The Economist page is this one: https://projects.economist.com/us-2020-forecast/president

          Before going into details it presents this summary:

          “Right now, our model thinks Joe Biden is very likely to beat Donald Trump in the electoral college.

          Joe Biden – Democrat
          Chance of winning the electoral college: around 9 in 10 or 90%
          Chance of winning the most votes: better than 19 in 20 or 98%
          Predicted range of electoral college votes (270 to win): 220-434

          Donald Trump – Republican
          Chance of winning the electoral college:around 1 in 10 or 10%
          Chance of winning the most votes: less than 1 in 20 or 3%
          Predicted range of electoral college votes (270 to win): 104-318”

        • Bruce contends that Andrew et al:

          “…implies a rigorous, quantitative approach that can be trusted to the final digit…”

          Yet right at the head of the page, before the numbers it says:

          “…Right now, our model thinks…[it is]…very likely…”

          As well as the other factors you pointed out. I just don’t see that as offering unrealistic precision.

    • Agree with Caleb. I read “Crystal Ball” as self-deprecating. And their “About” page is consistent…

      “…we’re also modest enough to know that no Crystal Ball can foresee all the twists and turns of a turbulent era in American politics. Thus, our motto remains ‘He who lives by the Crystal Ball ends up eating ground glass!'”

      • Maybe you and Andrew are both right.

        They claim to be “modest” but just looking through the site they don’t seem to be mocking themselves too loudly, so the idea that they’re not intending it to be taken seriously seems like a stretch. They’re playing both sides of the bet: selling the “Crystal Ball” while claiming it doesn’t mean anything.

  3. My 2 cents: give up expectations that are unreasonable. You unreasonably expect reasonability, meaning you think that people can logically connect dots to the depth that you desire as a person who connects dots to that depth and beyond. Turn on the TV: when there’s a windstorm, they remind people not to touch downed electrical wires. Why? Isnt that obvious? Dont play in the water when there’s a hurricane. Uh, yes, but they remind us that each time because a lot of people forget that lesson when they’re near the water and there’s a hurricane. See my point? You’re defining a result that you see as achievable, partly because that’s natural to you, partly because you’ve been trained, partly for other reasons, and that is not true for other people. That is, you are misled by their apparent ability to follow. You see behavior that looks conforming until it isnt, because human responses when they’re not honed to a specific, internalized understanding, tend to be noisy as bleep, meaning you have to expect a wide distribution of responses. Or prosaically, when confronted by an actual live wire, people who havent internalized the lesson are more likely to forget the dont.

    I’m not sure that anyone trained in any intellectual field can fully comprehend the degree to which they’ve internalized approaches which others have not.

    To me, if it is demonstrated that presenting data causes people to not vote in material numbers, then you fix what you’re saying. Without that focus, without that specificity clarifying the space to be analyzed, you could actually make communication worse by trying to make it better.

    To sum, recognize the inherent nature of the problem, and wait until something gives you information which is actionable. You can better define actionable if you think about the limits of communication from the perspective of those who are not inside the bubble of your shared understandings.

  4. This would’ve been reported as 52% with a margin of error of 4 percentage points (the margin of error is 2 standard errors), thus a “statistical dead heat” or something like that.

    The error seems to be describing this as a “statistical dead heat” when in fact it indicates an 84% chance of victory.

    A strict reading of the notation “52% +/- 2%” yields the interval (50%, 54%). Later you talk about “margin of error” being two standard deviations, or (48%, 56%). I think reporting +/- sd is going to be very misleading to the public who will interpret any interval with more certainty than it deserves.

    Speaking of N = 1, I see the NY Times reporting things like “This man called the last election for Trump, what does he say now?” on the top of their web site (I’m not even going to bother linking this garbage). My dad called the last election for Trump based on talking to people at bars in suburban Detroit, so maybe he should take up political punditry with a larger audience than me.

    In the public’s defense, the link between the popular vote and the electoral vote is rather confusing statistically. The polls last time were pretty good on Clinton’s margin of victory in the popular vote, but not so good on Trump winning the electoral college.

    There’s a fun article in a recent New Yorker about How the Simulmatics corporation created the furute. I suspect Andrew’s coverage on the blog will show up in six to nine months. It’s about Robert Kennedy’s application of analytics to his brother’s election in 1960. Not surprisingly, given the name, Simulmatics was a giant simulator of elections. It worked with a form of deterministic regression and poststratification (aka “Doctor P”). What surprised me is that the Kennedy election was a dead heat in the popular vote. And that so many people of both parties thought this kind of social science analytics was outright evil.

    Elliott, Merlin, and I are keeping our forecast up. Why? Simplest answer is that news orgs are going to be making probabilistic forecasts anyway, so we want to do a good job by accounting for all those sources of polling error that Jackson discusses.

    Academics are judged on the same basis as newspapers: the quality and quantity of their readership.

    Do you think a lot of these polls are being done by organizations that are not trying to do a good job accounting for error? Or to put it another way, when combining sources, how much do you try to account for underlying bias of the source organization as opposed to error and non-representativeness?

  5. In 2016 a friend expressed outrage that the election forecasts had been so wrong: “They said Clinton was a sure thing!” I said “I don’t know what forecasts you’ve been looking at; I mostly just look at fivethirtyeight[.com] and just before Election Day they gave Trump something like a 30% chance. That’s not a sure thing.” And my friend said “OK, mathematically that’s not a sure thing, but it’s pretty much a sure thing.”

    Honestly it’s hard to know what to do if people think about probability this way.

    Perhaps one solution is to not give numerical estimates at all. We could map probability estimates onto plain-language descriptions instead, e.g.:

    50-55% chance becomes “dead heat, could go either way”
    55-60%: “Candidate A has a slightly better chance of winning.”
    60-70%: “Candidate A has a noticeable edge, but if Candidate B wins that will only be a slight upset.”
    70-80%: “Candidate A will probably win, but it’s far from a sure thing.”
    80-90%: “Candidate A is very likely to win. A win by Candidate B would be a major upset, though not a historic one.”
    90-95%: “Candidate A is extremely likely to win. Not quite a sure thing, but getting to that territory.”
    95%+ : “Candidate A is extremely likely to win. A win by Candidate B would be one of the biggest upsets in US presidential election history.”

    I have to admit, I find it hard to come up with wording that would allow someone to put the phrases in correct order, even. And I’m not sure what problem this would solve…except my friend could not look at “Candidate A will probably win, but it’s far from a sure thing” and claim that the forecast said it was a sure thing!

    • Outside of 538, there were a number of sources giving 85-95 percent to Clinton in the weeks before the election.
      You called that range something between “a major upset” and “not quite a sure thing, but getting to that territory”

      538 was an outlier in the predictions.

      • +1. On the other extreme was Sam Wang at Princeton who called it 99% for HRC IIRC. I believe Andrew’s model was somewhere around 90% shortly before the election.

      • Sure, there were other predictions out there, but I and many other people considered 538 to be the most credible. I’m no election forecasting expert so I had to choose who to believe based on track record and on how much their methodology made sense to me; 538 won on both counts.

        Most of the other predictions I saw were clearly over-certain. My default assumption is that any model predictions about anything will be over-certain, so any mechanistic model that doesn’t have a human inflating the error bars is likely to understate them. I remember discussing with friends the ways the forecasts could be wrong in either direction — the usual suspects being the extent to which the voting population differs from the polled population, and people favoring one candidate being more likely to answer — and the fact that this could lead to a regional or nationwide polling bias. This possibility was recognized by political commentators but I think a lot of models ignored it, Andrew’s included I think.

        Anyway I was trusting 538 so I was surprised but not shocked by Trump’s victory. But among my friends I think there was some wishful thinking and a tendency to believe the forecasts that said HRC had it sewn up.

        The funny thing is, though, that the friend I mentioned above did read 538 and thought of it as the single most reliable forecast!

    • At minimum all forecasters should express the uncertainty in appropriate and honest and accurate mathematical terms, with appropriate and honest caveats. They don’t have a responsibility to make the public understand probability. Everyone knows forecasts have error.

    • It can both be true that (a) people misunderstand probabilities and (b) forecasters are overconfident.

      Here are all the final 2016 US election forecasts from prominent models I could find. They are ordered from most to least confident.

      Probability of Clinton Win

      The Princeton Election Consortium: 99%

      PollyVote, DeSart and Holbrook: 99%

      The Huffington Post, Natalie Jackson: 98%

      PredictWise, David Rothschild: 93%

      Daily Kos: 92%

      Slate, Pierre-Antoine Kremp and Andrew Gelman (based on Drew Linzer model): 90%

      New York Times, The Upshot: 85%

      Elliott Morris: 84%

      PollSavvy: 82%

      FiveThirtyEight, Nate Silver et al.: 72% or 71% (polls-plus or polls-only forecasts)

      How much forecaster overconfidence was there in 2016? Well, that depends on how much of an upset you think Trump’s win was. But regardless, I think it’s useful evidence to look at the full set of 2016 forecasts. As Daniel and Chris note, many were much more confident than FiveThirtyEight

      • I really like the idea of mapping forecast probabilities to sports probabilities to help intuition (at least for sports fans).

        So in the last election the forecasters could have said that the probability that Clinton is elected is equal to…
        – (Princeton and PollyVote): the probability of an extra point being kicked (under the old NFL rules).
        – (FiveThirtyEight): the probability of a 30 yard FG being kicked.

        The current Economist model could say that Biden’s probability of winning the electoral college is about the same as Kevin Durant making a free throw.

  6. The first story I use when teaching the problems of communicating uncertainty comes from one my stats teacher (David Bartholomew) told us:

    “We think that we know about uncertainty, and that when we have added a standard error or a confidence interval to a point estimate we have increased knowledge in some way or other. To many people, it does not look like that; they think that we are taking away their certainties—we are actually taking away information, and, if that is all that we can do, we are of no use to them.
    This was brought home to me forcibly when Peter Moore and I appeared before the Employment Select Committee of the House of Commons—which is not a random sample of the population at large. Our insistence that we could not deliver certainties was regarded as a sign of weakness, if not downright incompetence. One may laugh at that, but that is the way it was—and that is what we are up against. ” (Bartholomew, 1986, p. 428, JRSS:A)

  7. I like the article and the discussion, but I want to push back a bit on the idea that people aren’t good at reasoning under uncertainty. We reason under uncertainty all the time in everyday life, and I don’t think we’re that bad at it! What we’re bad at is reasoning under uncertainty with abstract representations like “there’s an 82% chance of X” or “the vote share will be 52% +/- 2%”. We’ve been running experiments that use less anemic visual representations paired with real incentives (not just asking “what do you think the chance is?” or “how confident are you?”) and on average folks don’t do too badly.

  8. The postmortems on the U.S. election by the pselphologists were quite varied in character and often a combination of the following:

    a. Even in Russian roulette there is 17 % chance of failure
    b. Well, she did win the popular vote by almost three million
    c. We noted that things were changing near the end
    d. The FBI and KGB did it
    e. Our competitors did even worse
    f. Sam Wang of the Princeton Election Consortium had to eat an insect and we didn’t

    Akin to the Big Bang of 13.8 billion years ago, some unexpected, momentous things just happen and they are hard to explain.

  9. I’m confused about how to do this even before we get to non-sampling error. How do we get to the 84%, which is the biggest problem and the hardest to put a number on since it involved the “unknown unknowns”? I’ve written a blog post on it, so as not to put too much here in the comments. Extract:

    Example (ii) 9 voters. I sample 3. 2 are for Clinton, 1 for Trump. My estimated margin is 67% for Clinton. What is my standard error? (straightforward, with effort) With what probability do I think Clinton will win? (Less than 67%)

    Example (ii) is simpler. One way to think about it is that each voter has today and will have on election day the same unmchanged probability of voting for Clinton, X, and I’m trying to estimate that number. My best (modal) guess for Clinton’s margin is then 9X, rounded to the nearest ninth since 60%, for example, is an impossible margin. I can also figure out the probability that if X= 55%, say, that rolling that 55% die 9 times for the 9 voters will end up with Clinton getting 5,6,7, 8, or 9 votes (since the three voters I sampled could all, in this model, change their mind on election day and vote differently from today).

    But that assumes each voter has an independent probability X of voting for Clinton, which is an extreme assumption and unrealistic. It’s closer to reality to think that we’ve got Clinton voters and we’ve got Trump voters and we’re trying to figure out how many of each there are: a voter doesn’t flip a coin to decide which he is.

    What is the opposite extreme? Assume that each of our sampled people has two other non-sampled people just like him. Then we can deduce that there are at least 6 people for Clinton and 3 for Trump. Since there are only 9 people total, that means that we can predict with 100% probability that Clinton will win, and that it will be by a 6-3 margin.

    So is it really useful to try to arrive at a number like 84%? I like the idea of knowing how an expert would bet on whether Clinton gets at least 50%. I think I’d prefer his honest answer without his depending too much on formal analysis,though. Like how I liked Brian Leiter’s ranking of philosophy departments before he went formalistic. I think Professor Leiter (a leftwing Nietzsche scholar) is biased, but I know his biases and can correct for them, and I value his gut opinion more than what he does now, which is some sort of expert survey where he gets to pick the experts so it looks fairer to naive people.

    This is related to bayesian-v-classical conundrum: how fancy do you make your model, and do you put in your subjective priors?

    https://www.rasmusen.org/blog1/hillary-clintons-current-margin-is-52-2-and-she-will-win-with-84-probability/

  10. “Predictive models give people the wrong impression about how likely a candidate is to win based on the polls.”

    “So do pundits.”

    “But models use data and express their predictions numerically, so people take them too seriously.”

    “Polls express their error numerically, too. Pundits seldom admit they could be in error. Some refuse to even admit they’ve ever been wrong.”

    “People know pundits’ interpretation of polls may be wrong without them acknowledging it. Everybody knows that an opinion is just an opinion.”

    “Have you ever heard of Fox News?”

    “Deceptive punditry is obviously harmful to democracy, but so are misunderstood model predictions.”

    “So, as a political scientist, instead of criticizing media that lies to voters, you’ve chosen to criticize media that allows voters to have access to information that’s too complicated for many of them to understand? What kind of priority is that?”

    “Well, the people who lie to voters pay me to conduct polls so that they’ll have something to lie about. The modelers are just free-riders.”

    “Ah.”

Leave a Reply to Phil Cancel reply

Your email address will not be published. Required fields are marked *