Yup, here’s more on the topic, and this post won’t be the last, either . . .

Jed Grabman writes:

I was intrigued by the observations you made this summer about FiveThirtyEight’s handling of between-state correlations. I spent quite a bit of time looking into the topic and came to the following conclusions.

In order for Trump to win a blue state, either:

1.) That state must swing hard to the right or

2.) The nation must have an swing toward Trump or

3.) A combination of the previous two factors.

The conditional odds of winning the electoral college dependent on winning a state therefore are a statement about the relative likelihood of these scenarios.

Trump is quite unlikely to win the popular vote (538 has it at 5% and you at <1%), so the odds of Trump winning a deep blue state due to a large national swing are extremely low. Therefore, if the state's correlation with the nation is high, Trump would almost never win the state. In order for Trump to win the state a noticeable proportion of the time (say >0.5%), the correlation needs to be low enough that the state specific swing can get Trump to victory without making up the sizable national lead Biden currently has. This happens in quite a number of states in 538’s forecast, but also can be seen less frequently in The Economist’s forecast. For example, in your forecast Trump only wins New Mexico 0.5% of the time, but it appears that Biden wins nationally in a majority of those cases due to its low correlation with most swing states.

It is hard for me to determine what these conditional odds ought to be. If Trump needs to make up 15 points in New Mexico, I can’t say which of these implausible scenarios is more likely: That he makes up 6 points nationally and an additional 9 in New Mexico (likely losing the election) or that he makes up 9 nationally and an additional 6 in New Mexico (likely winning the election).

If you are interested in more on my thoughts, I recently posted an analysis of this issue on Reddit that was quite well received: Unlikely Events, Fat Tails and Low Correlation (or Why 538 Thinks Trump Is an Underdog When He Wins Hawaii).

I replied that yes, the above is just about right. For small swings, these conditional distributions depend on the correlations of the uncertainties between states. For large swings, these conditional distributions depend on higher-order moments of the joint distribution. I think that what Fivethirtyeight did was to start with highly correlated errors across states and then add a small bit of long-tailed error that was independent across states, or something like that. The result of this is that if the swing in a state is small, it will be highly predictive of swings in other states, but if the swing is huge, then it is most likely attributable to that independent long-tailed error term and then it becomes not very predictive of the national swing. That’s how the Fivethirtyeight forecast can simultaneously say that Biden has a chance of winning Alabama but also say that if he wins Alabama, that this doesn’t shift his national win probability much. It’s an artifact of these error terms being added in. As I wrote in my post, such things happen: these models have so many moving parts that I would expect just about any model, including ours, to have clear flaws in its predictions somewhere or another. Anyway, you can get some intuition about these joint distributions by doing some simulations in R or Python. It’s subtle because we’re used to talking about “correlation,” but when the uncertainties are not quite normally distributed and the tails come into play, it’s not just correlation that matters–or, to put it another way, the correlation conditional on a state’s result being in the tail is different than the correlation conditional on it being near the middle of its distribution.

It’s good to understand how our models work. One advantage of a complicated multivariate prediction is that it gives so many things to look at, that if you look carefully enough you’ll find a problem with just about any prediction method. Reality is more complicated than any model we can build—especially given all the shortcuts we take when modeling.

Once you’ve made a complicated prediction, it’s great to be able to make it public and get feedback, and to take that feedback seriously.

And once you recognize that your model will be wrong—as they say on the radio, once you recognize that you are a sinner—that’s the first step on the road to improvement.

Also the idea that the correlation depends on where you are in the distribution: that can be important sometimes.

This reminds me of an issue when I used to work in finance. Correlations between price

movements of 2 financial instruments are easy to estimate for reasonably typical market conditions.

But one often wants to know the correlation for very large moves. That’s much harder.

A very experienced trader one said to me that when markets go wild all correlations -> + or – 1.

This is why a portfolio of “independent” diversified securities does not give the protection one would wish if

there is market meltdown.

The usual understanding of these election models is that they are taking data from polls, plus some priors, and somehow aggregating data to make an objective prediction. If the model is “good” then the predictions will be good.

But it’s been interesting to watch this reflective process, of looking for odd predictions made by the model. It seems that when the model makes one of these weird predictions, the response is “that’s not what I was trying to say, let’s fix the model”. It seems that the election predictions are being treated as just the subjective judgment of the well-informed human experts who made it (which seems like the right way to look at it).

For Bayesians I guess the theoretically ideal thing to do would just be to write down all the probability distributions, treating each as the modeler’s subjective best judgment, and making sure they are consistent. Of course this is computationally hopeless even if everything is discrete.

So then maybe one purpose of a statistical model is simply as a tool that makes it possible to express these judgments at all. The choice of model and priors would just be a more compact, tractable alternative to telling everyone the full distribution. Posterior predictive checks would test whether the model is expressing the opinions that it should.

All this is very influenced by the ideas of Bayesian practitioners such as the main author of this blog. However, the perspective of statistical modeling as just a language for compactly expressing beliefs is one I have not run into before as far as I know, at least not stated in that exact way. Would be very interested in reading more stuff from this perspective — I would guess people have thought in these terms before.

What got me thinking about it was the use of “bidding languages” in combinatorial auctions, where you have to come up with some way for people to express preferences that doesn’t involve setting a price on all 2^N combinations of items for sale.

“It seems that the election predictions are being treated as just the subjective judgment of the well-informed human experts who made it (which seems like the right way to look at it).”

I find that a troubling aspect of this approach. It’s unclear which part of the model is mechanical and which part is a personal opinion. You can fix mechanical problems. You can’t fix personal opinion: no one knows if it’s right or wrong.

All well and good for election “modelling”: the model isn’t adding much if any value to the polls. The polls in this race are clear. But if they weren’t clear – if it was ±2% – would the model add any certainty? Unlikely – again, unless the certainty was from some other obvious source, for example that if was clear one candidate would carry all the large states.

So that’s actually a good question: what does this or Nate’s model tell us that we can’t already see from the polls?

What if you were using this approach to forecast, say, hurricanes? Would you be going “well, jeez, in the tails it says one might occur in Maine but no, really I don’t think so so I’m going to tweak it out”?

Ilikegum:

The polls are data. They have value, but they need to be interpreted. Here are two naive interpretations of the polls that are wrong:

1. The polls should be taken literally, as if a poll with N respondents is the equivalent of a random sample of N balls from an urn.

2. The polls should be completely ignored as they provide no relevant information about the election.

There’s good evidence, both empirical and theoretical, against interpretation #1 and interpretation #2. Reasonable people all agree that we need something in between. A model (or, more generally, a forecasting method) is a way of getting something in between #1 and #2.

As for hurricanes: I’ve never tried to forecast them, but my colleagues and I have modeled all sorts of other things, and, yes, our workflow does involve going back and forth between modeling and examining the implications of our models.

“Here are two naive interpretations of the polls”

But no one would advocate either, right? The reality is that usually “the polls” means N *poles* each with n *respondents*, where N is in the hundreds if not thousands including all state polls over the year leading up to the election. People are already doing their own aggregating from many many polls over a period of a year and usually the poles are right.

It makes a lot of sense to “understand the implications” of a model. What’s not as clear is how those implications should be judged – what constitutes a “right” or “wrong” outcome of the model – and whether or how the model should be changed to reflect a judgement call, particularly when that judgement call is about an abstract probability in the tail of a model distribution.

In the case of elections it’s double academic, since a) these seem to be tweaks in the noise; and b) they won’t change the outcome. But OTOH again the judgement call scale is a smooth gradation so I can see people getting increasingly comfortable tweaking on decreasingly certain judgement situations, then applying that to something that matters.

I think the case for election models is pretty clear if you look at 2016. Conventional wisdom was that Clinton would definitely win, but 538 predicted around a 30% chance of Trump winning. Correlated polling errors in a few states can cause election outcomes that are not easy to predict from just raw polling data. You need to aggregate the polling data, quantify the historical accuracy of polls, incorporate your prior knowledge about election fundamentals, and then simulate all the “what if” scenarios. If you formalize the process in a model, then you can automatically incorporate new data in near real time, track how the race has changed over time, and quantitatively compare/contrast with other forecasts.

(aka ilikegum@fastmail.com)

“I think the case for election models is pretty clear if you look at 2016. “

I would say just the opposite. Models did no better than polls. I guess you’re going to argue that because Nate had a moderate chance of Trump winning that models were better, but that’s a low viscosity argument. No one gave that any credit and really no one knows what that even means. Most people interpreted that as “Clinton will win”. If the models had given Trump >=45% chance of winning while the polls were giving hands-down to Clinton, I might buy your argument.

So you can reasonably estimate the probability of a candidate winning just by reviewing the raw polling data? You should write out how to do that so others can understand, but don’t use any “model” though because we are trying to avoid that.

Or maybe you’re saying we can’t possibly make a reasonable prediction with a statistical model because there is too much uncertainty? What should we do then? Listen to experts, but only the ones who don’t use advanced statistical analysis?

“So you can reasonably estimate the probability of a candidate winning just by reviewing the raw polling data?”

Yep. If the polls are heavily in favor of Biden, which they are now, I bet on Biden. If they were heavily in favor of Trump, I would bet on Trump. If they were too close to call, I wouldn’t bet: exact same thing I’d do if there were model with purportedly quantitative probabilities.

“Or maybe you’re saying we can’t possibly make a reasonable prediction with a statistical model because there is too much uncertainty? “

I’m saying that all your doing is expending a pile of work to regurgitate what we already know from polls: Biden is the likely winner.

I think the hurricane example is actually reasonable too.

I mean, if I were forecasting hurricanes, it would seem kind of dumb, but that’s because I have no business forecasting hurricanes. But for a meteorologist to do that, based on their own scientific knowledge and as part of a research community, and a commitment to respecting the data to the extent it is reliable, it seems very defensible.

I think Gelman/Heidemanns as well as the FiveThirtyEight team know a lot about polls and elections and I trust their judgments. (Although, see Nate Silver’s 2016 primary failure for a way this can go wrong.)

I guess this only works if the modeler’s subjective judgment is good, and they are working in good faith and not fooling themselves.

“based on…scientific knowledge “

If that’s what it’s based on, great. But intuition isn’t scientific knowledge. And I’m not even saying intuition is wrong. It’s just a lot harder to defend.

“I trust their judgments. “

I trust Andrew too. I trust that Andrew and all the people he works with are doing the best that they can. It doesn’t mean they’re always right. It’s not always clear what’s the right thing to do.

Going even further, I can’t see why we don’t just look at the end result – it predicts X to win by that much? – and judge whether we find it reasonable, and if not, tweak the model until we get what we wanted. I mean, what are the rules? Where does the tweaking stop? What features in the predictions are you allowed to remove through tweaking the model?

” I mean, what are the rules? Where does the tweaking stop?”

Yes, I guess that’s my question. Perhaps we need a statement and paper by a group of eminent statisticians with guidelines for tweakers.

Andrew (other):

Your comment would be correct if our model were doing nothing but forecasting a single scalar parameter. But that’s not the situation we’re in. We’re predicting all 50 states at once, which allows us to look at things like the predictive probability that Trump wins the national election conditional on him winning California, etc. It’s not a simple one-to-one mapping from assumptions to conclusions.

I’m not sure what you mean. The fact that it’s not a one-to-one mapping from assumptions to conclusions means that the tweaking of assumptions required to affect the desired change in the conclusion of interest could be complicated.

But that just seems like a computational problem. Just use a computer to do lots of tweaking until you get the desired conclusion, e.g. candidate X wins by Y margin.

That computational problem doesn’t seem relevant to the question of methodology: what are the rules? where does the tweaking stop? what is allowed and what is forbidden?

I would feel slightly more comfortable if you were tweaking the *prior* predictives rather than the posterior ones, but I don’t think that’s what you are doing. It would also be very interesting to know the *prior* predictive for the outcome of the election, before conditioning on any polling data.

Andrew (other):

You write: “Just use a computer to do lots of tweaking until you get the desired conclusion, e.g. candidate X wins by Y margin.” But we’re not just spitting out one number! We’re spitting out a 50-dimensional posterior distribution, which is mostly summarized by 50 estimates and a covariance matrix.

The answer to, “what is allowed and what is forbidden,” is that anything is allowed. If you don’t like it, that’s fine, but there aren’t really any alternatives, at least not yet. This sort of opinion forecasting is not a mature science. There are too many sources of nonsampling error floating around for this to be done automatically. Or, to put it another way, there are ways it could be done automatically, but I’d have no good reason to trust the results.

Or, to put it yet another way: you might not trust a model that my colleagues and I have built after examining its predictions and checking that they are reasonable. That’s fine. But then you really really really shouldn’t trust a model whose predictions

haven’tbeen checked in this way!I think the point about it being worthwhile to tweak the *prior* rather than the posterior is maybe worthwhile. It’s perfectly reasonable for example to construct a prior by initially doing something kind of vague, and then sticking in some “fake data” to constrain the prior down to something you like, and then go from there adding in the real data. This is worthwhile to think about.

Right, Daniel, I absolutely agree. Tweaking until the prior predictives look sensible (discarding the polling data) seems like a reasonable process of constructing a prior compatible with prior knowledge. Tweaking the posterior predictives — with anything allowed! — is wild.

You say that it is a “clear flaw” in 538’s forecast, but I am not convinced. As Jed points out, the same outcomes happens in the Economist forecast as well, but to a lesser extent. The question seems to be about the degree to which we expect these types of outcomes. It is not difficult for me to imagine scenarios where rare state specific events (e.g., scandal, natural disaster, denying federal disaster aid to specific state) shift the outcome in a particular state largely independent of the national outcome. Maybe if you look back at data from past elections you could argue that this doesn’t happen, but with a small sample size it seems more like it would depend on what you assume is theoretically possible.

N:

Yes, all things are possible. Nobody’s saying these probabilities should be exactly zero. It’s all about the numbers. So, yes, “to a lesser extent.” You say, “It is not difficult for me to imagine scenarios where rare state specific events (e.g., scandal, natural disaster, denying federal disaster aid to specific state) shift the outcome in a particular state largely independent of the national outcome.” I could see this causing a shift of a couple percentage points, but not 10 percentage points. The only such examples I can think of historically are third-party challenges, and there are no state-specific third-party challenges in this election. So, again, the question is not about theoretical possibility (i.e., probability greater than 0), it’s about the probability.

Yet another way to say it is that these models are constructed by humans and are imperfect. I would be stunned if the Economist’s and Fivethirtyeight’s predictions did

nothave serious flaws somewhere, if you look hard enough. How could it be otherwise? Our default assumption should be that the predictions are flawed, and it makes sense that any time we look carefully at some aspect of the predictions that haven’t been carefully studied before, that we uncover issues.Thanks for taking the time to respond.

I agree all models are flawed and most code has bugs. Making your code open source helps find problems, but if you’re paying well qualified people a salary to do stuff like code reviews and QA then that can work well too.

You may be right that 538’s model has serious flaws, I was mostly taking issue with describing it as “clearly flawed”. In your series of blog posts it didn’t seem that clear cut to me as a reader, but maybe I lost the thread.

I am confused by how the state correlation is computed (https://github.com/TheEconomist/us-potus-model/blob/b89e602ec03e81c797c66cf2743af6fdae4aa526/scripts/model/final_2016.R#L272), which is the correlation of the matrix containing all state-level census variables. Why is the correlation of white percentage the *same* as the correlation of the polling odds ratio at 2020?

Yuling:

I’m not sure, but I’ll say one thing: Our state correlations are a hack, constructed by all sorts of weird manipulations from past correlation matrix in order to get something that seemed reasonable to us and that gave reasonable results when used with past elections.

Andrew said,

“Reality is more complicated than any model we can build—especially given all the shortcuts we take when modeling.”

This observation deserves a name, which deserves to be in the lexicon.

(Any suggestions? Maybe just Reality?)

See the Borges story: On Rigor in Science.

“ Also the idea that the correlation depends on where you are in the distribution: that can be important sometimes.”

The Formula That Killed Wall Street

https://www.wired.com/2009/02/wp-quant/