Australian polls failed. They didn’t do Mister P.

Neil Diamond writes:

Last week there was a federal election in Australia. Contrary to expectations and to opinion polls, the Government (a coalition between the Liberal (actually conservative) and National parties, referred to as LNP or the Coalition) was returned with an increased majority defeating the Australian Labor Party (ALP or Labor, no “u”).

Voting in Australia is a bit different since we have compulsory voting, that is you get fined if you don’t vote, and we have preferential voting. Allocation of preferences is difficult and sometimes based on what happened last election and the pollsters all do it differently.

Attached is a graph of the two party preferred vote over the last three years given by Kevin Bonham, one of the most highly regarded poll analysts in Australia. Note that in Australia Red means Labor and Blue means Liberal. The stars correspond to what actually happened at the election.

Since the election there has been much analysis of what went wrong with the polls. I’m attaching two links—one by a Nobel Laureate, Professor Brian Schmidt of the Australian National University, who pointed out that the published polls had a much lower variability than was expected, and another (very long) post from Kevin Bonham which looks at what has happened and suggests among other things that the polls “may have been oversampling voters who are politically engaged or highly educated (often the same thing).”

Diamond also links to this news article where Adrian Beaumont writes:

The Electoral Commission’s two party preferred projection is . . . the Coalition wins by 51.5-48.5 . . . Polls throughout the campaign gave Labor between 51 and 52% of the two party preferred vote. The final Newspoll had a Labor lead of 51.5-48.5 [in the other direction as what happened; thus the polls were off by 3 percentage points] . . . I [Beaumont] believe the poll failure was caused in part by “herding”: polls were artificially too close to each other, afraid to give results that may have seemed like outliers.

While this was a failure for the polls, it was also a failure of the betting markets, which many people believe are more accurate than the polls. . . . the Betfair odds . . . implying that the Coalition had only an 8% chance of winning. . . . It is long past time that the “betting markets know best” wisdom was dumped. . . .

Another reason for the poll failure may be that pollsters had too many educated people in their samples. Australian pollsters ask for age and gender of those they survey, but not for education levels. Perhaps pollsters would have been more accurate had they attempted to stratify by education to match the ABS Census statistics. People with higher levels of education are probably more likely to respond to surveys than those with lower levels.

Compulsory voting in Australia may actually have contributed to this problem. In voluntary voting systems, the more educated people are also more likely to vote. . . .

If there is not a large difference between the attitudes of those with a high level of education, and those without, pollsters will be fine. . . . If there is a big difference, as occurred with Trump, Brexit, and now it appears the [Australian] federal election, pollsters can miss badly. If you sort the seats by two party swing, those seats that swung to Labor tended to be highly educated seats in the cities, while those that swung biggest to the Coalition were regional electorates. . . .

I’m surprised to hear that Australian polls don’t adjust for education levels. Is that really true? In the U.S., it’s been standard for decades to adjust for education (see for example here). In future, I recommend that Australian pollsters go Carmelo Anthony.

24 thoughts on “Australian polls failed. They didn’t do Mister P.

    • I disagree, the point of a prediction market is to aggregate information about a specific event. It’s not the make sure that the p assigned to a each of a wide range of events on average is equal to the frequency with which the events occur.

      Fundamentally, the purpose of a prediction market is Bayesian, and when you place 92% credence on an outcome and instead an outcome with 8% credence occurs, you have done a poor job predicting.

      Obviously prediction markets will not always converge to 100% on the correct outcome seconds before the outcome is revealed, but the way we should evaluate them is the extent to which they approximate that… If they converge to 50/50 at least they are honest about no one having any idea what will happen. if they converge to 92/8 and the 8 happens… it’s an indication people have done a poor job evaluating the question.

      • when you place 92% credence on an outcome and instead an outcome with 8% credence occurs, you have done a poor job predicting

        Suppose one makes N probabilistic event predictions of 92%. To be calibrated, y ~ binomial(N, 0.92) of those events should obtain. If N = 100 and y = 100 or y = 80, you have a good reason to suspect your estimate of 92% is off. But in this case, we only have 0 ~ binomial(1, 0.92). We can’t even reject that forecaster by the p < 0.05 criterion.

        • This is exactly by the “calibrated” (frequency) criterion that I rejected though. Those other events have *nothing* to do with the event in question, and the people involved in betting on them are totally different people with different information.

          Fundamentally, every prediction market question is *always* an N=1 event. There can be no question of “calibration” all we have is, did you put high credence on the event that occurred, or not. It’s a question of whether there existed good information in the world that was aggregated in a good way, or whether either the information didn’t exist, was highly biased, or didn’t get aggregated well.

        • Imagine for example that I have 10 binary outcomes, they turn out to happen as:

          1011010011

          Now clearly 60% of these occurred…

          Now suppose prior to finding out what happened I had two models, one of them gave probability to occur (give a 1 result) of:

          .9 .1 .9 .9 .1 .9 .1 .1 .9 .9

          the other gave

          .58 .58 .58 .58 .58 .58 .58 .58 .58 .58

          clearly the first one is off by frequency… The expected number of outcomes is .9*6+.1*4 = 5.8

          the other one obviously is off by frequency…. but its expected number of outcomes is 5.8 as well…

          The frequency error in both of these is -.02 which seems really close, so perhaps they’re both basically just as good as each other?

          Now, what happens if we bet on the first model? Specifically, suppose when you win you get .4 dollars, and when you lose you get -.6 dollars

          What would your outcomes have been? Under the first model, you’d have won .4*10 = $4

          Under the second model you’d have won 6 of the times, and lost 4 of the times…

          0.4 * 6 – 0.6 * 4 = 0

          From the perspective of the *purpose* of a prediction market, the goal should be to predict accurately… we should measure goodness in terms of information content… the second model here has essentially no information content, other than a good approximation of the ultimate frequency. The first model has a LOT of information content, it’s almost completely accurate at each event.

        • Say there are two additional models giving predictions with the following probabilities:

          1 0 1 1 0 1 0 0 1 1

          .8 .2 .8 .8 .2 .8 .2 .2 .8 .8

          Every prediction is in the same sense that your “.9 .1 .9 .9 .1 .9 .1 .1 .9 .9” but one of them is more certain, the other less certain.

          The first one is clearly the best… if the outcome is actually 1011010011

          But what if the outcome doesn’t always match the favoured prediction? Say they get seven right and three wrong.

          Would you say that the model doing predictions with certitude is better, because it makes “better predictions” (converging to 100%, as you said) in most cases?

        • I’ve been thinking about what the proper information metric is, but it being a weekend and a lot of other stuff going on I haven’t come up with it. Intuitively for the binary case I want to do something with log base 2 of the error…

          There’s probably some metric already well defined, but intuitively a perfect model predicts a long string of bits without any “additional” information, and the worse the model is, the more “correction bits” you need to get from the prediction to the right answer… If you know already how to define such a thing, let me know, otherwise you might think about it and figure it out, or I can come back to it maybe monday.

        • Daniel, cross-entropy (or log-loss) is a popular metric. The Brier score is another. But if you don’t like calibration you may not like them either.

      • > Obviously prediction markets will not always converge to 100% on the correct outcome seconds before the outcome is revealed, but the way we should evaluate them is the extent to which they approximate that…

        What criteria do you propose then to evaluate prediction markets?

        Say that for a series of 100 football matches two different prediction markets A and B make the same predictions, and get it right 80 times, but with different levels of certitude:

        A) predicts the winner for each match as a sure event (100% certitude)

        B) predicts the winner for each match as a likely event (75% certitude)

        Is prediction market A better?

        Do you think that any prediction market can make better job predicting simply by rounding to 0% or 100%?

        • see comment here: https://statmodeling.stat.columbia.edu/2019/11/09/australian-polls-failed-they-didnt-do-mister-p/#comment-1160696

          What you want to look at is the probability that you assigned to the outcome that happened. So if you assign 100% to “heads” and you get a tails… you assigned 0 to that, and it should count against you.

          If you always assign 100% and you always are right.. you’re obviously doing a good job, you have a lot of information, and you need very little correction… log(1) = 0 so you have 0 bits of error per prediction.

          suppose you assign 75% to the outcome that occurs each time..

          -log(.75)/log(2) = .415 bits of error per prediction.

          suppose you assign 1 to 50% of outcomes, and you assign .5 to the other half of the outcomes..
          .5 * log(.5)/log(2) + .5 * log(1)/log(2) = .5, you have about half a bit of error per prediction.

          so I think what you want is 1/N * sum(-log(p(actual_outcome[i]))/log(2)) for the average bits of prediction error

        • Breaks down quite a bit if you assign p=0 to the actual outcome obviously, as this costs you infinite bits, but the point is that this does indicate a big error… and again that assigning say 0.08 to the actual outcome is a bad situation… it costs you 3.6 bits

        • But this also shows that you *shouldn’t* round off to 1/0 as you asked… because you *will* get some error, and then you’ve got infinite bit cost. From this perspective it makes some sense to use a prior that excludes certainty/zero, like a beta(1.1,1.1) rather than something like the standard beta(1,1) for inference on a binomial outcome…

  1. Seems to me there are two other possibilities, which is that the polls were correct and there has been hanky-panky. Generally elections are not so close that anything but blatant corruption and fraud could make a difference but in a close election.

    Also, there is the possibility that polling itself has become unreliable technique, as those polled may be refusing to engage or outright lie. Any phone calls I get that start talking about a survey I cut off immediately, because so many “polls” are push-polls, disguised slanders. Or are loaded questions intended to be passed off as popular disapproval or support.

    But, overall, given enough elections, isn’t it likely that even statistical projection with a properly estimated margin of error will still fail?

    • Australian elections are run by the Australian Electoral Commission and are widely regarded as very reliable. Pacific Islanders shifted as a group, particularly in Queensland, and that may have been missed by the pollsters’s samples.

    • Multiple voting, false enrolments, fraudulent how-to-vote ads, party “advisers” permitted to “assist” voters in nursing homes, etc.,… among a compliant, negligent public, and a politically infiltrated electoral commission and skewed judiciary… Your point is valid. The Australian Electoral Commission is even presently trying to defend, in court, its inaction regarding ads by Liberals that were admittedly designed to look like AEC ads, with no party branding, telling Chinese voters that the “correct way to vote” is to Vote 1 liberal. The AEC is siding with the Liberals in this case. 1000s of multiple votes each election the commission admits to; because Australia has no identity checks when voting, and the Commission does not validate addresses anyway.
      https://quadrant.org.au/opinion/qed/2019/06/election-fraud-and-the-aec/
      https://morningmail.org/electoral-rorting/

  2. Is it just my own conservative bias, or is it always the case that polls overestimate how well the more leftwing candidate will do? Perhaps a simple explanation is that the pollsters intentionally put their thumbs on the scale a little bit, as one can when there are decisions to be made on weighting, mistakes in data collection, etc.

    • That certainly seems to be the case in the UK. Conservatives did better than polls expected in 1992 and 2015 General Elections. While I’m no expert I believe the problem is partly about differential non-response bias, with young highly educated people who are disproportionately unlikely to vote Conservative being more likely to respond to surveys. All that said, in the most recent election in 2017, Labour did better than most pollsters expected. (From my left-wing perspective I think the polls are over interpreted by left-wing media and commentators due to some kind of wishful thinking.)

      • My experience have been that young people are the least likely to respond to surveys. However, if these samples are quota sampling young people then I would expect the more educated young people to be over-sampled then the less well educated young people. But I would have expected them to be more conservative then the lesser educated young people which would make the latter more likely to be working class and so vote Labour.

    • Eric:

      I don’t know about Australia. In the U.S. we looked at state polls over several elections and didn’t find any systematic error toward either party. But in any given election there is error. We found non-sampling error to be of the same order of magnitude of sampling error. So I think it should be easy to find errors in both directions, and it could be that some errors are more salient to you, or to some other observers.

  3. As a side-project Monica Alexander (monicaalexander.com) and I started MRP based polling in Australia at the 2016 election: https://www.petitpoll.com/. We skipped 2019 as our first son arrived at pretty much the same time, but we’re back at it now.

    We are in the process of opening everything up including data and model. If you’re in Australia then I’m presenting a paper about the model on 2 and 3 December in Monash and the ANU, respectively.

    Always keen to involve anyone who is interested, just get in touch.

    Rohan

Leave a Reply to Anonymous Cancel reply

Your email address will not be published. Required fields are marked *