A different Bayesian World Cup model using Stan (opportunity for model checking and improvement)

Maurits Evers writes:

Inspired by your posts on using Stan for analysing football World Cup data here and here, as well as the follow-up here, I had some fun using your model in Stan to predict outcomes for this year’s football WC in Qatar. Here’s the summary on Netlify. Links to the code repo on Bitbucket are given on the website.

Your readers might be interested in comparing model/data/assumptions/results with those from Leonardo Egidi’s recent posts here and here.

Enjoy, soccerheads!

P.S. See comments below. Evers’s model makes some highly implausible predictions and on its face seems like it should not be taken seriously. From the statistical perspective, the challenge is to follow the trail of breadcrumbs and figure out where the problems in the model came from. Are they from bad data? A bug in the code? Or perhaps a flaw in the model so that the data were not used in the way that were intended? One of the great things about generative models is that they can be used to make lots and lots of predictions, and this can help us learn where we have gone wrong. I’ve added a parenthetical to the title of this post to emphasize this point. Also good to be reminded that just cos a method uses Bayesian inference, that doesn’t mean that its predictions make any sense! The output is only as good as its input and how that input is processed.

9 thoughts on “A different Bayesian World Cup model using Stan (opportunity for model checking and improvement)

  1. So without blinking you conclude that Qatar is about twice as likely as Germany to win the worldcup! I don’t mean this as a criticism, it made me think about what to do when you find very surprising results out of a model. Of course you want to check for bugs and stuff, but if it persists at some point you kind of need to trust your model over your guts! It has happened to me a few times where I should have trusted my model more.

  2. Unrelated to the post:

    I remember an old post from Andrew about someone assuming that of course fracking would cause earthquakes. I’d like to find that post again, but I don’t remember enough detail to find it myself. Can anyone point to it?

  3. Model predictions are insane. Either the code is wrong, or the training data is insufficient. I think the filtering is to blame:

    matches %
    filter(
    year(date) >= 2020,
    home_team %in% countries$team,
    away_team %in% countries$team,
    tournament != “Friendly”)

    There are simply not that many matches between world cup teams that are not friendlies since 2020. And I suspect it is a weird sample, since most of the remaining matches are within regions (e.g. teams from Asia playing teams from Asia.)

  4. “Model predictions are insane” Not really. Crazier things have happened, in particular in world cup football (the Iceland phenomenon in 2018, Costa Rica leading their group in 2014 beating Italy and Uruguay, South Korea beating Italy in the 2002 QF, the list goes on).

    I agree with the filtering comment though. Results are very critically dependent on the sample data used. Sample data are (too) small, in particular for some teams (ahem, Tunisia). Excluding friendlies limits historical international matches to games played within their respective football associations. Including friendlies would have lead to world cup-atypical (i.e. higher) score differences. This was a design choice. Prior information in the form of world ELO rankings balances the “isolation from association specificity” to a degree.

    Which brings me to the comment on “comparing to prior information”. Many football World Cup predictions I have come across are essentially dressed prior probabilities derived from ELO rankings. Actual World Cup finalists often don’t match those prior information-based trends.

    PS. As football is obviously serious business, this forecasting endeavour is obviously not to be taken too seriously.
    PPS. Any forecast predicting Holland to win the WC is a good forecast.

    • > “Model predictions are insane” Not really. Crazier things have happened,

      The insane part is not predicting that crazy things could happen but saying that they are more likely to happen than non-crazy things. The ultimate “anything can happen” prediction is the uniform 1/32 probability. Saying that Tunisia is as likely to win as Germany would be hard to justify but saying that it’s five times more likely is something else.

      > Sample data are (too) small, in particular for some teams (ahem, Tunisia).

      Even more so for Cameroon or Senegal with no data at all. Which may be why the predicted probability for those teams is 1/32 – apparently ignoring any prior information.

      > Many football World Cup predictions I have come across are essentially dressed prior probabilities derived from ELO rankings.

      If winning one or two matches make Tunisia and Morocco favorites – all the way from positions #28 and #24 to #3 and #8 – these predictions may not be giving enough weight to the prior information.

      I tried to see what the “naked prior probability” looks like in this model but it’s not easy.

      If there are no matches (zero rows) the script doesn’t work well and we get that the favorite is NA (50% probability) followed by Saudi Arabia (7.5%), Costa Rica (7%), Germany (4.7%) and Tunisia (3%). There is some variability between runs but that seems to be the pattern.

      I tried using instead 16 matches all with a 1-1 score and paring #1 with #2, #3 with #4, etc. assuming that this fake data won’t perturb the prior predictions much. The results are equally puzzling. They change in every run with probabilities ranging from below 2.5% to over 4% but those four countries – Saudi Arabia, Costa Rica, Germany, Tunisia – seem to remain consistently close to the top for some reason.

  5. Maurits:

    What about also reporting the team abilities estimates (crediblility intervals) from your model?
    Something similar to what Andrew did in 2014 for his World Cup model. This should give us a better intution how your model is fitting the data and which is the influence of the ELO ratings in the abilities parameters estimates (I suppose they could be very influential!).

    (I may also guess that there could be some identifiability issues with the team abilities in your model, even if you use informative priors.)

Leave a Reply

Your email address will not be published. Required fields are marked *