History of time series forecasting competitions

Here are a couple of posts from Rob Hyndman in 2017 and 2018 that remain interesting, and not just for time series:

M4 Forecasting Competition:

The “M” competitions organized by Spyros Makridakis have had an enormous influence on the field of forecasting. They focused attention on what models produced good forecasts, rather than on the mathematical properties of those models. . . .

Makridakis & Hibon, (JRSSA 1979) was the first serious attempt at a large empirical evaluation of forecast methods. It created heated discussion, and was followed by the M-competition comprising 1001 time series that participants were invited to forecast. The results were published in Makridakis et al (JF 1982)

The M2 competition focused on different issues and involved a much smaller data set, but with richer contextual information about each series.

Twenty years after the first competition, the M3 competition was held, involving 3003 time series. . . .

Now, almost 20 years later again, Makridakis is organizing the M4 competition. Details are available at https://mofc.unic.ac.cy/m4/. . . . I [Hyndman] am pleased to see that this new competition involves two additions to the previous ones . . . It does not appear that there will be multiple submissions allowed over time, with a leaderboard tracking progress (as there is, for example, in a Kaggle competition). This is unfortunate, as this element of a competition seems to lead to much better results. See my paper on The value of feedback in forecasting competitions with George Athanasopoulos for a discussion. . . .

A brief history of time series forecasting competitions:

Prediction competitions are now so widespread that it is often forgotten how controversial they were when first held, and how influential they have been over the years. . . . The earliest non-trivial study of time series forecast accuracy was probably by David Reid as part of his PhD at the University of Nottingham (1969). Building on his work, Paul Newbold and Clive Granger conducted a study of forecast accuracy involving 106 time series . . .

Five years later, Spyros Makridakis and Michèle Hibon put together a collection of 111 time series and compared many more forecasting methods. They also presented the results to the Royal Statistical Society. The resulting JRSSA (1979) paper seems to have caused quite a stir, and the discussion published along with the paper is entertaining, and at times somewhat shocking. . . .

Maurice Priestley was in attendance again and was clinging to the view that there was a true model waiting to be discovered:

The performance of any particular technique when applied to a particular series depends essentially on (a) the model which the series obeys; (b) our ability to identify and fit this model correctly and (c) the criterion chosen to measure the forecasting accuracy.

Makridakis and Hibon replied:

There is a fact that Professor Priestley must accept: empirical evidence is in disagreement with his theoretical arguments.

Many of the discussants seem to have been enamoured with ARIMA models.

It is amazing to me, however, that after all this exercise in identifying models, transforming and so on, that the autoregressive moving averages come out so badly. I wonder whether it might be partly due to the authors not using the backwards forecasting approach to obtain the initial errors. — W.G. Gilchrist

I find it hard to believe that Box-Jenkins, if properly applied, can actually be worse than so many of the simple methods — Chris Chatfield

Then Chatfield got personal:

Why do empirical studies sometimes give different answers? It may depend on the selected sample of time series, but I suspect it is more likely to depend on the skill of the analyst . . . these authors are more at home with simple procedures than with Box-Jenkins. — Chris Chatfield

Again, Makridakis & Hibon responded:

Dr Chatfield expresses some personal views about the first author . . . It might be useful for Dr Chatfield to read some of the psychological literature quoted in the main paper, and he can then learn a little more about biases and how they affect prior probabilities.

Snap!

Hyndman continues:

In response to the hostility and charge of incompetence, Makridakis & Hibon followed up with a new competition involving 1001 series. This time, anyone could submit forecasts, making this the first true forecasting competition as far as I am aware. They also used multiple forecast measures to determine the most accurate method.

The 1001 time series were taken from demography, industry and economics, and ranged in length between 9 and 132 observations. All the data were either non-seasonal (e.g., annual), quarterly or monthly. Curiously, all the data were positive, which made it possibly to compute mean absolute percentage errors, but was not really reflective of the population of real data.

The results of their 1979 paper were largely confirmed. The four main findings (taken from Makridakis & Hibon, 2000) were:

1. Statistically sophisticated or complex methods do not necessarily provide more accurate forecasts than simpler ones.

2. The relative ranking of the performance of the various methods varies according to the accuracy measure being used.

3. The accuracy when various methods are being combined outperforms, on average, the individual methods being combined and does very well in comparison to other methods.

4. The accuracy of the various methods depends upon the length of the forecasting horizon involved.

The paper describing the competition (Makridakis et al, JF, 1982) had a profound effect on forecasting research. It caused researchers to:

– focus attention on what models produced good forecasts, rather than on the mathematical properties of those models;

– consider how to automate forecasting methods;

– be aware of the dangers of over-fitting;

– treat forecasting as a different problem from time series analysis.

These now seem like common-sense to forecasters, but they were revolutionary ideas in 1982.

I don’t quite understand the bit about treating forecasting as a different problem from time series analysis. They sound like the same thing to me!

In any case, both of these posts by Hyndman were interesting: lots of stuff there that I haven’t every really thought hard about.

8 thoughts on “History of time series forecasting competitions

  1. “Clarke’s first law: When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.”

    Yup. Whole lotta that going on.

  2. Makridakis, Spiliotis and Assimakopoulos have a 2018 paper comparing the relative performance of ML vs statistical time series forecasting methods.

    Abstract
    Machine Learning (ML) methods have been proposed in the academic literature as alternatives to statistical ones for time series forecasting. Yet, scant evidence is available about their relative performance in terms of accuracy and computational requirements. The purpose of this paper is to evaluate such performance across multiple forecasting horizons using a large subset of 1045 monthly time series used in the M3 Competition. After comparing the post-sample accuracy of popular ML methods with that of eight traditional statistical ones, we found that the former are dominated across both accuracy measures used and for all forecasting horizons examined. Moreover, we observed that their computational requirements are considerably greater than those of statistical methods. The paper discusses the results, explains why the accuracy of ML models is below that of statistical ones and proposes some possible ways forward. The empirical results found in our research stress the need for objective and unbiased ways to test the performance of forecasting methods that can be achieved through sizable and open competitions allowing meaningful comparisons and definite conclusions.

    https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0194889

  3. This is a timely post: I was about to do a blog about time series approaches and whether tradition ones like ARIMA are ready for retirement…and, if so, what should be the new standard. I’ve switched to state space / hidden variables models that give more flexibility and perform better in my applications, and I’ve wondered about the universality of that experience. Also, at least the way I implement these in Stan they are very slow and I’ve wondered if there are faster methods that are as good or better. Maybe this contest will help figure that out.

    • Phil:

      If you have some models you want to fit in Stan and they’re slow, you should post a question on the Stan forums, and if there’s no response there you could post something here on the blog.

      And, yes, it’s my impression that Arima has been way oversold.

    • IMHO Bayesian state space models are the way to go as long as you have some kind of meaningful knowledge of what the right hidden variables are and what information might inform the values for the hidden vars.

      Computationally they can be rough though. Shoot me an email Phil and I’d be happy to talk about it.

    • Hi Phil, I wonder which were the main performance issues with SSM that you experienced. I have been trying to implement several SSMs in Stan, and most of the bottleneck was in the Kalman Filter (mostly for multivariate model, as matrices grow quadratically). But in general sampling efficency was pretty high

  4. It does not appear that there will be multiple submissions allowed over time, with a leaderboard tracking progress (as there is, for example, in a Kaggle competition). This is unfortunate, as this element of a competition seems to lead to much better results.

    Why do empirical studies sometimes give different answers?

    I don’t understand this discussion. The cross validation error is an estimator with its own bias and variance, and the variance is typically quite high. For any finite dataset and loss function, there exists an estimator that will dominate the true “best estimator”. This is true even for Koenker’s “quantile loss” which they use for their uncertainty quantification competitions. This is easy to verify both theoretically and in simulations. If you forward simulate data, the best cross validated models will perform better than what you KNOW to be optimal.

    So it seems like the top end of any competitive setting is guaranteed to give unrealistic performance estimates that don’t hold long term. Methods from competitions results are rarely, if ever, productionized. So I don’t think you can learn much from studying the top performers in these competitions, though maybe you can learn something from the larger separation within the bulk of submissions.

Leave a Reply

Your email address will not be published. Required fields are marked *