Excellent performance of machine learning algorithms in a major time series competition. And then what is the role of statistical modeling? Here’s the answer:

Kevin Gray writes:

Perhaps you’ve seen this but, in case not, it may be of interest.

Here’s some commentary by Rob Hyndman and a few others.

Curtains for Statistics? I’m struggling to rationalize this.

The article that Gray is pointing to is “The M5 Accuracy competition: Results, findings and conclusions,” by Spyros Makridakis, Evangelos Spiliotis, Vassilis Assimakopoulos, and here’s a key paragraph from its conclusions:

The exceptional performance of statistical methods versus ML [machine learning] ones found by Makridakis et al. (2018b), as well as in the early Kaggle competitions (Bojer & Meldgaard, 2020), first shifted towards ML and statistical methods in the M4 competition, and then to exclusively ML methods like in the Kaggle competitions which started in 2018 and the M5 described in this paper. It will be of great interest if ML methods continue to dominate statistical ones in future competitions, particularly for other types of data that are not exclusively related to hierarchical, retail sales applications.

I wouldn’t phrase it quite like this, as I consider the machine learning methods to also be statistical. They’re nonparametric, but I still consider them to be statistical forecasting methods.

I see several places where probability modeling is relevant here:

1. As noted above, machine learning predictions correspond to nonparametric statistical models, and the moment we try to quantify prediction error they become probability models.

2. Data and prediction problems typically have multilevel structure (data in different countries, different years, different product lines, etc.), which creates a need for partial pooling across levels. The alternative to partial pooling is sometimes-complete-pooling and sometimes-no-pooling, which is just a crude form of partial pooling. We’ve discussed this issue many times over the years, whether in the context of time-series cross-sectional studies in political science and economics, analysis of public opinion, or transportability in causal inference.

3. As we say in Regression and Other Stories, the three goals of statistics are generalization from sample to population, from treatment to control group, and from observed data to underlying constructs of interest. These require various versions of poststratification and latent-variable modeling.

4. Finally, when the performance of different predictions is still being evaluated using traditional statistical approaches. This would start with averages but soon move to probability modeling (to convert variation in data to uncertainty in expected performance) and poststratification (to assess performance for a population of problems of interest).

7 thoughts on “Excellent performance of machine learning algorithms in a major time series competition. And then what is the role of statistical modeling? Here’s the answer:

  1. The future is now! https://i.imgur.com/JbO83WT.png

    (just kidding, it looks like Statistics still leads the pack, I’d just rescaled its axis: https://trends.google.com/trends/explore?date=all&geo=US&q=Data%20Science,Statistics,Machine%20Learning, and also isn’t on as downward a trajectory when specified as a ‘Discipline’ https://trends.google.com/trends/explore?date=all&geo=US&q=%2Fm%2F0jt3_q3,%2Fm%2F06mnr,%2Fm%2F01hyh_)

    (I also wonder what the seasonality of search results is from. An election cycle? And also as to the earliest pushback at the divisions between these overlapping magisteria — e.g. I remember seeing this post some time back: https://magazine.amstat.org/blog/2013/07/01/datascience/)

  2. One field I’m particularly still not impressed with machine learning methods (statistical learning is perhaps a better name) is economic forecasting.

    There are tons and tons of papers applying machine learning methods to forecast macroeconomic variables (gdp, inflation, etc). A lot report some good performance vs more traditional methods but with closer inspection you see that this is *highly* dependable on the particular series, the country, the time period analyzed etc.

    I’ve analyzed a lot of series used in some papers and found absolutely no evidence of superiority of machine learning methods vs traditional time series econometrics/statistics models (ARIMA, dynamic factor models, etc).

    One thing most of these papers do not talk about is the *huge* number of modeling decisions that need to be made when applying some of these methods.
    Plus, cross-validation for time series is a *mess*. Whether you use some type of block cross validation or information criteria, that introduces one more level of possible modeling decisions.

    Just a simple example of a popular choice now for financial/macroeconomic series, the adaptive LASSO. Which first estimation did you choose: OLS, standard LASSO, ridge regression, elastic Net? Then for the second step which coefficient you choose to penalize the first step?
    What about the choice of hyperparameters? was it a type of block cross-validation, what size of the blocks did you choose? Or maybe you used information criteria? Which one AIC, BIC?
    Most papers will just state their choice (in the positive case, a lot of them don’t even make clear all the choices) and say it was the best performing model. Which of course raises the suspicion that they’ve tried a bunch of them and picked the best one to present.

    Time series is frequently a lot harder than the standard “iid case” in which most machine/statistical learning methods were developed

  3. I’ve never really understood these competitions–doesn’t the high sampling variance of the cross validation estimator swamp the results? It looks like most of the top competitors all used some kind of model average of lightgbm and recurrent neural networks — I find it hard to believe there’s a meaningful difference between them.

  4. I was looking at this the other week. First, there was also an uncertainty competition looking at the confidence intervals: http://www.researchgate.net/publication/346493740

    A purely statistical method was very competitive. Description of the method here: https://arxiv.org/abs/2111.14721

    While I do believe machine learning algorithms should win on prediction tasks with lots of data, I wonder how much their dominance here was due to selection bias. Hosting the competition on Kaggle was bound to draw a lot more machine learning folks than statisticians.

  5. One academic very familiar with time series analysis commented “I was surprised by the ubiquity of lightGBM and its success against deep models. This comes at a good time because I’m teaching about various tree methods and will encourage students to try this particular implementation of boosting. Interestingly, boosting came out of the Stat literature, so that’s good, I guess. On a broader level, Statistics is a bit under siege these days from the algorithmic machine learners, but I think it is going to make us a better discipline in the end.” According to another academic (Rob Hyndman) and my own impressions few statisticians take part in these competitions.

  6. The M5 competition assessed model performance against a huge number of time series. It was looking for models that work in general with no specific tuning. Not all problems look remotely like this. Generative modelling (e.g.) is a tool for tackling a very different problem: how to extrapolate a particular time series about which you have a lot of specific structural information, possibly on the basis of very little data.

    You can’t build generative models across 10000 separate time series, but if you are only interested in two or three time series, then you can likely do a lot better than lightGBM.

    I don’t see how the success of ML models in the M5 competition renders other statistical approaches problematical.

Leave a Reply

Your email address will not be published. Required fields are marked *