Roy Mendelssohn points us to this paper by Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos, which begins:

Machine Learning (ML) methods have been proposed in the academic literature as alternatives to statistical ones for time series forecasting. Yet, scant evidence is available about their relative performance in terms of accuracy and computational requirements. The purpose of this paper is to evaluate such performance across multiple forecasting horizons using a large subset of 1045 monthly time series used in the M3 Competition. After comparing the post-sample accuracy of popular ML methods with that of eight traditional statistical ones, we found that the former are dominated across both accuracy measures used and for all forecasting horizons examined.

and continues:

Moreover, we observed that their computational requirements are considerably greater than those of statistical methods.

Mendelssohn writes:

For time series, ML models don’t work as well as traditional methods, at least to date. I have read a little on some of the methods. Some have layers of NNs. The residuals from one layer are passed to the next. I would hate to guess what the “equivalent number of parameters” would be (yes I know these are non-parametric but there has to be a lot of over-fitting going on).

I haven’t looked much at these particular models, but for the general problem of “equivalent number of parameters,” let me point you to this paper and this paper with Aki et al.

I would also point out again this paper (which I believe Dan Simpson mentioned awhile back):

https://arxiv.org/abs/1806.06850

which shows that some NN’ s are essentially polynomial regressions, and polynomial regressions have gone out of favor for many good reasons. There was also a study with medical records (I can’t find the link) that had I believe a logistic regression outperform all the ML algorithms.

There are clearly a lot of good ideas in machine earning algorithms, but in my view a lot of exaggerated claims and lack of understanding of why they work or don’t work , and their limitations. Even more one of the big claims is that understanding doesn’t matter, just the ability to forecast, but if they don’t even do that all that well, then ….

But I give them credit – the ML field comes up with the best names.

Roy: If you are not aware some similar concerns here http://statmodeling.stat.columbia.edu/2018/10/30/explainable-ml-versus-interpretable-ml/

I don’t think the comparison is in any way reasonable. Not that it isn’t done well, but rather that I can’t seem to find any choices that they made which are defensible.

First, their choice of statistical methods and data was ridiculous. They chose “the six most accurate methods of the M3 Competition” for the statistical methods, i.e. methods developed explicitly for time series forecasting, for the statistical methods. Then they chose datasets *used in that same competition* for this comparison. (Talk about overfitting!) The datasets had between 81 and 126 observations – which is fine, but a exactly where we expect statistical methods to have less overfitting then ML.

Second, their choice of ML methods and parameters was at best strange. The number of researcher degrees of freedom is off the charts – the packages used, the choice of parameters to use (they often used the defaults, but sometimes didn’t. They sometimes justified these departures with citations, but sometimes did not.) The preprocessing was chosen using a strange heuristic that assumed all ML models should use the same pre-processing, based on what worked best for one method.

Finally, the ML methods they used were not appropriate ones for the comparison. Not only did they not choose methods that were particularly suited to time-series forecasting, they didn’t even use anything like the current generation of those model types. As an example, for two methods they used the RSNNS package, a wrapper for SNNS. The current version of SNNS seems to be from 1995(!!!!) and doesn’t support standard methods like ReLU for the mlp or rbf methods.

This bit from the Introduction section of the paper is interesting: “The motivation for writing this paper was an article [18] published in Neural Networks in June 2017. The aim of the article was to improve the forecasting accuracy of stock price fluctuations and claimed that “the empirical results show that the proposed model indeed display a good performance in forecasting stock market fluctuations”. In our view, the results seemed extremely accurate for stock market series that are essentially close to random walks so we wanted to replicate the results of the article and emailed the corresponding author asking for information to be able to do so. We got no answer and we, therefore, emailed the Editor-in-Chief of the Journal asking for his help. He suggested contacting the other author to get the required information. We consequently, emailed this author but we never got a reply. Not being able to replicate the result of [18] and not finding research studies comparing ML methods with alternative ones we decided to start the research leading to this paper.”

I also saw this paper a while ago and agree that the choice of models for “Team ML” was bizarre, on top of not being suited for the tiny datasets involved. If anybody is actually trying to do things like that for actual analysis, though, this paper will hopefully stop them.

It does seem like there’s a trend among the young people these days to just throw deep nets at everything, even low-dimensional tabular datasets. The wiser millennials, though, know that you should also try gradient boosting.

Michael:

Maybe every methods paper (mine included) should be required to include a section called, Problems Where Our Method Won’t Work. Not just a Limitations section giving a bunch of hypothetical objections, but a list of the sorts of problems where applying the method will give a bad answer.

+10 – of course (from experience) convincing others of this, even in post-publication review is challenging.

I would like to see this happen in general, for sure. Maybe also some kind of “do you really need deep learning???” page in the documentation for the popular frameworks.

There is definitely “folk knowledge” that neural networks shouldn’t be used for very small datasets.

At the same time, there is a strong incentive to always use deep learning, because then you are doing work in applied artificial intelligence, which means you are very cool and smart.

And like in the paper, it will probably sort of work okay, just not as well as a simpler baseline, and with a big waste of computational resources.

Sort of orthogonal to this is I have been re-reading Pearl’s book on causality (as well as recently having read “The Book of Why”) as well as some papers and books from the Machine Learning literature. Given that these are two very active areas of research, it i interesting how their very foundations are so diametrically opposite.

And as for the limits of the M3 competition, there will be an M4 competition in December open to anyone.

I’m a few hundred pages behind you (about 2/3 of the way through “The Book of Why”). I’ve had the same thought. This is particularly interesting since Pearl comes mostly from the computer science side, and did a lot of the development of the Bayesian Network method. So he’s switched sides.

Sorry, but this is a worthless comparison. At the very least, go win some (eg, kaggle) competitions using your traditional methods, then people who know what they are talking about will care.