Skip to content

What is the role of statistics in a machine-learning world?

I just happened to come across this quote from Dan Simpson:

When the signal-to-noise ratio is high, modern machine learning methods trounce classical statistical methods when it comes to prediction. The role of statistics in this case is really to boost the signal-to-noise ratio through the understanding of things like experimental design.


  1. G Jones says:

    Is there a nice war story or example of experimental design boosting signal-to-noise ratio so that machine learning methods then dominate?

    • Not sure what you mean by dominate here, but I really like the work of Vikas Raykar and crew on adding noisy measurement models to training. It can work with any kind of ML model. Most ML treats a corpus as a gold standard, whereas statisticians know there’s often important measurement error and that accounting for it can in some cases can be a huge help.

      Also, you really want to be using statistics to evaluate systems even if they’re not built based on probabilistic principles.

      The use of the Penn Treebank in natural language has been a joke to anyone who knows anything about stats for 20+ years. Everyone’s required to train and test on the exact same sections and papers have been reporting increases in precision and recall on that section that are dwarfed by cross-validation error across the other sections of the treebank. Classic overfitting and noise chasing, which persists only because practitioners don’t understand statistics.

      What I’d recommend instead is Chris Manning’s work (from an old CICLING paper) on fixing the errors in annotation in the treebank. This is also related to the noisy measurement models mentioned above, but it actually fixes the errors. It’s not considered kosher to do this, so the results don’t “count” in any sense. It’s not like anyone took his advice and started using his improved Treebank. Nope, back to testing and training on the exact same noisy data.

  2. Marc Intrater says:

    The big question, though, is how do we know when the signal to noise ration is truly high? It may appear so based on our assessment, but even with a hold-out test sample, we still don’t know if either the signal, or the noise, is representative of the broader reality we are trying to model. The superior machine learning methods may have done a great job of overfitting the noise or bias of our sample. This is particularly a problem when the underlying phenomena may be changing. How much can we really expect to know about the future?

  3. Terms like ‘signal’ and ‘noise’ appear to be holdovers from the intelligence world, I am speculating. I wonder whether the are appropriate to use in AI and machine learning. Seem anachronistic.

  4. malcolm says:

    Signal to noise has been and stii is used in Communication Engineering to aid in designing receivers. I learnt the theory a long time ago but basically the statistics used were around probability models, averages and deviations.

  5. Matti Heino says:

    The M4 competition paper is out just a few months ago; they compared machine learning forecasting methods with statistical ones and come to the conclusion that the former need the latter to work:

  6. jimmy says:

    What qualifies as a modern machine learning method?

Leave a Reply