What is the role of statistics in a machine-learning world?

I just happened to come across this quote from Dan Simpson:

When the signal-to-noise ratio is high, modern machine learning methods trounce classical statistical methods when it comes to prediction. The role of statistics in this case is really to boost the signal-to-noise ratio through the understanding of things like experimental design.

12 thoughts on “What is the role of statistics in a machine-learning world?

    • Not sure what you mean by dominate here, but I really like the work of Vikas Raykar and crew on adding noisy measurement models to training. It can work with any kind of ML model. Most ML treats a corpus as a gold standard, whereas statisticians know there’s often important measurement error and that accounting for it can in some cases can be a huge help.

      Also, you really want to be using statistics to evaluate systems even if they’re not built based on probabilistic principles.

      The use of the Penn Treebank in natural language has been a joke to anyone who knows anything about stats for 20+ years. Everyone’s required to train and test on the exact same sections and papers have been reporting increases in precision and recall on that section that are dwarfed by cross-validation error across the other sections of the treebank. Classic overfitting and noise chasing, which persists only because practitioners don’t understand statistics.

      What I’d recommend instead is Chris Manning’s work (from an old CICLING paper) on fixing the errors in annotation in the treebank. This is also related to the noisy measurement models mentioned above, but it actually fixes the errors. It’s not considered kosher to do this, so the results don’t “count” in any sense. It’s not like anyone took his advice and started using his improved Treebank. Nope, back to testing and training on the exact same noisy data.

      • It’s not quite the same thing, but one thing that’s done in ML on images these days is data augmentation, where instead of just doing a training step on an image, you’ll pre-process the image to do things like tweak the brightness or crop part of it out or otherwise mess with it.

  1. The big question, though, is how do we know when the signal to noise ration is truly high? It may appear so based on our assessment, but even with a hold-out test sample, we still don’t know if either the signal, or the noise, is representative of the broader reality we are trying to model. The superior machine learning methods may have done a great job of overfitting the noise or bias of our sample. This is particularly a problem when the underlying phenomena may be changing. How much can we really expect to know about the future?

    • “How much can we really expect to know about the future?”

      Ah, the age-old question — but how can we know if we can’t know the future? ;~)

  2. Signal to noise has been and stii is used in Communication Engineering to aid in designing receivers. I learnt the theory a long time ago but basically the statistics used were around probability models, averages and deviations.

Leave a Reply to G Jones Cancel reply

Your email address will not be published. Required fields are marked *