Skip to content

“Find the best algorithm (program) for your dataset.”

Piero Foscari writes:

Maybe you know about this already, but I found it amazingly brutal; while looking for some reproducible research resources I stumbled onto the following at (which would be nice if done properly, at least as a standardization attempt):

Find the best algorithm (program) for your dataset.
Upload your dataset and run existing programs on it to see which one works best.
No mention of proper procedures in the FAQ summary, but I did not dig deep.

In the financial community data snooping has been a well known problem for at least 20 years, exhacerbated by automated model searches, so it’s amusing that something like that can still run that openly in the related and supposedly less naive ML community.

My reply: Some people do seem to think that there’s some sort of magic to a procedure that minimizes cross-validation error.


  1. Pierre Baptiste says:

    What is your opinion on these procedures? Is it possible that the multiple comparisons issue can be solved, perhaps by devising a correction?

    Note: the site seems to be down at this time.

  2. Dave says:

    This post must have been written a long time ago… The site doesn’t appear to be running any more.

  3. Peter Nelson says:

    If your goal is predictive accuracy (rather than an interpretable model or conclusions about causation), this is a fine thing to do. Minimizing cross-validation error isn’t magical, but it does offer a (noisy) estimate for generalization error.

    • Andrew says:


      Yes, minimizing cross-validation error can be fine, and it can be even better to include prior information to regularize the estimate of the tuning parameter. I don’t object to people minimizing cross-validation error; it’s just my experience that a lot of people seem to forget that the cross-validation estimate is itself noisy. The noisiness of the cross-validation error (and also, for that matter, the noisiness of estimated hyperparameters in Bayesian hierarchical modeling with flat priors on the hyperparameters) becomes clear if you fit the same model to multiple datasets, but people don’t always do this; there’s a tendency to just take the estimated hyperparameter and run with it.

      • Dean Eckles says:

        The main problem in a case like this is a regression towards the mean.

        Note that rather than just selecting a the value of a single continuous tuning parameter, when choosing among different methods, this is a quite high-dimensional choice — and it can be, as Andrew mentions, a very noisy choice.

      • Peter Nelson says:

        (Responding to both Dean and Andrew): Yes, it’s unfortunate that people treat comparisons of cross-validation error more like a sporting competition and less like a noisy estimate. This is especially annoying in the context of ML papers which justify new models (or new variations on existing models) by eking out a narrow “victory” on a particular data set.

        However, there are plenty of circumstances (in industry) where you’re just trying to get a good black-box predictor, and throwing things at the wall to see what sticks is a good strategy. If the project is important enough to justify collecting much more data and running experiments to conclusively demonstrate small advantages between models, then you’ll do that. If it’s not, then the “true” rankings over several models which gave similar cross-validation or testing scores isn’t something you really care about.

        • Andrew says:


          What I’m concerned about is not so much the risk of picking a suboptimal model, but rather the problems that can arise when I use a noisy point estimate for a hyperparameter. The problem is if estimates jump around a lot.

          That said, I’m not saying that cross-validation is a bad idea, just that people should remember that the cross-validated estimate, like any estimate, is a random variable.

  4. Harald K says:

    I recall a joke machine learning paper which was about finding the perfect dataset for your algorithm, but my googling skills fail me at the moment.

    • Tom Dietterich says:

      You are referring to LaLoudouana, Doudou, et al. “Data set selection.” Journal of Machine Learning Gossip 1 (2003): 11-19. You can find it on Citeseer.

      • jrc says:

        Thanks for that Tom!

        “The mission of the Journal of Machine Learning Gossip (JMLG) is to provide an archival source of important information that is often discussed informally at conferences but is rarely, if ever, written down. It is designed to be open and accessible for all researchers, but with particular emphasis on providing guidance and advice to talented young researchers so that they do not get disheartened when they encounter the egoism, mutual admiration clubs and power games that unfortunately infect much of science.”

        Under the heading “How To Give a Poster” in the section linked by clicking “guidance and advice”:

        “…always be on the lookout for the opportunity to implement an ‘audience upgrade’. If a big shot with a larger BSI [Big Shot Index] walks by, attempt to grab their attention and restart your spiel from scratch.”

Leave a Reply