More data beats better algorithms

Boris sent along this. I can’t comment on the examples used there, but I agree with the general point that it’s good to use more data. To get back to algorithms, what I’d say is that one important feature of a good algorithm is that it allows you to use more data. Traditional statistical methods based in independent, identically distributed observations can have difficulty incorporating diverse data, whereas more modern methods have more ways in which data can be input.

3 thoughts on “More data beats better algorithms

  1. I definitely think the "more modern methods" could be used more often in applied research.

    I have come across some researchers who dismiss algorithmic methods as "exploratory," presumably meaning that algorithmic methods cannot be used for controlled research of a confirmatory nature. I think there is a difference, though, between exploratory statistical methods (those using computer speed to, e.g., search for predictive relationships) versus exploratory research projects (those with no clear hypotheses). One could potentially use algorithmic methods to see whether their "confirmatory" research hypotheses hold in a particular dataset.

  2. "Data drives out analysis."

    The purpose of complicated analyses is often to produce a number that is not directly observable. For example, you might want to estimate the probability that a particular person has been reliable in paying his rent in the past, but not have access to that data. So, you could use a model.

    But if you can get actual data, the complex analysis isn't needed — nor are the expensive analysts.

    On the flip side, if you lose a data source the analysis has to stretch itself farther to replace that data. This will seldom produce as good a result.

  3. I thought this comment from one of the (competition-leading) BellKor team members was equally or more interesting. Basically, he says "[O]ur experiments clearly show that once you have strong CF models, such extra data is redundant and cannot improve accuracy on the Netflix dataset."

Comments are closed.