As we get more data, we can fit more model. But at some point we become so overwhelmed by data that, for computational reasons, we can barely do anything at all. Thus, the curve above could be thought of as the product of two curves: a steadily increasing curve showing the statistical ability to fit more complex models with more data, and a steadily decreasing curve showing the computational feasibility of doing so.
Very neat graph!
What do you think the “accuracy of model” or “predictive power of model” vs sample size curve looks like?
Do you think it monotonically increases (as it theoretically should), or do you think there is a point at which the conceptual complexity of the data starts to overwhelm the modeler and result empirically worse models being created.
Of course we can always throw away data (or do subsampling to make it sound nicer) if we don’t want the model size to go down from some size upwards… (I’m not so sure whether I want to advertise this but *sometimes* at least it is probably better than doing only what the computer can do with the whole data set.)
So the optimal sample size is a little over 10,000. Funny you should say that because I completely agree. I’ve been working with a landscape genomics (definition: http://rsbl.royalsocietypublishing.org/content/6/4/502.full) data set — the response variable is an 8-by-550,000 matrix. Model fitting is slow, and so I fit subsets of the data instead. For some reason I have naturally been drawn to 8-by-10,000 matrices. Don’t really know why…it just seems to make everything work smoothly.
As we get more data we need to do more data cleaning (implying simple models). We also have to study descriptive statistics (implying more simple models) to check if the data can be regarded as one dataset or several. It is a dangerous assumption to make that very large datasets are homogeneous and can be treated as such. Could that be another explanation for your graphic? If the dataset is homogeneous, then there are good classical statistical reasons for not using all the available data, samples should be fine for identifying much of the information in the data.
I’m assuming this is a conceptual graph and there’s no existing evidence for the precise shape of the curve or the approximate optimal point, but I’d be happy to be corrected on this.
This also gives me an idea. I’ve been distributing a data set to marketing and economics academics that’s large (about 70 gig), but I notice that the papers (several dozen so far) only seem to use tiny pieces of it, although the models they use are often complex. This would potentially provide a way to look at a related curve — the size of the data used on the x axis, and the complexity of the model on the y axis. (The size of the potential usable data is fixed.) The size of the data used would be relatively easy to measure. The complexity of the model is a harder concept, especially since the models are looking at a variety of different phenomena using a variety of different approaches and published in a variety of venues.
Can someone point me to some way to measure model complexity? Say, if you took a random set of papers from a journal on different topics and wanted to do this? That’s an analogous task because the issues addressed by those using this data range rather broadly. My first thought would be to use some set of low-level experts (Mechanical Turk?) to evaluate the papers in a pairwise fashion, and then generate something like an ELO score in chess (the most complex model “winning”). That’s not fully satisfactory because bad writers will seem more complex, good writers will seem more simple, I would guess.
Spiegelhalter et al, JRSoc Ser B, 2002, Bayesian measures of model complexity and fit, available for free from Google Scholar somehow
It would help to break this down by n and p – isn’t the typical problem more an issue of p being large rather than n? I would think that large n is usually not much of a problem since it just means that parameter estimates effectively converge way before you’ve used your samples (so you don’t lose much by subsampling).
Peter Norvig gives a related talk titled the “Unreasonable Effectiveness of Data”. He seemed to be (overly?) optimistic about how using large-n datasets can be more useful than a smart model. From what I remember, the theme was that academics are often focused on incrementally improving a model’s prediction error based on the performance for a given dataset (e.g. a benchmark), whereas in his applications, you could do just as well by adding more data to improve the model’s performance. The choice of the model in his examples stops mattering after a certain sample size (why worry about trading off bias for variance when you have too much data… although I think this breaks down when your dataset is large due to p). As examples, he discusses a toy version of google translate where the algorithm is fairly simplistic , but performs well because of the sheer volume of data fed into it.
I think a counterpoint to that idea occurs in large p scenarios where the model _needs_ to be complex (to impose some structure on the large number of parameter estimates), but can’t be because of the computational cost. In GWAS studies, where the “model” is basically a large number of independently estimated marginal correlations with p-values on each correlation. The amount of actionable information learned from those studies has been somewhat disappointing (in addition to having some reproducibility issues).
In natural language tasks, the large N (number of data points) and large P (dimensionality) are related, because the high P derives from the high N and the distribution of words, phrases and concepts in language. If I have 20K tweets of 20 words each, I have at most 400K tokens, and maybe at the outside 100K unique tokens (out of the billions of unique tokens you find on web scale data).
So even with large N, the parameter estimates don’t converge quickly because many of the dimensions only show up sparsely. So even if you use something like SGD, you need multiple passes to get them right.
You will get a lot of the prediction power, but prediction continues to improve for rare docs. For instance, if 0.1% of the docs are French and 99% are English, you need a lot more randomly sampled data to converge on French than English. Sampling carefully from all the data can help here.
Think of a baseball analogy. There have been quite a few baseball players over the years. If I say “remember Eddie Brinkman?” in a Tweet (he was the Detroit Tigers shortstop when I was a rabid baseball fan in the early 70s), you won’t know it’s about baseball if you’ve only trained on a small sample. Of course, Tweets about Eddie Brinkman are probably rare (there are a dozen included re-Tweets indexed by Google) so it probably doesn’t help your overall predictive accuracy much.
I saw an interesting talk by Fernando Pereira of Google a couple years ago where his group found middle-frequency terms (say “Stan Musial”) better predictors of table columns containing baseball players than high frequency terms (say “Babe Ruth”). The problem was that Babe Ruth was on too many general lists of celebreties and athletes, not just on lists about baseball.
One interpretation of this graph that I particularly like is that it is a strong argument against some forms of physical reductionism (e.g. psychology to brain science, biology to chemistry or chemistry to physics). Is it possible to show some reasonable numbers on the complexity axis?
It’s true as a matter of psychology that as we get more data, we do tend to fit more flexible models. But that’s not necessary. It’s actually somewhat irrational. The world is complicated so we need flexible models to describe our prior beliefs about the world. Just because we don’t have much data doesn’t change that (it actually makes it more necessary to do a good job on describing the prior beliefs).
I am nowhere near big data and might miss the joke. But if you have say 10^9 data points why not take some simple statistics (mean?) over each 1000 entries and then perform more complicated analysis on 10^6 new data “points”.
As we get mote data we sample, improve research design, and advance theory.
The binding constraint with complex models, in my view, is not so much computational as it is cognitive. We are simple minded beings.
Suppose you had infinite data and an all powrful computer: do you fit a God model that explains everything or a model for dummies we can understand?
The challenge is not processing power so much as leveraging human cognition.
Is there some simple (i.e. not extremely theoretical) reference for this? I have clinical collaborators who are set on using >150,000 data points for a model, so it would be nice to be able to show them a published discussion of why this is a bad idea.
The second component curve moves over time due to Moore’s law (or just if you get more funding to access existing supercomputers you aren’t currently using), so while the overall shape may be similar, the position of the peak keeps moving to the right.
It also moves due to better algorithms, but given that we don’t even know how to measure a point on this made-up curve, it’s rather moot.
Also, computational feasibility is not yea/nay. There’s an amount of time it takes to fit a model of a given size and complexity, and different people have different cutoffs (Google simply has more computers than I do).
Which algorithm to use depends on the size of the data. Consider a regression. At some point, computing the hat matrix gives way to quasi-Newton methods which eventually give way to stochastic gradient methods in terms of speed to compute an answer. Also, how many decimal places you need those estimates to will affect which software you use.
it also depends on the ‘complexity’ of the underlying phenomenon. for much of the phenomena that we model, high sample rates may not be necessary. alternately, for much of the models that we impose on the phenomenon, high sample rates might not be necessary.
I just heard a talk by Peter Bartlett about model selection in “unlimited” data situations that essentially addresses this curve. http://www.stat.berkeley.edu/~bartlett/talks/201207-MMDS.pdf
He talks about the problem of model selection given a computational budget (rather than given a sample size). You can either use your computational budget to get more data or fit a more complex model. He shows that you can get oracle inequalities for model selection algorithms under this paradigm (as long as the candidate models are nested).
I long ago hypothesized (but not seriously) that the number of factors that can be reliably extracted in a principle components is related to sample size with the formula “Number of factors” = ln( ln( “sample size”)). Since ln( ln( N)) is approximately 2 for almost all the sample sizes one would encounter in survey research, my hypothesis lead me to ignore more than two dimensions in survey analysis. (I overheard something about the iterated logarithm once that led me to my hypothesis.
I suspect that a very similar result holds for the number of parameters that can be estimated in ARMA models.
Pingback: Peter Bartlett on model complexity and sample size « Statistical Modeling, Causal Inference, and Social Science