Asa writes:

I took your class on multilevel models last year and have since found myself applying them in several different contexts. I am about to start a new project with a dataset in the tens of millions of observations. In my experience, multilevel modeling has been most important when the number of observations in at least one subgroup of interest is small. Getting started on this project, I have two questions:

1) Do multilevel models still have the potential to add much accuracy to predictions when n is very large in all subgroups of interest?

2) Do you find SAS, STATA, or R to be more efficient at handling multilevel/”mixed effects” models with such a large dataset (wont be needing any logit/poisson/glm models)?

My reply:

Regarding software, I’m not sure, but my guess is that Stata might be best with large datasets. Stata also has an active user community that can help with such questions.

For your second question, if n is large in all subgroups, then multilevel modeling is typically not needed. But if n is large in all subgroups, you can simply fit a separate model in each group. That is equivalent to a full-interaction model. At that point you might be interested in details within subgroups, and then you might want a multilevel model.

Asa then wrote:

Yes, a “full interaction” model was the alternative I was thinking of. And yes, I can imagine the results from that model raising further questions about whats going on within groups as well.

My previous guess was that SAS would be the most efficient for multilevel modeling with big data. But I just completely wrecked my (albeit early 2000’s era) laptop looping proc mixed a bunch of times with a much smaller dataset.

I don’t really know on the SAS vs. Stata issue. In general, I have warmer feelings toward Stata than SAS, but, on any particular problem, who knows? I’m pretty sure that R would choke on any of these problems.

On the other hand, if you end up breaking the problem into smaller pieces anyway, maybe the slowness of R wouldn’t be so much of a problem. R does have the advantage of flexibility.

If R can hold the dataset in memory, then the lme4 package may be the fastest way to estimate the multilevel model. At least according to its author, it's much more efficient than commercial packages at estimating models with large numbers of fixed effect levels (subgroups).

http://matrix.r-forge.r-project.org/slides/2009-0…

In this context you could think of a multilevel model as a computational trick to make the computing problem easier, though I suspect you could also just get yourself a decent desktop computer running 64bit Linux and 8 GB of RAM and stick the whole dataset in memory and use lme4 as suggested above. After all, 10 million observations, let's say 10 dimensions per observation, and 8 bytes per dimension is only ~ 1Gig of memory.

Alternatively.

1) *Start* with a model for within-group effects, and dimension-reduce the problem to a set of Bayesian posteriors about the within-group descriptors.

2) Create your between-group model as a model about the between-group differences between the posteriors of the within-group coefficients.

This should also confirm that lme4 can fit larger models than PROC MIXED in SAS – the difference lies in the use of sparse matrix methods in lme4.

If you "dare" to use also some other software, there are some linear-mixed programs that were made exactly for large datasets – here are some of them from the area of animal breeding

ASReml

http://www.vsni.co.uk/products/asreml/

BGF90

http://nce.ads.uga.edu/%7Eignacy/numpub/

DMU

http://www.dmu.agrsci.dk/

http://www.wcgalp8.org.br/wcgalp8/articles/paper/…

PEST/VCE

http://vce.tzv.fal.de/index.pl

MATVEC

http://statistics.unl.edu/faculty/steve/software/…

WOMBAT

http://agbu.une.edu.au/~kmeyer/wombat.html

I'll second the lme4/R recommendation, on the grounds that it would fit models for me that Stata 10's xtmixed wouldn't.

More specifically, for c 300k observations, and a cross-classified multi-level structure with high numbers of groups, Stata 10 seemed to be heading for a compute time of days, while lme4 did it in about ten minutes.

Caveat: Stata 11 has extended the relevant functionality, and might well be faster than 10.

Thanks everyone.

I figured out a pretty solid solution to this. N was about 90 million so I did the more basic data things off a server running PostgreSQL. Then I used the R package 'RPostgreSQL' and a data tunnel to interface (allowing me to run things in the R GUI from my laptop) and it worked very well. There is also an SQL procedural language called 'PL/R' that I used which allows you to do R commands in the SQL environment. This whole thing has left me more impressed than ever with R's opensource-ness.

As an aside…how large n are you talking about over the subgroups. I'm curious at how quickly estimating separate models becomes less of a problem? 300? 1,000? more?