Yes, worry about generalizing from data to population. But multilevel modeling is the solution, not the problem

A sociologist writes in:

Samuel Lucas has just published a paper in Quality and Quantity arguing that anything less than a full probability sample of higher levels in HLMs yields biased and unusable results. If I follow him correctly, he is arguing that not only are the SEs too small, but the parameter estimates themselves are biased and we cannot say in advance whether the bias is positive or negative.

Lucas has thrown down a big gauntlet, advising us throw away our data unless the sample of macro units is right and ignore the published results that fail this standard. Extreme.
Is there another conclusion to be drawn?
Other advice to be given?
A Bayesian path out of the valley?

Heres’s the abstract to Lucas’s paper:

The multilevel model has become a staple of social research. I textually and formally explicate sample design features that, I contend, are required for unbiased estimation of macro-level multilevel model parameters and the use of tools for statistical inference, such as standard errors. After detailing the limited and conflicting guidance on sample design in the multilevel model didactic literature, illustrative nationally-representative datasets and published examples that violate the posited requirements are identified. Because the didactic literature is either silent on sample design requirements or in disagreement with the constraints posited here, two Monte Carlo simulations are conducted to clarify the issues. The results indicate that bias follows use of samples that fail to satisfy the requirements outlined; notably, the bias is poorly-behaved, such that estimates provide neither upper nor lower bounds for the population parameter. Further, hypothesis tests are unjustified. Thus, published multilevel model analyses using many workhorse datasets, including NELS, AdHealth, NLSY, GSS, PSID, and SIPP, often unwittingly convey substantive results and theoretical conclusions that lack foundation. Future research using the multilevel model should be limited to cases that satisfy the sample requirements described.

And here’s my reaction:

My short answer is that I think Lucas is being unnecessarily alarmist. To me, the appropriate analogy is to regression models. Just as we can fit single-level regressions to data that are not random samples, we can fit multilevel models to data that are not two-stage random samples. Ultimately we are interested in generalizing to a larger population, so if our data are not simple random samples, we need to account for this, a concern that I and others address using multilevel modeling and poststratification; see, for example, my recent paper with Yair.

But this is not a problem unique to multilevel models. In any study, there is a concern when generalizing from data to population. Hree’s what Lucas writes:

I contend, some datasets on which the MLM has been estimated are non-probability samples for the MLM. If so, estimators are biased and the tools of inferential statistics (e.g., standard errors) are inapplicable, dissolving the foundation for findings from such studies. Further, this circumstance may not be rare; the processes transforming probability samples into problematic samples for the MLM may be inconspicuous but widespread. If this reasoning is correct, many published MLM study findings are likely wrong and, in any case, cannot be evaluated, meaning that, to the extent our knowledge depends on that research, our knowledge is compromised.

You could replace “MLM” with “regression” in that paragraph and nothing would change. So, yes, I think Lucas is correct to be concerned about generalizing from sample to population (what Lucas calls “bias”); it’s a huge issue in psychology and medical studies performed on volunteers or unrepresentative samples. But I don’t wee anything specially problematic about multilevel models, especially if the researcher takes the next step and does poststratification (which is, essentially, regression adjustment) to correct for differences between sample and population. If the data are crap, it’ll be hard to trust anything that comes out of your analysis, but multilevel modeling won’t be making things any worse. On the contrary: multilevel analysis is a way to model bias and variation.

Multilevel modeling doesn’t solve all problems (see my paper from a few years ago, “Multilevel modeling: what it can and can’t do”), but I think it’s the right way to go when you’re concerned about generalizing to a population. So in that sense I strongly disagree with Lucas, who writes, “Future research using the multilevel model should be limited to cases that satisfy the sample requirements described.” Random samples are great, and I admire Lucas’s thoughtful skepticism, but when we want to analyze data that are not random samples, I think it’s better to face up to the statistical difficulties and model them directly rather than running away from the problem.

22 thoughts on “Yes, worry about generalizing from data to population. But multilevel modeling is the solution, not the problem

  1. I have a dumb question regarding MLM: Can it be used with multiple overlapping classes? For instance, we study student performance on a test, like the Wikipedia example, so each student is part of a school, the school is part of a district, etc. But the student is also part of a gender (M/F), a socioeconomic status, all of which overlap. I guess the traditional solution is to use these as predictors, but what is the rationale for that, how do you decide what is a class and what is a predictor?

    • Cedric:

      Yes, the groups can overlap. That’s one reason we sometimes prefer the term “multilevel” rather than “hierarchical” model. Jennifer and I discuss nonoverlapping models in our book, I think in chapter 13. Also, in the paper with Yair (linked above) the groups all overlap in the way you describe.

  2. The picture in this post is horrible and unappetizing. Reminds me of the business ecosystem in NYC’s Penn Station. Not the best way to market MLM….

      • Yes, the victimized eyes of a futuristic dystopia where all public spaces are dominated by Nathan’s Hot Dogs, Sbarros Pizza and Tie Rack.

  3. What is a non-probability sample? Does he mean randomized sample?

    “If so, estimators are biased and the tools of inferential statistics (e.g., standard errors) are inapplicable, dissolving the foundation for findings from such studies.”

    This could be said of pretty much any observational non-randomized data. It’s right to worry that such issues will affect the outcomes of hypothesis testing. But perhaps the solution is that we shouldn’t be using observational, non-randomized data in a hypothesis testing framework in the first place.

  4. What if this gentleman obtained a “random sample” and it had an unusually high number of girls? Would he recommend throwing the data away then? Frequentist understanding of statistics destroys one’s intuition to the point that people actually start believing (entirely deterministic) random number generators can magically fix all the problems and without this magic you can’t do anything. Every field I’m familiar with eventually gets into an frequentist cul-de-sac like this and progress stops until people can think about the problem in non-magical terms.

    If your calculation shows that a “random sample” would fix all the problems then what that really means is “the majority of possible samples would work just fine and only a minority of population subsets would lead us astray”. The key question is whether the particular sample you have is one of those majority (good) ones or one of the minority (bad) ones. This question has to be faced regardless of how the sample was obtained. There’s no getting around it.

    There’s sometimes a subtlety here though. If a human researcher were to pick the sample by eye, there may well be a strong tendency to pick one of those minority cases where the sample leads to substantially wrong conclusions about the population. In this case, using an impersonal random number generator to pick the sample allows for the maximal possibility that the sample picked will be one of those majority (good) ones. In that sense randomization can be a useful tool, but in no way does this magically alleviate the responsibility to deal with the key question in the previous paragraph.

    In other cases by contrast it’s quite possible for the sample from an observational study or deterministic study to be one of those majority cases which works just fine. And if the answer to key question is “it’s one of those minority (bad) samples” then you have to deal with it somehow and it makes absolute no difference if it was chosen with a random number generator or not.

    • The “choosing by eye” problem is one of the major reasons to prefer random number generators in forensics. When there are opposing points of view, you want an impartial choice of which cases to look at.

      I agree with you though, either your sample is representative, or it isn’t, the fact that it’s chosen by a random number generator only makes it much more likely to be representative than some other methods. Also the random number generator can only choose from among cases that you know about, so already you can bias your sample in the sampling procedure by putting into the RNG a subset that you think is the entirety of the population. This is typical of random digit dialing for example, you don’t reach people who only have cell phones since you can’t legally dial a pay-by-the-minute line with a telephone poll (or maybe I’m wrong, but the general principal is still there, you certainly won’t get poor people who don’t have a phone at all for example)

      • If there was any method at all which returned one of those majority (good) samples 100% of the time that would be a great boon. Anyone who drunk the frequentist coo-laid though, would object because each sample wasn’t chosen with “equal probability” thereby invalidating the results. The absurdity of it all.

        • Find these comments curious given http://statmodeling.stat.columbia.edu/2013/06/14/progress-on-the-understanding-of-the-role-of-randomization-in-bayesian-inference/

          In an 1885 entry Peirce stated that the arguments for randomizing were so well accepted he did not need to repeat that but simple set out a recipe for taking random samples. Fisher _ended_ his scrabble with Student/Gosset about systematic arrangements with a quip that the only way to be safe from the Devil was to use randomization to which Ricard Peto quipped – that presumes to Devil cannot foresee random outcomes …

          There may be a cost to randomization but almost always some benefit, but even the most sophisticated inexpensive experiments (e.g. non-naive Monte Carlo, Numerical integration) at some point find it useful to add some randomness.

        • Keith:

          Randomization is great, but I think it’s silly for Lucas to pick on multilevel models. I see his attitude to be a classic form of methodological conservatism. Certain methods (e.g., regression, Anova, whatever) are ok, they’ve been grandfathered in and are considered acceptable by default. But other, newer methods (e.g., multilevel models) are suspect, and when these come up, all of a sudden all their assumptions are examined. If Lucas wants to worry about his inferences, that’s fine, but as I wrote above, I think multilevel modeling is a way to move forward to address these concerns.

        • Andrew: I fully agree, I was commenting on Entsophy’s and alex’s comments (I’ll be more specific in the future).

        • I was once given some ASTM standard on using random sampled batches to do quality control type calculations. It basically said that if the samples were chosen using a random number generator then the sampled items were representative and could be counted as 1/N of the total. If the samples were chosen by *any other means* they were representative only of themselves, so if you took 3 samples by digging around in a bin and grabbing a widget with your eyes closed you could only learn that those 3 widgets had the property measured and you had no information about the other 1000 widgets.

          While I agree with you and the standard that randomization has many virtues, this kind of magical thinking displayed by the standard is something that Entsophy is legitimately against.

          In my opinion the right way to deal with nonrandom samples is to consider the effect of the sampling on biases introduced, and model them using models that constrain the size of those biases to be something reasonable. If you have a reasonable size convenience or other type sample it can help to identify the biases by getting an additional, if small, random sample, or a sample by some entirely other convenience/biased method.

          In many many areas we are left with nothing but nonrandom samples to analyze, we need a way forward even in such circumstances, and I agree with Andrew that multilevel + Bayesian modeling provides huge flexibility to incorporate the kinds of modeling assumptions we will need to make to account for biases and data collection issues in general. I suspect you do too.

        • O’Rourke,

          I gave an example were randomization is useful. The key point was it’s useful for reasons complete different than those usually provided. In the example given, the official explanation is that randomization doesn’t work if each sample is chosen with “equal probability”. That’s pretty much as meaningless an idea as ever utter in statistics. In reality, randomization won’t be of help if either (A) a researcher choosing the sample by eye isn’t liable to lead to one of those minority (bad) samples or (B) the majority of samples aren’t good ones. If either of those isn’t true, then randomization either won’t improve or won’t help the situation.

          Another example of this kind was given by that Wasserman post
          http://normaldeviate.wordpress.com/2013/06/09/the-value-of-adding-randomness/

          Larry says “But if we randomly assign people to the two groups then, magically theta=alpha”. Uh …no, it isn’t magic. Once N gets to be larger than about 50 or so, than almost every division into treatment and control group will have that property approximately. It’s not the magic of randomness that makes it happen but rather the down-to-earth fact that almost every possible division works out that way.

          So again, if a given method for picking groups is liable to lead to one of those rare cases that are bad, then using an impersonal (usually completely deterministic) “randomization” technique provides maximal opportunity to be placed in the one of those good divisions and will be an improvement.

          Note thought, that if you don’t have the property that “a majority of the possiblies work out well”, then randomness doesn’t do what people think it does and some very smart people can get seriously mislead. There’s a price to be paid for not understanding what really makes the whole thing work.

        • The double-negative construction in your first paragraph is a little bit confusing, I’m going to try to paraphrase and see if I understood correctly:

          If most samples are good samples, then randomization or most other methods of sampling will work well. If however we usually choose a method (ie. “by eye”) that makes it much more likely to get one of the non-majority samples that are “bad” samples, then randomization will help because it doesn’t have this property.

          On the other hand, if many or most samples are “bad” samples, then randomization doesn’t help because it mostly gives “bad” samples since most of the samples are bad.

          I assume the latter case occurs when you have say mixtures of things going on, and most of the samples are polluted with the not-of-interest components of the mixture (ie. say we’re trying to find out how green widgets work in a box of 1000 red widgets and 100 green widgets, except we’re color blind so we can’t see the difference). In that case, random samples won’t help, we need to get someone who isn’t color blind. As soon as we do that, most likely we don’t need to randomize which green widgets we take, except if there’s something “shiny” about certain green widgets that make the color sensitive person prefer them. Then randomizing among the green widgets could be helpful to avoid their “shiny” type bias.

          I’m largely in agreement with you, though I tend to favor skepticism of non-random samples such that I will probably build a more complicated model in their presence than if I were given a simple chosen-by-RNG sample. The resulting inferences will likely have broader posterior distributions.

        • Daniel: Some would argue it should be much wider and possibly equal in width to the prior.

          Sander Greenland (2002) Multiple-bias modelling for analysis of observational data

        • I think Larry is correct, in that he likely means theta and alpha as parameters of a distribution and randomization mathematicallys make the _distributions_ and therefor all their parameters equal.

          It is very hard (for me) to discern what most people think randomization means or why it works (even the students I taught and tested on this) so I tend not to think about it much (other than occassionally propose anonomous surveys of practicing statisticians).

          For what people should think, I would suggest the Rubin paper I provided in the post I refered to (as agood start).

        • I agree with your general points about “randomization” and sample selection. But I also think that your quote from Larry is actually about something else. Your good points about “random sampling” have to do with when we do and do not have a sample that is representative of the population (whatever we mean by “population”). There are many good, non-statistical arguments one can make to show that their sample is more or less representative, regardless of how they chose it.

          But the issue of splitting into two groups isn’t about “random sampling”, its about “random assignment.” And I think that “random assignment” is the issue in “randomization” that a lot of us are worried about. That’s because we study people, and people have the nasty habit of choosing what they want to do, or being able to determine their own covariates to some degree. So, consider something like the causal effect of union membership on health (from a post a month or so ago). Different kinds of people join and don’t join unions. It doesn’t matter how big N gets, these groups will always be different. And I think that some of the conversation here is maybe confusing these two issues. Either that, or I just think that, when it comes to making causal claims about people, we are much more often in the “bad” sample world than in the “good” sample world, and that’s fully down to selection-into-group.

  5. Randomisation doesn’t make a sample more likely to be representative than other methods. You can always do equal to or (usually) better than randomisation at getting a representative sample by using deliberate selection.

    These randomisation arguments are such nonsense. If I am trying to use regression to study the extension of a spring at various weights, do I need to randomly choose from the ‘population’ of possible weights? That’s crazy. You can choose the x values based on what will help you run an efficient experiment, and you’ll get valid inferences about your equation.

    We really need a backlash against randomisation. Introducing rcts did a lot of good 30-40 years ago. But it has got to a point where everyone thinks randomisation is magic, because it is accessible and you don’t need to know any maths to toss a coin. There is a whole literature on things like optimum experimental design and minimisation – which had basically been buried.

    • Now say, to speed up my spring extension measurements I buy 10 identical springs and split the measurements among them.

      Randomising how to split the weights among 10 springs has a virtue now.

      i dont know if technically you call what I described randomization or not.

  6. Pingback: When is statistical significance not significant? | Blog Pra falar de coisas

Comments are closed.