Jeff pointed me to this interesting paper by David Primo, Matthew Jacobsmeier, and Jeffrey Milyo comparing multilevel models and clustered standard errors as tools for estimating regression models with two-level data.

Much of the paper is a review of the different methods and models (I’d also recommend my own paper, Multilevel Modeling: What It Can and Cannot Do), and it’s worth a read, although it has a few statements that may be misleading to the casual reader. (For example, on page 448, they write, “Clustering arises because the attributes of states in which individuals reside do not vary across individuals within each state.” You can also get clustering if the attributes are similar within each state, even if not identical. But their main point is a good one, which is that clustering is a characteristic of the underlying individuals, not merely something that arises from clustered sampling or other structured data collection. Another example, which could be more misleading to non-experts, is when they write on page 452 that multilevel modeling “uses Equations 1-3 and the assumptions below to estimate coefficients, variances, and covariances that maximize the likelihood of observing the data, given the model.” This isn’t right, for two reasons. First, there are a lot of ways to do multilevel estimation, and when the number of groups is small or group-level variances are small, the Bayesian approach is more effective (see our book, for example). Second, even in the classical framework, multilevel modeling works with the marginal likelihood, averaging over the varying coefficients (the beta_j’s in their Equations 1-3). It’s not maximizing the joint likelihood. I’m sure Primo et al. realize this, but the point might be lost on a novice reader.)

**Back to the main point**

Primo et al. compare three approaches: (1) least-squares estimation ignoring state clustering, (2) least squares estimation ignoring state clustering, with standard errors corrected using cluster information, and (3) multilevel modeling. Their general points are that method (1) can be really bad–I agree–and that (2) and (3) have different strengths. In comparing (2) to (3), their evidence (beyond the literature review) is an example, analyzing data from a recently published paper on state politics, in which they can do method (2) with no problem, but method (3) doesn’t run in Stata (“despite repeated attempts using different models (a linear probability model as well as a logit model), the model failed to converge”).

My thoughts:

1. One big advantage of multilevel modeling, beyond the cluster-standard-error approach recommended in this paper, is that it gives separate estimates for the individual states. Primo et al. minimize this issue by focusing on global questions–“Do voter registration laws affect turnout? Do legislators in states with term limits behave differently than legislators in states with no term limits”–and in their example they focus on p-values rather than point estimate or estimates of variation. Thus, in the examples they look at, multilevel modeling doesn’t have such a big comparative advantage.

2. Another advantage of multilevel modeling comes with unbalanced data–in their context, different sample sizes in different states.

3. I agree that it’s frustrating when software doesn’t work, and I agree with Primo et al. completely that it’s better to go with a reasonable method that runs, rather than trying to use a fancier approach that doesn’t work on your computer. That said, I think their abstract would’ve been clearer if they had simply said, “Stata couldn’t fit our multilevel model,” rather than vaguer claims about “large datasets or many cross-level interactions.”

4. I’d like to get their data and try to fit their model in R. It might very well crash in R also–we’ve had some difficulties with lmer()–in which case it would be useful to figure out what’s going on and how to get it to work.

5. I’d recommend displaying their Table 1 as a graph. (John K. also wrote a paper on this for political scientists.)

6. I completely disagree with their statement on page 456 that cluster-adjusted standard errors “requires fewer assumptions” than hierarchical linear modeling. As Tukey emphasized, methods are just methods. A method can be motivated by an assumption but it doesn’t “require” the assumption. For a simple example, least squares is maximum likelihood for a model with normally distributed errors. But if the errors have a different distribution, least squares is still least squares: it did not “require” the assumption. To go to the next step, classical least squares (which is what Primo et al. recommend for their point estimation) is simply multilevel modeling with group-level variance parameters set to zero. Thus, their estimate requires *more* assumptions than the multilevel estimate.

7. But, to conclude, I’m not criticizing their choice of clustered standard errors for their example. It’s not a bad idea to use a method that you’re comfortable with. Beyond that, it can be extremely helpful to fit complete-pooling and no-pooling models as a way of understanding multilevel data structures. (See here for more of my pluralistic thinking on this topic.) I hope that as more people read our book, they’ll become more comfortable with multilevel models. But what I really hope is that the software will improve (maybe I have to do some of the work on this) so we can actually fit such models, especially varying-intercept, varying-slope models with lots of predictors and nonnested levels.

My suggestion for problems of nonconvergence is to look at the estimates and condition number of the Hessian at the end of the iterations. This should show if either a zero random effect variance or an identification problem has occurred, which really means it is a problem with the model. It would obviously be better if the software worked this out. There may be better suggestions.

Comments in brackets:

"I hope that as more people read our book, they'll become more comfortable with multilevel models." [Yes! I'll vouch as a "MLM newbie" that your book is very helpful. Other novices may want to also look at introductory works (although older and less comprehensive than Gelman and Hill) by (1) Judith Singer (Harvard) and (2) Joop Hox (Utrect Univ.).] "But what I really hope is that the software will improve (maybe I have to do some of the work on this) so we can actually fit such models, especially varying-intercept, varying-slope models with lots of predictors and nonnested levels." [I'm not sure if you're familiar with GLIMMIX in SAS, but does it NOT due any of that?]

Re: #1. You say one of the advantages to multilevel modelling is its ability to separately estimate a coefficient for each state, using state-level data. How is this different than simply interacting a state dummy with some explanatory variable of interest? You can also find the effect separately by state in this way too.

Jason,

You're talking about the no-pooling estimate, as we call it in our book. Two issues arise with this approach:

(1) If you don't have a lot of data in each state, your estimates will be noisy.

(2) You can't estimate coefficients for individual states and state-level predictors at the same time.