Raymond Lim writes:

Do you have any recommendations on clustering and binary models? My particular problem is I’m running a firm fixed effect logit and want to cluster by industry-year (every combination of industry-year). My control variable of interest in measured by industry-year and when I cluster by industry-year, the standard errors are 300x larger than when I don’t cluster. Strangely, this problem only occurs when doing logit and not OLS (linear probability). Also, clustering just by field doesn’t blow up the errors. My hunch is it has something to do with the non-nested structure of year, but I don’t understand why this is only problematic under logit and not OLS.

My reply:

I’d recommend including four multilevel variance parameters, one for firm, one for industry, one for year, and one for industry-year. (In lmer, that’s (1 | firm) + (1 | industry) + (1 | year) + (1 | industry.year)). No need to include (1 | firm.year) since in your data this is the error term. Try some varying slopes too.

If you have a lot of firms, you might first try the secret weapon, fitting a model separately for each year. Or if you have a lot of years, break the data up into 5-year periods and do the above analysis separately for each half-decade. Things change over time, and I’m always wary of models with long time periods (decades or more). I see this a lot in political science, where people naively think that they can just solve all their problems with so-called “state fixed effects,” as if Vermont in 1952 is anything like Vermont in 2008.

My other recommendation is to build up your model from simple parts and try to identify exactly where your procedure is blowing up. We have a graph in ARM that makes this point. (Masanao and Yu-Sung know what graph I’m talking about.)

(Political scientist here) A good deal of the rare events data I use do not contain enough observations to break up into five year sets and do not contain a normally distributed number of 1s. Could you suggest a rule of thumb for dividing data with, say, 400 observations over a 40 year period? I realize this question is self-serving, and you'd probably need more specific information to make any hard suggestions, but I imagine others (especially in PS) could get some mileage out of exploring the trade-offs between capturing omitted, time-dependent factors within the separately estimated datasets and generating estimates with the full dataset, which would be insensitive to (possibly important) temporal differences across the years. Thanks!

When you say fixed effects logit do you mean conditional logit (e.g. as implemented with xtlogit in Stata) or a logit model with dummy variables? If the former, then there's no reason to expect similarities with OLS because identification is potentially based on a small subset of the data (e.g. excluding any units that don't have variation in the outcome variable). If the former, you need to be careful about whether the data are deep enough to get good estimates on the dummy variables. In either case, if you really are seeing a 300-fold increase in your standard errors, then you are probably trying to model more than what the data will allow (I suspect sparseness), and somehow OLS is masking that fact. Thus, I think you'd want to try to simplify things as much as possible.

Masanao and Yu-sung, please direct me to the graph that Andrew refered to i.e on which page of ARM?

Presumably "clutering" should be "clustering".

(All the pedants who spotted this are presumably waiting for some other pedant to post first.)