Andrew Grogan-Kaylor writes:

More and more in my work with students, I’m coming to a place where I realize that I know a lot, and am good at explaining, all kinds of statistical stuff like “we use a logit (or probit) for a binary outcome, and here’s why” or “when data are clustered inside neighborhoods, we use multilevel models”.

What I’m less good at is something that is emerging in a lot of the PHD students’ questions, which are more general questions about how to build statistical models. When do you know that you have “enough” independent variables? What variables should be included in your analysis even if they’re not going to end up being statistically significant? When you model interactions, how many interactions should you test? Which ones should you retain in your model?

I guess that what I’m looking for a is a more “philosophical” piece about statistical model building in general, as opposed to what I usually read, which are pieces about the particulars of a specific statistical technique.

Do you know of any general overview of modeling such as an article? I recall you talking about something like this in your blog, but a search is not turning it up.

My reply: There must be some overviews out there, but the only ones I’m particularly happy with are those in chapter 4 (linear regression) and chapter 5 (logistic regression) of my book with Jennifer.

My brief bit of strategy advice is to start the model simple and add variables.

Also, take your most important main effects and include their interactions. That’s a trick I learned a couple years ago, and it’s worked over and over for me. It sounds obvious once you hear it, but if you look back on your earlier analyses, I bet you’ll find you weren’t always doing it.

I keep aware of statistical significance, but I don’t think of model building as a process of testing variable or testing interactions. The main thing I get out of statistical significance is that if a coefficient is statistically significant and has the “wrong” sign–that is, it doesn’t make sense–then I look more carefully to try to understand what’s going on with the predictors.

One other tip–I think we mention this somewhere in the book–is to remember that a regression coefficient can be interpreted as the average difference, comparing two units that differ by 1 unit on predictor x but are identical for all other units. Sometimes this comparison doesn’t make a lot of sense, in which case it might not be worth your while to try too hard to interpret the corresponding coefficient.

> take your most important main effects and include their interactions

I heard this suggestion in a talk by David Cox more than a few years ago and he attributed the idea to someone earlier ( Tukey I believe)

Believe the statistics discipline is more comfortable publishing deductive stuff (model implications) rather than suggestive stuff (might be a good idea to try) or rhetorical stuff(it is a good idea because … ) so these "good ideas" end up being re-invented rather than well shared (blogs may help!)

But having unashamedly picked up the hammer of falsify-ability I see many of Andrew’s suggestions as increasing the likelihood of noticing what’s wrong in a model and elaborating it to a less wrong model – that will not necessarily minimize some mixture of bias and variance (perhaps very inappropriately traded off for the application) but increase the likelihood of generalizing to new data sets and applications.

Keith

It's been my good (or bad) fortune not to be a stat gure, although I've also been fortunate to be able to have highly skilled people around. But what I learned (nearly 40 years ago in grad school) still seems like good advice.

When building a model, start with the theory you're using. For example, if you're building a model of labor force participation, what does the theory suggest? Can you measure the factors that the theory suggests are important? If not, can you find good–or better-than-good–proxies?

Then you ask about how you're going to estimate whatever relationship you have to estimate.

Maybe that's obvious, but I've seen a lot of job candidates in economics in the past few years who knew the properties of their estimators, but not the content of their theories.

A great book on this topic is

Harrell, F: Regression modeling strategies.

In most social sciences, probably including economics, theory offer little guidance…so what to do?

off topic:

Does it make sense to include interaction terms without including all their respective constitutive terms? I have seen some economists doing that but I'm not sure how to interpret it. Maybe it is just wrong. I'm not sure if you discuss this particular topic in your book.

Many thanks,

I second the Harrell recommendation.

I regularly point people to Chapter 4, but there's plenty of good stuff in the rest of it.

Theory *always* offers something – otherwise, how did you decide what variables to measure?

Did you include the person's social security number as an IV? Well, why not? Because theory says it's ridiculous.

Sometimes, theory offers a lot, even in social sciences. But it always offers something.