Data Analysis Using Regression and Multilevel/Hierarchical Models

Posted on January 2, 2007 11:40 PM by Andrew

Our book is finally out! (Here’s the Amazon link) I don’t have much to say about the book here beyond what’s on its webpage, which has some nice blurbs as well as links to the contents, index, teaching tips, data for the examples, errata, and software.

But I wanted to say a little about how the book came to be.

When I spoke at Duke in 1997–two years after the completion of the first edition of Bayesian Data Analysis–Mike West asked me when my next book was coming out. At this time, I was teaching statistical modeling and data analysis to the Ph.D. statistics students and was realizing that there were all sorts of things that I had thought were common knowledge–and were not really written in any book–but the students were struggling with. These skills included:

– Simple model building–for example, taking the logarithm when appropriate, building regression models by combining predictors rather than simply throwing them in straight from the raw data file.
– Simulation–for example, I had an assignment to forecast legislative elections from 1986 by district, using the 1984 data as a predictor, then to use this model to predict 1988, then get an estimate and confidence interval for the number of seats won by the Democrats in 1988. This is straighforward using regression and simulation, but none of the students even thought of doing simulation. They all tried to do it with point predictions, thus getting wrong results.
– Hypothesis testing as a convenient applied tool. For example, looking at the number of boys and girls born in each month over two years in a city, and using a chi^2 test to check for evidence of over- or under-dispersion.

And a bunch of other things, including the use of regression in causal inference, how randomized experiments work, practical model checking, discrete models other than the logit, etc etc. (Students from back then will recall the examples from the homeworks: the elections, the chickens, the dogs, the TV show, etc., most of which have made their way into the new book.) No book covered this stuff. I tried Snedecor and Cochran, but this just described methods without much explanation. Cox and Snell’s Applied Statistics book looked good, but students got nothing out of it–I think that book is great for people who already know applied statistics but not so useful for people who are trying to learn the topic.

So I thought I’d write a book called “Introduction to Data Analysis,” a prequel to Bayesian Data Analysis, with all the important things that I thought students should already know before getting into serious modeling. (I also had plans to discuss the steps of practical data analysis, including how to set up a problem, and a bunch of other things that I can’t remember. This never led anywhere, but at some point I’d like to pick that up again.) I took some notes and thought occasionally about how to put the book together. When I pick it back up, it might be useful to take note of some industry examples of data models. Stream processing from VERVERICA.COM is an interesting example that many could gain a lot from.

The next step came in 2002, when I was talking with Hal Stern and he suggested that “Intro to Data Analysis” (or, as he put it, “All about Andy”) wasn’t enough of a unifying principle. We discussed it and came up with the idea of structuring the book around regression. I liked this idea, especially given Gary King’s comment from several years earlier that stat books tend to spend lots of time on simple models that aren’t so useful. I loved the idea of starting with regression right away, and helping students learn about the benefits of regression testing, rather than mucking around with all those silly iid models.

(Just as an aside: I really really hate when textbooks give inference for iid Poisson data. I don’t think I’ve ever seen such a thing: multiple observations of Poisson data with a constant mean. Somebody will probably correct me on this, but I think it just doesn’t happen. I have to admit that we do give this model in BDA, but we immediately follow it with the more realistic model of varying exposures.)

Anyway, back to the book: starting with the good stuff is definitely the way to go. I tried to follow the book-writing rule of “tell ’em what they don’t know.” It’s supposed to be a “good parts” version (as William Goldman would say) of regression. It’s still pretty long, but that’s because regression has lots of good parts. Having Jennifer as a collaborator helped a lot, giving a second perspective on everything in addition to her special expertise in causal inference.

11 thoughts on “Data Analysis Using Regression and Multilevel/Hierarchical Models”

Bill Jefferys on January 2, 2007 1:47 PM at 1:47 pm said:

Amazon now knows that the book is available; I have ordered a copy. Thanks, Andrew and Jennifer…I look forward to reading it.

Bill
Bob O'H on January 3, 2007 10:07 PM at 10:07 pm said:

Great! I've been waiting to be able to tell people to go out and buy it.

Bob
Mark on January 11, 2007 9:39 AM at 9:39 am said:

When you say ignore the not available yet thing on amazon, is there a way to bypass it? I'm still getting it as not available yet.
vasishth on January 13, 2007 10:47 PM at 10:47 pm said:

On p. 20 you say, "It is never possible to "accept" a statistical hypothesis, only to find that the data are not sufficient to reject it." Isn't it the case that strictly speaking this is not correct? What about two-one-sided t-tests (equivalence testing)? One can easily set up the usual null hypothesis as the alternative hypothesis and the alternative hypothesis as the null hypothesis. In effect one can end up accepting a statistical hypothesis (with a certain "confidence" interval). My understanding was that this is the rationale for testing the equivalency of generic drugs with brand drugs. I have a simple example with references on pages 132-134 of some notes I wrote: here.
vasishth on January 13, 2007 11:03 PM at 11:03 pm said:

Reading the ARM book some more, there are some issues with the code.

First, the ~gelman/arm/examples directory is hard to download using, e.g., wget -r. I don't want to click through each directory and download the files one by one (who would?). But wget results in all kinds of irrelevant html files getting included. Please provide a zip file of the examples directory.

I still intend to read the book (because I paid for it ;-), but I have to read it passively, without running the novel parts of the code.

The arm package cannot be installed except on Windows, because of its dependencies. It is non-trivial to run even chapter 3 code in a non-windows environment (Mac OS 10.3.9), one wastes a lot of time just getting to the point that display() works. I unpacked the arm package in CRAN and extracted the functions "manually". I still cannot get sim() to work.

I think you should prominently display the fact that the book's code is essentially impossible to run on non-Windows machines. I think I would have thought twice about buying the book if I knew this. Normally one does not expect a book based on R to run only on Windows (see MASS by Venables and Ripley, the book by Pinheiro and Bates, and any number of other R/S-Plus books).

For the moment I think you should have a more improverished version of arm as a package so that non-windows people can use it too. This should at least have the dataset and the non-BUGS based functions. As it stands the book is code is pretty difficult to use.

I am also interested in the question: what proportions of Windows versus Darwin/Linux users are capable of programming? How much of your intended audience do you lose by restricting the code to Windows? (I know this restriction derives from BUGS and not your work, but still). My guess would be that you've lost a large audience.
vasishth on January 15, 2007 3:50 AM at 3:50 am said:

In section 4.6 of the book, p. 69, the authors provide several principles, one of which is:

"Include all input variables that, for substantive reasons, might be expected to be important in predicting the outcome".

Consider an lmer model like the following where sentence reading time RT is modeled by two orthogonal constrasts c1 and c2 in a repeated measures design; say there are three conditions a,b,c in the experiment that each subject subject saw (hence within subjects), and that each subject saw multiple sentences (items) that have the manipulation of interest (the three conditions):

m1 = lmer(RT~c1+c2+(1|subject)+(1|item),data)

Now, say I fit another model m2 where I drop the (random) intercept for item because it has very low variance:

m2 = lmer(RT~c1+c2+(1|subject),data)

Then I run a model comparison

anova(m1,m2)

and the anova reports that m2 is a better model–there is no gain in keeping the random intercept for item.

Does it not make sense to remove the random intercept term for item even though it would make sense to keep it because we do expect different sentences to have different contributions to reading time due to all kinds of other factors orthogonal to the manipulation in the experiment (such as plausibility in the real world etc.)? This kind of model comparison is discussed in detail in the Pinheiro and Bates book; so I am a bit unclear on how to apply the principle above in the light of such discussions in the literature. The authors mention AIC and BIC in the book, but the connection between dropping factors as a consequence of model comparison is not discussed, as far as I can tell.

Also, it is not clear where the above principle comes from; it would be more interesting to understand WHY the authors proposed this principle. Perhaps it's in the book and I haven't found it yet.
Andrew Gelman on January 15, 2007 12:02 PM at 12:02 pm said:

Vasishth,

Thanks for the comments. To respond in order:

1. Yes, here I was thinking about classical hypothesis tests such as t-tests, F-tests, etc., where one is testing some specific model (which can typically be set up as a hypothesis that some particular parameter equals zero). I agree that it should be possible to accept (with reasonable certainty) a hypothesis such as theta>0.

The general approach of the book is estimation, not hypothesis testing. Here we were talking about these quick hypothesis tests that can be useful in applied statistics. I would not be testing hypotheses such as theta>0 (or |theta_1 – theta_2|
Tom Moertel on January 15, 2007 10:21 PM at 10:21 pm said:

Andrew,

I am about halfway through the book and think it is great. Not only have I learned a ton of useful techniques, but also I have a much better understanding of how to interpret models and results. I'm looking forward to digging into the "good stuff" in the second half.

Regarding point 3 from your most-recent comment, I suspect that you could address all of Vasishth's concerns about downloading files by providing a single Zip archive that contains all of the examples. His root problem was not being able to conveniently download all of the examples for offline study.

(Wget, BTW, is a program that can be used to recursively download hierarchies of files. It is often employed to download "everything" in a directory when a convenient all-the-files-in-one-big-bundle archive is not provided.)

Cheers,
Tom
vasishth on January 17, 2007 9:28 AM at 9:28 am said:

Thanks for the detailed responses. I give in; I got hold of a Windows machine to finish this book (actually, an Intel Mac so I can dual boot it between Windows and OS X :). It's a pretty amazing book. In my opinion, the R code is an integral component; those using other software will be missing out (not least because of the developments in the lme4 and related packages).

In point 1, you say: "The general approach of the book is estimation, not hypothesis testing." This is probably just my ignorance, but I have never really understood the difference. When we do hypothesis tests, we want to know whether, say, the difference between two conditions is significant. But the p-value depends on a confidence interval estimate; and the p value in essence comes from there.

Here is a real life example that I am working on right now. I have data from a reaction time study where the p-value for a particular experimental condition (coded with indicator variables) is 0.02, but when I compute the highest posterior density (HPD) intervals for the coefficient corresponding for the coefficient corresponding to the condition (using R's mcmcsamp function) I get the HPD intervals [-0.07,0].

Now the question is: how do I interpret this effect? The effect is in the "predicted direction" and if I were to follow the conventional approach used in experimental research in my area (psycholinguistics), I would declare victory and move on. But the MCMC-based estimates do not really support the conclusion that there is an effect.

It seems to me that estimation and hypothesis testing cannot really be separated. I think I see the point that the point of a particular research exercise can be to only find out what the coefficients are, along with their intervals, i.e., to do estimation (I assume this is what you meant by estimation). But even there, if the intervals include 0, we would conclude that that predictor does not have have a significant effect on the dependent variable. We are essentially doing a hypothesis test.

I can't see how we do one without the other. Would you agree with this?
Andrew on January 17, 2007 12:43 PM at 12:43 pm said:

Vasishth,

In some settings (for example, coefficients in a linear model), estimation and hyp testing are simply dual problems, but in other settings, they are different. For example, consider inference for a variance parameter (such as "tau," the sd of the school effects in the 8-schools example of chapter 5 of Bayesian Data Analysis). We know that the true value of tau is positive, and we can summarize, for example, by an HPD interval. We can also test the hypothesis that tau=0. These are different problems.

In your example, if the 95% interval for theta is [-.07, 0], that sounds like it supports the conclusion that there is an effect. In practice it will depend on how large "0.07" is in real terms.

P.S. I don't know that I trust the mcmcsamp() function. I just don't know exactly what it's doing. If you do want to use it, I'd suggest using mcsamp() which is our front end that runs multiple chains and converts to a Bugs object for easy display.
Shravan Vasishth on January 24, 2007 10:57 PM at 10:57 pm said:

I finally installed Windows on my Mac (a traumatic experience) and finally got the code working. However, the startup instructions on the website of the book did not work for me. I offer a working example for other souls as clueless as myself. The first problem is that the libraries have to be installed manually, they do not install automatically as adverstised. Second,
the library R2WinBUGS has to be called explicitly to run the critical bugs command.

Also, if anyone out there is thinking of installing a dual boot environment in Mac in order to install WinBUGS, there is a bug (no pun intended) in the licence installation of WinBUGS. The decode command for the license does not work as advertised, but the license installs anyway.

The working version is here:
http://www.ling.uni-potsdam.de/~vasishth/temp/sch…

Comments are closed.