In discussing the ongoing Los Angeles Times series on teacher effectiveness, Alex Tabarrok and I both were impressed that the newspaper was reporting results on individual teachers, moving beyond the general research findings (“teachers matter,” “KIPP really works, but it requires several extra hours in the school day,” and so forth) that we usually see from value-added analyses in education. My first reaction was that the L.A. Times could get away with this because, unlike academic researchers, they can do whatever they want as long as they don’t break the law. They don’t have to answer to an Institutional Review Board.
(By referring to this study by its publication outlet rather than its authors, I’m violating my usual rule (see the last paragraph here). In this case, I think it’s ok to refer to the “L.A. Times study” because what’s notable is not the analysis (thorough as it may be) but how it is being reported.)
Here I’d like to highlight a few other things came up in our blog discussion, and then I’ll paste in a long and informative comment sent to me by David Huelsbeck.
But first some background.
I have never performed a value-added education analysis myself, but I thought a lot about the topic a few years ago when Jim Liebman, a colleague of mine at Columbia Law School, was named the Chief Accountability Officer for the New York City schools. I read a bunch of research articles on teacher performance and was particularly impressed by the work of Jonah Rockoff and his collaborators, who found that teachers–but not schools–can make a big difference in student performance.
Jennifer and I had a bunch of conversations with Jim about how to do the value-added analysis and how to present the results in an accessible way. I don’t know what the school district finally ended up doing, but I recall that Jim wanted to use gain scores (that is, post-test minus pre-test) whereas Jennifer and I preferred to regress on pre-test to avoid the usual regression-to-the-mean issues and also to bypass problems of calibration that arise when different tests are used in different grades.
I’ll now return to our blog discussion of the L.A. Times project.
First, there was a bit of back-and-forth about the feasibility of the proposal to switch to a system in which 80% of teachers are fired within their first two years. There was also some concern about over-reliance on test scores, partly as a measurement issue (getting good scores isn’t the same thing as learning) and also, more seriously from my perspective, an incentives issue about what might happen if individual teachers knew that their “value-added scores” would be made public (and maybe even used to fire them).
In addition, a statistical issue came up: How variable are those estimates for individual teachers? The L.A. Times article featured two teachers with extremely different scores, and I’m guessing that the difference between their ratings is statistically significant. If you just took two teachers at random from the middle of the pack, though, it might be difficult to really know which one is better than the other (on the test-score metric).
At a technical level, I think they should use multilevel models, partly to get more accurate estimates for individual teachers and partly to address the multiple comparisons problems that will inevitably arise. (See here for my paper with Jennifer and Masanao on multilevel models for multiple comparisons.)
Finally now here are David Huelsbeck’s remarks:
Most of the statistical methodology described in the report by Richard Buddin of RAND that provides the basis for the LA Times article is straight-forward and used widely. The details of FGLS or the Bayesian methods used to correct for measurement error (not identified in the white paper) are not terribly important here. There is nothing special about so-called Value-Added Measurement (VAM). It’s just a context specific brand name for using student test scores after controlling for other antecedents.
The Buddin white paper does a fair job of describing the study and its broad results, but as Steve Sailer notes in the comments here and at Marginal Revolution, it does little to bolster the claim that the value-added estimates are useful for evaluation of individual instructors or schools. The LA Times articles, however, strike me as completely irresponsible in their representation of the study and its limitations. Richard Buddin’s white paper suggests that he is not likely the source of the problem. Perhaps the LAT reporters lack the capacity to really understand what it is that they are reporting on or perhaps they’re trying to “sex up” the story; probably some of both.
It will be interesting to see what form the LAT release of the individual teacher results takes. Will they publish bare point estimates of VAM? Likely. It would be far more responsible of them to report the unique 95%-confidence interval for each teacher. Though, I imagine that they might protest that doing so would be confusing to the average reader. Of course, that confusion is probably warranted in this case.
I have some issues with the method. First, the model is a simple linear additive model. A student gain from the 45th to the 55th percentile is treated as equivalent that from the 89th to the 99th. Also, the linear model is used though the dependent variable is clearly limited. I would expect the model to perform poorly towards the extremes. Can anyone here provide an informed opinion as to how much this might be expected to influence the validity of the estimates for individual teachers or schools?
As is often the case, the method relies on the assumption that the student-year error terms are exogenous. Given that the lagged test score is treated as a sufficient statistic for all prior inputs, I would expect this assumption to be violated. The use of robust standard errors only helps with the tests of significance or computation of confidence intervals. However, I question whether the assignments of students to teachers is sufficiently random for this not to impact the individual VAM estimates. If a teacher inherits most of each year’s incoming class from an especially (in)competent teacher in the lower grade, a situation that is likely to be persistent, would we not still expect the VAM estimate to be biased?
Finally, a quibble with Buddin’s presentation more than the method, there is no presentation of the restricted model ex-teacher VAM to compare with the full model. I do note, comparing Table 4 with Table 8, (the teacher and school VAM are estimated independently) that though the coefficient estimate variance of the school effects is in Buddin’s words “quite small” while that of the schools is “large” the R-sq of the two models differ by less than 0.01 for ELA and .001 for Math. This makes me suspicious that neither the teacher effects nor the school effects add much to the model. It’s not legit to guess from comparisons of coefficient estimates and standard errors, but my guess would be that the lagged test scores are doing all of the heavy lifting in these models. My guess is that the Cohen’s f-squared for the individual teacher VAM is diminishingly small. In keeping with the Bayesian bent of this forum, by how much would one rationally revise one’s prior estimate of an individual teacher’s performance on the basis of this VAM estimate?
The paper, McCaffrey, Sass, Lockwood and Mihaly (2009) The Intertemporal Variability of Teacher Effect Estimates, Education Finance and Policy, 4:4 referenced by Buddin goes a long way towards addressing the concerns of others here regarding the typical magnitude of standard errors, variability and forecasting accuracy of such models. Those authors estimate that restricting the grant of tenure only to teachers with VAM estimates in the top three quintiles would be expected to improve test scores by about 0.04 standard deviations.
There is an extensive literature from both compensation and learning that details why measures such as these are likely to be more harmful than helpful in this context, but this is a statistics blog and this comment is too long already.
I’m not setting up Huelsbeck as some sort of unquestionable authority figure here. For example, I doubt I would share his concerns about the use of an added model for a bounded variable–this sort of thing is done all the time and causes little harm, you just have to treat the estimates as some sort of average predictive comparison. But his comments generally seem reasonable to me, and he’s certainly coming at this with more knowledge than I have.