More on those L.A. Times estimates of teacher effectiveness

Posted on August 23, 2010 6:37 AM by Andrew

In discussing the ongoing Los Angeles Times series on teacher effectiveness, Alex Tabarrok and I both were impressed that the newspaper was reporting results on individual teachers, moving beyond the general research findings (“teachers matter,” “KIPP really works, but it requires several extra hours in the school day,” and so forth) that we usually see from value-added analyses in education. My first reaction was that the L.A. Times could get away with this because, unlike academic researchers, they can do whatever they want as long as they don’t break the law. They don’t have to answer to an Institutional Review Board.

(By referring to this study by its publication outlet rather than its authors, I’m violating my usual rule (see the last paragraph here). In this case, I think it’s ok to refer to the “L.A. Times study” because what’s notable is not the analysis (thorough as it may be) but how it is being reported.)

Here I’d like to highlight a few other things came up in our blog discussion, and then I’ll paste in a long and informative comment sent to me by David Huelsbeck.

But first some background.

I have never performed a value-added education analysis myself, but I thought a lot about the topic a few years ago when Jim Liebman, a colleague of mine at Columbia Law School, was named the Chief Accountability Officer for the New York City schools. I read a bunch of research articles on teacher performance and was particularly impressed by the work of Jonah Rockoff and his collaborators, who found that teachers–but not schools–can make a big difference in student performance.

Jennifer and I had a bunch of conversations with Jim about how to do the value-added analysis and how to present the results in an accessible way. I don’t know what the school district finally ended up doing, but I recall that Jim wanted to use gain scores (that is, post-test minus pre-test) whereas Jennifer and I preferred to regress on pre-test to avoid the usual regression-to-the-mean issues and also to bypass problems of calibration that arise when different tests are used in different grades.

I’ll now return to our blog discussion of the L.A. Times project.

First, there was a bit of back-and-forth about the feasibility of the proposal to switch to a system in which 80% of teachers are fired within their first two years. There was also some concern about over-reliance on test scores, partly as a measurement issue (getting good scores isn’t the same thing as learning) and also, more seriously from my perspective, an incentives issue about what might happen if individual teachers knew that their “value-added scores” would be made public (and maybe even used to fire them).

In addition, a statistical issue came up: How variable are those estimates for individual teachers? The L.A. Times article featured two teachers with extremely different scores, and I’m guessing that the difference between their ratings is statistically significant. If you just took two teachers at random from the middle of the pack, though, it might be difficult to really know which one is better than the other (on the test-score metric).

At a technical level, I think they should use multilevel models, partly to get more accurate estimates for individual teachers and partly to address the multiple comparisons problems that will inevitably arise. (See here for my paper with Jennifer and Masanao on multilevel models for multiple comparisons.)

Finally now here are David Huelsbeck’s remarks:

Most of the statistical methodology described in the report by Richard Buddin of RAND that provides the basis for the LA Times article is straight-forward and used widely. The details of FGLS or the Bayesian methods used to correct for measurement error (not identified in the white paper) are not terribly important here. There is nothing special about so-called Value-Added Measurement (VAM). It’s just a context specific brand name for using student test scores after controlling for other antecedents.

The Buddin white paper does a fair job of describing the study and its broad results, but as Steve Sailer notes in the comments here and at Marginal Revolution, it does little to bolster the claim that the value-added estimates are useful for evaluation of individual instructors or schools. The LA Times articles, however, strike me as completely irresponsible in their representation of the study and its limitations. Richard Buddin’s white paper suggests that he is not likely the source of the problem. Perhaps the LAT reporters lack the capacity to really understand what it is that they are reporting on or perhaps they’re trying to “sex up” the story; probably some of both.

It will be interesting to see what form the LAT release of the individual teacher results takes. Will they publish bare point estimates of VAM? Likely. It would be far more responsible of them to report the unique 95%-confidence interval for each teacher. Though, I imagine that they might protest that doing so would be confusing to the average reader. Of course, that confusion is probably warranted in this case.

I have some issues with the method. First, the model is a simple linear additive model. A student gain from the 45th to the 55th percentile is treated as equivalent that from the 89th to the 99th. Also, the linear model is used though the dependent variable is clearly limited. I would expect the model to perform poorly towards the extremes. Can anyone here provide an informed opinion as to how much this might be expected to influence the validity of the estimates for individual teachers or schools?

As is often the case, the method relies on the assumption that the student-year error terms are exogenous. Given that the lagged test score is treated as a sufficient statistic for all prior inputs, I would expect this assumption to be violated. The use of robust standard errors only helps with the tests of significance or computation of confidence intervals. However, I question whether the assignments of students to teachers is sufficiently random for this not to impact the individual VAM estimates. If a teacher inherits most of each year’s incoming class from an especially (in)competent teacher in the lower grade, a situation that is likely to be persistent, would we not still expect the VAM estimate to be biased?

Finally, a quibble with Buddin’s presentation more than the method, there is no presentation of the restricted model ex-teacher VAM to compare with the full model. I do note, comparing Table 4 with Table 8, (the teacher and school VAM are estimated independently) that though the coefficient estimate variance of the school effects is in Buddin’s words “quite small” while that of the schools is “large” the R-sq of the two models differ by less than 0.01 for ELA and .001 for Math. This makes me suspicious that neither the teacher effects nor the school effects add much to the model. It’s not legit to guess from comparisons of coefficient estimates and standard errors, but my guess would be that the lagged test scores are doing all of the heavy lifting in these models. My guess is that the Cohen’s f-squared for the individual teacher VAM is diminishingly small. In keeping with the Bayesian bent of this forum, by how much would one rationally revise one’s prior estimate of an individual teacher’s performance on the basis of this VAM estimate?

The paper, McCaffrey, Sass, Lockwood and Mihaly (2009) The Intertemporal Variability of Teacher Effect Estimates, Education Finance and Policy, 4:4 referenced by Buddin goes a long way towards addressing the concerns of others here regarding the typical magnitude of standard errors, variability and forecasting accuracy of such models. Those authors estimate that restricting the grant of tenure only to teachers with VAM estimates in the top three quintiles would be expected to improve test scores by about 0.04 standard deviations.

There is an extensive literature from both compensation and learning that details why measures such as these are likely to be more harmful than helpful in this context, but this is a statistics blog and this comment is too long already.

I’m not setting up Huelsbeck as some sort of unquestionable authority figure here. For example, I doubt I would share his concerns about the use of an added model for a bounded variable–this sort of thing is done all the time and causes little harm, you just have to treat the estimates as some sort of average predictive comparison. But his comments generally seem reasonable to me, and he’s certainly coming at this with more knowledge than I have.

12 thoughts on “More on those L.A. Times estimates of teacher effectiveness”

jme on August 23, 2010 8:51 AM at 8:51 am said:

"It will be interesting to see what form the LAT release of the individual teacher results takes. Will they publish bare point estimates of VAM? Likely. It would be far more responsible of them to report the unique 95%-confidence interval for each teacher. Though, I imagine that they might protest that doing so would be confusing to the average reader. Of course, that confusion is probably warranted in this case."

This is precisely why I ranted slightly in my comment on the previous post. I'm agnostic on the actual merits of the research, I'm not qualified to judge.

But I am frequently frustrated by how easy it is for news outlets to earn praise from stats people without making any attempt to educate/explain the most basic statistical concept: variation.

I agree with Dr. Gelman that the difference between the most/least effective teachers is most likely "statistically significant", and I'm not harping on the lack of CIs because I think the research is bogus and that the error bars will reveal that fact.

It's just that any ranking based on point estimates _has_ to be interpreted in the context of the error estimates. This is a crucial skill in understanding data analysis, and I'd prefer it if people reserved their praise for news outlets for when they actually attempt this (difficult) task with skill.

In terms of furthering good discussions/explanations of statistics and data in popular news outlets, articles that simply avoid mentioning error estimates have simply dropped the ball.
K? O'Rourke on August 23, 2010 12:37 PM at 12:37 pm said:

Unlike researchers they can report who refused to comment (ran into this years ago when surveying published authors about finacial relationships – you are not supposed to report who refused to answer your survey).

Nice that you posted a link on Jim Liebman's _adventures in school_

K?
Megan Pledger on August 23, 2010 1:22 PM at 1:22 pm said:

Here is part of what "John Rogers
Associate professor, UCLA Graduate School of Education and Information Studies and director of the Institute for Democracy, Education and Access"
said at the LA times site…

(Quoted directly since he's says it better than me and has a more authorative source…)

The National Academy of Sciences has identified several of the problems posed by value-added methods.

First, the National Academy of Science notes that student assignments to schools and classrooms are rarely random. It's not possible to definitively determine whether higher or lower student test scores result from teacher effectiveness or are an artifact of how students are distributed.

Second, you can't compare the growth of struggling students with the growth of high performers. In technical terms, standardized tests do not form equal interval scales. Enabling students to move from the 20th percentile to the 30th is not the same as helping students move from the 80th to the 90th percentile.

Third, estimates of teacher effectiveness can range widely from year to year. In recent studies, 10% to 15% of teachers in the lowest category of effectiveness one year moved to the highest category the following year ,while 10% to 15% of teachers in the highest category fell to the lowest tier.

The National Academy of Sciences concluded that value-added methods "should not be used as the sole or primary basis for making operational decisions because the extent to which the measures reflect the contribution of teachers themselves, rather than other factors, is not understood."
David Huelsbeck on August 23, 2010 3:16 PM at 3:16 pm said:

To hopefully clarify a few of my remarks that were quoted by Professor Gelman above:

First, to the extent that I can claim to have one, this is *not* my area of expertise. My questions were genuine, not rhetorical.

From what I can tell, the Buddin study was state-of-the-art given the data available to him.

My concern is that while the study is well suited to addressing overall policy questions (e.g. Should LAUSD allocated limitted funds to reducing average class size or to hiring more teachers with advanced degrees?) it is not very well suited to answering teacher specific questions (e.g. Would my child be better off in Ms. A's 4th grade class or Mr. B's next year?). The answer to the second type of question might be somewhat clear in the cherry-picked contrast of Smith and Aguilar, but in general there will be little information in the vast majority of relevant pairwise contrasts.
Steve Sailer on August 23, 2010 5:35 PM at 5:35 pm said:

Thanks.

That was exactly my impression of Buddin's paper for the LA Times: it was good on the general issues of what affects test scores, but strikingly lacking in evidence validating the LA Times' intention of publicly praising some teachers and publicly shaming others.

What I've noticed in the LA Times' articles is an unspoken assumption that "teacher effectiveness" must be the cause of otherwise unexplained changes in test scores. It looks like Teacher Effectiveness is being assumed to be the catch-all explanation for everything that can't be explained by a handful of standard variables such as class size. Maybe it is, but I didn't see much attempt to justify that view.
K? O'Rourke on August 24, 2010 4:28 AM at 4:28 am said:

Missing managerial overlay/firewall?

One of the last projects I was involved in prior to formally getting into statistics was a project to evaluate value added by dentists in a public dental health system (the funding for public dental health was eventually cut by newly elected less left leaning politicans)

It had a nice focus – perhaps some dentists being left leaning and finding more cavities on the left versus right side of the mouth (which they could be given feedback on) and even sexier – catching those who were stealing the silver that was used in fillings (where they could be sent to prison).

But early in the project it was obvious to all involved that these "signals" needed managerial review before they were "acted" upon in any way.

And I believe that should be the case for anything that has a particular impact on an individual – stuff happens and sometimes this stuff is known to those involved.

K?
ceolaf on August 24, 2010 8:10 AM at 8:10 am said:

Basic issues that the value added methodology has so far failed to address:

1) Fasle assumption of interval scales (mentioned above)

2) False ssumption of random assignment of children to teachers (erroneously tossed aside by Gelman. Rothstein jr (princeton?) has shown that value added can be used to measure teacher impacts on students' PREVIOUS scores. clearly non-random assignment.)

3) False assumption that teachers primary responsibility is to cover tested material. Yes, pure content is incredibly important in graduate school and college. But there are more fundamental things going on in k12 (especially k6) that do not appear in the standards. Moreover, not all the standards are tested on the tests.

4) The predictability of tests (Look at Koretz's Measuing Up to learn more) means that some teachers can knowingly focus on the tests to the exclusion of untested standards and the exclusion of other skills, mindsets and aptitudes (e.g. organization, how to take criticism, how to revise one's own work, considering alternative approaches, etc.). Compare the listed of tested standards to the list of untested standards & other skills/indsets/aptitudes and think about which ones should get the the lion's share of instruction/attention. High stakes tests (i.e. stakes for schools, for students and now for teachers) give teachers large incentives to focus simply on the tested standards, to the exclusion of the larger formal and informal curriculum, meaning that value added intrinsically (at least currently) fails to measure teacher quality.

5) Because of the need for multiple years of data, lagged impact on students is rarely (never?) examined. There is a paper out there that shows (with random assignment, btw) that teachers with the best impact on short term students scores (i.e. final exam) had the worst impact on long term scores (i.e. final exam of the next course in the required sequence). Focus on the test you know is coming or focus on the stuff that's really important for the next course? These are not necessary the same thing, and instructional time is quite limited.
ceolaf on August 24, 2010 8:11 AM at 8:11 am said:

Sorry. I meant to paste in a citation for that paper.

Carrell, S. & West, J. (2010). Does Professor Quality Matter? Evidence from Random Assignment of Students to Professors
Russell Almond on August 24, 2010 8:36 AM at 8:36 am said:

I was going to make a longer comment on the original post, but just ran out of time. A few quick points:

1) I think it is irresponsible of the LA Times to post names unless they have a lot more information. There are a large number of reasons why a good teacher could have negative gain scores. My favorite is that of an ESL teacher whose students are transferred to a mainstream class when they master enough English; therefore the average class scores SHOULD go down if she is doing her job well.

Quite likely there is a problem; however, hoards of angry parents demanding that the there students be transferred out of Smith's class is not likely to help the principal solve it.

2) Henry Braun wrote an ETS policy report covering value added models and similar issue. It is available from:
http://www.ets.org/Media/Research/pdf/PICVAM.pdf
Fred Thompson on August 25, 2010 6:30 AM at 6:30 am said:

Many of you may find this of interest in that it appears to show hoe value-added tests can be used with some sensitivity to encourage better teaching http://www.tapsystem.org/publications/wp_eval.pdf
Kaiser on September 1, 2010 9:48 AM at 9:48 am said:

I haven't looked at the papers yet so won't make comments on the methodology as of now but I do have some question about the practical utility of this model:

1) Should they have been modeling absolute rather than relative scores? While I haven't seen the data, in theory percentiles can go down even when absolute scores go up. In theory, two students earning the same scores on the same test in two different years will end up with different percentile scores. Are we wanting to measure learning in an absolute sense or in a relative sense?

2) How many points in absolute score does a percentile-score translate to? In one of the examples, a "top performer" moved the average percentile scores by 5 percentile points. Does that mean a raw score improvement of 1% or 10%? If it is more like 1%, is this a practically significant number?

3) If teachers should be labeled as good or bad based on some system like this, we need to have a way to help "bad" teachers improve. That requires that we understand the causes of "good" and "bad" teachers. Nothing in the article indicates that we have the slightest clue about this. In fact, midway through the article, the reporters reach what I had called "story time" (see current post on my blog)… that is when they silently moved from an evidence-based discussion to making up stories about the data e.g. "the surest sign of a teacher's effectiveness was the engagement of his or her students". No data/evidence was cited in support of this causal statement.

4) Given the fixed pool of teachers at any given time, some students will need to be assigned to so-called "bad" teachers. The proponents have not discussed how they would solve this allocation problem. Good teachers sold to the highest bidder? (just kidding)

5) If indeed the hypothesis is true that allocation of students to teachers is random, then I can't see that we have any policy issues. Allocation seems fair.

6) Andrew already said this but the issue of perverse incentive is severe when teacher ratings such as these are published and especially if these ratings then lead to firings/promotions.

7) sorry this is not in order but going back to the first statement about modeling percentiles rather than raw scores… does this model create losers always even when the entire population of students improve their raw scores?

8) I side with Andrew in thinking that "fire the bottom 80%" is a ludricous idea. You could do such a thing in a school or two but you can't do this to an entire school district. Given that there is some fixed number of teachers needed to staff the schools (the fire 80% rule does not change that), if we have to pay each retained teacher say 2-3 times the current salary to participate in this system, we have just doubled/tripled the budget. Who is paying for this?
Richard Rasiej on May 8, 2011 3:13 PM at 3:13 pm said:

I am a math teacher in Los Angeles, teaching at an LAUSD school. My academic background is mathematics (all but dissertation about 35 years ago) and my professional background, until I got into teaching six years ago, was actuarial science and financial modeling. One of the subjects I teach is AP Statistics, so I am conversant with the subject, although probably not at the level of even a good recently minted undergraduate statistics major.

Nevertheless, I'm sure you can appreciate my interest in the entire Value Added Method and its variants. The latest, as far as Los Angeles is concerned, is something called Academic Growth over Time (AGT) which was developed by LAUSD in association with the Value-Added Research Center of the Wisconsin Center for Education Research at the University of Wisconsin. Based on the technical document which is available on the LAUSD website for this project, "the AGT model is defined by four equations: a "best linear predictor" AGT model defined in terms of true student post and prior achievement and three measurement error models for observed post and prior achievement"; the three measurement error models are for posttest measurement error, same-subject pretest measurement error, and other-subject pretest measurement error". Also, teacher effects are estimated using Empirical Bayes shrinkage.

How are teachers, much less the public, supposed to react to the introduction of models such as these? How can I best go about the work of determining for myself whether this model is useful for what its stated purpose is supposed to be? Pesonally, I am in favor in trying to come up with a better system of retaining good teachers and making it easier to dismiss those who fall short; I’m just struggling with trying to evaluate what’s being dumped in our (the teachers) laps.

To a large extent, can't one make the argument that mandating a particular model and providing technical justification which is virtually incomprehensible to almost all the stakeholders a form of mathematical mugging, where a desired policy outcome is dressed in the garb of allegedly "objective" mathematics?

Comments are closed.