Skip to content

Value-added assessment political FAIL

Jimmy points me to a sequence of posts (Analyzing Released NYC Value-Added Data Parts 1, 2, 3, 4) by Gary Rubinstein slamming value-added assessment of teachers.

A skeptical consensus seems to have arisen on this issue. The teachers groups don’t like the numbers and it seems like none of the reformers trust the numbers enough to defend them. Lots of people like the idea of evaluating teacher performance, but I don’t see anybody out there wanting to seriously defend the numbers that are being pushed out here.

P.S. Just to be clear, I’m specifically addressing the problems arising in value assessment of individual teachers. I’m not criticizing the interesting research by Jonah Rockoff and others on the distribution of teacher effects. It’s a lot easier to estimate the distribution of a set of parameters than to estimate the parameters individually.


  1. Laszlo says:

    This is way too important a topic and the Chetty-Friedman-Rockoff work way too serious to dismiss it easily. Could you comment more on why is their result so limited? Why aren’t their results promising for using VA for teacher retention etc.? To put it one way: Isn’t the distribution of teacher effects a super-important step? (And a phenomenal sample to estimate the distribution, of long-term, really important effects?) What else would you need or use? As a Bayesian? :)

    See around page 50. here:

    (But if readers care, you can “promote” the whole site and an excellent video of Chetty’s presentation:

    • Andrew says:


      1. I think you misread my post. I specifically wrote that I’m not criticizing the research of Rockoff and his colleagues.

      2. Whatever you think of the use of value-added assessment to fire teachers, my impression is that it’s been a political failure. I see very few people defending the numbers.

      • Laszlo says:

        Andrew, I saw what you wrote originally (“not criticizing”), I just had to ask what you mean by that. Labeling it “a lot easier” is not exactly defending it (esp. its usefulness) and opposing Rubinstein. Or you did want to criticize him? Why not, then, if I may ask?

        • Andrew says:

          I’m neither supporting nor criticizing Rubinstein. Rather I’m making the meta-point that I don’t see many people out there defending value-added assessments of individual teachers.

          • Laszlo says:

            I see. Though the attention Chetty et al. got was extraordinary and it was not completely hostile.

            That said, would you like to be a person “out there defending value-added assessments of individual teachers?”

            (Basically, this post could have been less meta. But maybe I simply missed your previous posts on this.)

  2. MAYO says:

    “It’s a lot easier to estimate the distribution of a set of parameters than to estimate the parameters individually.” Indeed, and the latter is what is often needed in a scientific appraisal of a parameter (e.g., in a theory).

    • K? O'Rourke says:

      Possibly distinguish more clearly here, between the representation and what is being represented?

      Not necessarily more wrong to represent a single fixed unknown parameter (the represented) by a distribution (the representation) – for someone, from some pupose.

  3. Jonathan (a different one) says:

    While it’s certainly easier to estimate a distribution than to assign parameters to individual observations, can’t we be fairly confident that the low 5 percent are really worse than the top 5 percent? Granting that we can say very little about whether the low five percent are better or worse than the 5 percent just above them, why do we care?

    The lack of willingness of anyone to stand up and defend the results, I suspect, stems from a misunderstanding of what they’re trying to do — truncate the left end of the quality distribution with a full understanding that who finishes there, as opposed to just above, is largely a matter of luck.

    • K? O'Rourke says:

      And thats why – understandibly – no one wants to be fairly evaluated (with impact).

      The contribution of luck can not be completely removed, and before the evaluation,
      no one knows what their luck (and the impact of that) will be.

      • Jonathan (a different one) says:

        Sure the teachers don’t want to be judged, fairly or unfairly, but that doesn’t explains why the people who devised the test don’t want to defend it. (But see my response to Brett below.)

    • Brett Keller says:

      Jonathan – I sympathize, but Rubinstein’s argument is that we can’t even use it for that because it’s not clear the bottom and top 5 percent are that different from the rest: he compares teachers in the top and bottom with themselves in other years, with themselves teaching different subjects, and (in some cases where data is available two ways in the same years) with themselves teaching the same subject in the same year, and finds that there’s not nearly as much consistency as you’d expect if those percentiles really meant what we think they do.
      Ie, you can’t truncate the left end of the quality distribution if there’s so much noise that people jump in and out of that tail (and not just from a point just on the other side of the dividing line) from year to year and test to test.

      • Jonathan (a different one) says:

        First, so long as the test has some validity, even if a highly variable one, there ought to be some benefits, even if attenuated. But second, this reminds me of the GE strategy to fire the lowest 10 percent of managers. it is sometimes argued that this policy has beneficial results even in the absence of evidence that the lowest ten percent are well measured at all. Of course, it depends on an ability to find replacements, for example, and on a very carefully structured system of rewards to avoid sapping morale.

  4. jrkrideau says:

    I have not read much about Value Added [1] except I stumbled over the first Rubinstein post a few weeks ago and some comments by another blogger, G F Brandenberg but it strikes me that either there is a construct validity problem or a grievous failure to operationalize the construct or both(speaking in psychometric terms).

    If those graphs are portraying the overall results accurately then the measurement system is no better than a random number generator.

    Does anyone know if there is a manual/technical papers on the development of the idea?

    1. I’m not in education and I live in Canada not the USA.

  5. mpledger says:

    What I can’t understand is why every American statistician wasn’t jumping up and down saying how awful the statistical process was and that it should be abolished given that it’s going to seriously stuff up peoples lives when mis-identified teachers get fired.

    • Andrew says:


      I think that’s the point. Lots of people attacked the assessments and not much of anybody on the other side was there to defend them. So I’m guessing these assessments aren’t going to be used for much.

      • David says:

        It’s a bit frustrating, in that nobody is putting much credence to what seems like the obvious conclusion: Teacher quality doesn’t vary very much from teacher to teacher. The actual VAM models themselves do a great job predicting individual student performance, it’s just that teacher effects are not particularly large.

      • Jacob Hartog says:

        There are a host of arguments against these models in general: Jesse Rothstein’s paper discussing the biases introduced by dynamic tracking, in which he showed that a 5th grade teacher’s value added score was correlated with a student’s progress in 4th grade, was perhaps the most dramatic.

        What I think has been inadequately discussed is the use of individual specifications, rather than the zone of agreement across a broad swath of specifications. For example, the model used by NYCDOE doesn’t just control for a student’s prior year test score (as I think everyone can agree is a good idea.) It also assumes that different demographic groups will learn different amounts in a given year, and assigns a school-level random effect. The result is that, as was much ballyhooed at the time of the release of the data, is that the average teacher rating for a given school is roughly the same, no matter whether the school is performing great or terribly. The headline from this was “excellent teachers spread evenly across the city’s schools,” rather than “the specification of these models assume that excellent teachers are spread evenly across the city’s schools.”

        To be partisan for a moment, imagine using a multi-level model to assess the efficacy of basketball players that imposed a team-level random effect: we might easily ‘discover’ that the average player on the Charlotte Hornets was as good as the average player on the Chicago Bulls, when really a)that is an effect of the model design, and b)good players are what makes a team good, just as good teachers are by-and-large what makes a school good.

        If I was more sophisticated, I’d try to extend Rothstein’s paper to show that dynamic sorting of teachers into high and low-functioning schools messes up the models just as badly as dynamic sorting of students does.

        I should add that as someone who taught in NYC schools for 8 years, there’s nothing wrong with measurement. The pre-existing observation and evaluation system was completely terrible, and it is totally reasonable to combine evaluations with test-based measurement in making decision. Analytically privileging the results of tests over other forms of observation, let alone assigning a percentile to the coefficient on a single (remarkably complex) model and publishing it in the newspapers, is absolutely bonkers.

        I’d appreciate anyone’s comments showing me where I’m wrong.

        P.S. Rubinstein wrote a very funny and valuable book called “Reluctant Disciplinarian” about his early years teaching, that’s very much worth a read if you’re going to teach K-12. I’m not totally sold on his blog, though.

        • Ed says:

          “To be partisan for a moment, imagine using a multi-level model to assess the efficacy of basketball players that imposed a team-level random effect: we might easily ‘discover’ that the average player on the Charlotte Hornets was as good as the average player on the Chicago Bulls, when really a)that is an effect of the model design, and b)good players are what makes a team good, just as good teachers are by-and-large what makes a school good.”

          Although I agree with your point that if teachers are non-randomly sorted into classrooms in the same way that students are non-randomly sorted into classrooms that the inclusion of classroom-level covariates will mask true teacher effects, your basketball analogy is a bit off. In your analogy, the players are more analogous to the students, and the value-added analysis would be better applied to the coach. Therefore, your argument that the team is good because the players are good would seem to reinforce the notion that good schools are good because they have good students. A value-added model for the NBA would consider all players to be equal, with the winning and losing dependent only on the value added by the coach.

  6. Clark Andersen says:

    Teacher evaluation is fundamentally challenging if you wish to go beyond visual observations of teachers going through the motions expected of good teaching (which was the standard approach when I was teaching high school math in the 90’s). The intellectual (and motivational) bell curve is a real phenomenon, and teachers of classes packed with students at the lower end of the scale will have a very different effect on their students than teachers with students at the higher end of the scale, for a given effort of teaching. This is further complicated by the need for more behavioral management at the lower end, which further reduces effective teaching time. The best evaluation approaches I’ve seen aggregate the changes in individual students attributable to various teachers, but even with adjustments for student demographics this seems a very noisy process at the scale of individual teachers. At the elementary school level, a single teacher may have less than 20 students over the course of a year. A high school teacher may have nearly 10X that number, but scattered over 5-8 heterogeneous classes.

  7. revo11 says:

    I’m not too familiar with this literature, but aren’t there often questionable linearity assumptions in these analyses? For example, the value/difficulty associated with improving a test score from a 50 to a 60 is likely to be very different from the value/difficulty associated with improving a test score from an 85 to a 95. Whereas my impression is that these “value added” assessments don’t make such a distinction and might be missing the external validation data needed to calibrate such distinctions.

  8. […] Gelman notes that, on the subject of value-added assessments of teachers, “a skeptical consensus seems to […]

  9. EB says:

    Rubinstein’s plots are misleading. Because the data are quantized, points pile on top of one another invisibly. Density plots show a clearer relationship:

    (Whether these correlations are meaningful is a separate question, but the statistical graphics here certainly could be improved.)

    • Andrew says:


      I agree that density plots can be useful but I disagree with Stucchio that scatterplots are bad. Why not do both? The scatterplot gives you a sense of the individual data that you don’t get with a heat map.

  10. […] Hartog writes the following in reaction to my post on the use of value-added modeling for teacher assessment: What I [Hartog] think has been […]