It’s no fun being graded on a curve

Mark Palko points to a news article by Michael Winerip on teacher assessment:

No one at the Lab Middle School for Collaborative Studies works harder than Stacey Isaacson, a seventh-grade English and social studies teacher. She is out the door of her Queens home by 6:15 a.m., takes the E train into Manhattan and is standing out front when the school doors are unlocked, at 7. Nights, she leaves her classroom at 5:30. . . .

Her principal, Megan Adams, has given her terrific reviews during the two and a half years Ms. Isaacson has been a teacher. . . . The Lab School has selective admissions, and Ms. Isaacson’s students have excelled. Her first year teaching, 65 of 66 scored proficient on the state language arts test, meaning they got 3’s or 4’s; only one scored below grade level with a 2. More than two dozen students from her first two years teaching have gone on to . . . the city’s most competitive high schools. . . .

You would think the Department of Education would want to replicate Ms. Isaacson . . . Instead, the department’s accountability experts have developed a complex formula to calculate how much academic progress a teacher’s students make in a year — the teacher’s value-added score — and that formula indicates that Ms. Isaacson is one of the city’s worst teachers.

According to the formula, Ms. Isaacson ranks in the 7th percentile among her teaching peers — meaning 93 per cent are better. . . .

How could this happen to Ms. Isaacson? . . . Everyone who teaches math or English has received a teacher data report. On the surface the report seems straightforward. Ms. Isaacson’s students had a prior proficiency score of 3.57. Her students were predicted to get a 3.69 — based on the scores of comparable students around the city. Her students actually scored 3.63. So Ms. Isaacson’s value added is 3.63-3.69.

Remember, the exam is on a 1-4 scale, and we were already told that 65 out of 66 students scored 3 or 4, so an average of 3.63 (or, for that matter, 3.69) is plausible. The 3.57 is “the average prior year proficiency rating of the students who contribute to a teacher’s value added score.” I assume that the “proficiency rating” is the same as the 1-4 test score but I can’t be sure.

The predicted score is, according to Winerip, “based on 32 variables — including whether a student was retained in grade before pretest year and whether a student is new to city in pretest or post-test year. . . . Ms. Isaacson’s best guess about what the department is trying to tell her is: Even though 65 of her 66 students scored proficient on the state test, more of her 3s should have been 4s.”

This makes sense to me. Winerip seems to presenting this is as some mysterious process but it seems pretty clear to me. A “3” is a passing grade, but if you’re teaching in a school with “selective admissions” with the particular mix of kids that this teacher has, the expectation is that most of your students will get “4”s.

We can work through the math (at least approximately). We don’t know this teacher’s students did this year so I’ll use the data given above, from her first year. Suppose that x students in the class got 4’s, 65-x got 3’s, and one student got a 2. To get an average of 3.63, you need 4x + 3(65-x) + 2 = 3.63*66. That is, x = 3.63*66 – 2 – 3*65 = 42.58. This looks like x=43. Let’s try it out: (4*43 + 3*22 + 2)/66 = 3.63 (or, to three decimal places, 3.636). This is close enough for me. To get 3.69 (more precisely, 3.697), you’d need 47 4’s, 18 3’s, and a 2. So the gap would be covered by four students (in a class of 66) moving up from a 3 to a 4. This gives a sense of the difference between a teacher in the 7th percentile and a teacher in the 50th.

I wonder what this teacher’s value-added scores were for the previous two years.

23 thoughts on “It’s no fun being graded on a curve

  1. This gives a sense of the difference between a teacher in the 70th percentile and a teacher in the 50th.

    unfortunate typo here – that should read 7th percentile.

    I think you're right on with the direction of your analysis – the scandal is not so much that she receives a low score, but that the va scoring would be – presumably – so unstable. If the difference between a lowest 10% teacher and an average teacher is that insignificant, do the va scores really tell us anything?

  2. I think you may have missed a very important point here: the system is being graded on a linear scale when the marginal improvements are not linear. In simpler terms, they are assessing an increase from 2.0 to 2.1 (delta = 0.1) the same as an increase from 3.6 to 3.7 (same delta of 0.1). But going from 3.6 to 3.7 is much more difficult than going from 2.0 to 2.1, simply due to the upper-bound scoring of 4.

    To put it another way, imagine there is a weight loss contest. A 300 lb. person can lose 20 lbs. with not much difficulty. But can a 120 lb. person also lose 20 lbs. so easily? I'm not sure they have addressed the non-linearity of their system properly.

  3. There seem to be a few problems with the scoring system.

    First is the use of discrete grades. The scoring instrument is too sensitive to a single grade shifting from a 4 to a 3.

    Second is the failure to account for mean reversion. Every grade is a random variable. A "3" student may get a "4" on any given day because of lucky guesses or a lucky match of questions to their preparation. Likewise, a "3" student could get a "2" or a "1" on any given day, due to illness, family problems or other disruptions. A class with high prior scores could have been lucky, on average in the prior year and would be expected on average to revert.

    Finally, and perhaps most important, there appears to be a failure to acknowledge that students must make academic progress just to maintain a high score from one year to the next, assuming all of the tests are grade level appropriate.

  4. the article actually says this – the CI for her score is between the 0th and 52nd percentile. That's just crazy.

    Also, are they really using linear formulas for this? That, too, seems crazy.

  5. I found out several years ago that you just don't go around telling people (or hospitals, in my case) 'well you suck' based on a score that has a huge margin of error (7th – 52nd percentile!?). That nearly got my employer a one-way ticket to court.

    I love measuring and devising ways you can use data to steer firms, or teachers, but if your system gets this kind of press, the econometrician has failed big-time. Don't put nerds on high profile cases without the communication department nearby.

    That being said: The story makes you feel the lady is a good teacher, but all they give is information on how many hours she works, how happy her boss is and the colleges she went to.

    Adding value is something else and the model still might capture that component. Teaching & improving the smartest kids might take something more than enthusiasm.

  6. I attended the top high school in Pennsylvania. We achieved 100% proficiency (all 3s or 4s), but we failed our No Child Left Behind report card because we did not make adequate progress compared to the previous year (also 100% proficiency)! I guess we were supposed to have more 4s too…

  7. The cynic in me suggests this questionable measurement in education may be intentional. Let's suspend disbelief for a moment.

    Teachers unions like weak measurement because it strengthens their arguments for seniority based systems (which all unions have a bias towards).

    Management likes weak systems because (a) when they get criticized like this, they have an excuse to ignore the measurements "to study them further" (b) managements have a bias towards disguised arbitrariness [an example in my case being complicated math to determine my bonus, but math that can be manipulated by the levels above me]. So if they get to override the measurements when they want to, management has more control.

  8. ‘Ms. Isaacson may have two Ivy League degrees, but she is lost.’

    The NY Times continues to spread the elitist (and dead wrong) idea that graduates of ‘elite’ universities somehow have special and superior education in all things. I wouldn’t expect someone who teaches English and Social Studies (and thus majored in English or Social Studies) to understand statistical modeling any more than I’d expect an individual with an Ivy League MBA to understand macroeconomic theory and foreign affairs.

  9. My first thoughts were difficulties with mean reversion and non-linearity of score improvements here for students near the top tier (both mentioned in previous comments). I would hope that they accounted for both of these in the model.

    Did they? From the looks of things in the NYT article, they did not.

    That's leaving aside the fact that scoring students on a discrete 1-2-3-4 scale seems limited in its ability to do anything useful besides separate the lazy/incompetent or learning disabled from the reasonably literate students. Beside that, they seem to measure improvement on tests that are actually different.

    If I score a 3 on my 7th grade test and a 3 on my 8th grade test, does that mean I did not improve? If the tests are grade-level appropriate, then you can still improve in competency without moving ahead in your respective percentile score.

  10. The other problem is that the scoring system 1 (0?) to 4 is almost categorical. If the cut point for the 3 to 4 transition is, say, 35 correct, a student with a 34 will get a 3 and a student with 35 correct will get a 4. The same one-point difference will not help a student with a 20 move from a 3 to a 4. So, the 1 (or 0?) to 4 point scale can not be considered linear, but they are still using them as if they were linear by computing averages.

    This kind of scoring suffers from the same type of weakness as indicators such as "percentage at or above norms" or "percentage meeting or exceeding standards." They ignore the variation within categories, and are very sensitive to movement of scores near the cut-points.

  11. Gary Post and Raymond point out what seems to be a fundamental failing of this particular model of value added: the formula for the value "added" fails to account for the value added by maintaining a previous score.

    To use Raymond's example, if a school's current average score is already at the maximum of 4, the model says that no more value can be added; the school can only take value away. This conclusion is absurd, however, because maintaining perfect school-wide 4s adds a lot of value over the expected status quo.

  12. From EPI

    Donald Rubin, a leading statistician in the area of causal inference, reviewed a range of leading VAM techniques and concluded: "We do not think that their analyses are estimating causal quantities, except under extreme and unrealistic assumptions."….

    (I believe Andrew can vouch for this Rubin guy.)

  13. Maybe this high IQ 7th grade teacher is doing a lot of good for students who were already 4s, the maximum score. A lot of her students later qualify for admission to Stuyvesant, the most exclusive public high school in New York.

    But, if she is, the formula can't measure it because 4 is the highest score you can get.

    She would be better off under this formula ignoring all her best students and concentrating on her worst students.

  14. One would think that given 40+ years of Ed School research and experience, and 50+ years of quality management research and experience, a reasonably robust methodology with well understood application limits would have been worked out and standardized quite a while ago.

    One wonders why this is not the case.

    Using metrics like this to judge individual performance is a fairly notorious problem, for many of the reasons cited – sample size, limited control of the output by the person being evaluated, insufficient precision in scale, etc. Its better applied at the school level.

    But I'd be quite surprised if you asked the principle, and polled the teachers, if they didn't have pretty good agreement on the identify of the worst, and best, 10% of the teachers.

  15. QG,

    Yes, a well-designed 360 (I'd get some input from the students too) would be superior to what Winerip describes and if you combined it with a better test-based metric and some broader measures of student success you could actually come up with an excellent system.

    The problem is that the debate is intensely politicized (including some truly bizarre left/right alliances) and, to be blunt, education researchers have often done a poor job addressing the complexity of their subject (the practice of doing educational studies on the cheap doesn't help). All of this makes it difficult to get a good, robust system in place.

  16. It seems to me as if the idea of value-added scoring is a great idea, but the implementation has been somewhere between meaningless and offensive. As noted by lots of others, the 1/2/3/4 scoring is the worst problem. The underlying tests presumably ask a lot more than 4 questions, so quantizing that coarsely is just throwing away a huge amount of information (in the Shannon sense). Presumably using raw scores, perhaps on a 0-100 scale, would also get rid of a lot of the ceiling effects, too.

    This said, other people have talked about assumptions that the value-added model makes that aren't justified, such as the random assignment of students to teachers. And it seems as if multi-year smoothing would be really useful too.

  17. As D. Edwards Deming, economics practitioner, used to say a deadly disease of management is to "base improvement upon visible numbers alone." He changed the game for Japan's industrial model after WWII and said with prescience in 1984 that western scientific management will not make the United States competitive as a workforce in the changing economic world. We have continued to use cheap measures of outcomes in lieu of developing the capacity of our workforce to own their work, continuously improve, and deliver quality. Instead, our "bean counting" behaviorists have taken on a more intense and new level of frenzy to improve the education sector and make a corporate buck in the bargain.
    We have NEVER invested much in capacity building among our educators, never provided the time for quality improvement focus time in their work schedule, and never looked at making radical changes that would impact the antecedents of poverty that children bring with them to kindergarten.

    To lose an educator who is bright, committed and loves her job because of an evaluation score based on "visible numbers alone" makes no sense as a return on investment. if the goal is more "4s" then invest in what it takes to help her get more "4s" – firing our way to success will be a lousy return on investment, discourage young people from selecting into the profession, and not solve the problems we face with the system. Read about Deming and ask yourself why our system is broken- its not the educators, he would say. I agree.

  18. There are some features of Winerip's article which are not as clear as they could be, and some wrinkles about this particular case that weren't reported in the article. But a few things can be clarified. The New York State 7th grade math test is a mixture of multiple-choice and constructed response items, with a raw score ranging from 0 to 50. These raw scores are converted to scale scores ranging from 500 to 800, with relatively larger standard errors of measurement for extreme values, and the smallest standard error of measurement, 6 points, around the threshold for a student to be judged proficient. Students are classified into four proficiency levels, I, II, III, and IV. In 2009, the cut scores for Levels II, III and IV were 611, 650 and 693, respectively. These corresponded to raw scores of 11, 22 and 43, respectively. What New York City does to create "proficiency" scores is a linear interpolation based on the thresholds for the four proficiency levels. Thus, a raw score of 36, corresponding to a scale score of 677, would be assigned a "proficiency" score of 3 + (677-650)/(693-650) = 3.63. The maximum "proficiency" score, associated with a perfect raw score of 50 and a scale score of 800, is set to 4.5. There is no psychometric justification for this linear interpolation. The contractor for the value-added modeling, currently the Value-Added Research Center at the University of Wisconsin-Madison, normalizes the scale scores within subjects (i.e., English Language Arts and math), grades (3 through 8) and years (2005 to 2009, as available), and it's these normalized scale scores that are the outcomes in the value-added model.

  19. Shouldn't an evaluation system like this one also place a strong emphasis on transparency/interpretability? I mean, it's fine to have a black box spit out a number if you're some politician or thick-witted administrator who just wants an "objective" way of meeting your punitive quota, but one thing that is clear from the NYT article is that the system is not providing any useful feedback to the teachers. Given the granularity of the scoring and the distribution in this teacher's class, I guess it's clear that the difference must be in the number of 3's that didn't become 4's, but this really ought to be reported in a more direct way…setting aside all of the more fundamental problems that people here have noted.

  20. The assessment formula as implemented is worse than useless. The non linearity is very important, as is the wide confidence interval. That the confidence interval (0-52%) includes 0% indicates that the link function is flawed. Your confidence limits shouldn't be able to run up against the lower limit of 0.000000+% if you are using proper link function properly. They should do the interval in proficiency score space, then convert those limits to probability space. Secondly, if the confidence interval is so wide, you can't distinguish the change in performance as significantly different from zero, let alone the unidentified tenure-earning percentage (that seems to be somewhat less than 52%).

    If the administrators are using these metrics to grant or delay tenure decisions, they are essentially using random number generators.

  21. What does a confidence interval really mean in this case anyway? A VAM score is basically a descriptive statistic, albeit a very complicated one. Does it make sense to think of it as a statistic in the classical sense and calculate a 95% confidence interval? Is one assuming then that the teachers students are some theoretical random sample from all potential students that could have been in the teachers class? How should we be quantifying error for such situations?

Comments are closed.