Value-added modeling in education: Gaming the system by sending kids on a field trip at test time

Just in time for Halloween, here’s a horror story for you . . .

Howard Wainer writes:

In my book “Uneducated Guesses” in the chapter on value-added models, I discuss how the treatment of missing data can have a profound effect on the estimates of teacher scores. I made up how a principal might send the best students on a field trip at the beginning of the year when the ‘pre-test’ was given (and their scores would be imputed from the students who showed up) and that the bottom half of the class would have a matching field trip on the day of the post test. Everyone laughed.

But apparently someone decided to take it seriously.

El Paso Schools Confront Scandal of Students Who ‘Disappeared’ at Test Time

You can’t make this stuff up.

This sort of thing is not surprising but it’s worth keeping in mind. That a measurement system can be gamed, does not mean it’s useless, but part of good measurement is to consider these problems.

11 thoughts on “Value-added modeling in education: Gaming the system by sending kids on a field trip at test time

  1. Couldn’t this gaming be caught by having them report the metrics: “percent of eligible student cohort absent at pre test”, “percent of student cohort absent at post test” and “number of students common to pre and post test”.

    Or perhaps restricting year of end analysis only to those students that were present for both tests?

  2. Not surprising, there is a more nefarious verision I have seen quite often.

    It goes like this:

    An assessment test for 2nd grade subtraction skills where up to 10% of the questions can require borrowing

    1) Pre-Test for subtracting two digits numbers, the third, fourth, sixth, seventh, and tenth question out of 50 requires borrowing
    2) Post-Test all the borrowing questions come at the end

    The students are not told to skip a question that is difficult.

    And “amazingly”, the schools who use this assessment tool [which is freely provided by the cirriculum (textbook) provider] show DRAMATIC improvement between pre and post.

    • I enjoy studying radical and fringe political philosophies, as well as their internet cultures and subcultures. It’s been my experience that the term “white supremacist” is often applied to American Renaissance by others, but it is not a label that they apply to themselves, nor would it be honest to say that their ideology necessitates a positive belief in an objective superiority over non-white races. They do espouse an implicit subjective superiority, but this is hardly the same thing and nowhere near as vulgar as groups who use the term “white supremacist” to describe themselves. It could easily be the case that someone could come across an article of theirs online and not be aware of the social stigma attached to the source. Had he linked from stormfront, I would have been a bit more concerned.

  3. This piece is at risk of being quoted in a blog post. However, at least part of the issue is a misalignment of the incentives. The incentives matter for the teachers, but the individual students do not appear to face consequences in the same way (especially at early grade levels). It’s also a very bad error to make the people administering the test the same people who are held accountable for the results. This an create problems that are very hard to take out later on.

    My worry is we are creating an illusion of objectivity and the appearance of improvement. An independent body would not even necessarily solve the problem: think of S&P and its ratings.

  4. Pingback: More on value added measures of teaching | peakmemory

  5. I didn’t know that that website had any sort of point of view which would affect the accuracy of the facts that it presented. I got the site from the keynote address at a recent (10/18/13) conference on test fraud — the presenter is a scholar of impeccable credentials whose work I know and respect. But now that I have been wised-up I have removed it and replaced it with other sources that are more mainstream. Thank you for pointing this out to me.

    The story (pasted in below) will be appearing in my column in Chance (27(2), 2014).

    Life follows art: Gaming the missing data algorithm

    Howard Wainer
    National Board of Medical Examiners

    In 1969 Bowdoin College was path breaking when it changed its admissions policy to make college admissions tests optional. About one-third of its accepted classes took advantage of this policy and did not submit SAT scores. I followed up on Bowdoin’s class of 1999 and found that the 106 students who did not submit SAT scores did substantially worse in their first year grades at Bowdoin than did their 273 classmates who did submit SAT scores. Would their SAT scores, had they been available to Bowdoin’s admissions office, have predicted their diminished academic performance,?

    As it turned out, all of those students who did not submit SAT scores, actually took the test, but decided not to submit them to Bowdoin. Why? There are many plausible reasons, but one of the most likely ones was that they did not think that their test scores were high enough to be of any help in getting them into Bowdoin. Of course, under ordinary circumstances, this speculative answer is not the beginning of an investigation, but its end. The SAT scores of students who did not submit them have to be treated as missing data — at least by Bowdoin’s admissions office; but not by me. Through a special data gathering effort at the Educational Testing Service we retrieved those SAT scores and found that while the students who submitted SAT scores averaged 1323 (the sum of their verbal and quantitative scores), those who didn’t submit them averaged but 1201 – more than a standard deviation lower! As it turned out, had the admissions office had access to these scores they could have predicted the lower collegiate performance of these students.

    Why would a college opt for ignorance of useful information? Again there is a long list of possible reasons, and your speculations are at least as valid as mine, so I will focus on just one – the consequences of treating missing data as missing-at-random. The mean SAT score for Bowdoin’s class of 1999 was observed to be 1323, but the true mean, including all members of the class was 1288. A mean score of 1323 places Bowdoin comfortably ahead of such fine institutions as Carnegie Mellon, Barnard and Georgia Tech, whereas 1288 drops Bowdoin below them. The influential US News and World Report (USN&WR) college rankings uses mean SAT score as an important component. But those rankings use the reported scores as the mean, essentially assuming that the missing scores were missing-at-random. Thus, by making the SAT optional, a school could game the rankings and thus boost their placement.

    Of course, Bowdoin’s decision to adopt a policy of “SAT Optional” predates the USN&WR rankings, so that was almost certainly not their motivation. But that cannot be said for all other schools that have adopted such a policy in the interim. Or so I thought.

    After completing the study described above I congratulated myself on uncovering a subtle way that colleges were manipulating rankings. Silly, pompous me. I suspect that one should never assume subtle, modest manipulations, if obvious large changes are so easy; USN&WR gets their information reported to them directly from the schools themselves, thus allowing the schools to report anything they damn well please.

    In 2013 it was reported that six prestigious institutions admitted falsifying the information they sent to USN&WR (and also the US Department of Education and their own accrediting agencies).
    Claremont McKenna College simply sent in inflated SAT scores;
    Bucknell admitted they had been boosting their scores by 16 points for years; Tulane upped theirs by 35 points. Emory used the mean scores of all the students that were admitted, which included students who opted to go elsewhere – they also inflated class ranks! And there is lots more.

    In a 2011 Chance article about the use of value-added models for the evaluation of teachers I discussed the treatment of missing data that was currently in use by the proponents of this methodology. The basic idea underlying these models is to partition the change in test scores – from pre- test scores at the beginning of the school year to post-test scores at the end – among the school, the student and the teacher. The average change associated with each teacher was that teacher’s ‘value-added’. There were consequences for teachers with low value-added scores and different ones for high scoring teachers. There were also consequences for school administrators based on their component of the total value-added amount.

    There are fundamentally two approaches taken in dealing with the inevitable missing data. One is to only deal with students who have complete data and draw inferences as if they were representative of all the students (missing-at-random). A more sophisticated approach that is used is to impute the missing values based on the scores of the students who had scores, perhaps conditioned on available covariates. Inferences from either approach have limitations, sometimes requiring what Don Rubin characterized as “heroic assumptions.”

    To make the problems of such a missing data strategy more vivid, I suggested (tongue firmly in cheek) that were I a principal in a school being evaluated I would take advantage of the imputation scheme by having an enriching, educational field trip for the top half of the students on the day of the pre-test and another, parallel one, for the bottom half on the day of the post-test. The missing groups would have scores imputed for them based on the mean of the scores of those who were there. Such a scheme would boost the change scores, and the amount of the increase would be greatest for schools with the most diverse populations. Surely a win-win.

    Whenever I gave a talk about value-added and mentioned this scheme to game the school evaluations it always generated guffaws from most of the audience (although there were always a few who busied themselves taking careful notes). I usually appended the obiter dictum that if I could think of this scheme, the school administrators in the field, whose career advancement was riding on the results, would surely be even more inventive. Sadly, I was prescient.

    On October 13, 2012, Manny Fernandez reported in the New York Times that Former El Paso schools superintendent, Lorenzo Garcia was sentenced to prison for his role in orchestrating a testing scandal. The Texas Assessment of Knowledge and Skills (TAKS) is a state-mandated test for high school sophomores. The TAKS missing data algorithm was to treat missing data as missing-at-random, and hence the score for the entire school was based solely on those who showed up. Such a methodology is so easy to game that it was clearly a disaster waiting to happen. And it did. The missing data algorithm used by Texas was obviously understood by school administrators; for all aspects of their scheme was to keep potentially low-scoring students out of the classroom so they would not take the test and possibly drag scores down. Students identified as likely low performing “were transferred to charter schools, discouraged from enrolling in school or were visited at home by truant officers and told not to go to school on test day.”

    But it didn’t stop there. Some students had credits deleted from transcripts or grades changed from passing to failing so they could be reclassified as freshman and so avoid testing. Sometimes students who were intentionally held back were allowed to catch up before graduation with “turbo-mesters” in which a student could acquire the necessary credits for graduation in a few hours in front of a computer.

    Superintendent Garcia boasted of his special success at Bowie High School, calling his program “the Bowie Model.” The school and its administrators earned praise and bonuses in 2008 for its high rating. Parents and students called the model “los desaparecidos” (the disappeared). It received this name because in the fall of 2007 381 students were enrolled in Bowie as freshman, however the following fall the sophomore class was composed of but 170 students.

    It is an ill-wind indeed that doesn’t blow some good. These two examples contain the germ of good news. While the cheating methodologies employed utilize the shortcomings of the missing data schemes that were in use to game the system, they also tell us two important things:

    (i) Dealing with missing data is a crucial part of any practical situation, and doing it poorly is not likely to end well; and
    (ii) Missing data methodologies are not so arcane and difficult that the lay public cannot understand them.

    So we should not hesitate to employ the sorts of full-blooded methods of multiple imputation pioneered by Rod Little and Don Rubin; for opponents cannot claim that they are too complicated for ordinary people to understand. The unfolding of events has shown conclusively their general comprehensibility.

    Further readings and data sources:

    Fernandez, M. (October, 13, 2012). El Paso Schools confront scandal of students who ‘disappeared’ at test time. New York Times.

    Little, R. J. A. , & Rubin, D. B. (1987). Statistical analysis with missing data. New York: Wiley.

    Wainer, H. (2011). Uneducated Guesses Using Evidence to Uncover Misguided Education Policies. Princeton, NJ: Princeton University Press.
    Wainer, H. (2011). Value-Added Models to evaluate teachers: A cry for help. Chance, 24(1), 11-13.


Leave a Reply

Your email address will not be published. Required fields are marked *