Rising test scores . . . reported as stagnant test scores

Joseph Delaney points to a post by Kevin Drum pointing to a post by Bob Somerby pointing to a magazine article by Natalie Wexler that reported on the latest NAEP (National Assessment of Educational Progress) test results.

In an article entitled, “Why American Students Haven’t Gotten Better at Reading in 20 Years,” Wexler asks, “what’s the reason for the utter lack of progress in reading scores?”

The odd thing is, though, is that reading scores have clearly gone up in the past twenty years, as Somerby points out in text and Drum shows in this graph:

Drum summarizes:

Asian: +15 points
White: +5 points
Hispanic: +10 points
Black: +5 points

. . . Using the usual rule of thumb that 10 points equals one grade level, black and white kids have improved half a grade level; Hispanic kids have improved a full grade level; and Asian kids have improved 1½ grade levels.

Why does this sort of thing get consistently misreported? Delaney writes:

My [Delaney’s] opinion: because there is a lot of money in education and it won’t be possible to “disrupt” education and redirect this money if the current system is doing well. . . .

It also moves the goalposts. If everything is falling apart then it isn’t such a crisis if the disrupted industry has teething issues once they strip cash out of it to pay for the heroes who are reinventing the system.

But if current educational systems are doing well, and slowly improving through incremental change, then it is a lot harder to argue that there is a crisis in education, isn’t it?

Could be. The other thing is that it can be hard to get unstuck from a conventional story. We discussed a similar example a few years ago: that time it was math test scores, which economist Roland Fryer stated had been “largely constant over the past thirty years,” even while he’d included a graph showing solid improvements.

13 thoughts on “Rising test scores . . . reported as stagnant test scores

    • Yes. Between 2000 and 2013, for the school age population, Whites declined from 62% to 53%. Hispanics increased from 16% to 24%. That is a pretty significant recomposition.

      I really wonder if the within-group improvement is real or if the scores are being inflated in some way. Even if they are calibrating the test really well across years, there still might be a natural drift upwards because of familiarity with the test, better prep, etc.

    • Well this would explain a lot.

      But of course, The Atlantic can’t mention this. So instead, they have an article full of misinformed drivel about theories of teaching and testing.

      What a waste.

  1. These are combined race and hispanic ethnicity group across time-not by SES where we are seeing some of the largest disparities (Sean Reardon’s and others research). Given the amount of federal and private foundation monies going into intervention–should we have expected larger increases? If we stop intervening–as the quotes suggest–would the rates go down? Is the NAEP even the right measure to look at vs. state data?

    • Given the consistent inflation in high stakes test scores (see Koretz), I’d say NAEP scores provide far more robust inferences in changes in aggregate student performance across time. Scores on state tests tend to rise rapidly as new tests are introduced, often many times faster than NAEP or other external measures, as teachers learn to “teach to the test.”

  2. When I was teaching, they would show us stuff like this and I would wonder what the margins of error were. When I looked at the data for state test I found that one or two answers could affect the percentile an individual ranked in as much as three to five. That’s why were specifically told that targeting the goals (not the actual questions, which we couldn’t see, not legally) would make such a difference.

    Or to put it another way, it doesn’t take gut reaction disbelief to think that white students averaging 270 at the beginning of the graph and scoring 275 at the end of the graph might not be real, i.e., significant progress.

    Also, I preferred histograms for this kind of data. But that’s me.

    • Steven:

      I don’t really know, but according to the link, 10 points equals one grade level, so a gain of 5 points does represent significant progress.

      I agree that it would be good to show distributions as well as averages.

      • “I don’t really know, but according to the link, 10 points equals one grade level, so a gain of 5 points does represent significant progress.”

        How much has “what a 5th grader is supposed to know” changed since 1998?

        I know recently in CA they changed from whatever their previous curriculum was to the current Common Core curriculum and changed the tests, and everyone’s test scores plummeted, because Common Core expects you to learn different things, so if you’re in 3rd grade today taking a common core test and you had K,1,2 under whatever the previous curriculum was, you just don’t know some of the stuff on the test because you weren’t expected to and were never taught that stuff…

        In a more general sense, I think through time it’s very hard to compare knowledge, it’s a lot like inflation in prices. We have the CPI but to think that the CPI is the be-all-end-all of inflation comparisons is wrong. It’s really just the tip of the iceberg, and it’s a sausage factory deep inside. For example, what’s the price of an iPhone in 1965? It’s probably about 80 Trillion dollars and 50 years to produce it. ;-)

        Anyway it’s totally plausible to me that a kind of “grade inflation” could have taken hold in the measurement techniques we use so that the trends you’re seeing are just problems in measurement and that underlying knowledge could have easily increased, decreased, or stayed flat over that time period.

        What I mean there is that for example the questions could have gotten easier over time, or teaching more focused on the questions and less focused on other important knowledge, or the range of knowledge being tested could be more narrow and therefore easier to teach, or etc etc so that a broader more accurate measurement would show a very different trend.

        • Stability equals consistency — it does NOT equal validity. It means it tests the same things, but that does NOT mean it is measure of what was ACTUALLY taught.

  3. I don’t like their point scale. Notice Drum had to give a grade level rule of thumb to try to translate into something meaningful.

    “Grade level” is very intelligible, so why not just use that? Create standards (not norms) for each grade level. Then score the students according to where they fall (so 2.0 equals second grade level). You could then compare actual and effective grade levels. Come to think of it, if you wanted to get crazy, you could actually require the students to meet the standards before advancing to the next grade.

  4. One problem with grade levels is that if the within-grade variability (say, standard deviation) is much larger than the mean group difference between successive grades then what looks like a substantial difference in terms of grades is not really that great in the context of a grade-standardized reference sample. A gain of “one grade level” sounds a lot more impressive than one of “five percentile points”, but they can be the same, depending on score distributions.

  5. I wonder if they mean that the US has not changed its ranking compared to other countries. I believe that is true. I don’t see why that’s a particularly useful metric.

Leave a Reply to gregor Cancel reply

Your email address will not be published. Required fields are marked *