Resolving confusions over that study of “Teacher Effects on Student Achievement and Height”

Posted on June 23, 2020 9:37 AM by Andrew

Someone pointed me to this article by Marianne Bitler, Sean Corcoran, Thurston Domina, and Emily Penner, “Teacher Effects on Student Achievement and Height: A Cautionary Tale,” which begins:

Estimates of teacher “value-added” suggest teachers vary substantially in their ability to promote student learning. Prompted by this finding, many states and school districts have adopted value-added measures as indicators of teacher job performance. In this paper, we conduct a new test of the validity of value-added models. Using administrative student data from New York City, we apply commonly estimated value-added models to an outcome teachers cannot plausibly affect: student height. We find the standard deviation of teacher effects on height is nearly as large as that for math and reading achievement, raising obvious questions about validity. Subsequent analysis finds these “effects” are largely spurious variation (noise), rather than bias resulting from sorting on unobserved factors related to achievement. Given the difficulty of differentiating signal from noise in real-world teacher effect estimates, this paper serves as a cautionary tale for their use in practice.

This was, unsurprisingly, reported as evidence that value-added assessment doesn’t work. For example, a Washington Post article says “The paper is critiquing the idea that teacher quality can be measured by looking at their students’ test scores” and quotes an economist who says that “the effect of teachers on achievement may also be spurious.”

I asked some colleagues who work in education research, and they argued that the above interpretation of the empirical findings was mistaken:

1. The correlation of class mean test-score residuals across classes taught by the same teacher is large and positive, while the correlation of class mean height residuals across classes taught by the same teacher is zero. In their case, the method indicates sizeable teacher effects on test scores and zero teacher effects on height. Thus, this is a non-cautionary tale. The authors sort of say this in their abstract, noting that the “effects” on height are spurious (noise).

2. They find large variance of class-mean residuals (classroom “effects”) on height when they randomly assign students to teachers. This doesn’t seem right. If we randomly assign observations to groups and still find significant variation in “group effects,” there must be something else going on.

My colleagues’ key point was that one key piece of evidence for teacher effects on test scores is that these effects persist among teachers from year to year. If there appear to be large effects of teachers on height, but these effects do not persist among teachers from year to year, then we should not think of these as effects of teachers on height but rather as unexplained classroom-level residuals in our model.

Bittler et el. were informed of this discussion and they contacted me with further information. Here’s Bittler:

We have noticed that our work has gotten some attention that mischaracterizes what we find. We have submitted the attached blog post to the Brookings education Brown Center blog to make the point that first, more sophisticated models which account for class-year shocks no longer produce value added of teachers on height and second, another, more policy relevant point, that a measure of teacher quality that adjusts slowly (because it extracts the persistent component) may not be the best for motivating teachers.

From their post:

We [Bittler et al.] have seen a number of claims that overgeneralize our findings, suggesting that all uses of all types of value-added models (VAMs) are invalid. This conclusion is not supported by our work. We believe that our study has some good news about some value-added models, particularly those used in research. At the same time, however, we see important cautions in our findings for the use of value-added models in policy and practice.

They continue with some good news about the use of VAMs in research:

That we find teacher effects on height does not . . . invalidate the consistent finding emerging from VAM research that teachers matter for students’ achievement and other life outcomes. The most sophisticated value-added models currently used in research concentrate on persistent contributions to student outcomes. . . . [and] appear to demonstrate that teachers vary substantially in their contribution to achievement growth and that exposure to high value-added teachers has measurable positive effects on students’ educational attainment, employment and other long-term outcomes. Importantly, we find no teacher effects on height using the more sophisticated models these researchers use.

And some cautions:

VAMs are hard to use well in practice. . . . the models behind many of these real world applications are much less sophisticated than the ones that researchers use, and our results suggest that they are more likely to result in misleading conclusions.

In many cases, practical applications use single-year models like the models that yield implausible teacher effects on height in our analyses. Our findings reinforce previous work identifying problems with these models, demonstrating that random error can lead observers to draw mistaken conclusions about teacher quality with striking regularity.

Again:

The more sophisticated models used in research draw on several years of a teacher’s students to estimate their persistent effect. When we use these multi-year models, we do not find effects on height, suggesting that the effects we see in simpler VAMs reflect year-to-year variation that should be seen as random errors rather than systematic factors.

When it comes to policy:

Multi-year models adjust for these random year-to-year errors, but because they identify teachers’ persistent contributions to student learning using multiple years of data, they may not pick up on short term changes in teacher effectiveness. As such, these models are of limited usefulness for motivating annual performance pay goals. However, multi-year models may work well for identifying persistently very good or very bad teachers, but only after several years of teaching.

This subtlety – that our results support the validity of multi-year VAMs while indicating that single-year VAMs are not valid – has been overlooked in much of the discussion of our study.

And, in summary:

VAMs have underscored the importance of teachers, and we believe that they have a role to play in future educational research and policy. We just need to be more modest in our expectations for these models and make sure that the empirical tool fits the job at hand.

I sent the above to my colleagues, who had two comments:

1. The only quibble we have with their blog post is when they say “our results support the validity of multi-year VAMs while indicating that single-year VAMs are not valid.” This is very poor wording. The persistent component of teacher value assessment is contained in the single year estimate, it just has a bunch of noise. We can reduce the noise (but not eliminate it completely in finite samples) with more classrooms taught by the same teacher. We think it’s wrong to refer to an unbiased noisy estimate as “not valid” when we know that the signal variance is meaningfully large. If no noise were the bar then nothing would be valid, and we have known about the magnitude of signal/noise in VA for well over a decade.

2. It still is quite unclear what is driving their finding of significant class level residuals on height. Assuming no true peer effects on height, one possible explanation is correlated measurement error. But they find classroom height “effects” even when they randomly assign students to classrooms, which should not be possible if they are properly estimating the variance of random effects. The authors offer very little explanation to help us understand what might be going on, so it’s hard to know whether they have identified a pervasive problem in the estimation of group effects, whether there an error in their code, or something else entirely.

52 thoughts on “Resolving confusions over that study of “Teacher Effects on Student Achievement and Height””

Dan Wright on June 23, 2020 12:03 PM at 12:03 pm said:

The use of VAMs in education echoes a couple of points that come up in this blog often: that doing statistics isn’t easy and that it is really important to consider how the data may have arise before using any model for high stakes decisions. When I’ve told education policy folks, for example, that different statistical procedures are needed depending on how students are assigned to schools/teachers, or that previous scores can act as colliders and create problems for estimating effectiveness, they tell me that they are required to get estimates by law or by their governor. It is unfortunate that lots of the critical research on these methods and the more involved techniques that address some of these issues were conducted after the ratings systems were already taking place.

Reply ↓
- Michael Nelson on June 26, 2020 12:01 PM at 12:01 pm said:
  
  What’s unfortunate is that advocates for VAMs being written into state law began their advocacy prior to the evidence coming in, based on flawed rational–not empirical–arguments. This would be just another case of lawmakers adopting a “data-driven metric” that will “hold teachers to account” except that there’s no way a legislator could come up with VAM on her own. My experience in Ohio and Tennessee in the early 00’s was that Battelle for Kids lobbied hard for the idea with conservative lawmakers. Specifically, they pushed a model that, according to the BFK website, is a registered trademark of its inventor (see https://en.wikipedia.org/wiki/William_Sanders_(statistician)). So far as I can tell, the whole phenomenon of VAM in practice is a product of the confluence of two forces: the movement among conservatives in the late 1990s/early 2000s to make schools “work” by having them adopt private-sector business practices, and opportunistic education researchers and organizations chasing state, and then federal, funding. Education as a field is still struggling with the detritus of that confluence, including VAMs, publicly-funded charter schools, school vouchers, and standardized testing on a massive, and intrusive, scale.
  
  Reply ↓
A on June 23, 2020 12:04 PM at 12:04 pm said:

Didn’t read the study, but if a teacher encouraged a student to drink milk, that might affect height.

Reply ↓
- Martha (Smith) on June 23, 2020 4:04 PM at 4:04 pm said:
  
  Perhaps more relevant: Students (at least below college level) are still growing (in particular, getting taller). This brings in the possibility that “teacher” and “variables influencing height” might have some correlation. For example, schools in affluent districts are able to hire better teachers, but also students in affluent districts are likely to be taller 0on average) than students in poor districts, simply because quality of nutrition is likely to be higher in affluent districts than in poorer districts.
  
  Another factor that may be relevant to studies like this (but may be overlooked) is that a good teacher can influence a student’s performance in subsequent courses, not just in the current course taught by the good teacher.
  
  Reply ↓
  - gec on June 23, 2020 4:45 PM at 4:45 pm said:
    
    > a good teacher can influence a student’s performance in subsequent courses, not just in the current course taught by the good teacher.
    
    As you might say, +1!
    
    I feel like many of the best teachers I had taught me not just skills but ways of thinking that have had a much more lasting impact on my life than the particular topic they taught. My high school English teacher Mrs. Fleischaker was the first to open my eyes to style as a tool, teaching me how to analyze and appreciate the choices writers make to serve their aims. Until that point, I thought writing was just putting on paper whatever you heard in your head. She also made us write constantly, which opened my mind to the idea that writing is a skill that you need to practice, you don’t just wait for the muses to inspire you.
    
    Since then, even though I’ve still never voluntarily read Hawthorne or Dickens, the perspective she taught me—and the associated analytical skills—has done way more to help my scientific career than, say, the actual college course I took on technical writing.
    
    Reply ↓
    - Martha (Smith)) on June 23, 2020 11:00 PM at 11:00 pm said:
      
      gec said, “I feel like many of the best teachers I had taught me not just skills but ways of thinking that have had a much more lasting impact on my life than the particular topic they taught.”
      
      One example I remember vividly: In teaching math (and later stats) courses, I typically would include a section in the first day handout (or even a separate handout) of what I expect of students in writing up their solutions to problems. Once I had a student who seemed pretty average in a course, but then later in another course, I noticed that her write-ups of problem solutions were unusually good. I commented on it once, and she replied nonchalantly that she was just doing what she had learned in the previous class. I also had contact with her several years later, when she was teaching high school math. She was disappointed that the other math teachers in her school thought that “showing your steps” was adequate — but her principal (who had been an English teacher) really supported her including good writing in her math courses.
    - David P on June 25, 2020 8:07 AM at 8:07 am said:
      
      On style in math homework, I can’t resist quoting Calvin Trillin: “Math was always my bad subject. I couldn’t convince my teachers that many of my answers were meant ironically.”
  - TO on June 23, 2020 8:42 PM at 8:42 pm said:
    
    Affluence may influence weight but not necessarily height. Height largely depends on genetics.
    
    Reply ↓
    - Kyle C on June 23, 2020 9:56 PM at 9:56 pm said:
      
      I’m not saying you’re wrong, but few people who attended Ivy League schools believe that. The male students seem very tall on average.
    - Michael Nelson on June 26, 2020 12:32 PM at 12:32 pm said:
      
      Studies show that being taller than average is a causal factor in hiring and promotions and raises, and therefore of wealth and social class, at least among white males. As a result, taller people have some >0 additional propensity to be affluent, so the children of taller people have a >0 propensity to both be taller than average and to have parents who are rich and influential and/or have legacy status. So the genetic factor of height has an indirect causal effect on Ivy League admissions as mediated through those two environmental factors. Plus, there ought to be direct effects of height on admissions: the impact of height on job advancement probably also applies to college interviews; and people who write recommendations and who interview applicants for the Ivy League are likely to be wealthy and/or legacies and therefore to be taller than average, and people in general tend to favor those who resemble themselves, both in terms of personal history and appearance. All of these effects would reinforce each other over generations. To sum up, height affluence Ivy League.
    - Kyle C on June 29, 2020 9:08 AM at 9:08 am said:
      
      That is actually what we tacitly assumed in my day. I suppose it is more plausible than the reverse effect.
    - Martha (Smith) on June 23, 2020 11:05 PM at 11:05 pm said:
      
      Height can also be influenced by nutrition — I think it’s not so much the difference between “affluence” and “average” as it is between “average” and “poverty.
    - TO on June 23, 2020 11:52 PM at 11:52 pm said:
      
      “I think it’s not so much the difference between “affluence” and “average” as it is between “average” and “poverty.”
      
      This makes sense to some degree. But keeping the focus on the groups being compared, the income-mediated effect of nutrition on height would be weak. Affluence is not the norm and neither is poverty.
    - confused on June 24, 2020 12:47 AM at 12:47 am said:
      
      This sounds right to what I’ve heard.
      
      Both genetics and nutrition have a major effect on height.
      
      But I am not sure the nutritional factors affecting height necessarily “translate” from poverty in the tropics today or in the US and Europe centuries ago, to poverty in the US today. The kind of diet is dramatically different. It seems like this kind of thing has probably been studied by someone…
    - John Williams on June 24, 2020 12:46 AM at 12:46 am said:
      
      I was born in 1941, and over my lifetime, people in this country have gotten noticeably taller. I don’t think genetics change that fast.
    - confused on June 24, 2020 12:51 AM at 12:51 am said:
      
      Oh, that’s definitely a nutrition effect, but I am not sure the change in the US over time is comparable to the difference among social classes/wealth groups in the US today.
      
      The part I’m really skeptical about is a difference between wealthy people and middle/lower-middle class people in the US. I can believe a difference for the very poor, but I don’t think lower-middle class people in the US are generally suffering from malnutrition, or protein-starved the way poor people in the past or in many tropical nations today tend to be/have been.
Joshua on June 23, 2020 12:46 PM at 12:46 pm said:

> It still is quite unclear what is driving their finding of significant class level residuals on height.

Yes, but is it clear what is driving findings of significant differences in student outcomes by teacher? Is that not also an important question?

Yes, determining which teachers have the biggest positive and negative effects on outcomes in students is important and useful; but I think determining which teacher attributes play causal roles, and how they interact with or are moderated/mediated by various environmental factors, student attributes, etc. which is the real investigative challenge.

Reply ↓
- Michael Nelson on June 26, 2020 12:40 PM at 12:40 pm said:
  
  Your point has important implications beyond the value of identifying good teachers vs the value of identifying causal factors. Specifically, VAMs, much like machine learning algorithms, do not necessarily explain why a teacher did better and so may be selecting on features that reinforce racism, classism and other forms of bigotry. In addition to promoting systemic problems, individual teachers may receive futile “incentives” to change or adopt traits that are effectively immutable.
  
  Reply ↓
jd on June 23, 2020 3:21 PM at 3:21 pm said:

VAMs – velocità ascensionale media. First used as a metric by dubious Italian sports doctor Michele Ferrari as a way to quantify and compare speed of climbing on the big cols in professional cycling. https://en.wikipedia.org/wiki/VAM_(bicycling)

I know this adds nothing to the discussion, but that’s the only VAM I know of, and hey, it’s interesting.

Reply ↓
Rick G on June 23, 2020 7:44 PM at 7:44 pm said:

The fact that teachers can only be evaluated over a long time scale (using this method) suggests a policy experiment: assign (or post-stratify) teachers to different pedagogical approaches (or ed school curricula) and then ask 10 years later how they did, in order to determine which pedagogical approaches work best. Since we can’t wait 10 years to find out the answer, create a prediction market for the outcome, and pick the curricula which the highest asset price(s). Repeat this every 10 years, continuously.

Reply ↓
- Michael Nelson on June 26, 2020 12:48 PM at 12:48 pm said:
  
  Doesn’t this confound pedagogy with teacher characteristics? It assumes (to me, implausibly) that the only factors that vary among teachers and impact student performance are things that can be written into a curriculum, as opposed to things like enthusiasm, spending extra time and money outside the classroom, and actual fidelity to the curriculum.
  
  Reply ↓
  - Martha (Smith) on June 26, 2020 4:39 PM at 4:39 pm said:
    
    Actual fidelity to the curriculum could be an important factor — and could conceivably affect teacher effectiveness in both directions. For example, a teacher might think that the curriculum is shortchanging the students, so try to make what they believe are needed improvements in how they teach (e.g., “This curriculum is just asking the students to engage in rote learning, but they really need to work more on developing thinking skills”). Or a teacher might think that the official curriculum is focusing on things that are unnecessary (e.g., “I never learned that, and I’m doing fine so why should I waste my time and my students’ time by teaching it”))
    
    Reply ↓
Stevec on June 23, 2020 7:53 PM at 7:53 pm said:

Interesting discussions, thanks.

I’ve followed a little of the teacher VAM subject out of curiosity (I don’t work in education although my wife and many of her friends do).

Under the “blank slate” model of the human mind (hopefully everyone knows what that is) then a decent statistical analysis might well uncover good v bad teachers. Obviously there’s a potential persistence effect there as gec suggested, so even then it’s not a simple problem.

But I’ll suggest a different model of cognition, one supported by evidence rather than hope – students have different cognitive abilities. For example, I had a natural ability with maths when I was at school. Many of my friends didn’t. Put me in a class with an average teacher and my maths score will go up across the year. Put my maths-challenged friend in with an average teacher and their score might not move very much. Put them in with the best teacher and you might get the same “not very much” result.

So, what you might be measuring with a “quality VAM score” is how the principal decides to place students and teachers. You might learn nothing at all about the teacher. And if the VAM score is used to hire/fire/raise pay then you might learn how much the principal likes the teacher. Assuming the non blank slate reality, at the top of the cognitive distribution you might well find the difference between teacher quality by measuring outcomes. But no one will be doing this experiment, as you would need large scale RCTs.

Interested to hear comments on this suggestion.

Reply ↓
- Michael Nelson on June 26, 2020 1:08 PM at 1:08 pm said:
  
  In most cases, principals make the ultimate decision as to what grade level a teacher is assigned, and may influence the resources assigned to that grade (including number of teachers and therefore number of classrooms), but they have slightly less influence over the assignment of students to individual teachers. At the elementary level, classes are usually constructed (according to my teacher wife) by the teachers from the previous grade level, who make recommended class lists based on which students they believe will perform best together. Principals then assign these class lists to individual teachers, perhaps with some tweaks. And parents can also have an impact on student assignment, at least informally. As you get into middle school and above, students’ choice of electives, tracks, and the resulting scheduling needs tend to dominate assignment.
  
  All of which is to say, it’s not at all clear from the outside why a particular student gets assigned to a particular teacher, and therefore what the VAM is actually measuring, under your theory of student cognition. For what it’s worth, I agree that class make-up is a huge factor, as is a teacher’s “fit” to that class–for example, a teacher who’s assigned a class with a large proportion of special needs students may see less year-to-year growth in learning for that class, unless that teacher happens to be especially good with special education students or with differentiated instruction.
  
  Reply ↓
Stevec on June 23, 2020 8:00 PM at 8:00 pm said:

I’ll amend my last point because some teachers might be wonderful with advanced students and not so good with kids who struggle. Makes the design of a large scale RCT even harder.

Reply ↓
- Martha (Smith) on June 23, 2020 10:37 PM at 10:37 pm said:
  
  Complicating things even further: I’ve seen cases where student X does much better than student Y in the same class, but in that class student Y has been learning how to be better at learning while student X has plateaued — and in the next class, student Y far surpasses student X.
  
  Reply ↓
  - Michael Nelson on June 26, 2020 1:32 PM at 1:32 pm said:
    
    Plus, which/how many students grow the most in a classroom may be affected by factors with a questionable relationship to teacher quality or student ability. For example, I’ve seen administrators push (and even train) teachers to identify their so-called bubble kids–those whose performance is just below (“on the bubble”) the cutoff for Proficient–and then focus their efforts on those students. The rationale is that those students require the least individual growth to rise above the cutoff, so it will look like the school is making great progress at improving proficiency rates, and the school can avoid being put on the list of “failing schools.”
    
    Reply ↓
    - Martha (Smith) on June 26, 2020 4:43 PM at 4:43 pm said:
      
      Gameling the system. Go for the “prize”, not the learning.
confused on June 24, 2020 12:56 AM at 12:56 am said:

Another thought: effectiveness at improving students’ test scores may not correlate with effectiveness at genuinely educating students in the ways that will benefit them in life. The really good teachers I had in high school, who really taught *how* to think about things and promoted real understanding of principles rather than recitation of facts, weren’t “teaching to the test”.

Reply ↓
- Martha (Smith) on June 24, 2020 1:24 AM at 1:24 am said:
  
  Agreed
  
  Reply ↓
- jim on June 24, 2020 2:11 AM at 2:11 am said:
  
  What is wrong with “teaching to the test”????
  
  The whole purpose of the test is to find out if students have learned what they’re supposed to learn, so I hope every teacher in the *%#+ country is teaching to the test!!! Teachers aren’t supposed to be teaching
  The Mystical Elements of Woo. They *are* supposed to be teaching reading, writing, math, history, government, social studies, science and problem solving, so that’s what we test for.
  
  I’ve heard a lot about all this learning that goes on that can’t be found in tests but so far I haven’t seen it. My guess is such beliefs are perpetrated by people who teach and test poorly. It’s outrageous that so many people believe that what’s learned cannot be measured. It’s patently ridiculous. Anything that can be taught can be tested for, period.
  
  Apologies, confused, your a fine and intelligent person but I get so fed up with all the crackpot mysticism and woo woo about education I just can’t stand it.
  
  Reply ↓
  - confused on June 24, 2020 2:23 AM at 2:23 am said:
    
    >>What is wrong with “teaching to the test”????
    
    Because it essentially reduces a field of knowledge to a list of rote boxes to be checked.
    
    >>The whole purpose of the test is to find out if students have learned what they’re supposed to learn
    
    Well, the problem here is twofold, as I see it: a) “what they’re supposed to learn” is not very well chosen, and b) the tests actually used are poor measures of it in practice.
    
    It’s not mysticism and woo woo, it’s about *what* is taught.
    
    For some fields, like math, it is not much of a problem, because math is clearly defined and right answers are objective.
    
    But for things like history/government/social studies, it is a huge problem. It is one thing to be able to recite a list of dates that certain events happened, it is another thing to understand the underlying principles. Science is somewhat more objective but often has the same problem with teaching as “list of facts” vs “underlying principles”.
    
    This is especially problematic in the modern world where basically all information is available at our fingertips. Rote memorization becomes much less useful, knowing how to find, judge, and filter information becomes far more important. You need to have enough “bedrock knowledge” to evaluate whether new stuff is plausible or not, but this is not necessarily the same things that are taught (principles vs. list of undigested facts).
    
    It isn’t necessarily that the kind of learning we want “cannot be measured” in principle, it is that in practice it *is not measured* by the kinds of tests that are used. (Though it probably cannot be measured if the same organizational mindset goes into writing the tests. That doesn’t make it impossible, but it does mean the tests need to be more like quasi-practical problems than the current style, and grading probably cannot be as mechanical/totally objective in the more subjective fields, which can raise problems.)
    
    Reply ↓
    - Ben on June 24, 2020 10:28 AM at 10:28 am said:
      
      > I’ve heard a lot about all this learning that goes on that can’t be found in tests but so far I haven’t seen it
      
      The complaints I’ve heard from the couple teachers I know have been more that the standards are difficult to keep up with/follow because there’s a lot of stuff there and it’s not all very obviously useful. So that lines up with the thing confused said “a) “what they’re supposed to learn” is not very well chosen”.
      
      Standards for my home state are available: https://www.tn.gov/content/dam/tn/education/standards/math/stds_math.pdf . Looking at them I think the conclusion I draw is that I don’t know much about how 7th graders learn math.
      
      I don’t doubt the value of testing. I also don’t doubt that current standardized tests might not be that great and justified by dubious research or lobbying or something someone just made up cuz they had to write something down in a meeting.
      
      > learning that goes on that can’t be found in tests
      
      Well this can be read as just a critique of test content, not statement that tests cannot measure learning.
    - confused on June 24, 2020 10:58 AM at 10:58 am said:
      
      >>Well this can be read as just a critique of test content, not statement that tests cannot measure learning
      
      I think this is essentially correct.
      
      But I don’t think it’s easily fixed just by changing the particular questions on the test; I think that the *kinds* of tests that are used are often poor fits. (Again, in some areas of knowledge like math it is much less of a problem.)
      
      There is also the inherent human problem that once formal performance measures are instituted, and people know what they are, they focus on only those aspects of performance that are measured – which may not lead to general improvement. I think this problem is fairly fundamental to large bureaucracies and probably can’t be avoided within the current relatively “top-down” system.
    - Martha (Smith) on June 26, 2020 4:54 PM at 4:54 pm said:
      
      Confused said, “For some fields, like math, it is not much of a problem, because math is clearly defined and right answers are objective.”
      
      Baloney. There is a lot of mathematics that is not just about “Getting the right answer,” It’s about the reasoning and thinking behind the answer. The latter is harder than the former to test for, so often is not tested for, and consequently is not taught by teachers who “teach to the test”.
  - Michael Nelson on June 26, 2020 2:08 PM at 2:08 pm said:
    
    I will grant you that there are crackpot, woo-woo mysticists in education, so your argument isn’t a complete straw man. Gut there is a strong argument from empiricists that teaching to the test is counterproductive, too.
    
    Industry leaders frequently push for students to be taught “critical thinking skills,” which are not easily (or often) assessed through a standardized test. These are not the kind of people who reflexively defend bad instruction or reject measurement. Unfortunately, they are also not the kind of people who will pay a psychometrician to tell them that the thing they want, which certainly can be measured, cannot be measured well without committing vastly greater resources. Not when there are politicians and (sadly) psychometricians who will tell them the opposite.
    
    Assessments are designed to test for the things that are easiest to measure, like the old story about the drunk who looks for his lost keys under the light of the lamppost because it’s too dark to look where he dropped them. They are designed this way not because the designers are lazy or dumb, and not because they think what is learned cannot be measured, but because designers are told they must measure something and are given insufficient resources to measure the right thing. And they (we) often go along with it not because we are unscrupulous (though many are) but because getting paid while kids get shafted seems better than getting fired while kids still get shafted.
    
    Assessments also focus on testing for basic skills and knowledge, and teaching to that kind of test means not teaching advanced skills and knowledge. Again, advanced skills and knowledge are measurable, but why would we devote scarce resources to measuring them when the practical priority is ensuring that graduates meet a minimum standard? And when the political priority is ensuring that a sufficiently large proportion of students score well?
    
    Apologies, jim, but I get so fed up with fine and intelligent people insisting on blunt solutions for subtle problems. I’m sure that you would feel the same way about someone informing you that the most intractable problems in your field of expertise have one easy, obvious solution.
    
    Reply ↓
    - Martha (Smith) on June 26, 2020 4:47 PM at 4:47 pm said:
      
      Several good points here.
    - Joshua on June 26, 2020 5:21 PM at 5:21 pm said:
      
      Nice comment.
      
      From my experience, the problem is with validity. Does the test really measure what you think you’re measuring?
      
      In my experience as an educator, in a wide variety of educational contexts, I say often testing does not tell the tester much other than how a testee does on the test relative to other students.
      
      If that is the goal, in service of an overarching goal of ranking students relative to each other, then it does tell the tester what she/he wants to know – how to rank the students. Such a mechanism fits within the basic design of our educational paradigm, which is to perpetuate the social status quo.
      
      Much of this is not explicit for educators out working in educational environments, of course. But it was largely explicit for the founders of our basic educational paradigm (who wanted to train students to succeed in a hierchical workplace), and at some point you have to ask whether, if our basic educational paradigm has been a mechanism for perpetuating the status quo for decades, then does a tacit involvement become a kind of explicit involvement?
      
      Indeed, I think that many educators see their role as to distinguish between stydrnts of different ability – as much as to help each student achieve their potential equally (which is a much more complicated task).
      
      Most testing tells the teacher little about how to best help each student acheive his/her potential. Some criterion-referenced testing can do that, but you can do criterion-referenced assessments without testing. Most testing is norm-referenced, with the specific goal of evaluating each student with reference to the other students. As such, the goal of testing can be made rather explicit.
      
      And the real tragedy there is that it is all in service of a message to students that their role in their education is passive: They hand over the responsibility of assessment to the teacher – when a better goal in education is to encourage each student to be “meta-cognitive” with respect to assessment, and as such, accept responsibility for assessing their own learning. They can do that with a criterion matrix with no testing necessary.
      
      Testing is a tool to rank students, for the most part. Why should that be the goal of an educator?
    - Martha (Smith) on June 26, 2020 6:34 PM at 6:34 pm said:
      
      “Indeed, I think that many educators see their role as to distinguish between students of different ability”
      
      Hard to agree or disagree, since “many” is such a vague word.
      
      ” – as much as to help each student achieve their potential equally (which is a much more complicated task).”
      
      I’m not sure what you mean by “help each student achieve their potential equally”. Do you mean that the teacher should put equal effort into helping each student achieve their potential? Or do you mean the teacher should “help students achieved an equal proportion of their potential? Or if neither, what do you mean?
      
      To try to describe my own perspective as a teacher: I think that an important part of the teacher’s role is to try to see their students as individuals, who may have different strengths and weaknesses, and to help students improve where they are weak. I don’t see my job as being to rank students — although I do have to “group” them to give grades. Since different students do have different strengths and weaknesses, the grouping is more of an “in a summary manner” than a definitive ranking.
      
      I also see a large part of my job to be helping students learn from each other, and (more generally) helping them learn to learn. So, for example, I don’t give “straight lectures” — I include asking the class as a whole questions. I often say, “Get together with your neighbor to discuss this” before I take volunteers (or pick someone out if there are no volunteers) to give their answer. Then I often ask for a show of hands as to whether the other students agree or disagree. And then I will ask why or why not. So in (perhaps large) part, I am trying to teach them how to learn, including how to learn from and with others.
    - Joshua on June 26, 2020 9:54 PM at 9:54 pm said:
      
      Martha –
      
      > Hard to agree or disagree, since “many” is such a vague word.
      
      Most teachers I have encountered in a wide variety of educational contexts – public schools (elementary, middle and high school) in different communities, workplace training, adult basic education, community colleges (including with seniors), undergraduate university and graduate schools – rely quite a bit on testing as a means to assign grades and effectively sort students by ability based on test scores. It has been rather rare that I’ve encountered teachers who whole-heartedly embrace an alternative to that approach, in part because relatively few educational institutions have a paradigm flexible enough to allow them to do so, but also in part because most of the teachers I’ve encountered have gotten to where they’ve gotten by functioning well within that system. As such, they tend to not perceive a need to utilize an alternative approach.
      
      I’m sorry that I can’t be more specific, as I don’t have any way to assign a %. The best I can do is describe my impressions of most of the teachers I’ve encountered. I certainly know that there are very few educational institutions that don’t use testing as a means to sort students.
      
      > I’m not sure what you mean by “help each student achieve their potential equally”. Do you mean that the teacher should put equal effort into helping each student achieve their potential? Or do you mean the teacher should “help students achieved an equal proportion of their potential? Or if neither, what do you mean?
      
      Yeah – I knew that was kind of ambiguous. I meant that that one general goal is to sort students by ability, and another general goal is to have all students achieve their individual potential irrespective of their judged ability. Of course, those goals don’t have to be mutually exclusive – but in my experience “most” teachers (sorry for that word again) really do identify more with the former goal than the latter. That prioritization is, in fact, build into the very bones of our educational paradigm. The function is to sort students by ability and have a certain cohort pass through out through the top.
      
      I know that comes across as melodramatic or polemical. But there are some key reasons why our educational system effectively perpetuates the social status quo. There are some key reasons why, in general, students who come from wealthier, more privileged backgrounds, gain more from their educational experiences and are more able to leverage their educational experiences for future success. I have worked in some very privileged communities and some very prestigious schools and universities and seen how students from some backgrounds are effectively given a “membership card” a birth. Now it’s quite complicated, because the mechanics are somewhat indirect. We might say that some students get into better schools and do better at school because they score better on tests. But simply by virtue of the community into which some children are born they are immediately more likely to be on track for doing better at those tests. I’m not suggesting that all teachers see themselves as having such a goal, or explicitly sign on to being a cog in a wheel of perpetuating the status quo. Indeed, some teachers teach precisely because they want to be a part of providing advancement to those students who don’t come from privileged backgrounds and work very hard to help students to overcome the structural inequities that are built in.
      
      However, when (1) schools have functioned for decades as a mechanism for perpetuating the existing status quo and (2) (more to the point) schools largely rely on a testing paradigm which (A) doesn’t really test what students actually know very effectively (B) provide relatively little information for teachers about how to address the needs of every individual student, (3) are based on a paradigm where there is a kind of competition between students to see how they can be sorted based on assumed ability, and (4) sort students on a basis where those with more privilege have a leg up as soon as they’re born, the use of testing does necessarily become a part of that process.
      
      > To try to describe my own perspective as a teacher: I think that an important part of the teacher’s role is to try to see their students as individuals, who may have different strengths and weaknesses, and to help students improve where they are weak. I don’t see my job as being to rank students — although I do have to “group” them to give grades. Since different students do have different strengths and weaknesses, the grouping is more of an “in a summary manner” than a definitive ranking.
      
      So then I’d say that you’re not in the category that I”m describing – but then the question is regarding your use of tests and grades as a part of the educational process. I mean I’ve used tests and grades when teaching as well. Sometimes students really wanted that approach to be used, and I would have discouraged them if I insisted on a different approach. And quite often I had to employ those methods or I couldn’t keep my job. But I always felt that by using those methods I was working against my overarching goals as an educator, although I usually felt that I could overcome their negative impact by directing my energy in other directions as well.
      
      > I also see a large part of my job to be helping students learn from each other, and (more generally) helping them learn to learn. So, for example, I don’t give “straight lectures” — I include asking the class as a whole questions. I often say, “Get together with your neighbor to discuss this” before I take volunteers (or pick someone out if there are no volunteers) to give their answer. Then I often ask for a show of hands as to whether the other students agree or disagree. And then I will ask why or why not. So in (perhaps large) part, I am trying to teach them how to learn, including how to learn from and with others.
      
      Sounds like a beautiful process. I’d like to be in your classes.
      
      I’m really not trying to bash teachers. Not sure why I started to lean in that direction. My point was much more to bash testing.
    - Joshua on June 26, 2020 9:58 PM at 9:58 pm said:
      
      Sorty – I didn’t mean to assume anything about your use of tests, or your approval of them as a teaching tool or part of the educational process. I know it probably read like I was, and I wasn’t.
    - Martha (Smith) on June 26, 2020 10:18 PM at 10:18 pm said:
      
      Joshua said,
      “I’m really not trying to bash teachers. Not sure why I started to lean in that direction. My point was much more to bash testing.”
      
      At times it was unclear whether you were bashing teachers or bashing testing. Part of the point I was trying to make is that there are lots of different types of tests. The best tests help students learn. And the teacher can have a big influence in making tests help students learn (or not). Which brings to mind something I have occasionally done in teaching: Asking students to submit potential test questions, then having a class discussion of whether and why the questions are good or not. (Of course, not really possible in large classes –like the one with ~300 students I once taught.)
- JFA on June 24, 2020 6:24 AM at 6:24 am said:
  
  There is not a large distinction between teaching to the test and teaching for understanding. While it is possible do well on tests without really understanding the material, I’ve often found that understanding the material and doing well on the tests are highly correlated. And I have certainly never been in (nor know of anyone who has been in) the situation of really understanding the material and not doing well on test.
  
  Reply ↓
  - Daniel Lakeland on June 24, 2020 8:08 AM at 8:08 am said:
    
    There is actually a huge distinction. Take the Amateur Radio License tests… you can either learn a bunch of stuff about electrical engineering… or you can take the publicly available question pool and memorize which of the multiple choice answers is correct.
    
    Reply ↓
    - confused on June 24, 2020 11:03 AM at 11:03 am said:
      
      Yes. This is a bit of an extreme case, but a very good example of the general issue: people focus on those aspects of performance which are formally measured, sometimes to the detriment of overall performance.
      
      In education there is also a bit of an issue in terms of what the expected learning should be anyway.
      
      I mean, what is the purpose of public education?
      
      – to produce an educated citizenry capable of maintaining a democratic republic (original justification for US public education)
      – to make the next generation well-prepared for college/technical jobs and thus maintain the nation’s technical/economic position in the world (became important after Sputnik in the US, IIRC)
      – to provide basic knowledge useful for practical life (underemphasized, I think)
      
      Different goals will lead to different material being taught, and there is a finite amount of time/attention available.
    - Michael Nelson on June 29, 2020 7:25 AM at 7:25 am said:
      
      I like to think the best use of education is to promote self-actualization. Too many students come to school hungry, afraid, neglected or self-hating. If we settle for teaching them to survive, that’s all they’ll do.
roger koenker on June 24, 2020 8:42 AM at 8:42 am said:

The classical paper on student heights and milk is Student (1931), The Lanarkshire Milk Experiment, Biometrika, 23, 398-406.
Yes, THAT Student. It is my favorite paper on what can go wrong with controlled experiments and should be much better known.

Reply ↓
steven t johnson on June 24, 2020 8:53 AM at 8:53 am said:

Shouldn’t the clarifying remarks included a comparison between the correlation of student height and achievement to the correlation of teacher effectiveness and student achievement? And, wouldn’t it have been desirable to double check the correlation between teacher height and student achievement?

Reply ↓
Ron Kenett on June 24, 2020 10:30 AM at 10:30 am said:

A side track to the above discussion is to invoke the ASA VAM statement https://www.amstat.org/asa/files/pdfs/POL-ASAVAM-Statement.pdf

The interest in this is that it is a sort of precursor to the ASA p statement https://amstat.tandfonline.com/doi/full/10.1080/00031305.2016.1154108?scroll=top&needAccess=true

In my book on information quality we dedicate a section to the evaluation of the information quality in the ASA VAM statement. We rate it, using 8 information quality dimensions as 57%, a rather low score (Section 6.3 in https://www.wiley.com/en-us/Information+Quality%3A+The+Potential+of+Data+and+Analytics+to+Generate+Knowledge-p-9781118890653

I presented this at the Boston JSM in 2014. It raised lots of interesting discussions.

My comment on this was that ASA should have better assessed the impact of the VAM statement, before launching the p-statement with its wide range consequences, some good, many not so good.

Reply ↓
- Michael Nelson on June 26, 2020 2:21 PM at 2:21 pm said:
  
  I see what you mean. A document with that title should have a statement up front in the form of: “Evidence indicates that VAMs have a high probability of being useful for ___. Evidence indicates VAMs have a high probability of being useless or harmful for ___.” Or, you know, at least cite some evidence somewhere in the document for God’s sake.
  
  Reply ↓
Michael Nelson on June 26, 2020 3:15 PM at 3:15 pm said:

Your education colleagues seem intent on dismissing criticism of VAMs. That could just be my misapprehension of their tone, but the content of their responses is off-point in any case. Their initial critique of the paper, that correlations with scores are real and correlations with height are not, misses the point entirely. The authors’ argument is that (certain) VAMs are bad at measuring the real impact of teachers on student learning, and the evidence for their argument is that those same VAMs provide spurious estimates of teachers’ impact on student height. The fact that there is a real correlation between teacher and mean score, and not between teacher and height, is not a flaw in the authors’ argument–it is their argument. If a method estimates a non-zero effect when no effect exists, then the sign and magnitude of its estimate of an existing effect are questionable.

Your colleagues’ response to the blog post is that their “only quibble” with the authors is semantic, but then they belabor the point as if the authors’ “poor wording” were a sincere claim. Perhaps your colleagues were trying to imply that they now fully agree with the substance, while avoiding making any positive statements about the study. If we’re concerned with wording (as science communicators should be), a less convoluted response might’ve been something to the effect of: “We agree that the paper provides strong evidence that the conclusions drawn from single-year VAMs have low validity, though the author’s claim in the blog post that the VAMs themselves are ‘not valid’ is poorly phrased.” A direct statement like that would go a lot further toward resolving confusions over the study.

The second critique in each response seems to amount to: “We don’t understand how this part of the findings is possible, and the authors do a bad job of explaining it,” both of which points are incontrovertible.

Reply ↓
conchis on September 16, 2020 3:18 AM at 3:18 am said:

If there’s lots of noise in the raw 1-year teacher effects:
1. wouldn’t these estimates get shrunk heavily back to the mean; and
2. shouldn’t the scale of the residual noise be evident in the posterior distributions
Is it that these estimates should be adopting a bayesian approach but aren’t, or that the residual noise is being ignored in interpreting them?

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Resolving confusions over that study of “Teacher Effects on Student Achievement and Height”

52 thoughts on “Resolving confusions over that study of “Teacher Effects on Student Achievement and Height””

Leave a Reply Cancel reply