Robin Hanson and I discuss adjusting for variables you shouldn’t adjust for (for example, adjusting grades given sex, race, or pre-test scores)

In response to something Robin Hanson wrote on his blog (sorry I can’t find the exact link, I think it was at the end of July, 2008), I wrote:

Regarding the general point of adjusting for variables such as sex and race (and, for that matter, previous test scores) that you’re not “supposed” to adjust for, see the last (non-footnote) paragraph on page 4 that continues on to page 5 in this article.

The short version is that sometimes “fairness” or “the rules” are more important than inference about ability. Also, a decision rule that is optimal for each individual is not necessarily best for the group. Randomized rules are not in general optimal but they can provide a mix of outcomes that might be desirable in aggregate.

Robin responded:

Well I’d feel better if we had a coherent theory of “fairness” or “the rules” we could use to determine when we should not infer on all info available. Otherwise I fear those are just empty excuses for not inferring things we don’t want to infer. I can see the abstract possibility, but I’d want to see a concrete argument applied correctly to a specific circumstance, not just some vague hand-waving about “fairness.”

Then I wrote:

I think the course grades example is a real one. Suppose I give a pre-test at the beginning of the course ,then at the end I give a final exam. For simplicity suppose these are the only 2 pieces of info that we have on the students, and then imagine we can use these to predict fugure performance (e.g., grades in a future course). Once the course is over, the pre-test probably adds information (beyond what’s in the final exam alone), but it wouldn’t really be fair to use the pre-test to assign the final grade.

The principle here, such as it is, seems to be that a course grade should be based on things done in the course. A justification for this principle is that, when a future employer (for example) sees a transcript, he or she can best understand it if the separate course grades represent separate pieces of information. As a Stat 100 instructor, my job in assigning grades is to record how well the students did in Stat 100; it’s not my job to second-guess transcript readers by giving my Bayesian estimate of the students’ true ability.

More generally, then, this is the principle that keeping information segregated has a benefit in making it easier for outsiders to make best use of the information. Similarly, if I buy a widget on Amazon and rate it, I’m making the best contribution to society if I accurately describe my own experience with the widget–rather than reading everyone else’s reviews and then Bayesianly shrinking my own judgment to the common mean. Any analyst can do this; the contribution I’m making as a rater is to describe my own experiences, and diluting this with other information will just make it more difficult for others to make use of what I’m telling them.

And Robin wrote:

Well I completely agree that we want to keep info sources as modular as possible, so that we simplify as much as possible the task of combining info sources, including the task of updating the total given updates to each part. And in the case of course grades I agree that modularity suggests one grade a course only on work done for that course. But it seems to me that in this case we were discussing, the more modular choice is in fact to adjust the test scores for what we know about differing variances of different groups. If we don’t in fact do that with the test score, I don’t see another plausible process whereby that info will be included in the final result. Are you more imaginative here than I?

5 thoughts on “Robin Hanson and I discuss adjusting for variables you shouldn’t adjust for (for example, adjusting grades given sex, race, or pre-test scores)

  1. I think you can also view this as about how statistical models relate to the real world.

    In some settings statistics is used to make an inference about a real physical quantity – you've a sample and want to know how many black marbles there are in an urn. In others you're trying to estimate something more intangable and theoretical – like the 'true underlying cancer rate' based on the number of people in a town and the number who've got cancer.

    You can use the same maths for both problems. But something makes me feel the extra-statistical aspect of the setting make the two very different. In one you're trying to make an inference about a genuine quantity, it really exists and you could find out exactly what it is in the real world. In the other what you're trying to pin down is more a fiction that makes it easier for you to understand the data that the world's throwing at you, rather than something that is genuinely out there in nature.

    Now grades are real in a sense that 'true underlying ability' isn't. I get the sense that people view grades as just more substantial than something which has a looser connection with the real world, and only want statistics to go so far. If you were trying to adjust for 'physical' measurement error – your marking getting sloppy toward the end of a pile of papers and you wrongly assigning marks – I don't think the same concerns would exist and using the pre-test to pick that up would be an issue. It's just something about when statistics tries to go beyond that that freaks people out.

  2. I just read Robin's original post and the ensuing comment thread, and I find it both interesting and disappointing that the conversation there and here seems to be more about political feasibility and morality than the functional form of Robin's assumptions. I mean Robin's conclusion essentially rests on the, IMHO, strong assumption of iid-ness of the measurement error. What reason do we have to believe that the measurement error is uncorrelated with test score or group variance? (Incidentally someone did ask what would happen if the error were multiplicative, but they were ignored, plus the error cannot be purely multiplicative anyway since someone scoring at the mean would then have zero error.)

    Anyway my initial reaction was that a more acceptable solution would be to get students to take more tests (I never understood the model of basing students' course grades on one final exam). In fact, the pre-test scenario doesn't strike me as as bad as the Amazon scenario. In the Amazon case, the problem is statistical because you would be using other people's observations to form your own, so the variances of each observation are not equal AND the observations are autocorrelated. But in the pre-test case, both scores are independent and observable only to you, so escape both these problems.

    But to counter myself, I later realized Robin wasn't quite clear what he meant by "noisy measure" — whether that means (a) a student's one-time test score is a noisy measure of a student's mean test score, (b) a student's mean test score is a noisy measure of underlying ability, or (c) a+b. Taking more tests would help with (a), but not help with (b). My guess is Robin meant (b), but if so I think it was pretty careless of him to not state it that way.

Comments are closed.