# Dumpin’ the data in raw

Benjamin Kay writes:

I just finished the Stata Journal article you wrote. In it I found the following quote: “On the other hand, I think there is a big gap in practice when there is no discussion of how to set up the model, an implicit assumption that variables are just dumped raw into the regression.”

I saw James Heckman (famous econometrician and labor economist) speak on Friday, and he mentioned that using test scores in many kinds of regressions is problematic, because the assignment of a score is somewhat arbitrary even if the order was not. He suggested that positive, monotonic transformations scores contain the same information and lead to different standard errors if in your words one just “dumped into the regression”. It was somewhat of a throw away remark, but considering it longer, I imagine he mans that a difference of test scores need have no constant effect. The remedy he suggested was to recalibrate exam scores such that they have some objective meaning. For example, a mechanics exam scored between one and a hundred, one can pass (65) only if they successfully rebuild the engine in the time allotted, but better scores indicate higher quality or faster speed. In this example one might change it to a binary variable to passing or not, an objective testing of a set of competencies. However, doing that clearly throws away information.

Do you or the readers of Statistical Modeling, Causal Inference, and Social Science blog have any advice here? The transformation of the variable is problematic and the critique of transformations on using it raw seems a serious one, but the act of narrowly mapping it onto a set of objective discrete skills seems to destroy lots of information. Percentile ranks on exams might be a substitute for the raw scores in many cases, but introduces other problems like in comparisons between groups.

My reply: Heckman’s suggestion sounds like it would be good in some cases but it wouldn’t work for something like the SAT which is essentially a continuous measure. In other cases, such as estimated ideal point measures for congressmembers, it can make sense to break a single continuous ideal-point measure into two variables: political party (a binary variable: Dem or Rep) and the ideology score. This gives you the benefits of discretization without the loss of information.

In chapter 4 of ARM we give a bunch of examples of transformations, sometimes on single variables, sometimes combining variables, sometimes breaking up a variable into parts. A lot of information is coded in how you represent a regression function, and it’s criminal to just take the data as they appear in the Stata file and just dump them in raw. But I have the horrible feeling that many people either feel that it’s cheating to transform the variables, or that it doesn’t really matter what you do to the variables, because regression (or matching, or difference-in-differences, or whatever) is a theorem-certified bit of magic.

## 7 thoughts on “Dumpin’ the data in raw”

1. most definitely, time spent understanding the data before running any regressions is time well spent

while we caution against dumping in raw, we should also caution against indiscriminate standardization or normalization of data, which is also practiced. These procedures change the underlying distributions of the data and should be thought through before using.

there are software out there that claims to form any and all functions of variables (powers, roots, cosines, etc. etc.) and will pick out the "right" transformation. I suppose you might end up with a predictor like x1^2 / cosine(x2). Anyone has stories on this?

2. If the response has order but no stable magnitude, this may be a candidate for some of the purely ordinal techniques. Kurt Wittowski has been looking at multivariate U-statistics for this, and, in Psychometrics, Norman Cliff proposed looking at pairwise dominances.

This is similar to the problem of relating tactile and other sensory "magnitudes" to binary or choice responses. The "magnitudes" are arbitrary so you end up with a measurement that changes from sample to sample, even if the order is stable.

3. It sounds to me like inferential point Heckman was making was a reference to measurement theory and the idea of admissible statistics. Reading up on this area has helped me think about these problems – even though there's still a lot of disagreement on it between statisticians (pro-transformations) and measurement theorists (anti-transformations).

Regarding the example: I think it's not quite right to view the test score as a variable which is transformed thereby throwing away information. The problem's really that the test score is a summary statistic, and by using it you've already discarded other variables, transformed then, and thrown away information – without ever really getting a say in the matter. The questions really are (1) can you use more of the original data and throw away less information, or (2) can you make a better summary statistic?

Suppose you've an exam with five questions for one point each. If you had a choice would you (a) sum them to create a score variable ranging from 0-5, or (b) use five binary variables, one for each question, or (c) use some other manipulation? It depends on the situation, but (a) is a particular strategy with particular assumptions. There's the same issue with transformations, there's no reason why by transforming what you're given you can't create a metric that's better for your purposes. The classic example is to see that statistical assumptions (normality, etc.) are met, but you could also want a metric that better predicts labor outcomes than the original, or fulfills some other purpose.

The problem is if you use the data as you're given it you're implicitly buying in to particular strategies and analytical aims without really considering the alternatives.

4. Kaiser: Yes, I'm assuming you'd pick the transformation using subject-matter understanding, perhaps enhanced by some data analysis, as illustrated in chapters 3-5 of ARM.

Bill: U-statistics etc are fine, but to my mind this sort of thing distracts from the more interesting and important modeling goals. I'd rather just use a reasonable transformation and go from there.

George: IRT is great but it doesn't solve the problem of how to combine, break up, and transform regression predictors.

Alex: I agree with you, except in your third paragraph where you mention normality as your assumption to focus on. In my experience, the distribution of the error term is the least important assumption of the regression model.

5. I think I must be missing something. What this post seems to be about is the basic issue of what to do when the relationship between an explanatory variable and an outcome is not expected to be linear. This issue is discussed in just about every basic stats book, where standard advice is to transform the explanatory variable (take the square, the square root, the logit, etc.) to make the observed or expected relationship more linear. Where a casual guess at a good transformation isn't sufficient, people can use the fairly well-developed research on Generalized Linear Models (for which McCullagh and Nelder is a classic book). Kaiser's warning, in an earlier comment, about just searching for crazy transformations that happen to give a nice linear relationship, is a good one, but not a new one in the GLM world.

All of the above is what I would have said if I had only seen Kay's question. But having seen the responses, none of which mention generalized linear models or the fact that "link functions" have a long research history, makes me wonder if I am hopelessly wrong or out of touch. Has something happened over the past decade that invalidates the large body of research on link functions and the like? Has statistical practice moved on to something else?

Lest there be any doubt: I am not being sarcastic or dismissive. I don't keep up with progress in statistics as well as I should — this blog is pretty much the way I find out about new things — so there's a lot going on that I don't know about.