Transformations for non-normal data

Steve Peterson writes:

I recently submitted a proposal on applying a Bayesian analysis to gender comparisons on motivational constructs. I had an idea on how to improve the model I used and was hoping you could give me some feedback.

The data come from a survey based on 5-point Likert scales. Different constructs are measured for each student as scores derived from averaging a student’s responses on particular subsets of survey questions. (I suppose it is not uncontroversial to treat these scores as interval measures and would be interested to hear if you have any objections.) I am comparing genders on each construct. Researchers typically use t-tests to do so.

To use a Bayesian approach I applied the programs written in R and JAGS by John Kruschke for estimating the difference of means:
http://www.indiana.edu/~kruschke/BEST/

An issue in that analysis is that the distributions of student scores are not normal. There was skewness in some of the distributions and not always in the same direction. I wanted to find a distribution I could use for the likelihood that could better accommodate this sort of data.

Here is my idea: I re-scaled the data to [0,1] (actually I used a slightly smaller interval because there was a tendency to fit inappropriate bimodal curves in the postpredictives) and modified one of Kruschke’s programs for comparing proportions to use betas rather than Bernoullis for the likelihood.

What seems nice about betas for this type of data is that they can be skewed in either direction and can even be bimodal. JAGS tracks the pairs of credible shape parameters at each step in the chain of values sampled from the posterior from which I can derive the unrescaled means, standard deviations, differences in means and standard deviations, and effect sizes.

I get very similar results as to those using t distributions when using data that is fairly normal, but I also get post predictive graphs that suggest that the betas are much better when I use skewed data.

So, do you think what I did makes sense? Can you think of a better alternative approach? It seems to me that a good model for this sort of data (scores constructed as averages of responses on Likert scales) would be of interest to social science researchers.

I replied: My quick thought is that if you’re averaging, the distribution should not make too much difference. If you are rescaling, I think it might make more sense to just rank all the data and then transform to z-scores, as everyone knows how to think about the normal distribution. Finally, I’d be very wary of trying to use the beta distribution to capture bimodality. I saw someone try to do this once, about 25 years ago, and it was a disaster. The trouble is that the bimodality that you find can be very sensitive to the parametric form that you are assuming. If you are interested in bimodality, I think you’re better off studying it more directly.

To which Steve responded:

I guess I was hung up on having a very accurate model for the data. In frequentist statistics we rely on the CLT to give us approximate normality in the sampling distribution almost regardless of the shape of the population distributions when working with very large samples. Frequentist textbooks typically address the robustness of procedures when the normal population condition isn’t met, while Bayesian texts seem to emphasize the flexibility of modelling and the need to have a good model. I suppose what you are saying is that Bayesian estimation with a normal likelihood is also robust to departures from normality, so I shouldn’t worry.

I do think an accurate model for the data can be a good thing. I just think that the best way to get there is with regression models or mixture models rather than trying to get cute by, for example, using the parameters of the beta distribution as a way to catch multi modality. I have the same feelings about this, as I do about the use of fifth-degree polynomial regression models to capture nonlinear patterns. In either case, I’d rather fit the relevant aspect of the data directly, rather than trying to get lucky and hope it happens to fit some family of curves that happens to be sitting around. (This is not to say that I disdain parametric models. It’s just been my experience that if you’re interested in multimodality, an explicit mixture model is the way to go.)

1. genauer says:

And what would be wrong with using the traditional Kruskal-Wallis ?

• Andrew says:

Genauer:

If you just rank all the data and then transform to z-scores, you’ll get something that’s basically the same as Kruskal-Wallis for the simple analysis, and it also allows you do to more elaborate analyses as desired. Kruskal-Wallis isn’t “wrong,” I’d just prefer something a bit more open-ended. Also, to me, doing the z-score and then applying the normal-theory analysis demystifies the procedure, so that instead of it being a clever way to get a p-value, it’s more clearly a method for gaining robustness by throwing away all the information in the data other than the ranks.

• genauer says:

Hmm,

in semiconductor manufactoring KW is used for a brutal normalization of brutally non-normal data

Screening vast amounts of data, in a kind of a drag net man hunt, in the moment people realize, that we are all overlooking something.

As a first screening step.

z-score is not robust / sensitive against extreme outliers, or ?

• Andrew says:

Genauer:

Read carefully: I’m saying first rank all the data, then transform to z-scores. This is almost identical to K-W, it’s just using normal theory rather than the distribution of order statistics as a “home base.” Outliers won’t be a problem, just as they are no problem with K-W (which also begins with the step of reducing all data to ranks).

In practice, if all you’re doing is the single test, I think the two approaches are essentially identical and you might as well do K-W, as it’s standard and thus requires less explanation.

But if you want to go further (e.g., regression models, or partial pooling with unbalanced data as might occur if you have some groups with only one or two measurements), then I’d prefer my recommended approach as you can immediately throw the whole existing machinery of hierarchical linear models at the problem.

• genauer says:

Sorry Andrew, I was hanging into my thoughts.

What we actually did at that time, was pooling the data then in separate groups,

actually easy to separate by a material supplier, then hunting for new variables, we didnt have (considered) so far, and learning a lot more about physical interactions of production processes 500 steps and kilometers apart.

And coming from this specific (physics heavy) background, 5th degree polynomials are giving me the creeps.
In my business it is practically always a sign of not understanding the underlying mechanisms.

2. Thom says:

Wouldn’t an accurate model here be to fit a multilevel ordinal logistic regression with each subscale nested in a subset within a student? This seems more direct than the beta regression approach. Of course it may be overkill and comparing the simple averages may well work just fine.

3. Never Reviewed says:

The groups *are* different, so don’t worry about comparing them. Instead think about what process could generate all the individual variability, how this would be affected by gender, and attempt to model that. It is much more conducive to cumulative knowledge. Averages (the average male/female does not exist, so why study it other than as a stepping stone towards what is worth looking at deeper?) and transformations just seem to confuse matters in my opinion, maybe someone has a good justification but I have not come across it.