If you’re interested in the Box-Cox power transformation . . .

You can read this post from Danielle Navarro.

I also recommend Section 7.6 of Bayesian Data Analysis, which extends an example of Rubin where the Box-Cox power transformation fails. We do some cool stuff there, including a posterior predictive check that reveals the problem, and an extension of the model by incorporating a bound on the extreme end of the distribution.

The entire BDA3 book can be downloaded at the above link, but for convenience I’m including Section 7.6 right here for you.

8 thoughts on “If you’re interested in the Box-Cox power transformation . . .

  1. Interesting post.

    It reminded me of a question I have wondered about for some time.

    When presented with data from a continuous distribution that does not have support on the whole real line, or is asymmetric, and we have a bunch of covariates we want to use to explain this data our first course in regression may provide us methods for transforming the response data … perhaps this improves our fit, but primarily this seems to help us approximately satisfy the normality assumption used for tests and interval estimation.

    Later on we learn about glm, gam, etc where the method isn’t to transform the response data but to use a link function to connect the covariates to specific parameter(s) in some possibly non-normal distribution for the response.

    The glm is more elegant, for sure, and the transform the response first approach seems like it would often lead to students forgetting about all those ‘degrees of freedom’ used up in getting to their transformation. But when both options are available, it doesn’t seem like they would be equivalent.

    Is there theoretical or empirical evidence in the literature that when there are both response variable transformations and glms, that the glms will always be better?

  2. Quoting from the (linked) blog post:

    “However, it’s sometimes worth stating the obvious because in doing so we can automatically rule out a great many possible candidates: for example, the log-normal distribution, much beloved by pharmacometricians in other contexts, will in this particular instance be unsuitable for our needs.”

    It wasn’t explained why. So, am curious why the log-normal distribution, specifically, has to be ruled out.

    • Sam:

      She says it right before there in the post: “if we want a family of probability distributions that is flexible enough to be able to independently describe the location (e.g., mean), scale (e.g., variance), and skewness of the data, your distributional family will require at least three parameters.” The lognormal distribution has only two parameters.

      • Why should the three parameters be independent? If she had taken that as a starting point then I would understand better. But I don’t.

        Also, why doesn’t she take several alternative biologically based models of growth and see what they predict for the distribution of weights? That’s what I would do. Maybe some of them generate a better fit with less independent parameters.

        • She didn’t say the family of distributions must be flexible enough to specify location, scale, and skew independently. She said that _if_ you want such a family, then you will need at least three parameters. This is close to being a tautology.

          I don’t know what you mean by ‘alternative biologically based models of growth’. Even if I did, I suspect I would be skeptical of such an approach: this seems to me to require a very empirical approach. We know people can be very heavy (compared to others their age) if they sit around drinking soft drinks and eating Cheetos, or that they can be very light if they are anorexic or have a tapeworm or whatever and I don’t think I’d trust a model that claims to be able to give a distribution, or even a distributional form.

          You may be able to search and find a two-parameter function that fits the data OK, but then the search through the space of all functions is itself a search through a parameter space.

        • “She said that _if_ you want such a family, then you will need at least three parameters. This is close to being a tautology.”

          Why do you want such a family? Simple question. Not answered. She just says it is obvious that you do. I think that is wrong. You don’t need to capture three properties to describe the data.

          “Even if I did, I suspect I would be skeptical of such an approach: this seems to me to require a very empirical approach.”

          I am coming at this from a scientific angle, rather than a statistical one. In biology, the classic work is Huxley’s “Problems of Relative Growth” (https://www.press.jhu.edu/books/title/2195/problems-relative-growth) which shows, among many other things, that the log-normal distribution with appropriate parameters differing by application is a good description of a vast amount of data (crab limb lengths, deer antler lengths, human head sizes, etc). The underlying biological model here is that organisms (not just humans) grow by percentages that are (population-wise) independent from the previous ones. A product of independent random normal variables is a log normal, hence…

          I am inclined to think that she is coming at this from a point of view that is too statistical and not scientific enough. She should fix that by talking to some mathematically inclined biologists.

        • Anon:

          Sometimes we work on applied statistics problems, sometimes we write statistics textbooks or tutorials. I’m not a big fan of the Box-Cox model myself, but it’s out there, and people use it and have questions about it, so I think it can be useful to explore and understand it, which is what Navarro does in that post. What to do in a particular applied problem is another story.

Leave a Reply

Your email address will not be published. Required fields are marked *