The reason for log transforming your data is not to deal with skewness or to get closer to a normal distribution; that’s rarely what we care about. Validity, additivity, and linearity are typically much more important.

The reason for log transformation is in many settings it should make additive and linear models make more sense. A multiplicative model on the original scale corresponds to an additive model on the log scale. For example, a treatment that increases prices by 2%, rather than a treatment that increases prices by $20. The log transformation is particularly relevant when the data vary a lot on the relative scale. Increasing prices by 2% has a much different dollar effect for a $10 item than a $1000 item. This example also gives some sense of why a log transformation won’t be perfect either, and ultimately you can fit whatever sort of model you want—but, as I said, in most cases I’ve of positive data, the log transformation is a natural starting point.

The above is all background; it’s stuff that we’ve all said many times before.

What’s new to me is this story from Shravan Vasishth:

You’re routinely being cited as endorsing the idea that model assumptions like normality are the least important of all in a linear model:

Non-normality is relatively unimportant; at worst you just may lose a bit of power. I strongly recommend @StatModeling & Hill (2007, pp. 45-47)'s summary of key regression model assumptions. Normality of errors literally gets LOWEST priority. My experience supports this. 3/3 pic.twitter.com/R0BfQCoxdK

— Roger Levy (@roger_p_levy) December 8, 2018

This statement of yours is not meant to be a recommendation to NHST users. But it is being misused by psychologists and psycholinguists in the NHST context to justify analyzing untransformed all-positive dependent variables and then making binary decisions based on p-values. Could you clarify your point in the next edition of your book?I just reviewed a paper in JML (where we published our statistical significance filter paper) by some psychologists that insist that all data be analyzed using untransformed reaction/reading times. They don’t cite you there, but threads like the one above do keep citing you in the NHST context. I know that on p 15 of Gelman and Hill you say that it is often helpful to log transform all-positive data, but people selectively cite this other comment in your book to justify not transforming.

There are data-sets where 3 out of 547 data points drive the entire p<0.05 effect. With a log transform there would be nothing to claim and indeed that claim is not replicable. I discuss that particular example here.

I responded that (a) I hate twitter, and (b) In the book we discuss the importance of transformations in bringing the data closer to a linear and additive model.

Shravan threw it back at me:

The problem in this case is not really twitter, in my opinion, but the fact that people . . . read more into your comments than you intended, I suspect. What bothers me is that they cite Gelman as endorsing not ever log-transforming all-positive data, citing that one comment in the book out of context. This is not the first time I saw the Gelman and Hill quote being used. I have seen it in journal reviews in which reviewers insisted I analyze data on the untransformed values.

I replied that is really strange given that in the book we explicitly discuss log transformation.

From page 59:

It commonly makes sense to take the logarithm of outcomes that are all-positive.

From page 65:

If a variable has a narrow dynamic range (that is, if the ratio between the high and low values is close to 1), then it will not make much of a difference in fit if the regression is on the logarithmic or the original scale. . . . In such a situation, it might seem to make sense to stay on the original scale for reasons of simplicity. However, the logarithmic transformation can make sense even here, because coefficients are often more easily understood on the log scale. . . . For an input with a larger amount of relative variation (for example, heights of children, or weights of animals), it would make sense to work with its logarithm immediately, both as an aid in interpretation and likely an improvement in fit too.

Are there really people going around saying that we endorse not ever log-transforming all-positive data? That’s really weird.

Apparently, the answer is yes. According to Shravan, people are aggressively arguing for not log-transforming.

That’s just wack.

Log transform, kids. And don’t listen to people who tell you otherwise.

I have trouble believing log transformed data is easier for most people to interpret, no matter what it represents. Any transformation is adding another step to the process that generated the data, making it more difficult to understand.

I think log transformed data adds complexity to understanding the transformed data, but reduces complexity of the model. It’d be more reasonable to fit linear models and people understand those better than exponential models.

I could believe that this is a wash.

My understanding was that linearity is *not* an assumption of linear regression. Rather, we assume that E[e|X] ~ N(0, sigma^2) where e are the residuals, assumed to have a conditional mean of zero. Incidentally, linear relationships usually have better behaved rresif

Sorry for that, my post got mangled on mobile

Depends on how you define “most people”. In some disciplines, thinking in terms of order of magnitude is the de facto standard.

My understanding was that linearity is *not* an assumption of linear regression. Rather, we assume that E[e|X] ~ N(0, sigma^2) where e are the residuals, assumed to have a conditional mean of zero. Incidentally, linear relationships usually have better behaved residuals and the confusion between linearity of the data and conditional distribution of the residuals stems from there. Is it erroneous?

It’s more complicated than that.

y = a + b*x+c*x^2 + err

is *linear* in the coefficients a,b,c but nonlinear in the covariate x…

Normally people wouldn’t call this a “linear regression” though, because they mean “linear in x” ie a line.

In any case, suppose you get some data that really is from the above formula, and you fit instead

y = q + r*x + err

you’ll still get a fit, and you’ll still get that the mean of all the err[i] values is zero in your dataset, but you won’t have E[e | x] = 0 for all x, nor will you have q = a and r = b (unless c is negligibly small for the range of x)

Usually the justification for a linear regression like y = a + b*x + err is something about Taylor’s theorem, every sufficiently nice function can be represented as a Taylor series around a point and the error will be small in some neighborhood of x. If that neighborhood is the entire range of application for your formula, then you need not consider more complicated formulas.

> y = a + b*x+c*x^2 + err

I see this as a linear regression, transformations (given some nice behavior) of the DV doesn’t change the assumptions as far as I know. We can have g(x) = x² and y = a + b*x + c*g(x) + err as a linear regression. At least as I was taught.

Consider the expectation function y_hat = f(x, a, b, c) = a + b * x + c * x^2. As Daniel Lakeland pointed out, it’s a second-order polynomial function of the predictor x, but it’s a linear function of the coefficients a, b, and c. If you consider y = g(x, a, b, c, sigma) = f(x, a, b, c) + epsilon, where epsilon ~ normal(0, sigma), then you have a stochastic function.

Thanks for the explanations!

Logarithms may be hard for some people to understand(*), but they’re required to get the model closer to the process that generated the data. It’s very common to have multiplicative error. For example, when Y is the value of a financial instrument in the face of compound interest; when Y is the size of a population of salmon in the face of weather and predation; or when Y is the mass of a physical body in the face of forces exerted by other bodies.

(*) Back when I thought I might want to move from my professor job to Wall St, I took the intro to finance class in the MBA program at Carnegie Mellon. After many boring high-school level lectures on compound interest, we finally have to solve for an implied interest rate. I think OK, logarithms, at least we’re up to basic algebra now. Nope, professor says plug it into Excel because this isn’t a course about algebra. I can understand not talking about instantaneous interest and calculus, but no logarithms?

I wonder how many excel spreadsheets are breaking right now due to assuming interest rates would always be greater than zero.

What about data with negative values? How do we fix that? Take a ratio? When dealing with any sort of economic or accounting data log transforms are absolutely necessary. But some economic data can be negative, e.g. earnings.

Is the log-transform issue related to co-integration?

analyze the negative data separately from the positive data, and use the absolute value for the negative analysis?

It depends on the process. What Daniel Lakeland suggests sounds like the right thing to do for modeling credit and debt in a single variable. Faced with compound interest, as most credit and debt are, you get uncertain exponential growth in absolute value; -$1000 going to -$1100 or $10,000 goig to $11,000 is the same 10% multiplicative error.

On the other hand, if you’re looking at something like a return on investment, you may get negative variables (you ingest $10K and your investment’s now worth $9K, so the return is -$1K). But then it’s often better to look at the value of the underlying asset, not a transform like return that’s sensitive to some arbitrary sunk cost.

The bottom line is that it’s all problem dependent! That’s why we don’t just throw our spreadsheets into a hopper and let robo-statistician do our work for us.

one could use asinh(0.5x) = log(0.5x + sqrt(0.25x^2 + 1))

Does that seem “more interpretable” to you.

It depends on whether you know what an inverse hyperbolic sine is. Just squinting at the formula’s going to be hard to interpret if you don’t have that function chunked as performing some useful role. Once that function’s a critical part of the generative process, then I would think it would be more interpretable than an untransformed version to someone who understood the process.

It’s like saying the error term is normal with scale sigma. Statisticians know what that means. But really it’s just shorthand for saying the error is distributed with density p(epsilon | sigma) = exp(-(epsilon/sigma)^2 / 2) / sqrt(2 * pi * sigma^2).

Or consider logistic regression. What it’s saying is that the log odds of an outcome is a linear function of the predictors. Is that interpretable? I think that depends if you know that logit(u) = log(u / (1 – u)) and that the inverse is logit^-1(v) = 1 / (1 + exp(-v)). If I just write those functions out, it’s going to be hard to interpret if you don’t already know the meaning.

I just mean say you are looking at the returns for a bunch of stocks. Which is easier to understand?

The percent the stock price has changed ytd, or the asinh transformed version? Here is what it looks like for ~7700 listed stocks: https://i.ibb.co/gjvXwNt/perfYTD.png

If you tell me the inverse hyperbolic sine of the year-to-date performance of a stock is 2.3, I think that is more difficult to interpret than telling me the YTD performance is +10%.

> Or consider logistic regression. What it’s saying is that the log odds of an outcome is a linear function of the predictors

I dislike this description of logistic regression. It makes it sound like you have some strong assumption in place about how the log odds transforms your data into a line or something… There’s nothing preventing you from doing nonlinear models though.

What logistic regression means is that inv_logit(YourModel(Covariates)) is guaranteed to be a number between 0 and 1 and YourModel is allowed to be anything and yet thanks to the nonlinear transformation, it can never violate these limits of 0 to 1.

YourModel could easily be a complex 35 term Fourier series with respect to Covariates, or a radial basis function or a Chebyshev polynomial or an exponential function or any old nonlinear thingy and yet it will overall output values between 0 and 1 as it should.

When I said “complex” i meant “tricky” not “using complex numbers”… YourModel should be a model outputting a real variable, so that inv_logit converts it to the range (0,1)

Anon, Terry:

My usual reason for log transformation is that effects and comparisons typically make more sense on a multiplicative scale than on an additive scale.

If a variable is negative, then it can make sense to think of it as the difference between two negative values, and it could make sense to take the log of each. It depends on the context. Ultimately the choice of transformation represents an implicit choice of model.

Maybe I am missing something obvious, but how do you justify using a given base in terms of interpretability? Eg, which is most interpretable: log10(x) or ln(x) or log2(x)?

For a standard regression analogy, the difference between log10 and log2 is like the difference between taking height in inches or meters. That’s because if log10(z) = y, then log2(z) = log2(10) * y. So like the difference between height in inches or meters, the multiplicative factor can make a difference in terms of priors in the model.

Is it ok to transform count data with log(1+count) so that underdispersion and multivariate response correlation can be modelled?

Luke:

I don’t like log(1+x) because it destroys interpretability, also the 1 is arbitrary. If someone is going to do log(1+x), I’d rather have then do log(a+x) and then choose a reasonable value for “a”.

See also here.

How do you choose a “reasonable” value? See also Luke Smith’s point below about small values close to 0 distorting regression resulys.

Usually what should be done is to convert the x data to dimensionless ratios using a subject-matter specific scaling factor, and then you’d choose 1/scale as the epsilon increment, so you’d do

log(1/scale + x/scale)

For example, rather than working with population counts you’d work with say population / median population of a county in your state or some such thing, similarly votes / median state vote count in previous election or the like.

Cool. I’ve only ever thought about this in count regression models where a population size term (aka exposure) lets you convert something like disease rate into an expected count for a Poisson model.

This is a nice way of thinking about this in general. I’m always surprised how little play dimensions get in discussions of statistical modeling.

Could you spell this out a little bit? I’m confused about what “the x data” is referring to here, since in the example above log(1+x) x is used for the dependent variable.

Are you saying, instead of modelling for example:

population = intercept + b*numberOfHouses,

you could do:

log(1/meanPopulation + population/meanPopulation) = intercept + b*numberOfHouses

or something else?

Not quite. You already rescaled in your model. What I was thinking about in terms of exposure in epidemiology models is as follows. It’s like what Daniel Lakeland suggested, only with area-specific predictors. You could also do it with more general ones.

Suppose you have areas n in 1:N with data y[n] for each area is population. At that point, you can build a very simple (heterogeneous, not spatial) hierarchical model

In this case, exp(lambda[n]) is the expected number of houses in area n. The hierarchical model accounts for the population distribution of areas. It’s a silly model as we don’t really have anything to anchor the mu_lambda other than the y[n]. There may not be much opportunity for partial pooling if there’s a wide range of scales for y[n]. For instance, we might have a major metropolis and a small town if our data is on cities and these might vary several orders of magnitude.

It can make more sense in this case if we have something like the number of houses as a predictor x[n], to do something like this:

Now the `theta` parameter can be interpreted as a population per unit of housing. This will usually wind up being a better model given the predictor. In these kinds of models, the x[n] are called “exposure” terms.

There are many opportunities to use log transformations to make things more reasonable. Bob gives one example that I haven’t thought about so much. But Jens asks a question about transforming the observed data, and I think that’s a valid use of this kind of thing.

Suppose you have some data of counts of objects, and the counts are very large, like people in counties, or bacterial cells in water samples or whatever. You know this isn’t a continuous variable, but with the counts being potentially very large, the minimum increment is essentially “dx” compared to the size of the typical measurement, so you can treat it as-if continuous. But the best way to do this is to rescale the data from counts to fractions of a typical count… so suppose you’re talking about bacterial cells in a culture… and a typical number is maybe around 14500 but you will have some samples with counts like 5 and other samples with counts like 13551900…

So you take your data x which is counts of cells, and you divide it by the typical size 14500 and you get a ratio like “multiples of the typical value”… Now because this varies over several orders of magnitude between say 5/14500 = 0.00034 to 13551900/14500 = 934 you want to take a logarithm of this number and model *that* which will be much more compact between say -8 and +8

The only problem is it’s typically possible for you to get a 0 count and so log(0) = -inf and everything goes crazy.

Instead of taking the log of the data, you take the log of the data + the minimum possible increment. This perturbation affects *only* the left tail where you’re down close to 0 cells.

if scale=14500 then you take your data x and do (1+x)/scale and take the logarithm of this:

log( 1/scale + x/scale)

Now you have a variable which is centered somewhere around 0 (the fact that scale is “typical” means x/scale ~ 1 and log(1) = 0) and limited in tails to the region log(1/scale) to log(xmax/scale). Typically this is a much better behaved number to model.

Thanks Bob and Daniel for clarifying.

So if I try to summarize for myself, Daniel shows a clever way to scale the dependent data, while avoiding log(0) = -Inf values. If I understand correctly, it corresponds to my example, modelling for example log(1/meanPopulation + population/meanPopulation) = intercept + b*numberOfHouses instead of population = intercept + b*numberOfHouses.

And Bob’s example looks to me like adding log(exposure) as an “offset” on the linear predictor, similar to when you would include log(surveyEffort) as an offset in a Poisson model of counts. I usually do this when modelling counts of animals, especially when counts can be zero, essentially modelling the count/surveyEffort.

Or I have still misunderstood and need to play around with some numbers to get it…

Should have said: “especially when the survey effort is varying”.

Adede:

I’m not a big fan of log(a + x), but if someone wants to do log(1 + x), I’d rather have them do log(a + x) and choose a. Setting a=1 based on the nominal scale of x, that makes no sense at all. For example, if x is income, should a = $1 or $1000 or $10,000 or what? If someone wants to do log(a + x), it’s their job to explain to me why they chose the particular value of a that they chose. To just do log(1 + x) without reflecting on it, that seems horrible.

The 1/4 power idea sounds interesting but I wonder if it’s just an ad-hoc hack or there’s any reference for this. There are more general families of power transforms that include it as a specific case (e.g. Box-Cox). These were designed with the goal of achieving “normality” though, which is not what we want to prioritize in this discussion, and also I’m thinking about transforming covariates and not independent variables here.

How about modeling correlated count observation as a seemingly unrelated, overdispersed Poisson regression? That seems to match the effects in the data you describe.

One place where we really don’t want to use normal appoximations is when we have wide tails. Posterior predictive checks will diagnose failure in these conditions as the data will be way more dispersed than the simulated data from the model. So will mean square error checks on the data.

One key issue is that if your data have small positive values close to 0, log transforming them can cause extreme values in your lower tail where none existed before. This can greatly impact your regression estimates.

+1.

You have to be careful if the measurement of values near zero is flawed or that you don’t have a truly multiplicative process. Otherwise, you want those values to impact your estimates. For instance, a bank account of one dollar gaining 10% is the same information about interest rates as a bank account of $1M gaining 10% (maybe not, given how banks set rates based on deposit size), so you want those small values to have the same impact as big values.

As another example, consider a a lumber mill chopping lumber using a circular saw. The error’s not going to depend much on how long the 2 x 4 is that’s being cut. Wheter it’s a quarter meter of four meters, you’ll probably have similarly scaled error. You also won’t see someone trying to cut 0.001 meter pieces from a board using a circular saw.

A long time ago I worked with data on a radioactive pollutant whose concentration was measured with error. Of course the concentration could never really be negative, but the measurement could be, and sometimes was. Those negative values did almost certainly represent extremely low concentration. One thing that is often done in cases like this is to set negative measurements (or zero measurements) to some fixed small number. Not zero, though, because I needed to work in log space.

Setting negative numbers and zeros to a small positive number probably would have been fine in practice, but such an approach would throw away a little bit of information: if the measurement was y = -0.5, the true concentration is probably lower than if the measurement was y = -0.1. So I ended up calculating an ‘adjusted concentration’ = (y/2) + sqrt( (y/2)^2 + d^2) where d is a small positive concentration. A measurement of 0 gets set to d, a slightly negative measurement gets set to a number slightly less than d, and a very negative measurement gets mapped to a concentration close to zero (but still slightly positive). I still think this is a good practical approach.

A good way to handle this in a modern Bayesian fit would be to have a parameter that describes the underlying actual value which is limited to positive values, and then a measurement error model that describes the additive measurement.

This is exactly the message we’re always trying to get across. Model what’s actually going on. If there’s a latent value that’s constrained to be positive, model it as a parameter constrained to be positive. Then if the measurement can be negative that’s no problem—it’s no longer inconsistent.

I had a fairly involved discussion a while ago with some fisheries people on their population models. I felt they were too convinced they meausred total fish caught accurately and suggested they rethink the model along those lines—a latent population which is always non-zero, then measurements, which could be noisy enough to imply you caught all the fish in the sea and then some. I didn’t make much headway on that discussion, which along with other cases like this, is why I’m so engaged in this particular thread.

Including a measurement error model adds a big layer of complexity that, in many cases, is completely unnecessary. Dealing with a vector of concentrations is just way easier than dealing with a vector of probability distributions. If you don’t really care whether a few true concentrations are 0.3 or 0.4 pCi/L because most of your measurements are in the range 0.8-5.0 anyway and you just want to make sure the really low ones aren’t exerting too much influence, you can just go ahead and impute the negative values, zeros, and ‘below detection limit’ values to something reasonable. If you do this you should check and make sure that, if you choose some different value that is also reasonable, the inferences you care about don’t change enough to matter. This is the situation I was in.

But although many situations are like that, many situations are not.

The situation Bob describes seems like it might be different: the measurement errors might really matter. Actually there are all kinds of issues with population sampling such that miscounting captured fish might be among the least of them. As fisheries biologists know well, fish aren’t balls in an urn from which you can withdraw a sample with equal probability. For instance, there are hard-to-sample populations that exchange individuals with easy-to-sample populations at some unknown rate, so even if you accurately quantify the easy-to-sample population you probably don’t really know what you want to know. It’s certainly an area that can benefit from good statistical models.

> you can just go ahead and impute the negative values, zeros, and ‘below detection limit’ values to something reasonable

Maybe, but it seems better to impute them via a parameter so that the parameter can vary around, otherwise you’ll get an apparently too precise estimate. It might not matter, but it might matter, and it seems like the only way you can tell is to fit the more complete model and show that it’s not much different from the approximate model. That can be a good idea if you’re planning to run this model over and over again at different times for example, you find a computational approximation that’s much faster and show it doesn’t offer much error… but if it’s a one-shot sort of thing, it seems like you want to do the more full model.

Dan, you say “It might not matter, but it might matter, and it seems like the only way you can tell is to fit the more complete model and show that it’s not much different from the approximate model.”

There may be cases in which that’s true, but it’s certainly not universal. As I mentioned earlier: You can impute the negatives and zeros to some low number that is on the low end of what is reasonable, and run your analysis; and then redo it with the negatives and zeros imputed to some other number that is on the high end of what is reasonable, and if the inferences for the main parameters of interest don’t change much then I don’t think there’s a need to do something much more complicated.

I’m not claiming that imputing a reasonable low number always works. Indeed, I have seen analyses in which people did something standard — replace all of their ‘below detection limit’ measurements with a value equal to half the detection limit — and got results that were pretty bad, or at least that I didn’t trust at all and that seemed fishy. But if they had tried imputing a value equal to 0.1 x (detection limit) instead, they would have gotten a very different answer, which would have warned them that their results were sensitive to their imputation procedure. In such a case they should probably fit a more complicated model, as you suggest. But I do think that if you try replacing ‘below detection limit’ measurements with 0.8 x (detection limit) and with 0.1 x (detection limit) and your key results don’t change much, you’re almost certainly fine.

Sure, those seem like reasonable ideas. my preference would probably be to impute small values as draws from some distribution, I’d probably tend to use a gamma distribution, and try a few different sets of parameters

imputing to a fixed value is always going to imply a falsely small variance, and generating some random numbers is certainly not much harder than a fixed number.

A similar but opposite problem:

Gold is a famously “nuggety” mineral; it’s common to get high gold assays that aren’t characteristic of the deposit in general. A common practice in less sophisticated operations is to use a “cut grade” when calculating anticipated ore grades from core samples, so that any analysis over 1oz/ton is cut to one ounce for the ore grade calculation. So if you have an assay that runs 17.31 oz/ton, you cut it to 1oz/ton. Even this usually isn’t enough to put the calculated grade on par with the ultimate mining grade.

What kind of transformation could be used to apply across all data for this situation? A log would be better than nothing but when you have a sequence of values like 0.39, 0.17, 0.33, 0.41, 0.15, 14.77, 0.49… the log isn’t going to cure that large value.

I think again you would use a measurement error model with a long tail. Something like a T distribution. you are doing inference on the population mean, and a single or even a few outliers will not pull that estimate when the distribution has long tails

Yes! The model Daniel Lakeland suggests will also be much better predictively, because you’re going to continue to get that nuggety behavior in future observations. The point is to try to model what’s actually going on in the geology, then model how your measurements are derived from that.

It’s also possible to treat some of these observations with censoring, but that just throws away information if you actually have it and if you don’t have a wide enough error scale (wide enough tails, for example), it’ll be biased predictively.

What about a mixture model of Gaussian distributions? Would that be a bad idea in this scenario? Distributions for ‘nuggety’ and typical deposits.

jd, sure if the nuggets are typically always say around 10-15 oz/ton then it makes sense to say your distribution is one half-normal distribution normal(0,1) truncated to [0,inf] and one nuggety distribution normal(12,3) or similar, with an unknown mixture…

but you will typically have a harder time sampling this kind of distribution, because there’s a large energy barrier between the region around 0 and the region around 12. You can benefit from some kind of “bridge” between the two, say a t distribution with some small mixture quantity that prevents the region in between from going too close to zero density, even if it isn’t necessarily all that realistic…

Daniel – Good points. Makes sense.

When I first read jim’s post, I was thinking of two different processes that generated the ‘nuggety’ and typical observations. But re-reading the post, it just sounds like he is referring to outliers in general.

jd: definitely would use different distributions for different types of deposits. High-grade vein-hosted gold deposits with free gold (Ontario, Quebec, maybe colorado) typically show a strong nugget effect, while base-metal deposits (porphyry Cu/Mo in New Mexico, Arizona, Utah) with secondary gold have a more normal distribution.

jim – Cool. Interesting example. It got me to playing around with mixture models and models with student family in ‘brms’ yesterday. I haven’t really had to model data like you described, before.

A few typical quantities that I’m modeling are: Servings per day of fruit and vegetable, Total energy intake per day, Minutes of physical activity per day, Percentage of waking time spent sedentary, Sleep efficiency, etc.

It would be brutally difficult to explain or present most of those on a log-transformed scale and still have the results be remotely useful to the reader. The result would probably be their eyes would glaze over and rather than figure out what “an additional 0.3 log-servings per day” means they’d just skip down to the p-value ;-)

There’s no reason you can’t transform the *presentation* to non-log scale. The key is keeping the *analysis* on a scale where the size of the uncertainty is better modeled…

Is “20% increase in energy intake per day in the experimental condition” really that hard to understand?

Brent:

You can model things on the log scale and then present results on the original or log scale as appropriate. For example, you could say the average number of servings per day for a particular item is 1.3, and then you could say that a particular treatment increases consumption by 10% (that is, it has a coef of 0.1 on the log scale). You can use logs to make the model more useful and sensible without actually presenting results as logarithms.

This is similar to how we use average predictive comparisons for logistic models.

Personally I would use whatever presentation or plot that makes the best case. If it’s a bit more complicated, provide an explanation of how to understand the presentation. I think people like to learn new things if you make a good case for it and provide a clear presentation.

Absent substantive information of the underlying mechanism, log transform makes a lot of sense for variables with large dynamic ranges because it is unlikely for a linear additive process to generate that kind of data.

Of course, with an over-dispersed Poisson GLM using a log-link, you can preserve the non-negative expected values without transforming the response variable.

The log link is what models the predictors as having a multiplicative effect on the outcome. The error’s still going to be modeled by the Poisson plus overdispersion. Lots of knobs to twist in even simple GLMs!

I prefer the quasi-Poisson to log transforming the dependent variable, as quasi-Poisson is consistent no matter the conditional distribution of y given x. In my opinion, that makes it extremely useful for modeling non-negative outcomes. I do not need to abandon the original scale of y, can retain zero values without adding any arbitrary constants, and don’t have to worry about bias due to heteroskedasticity. Modeling the conditional expectation directly is just superior in so many ways to log transforming and then estimating.

+1 to this

Was this a response to the original suggestion? I think Dave C. above is suggesting a Poisson distribution for the dependent (that’s y, right?) variable, not transforming it.

It looks like the quasi-Poisson model is like the negative binomial in that it’s a gamma-Poisson compound. The difference is the way variance is characterized as a quadratic function of the mean rather than a linear one. The section of the Poisson Wikipedia page on overdispersion has a nice clean definition. These models make a lot of sense, but they can be challenging to fit with MCMC because of the extra degree of freedom the overdispersion gives you.

Dave, Jesper:

We’re not always working with count data.

quasi-Poisson only assumes that variance is proportional to mean. It is not restricted to count data but works for continuous non-negative variables as well.

Dave:

Fair enough. Ultimately I’m in favoring of modeling the data and the underlying process, and data transformations are typically best viewed as a shortcut.

Andrew may hate Twitter, but that hasn’t stopped his bot from making 5K posts and accumulating 25K followers in his name!

To me, Twitter feels like one big blog comment section indexed by hashtag rather than URL. The content being discussed is usually hosted elsewhere. For instance, Andrew’s 5K Twitter posts are mostly (all?) just links to blog posts. That then provides an anchor for a Twitter discussion around the topic that complements the blog post discussion here (very complementary in that I don’t think Andrew reads or responds to Twitter posts).

Just like some blogs are toxic, some Twitter threads are toxic. But most of the ones in stats look like Andrew’s blog comments with a whole lot of “likes”, cut out diagrams, and retweets thrown in.

Bob:

I don’t like twitter because it seems to encourage snappy replies rather than thoughtful arguments. From the other direction, I’m sure twitter has many virtues that blogs don’t have. I like the chance in a blog to expand on a point and to make digressions, and then to have lots of discussion in comments. I guess it depends on the blog, though. Some blog comment sections are cesspools.

Well, we all know the only truly thoughtful arguments are one that pass peer review. In other words, it’s a cesspool all the way up (sorry about that—the metaphor decomposed when I changed direction).

If you like digression and discussion, you should love Twitter! The problem I find is that the stray comments clutter everything up needlessly—I’ve never figured out how to filter it down to actual content (I could take it on as an NLP project, but I’ve still never recovered from reading and annotating 5K random English-language Twitter posts). It might be easier if I had an account :-)

Don’t do it Bob! Back away from the prompt…

(Thanks again for this bog, Andrew. I do really mean it!)

“blog”

I don’t think it’s good to flippantly advise “log-transforming your data”, because in this twitter infested world that is all people are going to take away. By “your data” I assume you actually mean only your positive valued response data (the left hand side of the equation). Or do you mean data right-hand side as well or instead? That’s not clear from your post. Data can mean many things.

I think it’s much more important to think about the data generating process and pick an appropriate probability model. It may be that log-transforming works fine for a quick regression model, but if we’re dealing with count data with many zeros, it might be better to go with the negative binomial or Poisson GLM.

Also, I can think of and have actually seen examples (of what not to do!) of people using the log-transformation solely to achieve linearity without thinking about how this affects the error structure. As an example suppose you want to fit the Ricker model to stock-recruitment curves for salmon populations (http://oregonstate.edu/instruct/fw431/sampson/LectureNotes/16-Recruitment4.pdf). If the errors are actually closer to normally distributed than log-normally distributed (basically what I mean is that the variance of large values is about the same of the variance of small values), but you log-transform you’ll actually introduce heteroskedacisty. You’re often better off using non-linear regression. I have seen this in action: I can remember one talk I went to in particular where somebody was presenting a bunch of plots of stock-recruitment curves after back-transforming from the log scale and the regression line was clearly wrong in several of the plots (meaning not going through the rough center of the data). For many cases they would’ve been better off with using non-linear regression.

I’ve got examples of this buried somewhere in my files, but if I can find them I’ll post the R code of a simulation of this example here.

Dalton:

My advice is not flippant; it’s serious! And I did quantify it with “usually.” In any case, I think the relation between x and y is typically much much more important that equality of variances or the distribution of the error term. See here for some discussion taken from my book with Jennifer.

I will say one thing, though. Instead of saying “log transform your positive data,” I should’ve said, “log transform your variables that are inherently positive.” For example, I wouldn’t typically apply the log transform to test scores or to responses on a Likert scale, even if happens to be coded from 1 to 5.

Andrew,

You’re right. Flippant was the wrong word choice, I think I meant more so your style which was, as usual, more casual and in this case not precise. No insult was intended.

But I do have an example where log-transforming inherently positive values is the wrong choice because the error terms do matter. (It actually comes from one of Alix Gitelman’s courses at Oregon State). Suppose we have some stock recruitment data for salmon populations and we want to fit a Ricker curve to these data to find the equilibrium point (the number of adults we need to ensure replacement). If the errors are actually closer to normal, but we take the log of both sides because we don’t want to use non-linear regression, we will get a different answer for the equilibrium point than if we don’t take the log and instead use non-linear regression. If somebody uses this to set harvest quotas and the equilibrium point is actually higher than than our estimate then that could cause damage to the population.

It’s generally true that “[i]f the errors are actually closer to normal” you don’t want to log-transform. A log *link* will work nicely, though, and avoid having to deal with nonlinear regression: in R’s glm (and presumably rstanarm etc.), y ~ x + offset(log(x)), family=gaussian(link=”log”) will do the trick. If you make it y ~ x + log(x) instead you get a generalized Ricker for little extra cost …

Cool! I didn’t know this.

This is an interesting discussion. Back when I taught this stuff (past tense) in an essentially introductory course (interdisciplinary grad program), I emphasized thinking carefully about what it is you want to measure and know. If you really want to know about percentage changes in variables, log transform. If you think the underlying process is a percentage growth rate, as so many are, log transform. Don’t do it mechanically or just to get a better fit. (A lot of my emphasis was on resisting overfitting, since that is the bane of a lot of newbies.) I realize this is all pretty basic, but it’s amazing how basic stuff is often smudged or overlooked altogether in standard stats texts.

Actually, what exasperates me is when I am forced to do an analysis of raw reading time data because a reviewer demands it. I don’t mind what others do in their papers as long as they release the data and code behind the paper. When reviewing papers, I don’t insist on transformation (I used to but I realized I shouldn’t impose anything on others if I don’t want them to impose anything on me), instead I just look at the data myself and make my own judgement.

And it irritates the heck out of me when people say to me, “because Gelman and Hill (2007) said so.” I’ve lost count of the number of times I’ve heard that.

Personally, I also hate when people will interpret “normality of errors is the least important assumption when fitting a linear model” as “normality of errors doesn’t matter so don’t bother inspecting your residuals to see if they deviate too strongly from a normality assumption”, which is fairly common.

Just because there are ways around this (particularly if the sample is large), it doesn’t mean one shouldn’t at least check residuals for deviations from normality and (particularly) for some specific kinds of these that may be more troublesome than others, even more so when it’s so simple to do that.

To be honest, back in 2006/2007 that’s exactly the inference I drew from Gelman and Hill’s statement. You can see why that inference seems to make sense, right? One really has to spell things out for the non-statistician user, they can’t possibly know what to do with a statement like “the residuals are the least important thing”.

I know, right? I mean, yes, if you have a large sample and under fairly mild assumptions this isn’t an issue for the most part – the Central Limit Theorem will take care of it. That’s what I was initially taught about how to think about the normality assumption in my original field.

But, while in practice the CLT will usually hold (and thus it’s not really incorrect), there can always be aberrant cases where it may not (e.g. with Cauchy-distributed or fat-tailed errors in general) – and in a world in which it’s not hard or (generally) computationally expensive to check the residuals, there is no real reason not to look what kind of deviations from the normality assumption may be going on (if any) since most software will output a normal QQ-plot of residuals automatically when doing general model diagnostics (which in my particular case I wasn’t taught about either, even though this should probably be standard in any undergrad course on linear regression. I wonder how much published research also has some rather ugly residual plots simply because researchers didn’t bother looking at them since “normality of errors is the least important assumption”, and particularly plots that allow to detect violations of the critical assumptions).

I think the issues with statistical education don’t stop with how people are teaching how to properly think about NHST but go beyond this particular topic.

One of the things I love about Stan is that I no longer needed to rely on linear regression and transformations just because the calculus makes it computationally tractable. It seems like the ideal future path for statistics is creating theoretically meaningful mathematical models and fitting them to the data. I’ve mostly worked with reaction times from psycholinguistic experiments for which the parameters of ex-gaussian models have been associated with psychological correlates. It’s been nice to be able to fit the theoretically meaningful model directly.

+1 to this.

Fantastic to see Stan entering psycholinguistics finally. Long overdue.

One other argument I have recently encountered (a paper of mine was simply rejected using this argument) for not log-transforming reading times was that “cognition happens on the millisecond scale” so log-transforming takes you away from the true scale. But suppose the eyetracker was delivering data already log-transformed; then cognition would be happening on the log millisecond scale.

Also, I suspect people are confused by the meaning of the slope in a linear (mixed) model when it’s done on the log scale. E.g., if you have a two-condition predictor that is sum coded as +/- 0.5, and the estimates from the model are 8 for the intercept and 0.01 for the slope, people think that the effect size is exp(0.01), which is tiny (1 ms), which is certainly not realistic. But the effect is really exp(8+0.01/2)-exp(8-0.01/2)=30 ms. I am guessing it’s this basic misunderstanding that makes people think that “cognition happens on the millisecond scale”.

We just don’t teach this stuff properly. Stats education is just a bunch of cookbook tricks. When I started out in 1999, I just knew how to do the repeated measures ANOVA by copying out some code on the internet and that was it. I knew I had won when p was less than 0.05 and I had lost if not. That’s how statistics is still taught. One can’t blame people for looking at stuff like log transforms etc. with suspicion.

Another thing I found odd is that psychologists routinely dismiss Box and Cox 1964 as irrelevant. That requires a lot of hubris. But then Box-Cox is not part of the standard education in stats in psych* disciplines so obviously people look at this paper with suspicion (how come I never learnt about this? it must be wrong). Just speculating of course.

Shravan:

I think that what’s most relevant is not the scale of the data but rather the scale of the comparisons. It makes more sense to me to think of a treatment that delays reaction time by 10%, than a treatment that delays by 10 milliseconds.

That said, it will depend on the context. I can imagine a treatment causing an absolute delay and also a multiplicative delay, in which case it would be best to include both these effects in the model, and gather sufficient data to disentangle them.

I’m not a big fan of Box-Cox transformations. Usually for me it will be log, or raw scale, or maybe some custom transformation to handle a variable such as income or education. I can’t really picture an example where I’d want to do the transformation to the -1/8th power or the 0.39 power or whatever, and the idea of estimating the transformation from the data alone sounds like a disaster. I can well believe that the method was helpful for the sorts of problems that Box and Cox were working on back in 1964, but I can’t really see it as relevant anymore. I’d rather just model the process directly.

Andrew, I agree that Box and Cox is not really to be taken literally in that whatever lambda happens to be is the power to use. But the principle seems reasonable, and the log and reciprocal are really useful transforms for reading time data. Kliegl, Masson have a paper on this.

I’m using the log transform stuff in another context, particularly in the display of data.

I’ve always assumed that the log transform fits some evolved property of human perception. Our sensory system and brain do some wacky things – non-linear transforms are the least of it.

Just a thought for your consideration.

Apparently people do perceive logarithmically: https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1740-9713.2013.00636.x

Adding to MST’s and AW’s comments: I think more about logarithms needs to be included in the secondary math curricula. See for example, the links Logarithms and Means, Lognormal Distributions 1, and Lognormal 2 at https://web.ma.utexas.edu/users/mks/ProbStatGradTeach/ProbStatGradTeachHome.html .

I do not log-transform response time (RT) data. The issue for me is that RT is offset quite a bit from zero. So the appropriate model might be something like log(RT-psi)~N(X\theta,I\sigma^2). I think fitting that model is fine. But that is not the same as log transforming. If you log-transform, you assume psi is zero. But psi is large; say 60% of E(RT). So, without knowledge of psi, the log doesnt have much validity.

The shifted log-normal is easy to fit since Stan and brms came around. Makes a lot of sense.

Over the years, I’ve started to think we could be better off modelling RTs and errors jointly, as a bivariate outcome. And maybe fixation times and regressions in a similar approach with eye tracking data. I don’t see this approach gaining much traction yet, but the psychometric literature seems to use it.

Can anyone think of an issue if you transform one variable of a bivariate outcome? Honestly, I haven’t considered this too carefully.

Very nice discussions. I humbly present to you a paper that I co-wrote that may be of interest. It presents an economic argument to the potential advantages of log-transformation of positive data.

Reducing Costs and Improving Fit for Clinical Trials that Have Positive-Valued Data

Maria Deyoreo & Brian P. Smith

Pages 234-242 12 May 2017

One other argument I have heard in favor of not log-transforming reading time data is that log transforming can make an interaction non-significant, or make a non-significant interaction significant. Someone told me that Balota, Aschenbrenner, and Yap, 2013 make this point but I couldn’t find it in that paper.

I never understood this argument. It presupposes that the raw reading time analysis somehow reflects the truth a priori. In practice, what I find is that if an interaction in raw RTs disappears in log RTs, it’s because there were are few extreme data points in the raw RTs driving the effect.

Not to mention that focusing on whether the effect is significant or not is utterly insane, but there I’m preaching to the enlightened on this blog.

Hey, I heard people are citing others out of context here so I’ll contribute to that. From Kruschke’s Book:

“If the initially assumed noise distribution does not match the data distribution, there are two ways to pursue a better description. The preferred way is to use a better noise distribution. The other way is to transform the data to a new scale so that they tolerably match the shape of the assumed noise distribution. In other words, we can either change the shoe to fit the foot, or we can squeeze the foot to fit in the shoe. Changing the shoe is preferable to squeezing the foot. In traditional statistical software, users were stuck with the pre-packaged noise distribution, and had no way to change it, so they transformed their data and squeezed them into the software. This practice can lead to confusion in interpreting the parameters because they are describing the transformed data, not the data on the original scale.”

I forgot the page number; I’ve been sitting on this quote for a couple of days, trying to come up with something relevant to say, but my brain is just… it just doesn’t.

But, I find Kruschke’s approach intuitively appealing, I like the simplicity of it. Though I admit, that I’ve never really thought about the perspective given in the post: maybe indeed log-transformed stuff is more easily interpretable! To me it has always just seemed like on more hoop to hop through. Maybe it has been in the applications I’ve worked on; maybe I’ve limited myself for no other reason than my prejudice.

https://www.spec.org/ was formed 30 years ago to improve methodology of benchmarking & its methods more or less became the gold standard, long used for CPU design driven by analysis of performance metrics.

It started by having 2 systems X & Y run N benchmarking, yielding runtimes Xi & Yi & converting them to ratios Ri = Xi/Yi, which gives the performance of Y relative to X on benchmark i, i.e., larger = faster.

We always used the Geometric Mean G = (R1 * R2 … RN)^(1/n), but of course, mathematically, that’s just a simpler way of:

ri = ln(Ri)

G = exp (arithmetic mean of (r1+R2…+RN))

One has a fighting chance to use the log-transformed ri for normality tests, compute other moments, such as a stdev, which when exponentiated, becomes a multiplicative standard deviation, etc. One cannot do any of that with the untransformed Ri without getting contradictions, because the choice of numerator/denominator is arbitrary and can make each system look faster than the other.

Normality of ri is hardly guaranteed, but in practice there is often a good fit. If not, it usually means there is some subset of the benchmarks that is radically different, as seen with vector machines compared to scalara microprocessors.

The recent Machine Learning Performance MLPerf group uses similar approach: https://mlperf.org/