https://www.quora.com/Why-is-the-Box-Cox-transformation-criticized-and-advised-against-by-so-many-statisticians-What-is-so-wrong-with-it/answer/Adrian-Olszewski-1?ch=10&share=b727f842&srid=MByz ]]>

It started by having 2 systems X & Y run N benchmarking, yielding runtimes Xi & Yi & converting them to ratios Ri = Xi/Yi, which gives the performance of Y relative to X on benchmark i, i.e., larger = faster.

We always used the Geometric Mean G = (R1 * R2 … RN)^(1/n), but of course, mathematically, that’s just a simpler way of:

ri = ln(Ri)

G = exp (arithmetic mean of (r1+R2…+RN))

One has a fighting chance to use the log-transformed ri for normality tests, compute other moments, such as a stdev, which when exponentiated, becomes a multiplicative standard deviation, etc. One cannot do any of that with the untransformed Ri without getting contradictions, because the choice of numerator/denominator is arbitrary and can make each system look faster than the other.

Normality of ri is hardly guaranteed, but in practice there is often a good fit. If not, it usually means there is some subset of the benchmarks that is radically different, as seen with vector machines compared to scalara microprocessors.

The recent Machine Learning Performance MLPerf group uses similar approach: https://mlperf.org/

]]>Adding to MST’s and AW’s comments: I think more about logarithms needs to be included in the secondary math curricula. See for example, the links Logarithms and Means, Lognormal Distributions 1, and Lognormal 2 at https://web.ma.utexas.edu/users/mks/ProbStatGradTeach/ProbStatGradTeachHome.html .

]]>“If the initially assumed noise distribution does not match the data distribution, there are two ways to pursue a better description. The preferred way is to use a better noise distribution. The other way is to transform the data to a new scale so that they tolerably match the shape of the assumed noise distribution. In other words, we can either change the shoe to fit the foot, or we can squeeze the foot to fit in the shoe. Changing the shoe is preferable to squeezing the foot. In traditional statistical software, users were stuck with the pre-packaged noise distribution, and had no way to change it, so they transformed their data and squeezed them into the software. This practice can lead to confusion in interpreting the parameters because they are describing the transformed data, not the data on the original scale.”

I forgot the page number; I’ve been sitting on this quote for a couple of days, trying to come up with something relevant to say, but my brain is just… it just doesn’t.

But, I find Kruschke’s approach intuitively appealing, I like the simplicity of it. Though I admit, that I’ve never really thought about the perspective given in the post: maybe indeed log-transformed stuff is more easily interpretable! To me it has always just seemed like on more hoop to hop through. Maybe it has been in the applications I’ve worked on; maybe I’ve limited myself for no other reason than my prejudice.

]]>I never understood this argument. It presupposes that the raw reading time analysis somehow reflects the truth a priori. In practice, what I find is that if an interaction in raw RTs disappears in log RTs, it’s because there were are few extreme data points in the raw RTs driving the effect.

Not to mention that focusing on whether the effect is significant or not is utterly insane, but there I’m preaching to the enlightened on this blog.

]]>jim – Cool. Interesting example. It got me to playing around with mixture models and models with student family in ‘brms’ yesterday. I haven’t really had to model data like you described, before.

]]>Should have said: “especially when the survey effort is varying”.

]]>Thanks Bob and Daniel for clarifying.

So if I try to summarize for myself, Daniel shows a clever way to scale the dependent data, while avoiding log(0) = -Inf values. If I understand correctly, it corresponds to my example, modelling for example log(1/meanPopulation + population/meanPopulation) = intercept + b*numberOfHouses instead of population = intercept + b*numberOfHouses.

And Bob’s example looks to me like adding log(exposure) as an “offset” on the linear predictor, similar to when you would include log(surveyEffort) as an offset in a Poisson model of counts. I usually do this when modelling counts of animals, especially when counts can be zero, essentially modelling the count/surveyEffort.

Or I have still misunderstood and need to play around with some numbers to get it…

]]>Sure, those seem like reasonable ideas. my preference would probably be to impute small values as draws from some distribution, I’d probably tend to use a gamma distribution, and try a few different sets of parameters

imputing to a fixed value is always going to imply a falsely small variance, and generating some random numbers is certainly not much harder than a fixed number.

]]>jd: definitely would use different distributions for different types of deposits. High-grade vein-hosted gold deposits with free gold (Ontario, Quebec, maybe colorado) typically show a strong nugget effect, while base-metal deposits (porphyry Cu/Mo in New Mexico, Arizona, Utah) with secondary gold have a more normal distribution.

]]>Dan, you say “It might not matter, but it might matter, and it seems like the only way you can tell is to fit the more complete model and show that it’s not much different from the approximate model.”

There may be cases in which that’s true, but it’s certainly not universal. As I mentioned earlier: You can impute the negatives and zeros to some low number that is on the low end of what is reasonable, and run your analysis; and then redo it with the negatives and zeros imputed to some other number that is on the high end of what is reasonable, and if the inferences for the main parameters of interest don’t change much then I don’t think there’s a need to do something much more complicated.

I’m not claiming that imputing a reasonable low number always works. Indeed, I have seen analyses in which people did something standard — replace all of their ‘below detection limit’ measurements with a value equal to half the detection limit — and got results that were pretty bad, or at least that I didn’t trust at all and that seemed fishy. But if they had tried imputing a value equal to 0.1 x (detection limit) instead, they would have gotten a very different answer, which would have warned them that their results were sensitive to their imputation procedure. In such a case they should probably fit a more complicated model, as you suggest. But I do think that if you try replacing ‘below detection limit’ measurements with 0.8 x (detection limit) and with 0.1 x (detection limit) and your key results don’t change much, you’re almost certainly fine.

]]>Reducing Costs and Improving Fit for Clinical Trials that Have Positive-Valued Data

Maria Deyoreo & Brian P. Smith

Pages 234-242 12 May 2017

Apparently people do perceive logarithmically: https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1740-9713.2013.00636.x

]]>When I said “complex” i meant “tricky” not “using complex numbers”… YourModel should be a model outputting a real variable, so that inv_logit converts it to the range (0,1)

]]>> Or consider logistic regression. What it’s saying is that the log odds of an outcome is a linear function of the predictors

I dislike this description of logistic regression. It makes it sound like you have some strong assumption in place about how the log odds transforms your data into a line or something… There’s nothing preventing you from doing nonlinear models though.

What logistic regression means is that inv_logit(YourModel(Covariates)) is guaranteed to be a number between 0 and 1 and YourModel is allowed to be anything and yet thanks to the nonlinear transformation, it can never violate these limits of 0 to 1.

YourModel could easily be a complex 35 term Fourier series with respect to Covariates, or a radial basis function or a Chebyshev polynomial or an exponential function or any old nonlinear thingy and yet it will overall output values between 0 and 1 as it should.

]]>There are many opportunities to use log transformations to make things more reasonable. Bob gives one example that I haven’t thought about so much. But Jens asks a question about transforming the observed data, and I think that’s a valid use of this kind of thing.

Suppose you have some data of counts of objects, and the counts are very large, like people in counties, or bacterial cells in water samples or whatever. You know this isn’t a continuous variable, but with the counts being potentially very large, the minimum increment is essentially “dx” compared to the size of the typical measurement, so you can treat it as-if continuous. But the best way to do this is to rescale the data from counts to fractions of a typical count… so suppose you’re talking about bacterial cells in a culture… and a typical number is maybe around 14500 but you will have some samples with counts like 5 and other samples with counts like 13551900…

So you take your data x which is counts of cells, and you divide it by the typical size 14500 and you get a ratio like “multiples of the typical value”… Now because this varies over several orders of magnitude between say 5/14500 = 0.00034 to 13551900/14500 = 934 you want to take a logarithm of this number and model *that* which will be much more compact between say -8 and +8

The only problem is it’s typically possible for you to get a 0 count and so log(0) = -inf and everything goes crazy.

Instead of taking the log of the data, you take the log of the data + the minimum possible increment. This perturbation affects *only* the left tail where you’re down close to 0 cells.

if scale=14500 then you take your data x and do (1+x)/scale and take the logarithm of this:

log( 1/scale + x/scale)

Now you have a variable which is centered somewhere around 0 (the fact that scale is “typical” means x/scale ~ 1 and log(1) = 0) and limited in tails to the region log(1/scale) to log(xmax/scale). Typically this is a much better behaved number to model.

]]>I just mean say you are looking at the returns for a bunch of stocks. Which is easier to understand?

The percent the stock price has changed ytd, or the asinh transformed version? Here is what it looks like for ~7700 listed stocks: https://i.ibb.co/gjvXwNt/perfYTD.png

If you tell me the inverse hyperbolic sine of the year-to-date performance of a stock is 2.3, I think that is more difficult to interpret than telling me the YTD performance is +10%.

]]>Daniel – Good points. Makes sense.

When I first read jim’s post, I was thinking of two different processes that generated the ‘nuggety’ and typical observations. But re-reading the post, it just sounds like he is referring to outliers in general.

> you can just go ahead and impute the negative values, zeros, and ‘below detection limit’ values to something reasonable

Maybe, but it seems better to impute them via a parameter so that the parameter can vary around, otherwise you’ll get an apparently too precise estimate. It might not matter, but it might matter, and it seems like the only way you can tell is to fit the more complete model and show that it’s not much different from the approximate model. That can be a good idea if you’re planning to run this model over and over again at different times for example, you find a computational approximation that’s much faster and show it doesn’t offer much error… but if it’s a one-shot sort of thing, it seems like you want to do the more full model.

]]>jd, sure if the nuggets are typically always say around 10-15 oz/ton then it makes sense to say your distribution is one half-normal distribution normal(0,1) truncated to [0,inf] and one nuggety distribution normal(12,3) or similar, with an unknown mixture…

but you will typically have a harder time sampling this kind of distribution, because there’s a large energy barrier between the region around 0 and the region around 12. You can benefit from some kind of “bridge” between the two, say a t distribution with some small mixture quantity that prevents the region in between from going too close to zero density, even if it isn’t necessarily all that realistic…

]]>What about a mixture model of Gaussian distributions? Would that be a bad idea in this scenario? Distributions for ‘nuggety’ and typical deposits.

]]>Including a measurement error model adds a big layer of complexity that, in many cases, is completely unnecessary. Dealing with a vector of concentrations is just way easier than dealing with a vector of probability distributions. If you don’t really care whether a few true concentrations are 0.3 or 0.4 pCi/L because most of your measurements are in the range 0.8-5.0 anyway and you just want to make sure the really low ones aren’t exerting too much influence, you can just go ahead and impute the negative values, zeros, and ‘below detection limit’ values to something reasonable. If you do this you should check and make sure that, if you choose some different value that is also reasonable, the inferences you care about don’t change enough to matter. This is the situation I was in.

But although many situations are like that, many situations are not.

The situation Bob describes seems like it might be different: the measurement errors might really matter. Actually there are all kinds of issues with population sampling such that miscounting captured fish might be among the least of them. As fisheries biologists know well, fish aren’t balls in an urn from which you can withdraw a sample with equal probability. For instance, there are hard-to-sample populations that exchange individuals with easy-to-sample populations at some unknown rate, so even if you accurately quantify the easy-to-sample population you probably don’t really know what you want to know. It’s certainly an area that can benefit from good statistical models.

]]>I know, right? I mean, yes, if you have a large sample and under fairly mild assumptions this isn’t an issue for the most part – the Central Limit Theorem will take care of it. That’s what I was initially taught about how to think about the normality assumption in my original field.

But, while in practice the CLT will usually hold (and thus it’s not really incorrect), there can always be aberrant cases where it may not (e.g. with Cauchy-distributed or fat-tailed errors in general) – and in a world in which it’s not hard or (generally) computationally expensive to check the residuals, there is no real reason not to look what kind of deviations from the normality assumption may be going on (if any) since most software will output a normal QQ-plot of residuals automatically when doing general model diagnostics (which in my particular case I wasn’t taught about either, even though this should probably be standard in any undergrad course on linear regression. I wonder how much published research also has some rather ugly residual plots simply because researchers didn’t bother looking at them since “normality of errors is the least important assumption”, and particularly plots that allow to detect violations of the critical assumptions).

I think the issues with statistical education don’t stop with how people are teaching how to properly think about NHST but go beyond this particular topic.

]]>Andrew, I agree that Box and Cox is not really to be taken literally in that whatever lambda happens to be is the power to use. But the principle seems reasonable, and the log and reciprocal are really useful transforms for reading time data. Kliegl, Masson have a paper on this.

]]>The shifted log-normal is easy to fit since Stan and brms came around. Makes a lot of sense.

]]>The 1/4 power idea sounds interesting but I wonder if it’s just an ad-hoc hack or there’s any reference for this. There are more general families of power transforms that include it as a specific case (e.g. Box-Cox). These were designed with the goal of achieving “normality” though, which is not what we want to prioritize in this discussion, and also I’m thinking about transforming covariates and not independent variables here.

]]>Thanks for the explanations!

]]>Can anyone think of an issue if you transform one variable of a bivariate outcome? Honestly, I haven’t considered this too carefully.

]]>Was this a response to the original suggestion? I think Dave C. above is suggesting a Poisson distribution for the dependent (that’s y, right?) variable, not transforming it.

It looks like the quasi-Poisson model is like the negative binomial in that it’s a gamma-Poisson compound. The difference is the way variance is characterized as a quadratic function of the mean rather than a linear one. The section of the Poisson Wikipedia page on overdispersion has a nice clean definition. These models make a lot of sense, but they can be challenging to fit with MCMC because of the extra degree of freedom the overdispersion gives you.

]]>Yes! The model Daniel Lakeland suggests will also be much better predictively, because you’re going to continue to get that nuggety behavior in future observations. The point is to try to model what’s actually going on in the geology, then model how your measurements are derived from that.

It’s also possible to treat some of these observations with censoring, but that just throws away information if you actually have it and if you don’t have a wide enough error scale (wide enough tails, for example), it’ll be biased predictively.

]]>This is exactly the message we’re always trying to get across. Model what’s actually going on. If there’s a latent value that’s constrained to be positive, model it as a parameter constrained to be positive. Then if the measurement can be negative that’s no problem—it’s no longer inconsistent.

I had a fairly involved discussion a while ago with some fisheries people on their population models. I felt they were too convinced they meausred total fish caught accurately and suggested they rethink the model along those lines—a latent population which is always non-zero, then measurements, which could be noisy enough to imply you caught all the fish in the sea and then some. I didn’t make much headway on that discussion, which along with other cases like this, is why I’m so engaged in this particular thread.

]]>Not quite. You already rescaled in your model. What I was thinking about in terms of exposure in epidemiology models is as follows. It’s like what Daniel Lakeland suggested, only with area-specific predictors. You could also do it with more general ones.

Suppose you have areas n in 1:N with data y[n] for each area is population. At that point, you can build a very simple (heterogeneous, not spatial) hierarchical model

y[n] ~ Poisson(exp(lambda[n])) lambda[1:N] ~ normal(mu_lambda, sigma_lambda)

In this case, exp(lambda[n]) is the expected number of houses in area n. The hierarchical model accounts for the population distribution of areas. It’s a silly model as we don’t really have anything to anchor the mu_lambda other than the y[n]. There may not be much opportunity for partial pooling if there’s a wide range of scales for y[n]. For instance, we might have a major metropolis and a small town if our data is on cities and these might vary several orders of magnitude.

It can make more sense in this case if we have something like the number of houses as a predictor x[n], to do something like this:

y[n] ~ Poisson(exp(log x[n] + theta[n])) theta[1:N] ~ normal(mu_theta, sigma_theta)

Now the `theta` parameter can be interpreted as a population per unit of housing. This will usually wind up being a better model given the predictor. In these kinds of models, the x[n] are called “exposure” terms.

]]>It depends on whether you know what an inverse hyperbolic sine is. Just squinting at the formula’s going to be hard to interpret if you don’t have that function chunked as performing some useful role. Once that function’s a critical part of the generative process, then I would think it would be more interpretable than an untransformed version to someone who understood the process.

It’s like saying the error term is normal with scale sigma. Statisticians know what that means. But really it’s just shorthand for saying the error is distributed with density p(epsilon | sigma) = exp(-(epsilon/sigma)^2 / 2) / sqrt(2 * pi * sigma^2).

Or consider logistic regression. What it’s saying is that the log odds of an outcome is a linear function of the predictors. Is that interpretable? I think that depends if you know that logit(u) = log(u / (1 – u)) and that the inverse is logit^-1(v) = 1 / (1 + exp(-v)). If I just write those functions out, it’s going to be hard to interpret if you don’t already know the meaning.

]]>Consider the expectation function y_hat = f(x, a, b, c) = a + b * x + c * x^2. As Daniel Lakeland pointed out, it’s a second-order polynomial function of the predictor x, but it’s a linear function of the coefficients a, b, and c. If you consider y = g(x, a, b, c, sigma) = f(x, a, b, c) + epsilon, where epsilon ~ normal(0, sigma), then you have a stochastic function.

]]>I think again you would use a measurement error model with a long tail. Something like a T distribution. you are doing inference on the population mean, and a single or even a few outliers will not pull that estimate when the distribution has long tails

]]>Dave:

Fair enough. Ultimately I’m in favoring of modeling the data and the underlying process, and data transformations are typically best viewed as a shortcut.

]]>Personally I would use whatever presentation or plot that makes the best case. If it’s a bit more complicated, provide an explanation of how to understand the presentation. I think people like to learn new things if you make a good case for it and provide a clear presentation.

]]>A similar but opposite problem:

Gold is a famously “nuggety” mineral; it’s common to get high gold assays that aren’t characteristic of the deposit in general. A common practice in less sophisticated operations is to use a “cut grade” when calculating anticipated ore grades from core samples, so that any analysis over 1oz/ton is cut to one ounce for the ore grade calculation. So if you have an assay that runs 17.31 oz/ton, you cut it to 1oz/ton. Even this usually isn’t enough to put the calculated grade on par with the ultimate mining grade.

What kind of transformation could be used to apply across all data for this situation? A log would be better than nothing but when you have a sequence of values like 0.39, 0.17, 0.33, 0.41, 0.15, 14.77, 0.49… the log isn’t going to cure that large value.

]]>I’ve always assumed that the log transform fits some evolved property of human perception. Our sensory system and brain do some wacky things – non-linear transforms are the least of it.

Just a thought for your consideration.

]]>quasi-Poisson only assumes that variance is proportional to mean. It is not restricted to count data but works for continuous non-negative variables as well.

]]>Dave, Jesper:

We’re not always working with count data.

]]>+1 to this

]]>Could you spell this out a little bit? I’m confused about what “the x data” is referring to here, since in the example above log(1+x) x is used for the dependent variable.

Are you saying, instead of modelling for example:

population = intercept + b*numberOfHouses,

you could do:

log(1/meanPopulation + population/meanPopulation) = intercept + b*numberOfHouses

or something else?

]]>Shravan:

I think that what’s most relevant is not the scale of the data but rather the scale of the comparisons. It makes more sense to me to think of a treatment that delays reaction time by 10%, than a treatment that delays by 10 milliseconds.

That said, it will depend on the context. I can imagine a treatment causing an absolute delay and also a multiplicative delay, in which case it would be best to include both these effects in the model, and gather sufficient data to disentangle them.

I’m not a big fan of Box-Cox transformations. Usually for me it will be log, or raw scale, or maybe some custom transformation to handle a variable such as income or education. I can’t really picture an example where I’d want to do the transformation to the -1/8th power or the 0.39 power or whatever, and the idea of estimating the transformation from the data alone sounds like a disaster. I can well believe that the method was helpful for the sorts of problems that Box and Cox were working on back in 1964, but I can’t really see it as relevant anymore. I’d rather just model the process directly.

]]>Also, I suspect people are confused by the meaning of the slope in a linear (mixed) model when it’s done on the log scale. E.g., if you have a two-condition predictor that is sum coded as +/- 0.5, and the estimates from the model are 8 for the intercept and 0.01 for the slope, people think that the effect size is exp(0.01), which is tiny (1 ms), which is certainly not realistic. But the effect is really exp(8+0.01/2)-exp(8-0.01/2)=30 ms. I am guessing it’s this basic misunderstanding that makes people think that “cognition happens on the millisecond scale”.

We just don’t teach this stuff properly. Stats education is just a bunch of cookbook tricks. When I started out in 1999, I just knew how to do the repeated measures ANOVA by copying out some code on the internet and that was it. I knew I had won when p was less than 0.05 and I had lost if not. That’s how statistics is still taught. One can’t blame people for looking at stuff like log transforms etc. with suspicion.

Another thing I found odd is that psychologists routinely dismiss Box and Cox 1964 as irrelevant. That requires a lot of hubris. But then Box-Cox is not part of the standard education in stats in psych* disciplines so obviously people look at this paper with suspicion (how come I never learnt about this? it must be wrong). Just speculating of course.

]]>Fantastic to see Stan entering psycholinguistics finally. Long overdue.

]]>To be honest, back in 2006/2007 that’s exactly the inference I drew from Gelman and Hill’s statement. You can see why that inference seems to make sense, right? One really has to spell things out for the non-statistician user, they can’t possibly know what to do with a statement like “the residuals are the least important thing”.

]]>Cool! I didn’t know this.

]]>It’s generally true that “[i]f the errors are actually closer to normal” you don’t want to log-transform. A log *link* will work nicely, though, and avoid having to deal with nonlinear regression: in R’s glm (and presumably rstanarm etc.), y ~ x + offset(log(x)), family=gaussian(link=”log”) will do the trick. If you make it y ~ x + log(x) instead you get a generalized Ricker for little extra cost …

]]>