## What are the key assumptions of linear regression?

Andy Cooper writes:

A link to an article, “Four Assumptions Of Multiple Regression That Researchers Should Always Test”, has been making the rounds on Twitter. Their first rule is “Variables are Normally distributed.” And they seem to be talking about the independent variables – but then later bring in tests on the residuals (while admitting that the normally-distributed error assumption is a weak assumption).

I thought we had long-since moved away from transforming our independent variables to make them normally distributed for statistical reasons (as opposed to standardizing them for interpretability, etc.) Am I missing something? I agree that leverage in a influence is important, but normality of the variables? The article is from 2002, so it might be dated, but given the popularity of the tweet, I thought I’d ask your opinion.

My response: There’s some useful advice on that page but overall I think the advice was dated even in 2002. In section 3.6 of my book with Jennifer we list the assumptions of the linear regression model. In decreasing order of importance, these assumptions are:

1. Validity. Most importantly, the data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient. . . .

2. Additivity and linearity. The most important mathematical assumption of the regression model is that its deterministic component is a linear function of the separate predictors . . .

3. Independence of errors. . . .

4. Equal variance of errors. . . .

5. Normality of errors. . . .

Further assumptions are necessary if a regression coefficient is to be given a causal interpretation . . .

Normality and equal variance are typically minor concerns, unless you’re using the model to make predictions for individual data points.

1. alex says:

I understand that in the context of your book the assumptions are really an explanation of the form of a regression equation. But I’ve never really liked the more common talk of THE assumptions of linear regression. As you say it depends what you are using the model for. Normality is a concern if you are trying to predict a data point but not if you are trying to approximate a conditional expectation.

Even if all the assumptions are violated. You still have a model – and are better off than you were before. It is just possible that if you incorporate auto-correlation, for example, you could get a slightly better one. And, of course, data driven model selection can also get you into trouble once you’ve fudged the model to get you past the diagnostics.

I wish we could get beyond: these are the assumptions, make sure they are met. To something more like this is the inpact of heteroscedasticity, but you don’t need to worry about it in this context, and this is how you can introduce it into a model if you want to incorporate it.

2. Christian Hennig says:

We know that all observations are discrete and therefore we know that whatever can be measured can never be normally distributed. We can write down a model based on the normal distribution, we can use it, and we can investigate whether the data are distributed in such a way that they make a meal of the outcome of our analysis.
Claiming that normality is required and has to be tested is total nonsense though (although very, very widespread nonsense). Why test something we know cannot be true? (The linked article is even worse that that, by the way. “Variables” need to be normal??)

3. Entsophy says:

I don’t think those normality assumptions mean what people think they mean. Most everyone sticks to Frequentist intuitions and refuses to go full Objective Bayesian. What you’re actually assuming is that the errors in the data (note: not “past errors” or “future errors” or “errors given by the data generation mechanism” but the errors in the data you actually collected) lie in the high probability manifold of the multivariate Normal distribution (usually a hypersphere or hyperellipse if there’s different variances).

The “probability” calculations derived from those distributions are identifying which claims are true for the majority of possibilities in the high probability manifold. They’re taking a “majority vote” over the range of possibilities in other words. For example, if mu is a parameter of interest, then you might get “for the vast majority of the potential errors in the high probability manifold it happens that a< mu <b". Since the true errors in the data are themselves in the high probability manifold, then our best, or most reasonable, guess is that they are one of those "vast majority" and the inferred statements "a< mu <b" are in fact true.

So the assumption actually needed is that the true errors in the data lie in the high probability manifold. You merely need to know enough about the unknown errors in the data to guarantee this holds. Knowing reasonable bounds on their magnitude suffices usually. It's not needed or even generally true that the histogram of the errors in the data look like a N(0,sigma) or that they be independent, either statistically or causally. Future hypothetical errors are completely irrelevant. In fact the errors in the data can be ridiculously un-normal and non-independant and everything will be fine. The errors will still almost always be one of those 'vast majority' and "a< mu <b" will thus be an accurate statement.

And that's why normality doesn't need to be checked in practice. It's why normality assumptions are "unreasonably" effective in practice. That’s also why Frequentist Confidence Intervals basically never have the coverage properties they think are "guaranteed". In this case Frequentists and Bayesians aren’t interpreting the same basic facts differently. They’re making different claims about what those basic facts are, and the facts on the ground strongly favor Bayesians.

4. Brendon J. Brewer says:

Cool post Entsophy. There are only two parts I don’t like. One is the label “objective Bayesian”. Ugh. The other is: “It’s not needed or even generally true that the histogram of the errors in the data look like a N(0,sigma) or that they be independent”. IMO correlations are a big problem. Assuming independence when the errors are more typical of a correlated distribution can lead to very overconfident inferences.

• Entsophy says:

Brendon,

I don’t care for “Objective Bayesian” either because I just think of it as “Bayesian”, but there is a large group of people for whom “Bayesian” means Savage/De Finetti style subjective bayes which I’m not familiar with and have little sympathy for.

Correlations in errors are almost always present and rarely important for the simple questions the statistics is usually being asked. If you have Y_i = mu+e_i then the errors could be e_1,…,e_9 = -4,-3,-2,-1,0,1,2,3,4 and the fact that they’re highly correlated doesn’t prevent you from getting a good estimate for mu. In fact, the average of the data Y_i will exactly equal mu, so the point estimate couldn’t be any better. Pretty much any interval estimate you create, using any method, will contain the correct mu.

If you knew something about those correlations you could get a tighter interval estimate for mu. But the regular old intervals you’d get assuming NIID wont be wrong. They’ll still contain mu, it’s just that they’ll be wider than they might have been. In typical cases the reduction in the length of the interval estimate isn’t worth the cost of better information about the correlations.

• Brendon J. Brewer says:

I don’t think that’s right in general (about the correlations). With the iid prior on the data, you’re saying there’s a very high prior probability that positive errors will cancel negative ones. With a correlated error model there’s a higher prior probability that, for example, most of the errors are positive. Then cancellation won’t occur. Using the iid likelihood would give a posterior that is too narrow.

• alex says:

I think you’ve got that the wrong way round. Autocorrelated errors contain less information than iid errors. So using ols when you don’t have independence will give you CIs that are too small (more type I errors), your effective sample size is less than you are assuming. Correct modelling will result in wider intervals.

I agree point estimates will be unbiased, but they won’t be as efficient was they could be.

• Entsophy says:

Brendon and alex, I don’t think you’re getting what I’m saying at all. Statements like “you’re saying there’s a very high prior probability that positive errors will cancel negative ones” betray a serious misunderstanding of what I’m getting at. All I can suggest is to forget for a moment what you’ve been taught in statistics class and think in very literal/concrete terms about the one set of errors in the data. Call them E=. This is not a random variable or anything. It’s just the actual numbers for the errors in the data. In fact, for what follows it’s best to just forget that you’d ever heard of “probabilities”. Pretend you’ve never heard of “random”.

If we knew even one of those numbers e_1,…,e_n exactly, then we could determine mu exactly as well. In general we don’t know E exactly, but we can confine it to some domain W such that E is in W. The size of W is directly a measure of how well we know E. Usually we take the log and say the Entropy S=ln|W| is a measure of how well we know E. If we know E exactly then W is a single point and S=0. In other words we’d have perfect knowledge of E and can determine mu exactly because of it.

If W is quite large (~n*ln(sigma) in the case of NIID) then any interval estimates gained by taking that “majority vote” over W will be quite large as well. If you have definite knowledge of any patterns to those numbers e_1,…,e_n you can use that knowledge to find a smaller W’ such that E is in W’. Or in terms of entropy S> S’. So if you take that “majority vote” over W’ you’ll get smaller interval estimates.

Whether you use W or W’ though, as long as those actual numbers E are part of the “vast majority” your interval estimates will contain the true mu. It’s just that you greater knowledge of the patterns in those numbers (S>S’) allowed you create smaller intervals using W’ than W.

• Entsophy says:

It should be “E= the vector e_1,…,e_n” above.

• Brendon J. Brewer says:

“Statements like “you’re saying there’s a very high prior probability that positive errors will cancel negative ones” betray a serious misunderstanding of what I’m getting at.”

Maybe I don’t understand what you’re getting at, but the fact remains, using iid likelihood when the actual errors in the actual data (yes I’m aware that they’re fixed, that’s why I called the p(errors | parameters) the prior beliefs about the errors and not the actual distribution of the errors) is more typical of a correlated distribution can result in extremely terrible posterior inferences. I’m amazed you’ve never seen this.

• konrad says:

Entsophy: perhaps you can clarify by taking us through an example where correlation is high?

Say mu=0 (the true value being measured). We know in advance that all errors will be in [-2, 2]. We take 10 measurements, but unbeknownst to us they are highly correlated (due to an unknown causal mechanism) and end up near identical: e_i is in [0.9, 1.1] for every i.

The usual sort of normality assumption wll fail because the correlated measurements will cause us to underestimate the variance of the normal distribution, leading to an overly confident inference that mu is close to 1. How is this avoided in your setup?

• Brendon J. Brewer says:

That’s exactly what I was trying to get at, konrad. Thanks for making it more concrete.

• Entsophy says:

Incidentally, it was convenient to remove all mention of probability distributions in order to clarify the logic, but Probability distributions arise naturally as a kind of generalization to what I said above:

Uniform distribution on W *generalizes to* arbitrary distributions P(x)
W *generalizes to* the high probability manifold of P(x)
S=ln|W| *generalizes to* S=-\sum_{x} P(x)lnP(x)

Although it’s more general the logic behind what’s going on is basically the same. All you have to do is drop the requirement that probability=frequency.

• Entsophy says:

Taking a majority vote *generalizes to* taking the average over P(x)

5. Cyrus says:

Facepalm.

Seriously.

• Piero says:

+1

• Brendon J. Brewer says:

Your comment might be more helpful if you actually said something, instead of simply trying to appear smarter than everyone else. I don’t even know what you’re facepalming about, as there are many people here saying many different things.

• Piero says:

Well, (needless to say, speaking for myself, not for Cyrus), facepalm because anyone who has some vague memory of the textbook treatment of classical (“OLS”) regression (which is what is discussed there) should remember that nowhere in the derivation of the properties of the OLS estimator (at the estimation or inference stage) anything about the distribution of the *regressors* is mentioned, assumed, invoked, used, etc.

So the requirement of “normality of regressors” is something that people just make up. I conjecture that it is because they did not do their homework when they took their intro applied regression course: having taught applied regression courses at the graduate level several times, I have the impression that some of the lazier students think that stating “this has to be normally distributed, that has to be normally distributed” allows them to stay on the safe side. So basically, one could conjecture that the paper linked to this blog post has been written by someone that did not really bother understanding the derivation of OLS when they studied it, and then thinks they can tell others how to do it right. Well worth a facepalm.

It is true, indeed, that by explaining this I’ll appear much less smart, so you have a point there.

6. Shravan Vasishth says:

It’s great that this topic has come up again. I want to ask the statisticians and Andrew the following question:

Suppose we are interested in null hypothesis tests in linear models, e.g., $H_0: \beta_1 = 0$, where $\beta_0$ is one of the parameters in the model. Suppose also that we have a “lot” of data. To make things concrete, assume that we have a 2×2 within subjects design, with 100 subjects; each subject sees one of the four conditions in the 2×2 design 24 times (the standard counterbalancing done in psychology). So, each subject will see each condition 24 times. Linear mixed models are a standard way to analyze such data.

Is the normality assumption important for hypothesis testing in this situation? The answer seems to be yes; this is based on what I have learnt in an MSc programme I am doing at the University of Sheffield. The blanket statement Andrew makes in this post and in the G and H 2007 book does not match the argument I have been taught (see below). Andrew’s statement and the discussion below cannot both be right, unless there is an important caveat that has been left out.

Here is what I have understood so far (I’m excerpting this from my own notes, which are based on lecture notes from the MSc programme). I hope latex typesetting is possible in the comments to Andrew’s post. If the typesetting does not show up as intended, see p. 3 of my notes here:

https://github.com/vasishth/StatisticsNotes/blob/master/linearmodels.pdf

Note that $\hat{\beta} \sim N_p (\beta,\sigma^2 (X^T X)^{-1})$, and that
$\frac{\hat{\sigma}^2}{\sigma^2} \sim \frac{\chi^2_{n-p}}{n-p}$.

From distributional theory we know that $T=\frac{X}{\sqrt{Y/v}}$, when $X\sim N(0,1)$ and $Y\sim \chi^2_{v}$.

Let
$x_i$ be a column vector containing the values of the explanatory/regressor variables for a new observation $i$. Then if we define:

\begin{equation}
X=\frac{x_i^T \hat{\beta} – x_i^T \beta}{\sqrt{\sigma^2 x_i^T (X^T X)^{-1}x_i}} \sim N(0,1)
\end{equation}

and

\begin{equation}
Y=\frac{\hat{\sigma}^2}{\sigma^2} \sim \frac{\chi^2_{n-p}}{n-p}
\end{equation}

it follows that $T=\frac{X}{\sqrt{Y/v}}$:

\begin{equation}
T= \frac{x_i^T \hat{\beta} – x_i^T \beta}{\sqrt{\hat{\sigma}^2 x_i^T (X^T X)^{-1}x_i}} =
\frac{ \frac{x_i^T \hat{\beta} – x_i^T \beta}{\sqrt{\sigma^2 x_i^T (X^T X)^{-1}x_i}}}{\sqrt{\frac{\hat{\sigma}^2}{\sigma^2}}}
\sim t_{n-p}
\end{equation}

I.e., a 95\% CI:

\begin{equation}
x_i^T \hat{\beta} \pm t_{n-p,1-\alpha/2}\sqrt{\hat{\sigma}^2 x_i^T(X^T X)^{-1}x_i}
\end{equation}

So, although we can estimate $\hat{\beta}$ without any distributional assumptions, we cannot calculate confidence intervals for parameters, and we can’t do hypothesis testing relating to these parameters using F tests because we don’t know that $\hat{\beta}$ is multivariate normal because the distribution of $y$ might not be multivariate normal (because the distribution of $\epsilon$ might not be normal).

Although I find the Gelman and Hill book one of the best ones out there for fitting linear and linear mixed models from a non-statistican’s perspective, it does not help much to make statements about what’s important and what’s less important without really explaining why exactly. People often tell me that the normality assumption is *unimportant* (which is how the phrase “least important” is interpreted) because Andrew Gelman says so, i.e., a proof by reference to a higher authority. Of course, one can go through life taking the word of an expert at face-value (we do this all the time with our doctors—I certainly do). But I (and others, I’m sure) really need to understand how exactly how Andrew’s comments square with what I have above.

• Shravan Vasishth says:

I left out an important detail in my example: the dependent measure is something like reaction time or reading time in milliseconds; this usually does not lead to normally distributed residuals. I use the Box-Cox procedure to determine the transform that stabilizes variance (this is the method I have learnt as a student of statistics at Sheffield).

• Christian Hennig says:

Shravan: You’re technically right that the theory behind the standard inference assumes the normal distribution. However, the range of error distributions for which for example confidence coverage probabilities are approximately correct for at least moderate n is rather wide (you may run some simulations if you want to convince yourself). Although there are exceptions (beware of leverage outliers)!
As I wrote before, saying that the normal assumption *must be* fulfilled would imply that we could *never* use the theory.

• Shravan Vasishth says:

I followed Christian Hennig’s advice and ran a simulation. The R code is given below. The shape of residuals I generate below is pretty typical of what we see in reading studies. Our effects are often in the range of t=2.0 to t=2.5 or so.

The basic result is that power is about 40% when the residuals are skewed; and power is about 70% when the residuals are (approximately) normal.

It’s true that the coverage of the 95\% confidence intervals does not change.

But isn’t the loss of power serious? If yes, would I not want to ensure in this kind of situation that my residuals are *approximately* normal, e.g., by using the Box-Cox procedure to find the appropriate transform?

It seems to me that the normality of residuals should not be dismissed as unimportant across the board. This is the precise conclusion researchers have drawn (that normality of residuals is completely irrelevant), and Gelman and Hill 2007 is often cited/mentioned as the justification for that conclusion. My impression is that Gelman and Hill’s comment has been misinterpreted to be a blanket statement.

I understand that I would need to investigate data with various properties and distributions, but the example below seems like one useful case for making this discussion more specific.

I’m grateful for any corrections and advice on this.

nsim<-100
n<-100
pred<-rep(c(0,1),each=n/2)
store<-matrix(0,nsim,5)

## should the distribution of errors be non-normal?
non.normal<-TRUE

## true effect:
beta.1<-0.5

for(i in 1:nsim){
## assume non-normality of residuals?
## yes:
if(non.normal==TRUE){
errors<-rchisq(n,df=1)
errors<-errors-mean(errors)} else {
## no:
errors<-rnorm(n)
}
## generate data:
y<-100 + beta.1*pred + errors
fm<-lm(y~pred)
## store coef., SE, t-value, p-value:
store[i,1:4] 4/n:
store[i,5]4/n)
}

## “observed” power for raw scores:
table(store[,4]<0.05)

## t-values' distribution:
summary(store[,3])

## CIs:
upper<-store[,1]+2*store[,2]
lower<-store[,1]-2*store[,2]
## CIs' coverage is unaffected by skewness:
table(lowerbeta.1)

## distribution of num. of influential values:
summary(store[,5])

## power about 40% with non-normally distributed residuals.
## power about 70% with normally distributed residuals.

## typical shape of residuals in reading studies:
library(car)
qqPlot(residuals(fm))

• Andrew says:

Shravan:

When data are all-positive, we recommend taking the log. Typically the model makes more sense on the log scale (that is, additive on the log scale is a multiplicative model on the original scale). This falls in item 2 (“additivity and linearity”) of my above list.

• Shravan Vasishth says:

Please put this into the next edition of your book to help the great unwashed masses from misinterpreting your statement. :)

Also, it would help a lot if the importance ranking you provide is backed up with either simulations or an argument rather than just a statement. Right now it looks like it’s an opinion; obviously that’s not the case; but it would help a lot to know where the conclusion comes from. It’s possible I just don’t know the statistical literature, but in that case please just point the reader to references.

7. […] What are the key assumptions of linear regression? (andrewgelman.com) […]

8. David W. Hogg says:

It may be of no interest, but I was able to write 55 pages on linear regression (for no identifiable reason I can discern) here: http://arxiv.org/abs/1008.4686

9. Chris P. says:

Is the assumption of normality of residuals (point 5 of Andrew’s list) the same thing as assuming that the outcome data is normally distributed (at least when modelling the data with a gaussian outcome linear regression)? Because I think these two are confused, and I can’t find a definitive separation of the two.

For example, while Andrew says that normality of the residuals is the least important assumption, and I know that MANOVA and LMs in general have been shown to be robust to violations of that assumption, Andrew still places a high value on checking a model’s output (simulating fake data, or posterior predictions) with the real data. In that case, wouldn’t non-normality of residuals show up as the fake data not looking like the real world data?

• Christian Hennig says:

Chris P.:
Second paragraph: “I know that MANOVA and LMs in general have been shown to be robust to violations of that assumption” – I think that’s a much too general interpretation of as far as I know rather modest results. Extreme outliers can make every strong effect seem non-significant. In fact, they are robust against some but not all violations of this assumption and therefore it still pays off to have a look at this.

• Corey says:

“…while Andrew says that normality of the residuals is the least important assumption… wouldn’t non-normality of residuals show up as the fake data not looking like the real world data?”

On this point, the first paragraph of the first comment hit the nail on the head. (I’m not trying to be snarky by repeating “first” — that’s just where the information happens to be.)

• Chris P. says:

Christian:
That’s a good point, but the normality of errors is still #5 on the list and is “generally the /least/ important” of the assumptions (Gelman & Hill 2006, p. 46). That is, for “estimating the regression line (as compared to predicting individual data points).” So I’m still confused about its importance. You’re arguing that is _is_ important, if I’m interpreting your comment correctly.

Corey:
I didn’t read any snark into it at all. :)

I’m assuming you mean the comment about “Normality is a concern if you are trying to predict a data point but not if you are trying to approximate a conditional expectation.” If this is the case (and Andrew makes the same point in the book, as I quoted above), then I guess I’m just confused about their relative importance. In the book, Andrew recommends /not/ performing normality diagnostics. But there is a clear emphasis on model checking through posterior predictive simulation (even Andrew’s most recent blog post on 7-Aug-13 recommends this). How are they practically different?

I can imagine where you could have:
– normal residuals and ‘fitted’ fake data simulations, (call this situation (a))
– normal residuals and ‘misfitted’ fake data simulations, (b)
– non-normal residuals and ‘misfitted’ fake data simulations (c)

But how can you have:
– non-normal residuals and ‘fitted’ fake data simulations (d)…. unless the ‘fitted’ fake data simulation was an accident of aggregation?

My reasoning is (and I’m hoping someone can correct me where I’m wrong) — if your model was specified correctly, your residuals would be normal, meaning your model’s predictions were generally right around the mark. Assuming a gaussian linear model. You may have the wrong theoretic model for the data and happened to have predicted the actual data very well (situation b). But how could you have situation d, unless by accident?

• Christian Hennig says:

Chris: The problem with discussing/ranking the importance of the normality assumption is that, as was discussed above already, “it depends”. There are a number of combinations of specific violations of the assumption and what you want to do in which the violation is actually important, and a number of other situations in which it’s pretty harmless. So overall it’s…?

• Andrew says:

Chris:

I think it’s generally good to see where the model assumptions don’t match the data, and indeed I did list normality as one of the assumptions. I just think that validity, additivity and linearity, independence, and equal variance are more important.

• Chris P. says:

Andrew:

I think I agree with the relative importance, especially after reading papers exploring the robustness of violations to the normality assumption. I guess I’m just confused about: why are normality assumptions de-emphasized (reasonably), but fake data simulation is emphasized?

Intuitively the fake-data simulations make complete and utter sense. Check that the model could have plausibly generated the observed data. No argument there. But, aren’t normality diagnostics, if done correctly, attacking the same problem? And aren’t they a bit more rigorous than “the simulated data /looks/ like the outcome data, that’s good.”

I don’t think I’m really arguing anything or trying to win a point, I’m just trying to figure out the thinking behind the emphasis and de-emphasis. To me, it makes sense to have both, with maybe more emphasis on the normality diagnostics, considering that they take into account the multitude of interacting predictors in the model to produce a measure of model fit that can be presented and easily interpreted in two dimensions.

• Andrew says:

Chris:

I think it’s a great idea to check the normality of the residuals using fake-data simulation. I just would put this step down on the list, after checking validity, additivity and linearity, independence, and equal variance.

And then there’s the question of what to do, once model violations have been found. Is it worth changing the model? Maybe so, especially if prediction is a concern.

10. […] What are the key assumptions of linear regression? (andrewgelman.com) […]

11. […] Andrew Gelman […]

12. […] of a large number of independent factors. Andrew Gelman discusses the assumptions of regression here (and I’d go so far as to suggest that his points 3 and 5 are not so very […]

13. Matt Williams says:

A somewhat belated comment, but: A couple of co-authors and I actually recently published a response to the Osborne and Waters paper you mentioned. Our response is at http://pareonline.net/pdf/v18n11.pdf

We make similar points to those mentioned here – e.g. that we may assume normality of errors, but not of the marginal distribution of the response variable, or of the predictors. We also address the misconception that measurement error can only bias simple regression coefficients downwards, and the suggestion of the original authors that corrections for attenuation be used for simple and partial correlation coefficients.

Part of our intent with the article was to provide a revised, open access summary of regression assumptions that is still pitched at the level of researchers with limited stats training, but without oversimplifying into inaccuracies. Feedback welcomed :)

• Andrew says:

Matt:

Thanks for the link. I think your article is an improvement upon the original, but given my post above you will probably not be surprised to hear that I am unhappy with its emphasis on the normal distribution. I think the assumptions of validity, additivity, and linearity are much much more important, with the normal distribution for the errors typically only being relevant if you are using the regression model to make predictions for individual cases.