Skip to content

It’s not about normality, it’s all about reality

This is just a repost, with a snazzy and appropriate title, of our discussion from a few years ago on the assumptions of linear regression, from section 3.6 of my book with Jennifer.

In decreasing order of importance, these assumptions are:

1. Validity. Most importantly, the data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient. . . .

2. Additivity and linearity. The most important mathematical assumption of the regression model is that its deterministic component is a linear function of the separate predictors . . .

3. Independence of errors. . . .

4. Equal variance of errors. . . .

5. Normality of errors. . . .

Further assumptions are necessary if a regression coefficient is to be given a causal interpretation . . .

Normality and equal variance are typically minor concerns, unless you’re using the model to make predictions for individual data points.


  1. Shravan says:

    …and I will again point out that this section in Gelman and Hill is misused by people using NHST to argue that it’s irrelevant to check the normality of residuals. This is particularly important because people use p-value > 0.05 to argue that the null is true. People don’t really get the point of the book. Andrew, in the next edition, perhaps you could add a whole chapter on what the problems are with NHST, and perhaps talk about the assumptions of linear models not in the frequentist part of the book but in the Bayesian part. Why does this book even have a frequentist part?

  2. Rahul says:

    Is validity overlooked because it can be inconvenient or because it is hard?

  3. numeric says:

    How does Simpson’s paradox fit into this checklist (

      • Clyde Schechter says:

        Actually, I see Simpson’s Paradox as fitting under #1. Simpson’s Paradox is about missing (or, less commonly, inappropriately included) variables in the model. The data don’t map to the research question. You can still have a Simpson’s paradox with analyses that don’t rely on (or even use) additivity or linearity.

        • jrc says:

          To me, #1 is more about “measurement” in the broadest sense: Does the thing I’m measuring actually map to the relationship in the world I am claiming to be investigating.

          #2, though, is about the statistical model: Does my regression model (not substantive theoretical model) reflect the world sufficiently such that I can properly interpret the results, or does it accidentally average over/away something important because it is mis-specified.

          So I think of Simpon’s Paradox in relation to #2. Whether the model is implicit or explicit, it is mis-specified such that the effect in Group A is assumed to be equal (a kind of linear additivity) to the effect in Group B. That isn’t about measurement, it is about modeling.

        • This is my read on it too. When you’re model is inappropriate and you move to an appropriate model, it can massively change your interpretation of the situation, and that’s more or less the essence of Simpson’s Paradox.

        • Also, with Simpsons’ Paradox, it can occur when Linearity is in fact a great model, it’s just that you don’t have the right variables.

          • numeric says:

            The example in wikipedia (which, for those two lazy to click on the link above) can be “correctly” modeled as

            y = a + x1b1 + x2b2 + u

            where the data is of the form

            y x1 x2 a
            6 2 0 4
            7 3 0 4
            8 4 0 4
            9 5 0 4
            1 0 8 -7
            2 0 9 -7
            3 0 10 -7
            4 0 11 -7

            where b1 = b2 = 1 for all cases (the error term u is identically zero). Modeling

            y = a + xb + u

            gives the following:

            Estimate Std. Error t value Pr(>|t|)
            (Intercept) 8.9634 1.7746 5.051 0.00233 **
            x -0.6098 0.2449 -2.490 0.04718 *

            Multiple R-squared: 0.5081, Adjusted R-squared: 0.4262
            F-statistic: 6.198 on 1 and 6 DF, p-value: 0.04718

            That is, the slope of the regression is negative when the two groups are collapsed together but when modeled separately each has the same positive coefficient 1. One can eyeball the graph in wikipedia (don’t know how to attach it to this comment–is there anyway to do that, or do you want to suppress ability due to the ability to massively spam?) and the intuitive answer is the model y = a + x1b1 + x2b2, but given the low n it’s unlikely any test of misspecification will pick it up. Note number 2 in the list applies since

            xb = x1b1 + x2b2

            and so the model y = a + xb + u is incorrectly specified as b = (x1b1 + x2b2)/x, which means b is non-constant. So somewhere Andrew needs a constancy condition on the b, which I don’t see in the list (maybe it is implied in 2). But Bayesians can let b vary–are there any examples of Bayesian regression where the b is non-constant (and correlated with the x for that matter)? If so, what are the convergence properties of such a model (poor, I would guess). And if there is such a model, it should work on the wikipedia case as that is the simplest case.

            • Keith O'Rourke says:

              > Andrew needs a constancy condition
              That is what additivity means – as in treatment effects are constant over different groups.

              Interesting how muddled thinking does seem to be on this issue.

              Replacing additivity with commonness and treatment effects with parameters may make it clear.

              The reality the wiki graph is depicting is that of two groups with differing intercepts but identical slopes.

              To represent that reality adequately the statistical needs to have common parameters for what is common in reality and different parameters for what is different – get any of the those wrong and you have a additivity failure.

              The wiki graph is a nice illustration of this.

              This is how I would have extracted and code the data

              1. A model with common intercept and common slope is wrong.

              (Intercept) x
              8.9634 -0.6098

              2. A model with different intercepts for groups and a common slope is adequate.

              factor(g)0 factor(g)1 x
              4 -7 1

              3. A model with different intercepts and slopes for groups is wrong.

              factor(g)0 factor(g)1 x factor(g)1:x
              4.000e+00 -7.000e+00 1.000e+00 1.652e-16

              Not much of a penalty here – but its not a real example (factor(g)1:x 1.652e-16 se=5.509e-17 t=2.999e+00 p=0.04 * )

              In philosophy speak “Awareness of commonness can lead to an increase in evidence regarding the target (2); disregarding commonness wastes evidence(3); and mistaken acceptance of commonness destroys otherwise available evidence(1).”

              These considerations of what to take a common and what different is everywhere in applied statistics. Years ago I had tried to get this across here – – but my mistake likely was getting bogged down in explaining likelihood mechanics and loosing most readers.

              Simple examples like this wiki one are probability much better way to present the challenges and opportunities of commonness.

  4. Cimentada says:

    When you refer to validity, do you mean more concretely that the variables used are actually measuring what you want and their reliable? Something along the line of theory justifying the variables and their validity?

  5. FJR says:

    We have recently done some analysis on the impact of the assumption of normality on variable selection in linear regression models:

    Tractable Bayesian variable selection: beyond normality

  6. Elin says:

    Are you saying both that validity can mean that the variables are measured with validity but also that all needed variables are included and unneeded variables are excluded? I was looking for a word and settled for “needed” but some other close to that word could be used (I have sometimes heard relevant and irrelevant). Also isn’t sufficient sample size so that you don’t have problems caused by multicollinearity. important?

    How important are 4 and 5 outside of a specific analytical framework whether Bayesian or NHST?

Leave a Reply