Collinearity in Bayesian models

Dirk Nachbar writes:

We were having a debate about how much of a problem collinearity is in Bayesian models. I was arguing that it is not much of a problem. Imagine we have this model

Y ~ N(a + bX1 + cX2, sigma)

where X1 and X2 have some positive correlation (r > .5), they also have similar distributions. I would argue that if we assume 0 centered priors for b and c, then multi chain MCMC should find some balance between the estimates.

In frequentist/OLS models it is a problem and both estimates of b and c will be biased.

With synthetic data, some people have shown that Bayesian estimates are pretty close to biased frequentist estimates.

What do you think? How does it change if we have more parameters than we have data points (low DF)?

My reply:

Yes, with an informative prior distribution on the coefficients you should be fine. Near-collinearity of predictors implies that the data can’t tell you so much about the individual coefficients—you can learn about the linear combination but not as much about the separate parameters—hence it makes sense to include prior information to do better.

27 thoughts on “Collinearity in Bayesian models

  1. In practice, though, this can be a big deal in terms of the speed/efficiency of sampling. Gibbs samplers will have a hard time traveling along the “ridge” in the likelihood space that is at an angle relative to the correlated parameters. Even gradient-based methods like HMC (or population methods that approximate gradients) can do better but will often experience a slowdown or have high autocorrelation as they attempt to explore the ridge. As Andrew says, an informative prior can help flatten that ridge so it is less of a problem, but for myself I would always try to reparameterize the model first to remove as much of the correlation as possible (or course, priors can be viewed as effectively reparameterizing the model and vice versa, so it boils down to which method leads to easier interpretation of the resulting posterior estimates).

    • There’s a discussion with visualizations of this topic in the Stan user’s guide chapter on problematic posteriors.

      We usually try to do both—use as much prior information as we have (in both choosing the prior and the likelihood structure) and also choose a parameterization that’s easy to fit.

      So we’ll use informative hyperpriors where possible and a non-centered parameterization in a hierarchical model. And we’ll marginalize out discrete parameters. But even so, we still wind up using models with K predictor vectors in a K-way multi-logistic regression (which is only identified through the prior).

      P.S. Gibbs and Metropolis are non-starters in high dimensions with correlation. Michael Betancourt has a nice explanation of why in his conceptual intro to HMC paper on arXiv. HMC does much better, but still has diminishing performance with correlated posteriors.

      • I find that I spend most of my model fitting time these days working out good parameterizations, since that leads to drastic efficiency gains when I want to estimate parameters (typically in Stan, though sometimes I work with models with nondifferentiable likelihoods that require population methods, but it helps either way).

        Personally, I find this conceptual work to be the most informative part of model development, since it forces me to do simulations and analytic work that leads to a deep understanding of the model. Of course, it means the final model is often not an “off-the-shelf” approach, but because it is so well-tuned to the problem at hand, it doesn’t require additional work to interpret the estimates (in contrast to the cases where people just throw a bunch of predictors at a GLM and try to figure out what they mean afterward).

        In other words, the parameterization that leads to the most efficient sampling often turns out to be the one that leads to best intuitive understanding of the model!

        • Have you been talking to Michael Betancourt? He’s been saying this about reparameterization from the get go. Matt Hoffman reparameterized our very first models coded in Stan (the so-called “Matt trick”, which we later found out was called a “non-centered parameterization”).

          Stan 2.19 adds a neat feature that hasn’t quite hit full generality yet. In a traditional non-centered parameterization,

          parameters {
            vector[K] alpha;
            real mu;
            real<lower = 0> sigma;
          }
          model {
            alpha ~ normal(mu, sigma);
          }
          

          and non-centered

          parameters {
            vector[K] alpha_std;
            real mu;
            real<lower = 0> sigma;
          }
          transformed parameters {
            vector[K] alpha = mu + sigma * alpha_std;
          }
          model {
            alpha_std ~ normal(0, 1);
          }
          

          As of 2.19, the non-centered parameterization can be written like this:

          parameters {
            vector[K] alpha;
            real mu;
            real<lower = 0> sigma;
          }
          model {
            alpha ~ normal(mu, sigma);
          }
          

          and the centered parameterization in full form as

          parameters {
            vector<offset = 0, multiplier = 1>[K] alpha;
            real mu;
            real<lower = 0> sigma;
          }
          model {
            alpha ~ normal(mu, sigma);
          }
          

          So all that changes is the offset and multiplier, which control the parameterization by using an underlying standardizing transform of alpha to (alpha – mu) / sigma. The implicit Jacobian effectively implements the centered or non-centered parameterization after the algebraic dust has settled.

          What we need to do is abstract our transform, inverse transform, and Jacobian code in a chainable transform the way that TensorFlow Probability has implemented bijectors. Then we’ll be able to have non-centered scale parameters using lognormal, etc.

          In 2.20, which will be out soon, we’ll also have the multivariate version where the offset is a vector and the multiplier a Cholesky factor.

        • Awesome stuff! I’m stuck in 2.18 at the moment, but I’m looking forward to updating, particularly for the multivariate form in 2.20. It looks like much of the stuff I have to do “by hand” at the moment will be more elegant and transparent.

          Along those lines, though I was thinking more about my own understanding of the model in my earlier post, the added transparency in these updates to Stan will help *other* people understand my code, which is another big plus.

          I haven’t been talking with Michael Betancourt, but evidently I should be! And when I first saw the “Matt trick” in action (I think in one of the Stan examples maybe even in the user manual?) it really opened my eyes about how to get the most out of Stan.

        • Bob:

          Yes. Mike Betancourt was there when I spoke on parameterization and Bayesian modeling in London a few years ago, and he’s been involved in lots of our conversations on the topic since then. Lots of new ideas and methods out there beyond what’s in that 2004 paper.

        • Is that the paper that led to your Pinocchio principle? I saw you give a talk on that once, but I hadn’t seen the actual paper. I’m not familiar with the terminology in section one (“data augmentation”). I tend to get lost in the narrative of these JASA papers aimed at statisticians.

          I’m not even talking about generating new modeling ideas for priors, just that some people have told me (namely you and I think Michael), that they like thinking in terms of

          alpha_std ~ normal(0, 1);
          ...
          alpha = mu + sigma * sigma_std;
          

          rather than in terms of

          alpha ~ normal(mu, sigma);
          

          The main reason we added multiplier and offset transforms in Stan 2.19 is so that we could use the second form and still get a non-centered parameterization under the hood.

          Where I do see this kind of thing is in moving between, say, a beta prior on a probability to a normal prior on the log odds. They behave differently. But I think that normal prior’s just taken for convenience in the logistic regression, not because of some assumption of a normal population distribution of effects on the log odds scale. I’m probably missing something here.

        • Bob:

          I can’t remember when I came up with the folk theorem and the Pinocchio principle, but that paper is definitely related to both of them.

          Regarding your other point: Yes, in the past few years I’ve been thinking that all parameters should be scaled; for example:

          real alpha ~ normal(mu, sigma);
          

          Or

          real alpha_standardized ~ normal(0, 1);
          alpha = mu + sigma*alpha_standardized;
          

          In either case, the rescaling was motivated by computational concerns (speed of computation and also readability of code) but it makes a lot of statistical sense. So . . . it’s an example of the Pinocchio principle!

        • McElreath in one of his lectures this last time round (the lectures are online and I can’t remember which one but he reads the blog so maybe he will chime in) has a really good example of centered versus non-centered parameterizations, the effect it has on the shape of the surface to be transversed and the efficiency of an MCMC type algorithm in each.

  2. I actually just posted about this on the user group[1]. I was dealing with a situation in an Error Correction Model with variables that had time trends in them and had something like >99.9% correlations in log levels. I had a much larger number of data points than parameters, but that didn’t fix the problem. The only thing that fixed it was de-trending the variables, resulting in a much lower correlation.

    [1] https://discourse.mc-stan.org/t/linear-regression-with-highly-correlated-covariates/9546

  3. “In frequentist/OLS models it is a problem and both estimates of b and c will be biased”

    –> where did you hear that? Collinearity is increasing error, but does not create bias in an OLS setting.

    I don’t see a fundamental advantage of a Bayesian analysis for dealing with collinearity, except for the possibility to set (shrinkage) priors (which does help), and possibly better ways to explore the induced correlations in the posterior (in principle though, frequentists can do the same by exploring the varcov matrix)

      • if the goal is to estimate E[y|x1,x2,…,xp], zeroing out the coefficient on xp in the model biases the other coefficient estimates. On the other hand, if the goal is to estimate E[y|x1,x2,…,xp-1], then zeroing out the coefficient on xp in the model of E[y|x1,x2,…,xp] doesn’t necessarily bias the other coefficient estimates.

        I think the classical approach of dropping one of the problem regressors in the face of high collinearity is motivated by the second goal, not the first. The idea is that we originally wanted to do the first, but high collinearity means the data won’t allow it, so we’re now going to do the second instead. This can be reasonable in prediction problems, for example, but is trickier in causal inference problems, because switching from goal one to goal two may remove the causal interpretation of the coefficient(s) of interest, depending on the assumed role of xp in the underlying causal model.

    • If they’re perfectly collinear, the notion of true values of coefficients hardly makes sense. The MLE isn’t well defined (even with an L1 penalty; an L2 penalty gives you identification).

      If there’s a lot of collinearity, the posterior’s going to be highly correlated (equivalently, standard error for point estimates).

      > I don’t see a fundamental advantage of a Bayesian analysis for dealing with collinearity

      The main advantage is in inference. In predictive inference, the uncertainty and correlation in coefficient posteriors gets factored in as we average over the posterior. I wrote a little case study involving simulated logistic regressions with correlations.

      In parameter estimation, a Bayesian posterior mean minimizes expected square error in estimation, whereas the MLE is mainly useful asymptotically. I also evaluate this in the case study.

      > except for the possibility to set (shrinkage) priors (which does help),

      No Bayes necessary. Just call that prior a “penalty” and you’re good to go with penalized MLE. At least for the point estimate.

      • Bob:

        Two predictors can be perfectly collinear in the sample, hence no separate identification from the likelihood, but not be collinear in the population, with true values of coefficients making sense. That’s what was going on in the example we had in BDA.

        • Thanks for pointing that out, Andrew. I need to keep highlighting the two notions of identifiability at play (identifiability and propriety of posterior).

          0. Maximum likelihood estimate (MLE) exists. Does the likelihood function have a maximum for a given data set? If we have a standard generalized linear model (GLM) likelhood and collinear data, then there’s no MLE. If we have logistic regression and the data are separable, there’s no MLE.

          1. Identifiability. Roughly speaking, a likelihood is classically identifiable if for every data set, a superset of data can be created that has an MLE. Thus GLMs are identifiable even if the MLE doesn’t exist for every data set. If we have separation in a data set for logistic regression, we just add a counterexample as a data point and it’s no longer separable. If we have collinearity in a data set, we won’t after adding a full-rank data matrix (one that has no collinearity).

          2. Propriety of posterior. Suppose we have a likelihood and data set where the MLE doesn’t exist. If we look at the posterior for a Bayesian model with that likelihood and an improper prior, the posterior will also be improper. The impropriety arises from the infinite ridge of constant probabilty mass. The impropriety can be corrected by adding a proper prior on the regression coefficients. Then even with no data, the posterior is proper.

          P.S. NUTS is really good at diagnosing impropriety—it’ll jet off along that ridge until it hits that magic 10^300 or so overflow. When you see posterior means near max arithmetic precision with divergences in every iteration (due to numerical overflow), that’s what’s happening. Gibbs on the other hand, will slowly diffuse along the ridge as its movement each iteration is going to be slow and in a random direction. There’s a long description of this in the Stan user’s guide chapter on problematic posteriors.

        • Bob do you have references for these points?

          For identifiability, I would intuitively put it as flat spots somewhere on the multivariate likelihood function that remain flat, no matter how much data you get. And taking advantage of all observation being discrete a formally correct? one below.

          Seems very different from yours. Also, I find the point on the likelihood surface of highest value (MLE) misleading and distracting as its the full likelihood function that matters especially for Bayes in general.

          “Under any data model specification there is only a finite set of possible observations (since without lack of generality continuous observations are excluded) and their probabilities are fully determined by the data model for each parameter point (value). If for every parameter point in the data model, these probabilities are different, then in principle there is a one to one function from these probabilities of possible observations to the parameter points. If not, the data model is not identifiable and no amount of data would ever be able to discern which of the various parameters points gave rise to the observed data – no matter how much data is observed.” from page 29 of https://statmodeling.stat.columbia.edu/wp-content/uploads/2011/05/plot13.pdf

        • > I would intuitively put it as flat spots somewhere on the multivariate likelihood function that remain flat, no matter how much data you get.

          That’s right. I think that’s just a different way of saying the same thing I said (modulo multimodality, which we can set aside for the moment).

          In the quoted definition, I don’t understand the “finite set of possible observations” bit, but beyond that, the idea’s the same.

          It’s easier to understand with an example, so I’ll use the one from the Stan user’s guide. Suppose my likelihood is defined to be

          p(y | mu1, mu2) = PROD_{n in 1:N} normal(y[n] | mu1 + mu2, 1)
          

          No matter what data set y I get, the parameters (mu1, mu2) produce the same likelihood value as (mu1 + c, mu2 – c). For the likelihood to be identifiable, there needs to be some data set that distinguishes the parameter values.

          The Wikipedia article on identifiability is surprisingly readable:

          > … In statistics, identifiability is a property which a model must satisfy in order for precise inference to be possible. A model is identifiable if it is theoretically possible to learn the true values of this model’s underlying parameters after obtaining an infinite number of observations from it. Mathematically, this is equivalent to saying that different values of the parameters must generate different probability distributions of the observable variables. … A model that fails to be identifiable is said to be non-identifiable or unidentifiable: two or more parametrizations are observationally equivalent.

        • > don’t understand the “finite set of possible observations” bit
          It just keeps the data space finite even with an infinite number of observations which simplifies the proof.

          And I agree the wiki articles is surprisingly readable.

          Thanks for the clarification.

        • This is something I’ve been saying for a while, observations always have finite precision. Even the most precise observations these days tend to come from say 32 bit A/D converters. Perhaps time measurements can be made with say 48 bit clocks or something.

          As an order of magnitude estimate, the mass of the sun is about 2e30 kg, that’s about 1.2e57 proton masses, which requires about 190 bits to represent. So unless you think we will ever measure anything as precisely as extracting and counting all the individual protons from the sun… you can be assured we will never measure anything with more accurate precision than about 190 bits. That’s a lot less than infinity ;-)

  4. Are you guys talking about bias that is somehow an artifact of imperfect estimation, or bias due to the parameters being correlated per se?

    I’ve been diving just a little bit into casual inference, reading the primer from Pearl et al. A key point there seems to be, to include or not include paramaters based on the causal model of the process generating the data, and not whether the parameters are correlated or not. So you often, depending on the casual model, need to include collinear parameters to avoid bias/wrong estimates. That was a real eye opener for me, after years of hearing you should avoid collinearity because it creates bias.

    But you are talking about bias stemming from imperfect estimation, right?

    • Jens:

      Causal inference is a particular sort of out-of-sample prediction. When making predictions we want to use appropriate conditioning. This is an issue for causal prediction and it is an issue for non-causal prediction as well. We have an example in BDA of a hypothetical prediction problem with two perfectly collinear predictors, where it is appropriate to include both these variables as predictors, accepting posterior uncertainty, rather than to follow the classical approach of including only one of the predictors in the model. Actually I think we took that example out of BDA for the third edition to save space. Anyway, the general point is, whether or not you’re doing causal inference, you have to be careful not to let your data limitations drive your analysis. Excluding a predictor from a model corresponds to the strong assumption of its coefficient being zero.

  5. Please note that OLS is unbiased as long as y = Xb + e, where E[e | X] = 0. Here’s the proof:

    E[b.hat] =
    E[(X’X)^(-1)X’y] =
    E[(X’X)^(-1)X’(Xb + e)] =
    E[(X’X)^(-1)X’Xb] + E[(X’X)^(-1)X’e] =
    b + E[E[(X’X)^(-1)X’e | X]] =
    b + E[(X’X)^(-1)X’E[e | X]] =
    b

    Importantly, this result does not require any assumptions pertaining to collinearity. The variance of b.hat is impacted by collinearity, and b.hat isn’t computable in the event of perfect collinearity due to noninvertibility of X’X. But as you can see, OLS is not subject to “collinearity bias”.

  6. Assuming the model is correct, this is a model identifiability problem. Bayesian approach helps if your knowledge on the subject matter allows you to set informative prior.

Leave a Reply to yyw Cancel reply

Your email address will not be published. Required fields are marked *