When you add a predictor the model changes so it makes sense that the coefficients change too.

Posted on January 4, 2017 9:54 AM by Andrew

Shane Littrell writes:

I’ve recently graduated with my Masters in Science in Research Psych but I’m currently trying to get better at my stats knowledge (in psychology, we tend to learn a dumbed down, “Stats for Dummies” version of things). I’ve been reading about “suppressor effects” in regression recently and it got me curious about some curious results from my thesis data.

I ran a multiple regression analysis on several predictors of academic procrastination and I noticed that two of my predictors showed some odd behavior (to me). One of them (“entitlement”) was very nonsignificant (β = -.05, p = .339) until I added “boredom” as a predictor, and it changed to (β = – .10, p = .04).

The boredom predictor also had an effect on another variable, but in the opposite way. Before boredom was added, Mastery Approach Orientation (MAP) was significant (β = -.17, p = .003) but after boredom was added it changed to (β = -.05, p = .335).

It’s almost as if Entitlement and MAP switched Beta values and significance levels once Boredom was added.

What is the explanation for this? Is this a type of suppressor effect or something else I haven’t learned about yet?

My reply: Yes, this sort of thing can happen. It is discussed in some textbooks on regression but we don’t really go into it in our book. Except we do have examples where we run a regression and then throw in another predictor and the original coefficients change. When you add a predictor the model changes so it makes sense that the coefficients change too.

27 thoughts on “When you add a predictor the model changes so it makes sense that the coefficients change too.”

Brenton on January 4, 2017 10:04 AM at 10:04 am said:

Good to be cautious, though, about overinterpreting these types of changes. The changes in coefficient values are fairly minor in absolute magnitude–I would suspect most of these effects are due to multicollinearity and sampling error. Beyond that, with the high degree of shared variance likely present in these measures, the pattern of regression coefficients is likely to be very fungible (Waller, 2008, Psychometrika; see a fairly accessible discussion here: http://www.ats.ucla.edu/stat/stata/faq/fungible/fungible2.htm). For these results (at least as presented here), I would focus on the zero order effects and use the regression models only to conclude that most of the variance accounted for across the mesures seems to overlap.

Reply ↓
Jeff McLeod on January 4, 2017 10:24 AM at 10:24 am said:

Brenton is right, you ought to be cautious when interpreting suppressor variables. There should be a *theoretical* reason why the pattern makes sense.

Here’s an example from the educational testing world.

The criterion variable Y is college GPA.
Predictor X1 is an aptitude test like the SAT.
Predictor X2 is a measure of reading speed.

Assume all measures are transformed into deviations or z-scores.

Both X1 and X2 predict Y in a positive fashion.

But the regression results show X1 is positive and X2 has flipped negative.

A plausible interpretation is that while X2 is a predictor of college GPA, the composite (b1X1 – b2X2) is a BETTER predictor than either X1 or X2 alone. Stay with me here…

The negative regression coefficient b2 for X2 reflects that the construct of reading speed is more correlated with the residual of X1 than with the criterion itself. More clearly: some of the performance on a timed test like the SAT is valid variance related to college GPA, but some of the variance is an unfair advantage of reading speed that spuriously increases your SAT score. But this artificial advantage on the SAT is not necessarily correlated with the criterion (obviously, it will help in some classes, but not in others).

Again, this is an example. When reading speed is a suppressor, it means that the coefficients — if I may anthropomorphize them — want to clean the spurious construct variance out of the SAT. Consider the regression equations as penalizing fast readers to put them on par with ordinary readers to yield a better prediction of Y.

Good for you Shane Littrell for your interest in going deep.

Reply ↓
- Shane on January 4, 2017 8:01 PM at 8:01 pm said:
  
  Thanks, Jeff! I like your suggestions and the advice makes sense. I read this blog pretty religiously and Dr. Gelman’s articles are always leading me in great directions for the types of things I need to get better at, so I’ve been digging deeper into studying stats in the past few months since submitting that email.
  
  Both of my masters are in psych, and I did well in all of my stats classes, but I’ve always felt that I only learned enough stats to just “get by.” Maybe it’s imposter syndrome, maybe not. But, when I eventually get my PhD, I want to ensure that I master as much of this type of material as I can so that I can minimize (as much as is possible) the mistakes that I’ve learned about through reading this blog (and Uri Simonsohn’s and Daniel Lakens’).
  
  Thanks again for the comments!
  
  Reply ↓
  - Martha (Smith) on January 5, 2017 12:39 AM at 12:39 am said:
    
    It’s not “imposter syndrome” — it’s having the good sense to realize that you don’t really understand. That’s a good trait.
    
    Reply ↓
    - Diana Senechal on January 5, 2017 8:04 AM at 8:04 am said:
      
      +1
      
      That would make a good op-ed: the difference between “impostor syndrome” and healthy recognition of one’s need for knowledge and understanding.
      
      Shane, thank you for raising the question about suppressor variables! This discussion is helpful and interesting.
  - Elin on January 5, 2017 4:02 PM at 4:02 pm said:
    
    OLS regression, particularly if you are using it with p values but also just in general, assumes that you have a correctly specified model (this is built into the Gauss-Markov assumptions). That means you don’t take variables in and out, you have all relevant and no irrelevant variables and that is that. Estimates will always change somewhat when you have covariation among the predictors. I think it’s good you are focusing on the sign and magnitude more than anything else and what you are doing is really understanding the relationships among your predictors and help you develop a correctly specified model. But at that point of playing around you have to drop talking about p values for sure.
    
    As Martha said, good for you for pushing deeper in your understanding.
    
    Reply ↓
  - Angus on January 5, 2017 7:45 PM at 7:45 pm said:
    
    I’ve found that studying undergraduate statistics didn’t really teach me anything I’ve needed to know to be able to work on my PHD. Just enough to be able to understand what people are talking about here.
    
    Reply ↓
Henning on January 4, 2017 10:48 AM at 10:48 am said:

Thanks for the example Jeff McLeod. Maybe I’m too naive here, but couldn’t this be addressed with some kind of interactions?

Besides, when I read the original post, I was reminded of this post: http://marcfbellemare.com/wordpress/12082

Reply ↓
- Jeff McLeod on January 4, 2017 11:33 AM at 11:33 am said:
  
  I agree with you Henning. I think Simpson’s Paradox is germane, and it’s a good way to think about the results.
  
  After finding a suppressor effect, I would follow it up by breaking the suppressor variable into groups (high vs low) and running regression within each group. If the regression coefficients flip back, you can at least suggest that the interaction is a plausible cause.
  
  So no, you’re not naïve Henning. I’d go the same route. But I look at the suppressor variable as a possible diagnostic that could indicate an unspecified interaction, or a multi-level situation as another commenter indicated.
  
  Reply ↓
  - psyoskeptic on January 4, 2017 6:22 PM at 6:22 pm said:
    
    Why wouldn’t you just put the interaction term in the model? Your method leads you to chase noise around 0.
    
    Reply ↓
    - Jeff McLeod on January 5, 2017 6:24 PM at 6:24 pm said:
      
      There is a good body of literature in psych methods about the mischief that can be created using interaction terms.
      
      One particularly insidious issue is that an interaction of two correlated predictors can act as a near proxy for a simple square of one or the other variable.
      
      But even if you hit the right construct as an interaction, your reader has no idea how to interpret it. How do you translate a multiplicative term into a concept the human mind can wrap itself around? Some say it’s a sort of “synergy” which sort of rings a bell. In my experience you have to aim for maximum clarity. There is a wonderful book by Aiken and West called Multiple Regression: Testing and Interpreting Interactions. To interpret an interaction, you often end up spelling it out in more detail, and often by rendering the model as linear components with different slopes.
- Keith O'Rourke on January 5, 2017 8:32 AM at 8:32 am said:
  
  Henning:
  
  The link is much like the one I gave below.
  
  The reality the Simpson’s Paradox example depicts is that of two intercepts and one slope _not_ two intercepts and two slopes.
  
  Jeff goes into something much more specific (along with a term I have never heard of) but which may line up well with Shane’s current interests. More generally, its all about realizing the differences between models and hoping to get the least wrong one.
  
  Reply ↓
Keith O'Rourke on January 4, 2017 10:50 AM at 10:50 am said:

Some refer to this as the Greek parameter problem – ideally the betas should have subscripts that indicate the other independent variables that were included in the model.

More generally, it is the model one has to be specified well enough to adequately connect to reality in some sense.

Model specification includes all the parameters, some of which are taken as common for all observations, different for subsets of observations (e.g. interactions) or partially pooled (e.g. multilevel models). Getting these (too) wrong in various ways leads to various problems.

(A simple example where using one intercept rather than two leads to Simpson’s paradox – http://statmodeling.stat.columbia.edu/2016/09/08/its-not-about-normality-its-all-about-reality/#comment-303932 )

Reply ↓
- Martha (Smith) on January 5, 2017 12:43 AM at 12:43 am said:
  
  +1
  
  Reply ↓
- ojm on January 5, 2017 3:44 PM at 3:44 pm said:
  
  +1
  
  Reply ↓
Carol on January 4, 2017 10:53 AM at 10:53 am said:

Hi Shane,

Yes, this is suppression. I suggest that you take a look at Tzelgov & Henik, PSYCH BULL, 1991 or Lewis & Escobar, THE STATISTICIAN, 1986, for easy-to-understand explanations.

For an empirical example, Marcus Crede, Andrew, and I recently pointed out that the results in that famous (or infamous) PNAS air rage study by DeCelles and Norton were due to a particular kind of suppression (negative suppression, which results in a sign reversal).
The original article, our comment, and an off-the-mark reply from DeCelles and Norton have now all been published in PNAS.

Carol

Reply ↓
- Shane on January 4, 2017 7:48 PM at 7:48 pm said:
  
  Carol:
  
  Thank you for the response! I’ll definitely check those out and look for your article. Coincidentally, I recognized Dr. Crede’s name b/c I just applied to Iowa St.’s PhD program.
  
  Whatever school I get into, I’m confident I’ll learn much more about these procedures and feel more comfortable with my analyses and interpretations.
  
  Thanks again!
  
  Reply ↓
  - Carol on January 5, 2017 10:44 AM at 10:44 am said:
    
    Hi Shane,
    
    I will send you via e-mail a manuscript that Nick Brown and I wrote that contains a clear (we think) description of the three kinds of suppression.
    
    Carol
    
    Reply ↓
Dan Wright on January 4, 2017 4:45 PM at 4:45 pm said:

It would be nice if the statistics packages could list

B se B
X1 | X2,X3 … 4.2 .8

etc., rather than just X1, but the width of paper would become a problem (particularly with the length of variable names people use nowadays). I think just printing X1 makes people lazy when they say things like the B is the “effect of X1” rather than the “effect of X1 conditional on …”.

Reply ↓
- zbicyclist on January 4, 2017 6:41 PM at 6:41 pm said:
  
  To solve the width of paper problem, we could limit it, e.g.
  X1 | X2
  X1 | X2,X3
  X1 | X2 et al. (for 3 or more)
  
  Reply ↓
NatashaRostova on January 4, 2017 5:43 PM at 5:43 pm said:

The more you study the math beneath regressions, the less magical they become. And eventually you cannot believe you ever viewed the results as anything fundamentally trustworthy or more than simply a mathematical abstraction. I know how controls are supposed to work in the ideal, but in reality I have trouble convincing myself the control is mapping to reality, rather than just an often fragile result of how the likelihood function decided to minimize the residuals.

I might have lost too much faith, actually.

Reply ↓
- jrc on January 4, 2017 7:12 PM at 7:12 pm said:
  
  Sounds like you’ve lost just about the right amount of faith.
  
  I’m still amazed when I realize that people think of “I included a proxy measure of that characteristic in my matrix of right hand side variables as a linear relationship between it and the (probably poorly measured proxy) outcome variable” as “I’ve controlled for that.” Then they internalize “controlled for something” as “it isn’t a statistical problem anymore”. It’s like, because I had a list of 7 of your household assets, I can generate an index and then I’ve “controlled” for wealth. So now I know how much wealth matters, and I know that none of my other parameters are tainted by omitted wealth variables. Sure.
  
  So many problems and questions just start sounding ridiculous after you’ve internalized the math. Once you realize that you are just dealing with a mathematical projection, once you see your X matrix as a series of numbers that vaguely represent things in the world… then all that talk about moderating and mediating and suppressor variables (you know, where people over-literally interpret all of these parameters in relation to real things in the world and speculate wildly using their favorite theory jargon), all that just disappears and you wonder who taught these people to think about a) statistics; and b) the real world. Obviously someone who did them real harm in their quest to understand the way the world actually works.
  
  Reply ↓
  - Martha (Smith) on January 5, 2017 12:36 AM at 12:36 am said:
    
    Yes, it bugs me that people use the word “control” in this way. It would be more realistic to say “attempted to account for” rather than “controlled for”. Especially when people don’t understand the underpinnings of statistical methods (incuding their assumptions), they are likely to interpret the word “control” in a way that isn’t appropriate — e.g., as in “this switch controls this light.” That’s a deterministic situation, but when you are using inferential statistics you are dealing with lots of indeterminacy.
    
    Reply ↓
    - Keith O'Rourke on January 5, 2017 8:16 AM at 8:16 am said:
      
      +1 (especially hate to hear that in response to the occasional journalist who asks what about this possible confounder – OH we controlled for that so it not anything to be concerned about.)
Jack on January 4, 2017 10:17 PM at 10:17 pm said:

Andrew:
It’s not enough to say this sort of thing can happen, because the model is different. One needs to think about what the model is and how well the measures capture the concepts in the model.

In the standard regression model, Y=XB, the underlying conceptual model of how the world works and how each of the RHS regressors relate to the LHS dependent variable and to one another is that each X is correlated with dependent variable (positively or negatively) and not correlated with the other RHS variables. But that is rarely a description of how the various aspects of the world we are monitoring relate to one another. RHS variables may interact; they may be correlated, so we observe changes in X1 and X2 concurrently.

Researchers using multivariate models need to think about their conceptual model of how the RHS variables interact and how they are associated with one another and their dependent variables. They need to think about whether their measures are correlated, even if there is no interaction or mediation, and even if the concepts they are trying to measure are not correlated.

This is not a problem of matrix algebra, as jrc frames it. Jeff McLeod gets at it the issue when he talks about a having a theoretical reason why the pattern makes sense. I would be looser than that, applying a lower level of conceptualizing that full blown theory. And I would think about the meanings of the measures and how closely the measure as constructed relates to the underlying concept being measured as well as to the other measures in the model.

So, my response to Shane would not be to note he has different models and this happens but to ask him what his underlying conceptual model is; does it predict or anticipate interactions among the RHS variables. Does he expect his RHS concepts to be correlated causally, or mediated, or spuriously correlated because they are causally associated with other variables not in the model. Are his measures correlated, even if the underlying constructs are not, and why. What is his causal model? And what is his regression analysis confirming or raising questions about regarding his causal model, and how should his model be updated to reflect what he has learned in the analysis.

Reply ↓
- jrc on January 4, 2017 10:34 PM at 10:34 pm said:
  
  Jack,
  
  Just for the record, I agree that the important things to consider when a practitioner is choosing a regression model are not issues about matrix algebra but issues about how the model relates to the world. I just think that understanding the matrix algebra is one important way of de-mystifying statistics… it is a reminder that there is no magic going on when you “control” for something.
  
  As for how I think about models when I am doing regression analysis – I think about comparisons in the world. Who is this model comparing to whom in order to make claims about the effect of some thing in the world? Does it make sense to compare within- or across-units? How do I specify a model that implicitly leverages the comparisons that I think are more valid or illuminating or useful instead of comparisons that are less useful.
  
  RCTs fit this easily: I want to compare the treated and the control units. But fixed-effects models for quasi-experimental or observational studies work similarly. If I want to know the effect of, say, a policy mandating maternal work leave that affected some states and not others, I might want to look at changes in my outcomes in the states that implemented the policy relative to changes in states that did not implement it. IV is the same – leveraging differences across people generated by the instrument forces a comparison among otherwise similar people who were affected by the instrument and those who weren’t.
  
  So yes, I agree with you that it isn’t about matrix algebra. But I don’t think it is about thinking carefully about all the cross-correlations either. It is about thinking about the world and how to use what we know about the world and about regression modeling to identify the effect of interest by making useful (as opposed to misleading) comparisons among people. The matrix bit is just one way to remind ourselves that there is no magic in statistics, only comparisons of differences across observations.
  
  Reply ↓
  - Jack on January 6, 2017 8:47 PM at 8:47 pm said:
    
    jrc,
    
    Agree with what you say, but it doesn’t speak to the problem Shane has, which is a cross-sectional observational data set.
    
    RCTs work to isolate effects, because if randomization is effective, all the other covariates are balanced. Quasi-experimental analysis, often done in a pre-post difference in difference framework, assumes the effects of the covariates over time are constant and the other differences in environment other than the policy change studied are similar across treatment and control groups, so they can be differenced out.
    
    Neither of these speak to Shane’s question, which is that his parameter estimates are contingent on which other variables are included in the model. Andrew said “It’s a different model. So of course.” I’m recommending to Shane that he try to understand why the estimates are changing and offered several things for him to think about: interactions between his variables, collinearity due to correlation of the observed variables in the population, collinearity due to overlap in what the measures as constructed were measuring. Each has different implications for interpreting the model and choice of whether and how the multiple measures are included in the final model selected.
    
    Reply ↓

Statistical Modeling, Causal Inference, and Social Science

When you add a predictor the model changes so it makes sense that the coefficients change too.

27 thoughts on “When you add a predictor the model changes so it makes sense that the coefficients change too.”

Leave a Reply Cancel reply