Good examples of lurking variables?

Rama Ganesan writes:

I have been using many of your demos from the Teaching Stats book . . . Do you by any chance have a nice easy dataset that I can use to show students how ‘lurking variables’ work using regression? For instance, in your book you talk about the relationship between height and salaries – where gender is the hidden variable.

Any suggestions?

30 thoughts on “Good examples of lurking variables?

  1. I always liked the WWII bombing analysis — I think this was in some old textbook (by Tukey?). After the war they studied the accuracy of strategic bombing with regressions. Some things made sense (different types of bombers had different accuracy levels, higher altitude meant less accuracy). But one variable was whether enemy fighters opposed the bombers, and this had the *opposite* effect from what anyone would expect (fighter opposition meant more accuracy). Want to try to guess the hidden variable?

    Cloud cover. If the weather was cloudy the enemy wouldn’t bother to send up fighters, and accuracy was terrible because in that era bombing depended on sighting landmarks on the ground.

    • Eli:

      That’s good. What would complete the analysis would be the construction of some fake data to make the point. It would be really cool if the students in the class could themselves learn how to create the fake data.

      • You could maybe falsify the data yourself and make them try and find the issue? That would be fun too for both you and them!

  2. What about trade and conflict, with distance as a lurking variable? The bivariate relationship is positive, but this simply reflects that closer pairs of states tend to trade more and fight more.

  3. To me, height is the hidden variable in discussions of the relationship between sex and salaries, not the other way round. Can both those statements be valid?

    • Gender is closer to being the cause of salary discrepancy than height. Now, there might be various luring variables between gender and salary, and Andrew has another blog post on that (Andrew, could you please link? Thanks!)

        • It means that the “cause” is something like the long term biases built in to societal norms, but those biases are more related to the gender than to the height. I would guess that within gender height might be an issue, but that across genders you’ll find even that people of the same height have a gender discrepancy, combine that with the fact that the two genders tend to have different average heights, and you get a large portion of the bias.

        • How do you distinguish that statement from “within height, gender might be an issue, but that across heights you’ll find that even people of the same gender have a height discrepancy”?

        • In consumer research, an area that I am getting to know, they are always looking for intervening variables or ‘process measures’. A causes C via B. So they set up experiments, where they show that A causes C. Then they measure/manipulate B to show the causality.

        • dan, thanks for the reply. by more related, do you mean more correlated? and when you talk about looking within genders vs between genders, i think that is an example of when andrew talks about the all else equal fallacy.

          i preface this with the disclaimer that i don’t really know what i am talking about.
          rama, is that consumer research example your definition of a lurking variable? would you define a lurking variable as a confounder too? (again, i have no clue what any of these terms mean.) if yes, then something seems off. definitions of confounding usually say that the confounder cannot be in the causal path.

        • Jimmy (There is no reply button on your post so I’m replying to myself) A confound is something that you need to get rid of to get your paper published. An intervening variable/mediator is something you need to have to get your paper published. That just about sums it up for me.

  4. I seem to remember one from university talking about the number of storks in an area was a fantastic predictor for the number of babies being born in areas of Oslo. Turns out the hidden variable was the number of chimneys in the area as storks like nesting there!

  5. Aren’t “lurking” or moderator variables in social research really the same thing as instrumental variables in econometrics?

      • To be clear, I understand a ‘lurking variable’ to be an ‘omitted variable’ in the metrics sense.

        The answer, then, is no. Recall that instrumental variables (IV) are used to ‘instrument’ for variables that we believe *are* correlated with the error term. For instance, consider a model where we look to describe earnings based on education and age with age^2, for instance. It’s likely that ability describes wages, but ability is unobserved and is correlated with education. [So ability is the omitted variable, and thus is the lurking variable to which you were referring?] So since ability is in the error term, u, we have bias in the parameter on education. So to avoid this, we look for a variable Z that

        (1) is correlated with education, so that Cov(educ, Z) ≠ 0

        (2) is not correlated with the error, so that Cov(Z,u) = 0.

        Then by estimated the model:

        educ = A + B*Z + u_2

        and taking the predictions (which are notably *not* correlated with ability), we are able to estimate the effect of education that is *not* correlated with ability, which is what we wanted.

        So a ‘lurking variable’ is not the same thing as an ‘instrumental variable’ because they have strictly different definitions, but an IV is sometimes used in response to a well specified lurking variable.

  6. hi andrew, (eli’s example is pretty good.) do you think it would be worthwhile to define some of the terms? i always find this conversation confusing, because i do not ever really know what people mean when they use these terms above. what is a lurking variable? is it a confounder? if yes, what is confounding? people will sometimes say (as part of their definition) that the confounder is associated with both exposure and outcome. however, others will strengthen this requirement and say that the confounder has to cause exposure and outcome. if so, is the uc berkeley admissions example an example of confounding? but why does it make sense to say that department causes gender? if a lurking variable is not a confounder, what is it? and then, how do other things such as interaction, instrumental variables, etc fit? how do you think about this?

  7. What about David Card’s classic dataset on return to schooling where he instruments education with a dummy for proximity to a four-year college. The lurking variable here, as usual, is ability, where ability is correlated with education (meaning that an explanatory variable is correlated with the error term). The dataset is available from Wooldridge’s Intermediate metrics text, and is available for download at the following link:

    http://academic.reed.edu/economics/parker/s11/312/asgns/data/CARD.DTA

      • From the fellow’s website:

        Note: The .dta file is a Stata dataset that should be downloaded and opened in Stata. The .def file is a text file with definitions of the variables and sample statistics. Open the latter in any text editor.

  8. My favorite is the (possibly apocryphal) finding that ice cream consumed and number of drownings are correlated, with the lurking variable being hot weather.

Comments are closed.