Skip to content

How much can we learn about individual-level causal claims from state-level correlations?

Hey, we all know the answer: “correlation does not imply causation”—but of course life is more complicated than that. As philosophers, economists, statisticians, and others have repeatedly noted, most of our information about the world is observational not experimental, yet we manage to draw causal conclusions all the time. Sure, some of these conclusions are wrong (more often than 5% of the time, I’m sure) but that’s an accepted part of life.

Challenges in this regard arise in the design of a study, in the statistical analysis, in how you write it up for a peer-reviewed journal, and finally in how you present it to the world.

School sports and life outcomes

An interesting case of all this came up recently in a post on Freakonomics that pointed to a post on Deadspin that pointed to a research article. The claim was that “sports participation [in high school] causes women to be less likely to be religious . . . more likely to have children . . . more likely to be single mothers.” And the advertised effects were huge: “a ten percentage-point increase in state-level female sports participation generates a five to six percentage-point rise in the rate of female secularism, a five percentage-point increase in the proportion of women who are mothers, and a six percentage-point rise in the proportion of mothers who, at the time that they are interviewed, are single mothers.” These effects are huge to start with—elasticities of 50% for things that have nothing (apparently) to do with the treatment—and are even larger when you consider that the outcomes are binary and, for example, sports participation can’t make you secular if you were already going to be secular anyway, it can’t cause you to have a child if you were already going to have a child anyway, etc.

But, as the authors of the paper (Phoebe Clarke and Ian Ayres) explain in their blog posts and the scholarly article, they’re not measuring the effects of sports participation directly. Here’s what they’re doing:

This paper . . . adopts an instrumental-variables method . . . in which variation in rates of boys’ athletic participation across states before the passage of Title IX is used to instrument for changes in girls’ athletic participation following its passage . . .”

Here’s the summary from the published article in the Journal of Socio-Economics:

Screen Shot 2014-03-29 at 9.05.43 AM

And here’s Ayres in the Freakonomics blog:

I apply the same methodology [as used earlier by economist Betsey Stevenson] to social outcomes, and find that sports participation causes women to be less religious, more likely to have children, and, if they do have children, more likely to be single mothers.

More specifically, their analysis is “comparing women in states with greater levels of 1971 male [high school] sports participation . . . to women in states with lower levels of 1971 male sports participation.” The outcome are state-level average responses to General Social Survey questions for “respondents who completed tenth grade and who either attended high school before Title IX was passed in 1972 or after it came into full effect in 1978.” So they’re doing their best to a target their analysis on the group of women who’d be affected by the treatment. Then they run individual-level regressions on binary variables (just as a minor point, I’d prefer to keep the original ordered responses; not a big deal but it can’t hurt), but the action is all coming from the state-level predictor (the measure of male athletic participation in 1971, by state).

The trouble is that instrumental variables regression is not magic. In this case, the problems are:
(a) the treatment is at the group, not the individual, level, and
(b) it’s not a clean “natural experiment.”
Think of it this way. Suppose some states were randomly selected to get the Title IX treatment and some were not. This would be the ideal scenario—but, even there, you’re measuring the effect of an aggregate policy, not the effect on individual participation. But it’s much worse than that. Actually, the treatment was applied to all the states, so all that could be studied was an interaction. It’s not even like these examples where a new policy is phased in, in different years in different states. Finally, of course the interaction being studied is not random; there are systematic differences between states with higher and lower boys’ high school sports participation in 1971. (The highest rates are reported in North Dakota, Nebraska, Minnesota, Iowa, Kansas, Montana, Arkansas, South Dakota, Vermont, Idaho, Oregon, and Wyoming.)

So, where does this stand on the correlation-causation scale? Clarke and Ayres are measuring correlations and giving them a causal interpretation. That’s not always such a bad thing to do, indeed it corresponds to an implicit model in which the observed variation can be taken as random (an ignorable treatment assignment, as Rubin would say). In this case they have a bit more—but not a lot more, in my opinion, because of problems (a) and (b) above. In short, I disagree with Ayres’s claim that this instrumental variable “is about as good as they come.” They come better. That doesn’t mean I don’t think Clarke and Ayres should publish their results. I just don’t think they should jump the gun on the causal interpretations.

Kaiser Fung has written about “story time“: after researchers do the hard work of causal identification and statistical analysis, they start with the unsupported speculation, with general idea that some of the rigor of the design and analysis should leak into the speculation. I think story time is just fine (and I think Kaiser would agree with me on this). What’s important is to draw the line at the right place, to make it clear to your readers where the data analysis ends and the speculation begins. In this case, I think the analysis ends somewhere after the state-level correlational analysis and the discussion of possible identification. The causal reasoning is speculation.

What to do?

OK, fine. What are we getting from all this, besides general “Mom and apple pie” advice not to oversell our research results (advice that would be good for me to follow sometimes with my own work, I’m sure)?

I do think we can get somewhere, taking as a starting point the implausibility of the reported point estimates. As noted above, if a ten percentage-point increase in state-level female sports participation is associated with a five percentage-point increase in the proportion of women who are mothers, there’s no way that most of this can be coming from a direct effect. The implication would be that there’s this huge group of girls who (i) will have children if they do sports, and (ii) will not have children if they do not do sports. Clearly these estimated elasticities have to be driven by big differences between states that possibly have nothing to do with high school sports. The authors do some placebo controls—applying their analysis to some other outcomes—and get a mix of statistically significant and non-significant results, and that’s fine, but maybe the next step would be to do some more systematic comparisons, looking at lots of different state-level predictors (not just boys’ 1971 high school sports participation) and lots of different state-level outcomes. Report a big grid of correlations, then see what’s there.

I’d also suggest, for each outcome, to make a scatterplot of the state-level aggregate vs. boys’ sports participation in 1971. If you want to make the causal leap, go for it—but make clear that it’s a leap. In the meantime, the scatterplot (with the 50 states labeled by their convenient two-letter abbreviations) could give a lot of insight.

Finally, if you’re interested in the substantive questions about the effects of sports participation, I think it’s essential to make a connection to whatever is already known in this field. Sure, survey data have their limitations: as Clarke and Ayres note, kids select into sports participation. But there are ways of getting around this, various versions of natural experiments which, like the Title IX thing, are not perfect but provide some leverage. Also one can try to model the selection process. Lots of ways of doing this.

Accepting uncertainty and embracing variation

Also a minor point. The article includes the following footnote:

It is true that many successful women with professional careers, such as Sheryl Sandberg and Brandi Chastain, are married. This fact, however, is not necessarily opposed to our hypothesis. Women who participate in sports may “reject marriage” by getting divorces when they find themselves in unhappy marriages. Indeed, Sheryl Sandberg married and divorced before marrying her current husband.

This sort of case-by-case discussion can be interesting for formulating hypotheses but it looks odd to me when phrased as above. My problem is that, even if all the modeling assumptions are correct, the model’s predictions are only probabilistic. It’s not necessary to explain away every contrary example. This is not a big deal but I bring it up because one of our themes on this blog in recent months has been the love of certainty, and the desire to use statistical tools to transmute variation and uncertainty into sure things. Sometimes this works (the law of large numbers and all that) but when we get back to individual cases we should recognize the limitations of our models and our predictions.

Limitations of the claims, and limitations of the criticisms

As is generally the case with these correlation-causation things, I don’t want to say that the research hypotheses are false. It may well be true that, at the individual level, “sports participation causes women to be less religious, more likely to have children, and, if they do have children, more likely to be single mothers,” even if the actual effects are an order of magnitude lower than claimed. But state-level correlations don’t tell us much about this. Recall that if we were studying state-level correlations of income and voting, we’d come to the false conclusion that poor people are more likely to vote Republican. In the present example, the Title IX story helps, but only a little.


  1. Katya says:

    You note “Suppose some states were randomly selected to get the Title IX treatment and some were not. This would be the ideal scenario—but, even there, you’re measuring the effect of an aggregate policy, not the effect on individual participation.”

    I’d just like to understand why you suggest that measuring the effect of an aggregate policy is not ideal (assuming a truly random assignment, which was not the case here).

    As long as individuals across aggregates (states) are identical on observable and unobservable characteristics (i.e. the treatment was properly randomized at the state level), and within state correlations are accounted for in the cov-var matrix (clustered standard errors), why is this methodology an issue?

    Village and state level treatment assignments are quite common in randomized control trials in economics in order to avoid contamination of the control group. Such RCTs assign treatment to clusters, and power calculations account for those clusters.

    I’d love to hear your feedback on this!

    • Andrew says:


      I’m very interested in understanding the effects of state-level policies. But I was reacting to the authors’ statement that “sports participation [in high school] causes women to be less likely to be religious . . . more likely to have children . . . more likely to be single mothers.” That is an individual-level claim.

      • Øystein says:

        But you are also objecting to their claims of causality as such, if I read you correctly. Are you as sceptical of Stevenson’s claims of causality? She finds more moderate effects and emphasize the fact that the participation data are state-level, but her identification is the same.
        (“I find that a 10-percentage point rise in state-level female sports participation generates a 1 percentage point increase in female college attendance and a 1 to 2 percentage point rise in female labor force participation,”

        On another note, regarding Ayres’ history: If you compare his and Clarke’s tables they are funnily similar to the ones of Stevenson, but they certainly reference her work, some 15 times in 23 pages.

        • Dan Wright says:

          I think his (Andrew’s) point is that even if a policy that causes more people in State X to do Y, does not mean that those individuals who take part in the policy are more likely to do Y. And studies like City Walker (2014 … below) make me worry about causal conclusions too.
          note: I haven’t looked at the article, just read this blog.

      • jrc says:

        Just to press on this – and not in relation to this particular IV paper which has tons of issues* – I agree with Katya that we don’t need individual level randomization (or quasi-experimental variation) to get insights into individual behaviors. That said, the question is sort of “what level of aggregation in our identifying variation, cross-cut to what degree against institutional/political, geographic, or socio-cultural factors, still allows us to think about individual causal effects?”

        I’m guessing that in the State-Year panel-type setup (and people think of these repeated cross-sections that way – sometimes for good, sometimes for ill), where there is no variation within state-year cell in the covariates of interest, you think we’ve got too many confounding/overlapping effects to make individual claims. I’m also guessing that if the quasi-experiment provided variation at, say, the neighborhood level, some sort of region-X-time Fixed Effect/IV model might be ok to learn about individual behaviors. Is this about right?

        Also – I would say the state-year FE model is the workhorse estimation strategy for a large fraction of applied micro research, whether it be US policy variation or weather variation in Indonesia or whatever. And it is generally accepted practice in the field to think of these as individual-level effects. That is not me arguing that this conception is right or wrong, just pointing out how central it is to a whole lot of quantitative social science work. We can probably agree that at least these estimates give you something about some aggregate effect in a population, which at least makes them better than Mechanical Turk stuff (where they do have nice individual-level information, just not anything representative of anyone else). That said – I’m going to watch now for the rhetoric and pay more attention to how people are thinking about what these effects mean.

        *This paper doesn’t even seem to pass the IV-toolkit checklist: for instance – no first-stage F-stats reported, and the discussion of the validity of the instrument seems to be “This other person did it and people seemed to like it.” But maybe that second point is unfair of me.

  2. City walker says:

    “the interaction being studied is not random; there are systematic differences between states with higher and lower boys’ high school sports participation in 1971. (The highest rates are reported in North Dakota, Nebraska, Minnesota, Iowa, Kansas, Montana, Arkansas, South Dakota, Vermont, Idaho, Oregon, and Wyoming.”

    Here’s my story: looking at the list above, the states with the highest rates of high school sports participation may also have the highest proportion of rural schools. “Making the team” in these small schools is not as competitive as in big suburban and city schools, so a large proportion of the kids participate. (I went to a small school, I speak from experience!) If the instrumental variable actually reflects the proportion of kids in the state who attend small rural schools, then…

    Does anyone want to build on my story?

    • zbicyclist says:

      A good example of a type M error, it seems to me. In my experience, type M errors aren’t unusual with data that’s aggregated.

      But I can’t resist a bit of a rant.

      Critics of Title IX’s passage predicted dire effects on the social order.

      Interestingly, the authors seem to support these fearful predictions from the socially conservative.
      More sports seems to lead to very large (almost “good enough to make alarmist headlines to raise money” large) effects in directions many people would regard socially undesirable:

      a decline in church attendance (note churches have many effects, but one of which is to create a focal point for the generation of social capital)

      more children outside of the social safety net of marriage.

      And yet the tone of the article is positive, invoking a national sports hero (Chastain; see jrc’s post), just in case we might miss the point that they think these are positive changes.

      The authors are from Yale — what would we think of the study if the identical study and analysis had been done by a sociologist at, say, Bob Jones University and the tone of the article was to regard these as negative changes?

      • zbicyclist says:

        Lest anyone think I’m seeing tone when none is there, this is one line of the 6 line abstract:

        “our results appear to paint a picture of independence from potentially patriarchal institutions (church and marriage)”

        Both church and marriage are multifaceted institutions; any summary of their positive and negative effects goes beyond a blog post.

  3. Moreno Klaus says:

    ….this study reflects the underlying higher probabilities of having kids in rural areas? I dont know. I also dont understand what would be a possible theory that could explain this. I also dont get their instrument. It is not obvious to me that boys and girls participation is highly correlated.

    • City walker says:

      I did a super-fast research study (admittedly very cursory). The results show my theory that high state level sports participation by boys is associated with a high proportion of small (presumable rural) high schools in those states has some validity.

      Here are average high school sizes, by state, for 1999-2000 (I just used the first data table I found):
      Notice that those high participation states really do have the smallest average high school size. Here’s a graph:

      Perhaps Andrew will comment on what it means for the reported results.

  4. Here’s a “story time” that “explains” the effect which I’ve made up just to show what the state-level predictor -> individual level causality problem is:

    The story time model goes like this:

    States in which more girls participated in sports tended to occupy the time of a certain group of girls (athletes) doing things other than dating boys. With the other girls time taken up by sports, the *remaining* girls who didn’t participate in sports had more attention from boys, and were therefore more likely to become pregnant. Girls who became pregnant this way were overwhelmingly less likely to be devoutly religious, but these relationships were not built on a strong foundation, and with a lack of religiousness to “keep” them in marriages, these women were more likely to be divorced and single mothers.

    So that story time basically says “more participation in sports on aggregate produces a larger population of nonreligious child-bearing girls who reject their baby-daddies” but the “individual” causal treatment effects are THE OPPOSITE SIGN of the aggregate causal claim! It’s exactly the NON RELIGIOUS NON ATHLETES that become pregnant and single.

    Now, I don’t have any reason to believe my model, but I have about the same amount of reason to believe my model as the claimed model… not very much.

    • Manoel Galdino says:

      +1. Excellent comment.
      Don’t we all know about Ecological Fallacy?

      I was quite surprised by the fact that many readers of this blog thought that even a truly randomized treatment on state-level would allow identification of individual-level causality.

      • Daniel Gotthardt says:

        Of course there is the ecological fallacy and I’m usually the first to warn against it here in my department, but sometimes we don’t have better data and there is actually some quite interesting work on “ecological inference”, for example by Gary King:

        • Andrew says:


          Hey, much of that is my work!

          • Daniel Gotthardt says:


            I actually had no idea to be honest, I just recently stumbled upon Gary King’s work/site. Can you recommend any of your work on the topic? Searching your published articles I found – do you mean that?

            • Andrew says:


              There is that one paper, but most of my work on ecological regression went into Gary’s book. It was largely his project but we worked on it together. I derived many of the formulas and gave him many of the statistical ideas (for example, setting up the problem as a hierarchical model and various details of the model itself, also the idea of visualizing the model as being equivalent to tomography, which I knew about from my PhD thesis in medical imaging that I’d done a few years earlier), and we built the actual model together. There were lots of tricky details to work out, including some issues involving parameterization of a truncated normal distribution that we used for the hierarchical model.

              • Daniel Gotthardt says:


                Thank you for this insight. That does sound really interesting! I’d really be glad if I had more time to get into it! Maybe I can sneak this stuff in my upcoming thesis about modeling and explaining international differences in the working force participation of women with children under the age of 3 when discussing issues of ecological reasoning.

      • jrc says:

        I think the ecological fallacy is not quite right here. In many of these state-year models, people aren’t running a regression of aggregates on aggregates, they are running a regression of individual outcomes on a bunch of individual covariates and also some right-hand-side variable of interest that varies only at the state-year level. That’s not the standard setup for the ecological fallacy, I don’t think.

        To me, there are usually two questions I’m particularly interested in and that determine how close these papers come to just regressing aggregates on aggregates, and how close they come to capturing real, individual-level effects:

        First – where is the variation in the RHS variable of interest coming from? Is it from some policy, from a history of the geographical sorting of humanity, from a meteor hitting a city, from federal court intervention, etc..? This gives me some way to think about how the individuals affected by the change we are interested in were subjected to it (or selected into it). Sometimes the RHS may be highly correlated with the type of people that live in some state (say, variation induced by a state referendum) and sometimes it may not (say variation in the percentage of people drafted into Vietnam caused by poor population estimates from the federal selective service board).

        Second – how is that variation playing out? Is it assigned by region and year? What is the serial correlation over space/time in the RHS. In this case, there is terrible variation – you essentially have a single-difference over time that hits some people harder than others (based on male athletic participation). But other times that variation is much better – you might have some policy that starts in different states at different times, so you get a bunch of single-differences across time but ones that hit at different points in time, separately identifying the policy effect from the temporal effect. Or you might be looking at some time-series of air-pollution and be able to instrument separately for each state and year’s pollution levels using the wind and Chinese manufacturing output, at which point you maybe have variation that, once conditional on state and year (separately) looks pretty random within some state’s time series.

        Anyway – I agree that in any of these cases you are never identifying some pure, individual-level causal effect (as if such a thing existed in the first place that was common across people). At their worst, the state-year models might reduce to the ecological fallacy. But at their best they become a kind of cluster randomized control trial where treatment is clustered at the state-year cell.

        This is me trying to make the case. I’d appreciate push back.

  5. jrc says:

    Non-Statistical Point:

    “The Chastain Effect”? That is near blasphemous. Brandi Chastain has 2 World Cup winner’s medals and 2 Olympic Gold ones. That woman is a national treasure.

    Our #USMNT may not win the World Cup this summer, but our #USWNT are favorites every time – yes, that does have to do with Title IX, but the title of this paper is like calling the effects of the 19th Amendment the “Hilary Effect”. Hilary Clinton obviously benefited from (and stands to benefit more in the future from) the 19th Amendment, but she is the tail probability benefit and not at all representative of the benefits derived by most of society.

    #USA – “Momentary insanity, nothing more, nothing less…I thought, ‘This is the greatest moment of my life on the soccer field’.” – Brandi Chastain

  6. Steve Sailer says:

    “The highest rates are reported in North Dakota, Nebraska, Minnesota, Iowa, Kansas, Montana, Arkansas, South Dakota, Vermont, Idaho, Oregon, and Wyoming.”

    Back in 1971, the North Central and Northwest were less religious, more progressive and populist (think geneticist and VP Henry Wallace of South Dakota). They traditionally were more feminist being WASPy, German, and Scandinavian, rather than, say, Italian or Jewish: e.g., Congresswoman Jeanette Rankin who was the only member of House to vote against War in 1917. Frontier states had been fairly feminist to attract women for their lonely cowpokes.

  7. Steve Sailer says:

    These days you have to break states down by race or you just run into Moynihan’s Law of the Canadian Border:

  8. BMGM says:

    And let us not forget the study that came out in the 1990s (when I was in graduate school). I can’t remember if it was for women in science in general or for women in physics in particular.

    They were looking into the factors that correlate with women who earn PhDs. Not surprisingly, encouragement of a HS science teacher was most commonly experienced. Participation in competitive athletics was the second most common factor–ahead of parental encouragement. The cohort they studied was the first wave of Title IX female athletes. It would be interesting to follow up now that girls’ athletics is more mature.

    A quick survey around the Joint Institute of Laboratory Astrophysics showed that 100% of the female professors had been or were currently nationally-ranked athletes. Nearly all of the female graduate students and postdocs were HS and/or college athletes. We were much more athletic than our male peers.

    The sample size for women who earn PhDs in physics is small, but that was a fun finding to discuss at parties. ;-)

  9. […] I wrote on our statistics blog, these effects are huge to start with — elasticities of 50 percent for things that have nothing […]

  10. […] lack of understanding of variation, the idea that this thing would work every time. (Recall this similar attitude of researchers who felt the need for their theory to explain every […]

Leave a Reply