Skip to content

How to approach a social science research problem when you have data and a couple different ways you could proceed?

tl;dr: Someone asks me a question, I can’t really tell what he’s talking about, so I offer some generic advice.

Joe Hoover writes:

An issue has come up in my subsequent analyses, which uses my MrsP estimates to explore the relationship between county-level moral values and the county-level distribution of hate groups, as defined by the SPLC.

Setting aside issues of spatial auto-correlation, control variables, measurement, and all other potential complications, I want to explore the US county-level association between a county mean outcome X and the county-level distribution of rare-event Y (N Y = 0 is about 2800, N Y > 0 is about 250).

My initial analytical plan included two analyses:

1. Model Y as some zero inflated function of X. I tried this and observed a lot of noise (small effects with estimated with low uncertainty).

2. Employ a case-control design that includes all hate group counties + a random sample of counties without hate groups. This design is based on a recent paper that investigated the county-level distribution of hate groups. When I tried this approach, estimation uncertainty decreased and the effects were in the hypothesized direction (how convenient!).

My issue now is that I have two very different sets of results that rely on two very different designs. It seems to me that they address two different questions, but am not entirely sure what question the second analysis really addresses:

1. If we know X for a given county, does that tell us anything about the expected rate of hate groups in that county. Answer: no.

2. Among counties that…mostly have at least one hate group, does knowing X tell us anything about how the expected rate of hate groups in that county. Answer: yes?

Part of my confusion about how to work with these results derives from the complexity of the DGP: there are probably many counties that would be nice places to start a hate group, but maybe…there are no self-motivated bigots there. Or, the bigots there are introverted and don’t like to be in groups, etc.

I guess I’m thinking of these factors as something analogous to epidemiological exposure. For example, perhaps county-level population density increases the risk contracting a virus at the county level. But, if the virus is rare, estimating a model that includes every county won’t reveal this relationship because most counties were never exposed.

This kind of epidemiological reasoning makes sense to me, but it is outside of my areas of expertise. And, I am also aware that it is probably not a coincidence that the reasoning which justifies the ‘good’ results ‘makes sense’ to me.

Accordingly, I would like to place myself on firmer ground by better understanding the precedents for these different analytical approaches. Specifically, I would like to know if it ever makes sense to use a case-control approach if you have data for the entire world (i.e. in my case, case-control requires throwing out observations, which feels strange). Also, I would like to have a better idea of how to interpret these kind of results.

My reply:

I’m getting confused on the details here so let me try to step back and answer in the abstract. He’s fitting two completely different models to the same data . . . hmmmm, not quite the same data, more like two takes on the same problem.

Thinking about fundamentals . . . I was taught that, when stuck, we should think about statistical problems as prediction problems, with causal inference corresponding to prediction under various potential outcomes. So that’s what I’d do here. Instead of saying that you want to “explore the relationship between county-level moral values and the county-level distribution of hate group,” try to define a more precise question (WWJD), then some of the answers will flow.


  1. Bill says:

    It strikes me (not a statistician) that reliable, non-biased data is essential to any analysis. SPLC is not a source of non-biased data.

    • Well, if you take the SPLC’s data as fact about the world you may have issues, but if you take the SPLC’s data as fact about what the SPLC thinks, it’s unbiased information about that question ;-) so it all depends on the purpose of your analysis.

  2. Martha (Smith) says:

    Just in case anyone else is having trouble with the acronyms: Am I correct in assuming that
    SPLC = Southern Poverty Law Center?
    DGP = Data Generating Process?

  3. mpledger says:

    For model 1 – is it “(small effects with estimated with high uncertainty)”
    For model 1 I am assuming you are modelling it as a log linear model e.g. number of hate groups is a function of moral values. I think you should look at fitting the independent variable “moral values” with smoothing in a gam model so you can see if there are any non-linearities. And I think the number of hate groups is going to increase with the population size of the county so try including the logged population size as an offset. A highly moral county may have lots of hate groups just because there are lots of people.

    I think the real problem is that the membership size of a hate group is a really important factor. A well-organised hate group might have merged with more dis-organised hate groups to become one very large influential group – a badly organised hate group might have splintered into many new groups each too busy with themselves and too small to be influential – so that the number of hate groups isn’t going to have good correlation with “morals” in a county whereas number of people participating in a hate group does.

    For model 2: I don;t like the idea of dropping counties out of the analysis – even randomly. Can you do a version of bootstrapping where you do the analysis multiple times with different random samples of hate groups?

    I don’t think a “rare virus” interpretation is a good idea because pretty near everyone has radio, tv and the internet. People are pretty exposed.

Leave a Reply