Daljit Dhadwal writes:
On the Ask Metafilter site, someone asked the following:
How does statistical analysis differ when analyzing the entire population rather than a sample? I need to do some statistical analysis on legal cases. I happen to have the entire population rather than a sample. I’m basically interested in the relationship between case outcomes and certain features (e.g., time, the appearance of certain words or phrases in the opinion, the presence or absence of certain issues). Should I do anything different than I would if I were using a sample? For example, is a p-value meaningful in this kind of case?
This is a question that comes up a lot. For example, what if you’re running a regression on the 50 states. These aren’t a sample from a larger number of states; they’re the whole population.
To get back to the question at hand, it might be that you’re thinking of these cases as a sample from a larger population that includes future cases as well. Or, to put it another way, maybe you’re interested in making predictions about future cases, in which case the relevant uncertainty comes from the year-to-year variation. That’s what we did when estimating the seats-votes curve: we set up a hierarchical model with year-to-year variation estimated from a separate analysis. (Original model is here, later version is here.)
So, one way of framing the problem is to think of your “entire population” as a sample from a larger population, potentially including future cases. Another frame is to think of there being an underlying probability model. If you’re trying to understand the factors that predict case outcomes, then the implicit full model includes unobserved factors (related to the notorious “error term”) that contribute to the outcome. If you set up a model including a probability distribution for these unobserved outcomes, standard errors will emerge.