## How to interpret inferential statistics when your data aren’t a random sample

I’m having a bit of a ‘crisis’ of confidence regarding inferential statistics. I’ve been reading some of the work by Stephen Gorard (e.g. “Against Inferential Statistics”) and David Freedman and Richard Berk (e.g. “Statistical Assumptions as empirical commitments”). These authors appear to be saying this:

(1) Inferential statistics assume random sampling

(2) (Virtually) all experimental research (in psychology, for example) uses convenience sampling, not random sampling

(3) Therefore (virtually) all experimental research should have nothing to do with inferential statistics

If a researcher gets a convenience sample (say 100 college students), randomly assigns them to two groups and then uses multiple regression (let’s say) to analyse the results, is that researcher ‘wrong’ to report/use/rely on the p-values that result? [Perhaps the researcher could just use the parameter estimates – based on the convenience sample – and ignore the p values and confidence intervals…?]

Are inferential statistics just (totally) inappropriate for convenience samples?

I would love to hear your views. Perhaps you’ve written a blog post on the matter?

Yes, this has come up many times! Here are some posts:

How does statistical analysis differ when analyzing the entire population rather than a sample?

Analyzing the entire population rather than a sample

Yes, worry about generalizing from data to population. But multilevel modeling is the solution, not the problem

How do you interpret standard errors from a regression fit to the entire population?

What do statistical p-values mean when the sample = the population?

How do you interpret standard errors from a regression fit to the entire population?

The longer answer is that random sampling is just a model. It doesn’t apply to opinion polls either. But it can be a useful starting point.

I’ve been worried that everything I’ve learnt (or taught myself) about inferential statistics has been a waste of time given that I’ve only ever used convenience samples and (in my area – psychology) don’t see anybody using anything other than convenience samples. When I read the Stephen Gorard papers and saw that at least one eminent expert in stats – Gene Glass – agreed, I had my ‘crisis of confidence.’

If you can restore my confidence, I would be very grateful. I really like using multiple regression to analyse my data and would hate to think that it’s all been a charade!

My reply: I don’t want to be too confident. I recommend you read the section in Regression and Other Stories where we talk about the assumptions of linear regression.

1. Roman says:

I do not agree with the points proposed by the questioner. Especially point 2 a 3: “all experimental research (in psychology, for example) uses convenience sampling, not random sampling”. In this context the stochastic process to be modeled is linked to the random assigment of certain treatment/causal levels. It is not about how you choose people to experiment.

• Andrew says:

Roman:

I agree that it’s not true that all experimental research (in psychology, for example) uses convenience sampling, not random sampling. For example, lots of work is done using survey experiments. I think it would be accurate to say that most experimental research uses convenience sampling. And then the issue of generalizing to the population does arise, for reasons discussed here; see also this recent article with Lauren Kennedy.

• Charles Driver says:

Even a perfect ‘random sample of the population’ is only a perfect sample *at that moment in time* and, with respect to humans in general, is just a somewhat better convenience sample…

2. Sean Mackinnon says:

I have been wrestling with this thought re: non-random sampling a lot lately, and really with the idea of inferential statistics in psychology as a whole. It really feels like there needs to be a paradigm shift in methods and analysis given the real-world constraints of who you can actually get to do psychological studies of various sorts.

I think back to the origins of statistics (e.g., Galton, Pearson) and how the point was fundamentally to show that there are different populations of people who are superior to other people in various ways. So all the methods of inferential statistics kind of came together around that general point. Populations were defined, and the goal was to approximate their from samples.

I’ve been starting to wonder these days whether the questions that I really want the answer to are actually well-captured by conventional inferential statistics. That is, I’m not sure that I really am looking for the mean difference in 2+ populations. Or the theoretical slope of a linear regression line if I could measure the whole population. In psychology, it seems to me like the within-person processes are the most important — being able to reasonably understand, describe and predict what a given person will do and feel seems like the core. But a lot of the typical textbook stuff doesn’t really get at that, because inferential stats are all about population parameters. I think I kind of really want individual person parameters — which in theory, linear mixed models kind of get at but ultimately it always seems to boil down to the summary fixed effects in papers.

(This random musing brought to you by having a lot of time on my hands during sabbatical)

• Peter Dorman says:

This is right up the alley of my earlier (other thread) comment about small and large N studies. Psych is not the only terrain in which “within” or at least individual-level processes are really the object. This is also true in a lot of microeconomic contexts: health care choices, education choices, etc. Take the question I’ve spent some time on: do workers receive compensating wage differentials for the risks they face on the job? One can and should ask this question about individual workers and jobs. Does individual X actually compare risks across available jobs and choose one of them on the basis of satisfactory wage compensation? And does employer Y actually observe the wage/risk tradeoffs of its recruitment pool and select production methods whose risks are best for their net? The problem is that you can’t do these studies for everyone, so the issue of representativeness comes up; hence the necessity for large N statistical work. My point here is that, while population inference is necessary it’s not sufficient, in part because, as you say, the processes we are trying to identify operate, or don’t, at much lower levels. Maybe this is a rule of empirical research: at least some time and effort should spent directly observing instances of the thing you want to learn about (rather than the tracks they leave). OK, that may be difficult for psychology….

Back to convenience sampling. I have no formal training in the natural sciences, but I’ve had to be an interloper in the course of writing a book on climate change, which will finally (after years of delay) be coming out in 2022. One of many issues is methane: how much of it is stored in mostly stable deposits especially in marine environments and how susceptible it is to mobilization (release) due to temperature changes. I discovered this is hotly debated among specialists, and it is important for the “tipping points” aspect of the overall climate problem. After reading literature on both sides (and chatting with one of the participants), I became convinced this boils down in large part to a sampling issue. Talk about convenience samples! It’s *really* hard to identify and measure a methane hydrate formation. It’s a major enterprise. We know they’re out there because we’ve found a bunch. The question is whether they exist in such numbers and sizes that they constitute a potentially large greenhouse overhang, so to speak. This is about the past as well as the future, since the debate extends to the role of methane releases in past thermal events, like the Paleocene-Eocene Thermal Maximum. The “anti-methane” folks put a lot of credence in theoretical models that predict that the convenience sample is highly unrepresentative. The “pro-methane” folks think these theoretical priors are weak and more credence should be given to what we’ve been able to see and measure. My reading is that the “empiricists” (pro-methane) are the clear majority, but as an outsider I’m unqualified to pass a judgment, so in the book I briefly explain the debate and treat it as unresolved.

Maybe there are readers who understand this topic better than I do and can chime in.

My guess is that there are lots of topics in the natural sciences where it is difficult and expensive to conduct an observation, and convenience sampling rules.

• John Williams says:

The notorious sampling unit for people who study benthic invertebrates in streams is “the riffle nearest the road.”

• Anonymous says:

“Maybe there are readers who understand this topic better than I do and can chime in.”

I can chime in. Whether I understand it better or not…

There’s a tendency in “planet disaster research” to latch on to some ancient geological event and point to it as the quintessential proof that x or y disaster can happen. Tragically there is a very strong selection bias in the available evidence for all ancient events, and the older the event the stronger the bias. There is ample controversy even regarding the causes of the Pleistocene megafauna extinction (woolly mammoth, sabre-tooth tigers etc), which occurred only in the last ~100-10K years.

It would interesting to try to get a handle on how evidential selection bias drives hypothesizing about such events. One way to do that would be to study the proposed explanations of an event prior to the discovery of significant definitive information about that event. Probably the best studied ancient event is the Cretaceous-Tertiary (K-T) extinction, which most – but not all – researchers now agree was caused by an asteroid impact. Prior to Alvarez’ discovery of the iridium spike at the K-T boundary, did anyone intuit that an asteroid impact was responsible for the event?

Also for a more general way of thinking about how humans explain things with biased information, I’m sure someone could concoct some problem that can be solved with certainty with, say, ten bits of information. But if we give people only 2, 3, or 5 bits of that information, what kinds of explanations do they generate? What’s the min number of bits of info that people can use to solve the problem? Something like this must have been done already. In turn that would play back into the work of people like Phil Tetlock – how much info do people need to solve a problem or make an accurate forecast?

• Peter Dorman says:

The PETM is a relatively recent discovery, since it happened for such a short period, given how long ago it was. (Maybe 200,00 yr duration 55 million y.a.) In line with your speculation, people already had a prior for it, a big infusion of atmospheric CO2 from extensive vulcanism — since that was the explanation for the extinction of the dinosaurs before evidence for an asteroid impact was gathered. (Cue the Rite of Spring section of “Fantasia”.) Well, it turned out this was a pretty good starting point for investigating the PETM, since there really was a massive plate event at that time, the separation of N America from Europe in the N. Atlantic. The methane question comes down to whether the CO2 anomaly could have been solely due to this primary cause, or whether the vulcanism-induced warming triggered secondary feedback mechanisms, like methane releases, that added to it. This is difficult to tease out from the proxy evidence.

It should be obvious that this is a critical question for today. In my book I treat it as a risk of unknown magnitude. Not being able to rule it out matters a lot, even though we can’t rule it in either.

• Sean Mackinnon says:

That is really interesting stuff. Way outside my area of knowledge content-wise, but fascinating to think about what convienence sampling looks like in the natural sciences.

3. paper says:

Interesting discussion that happened around this issue in epidemiology (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3888189/)
Probably shouldn’t have your confidence shattered over this, there are enough problems in other areas that should make you question your sense of reality, sanity, life choices, etc… :)

4. Christian Hennig says:

Models are generally idealisations and their job is not to be true. In many places I see the claim that model assumptions need to be fulfilled in order for statistical inference to be valid, but this is wrong, wrong, wrong. If it were right, we couldn’t ever do any statistical inference. In principle it is fine (and actually what is always done, even with as proper as possible random sampling, as no model is ever true) to use a model that is different from what goes on in reality.

However, every difference between model and reality should be a reason to question the model. The important question is, what, and how much of a difference does it make to our analyses? If convenience samples are used instead of random samples, we should critically ask what systematic differences may occur because of this and what their impact could be. There may or may not be available data/evidence on this. In case there isn’t, we should argue why we believe (if we believe this) that it won’t be a problem, which of course then is open to criticism of others.

5. Matawan says:

…hard to believe that in year 2021 a basic statistical issue like this has not been definitively resolved.
yet I see lots of uncertainty and only very nuanced ‘answers’ to the fundamental question.
(luckily this issue rarely comes up in the real world of statistical practice.?)

6. Sander Greenland says:

All this has been debated to death for decades in epidemiologic articles. Above, “Paper” notes one example debate involving Rothman et al.

Here’s an oversimplified summary of some main points that I had not seen mentioned above, based on a critical distinction for comparative studies examining causal effects of treatments (a distinction which was explicated at least by the 1960s):
1) What if anything do the statistics say about what happened to those in this experiment? (“internal validity”).
2) What if anything do the statistics say about what will happen to those not in this experiment? (“generalizability”, “external validity”, “transportability”).
Regarding those questions:
a) Successful randomization within the experiment ensures one can provide some statistically valid statement about effects in the experiment and thus answer (1), since it makes potential outcomes unconditionally exchangeable across the randomized groups (for observers who understand exactly what randomization does and does not do).
b) Successful randomization after successful random sampling from some source population ensures one can provide some statistically valid statement about effects on members of that source population and thus answer (2), since it ensures exchangeability of potential outcomes between those in and out of the experiment.
When there is no randomization or no random sampling, the usual quantitative fixes involve some sort of stratification or other adjustment to get closer to the desired exchangeability.

(if you can’t download them I’ll be happy to send you any directly):
Greenland S, Robins JM. Identiﬁability, exchangeability and epidemiological confounding. Int J Epidemiol 1986;15:413–419.
Greenland S. Randomization, statistics, and causal inference. Epidemiology 1990;1:421–429.
Greenland S. On the logical justiﬁcation of conditional tests for two-by-two contingency tables. Am Statist 1991;45:248–251.
– the last one focuses on explaining a paradox (only apparent, not real) of how one could have (as Fisher did) a randomization model (a) for testing a causal effect that led to marginal conditioning, yet others could have a random sampling model (b) with only a goal of source-population description that led to no conditioning. Writers like Rothman who seem to deny the importance of (b) are focused exclusively on internal validity, to the consternation of others who have inference beyond the observed sample in mind.
Since then there has been a huge literature on graphical and structural models which extend these points in full generality, albeit some cautions are needed in that far too much of that literature confuses randomization with no confounding (which vexes some philosophers no end); see
Greenland S, Mansournia MA. Limitations of individual causal models,
causal graphs, and ignorability assumptions, as illustrated by random confounding and
design unfaithfulness. Eur J Epidemiology 2015;30:1101-1110.

• Andrew says:

Sander:

Yes, all of this is well known to many and not known to many others. I remember many years ago taking a class where the instructor explained to us that if you have a randomized experiment, you can get valid causal inference even if the people in the study aren’t a random sample from any population. It was only years later that I realized he was wrong, in the sense that in real life we almost always care about the effect in some population of future cases, we don’t really care about the effect on whoever happens to be in the experiment.

7. psyoskeptic says:

As I’ve said to many students, in addition to a statement explaining the selected N every paper should have a comment on sample representativeness.

This can be pretty broad but I usually wanted them to focus on whether they believe the sample is representative of some population, and why that is so. For example, in a perception study one might just run 4 friends in order to attempt to identify liminal detection of some stimulus. If they all come out pretty similar and the stimulus doesn’t have any significance biasing detection (such as racialized face detection v. simple light detection) then you might argue it should generalize to the population. Or, there could be properties of the sample itself that suggest it is not representative even of the sub population from whence it is drawn.

As long as this is eventually reviewed by people who realize that including such information is a good thing regardless of how it comes out all is well. Unfortunately, far too many reviewers are perplexed when you actually say something like,
“we ran as many participants as the budget would allow because we wanted as good parameter estimates as possible.” Or, “we believe that this sample is likely much less skewed than the population based on prior findings (here, here, here).” It’s like you’re not supposed to honestly report what you found and how you interpret it.

8. Martha (Smith) says:

Two women in their late seventies crossed paths in the alley between their houses. One was a mathematician, the other a biologist. So of course, the conversation quickly turned to bone density testing. The biologist said something to the effect that test results assumed a normal distribution; both women doubted that this was the case. So later, the mathematician tried looking this up on the web, and came across https://courses.washington.edu/bonephys/opbmdtz.html. Aha! The machines for testing bone density vary — “The early bone density machines in the 1970’s and early 1980’s all used different kinds of units, so results were reported in Z-scores to allow comparisons to normal people. Later bone density was measured in large populations and the Z-scores were compared to the general population and not just to healthy people.” (Hmm — perhaps a confusion here between “normal people” and a “normal distribution?) “But when the bone density machines became commercial, the different companies would not agree on a standard measurement…. . If the companies would have used the same standards, then we could always just look at the plain bone density in g/cm2, just like we look at cholesterol in mg/dl or weight in kg. Unfortunately, that did not happen. Instead, the T-score was invented.” (Well, maybe an existing concept of “T-score” was borrowed/adapted?)

• Sean Mackinnon says:

Oh wow, that is really interesting. I’ve been reading up a lot on the history of the normal distribution lately, and it seems to me that many variables are “normalized” to facilitate ranking of people rather than the population distributions themselves actually being normal.

Somewhere along the line, ppl started assuming that features of humans are also normally distributed (not just the errors). I guess Galton got real excited about height (which, coincidentally actually does approximate a normal curve) even though lots of other human qualities aren’t.

It’s crazy that bone density is measured with T scores when it has a real world quantifiable measurement unit!

9. Michael Nelson says:

I’m reminded of a comment by Fisher in one of his early papers on estimation theory, to the effect that the sample at hand need not be a random sample for inferential analyses on the data to be valid–every sample, he insisted, is a random sample of *some* population! Perhaps he would’ve said that it’s the statistician’s job to generate probability statements and the practitioner’s job to find or create populations about which to infer. Much easier done when we’re talking about plots of land of varying slopes or vats of yeast.

• psych defector says:

+1

• OliP says:

One of the modern challenges of this is that, in terms of hype and impact, a practitioner might ‘accidentally’ forget to define the population that could plausibly have been sampled randomly, and imply in their paper title/abstract/etc that it applies to the broadest population possible.

I suppose the other thing here is the link to bias modelling. If you can define a plausible model for how the sample arose, then you can make some defensible adjustment for broader inference

10. Winston Lin says:

Adam asks, “If a researcher gets a convenience sample (say 100 college students), randomly assigns them to two groups and then uses multiple regression (let’s say) to analyse the results, is that researcher ‘wrong’ to report/use/rely on the p-values that result?”

Cyrus Samii wrote a nice 2014 blog post on this question, and in the comments, I recommended a helpful paper by Reichardt and Gollob, although it’s about simple difference-in-means and t-test analyses, not multiple regression.
https://cyrussamii.com/?p=1622

As Cyrus mentions, there’s a “happy coincidence” (first discovered by Neyman) that justifies frequentist inference in completely randomized experiments with convenience samples. Cyrus focuses on “robust” standard errors in his post, but the same happy coincidence extends to conventional OLS standard errors if the two groups are equal-sized.

I don’t mean to suggest that such an analysis would answer all the questions we care about, but it can be a useful starting point (to borrow Andy’s phrase above) and you don’t have to assume random sampling to justify this starting point.

Where can you find the best CBD products? CBD gummies made with vegan ingredients and CBD oils that are lab tested and 100% organic? Click here.