In my article about implausible p-values in psychology studies, I wrote:
“Women Are More Likely to Wear Red or Pink at Peak Fertility,” by Alec Beall and Jessica Tracy, is based on two samples: a self-selected sample of 100 women from the Internet, and 24 undergraduates at the University of British Columbia. . . .
[There is a problem with] representativeness. What color clothing you wear has a lot to do with where you live and who you hang out with. Participants in an Internet survey and University of British Columbia students aren’t particularly representative of much more than … participants in an Internet survey and University of British Columbia students.
In response, I received this in an email from a prominent psychology researcher (not someone I know personally):
Complaining that subjects in an experiment were not randomly sampled is what freshmen do before they take their first psychology class. I really *hope* you why that is an absurd criticism – especially of authors who never claimed that their study generalized to all humans. (And please spare me “but they said men and didn’t say THESE men” because you said there were problems in social psychology and didn’t mention that you had failed to randomly sample the field. Everyone who understands English understands their claims are about their data and that your claims are about the parts of psychology you happen to know about).
Just because a freshman might raise a question, that does not make the issue irrelevant! Freshman can be pretty thoughtful sometimes. And I hope they remain skeptical of these studies even after they take their first psychology class. Like these freshmen, I am skeptical about generalizing to the general population based on 100 people from the internet and 24 undergraduates.
There is no doubt in my mind that the authors and anyone else who found this study to be worth noting) is interested in some generalization to a larger population. Certainly not “all humans” (as claimed by my correspondent), but some large subset of women of childbearing age, some subset that includes college students in Canada and women of various ages who are on Mechanical Turk. The abstract to the paper simply refers to “women” with no qualifications.
Why should generalization be a problem? The issue is subtle. Let me elaborate on the representativeness issue using some (soft) mathematics.
Let B be the parameter of interest, in this case the difference in the probability of wearing red or pink shirts, comparing women in two different parts of their menstrual cycle, among the women who are wearing shirts and have regular menstrual periods.
The concern is that, to the extent that B is not very close to zero, that it can vary by group. For example, perhaps B is a different sign for college students (who typically don’t want to get pregnant), as compared to married women who are trying to have kids. Perhaps B is much different for single women than women with partners (similarly to what was argued in a different recent paper in Psychological Science).
I can picture three scenarios here:
1. Essentially no effect. Women’s clothing colors have very low correlations with the time of the month, and anything that you find in data will likely come from sampling variability or measurement artifacts.
2. Large and variable effects. Results will depend strongly on what population is studied. There is no reason to trust generalizations from an unrepresentative sample. The college freshmen are right.
3. Large and consistent effects. If the parameter B is large and pretty much the same sign everywhere, then a sample of college students or internet participants is just fine (measurement issues aside).
The point is that scenario 3 requires this additional assumption. Until you make that assumption, you can’t really generalize beyond people who are like the ones in the study.
Now, in this particular case, I expect the authors gained confidence in their results because they appeared in two very different populations. They saw a large B in the group of internet participants and a large B in the college students, hence this is some evidence that B is large in general. This is a good idea in general—two case studies is a good way to get started in looking at variation—but in this particular case I don’t trust it because the sample sizes are so small and the data analysis rules were somewhat flexible (see Eric Loken’s comment here).
Representativeness of samples is something that empirical economists have thought a lot about. In their use of phrases such as “local average treatment effect,” they recognize that treatment effects vary, and they are interested in looking at where an intervention works and where it doesn’t (as in this paper by Rajeev Dehehia).
Researchers in medicine and public health are also acutely aware of variation in treatment effects and the need to consider what population is being studied when an effect is being estimated. In medicine and public health (unlike in psychology), it tends to be expensive to add people to a study. Researchers want to maximize their power for a given cost, and so they often make an effort to restrict enrollment to the subset of people who they believe are most likely to respond to the treatment. (Hence, among other things, the fabled “decline effect” when a successful experimental treatment is applied to the general population.)
I don’t know so much about social psychology but, given the sentiments expressed in my correspondent’s note above, I suspect that researchers in that field aren’t always so aware of the potential for treatment interactions; they seem to be implicitly operating under scenario 3 above, in which effects are universal, or at least where there is no reason to be concerned about extrapolating from 24 college students to “women” in general. The point of this post is to explain why such a generalization can be a mistake. This is a case where professional expertise can be a bad thing, a case where the intuition of a college freshman can be more valid than the experience of an experienced and much-published researcher.
I hope that the next time a freshman comes to my correspondent with a complaint about subjects in an experiment not being randomly sampled, that he (my correspondent) not merely dismiss the complaint but instead discuss with the student its relevance under scenarios 1, 2, and 3.
P.S. I’m continuing to come back to this particular paper, not because I want to keep giving its authors a hard time, but because our discussion of it yielded a lot of thought-provoking comments that I think are worth exploring, and it helps for me to do this exploration in the context of particular cases.