Does it matter that a sample is unrepresentative? It depends on the size of the treatment interactions

In my article about implausible p-values in psychology studies, I wrote:

“Women Are More Likely to Wear Red or Pink at Peak Fertility,” by Alec Beall and Jessica Tracy, is based on two samples: a self-selected sample of 100 women from the Internet, and 24 undergraduates at the University of British Columbia. . . .

[There is a problem with] representativeness. What color clothing you wear has a lot to do with where you live and who you hang out with. Participants in an Internet survey and University of British Columbia students aren’t particularly representative of much more than … participants in an Internet survey and University of British Columbia students.

In response, I received this in an email from a prominent psychology researcher (not someone I know personally):

Complaining that subjects in an experiment were not randomly sampled is what freshmen do before they take their first psychology class. I really *hope* you why that is an absurd criticism – especially of authors who never claimed that their study generalized to all humans. (And please spare me “but they said men and didn’t say THESE men” because you said there were problems in social psychology and didn’t mention that you had failed to randomly sample the field. Everyone who understands English understands their claims are about their data and that your claims are about the parts of psychology you happen to know about).

Just because a freshman might raise a question, that does not make the issue irrelevant! Freshman can be pretty thoughtful sometimes. And I hope they remain skeptical of these studies even after they take their first psychology class. Like these freshmen, I am skeptical about generalizing to the general population based on 100 people from the internet and 24 undergraduates.

There is no doubt in my mind that the authors and anyone else who found this study to be worth noting) is interested in some generalization to a larger population. Certainly not “all humans” (as claimed by my correspondent), but some large subset of women of childbearing age, some subset that includes college students in Canada and women of various ages who are on Mechanical Turk. The abstract to the paper simply refers to “women” with no qualifications.

Why should generalization be a problem? The issue is subtle. Let me elaborate on the representativeness issue using some (soft) mathematics.

Let B be the parameter of interest, in this case the difference in the probability of wearing red or pink shirts, comparing women in two different parts of their menstrual cycle, among the women who are wearing shirts and have regular menstrual periods.

The concern is that, to the extent that B is not very close to zero, that it can vary by group. For example, perhaps B is a different sign for college students (who typically don’t want to get pregnant), as compared to married women who are trying to have kids. Perhaps B is much different for single women than women with partners (similarly to what was argued in a different recent paper in Psychological Science).

I can picture three scenarios here:

1. Essentially no effect. Women’s clothing colors have very low correlations with the time of the month, and anything that you find in data will likely come from sampling variability or measurement artifacts.

2. Large and variable effects. Results will depend strongly on what population is studied. There is no reason to trust generalizations from an unrepresentative sample. The college freshmen are right.

3. Large and consistent effects. If the parameter B is large and pretty much the same sign everywhere, then a sample of college students or internet participants is just fine (measurement issues aside).

The point is that scenario 3 requires this additional assumption. Until you make that assumption, you can’t really generalize beyond people who are like the ones in the study.

Now, in this particular case, I expect the authors gained confidence in their results because they appeared in two very different populations. They saw a large B in the group of internet participants and a large B in the college students, hence this is some evidence that B is large in general. This is a good idea in general—two case studies is a good way to get started in looking at variation—but in this particular case I don’t trust it because the sample sizes are so small and the data analysis rules were somewhat flexible (see Eric Loken’s comment here).

Representativeness of samples is something that empirical economists have thought a lot about. In their use of phrases such as “local average treatment effect,” they recognize that treatment effects vary, and they are interested in looking at where an intervention works and where it doesn’t (as in this paper by Rajeev Dehehia).

Researchers in medicine and public health are also acutely aware of variation in treatment effects and the need to consider what population is being studied when an effect is being estimated. In medicine and public health (unlike in psychology), it tends to be expensive to add people to a study. Researchers want to maximize their power for a given cost, and so they often make an effort to restrict enrollment to the subset of people who they believe are most likely to respond to the treatment. (Hence, among other things, the fabled “decline effect” when a successful experimental treatment is applied to the general population.)

I don’t know so much about social psychology but, given the sentiments expressed in my correspondent’s note above, I suspect that researchers in that field aren’t always so aware of the potential for treatment interactions; they seem to be implicitly operating under scenario 3 above, in which effects are universal, or at least where there is no reason to be concerned about extrapolating from 24 college students to “women” in general. The point of this post is to explain why such a generalization can be a mistake. This is a case where professional expertise can be a bad thing, a case where the intuition of a college freshman can be more valid than the experience of an experienced and much-published researcher.

I hope that the next time a freshman comes to my correspondent with a complaint about subjects in an experiment not being randomly sampled, that he (my correspondent) not merely dismiss the complaint but instead discuss with the student its relevance under scenarios 1, 2, and 3.

P.S. I’m continuing to come back to this particular paper, not because I want to keep giving its authors a hard time, but because our discussion of it yielded a lot of thought-provoking comments that I think are worth exploring, and it helps for me to do this exploration in the context of particular cases.

41 thoughts on “Does it matter that a sample is unrepresentative? It depends on the size of the treatment interactions

  1. As I read the email, it was referring to 2 more than 3, that it asserted the value in this research was of its own and that only more work would find out if 3 is appropriate.

    I’m of mixed minds about this kind of research. Is it better to have this kind of thing out there or is it noise? And what noise gets picked up later in wrong ways … to the extent that if one could determine an overall value for this kind of stuff would it be positive, negative or nothing much?

    I think sometimes about one of the most ridiculous (to me) results I’ve seen: that eating hot dogs relates to childhood leukemia. The idea isn’t all that odd but the work was; it looked (don’t remember how) at hot dog consumption and found a meaningful result at something like between 14 and 18 or 16 and 20 hot dogs a week but nothing below and nothing above. That’s a lot of hot dogs but apparently there would be a target zone of danger for which I can’t imagine a non-far-fetched, non-highly indirect explanation. Worthwhile? I have no idea. It seems to indicate the lack of a link but then no one really expected one so is this any different from picking any food – like blueberry muffins – and asking the same question? Would we then have to look at all foods one by one or can we group processed meats, etc.? The utility may be there but is it really?

    I far prefer the famous look at rats tested about AC/DC singers. That at least is useful for entertaining people in a bar.

  2. I think this “prominent psychology researcher” does not deserve a response by Andrew. The criticism sounded quite inane to me.

    The study doesn’t have to generalize to all humans but it needs to generalize to those at least they claim it generalizes to. And if it doesn’t generalize to anyone at all outside those 124 subjects, its a fairly uninteresting study.

      • Well, as the psychologist writes: “their claims are about their data”, he implies only *these 24 women* are relevant, which is absurd or inane, as Rahul says. Who cares about 24 women? If the claims are that they found some cool effect in 24 women, that is not publishable — or should not be! Clearly, the journal editor believed the claims were to some extent generalizable.

        The psychologist sounds like a troll.

        • Jack:

          I think he’s no troll. I think he was annoyed that I was mocking his field, and I think he underrates some serious criticisms that have been expressed by undergraduates, perhaps because he has not thought hard about treatment interactions (which are, I think, not covered enough in textbooks on statistics and research methods).

          I remember getting the impression from a course I took that random sampling is necessary for surveys but not for experiments. I don’t think that’s correct—an experiment is typically interesting only to the extent that its results generalize to a population of interest—but it’s something that gets taught.

          • What exactly is random sampling in an experiment? Does that mean randomization at the design stage? Which individuals to assign to controls and how to block the design etc?


            • Rahul:

              By “random sampling in an experiment.” I mean that the people in the experiment are randomly sampled from some larger population. This is different from, for example, a medical experiment that is performed by randomly assigning treatments to some selected group of patients.

  3. “Complaining that subjects in an experiment were not randomly sampled is what freshmen do before they take their first psychology class.”

    Apparently psychology classes are now imparting negative knowledge.

    • I think you have raised an important point.

      Two instances from my career

      1. When teaching a class of Rehab Medicine students they got very excited about their assignment to locate a journal article and assess it with trial quality scale (an old one by Thomas Chalmers). I found out why when they handed them in, almost all were assessment of articles done by their professors and the highest score was 30%. Next year one of their faculty decided they wanted to teach the course.

      2. When teaching undergrads, many from sociology and psychology, they mentioned that what the course was suggesting as adequate statistical analysis and practice did not seem to be what their professors and colleagues did in their papers. Soon after, one of those departments decided they should teach their own students statistics.

      One of the challenges of teaching and consulting about statistical practice is that you more than occasionally upset and ruin others perceptions about how adequately they reason from observations and experiments (i.e. how they grasp the world around them). That’s very unsettling.

  4. Jacob Cohen in the 90s was already trying to divulge some better statistical practices to psycologists: his paper are, to a statistician, extremly sarcastic and, well, funny.
    I recently discovered them, in my effort to discuss statistical testing with hydrologists, and strongly reccomend these two
    The Earth is round (p<0.05)
    Things I have learned (so far)‎

    I am afraid this is all in the greater discussion of how to make statistical thinking easier for non-statistician (and people who do not want to read about statistics).
    Although, I find myself sometimes doing things that I know are bad statistical practice, but one has deadlines, unreliable data, co-authors to plese and so on…

  5. The psychologist can be read a bit more charitably. Putting aside the other issues, it is somewhat remarkable that a causal impact of cycle on clothing choice can be shown among ANY group of women. How much it generalizes is a second question. This is likely true of many of the sorts of results psychologists seek — surprising influences on behavior. So it is likely the psychologist is accustomed to reading experimental results while aware their generalizability is an undressed question, and comfortable with that. We who come from other disciplines — where the point is usually to represent some population as much as to give evidence of a causal impact — see a “problem” because we read with expectations from our own traditions.

    • The naive question I have is: Do psychologists start with the assumption that in any group of ~100 individuals the impact of cycle on all arbitrarily chosen parameters (clothing, brand of cereal, grade of fuel, topping on pizza etc.) will be perfectly zero?

      Only then can this finding be newsworthy.

    • igyt, It is very hard to see why it would be a causal effect if it doesn’t generalize to *some* subpopulation beyond the women in the study. There’s no experimental manipulation!

    • In addition to Andrew’s point that they used the term “women” with no qualification, I’d add that they used statistics. You don’t need statistics if you are not sampling a larger population.

  6. Generalizing from your correspondent to social psychology seems as tenuous as Beall and Tracy’s generalization. That said, I can’t say I disagree. Social psychologists are traditional not concerned with the representativeness of samples because one of the field’s pillars is the power of situations to shape behavior (e.g. Milgram), a corollary of which is that people are more similar than we expect.

    Obviously people can be different in large and important ways and some of those ways are predictable rather than random. But isn’t the problem here really just sample size? If Beall and Tracy’s study included 5000 MTurkers and 2400 undergraduates, assuming they found the same effects, wouldn’t scenario #3 be the most likely?

    • Karim:

      Had Beall and Tracy prechosen their data analysis and data selection rules and done it on 5000 Turkers and 2400 undergrads, I think there’s no way they would’ve found the same effects. I think the effects they found would’ve been much smaller.

      • Agree 100%. But doesn’t that mean that the representativeness of their sample is not really a concern, or at worst a minor concern?

        I wouldn’t bet on replicating their effect sizes in another sample, be it on mTurk, on a college campus, or in Central Park. But the reasons I wouldn’t have little to do with the samples they chose.

  7. First, I feel I should start by saying that I am someone from within the field of social psychology. With that out of the way, I thought I might throw in my two cents regarding what I see as my field’s stance on the issue (assuming, of course, that an entire field can have a “stance”).

    The potential problem of non-representative samples and generalizability in psychology studies is, obviously, not a new one. What we teach (or at least what I teach and have been taught) is not that representativeness, random sampling, generalizing to the population of interest, etc are unimportant (as I fear was conveyed by the comment by the unnamed psychologist in your article). Instead, we teach multi-method convergence. Some research designs require the controlled conditions of the lab and, as much as I’d like to have a nationally representative sample of Americans (for example) to show up at my lab for my study, that clearly isn’t feasible. Lab studies with student samples aren’t bad by design, they are simply one tool for testing our theories. The way I see it is that the specific results of any given study, no matter how good the sample or how well designed, should never be generalized outside of that particular sample. Instead it is the theory that should be generalized to other samples – not the results. Controlled lab studies using convenience samples (such as students) should be supplemented with less controlled field studies with better samples. For any given sample, the researcher should start by asking “how would my theory or hypothesis play out among the sample I have?” and then see what the data have to say.

    Andrew, from this point of view, the most damning point you make about the “shirt color and menstruation” study is not that the samples are bad, but that it doesn’t seem like the theory played out the way it should have given the sample of college students used (at least in one of the studies). Specifically, if the theory says that women wear red at certain times during the month in order to attract a mate, then we would expect a null (or even negative) effect among a sample that is likely not interested in having children (such as college aged women). That’s not a problem of having a poor sample, that’s a problem with misinterpreting the results to support the theory when, in fact, they don’t.

    I also wonder if taking such a theory centric approach can help address the question you raise of choosing between the nature and size of the parameters of interest. It seems to me that point 3 can never really be fully demonstrated. Moreover, when your population of interest is a broad as “women in general” or even “people of western cultures in general”, I have trouble coming up with about any parameter that has “large and consistent effects.” Any non-zero parameter is almost certain to fall under point 2. That said, it seems to me that, without any good empirical or theoretical reason to think that an effect would differ from one group to the next, we should start from the assumption that it does not differ until the data (or a revised theory) say otherwise.

    Anyways, I fear this post is turning into a bit of a rant and so I’ll leave it at that. Hope this helps further the discussion.

    • As someone from not within the field I feel that the “field studies with better samples” follow-up is very rarely happening. What we are seeing is repeated iterations of sampling by convenience studies being churned out and things being left there.

      The glamour lies in being the first to report a sensational correlation. Not in publishing a validation on a larger, representative sample. What makes it worse, is that when reported in secondary sources (e.g. news) this crucial distinction of external validity is often entirely lost.

      I may be wrong: Has someone done a count of how many studies using college students come out for every one study with a wide, large representative sample?

      • To your first point, see Cialdini’s clever Dear Jane letter to the field of psychology:

        Also, see Baumiester et al.’s discussion of the narrowing of research methods in the field.

        I’m an insider and both of these articles are consistent with my experience in the field. What passes for method is what is expedient: college students hitting keys on a keyboard in a “controlled” environment.

          • Yes. Psychologists have wrapped themselves tightly in the warm blanket of Campbell’s claim that internal is the sine qua non of validity. It’s the only rationalization they have to help them explain to themselves why they persist in conducting research on tiny samples of the W.E.I.R.D.

          • Rahul and chicken: You both make very good points (and those two article are fantastic). I think there is still some distance between what is the “ideal” in the terms of social psych research and what occurs in practice, and that likely stems from a variety of different sources that have also resulted in a variety of other problems across many different fields within the social sciences (and many of the problems have been discussed in depth both on this website and others such as Dan Kahan’s cultural cognition blog).

            One of the outcomes of all the changes in the can be collecting whatever data is “expedient”, but I don’t know if that’s always lab experiments. As we all know, the internet now provides a very easy way to collect data from many people even more quickly than in a lab. Mechanical turk is the most obvious example (and has its own potential issues), but there are also many different sources to beg people to take your online study for little or no money (I’ve even got people from craigslist to participate in a study for free on one instance). I think it’s easy to brush off these source of data, and they can be just as fraught with problems of their own, but they can also be fantastic with a little bit of creativity.

            My point here is that things really aren’t as bad as they may seem. There is an ever increasing variety of sources of data (some quite good, in fact, such as the TESS project funded through the NSF) and, perhaps unlike in the past, papers that rely solely on student samples for no reason other than “it was easy” should have increased difficulty being accepted at top tier journals.

            To speak directly to your second question, Rahul, I think that the two sources Mr. chicken provided provide some answer. Such things as the cognitive revolution shifted the focus of the field away from behavior and towards cognition, tightly controlled experiments, and attempts to demonstrate causal relationships. As for your first question, I believe such a paper does exist, but I am having trouble remembering who the authors were. If memory serves me, it looked at many different fields of psychology and turned up roughly the results one would expect. That said, I think it may have also come out shortly before many of the changes cited above really got into full swing. I’m sure the problems this article and your comments have raised is still there, but things may not be as bad as they were even 6 or 7 years ago.

  8. I will read Rajeev’s paper carefully, it looks interesting. However, there is an important difference between (a) prediction and (b) explanation.

    In my work I define the goal of generalized causal inference as _explaining_ the causes of heterogeneity. That involves interactions but — this is key — not all interactions are created equal (see VanderWeele, Four types of effect modification: A classification based on directed acyclic graphs).

    A simple example will do. Take a population level causal model X – > Y W where X is treatment, Y outcome, Z moderator, and W an effect of Z. Suppose X and Z interact in Y=F_y(X,Z,U_y), where U_y is a disturbance (not drawn). We have a convenience sample from this population, and suppose we observe W but not Z.

    Here W interacted with X can be useful in predicting Y in sub-samples, and, by extrapolation and shape restrictions, out of sample. But I would not say W explains the heterogeneity. It is simply a proxy for Z.

    Sometimes prediction is all we need but often, e.g. in medicine, we want to explain and understand the heterogeneity so we can do something about it.

  9. I was amused by your comment that “I don’t know so much about social psychology but, given the sentiments expressed in my correspondent’s note above, I suspect that researchers in that field aren’t always so aware of the potential for treatment interactions…”

    In light of the main point you are making (which I agree with), do you think you should be extrapolating from a sample of one to all researchers in the field of social psychology?

    • For one, social psychologists themselves don’t seem too disturbed by papers of this nature. In fact, this professor that emailed Andrew & other commentators from the field seem quite defensive of the original paper itself.

      That’s one reason why an extrapolation may not be too unfair.

    • Jonathan:

      I chose my words carefully and gave my statement appropriate qualifications. I have no problem from extrapolating from a small nonrepresentative sample, as long as that extrapolation is presented clearly as speculation. That is, I can speculate and you can argue why my speculation is wrong. For example, I received lots of positive feedback on my Slate piece and only this one negative email, so perhaps the attitude of this particular psychologist is not so representative of the field.

      On the other hand, these sorts of articles seem to get published over and over in Psychological Science, so there does seem to be something going on there.

  10. Relevant:

    Be sure to read the commentaries at the end.

    Tacit beliefs matter in science. As an anthropologist, I was trained to assume that everyone is a unique snowflake—not only are there important treatment interactions among samples, but also within samples. That affects my work and how I read others’ work. But I don’t reflect on the assumption too often, to be honest.

    My impression, like that of the authors of the linked paper above, is that social psychology tends towards the opposite assumption. One of the commentaries to the above paper actually does argue that student samples are good for generalizing to the world! But that’s only one. All of the other psychologists agreed that student samples are problematic.

    • ” As an anthropologist, I was trained to assume that everyone is a unique snowflake…. I don’t reflect on the assumption too often, to be honest.”

      Bug or Feature? :)

  11. I laughed when I saw this post and looked down on my clothes. I swear I had completely forgotten about the findings in the study when I chose my clothes this morning. As a fertility monitor user who got a “high” this morning (the scale goes low-high-peak), I have a better measurement than I recall the study had.

Comments are closed.