Have social psychologists improved in their understanding of the importance of representative sampling and treatment interactions?

I happened to come across this post from 2013 where I shared an email sent to me by a prominent psychology researcher (not someone I know personally). He wrote:

Complaining that subjects in an experiment were not randomly sampled is what freshmen do before they take their first psychology class. I really *hope* you [realize] why that is an absurd criticism – especially of authors who never claimed that their study generalized to all humans. (And please spare me “but they said men and didn’t say THESE men” because you said there were problems in social psychology and didn’t mention that you had failed to randomly sample the field. Everyone who understands English understands their claims are about their data and that your claims are about the parts of psychology you happen to know about).

As I explained at the time, this researcher was mistaken: the reason for this mistake turns on a bit of statistics that is not taught in the standard introductory course in psychology or statistics. It goes like this:

Like these freshmen, I am skeptical about generalizing to the general population based on a study conducted on 100 internet volunteers and 24 undergraduates. There is no doubt in my mind that the authors and anyone else who found this study to be worth noting is interested in some generalization to a larger population. Certainly not “all humans” (as claimed by my correspondent), but some large subset of women of childbearing age. The abstract to the paper simply refers to “women” with no qualifications.

Why should generalization be a problem? The issue is subtle. Let me elaborate on the representativeness issue using some (soft) mathematics.

Let B be the parameter of interest. The concern is that, to the extent that B is not very close to zero, that it can vary by group. For example, perhaps B is a different sign for college students, as compared to married women who are trying to have kids.

I can picture three scenarios here:

1. Essentially no effect. B is close to zero, and anything that you find in this sort of study will likely come from sampling variability or measurement artifacts.

2. Large and variable effects. B is large for some groups, small for others, sometimes positive and sometimes negative. Results will depend strongly on what population is studied. There is no reason to trust generalizations from an unrepresentative sample.

3. Large and consistent effects. B is large and pretty much the same sign everywhere. In that case, a convenience sample of college students or internet participants is just fine (measurement issues aside).

The point is that scenario 3 requires this additional assumption that the underlying effect is large and consistent. Until you make that assumption, you can’t really generalize beyond people who are like the ones in the study.

Which is why you want a representative sample, or else you need to do some modeling to poststratify that sample to draw inference about your population of interest. Actually. you might well want to do that modeling even if you have a representative sample, just cos you should be interested in how the treatment effect varies.

And this loops us back to statistical modeling. The intro statistics or psychology class has a focus on estimating the average treatment effect and designing the experiment so you can get an unbiased estimate. Interactions are more advanced topic.

Has social psychology advanced since 2013?

As several commenters to that earlier post pointed out, it’s kinda funny for me to make general statements about social psychology based on the N=1 email that some angry dude sent to me—especially in the context of me complaining about people making generalizations from nonrepresentative samples.

So, yeah, my generalizations about the social psychology of 2013 are just speculations. That said, these speculations are informed by more than just one random email. First off, the person who sent that email is prominent in the field of social psychology, a tenured professor at a major university with tens of thousands of citations, including a popular introductory psychology textbook. Second, at that time, Psychological Science, one of the leading journals in the field, published lots and lots of papers making broad generalizations from small, non-representative samples; see for example slides 15 and 16 here. So, although I can make no inferences about average or typical attitudes among social psychology researchers of 2013, I think it’s fair to say that the view expressed by the quote at the beginning of this post was influential and at least somewhat prevalent at the time, to the extent that this prominent researcher and textbook writer thought of this as a lesson to be taught to freshmen.

So, here’s my question. Are things better now? Are they still patiently explaining to freshmen that, contrary to naive intuition, it’s just fine to draw inferences about the general population from a convenience sample of psychology undergraduates?

My guess is that the understanding of this point has improved, and that students are now taught about the problem of studies that are “WEIRD.” I doubt, though, that this gets connected to the idea of interactions and varying treatment effects. My guess is that the way this is taught is kinda split: still a focus on the model of constant effects, but then with a warning about your sample not being representative of the population. Next step is to unify the conceptual understanding and the statistical modeling. We try to do this in Regression and Other Stories but I don’t think it’s made it to more introductory texts.

24 thoughts on “Have social psychologists improved in their understanding of the importance of representative sampling and treatment interactions?

    • Ruben:

      Thanks for the link. From the twitter post by Bavel:

      A replication of 27 survey experiments (n = 101,745) finds that convenience samples overwhelmingly produce similar findings to representative samples. In 0 of 393 analyses were the effects in opposite directions.

      That “0 out of 393” thing seemed hard to believe! The clarification comes from the linked article by Coppock et al.:

      In 0 of 393 opportunities do the CATEs have different signs while both being statistically distinguishable from 0.

      In any case, it’s an interesting result. I think much will depend on the quality of the studies in question. Lots of junk science proceeds by rooting around in unrepresentative samples of noisy and finding statistically significant differences (which then will be large because the data are noisy). In this case, two different studies could easily yield two much different results, even if the two samples are drawn from the same population. If the analyses are done in a more reasonable way, then, yes, flipping signs would seem much less likely, for the same reason that “Simpson’s effect”-type reversals are rare enough so that, when they do happen (as with Red State Blue State), they’re notable.

      • I read the linked paper but I cannot figure out how they account for any post-stratification that was done in the original papers utilizing a convenience sample (not willing to dig that far!). This begs the question: does this study say that convenience samples reliably yield the same results as random samples, or does it say that modern approaches to post-stratification work quite well?

        • Matt:

          I’d guess that none or very few of the papers in that study used poststratification. Standard practice is just to estimate the causal effect from the existing data and be done with it, relying on the randomization to ensure that the estimate is unbiased.

      • Following up a bit:

        “Of the 156 CATEs that were significantly different from no effect in the original, 118 are significantly different from no effect in the MTurk replication.”

        “Of the 237 CATEs that were statistically indistinguishable from no effect in the original, 158 were statistically indistinguishable from 0 in the MTurk version.”

        70% of the replicated studies matched the statistical significance determination of the original study.

        76% of the statistically significant original studies were also statistically significant in the replicated studies.

        “In 0 of 393 opportunities do the CATEs have different signs while both being statistically distinguishable from 0”

        But 60% of the original studies weren’t statistically significant, so the above statement misrepresents the results. A representative statement of the results is:

        Of the156 results that were statistically significant in the original work, none of the replications were of opposite sign from the original work

        The authors discuss a number of caveats at the end of the paper. I’m no expert but the discussion seems thorough and sound at first pass. If I were to follow up on it – I won’t – I would review the original studies to get a better idea of what this study was actually trying to replicate then review this paper again.

  1. I’m just a tad one note about this issue, but I think it’s a more general problem related to a general cognitive bias and should be viewed as such, not so much within a smaller window of something like social psychology.

    For example, although I think the field of epidemiology is pretty focused on only generalizing from representative sampling, (to my surprise) we saw famous epidemiologists (e.g., Ioannidis and Bhattacharya) making the fundamental error of generalizing from a non-representative sample during their TV campaign to promote incorrect estimates about the virulence of COVID, and the likely associated mortality outcomes.

    Generalizing from sampling is an inherent feature of human cognition – based on our psychological and cognitive tendencies. Often times, generalizing from unrepresentatice sampling works. That’s why we do it! But it works less often (probably?) than generalizing from representative sampling.

    Choosing between sub-optimal and less sub-optimal strategies is hard!

    • On a side note here, it’s amazing to me how little ‘post-mortem’ has been done in the various camps around Covid since 2020. Everyone, contrarian and mainstream alike, seems to have come away feeling more vindicated than ever.

      • imho some of this has to be put on the “standard” research practices. If your usual thing is collect data, analyze with linear models, check for statistical significant, publish discovery…. then you’re going to get a lot of nada from the COVID pandemic. It’s just crazy messy, with different sub-groups behaving differently, the virus spreading through space in distinct waves, policies changing weekly, data quality varying through time… expanded rollouts of vaccines and home-tests and boosters and variation in school attendance and etc…

        If you treat it as a dynamic time and space varying process, you could get somewhere, but the sophistication required is well beyond what most people can do. It’s on the order of complexity of running the national weather service but without a reliable network of weather sensors or satellites.

        It’s quite possible you could learn something from the wastewater viral concentration data with some simplified models, but you’d still probably want ~1000 coupled differential equation based processes, and you’re going to have to do Bayesian analysis of it.

        In the end my strong prior is what you’ll find is to the extent that people limited their exposure to others, wore quality masks, had higher health education levels, kept their kids home from school, and took the vaccines as soon as they could, the risk of death or hospitalization was less.

        It would be interesting however to simply run a quality census type health survey… for example follow up with everyone who was in the 2019 ACS survey and see what their family/household experience of COVID was.

        • This is spot on!

          Actually, not only are there few people who are capable of this kind of ultra-complicated system analysis, the super-computing resources required to carry it out are also scarce. Moreover, those scarce resources are mostly in the hands of people who use them for other things; public health policy analysis has never been a top priority in our economy (nor any other, as far as I know.)

          My guess is that these questions will never be adequately studied and will forever remain unsettled.

        • What do you mean by a lack of sophistication? Is it just a lack of appropriate data, or do you mean that the people studying these questions are using the wrong mathematical/statistical tools because they can’t use the right ones? If it’s the latter, would you mind sharing an example of what the right modelling process would look like?

        • Clyde: I agree, if you want to run sophisticated models you’ll need a big computer and those are mostly being used to do weather analysis, global circulation models, nuclear bomb aging studies, galactic formation simulations, etc.

          Blackthorn: I mean mostly that people are not studying things with the right kinds of models. I’ve seen adoption of agent based models in Ecology, a little in Biology, and every so often as a niche thing in Economics. Early on in the COVID pandemic there was a Kings College group that open-sourced their agent based model in C++ for pandemics. So I know there’s some niche stuff being done there.

          Basically, it’s hard, but if you want to understand something like what happened in the US in the COVID pandemic you should think of the US as not a single point with 330M people in it, and not as 50 “equal” states, and probably not even as 3000+ counties, but rather maybe 33000 regions in which 10000 people reside placed in approximately correct location on a map of the US, with roads connecting them, and each region has its own socioeconomic / demographic situation, which can be poor, middle class, rich, mixed, etc. Then there’s policies at state and county levels that apply, there’s a mix of jobs, people migrating, airplanes flying from one airport to another, people driving for work and vacations, seasonal issues with vacations, big events like Sturgis motorcycle rally, or music festivals or whatever.

          Of course, you don’t have accurate models for all these things, but we can specify stochastic rules that stand-in for accurate models, and sometimes it doesn’t matter because that’s good enough. And we run this model forward with a time-scale that is sufficiently small relative to the time course of the disease. Since people are most infections in the first 5-10 days, that suggests a time-course of 1 day or maybe even as small as 1/2 day. You don’t necessarily have to model each of the 330M people in the US, but maybe you should have each entity represent say 10 people on average, so you’ve got 33M entities.

          then what kind of data do we have? Google had all sorts of “mobility” data they were publishing. We have case counts on a county basis for each day. We have some information about testing availability. One of the largely overlooked data sources is wastewater treatment… it’s a really excellent indicator of overall prevalence of viral infection in a region. We have vaccination records that could indicate maybe down to how many people per day got vaccinated per county, or maybe people/week/county. There’s death data indicating how many people died of all causes and of COVID each day or week, each county… We’ve got some demographic data on deaths, like by race and Hispanic / Latino background and maybe by different kinds of jobs.

          So, you build some kind of rule-based model involving agents making decisions like whether they’re going to work or not, and whether they’re infected on a given day based on lots of stuff like whether there’s a stay at home order, whether they’re one of the “essential” workers like nurses doctors transportation people and grocery store clerks and whatever… whether they use masks or not, whether they live in a household with other infected or high risk people etc.

          And maybe we parallelize this, so we put different regions that are geographically separated on different machines, and people travel between machines if they drive on certain roads etc… So we might have 20 or 30 machines each simulating a given region, and 1 million agents or so on each machine, and we run the model forward trying to find parameter values for the rule set which approximately reproduce say wastewater concentration statistics and clinical case statistics and death and hospitalization statistics aggregated at the county level.

          You could get away with doing something like this without a “supercomputer” but it’d be a bunch of work to build the model, and to run it, because any given run is likely to be relatively fast, but to tune the parameters (or get a posterior sample) you’re going to run 3 years ~ 1000 timesteps maybe a million times…. That’s 3B timesteps, and each timestep handles 30M agents. So let’s call it on the order of 10^15 operations. So if each operation is maybe 1000 cycles, and you have a 1GHz processor… ok now I’ve gotta pull up a terminal… if you run on 1000 nodes it’ll take you 1000 days to do the run…So yeah, you really do need a supercomputer. I’m sure there are ways to reduce that work a lot… you could probably homogenize stuff a bit and re-use sub-simulations, and use multiscale techniques to diffuse stuff around at rates related to statistics of the underlying population… so I’m pretty confident with some quality model builders you could reduce that cost by a factor of 100 or 1000. So maybe with 100 nodes rented from Amazon you could run it for 100 days and get a lot of great answers.

          If you did all that though, you could certainly see if things like vaccination rates directly *caused* lower death. Because after you’ve fit the model to mimic the actual vaccination data, you could reduce the vaccination take-up and run a counterfactual and find out whether you would predict more death in say LA county if vaccination take-up was 10% less… stuff like that. You could look at what would have been the effect of canceling Sturgis rally. What would have been the effect of preventing people from vacationing in Florida… what would have been the effect of much more rapidly acquiring and spreading the word about using KN95 masks? etc etc.

          If you think that all sounds insane, it’s only because you haven’t hung out with people doing climate modeling or biogeochemistry or ecology of the global oceans, or stuff like that, because **this kind of stuff gets done** it just isn’t being applied to social sciences much.

        • Daniel: It’s not that I find the approach insane, I just don’t understand why these techniques aren’t more widely used. I don’t mean to imply that the academic market is “hyper-efficient” but is it really as simple as grabbing a bunch of people working on climate/biogeochemistry/etc. and asking (begging?) them to work on some other problems instead? I often see this criticism that academics in a certain field don’t use the wrong math, in the Econ case there’s often the additional implication that Economists are just too stupid to use the “right” math, but it just doesn’t seem that likely to be true to me.

        • My impression is that there are two reasons agent based modeling never caught on in economics:

          1. Economists like math. They like proving theorems, they like differential equations. It’s more aesthetic that way. If feels more like you “understand” that way.

          2. Economists don’t like computers. The main challenge with ABMs is distributed computation, which is not a skill economists train. I don’t think this is equivalent to the “economists are too stupid argument.” I don’t think distributed computation is any harder than say, kakutani’s fixed point theorem and supermodularity and all the mathy nonsense that economists learn in their training. It’s kind of weird to see papers with all this fancy math and whatnot where the empirical analysis ends up loading a CSV file into stata, typing “reg xy, robust”, and then a bunch of weak arguments as to why a linear model is appropriate. But that’s how it often is; the econometricians who learn tools like R and stan and JABM and apache spark seem to have swum upstream.

          It can also be argued that models where you can prove the existence and weak pareto optimality of finitely many equilibria also broadly supports a neoclassical political philosophy. Personally, I don’t think most economists are all that neoclassical; if that is a factor, it’s through the legacy of folks like Samuelson and Arrow.

        • I agree Daniel, and would also add in Joshua’s point about political polarization overlay. It really is endlessly frustrating!

      • Chris –

        > various camps around Covid since 2020. Everyone, contrarian and mainstream alike, seems to have come away feeling more vindicated than ever.

        I guess if’s what to expect once an issue becomes polarized.

        It’s incredible how “lockdown deaths” is taken by many as an article of faith – despite that it’s always (at least that I’ve seen) based on a highly speculative counterfactual assumption: that outcomes wouldn’t have been worse if the NPIs hadn’t been implemented. It’s based on reasoning that couldn’t really be more facile: Correlation equals causation, IOW, negative outcomes were associated with NPIs therefore the NPIs caused the negative outcomes. Yet that thinking, in the form of anger about NPIs, may well be the driving factor in the election of our next president (if DeSantis wins).

        Sure, there’s much of the same on the other side, but I honestly think the general level of analysis isn’t as bad. Although, I would the focus on vindication is no less prominent.

      • It isn’t time for a post mortem yet. Most likely outcome is 2-3 variants that differ from each other by ~20% circulating at the same time. Too strong immunity vs one will mean weak immunity vs the others, a la dengue virus.

        So far everything has gone pretty much as you would expect based on prior knowledge, so I don’t see why it would stop now.

        • Although that’s in some sense true, I think it’s worth pointing out that deaths per day has plateaued at what is probably within a factor of 2 of what will be the long-term endemic rate… things haven’t changed much with respect to deaths since about May 2022. Yes we’ll probably see a climb due to holidays but it’s not going to be more than double the rate we had in Aug 2022 probably, unlike say Jan 2021 which was about 5x what we’ve got going on now:
          As ourworldindata.org shows

          So, if you’re interested in the dynamic **transient** response, studying the period Jan 2020 to May 2022 is probably pretty reasonable, extending it to Jan 2023 makes some sense too, but is not necessary to get a sense of what happened during the period where immunity was building up to a steady-state.

        • Covid deaths are about 1/5 of the same time last year, just like tests. Categorizing “with covid” as a covid death was just as dumb as blaming every celebrity/athlete collapse or sudden death on vaccination. So I don’t think that means much. “Excess mortality” is a better number. We see 2022 tracked 2021 up until March or so, when the hysteria in the news and NPIs largely ended:

          United States reported 3,353,787 deaths, for the 52 weeks of year 2020 (all years of age). Expected deaths were 2,910,693. That is an increase of +443,094 deaths (+15.2%).

          United States reported 3,457,521 deaths for the 52 weeks of year 2021 (all years of age). Expected deaths were 2,937,434. That is an increase of +520,087 deaths (+17.7%).

          Year to date, United States reported 3,239,114 deaths for the 52 weeks of year 2022 (all years of age). Expected deaths were 2,979,305. That is an increase of +259,809 deaths (+8.7%).

          https://www.usmortality.com/deaths/excess-yearly-cumulative

          We will see how the winter ends up, but the danger from these vaccines has always been due to the inevitable “SARS-3”, after imprinting the same immunity on such a large swath of the population. I don’t think we are there yet.

          The one surprise is the IgG-4 antibodies after boosting. Basically, it seems chronic exposure to the spike leads to the body treating it as an allergen, and becoming tolerant.

          How the mRNA vaccines are producing said chronic exposure is unknown, but the CEO of Pfizer was asked during one of the early FDA approval meetings whether they checked what happened when they treated cells expressing endogenous retroviral proteins (~ 5% of the human genome, expressed during proliferation). He laughed at the question, but reverse transcription is the most likely explanation.

      • Chris:

        But I guess the people who were pushing ivermectin, or who were projecting 500 total deaths, or who were promoting forever lockdowns aren’t feeling more vindicated, right? Or if they are, at least the rest of us realize how wrong they were?

  2. The email seems to be making the claim “our studies are just claims about the sampled population” but nobody wants to know how the minds of Psych 101 students at a research university in the Midwestern USA work. No press release makes such a narrow claim, because its obviously boring. If you want to make claims about Americans in general, or people in general, you obviously need a larger and more diverse sample, or you need a series of studies on specific populations with enough statistical power to see how that population varies from other populations.

  3. The simplest response to this researcher is, yeah, freshmen often make the mistake of confusing the motivation for random assignment with the motivation for random selection. A freshman would be wrong to say that random selection is necessary for experimental validity. But they would be equally wrong to say, as the researcher implies, that random assignment is sufficient for experimental generalizability. By accusing you of making a freshman mistake, he is himself making a freshman mistake.

  4. For most people, adolescent brain development is done by age 25, but variers beforehand, often still in progress in early 20s.
    What could possibly go wrong from sampling undergraduates?

  5. This is a great post that brings up the issue of generalizability of claims. Two issues were left untouched, also by the discussants:
    1. Formulation of claims
    The finding categories listed as (1), (2) and (3) are verbal. The discussion of claims is verbal. We should discus how to present claims verbally.
    2. The Sign test
    The term “same sign” cam reoccurring and this is quite critical. Directional tests were proposed (Tukey, Gelman and Carlin). Why are they not mentioned.
    Our paper in the link below addresses both of these https://www.dropbox.com/s/zfmuc81ho2yschm/Kenett%20Rubinstein%20Scientometrics%202021.pdf?dl=0

Leave a Reply

Your email address will not be published. Required fields are marked *