The gaps between 1, 2, and 3 are just too large.

Someone who wishes to remain anonymous points to a new study of David Yeager et al. on educational mindset interventions (link from Alex Tabarrok) and asks:

On the blog we talk a lot about bad practice and what not to do. Might this be an example of how *to do* things? Or did they just get lucky? The theory does not seem any stronger than for myriad other too-noisy-to-say-anything studies.

My reply: Hey, I actually was involved in that project a bit! I don’t remember the details but I did help them in some way, I think at the design stage. I haven’t looked at all the details but they seemed to be doing all the right things, including careful measurements, connection to theory, and analysis of intermediate outcomes.

My correspondent also asks:

Also, if we need 65 random schools and 12,000 students to do a study, I fear that most researchers could not do research. Is it pointless to do small studies? I fear they are throwing out the baby with the bathwater.

My reply:

You don’t need such a large sample size if you collect enough data on each individual student and class. Don’t forget, you can learn from N=1 in a good qualitative study—after all, where do you think the ideas for all these interventions came from? I do think, though, that the sloppy-measurement, small-N study is not such a good idea: in that case, it’s all bathwater with no baby inside.

What we really need are bridges between the following three things:

1. Qualitative research, could be N=1 or could be larger N, but the point is to really understand what’s going on in individual cases.

2. Quantitative research with careful measurement, within-person comparisons, and large N.

3. The real world. Whatever people are doing when they’re not doing research.

The gaps between 1, 2, and 3 are just too large.

32 thoughts on “The gaps between 1, 2, and 3 are just too large.

  1. Could it be that certain things are just not suitable for study? Is it that stuff like “Growth Mindset” is so variable, so context dependent, vague etc. and the effect sizes, if any, are so small that we need these gargantuan data sizes to discover?

    In other words, any effect that’s so subliminal, that we would need ” 65 random schools and 12,000 students” to discern it; with the effort involved, does it do any good to know it? Post hoc would there be meaningful ways to intervene?

    • I think your thinking is in the right direction (where we ultimately bring in decision theory prior to conducting a study…and say, hey, even if we find something can we actually use it to any great benefit and if we do, do the impediments to implementation make it futile anyways…and do we trust our measurements…and will the model change in time or space…again, how much money are we spending on this?) but you have to be careful on how you frame this.

      Saying that spending money on large n studies that chase small, context dependent effects is a poor use of resources could be mistaken for a vote of confidence for the situation we are currently in. Where studies with small n are conducted, large effects found, and publications/TED talks follow.

    • Rahul – not that I disagree with the main point, but I would say that, while 12,000 students is a large sample size, the clustering into 65 schools may mean that the “effective sample size” is much, much smaller. Sure, it will depend on the underlying correlations of student outcomes within v. across schools, and how much structure you are willing to put on the covariance matrix of error terms (random effects, cluster robust methods, etc.)… but, for instance, I worked on a study with 5,000 people in 60 or so clusters, and we got confidence intervals that were about 0.15sd wide…. my general rule of thumb is that the effective sample size ends up being closer to the number of clusters than it does to the number of actual observations, but that depends a lot on the setting (the underlying DGP of Y, and the correlation of “treatment” variable within-cluster, which in cluster randomized trials is often 1) and the modeling choices – my field tends towards very conservative estimates of standard errors (meaning our models tend to estimate bigger ones, conditional on data and intervention, than the models used in other fields). Looking at that blog post linked above, they seem to get much smaller standard errors than we did, but then again, as of now, the paper is offline so the authors can do some revisions, so maybe that will change…

      Anyway, this is just a technical point about big N small G. My feeling is that we should be more willing to sacrifice N if we can up G…but I guess that is a total digression (but since I wrote it, might as well post it). Also – 65 randomly chosen schools does not mean “that it [the sample] represented the full array of the U.S. public educational contexts.” I mean, 1 school per state? Sure, I could take a random sample of 24 people and call it “nationally representative” but that would be pretty clearly false in anything other than a misleadingly technical way.

      • @jrc

        My point is simple: If I need to take the pains to go to 65 different schools, and survey 12,000 different students to discern an effect, what I hope to discover better be something quite important and universal.

        • Maybe, but suppose instead it’s something that improves education very slightly, say it’s worth $10 per student. Now multiply by 10 million current students, and add up the discounted value for all future students… That $10 improvement might easily be worth a billion dollars.

        • Can you give any examples of this?

          A $10 per student intervention that was achieved consistent across the population where the effect was so subtle that such a large study was needed to discover it?

        • No, because no one ever does the sufficiently comprehensive evaluations to quantify this stuff, but that doesn’t mean it hasn’t happened many many times, just that we have insufficient posterior concentration to determine the true effect.

          Let me give some kind of example that will seem totally plausible though. Suppose there is the current textbook for 3rd grade math, and the new one written by some people who’ve been studying how to teach 3rd grade math for a decade. Now, we buy the textbook, it costs $50 and it gets used for at least 5 years, so it basically costs $10/yr to use (we’ll suppose the alternative is to stretch out our use of the current text we already have costing $0). Suppose that it does a better job of explaining 3rd grade math, in about as good a way *on average* as a single additional $20 individualized tutoring session over the course of the 3rd grade year. The $20/yr equivalent benefit minus the $10/yr cost leads to an *average* benefit of $10/student.

          Is that so hard to imagine? But think of how hard it would be to quantify that, particularly in such a way that the posterior estimate of the net benefit had 95% posterior probability of being greater than 0 (or “the confidence interval excludes zero”) and maximum a-posteriori estimate of $10 (or “point estimate of the mean is $10”)? You’d easily have to study tends of thousands of students across tens or hundreds of schools, and you’d have to do it with randomized assignment of the textbook…

          It’s totally plausible to me that there are textbooks that do a better job of explaining on average, and that those textbooks are worth $10 more… but quantifying which ones with certainty would be a huge endeavor.

        • With enough money lots of things can be done. But it’s not cost effective to devote tens of millions of dollars to determine accurately which is the better textbook A or B in this kind of case.

          That being said, plenty of people are willing to do crappy low powered studies using p values and declare victory. I had this discussion with a PhD Chemist friend who is also heavily involved in elementary school education. She declared that some “large” study (of like 1500 kids) had shown that tracking was bad for kids on average, and that it had been proven once and for all… I wound up arguing that the study, which was of I think 10 schools in one or two states, had poor design, no randomly assigned anything, and had very little in the way of actual information to be gleaned from it.

          So we had diametrically opposing points of view: One educated person sees news story about “science” with “proper statistics” and “a large sample” and declares “definitive victory”, and another sees “a hot mess of observational results on a tiny sample” and declares “offers virtually nothing of value”.

          Guess which way the headlines went?

        • Today I went to a talk by Yeager on the work in question. (abstract at https://stat.utexas.edu/training/seminar-series#yeager). Based on the talk (and as the title in the link indicates), the intent was not to “discern an effect” but to study heterogeneity of effect. One plausible assumption is that the school can affect the effectiveness of the intervention — in particular, the environment in the school can plausibly “nurture” the effect of the intervention, or it can plausibly inhibit that effect. Thus studying a fairly large number of schools is important to study the school environment effect on effectiveness of the intervention. Also, another intent was to provide a data base that other researchers can use to study other related questions.

  2. I’m reminded of Lubinski’s obituary for Humphreys: https://my.vanderbilt.edu/smpy/files/2013/02/HumphreysObit.pdf

    > One of the most dominant themes in his writing is the importance of incorporating reliable and construct-valid measures of individual differences into psychological and social science research and, especially, securing large samples. Humphreys learned about the need for large samples in psychological research during his military work in the 1950s. Too much psychological research, he maintained, is based on inadequate sample sizes, which is a key reason why so much psychological research fails to replicate. Well before the advent of meta-analysis, he was long aware of how correlations fluctuate when Ns are small. This is one reason that much of his empirical research during the last 20 years of his career was based on a wonderful longitudinal study, Project TALENT (Flanagan et al., 1962). Project TALENT is a stratified random sample of U.S. high schools. It consists of four cohorts, grades 9 through 12, with approximately 100,000 students per cohort (totaling over 400,000 participants). It also contains follow-ups at 1, 5, and 13 years following high school graduation. At Time 1, students were assessed in their high school for a full week on measures of ability, background, general information, interests, and personality, among other things. Humphreys and students under his supervision mined this impressive data bank as thoroughly as anyone. Humphreys believed that psychology would be much better off if resources were concentrated on a small number of such studies across multiple psychological domains, rather than the literally thousands of N < 100 psychological investigations, which employ measures having unknown psychometric properties, and typically contribute little to cumulative knowledge. Findings from Humphreys' empirical research, which primarily focused on the identification and development of intellectual talent, are widely cited in the modern psychological literature, handbooks, and textbooks.

    Project Talent publications holds up well over the years, and more still occasionally come out. Certainly, I have vastly more faith in Project Talent results than I do in any of Rosenthal's priming or bias experiments from the 1960s and on such as the infamous 'Pygmalion effect'… (I'd mention _The American Soldier_ surveys of hundreds of thousands or millions of soldiers in WWII, which may be the military research Humphreys worked on, but it's not clear to me how much they actually influenced psychology.) One could also point to the enormous success of the UK Biobank recently. Would you rather have 1 UK Biobank or 5000 candidate-gene studies each with n One of the more memorable things about Bouchard’s course was that, occasionally, after presenting a compelling empirical demonstration to the class, on the power that psychological variables can hold for predicting important socially-valued outcomes (educational achievements, occupational accomplishments, or life in general), Bouchard would turn to the class and say: “See, see, see what happens when psychologists choose to study *real* variables.”
    >
    > This point of view was not unrelated to that of Bouchard’s colleague Paul E. Meehl; Meehl would occasionally wonder out loud whether it would be scientifically prophylactic to require graduate students in psychology to take a minor in a natural science like biology or genetics. By going so, Meehl speculated, they might be able to recognize a meaningful scientific contribution, if they should ever happen to encounter one in psychology!

    • I would say: no difference (Daniel: please explain!). But that’s not a problem from my quantitive worldview. The first few anecdotes serve to build hypotheses and carve out the details of the experiments you will try at n = 10 or n = 100. After that fase you re-adjust and go for an even bigger scale. Research should be iterative.

      • Anectode: this one guy used to chew on this weed that tastes slightly bigger and he swears it helped him with his asthma.

        Experiment: we fed varying levels of weed to asthma patient through time and recorded peak flow and other metrics, our score was consistently co-varying with disease lagged by 3 days up to doses of 3g 3 times per day…

        • Sounds like a lot of researcher degrees of freedom in the analysis. And you better hope everyone responds the same way. There are a loud of differences in terms of weight, lifestyle, genetics, etc that could affect results.

        • Daniel didn’t say an N=1 was the way to go. He was simply demonstrating that an anecdote isn’t the same thing as a reasonably well conducted N=1 study. Moreover, an anecdote doesn’t necessarily have to have N=1. Though, Andrew did say qualitative (not quantative) N=1 study which makes Daniel’s example not exactly in-line with Andrew’s original statement (seeing as he is taking measurements and such).

          For another example of a non-anecdotal N=1 quantative study/…one can learn a lot about the failure mode of stone column pile foundations for wind turbines by testing one to failure (tension/compression). Just because I only failed one foundation doesn’t mean I don’t have a good idea about what to expect for the next N>1 given a certain soil and load condition.

        • Allanc:

          Right. In your stone pile foundations example, you have strong prior information that the relevant parameters vary very little among turbines of a certain type.

  3. Isn’t this the Yeager et al. study that was up for a short while and then removed from psyarchivx, and is now embargoed? It appears to be under revision again. I recognize a couple of names in the author list that lead me to believe this will be a worth-while effort (assuming they keep the researcher degrees of freedom in check or present some form of multi-verse results that explore the stability of their findings across the host of interaction effects they could potentially explore!)

    Link: https://psyarxiv.com/md2qa/

  4. As a (volunteering) side-project I am supporting a social institution with its research practices. They wanted to hop on the analytics bandwagon so to say. They had volunteers going out interviewing every (!) senior citizen in their local population and wanted to use the data they were collecting (quite extensive) for good purposes. What really caught them off guard were my questions: why are you asking these questions, whom do you seek to influence, what’s the relationship between the data your retaining and your organisations. It was all “we always did it this way” and that from people with MScs in social science. The iterative nature of research (n = 1, n = 10, n = 100, etc.), but also of business (social institutions áre in business imho) was totally lost on them. Not anymore!

    • ““we always did it this way”
      Reminds me of group of good high school teachers I once worked with who had signs that said TTWWADI with a slash through it. (i.e., saying “No” to “that’s the way we’ve always done it”)

  5. This is an area where I think social psychology could improve it’s understanding of appropriate sample selection. I do not understand why students with high GPA’s already would be included in the study at all except as a control group. An analogous sample in clinical research would be to include healthy people in a drug trial and then argue that the average treatment effect was zero because the healthy people could not actually get any healthier.

    • The target sample should be students with low GPA’s and assessment results suggestive of a fixed trait mindset. I would still not expect such brief interventions to have long lasting effects and suspect a more focused and sustained effort would be required to actually change the way a sizable number of this group approaches complex problems that require resolute effort and persistence toward short term goals in the service of long term goals.

  6. Interesting post. It seems to me that the bridge between (1) and (2) is the purview of a domain/approach I usually hear call “mixed methods” – have you spoken to anyone in that world about this issue? I normally see that phrase in job postings in e.g. educational psychology departments and possibly sociology as well, and I imagine you have colleagues in both places you could ask. (The extent of my knowledge on the topic begins and ends with the job postings, unfortunately.)

Leave a Reply to Curious Cancel reply

Your email address will not be published. Required fields are marked *