Understanding the “average treatment effect” number

In statistics and econometrics there’s lots of talk about the average treatment effect. I’ve often been skeptical of the focus on the average treatment effect, for the simple reason that, if you’re talking about an average effect, then you’re recognizing the possibility of variation; and if there’s important variation (enough so that we’re talking about “the average effect” rather than simply “the effect”), then maybe we care enough about this variation that we should be studying it directly, rather than just trying to reduce-form it away.

But that’s not the whole story. Consider an education intervention such as growth mindset. Sure, the treatment effect will vary. But if the treatment’s gonna be applied to everybody, then, yeah, let’s poststratify and estimate an average effect: this seems like a relevant number to know.

What I want to talk about today is interpreting that number. It’s something that came up in the discussion of growth mindset.

The reported effect size was 0.1 points of grade point average (GPA). GPA is measured on something like a 1-4 scale, so 0.1 is not so much; indeed one commenter wrote, “I hope all this fuss is for more than that. Ouch.”

Actually, though, an effect of 0.1 GPA point is a lot. One way to think about this is that it’s equivalent to a treatment that raises GPA by 1 point for 10% of people and has no effect on the other 90%. That’s a bit of an oversimplification, but the point is that this sort of intervention might well have little or no effect on most people. In education and other fields, we try lots of things to try to help students, with the understanding that any particular thing we try will not make a difference most of the time. If mindset intervention can make a difference for 10% of students, that’s a big deal. It would be naive to think that it would make a difference for everybody: after all, many students have a growth mindset already and won’t need to be told about it.

That’s all a separate question from the empirical evidence for that 0.1 increase. My point here is that thinking about an average effect can be misleading.

Or, to put it another way, it’s fine to look at the average, but let’s be clear on the interpretation.

I think this comes up in a lot of cases. Various interventions are proposed, and once the hype dies down, average effects will be small. Of course there’s no one-quick-trick or even one-small-trick that will raise GPA by 1 point or that will raise incomes by 44% (to use one of our recurring cautionary tales; see for example section 2.1 of this paper). An intervention that raised average GPA by 0.1 point or that raised average income by 4.4% would still be pretty awesome, if what it’s doing is acting on 10% of the people and having a big benefit on this subset. You try different interventions with the idea that maybe one of them will help any particular person.

Again, this discrete formulation is an oversimplification—it’s not like the treatment either works or doesn’t work on an individual person. It’s just helpful to understand average effects as compositional in that way. Otherwise you’re bouncing between the two extremes of hypothesizing unrealistically huge effect sizes or else looking at really tiny averages. Maybe in some fields of medicine this is cleaner because you can really isolate the group of patients who will be helped by a particular treatment. But in social science this seems much harder.

1. jd says:

But you don’t actually know if the effect is 1 GPA for 10%. It’s likely somewhere between 0.1 GPA for 100% and 1 GPA for 10%, right? And if you don’t know, then you don’t really know how effective it is. Is the idea just to do any interventions that have small average treatment effects but likely larger effects on some unknown subset? Sort of like a blind shotgun approach. I’m still not sure how to gauge the intervention based on an average treatment effect of 0.1 GPA. Without knowing who it is affecting, it doesn’t seem very informative.

• Ideally you estimate a distribution of effect sizes. you still may not be able to predict who it will affect but you can at least see how many people it’s likely to affect at the 1pt level, how many at the 0.5pt level etc.

of course to estimate a frequency distribution over students will require a lot of students and the uncertainty in the shape of the distribution will be large compared to the stderr of the avg.

• Anonymous says:

“Is the idea just to do any interventions that have small average treatment effects but likely larger effects on some unknown subset? …Without knowing who it is affecting, it doesn’t seem very informative.”

Thus an opportunity topic for additional work to tease out the relative effects.

I feel like in social science people want to do one study and go “boom! There you go, call Congress and lets have some laws!!” Climate change is and it’s various effects are probably roughly an appropriate analog for complexity in the physical sciences, and it’s taken thousands and thousands of studies and at least 40 years all totaled to get where we are today. It would be more than astounding if such a complex phenomenon as human learning could be worked out a few simple NHST studies.

• Martha (Smith) says:

“It would be more than astounding if such a complex phenomenon as human learning could be worked out a few simple NHST studies.”

+ many

• jd says:

“Thus an opportunity topic for additional work to tease out the relative effects.”

That’s my point. An ATE of 0.1 GPA doesn’t say much of anything. It’s conjecture whether it affects a large number of people by a small amount or a small number of people by a large amount, until you know who it affects and by how much. One might even come up with a negative ATE, but if it positively affected a particular subset and negatively affected others, it still might be useful to the appropriate subset. So until more is known, I don’t see a small ATE in this scenario as particularly informative.

And I certainly wouldn’t suggest any NHST studies of any kind.

• Anonymous says:

“That’s my point…And I certainly wouldn’t suggest any NHST studies of any kind.”

+1+1+1+1+1

• confused says:

>>Climate change is and it’s various effects are probably roughly an appropriate analog for complexity in the physical sciences, and it’s taken thousands and thousands of studies and at least 40 years all totaled to get where we are today.

Arguably much more than 40 years. Svante Arrhenius wrote about CO2 from coal burning increasing global temperature over a century ago. But the rate of CO2 emission was much lower then, and the science was just not there to understand why it mattered – a change of a few degrees doesn’t intuitively sound all that significant.

• Martha (Smith) says:

Interesting. Thanks.

• Anonymous says:

“Arguably much more than 40 years. Svante Arrhenius wrote about CO2 from coal burning…over a century ago.”

That crossed my mind. However in the modern understanding of climate change feedbacks/forcings are the primary control on temperature and they vary significantly in sign and magnitude. I don’t know if Arrhenius had any concept of feedbacks/forcings as we know them today.

2. Su says:

Just to clarify, are you suggesting that we interpret ATE (I) based on the area of study and (ii) as a mixture between intervention and no intervention?

• I don’t think he means to always treat as a discrete mixture of “effect” and “no effect”. I’m sure Andrew would encourage people to treat the treatment as having continuously varying effects across different proportions of the population. In any realistic scenario educational effects will help some people a lot, more people a little, a bunch of people have very little effect, and some will be hurt… The result will be a distribution of effects, with a core, and tails, and asymmetry and the whole thing.

This is *different* from *our uncertainty* about the effect as well. It’s key to keep the frequency distribution and the uncertainty conceptually separate.

• Su says:

Thanks for clarifying.

But I have a follow up question.
If we were to adjust for variables accounting for variations in the population, then the ATE should not be all that variable across different proportions of the populations. Right? Or is it that in reality we can’t account for these always and hence we need to look at it more as a distribution of effects?

• Martha (Smith) says:

Su said,
“Or is it that in reality we can’t account for these always and hence we need to look at it more as a distribution of effects?”

I think this indeed describes the reality. In teaching statistics, I have tried to emphasize that all we can do is *try* to account for variations in the population, in the ambient conditions etc. It is misleading to say we “can account” or “have accounted” for these possible confounding factors. No statistical analysis can remove all uncertainty.

• Martha is definitely right. But suppose you do have a known variable, like for example there’s a genetic SNP which changes the rate at which the drug is metabolized and so certain groups who have this SNP are not helped as much by the drug… so we can estimate a distribution of effect sizes p(effectsize | snp_yes) and p(effectsize | snp_no).

We could look at the shape of these two distributions, and ask whether they are mostly the same with just a little say shifting and spreading, or if they’re very different shaped… Like for example suppose with snp_yes you have one sub-group who the drug hardly works on at all.. so there’s a big lump down near 0, but then there’s another group where the SNP of interest is combined with a second SNP which the combination deactivates the enzyme so that it doesn’t degrade the drug very fast at all… so you’ve got another lump of probability which is out in the same range as the snp_no group, maybe the drug even works a little better.

So now we can split into p(effectsize | snp_no), p(effectsize | snp_yes, secondsnp_no), and p(effectsize | snp_yes, secondsnp_yes)

There’s absolutely no reason why the average over any one of these distributions needs to be anywhere near the average over all of them put together, nor should the variation scale or shape of the distribution necessarily be the same.

• Su says:

Thanks for clearing that.

3. Micheal says:

Anyone have pointers to research on how we can best communicate this variability in benefit to, for example, teachers deciding if they should implement the intervention? Or to people deciding if they should bother dieting?

• Michael Nelson says:

Only half-joking: Match.com

• Martha (Smith) says:

I don’t know of any such research. I do think that an important part of helping teachers, physicians, patients. etc. make decisions involving uncertainty is to help them be aware that there is almost always uncertainty — and that this needs to be done in all statistics courses that professionals take, and all science courses using statistics that professionals take. It also should be addressed in research papers that deal with these subjects — for example, a research paper on the effectiveness of a drug for a certain medical condition should include a discussion section pointing out how the uncertainty in the research findings might best be communicated to the patient, to help the patient make an informed decision. (Yes, I know this is pie-in-the-sky for drugs especially — I know the drug companies are out to sell their product. But I do have the right to dream of an ideal world!)

4. Daniel says:

When you have within-subjects data, you may still be generally interested in the ATE, but you could also add a visualization so we can see the distribution. If it’s precisely estimated to be 0 for 90% and 1 for 10%, that’s really helpful information. In some cases it’ll be a mixture of relatively-discrete effects like that, in some cases it’ll be a uni-modal distribution of effects around the ATE (whether it’s Gaussian or not), and in others, it’ll be a blend (think multi-peaked). Even without identifying the sources of this heterogeneity, just getting a sense of participant-based variation, you can learn. Still, we’d like to know what it is about these people that explains why the effect is larger or smaller for them.

The ATE is a starting point for research. Often just showing that the ATE is at a level we might care about is a useful contribution. How much heterogeneity-probing is enough for the first paper and what heterogeneity-probing can be left to follow-ups?

Identifying systematic heterogeneity (i.e., moderators) is helpful for its practical implications as well as developing a better understanding of the mechanisms by which an intervention is understood to work. Keeping an eye on leftover participant-based variation is also a good idea and sometimes will color the conclusions of the work, though sometimes it actually won’t that much. For example, imagine a distribution of the TE that is just the sum of two Gaussians with a big valley. The ATE is squarely in the valley, pretty meaningless, aside from summarizing the effect overall. Identifying the critical moderator allows you to split it into two Gaussians, each with their own conditional ATE. That’s done a lot for understanding this effect. You could still show each curve since it helps indicate that there is participant-based variability even within each level of the moderator, but if the curves are actually quite tight, it won’t change your conclusions much. If you had just relied on the ATE without caring about how it varied, you wouldn’t get to that level of confidence in your conclusions. If you see “messier” distributions (like 90-10 in the OP), then it’s very important that you seriously tried to get to know your data because it really changes the meaning.

5. Roman Folw says:

I feel like this post is more abour average than ATE.

6. Eric Loken says:

This also relates to the distinction between the average effect and the process or mechanism producing it. To say participants lost 1.5kg on average does NOT invite a discussion about the process by which the treatment induced that level of change. As Andrew’s example highlights – it could be a dramatic mechanism of 15kg loss for some (the relevant process that deserves investigation) combined with non-response for everyone else. Expected values don’t reflect individual level processes. It is *so* tempting to read off the regression coefficient and start speculating about process. I see it done all the time. (And often I’m guilty too, no doubt.)

• Garnett says:

Eric:
Please clarify “combined with non-response for everyone else”.
Do you mean that non-response is a response of *zero* or that the measurement is missing, perhaps due to some non-ignorable mechanism?

• Eric Loken says:

Well, it could be almost anything. It could be that 90% didn’t comply. It could be that the intervention literally has zero impact for 90%. The point is just that the expected value doesn’t represent a process. Again, if a reading program raises reading scores by 10 points on average, that doesn’t immediately warrant hypothesis generating about a uniform mechanism by which scores were increased by 10 points. The net effect is an expected value and not a description of a process. And thinking of the heterogeneity of treatment effects highlights that. As does the idea that a small net effect might be due to a large process for a subset of the population.

It’s kind of like the absurd notion that because people on average gain 10 pounds a decade, then there must be an approximately 10 kilocalorie per day energy imbalance, and thinking that might be a basis for intervention. The net is not the process.

7. Michael Nelson says:

Medicine, schmedicine. Just try to find hard data on the differential effects of a medical treatment on non-white, non-males. At least historically, you’ll find it’s hit and miss at best. In education, we’ve been tracking demographic differences for decades, probably driven by our social justice streak. (Which also kept us doing subjective qualitative research for a long, long time, but that’s another issue.) At the moment, the vast majority of federal education research dollars go to studies with a special focus explicitly on one or more vulnerable populations (a stupid catch-all term for subgroups that are some combination of non-white, non-male, and challenged behaviorally, mentally, physically, economically, or criminally) within the total sample.

I’m not saying even most education researchers do a great job of the design and analysis even most of the time, but the best work is really good, and we’re still catching up in methods and statistics compared to many other social sciences (again, due to the social justice/qualitative/subjective history). But IES has been pushing the mantra of “What works, for whom, and under what conditions?” for two decades now, and they’ve mostly put their money behind rigor and specificity. (With the huge exceptions of over-reliance on p-values and chasing after politicians’ pet ideas like charter schools and vouchers and teacher incentives.) The increasing emphasis on implementation fidelity, in fact, is another way of looking for large effects for small groups–and I think it’s a model for achieving the kind of total-data analysis you’ve said is your ideal for all of social science, if we can just get it to be the main analysis instead of a secondary one.

Man, this all sounds very defensive, for a response to a post that was intelligent and optimistic. :) But I think all this needs to be said, for those unfamiliar with the intricacies of education research and whose fields may fetishize simplifying models *cough*economists*cough* Seriously, though, let us not forget that most of the worst ed research you’ve highlighted here over the years came from lead authors in econ and psych, not education. Probably they have better PR people.

• yyw says:

Feynman called the field of education research “cargo cult science.” Has anything changed since then? If yes, what are the notable findings? What positive impacts have those findings had on student achievements? Would US students’ math proficiency for example have been worse if all those reforms in pedagogy of the past century were not adopted?

• Michael Nelson says:

You know, I’m familiar with the term, but I’d never read the paper itself. So I read it, and I thank you–it’s as enjoyable and insightful as one expects from Feynman. In his talk, Feynman applies the term to every social science, and even to physics. He then points to anecdotes or studies that prove his point–but not in education. Feynman’s only articulated complaint about education is that students’ scores don’t ever seem to go up, so the science must not be working. This is where Feynman’s reasoning gets a little muddled. He says science is supposed to eliminate bad ideas and replace them with good ideas, but this is not so. Scientists discover whether ideas are good are bad. Changing practice, at least at that time, was up to civil servants and politicians. And to a lesser extent to the teachers themselves and the professors in schools of education. None of these types were easy to move with science back then, and most are too easy to move with marketing-masquerading-as-science now.

So that’s his first mis-perception, that scientists have as much practical sway in education as they do in building rockets. His second is the resilience of the mind of the so-called typically-developing child. Most of learning is human development plus experience, which makes formal education very robust to bad ideas, especially if the student is typically-developing and has no more disadvantages than average. (The caveat here is that studies at that time rarely looked at atypical or disadvantaged students, because white people, and before that time because such students weren’t allowed in the kinds of schools you could get funding to study. And also white people.) By robust I mean (for example) a bad reading teacher or a bad reading pedagogy usually does little to permanently stunt a student’s learning to read.

His third oversight is not to recognize that the most powerful non-demographic variables for predicting learning performance are found outside of the school. Most true education interventions are only trying to reverse the impact of those variables–abuse, poverty, systemic racism, crime, and so forth–and only after they’ve had that impact. Worse yet, we’ve put many social interventions inside the school, so much so that even the best parents don’t know what to do with their kids during the lockdown. Having an instructor ensure that a child is well-fed and nurtured makes about as much sense as having a cop respond to a broken-down car or a mental health emergency. If we ever do “defund the police,” teachers will be unbelievably jealous.

I would go so far as to say that having teachers focus almost exclusively on instruction, with other forms of care focused elsewhere, is one of education research’s best ideas, with strong empirical support. And it’s a prime example of our lack of influence in running schools.

• Martha (Smith) says:

Brings to mind an anecdote told by a friend who was teaching elementary school, talking about all the things such a teacher does. One that has really stuck in my mind is when one kid in her class came to school wearing a pair of jeans that had the inseam completely torn open, with only one safety pin to hold it together. At lunch time that day, she took him home and sewed up the jeans on her sewing machine. Just part of a day’s work for an elementary school teacher.

• yyw says:

Colleges of education exist because presumably their research and teaching of pedagogy make a positive difference. If there is no clear evidence that they do in practice, then federal funding agency has to look closely to see if continued investment is warranted.

• Edubuu says:

“Scientists discover whether ideas are good are bad.”
I think you are guilty of burying the lede, to say the least.
Never mind whether your assertion is actually true (hint: it isnt.), but every thing else you wrote is pretty weak tea in defense of it.

A more defensible position is a paraphrase of George Box’s axiom: All [ideas] are wrong, but some are useful.

• Michael Nelson says:

I know, you wanted practical ideas and actual effect sizes. If you’re curious, you can go to the What Works Clearinghouse to see a long list of interventions, by subject and type, with their effect sizes. But one of my favorite case studies is peer tutoring in general and the PALS (peer-assisted learning strategies) intervention in particular–basically, having students break into small groups or pairs and teach one another. Studies in the 90’s, carefully-designed (though small-scale) with within-subjects measures, showed “large” effects, in the Cohen sense. As it was scaled up in the 2000’s, effect sizes reduced over time, eventually settling around “small.” Part of that is due to larger samples, more diverse samples with greater individual differences, better methods and measures, etc. But there’s strong evidence that the main problem is that ed researchers compare treatment classrooms to business-as-usual (BAU) classrooms, and BAU is not static. After its early successes, peer-tutoring started to get worked into professional development and published curricula. We also now know that different interventions manipulate different variables but generally the same small number of mediators, so as good ideas get worked into BAU, as they should, there are diminishing returns for all but the most powerful classroom interventions. Consequently, I’d say that at less than half of current education research is about pedagogy, instead focusing on things like noncognitive skills.

• Martha (Smith) says:

Interesting. Thanks.

• yyw says:

Not a fan of group learning based on my own teaching experience, but I will check out the PALS. Thanks for the info.

8. James says:

There is a fair amount of interest from a number of economists on estimating heterogeneous treatment effects. The literature on this is relatively new but has garnered a good bit of attention. Unfortunately, applications tend to be focused in areas where there’s a lot of data, i.e. big tech companies. But I think there is more fundamental interest in the distribution of these effects in some circles.

Though to be fair to your post, Andrew, when working on a paper trying to look at such a distribution, I did have a fellow econ student tell me that these distributions are interesting and that only the average matters. So, the interest is certainly not universal.

Still, I think it’s worth nodding to that literature.

• Andrew says:

James:

I agree that these concerns are not new. See for example this discussion from 2009, but of course you could go back many decades earlier, to all the research on factorial experiments.

• Jeff Smith says:

I kinda like Heckman, Smith and Clements (1997) Review of Economic Studies but then I am one of the authors and so perhaps biased. But the idea of worrying about treatment effect heterogeneity was not new in economics even back then, when we rode horses and churned our own butter.

• Andrew says:

Jeff:

Sure. I was thinking of the literature in the 1950s and 1960s on factorial experiments. But there’s been lots of good stuff on varying treatment effects since then. I wrote a paper on the topic in 2004! Indeed, the very phrase, “average treatment effect” recognizes variation: if there were no variation, what would you be averaging? So these concepts are not new, and they keep coming up, but in the meantime they are often ignored in textbooks and applied work.

9. oncodoc says:

Most treatment advances in oncology are modest for the population studied. I have often considered that this might mean some improvement for all or a big improvement for some with little benefit for many. Trials should separate out the real excellent responders in order to help understand the underlying biology. I even remember some patients who had a great response in clinical trials that were “negative” overall. Our current methods grind everything into sausage, and while I like sausage the individual ingredients may be better separately in some cases.

• Martha (Smith) says:

This brings to mind a recent discussion with someone on clinical trials — I pointed out that since conclusions are often based on average effects, a treatment might be considered “empirically supported” even though some subjects in the clinical trial got worse under the treatment. Her reaction was that this practice was unethical (which, from the precept of “First do no harm” it is.)

That is the case for many versions of psychotherapy as far as one can tell.

• confused says:

I’ve been concerned for a while about the opposite problem – drugs never reaching the market because the average effect is too small or negative, but they are helpful in a definable (and thus predictable-in-advance) subpopulation.

Certainly with psychiatric meds you often have to try several different things before you find something that really works; what will work for a particular individual isn’t necessarily predictable. So there may be an advantage to having more options out there.

10. Jonathan says:

Interesting post and comments. Thanks. My take: how average effect is viewed depends on the model to which it’s an input. When people complain about a .1 effect, they’re not necessarily being innumerate, since the objection may stand for a more complex model that evaluates worth. That may be dollars in a budget in the end, but it has to get there. What if the promise was to raise ‘all boats’? What if those being helped are, as they tend to be, those most easily helped? Like training programs tend to benefit people like women who left to have children or others who have strong backgrounds with a gap, and not those who, in the context training programs are pushed forward, most need help. Or, indeed, often no one but the most easily helped because the demand for people trained is limited, for whatever reason, to a narrow segment.

I’ve noticed that drug ads have become more targeted over time. The generic ‘pounding headache’ form is now ‘if you have this specific version of this specific diagnosis, then maybe the benefiicial effects might be worth the long list of serious side effects’. Maybe sometimes only for a chance at life for a few extra months. The model for development decisions has changed. Some of that is the way we pay for things makes that a rational business decision, and some is that mathematical modeling and analysis now generates these cases for develoment. (I wonder what would happen with one of those but not both.) So, in drugs, average effect has been filtered into segments. Imagine one of these specific breast cancer drugs evaluated against all breast cancers.

Maybe the underlying lesson is that all boats is a fiction, just like average effect, because some boats float much better than others. And if you impart movement to the system, meaning only distributions versus distribution singular, then not floating well can be the equivalent of sinking. And it rarely is some of the people some of the time.

11. malcolmkass says:

“I hope all this fuss is for more than that. Ouch.”

I’ve seen comments like this before on other reported education research. I suspect these types of beliefs are common with people on the qualitative side who misinterpret the magnitude of these effects.

• Michael Nelson says:

Also, most people don’t realize that the size of the effect size (small, medium, large) depends on the field. A nice alternative to effect size for this reason is reporting how many hours/days/grades of additional standard instruction would have a similar impact. AND the population.

12. Ron Kenett says:

I cannot resist commenting on:
“Again, this discrete formulation is an oversimplification—it’s not like the treatment either works or doesn’t work on an individual person. It’s just helpful to understand average effects as compositional in that way. “

Compositional data analysis is a well established domain building on work of John Atichison https://en.wikipedia.org/wiki/John_Aitchison. In that community, CODA is not an R package but an acronym.

For an intro to CoDa (compositional data analysis), with applications to association rules, see https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3033588

13. Peter says:

You write, “if the treatment’s gonna be applied to everybody…” but in the case of GPA such an extrapolation from small/medium-sample treatment studies to systems-level interventions seems inappropriate. At its most simple, GPA is a property of people in systems, not people in isolation. If an intervention that “raises GPA by 0.1 points” were applied to everybody in a school system, the result would probably be that GPAs in that system would become (implicitly or explicitly) curved such that performance that would have been assigned a 3.1 before the school-wide intervention would afterwards achieve a 3.0 instead, leading to an “effect” of 0 points even when there are real underlying improvements in student behavior.

Maybe this concern I’m expressing here is orthogonal to the one your articulated about averages–I just find it peculiar to see you writing as though this treatment effect (whether average or varying) is a stable property of the intervention that can be expected to describe the effect size of a scaled-up implementation. I’d assume these treatment studies are trying to make a more humble claim–something like, “this intervention has a positive impact on the measures we looked at (p<.05)"–whereas this other question of an effect size estimate that can be extrapolated to systems-level implementations seems ill-formed to me.

• There’s also the issue of grade inflation. Performance that in say 1980 would have gotten you a C+ today may be getting you a B… magical improvements!

Whats worse is when you compare something you’re doing today to something that was done say a decade ago… and concluding that things are better now, when in fact things went backwards (this is kind of like if you earned say \$1 in 2000 and you earn \$1.09 today, you make *less today* because inflation adjusted your \$1.09 today buys you \$0.73 worth of 2000 stuff.)

14. James Watson says:

It’s worth noting that this is becoming more and more standard in medicine. There is quite a large literature on heterogeneous risk of outcome and risk stratified treatment effect estimation.
Some few key references:
https://trialsjournal.biomedcentral.com/articles/10.1186/1745-6215-11-85
https://pubmed.ncbi.nlm.nih.gov/27375287/
https://www.bmj.com/content/363/bmj.k4245.short

15. Alex Sutton says:

I really welcome this post. As someone who has had a career in meta-analysis in medicine for several decades this is a crucial issue that is not given enough consideration. As soon as you recognise the propensity for variation between individuals, you can appreciate how this translates into heterogeneity (i.e. variation) of AVERAGE treatment effects across studies. Using the example in the original post, if you have a higher proportion of responders in a population then you will get a different average treatment effect. Repeat this, say ten times in ten studies and you get ten different underlying treatment effects to average over in the meta-analysis (some of which may be truly harmful for some populations as others have pointed out). Debate has continued as to how we should deal with this (often we only have study level data so trying to find patient level predictors of response is v. limited – an argument for obtaining the individual level data). Perhaps the most important issue is to question: If average effects truly vary across studies, for whom can we expect the average of those average effects (i.e. the meta-analysis result)??

• Hi Alex.

> As someone who has had a career in meta-analysis in medicine
No longer in that?

> If average effects truly vary across studies
Believe it was in WG Cochran’s last publication where he ended it with the time by place treatment variation being an endless puzzle to sort out. Randomization enables group based assessments of effects with a high degree of re-assurance, all else requires fallible judgments.

16. Andre says:

Yes, but what are the side effects of the research project? Identifying small effects is definitely valuable, but really I think the questions need to be more comprehensive.

One of my roommates in during an internship had schizophrenia. They experimenting with a new medication, prescribed to them. I don’t know if it was an RCT, or what. It was definitely an atypical anti psychotic. They had male and female social workers (not sure their actual qualification) check up on him every day. But many of these drugs are sedative. He would sleep, all day every day. In their case, it could have been better than the alternative. I can’t say, I never saw them off the medication or during a serious episode. But in their case, from the point of view of the social workers, sleeping all day means that symptoms of psychosis are gone. Problem solved.

But for some atypical anti psychotics, the side effects are truly not worth the side effects of some of these meds. They can cause sedation, diarrhea, weight gain, reduced social and intellectual function.

The question that is asked, and I’ve seen in a paper when I’m looking at a new med, “Are the symptoms of psychosis reduced?” If yes, then the medication is good, and we can prescribe it or get it on the market. I remember reading a study my mother suggested the drug to me, and the study raised a bunch of questions. Like, symptoms of psychosis went down for both treatment and control, but marginally more for the group that was being treated, but the plots had no uncertainty and the methods section was vague, so I couldn’t tell anything about this effect. They were also judged by more than one psychiatrist who asked psychosis diagnostics from the DSM. And psychotic illnesses like schizophrenia are episodic (they can be dormant for months or possibly never re-occur), and the paper I read didn’t mention as to whether patients were in an episode, or a serious episode or not. So i decided the effect was marginal, if anything, and I didn’t want to waste my time and money I don’t have with an apathetic psychiatrist and gamble my brain away with another drug.

But that’s not my point. Does this answer my research question? Sure, but what are the side effects of the study, and do they outweigh the positives of the study, and how do we measure that?

I for one, would rather mess up my kidneys 20 years from now, and be able to make marginal contributions to society or a conversation, than be a vegetable that shits his pants and can’t even work at a restaurant.

I see doctors make this mistake, and it is a mistake, that “this works for other people,” and this can be thought of, I guess, as looking at average treatment effects, as opposed to conditioning on the individual. The best doctors, I’ve noticed, are sympathetic to this fact.

May be there’s not some global sodium gold standard that everyone has to abide by, but different levels of sodium might affect different people. Like with people that have lower blood pressure, it is possible (but I only have 2 observations and unofficial conversations with a friend) that ever so slightly higher levels of sodium will help them retain water more. I would fit the model myself, for free, and I know how to spend enough time with a dataset until I can find small signals, but too hard to get the data. And if the methods are too far from the mainstream, journals will ignore it anyway. If we were to investigate this, I wouldn’t be making the grandiose claim that more salt is better and will keep you more hydrated, which is the kind of claim that’s often required for publication, it would be, more that, this is an observation and something that we should consider when trying to carefully think about keeping our body hydrated.

So to the education study. Did their GPA raise .1 points? Sure. But what are the side effects of the intervention? What long term effects did the intervention have? Did it have negative effects on their personality and social function? Anxiety disorder? I’m not saying your study did, or didn’t, I’m just saying if their are other long term negative effects that the study had, then your study was a failure, in that there was net damage. But these things are hard to measure.

But if you lack the sensitivity to think about these things, and you’re only worried about measuring one positive aspect of something, then you’re heart is in the right place, but your mind isn’t. This is analogous to some CEO making a decision that makes his company money and does permanent damage to the environment, or something.

• Martha (Smith) says:

Lots of good points here.

To me, one important criterion for a physician to be a good physician is that they give a careful diagnosis, and then say, “These are the options for treatment”, and give the pros and cons of each option. Then I make the decision on treatment, using the information they have given, as well as other information I have about myself and the conditions in which I live. What medical research needs to do is to give physicians the information they need to be a. good physician — in particular to be able to diagnosis, list possible treatments, and give the pros and cons of each possibility, so patients can make informed decisions on treatment.

17. Oliver C. Schultheiss says:

I am probably just being naive, but my take on an average that has a non-negligible amount of variance is that there generally are moderators lurking in the background that need to be identified and associated interaction effects that will supersede the main effect producing the average. After all, the real state of things is almost always more complex than we think and than we can represent by mere main effects. And even that interaction may just hide something far more complex. In other words, main effects (i.e., average effects) are useful fictions in almost all cases. And even an apparently negligible main effect (or average) may hide a disordinal interaction that may show that, for instance, a growth mind set has beneficial effects for some or under some conditions, but may be detrimental for others or under other conditions. Of course, statistical evidence will not suffice in the end — it needs to be backed up by and validated through an in-depth exploration of process and mechanism. But again — if the real state of affairs is more complex than our simple initial notions, then it should be possible to tease out those mechanisms generating the complexity.

• Martha (Smith) says:

Oliver said,
“But again — if the real state of affairs is more complex than our simple initial notions, then it should be possible to tease out those mechanisms generating the complexity.”

I’m not convinced about this. Maybe in some cases it may be true with carefully designed, larger data sets, but nature can be incredibly complex, and in myriad ways.

• Oliver C. Schultheiss says:

Martha,

I completely get your point. But I wouldn’t be in the business of science if I didn’t hold out hope here. Call it the illusion of unlimited incremental knowledge gain, and along with that more complex models and more sophisticated model tests. My discipline — psychology — won’t make real progress unless we get to a sufficiently complex (but not opvercomplex) model of the human mind that can be accepted as a unifying framework for further research. Clearly, isolated main effects, inspired by localized, ad-hoc “theorizing”, won’t get us there.

18. Anon says:

Those who actually read the paper in question were treated to a lot of careful discussion about heterogeneity in the growth mindset intervention by the context in which it was applied….