Earlier today we posted, “To Change the World, Behavioral Intervention Research Will Need to Get Serious About Heterogeneity,” and commenters correctly noted that this point applies not just in behavioral research but also in economics, public health, and other areas.

I wanted to follow this up with a question:

If variation in effects is so damn

importantand so damnobvious, why do we hear so little about it?

Here’s my quick response:

It’s difficult to estimate variability in treatment effects (recall the magic number 16), and in statistics we’re often trained to think that if something can’t be measured or estimated precisely, that it can be safely ignored.

When I talk about embracing variation and accepting uncertainty, this is one reason why.

**P.S.** Thanks to Diana for the above photo of Sisi, who is really good at curve fitting and choice of priors.

One reason for the relative lack of discussion wrt effect size, relative variable importance, etc., is the lack of literature explicitly developing methods for evaluation.

Here is a link to Ulrike Gromping’s archive of papers about it as well as her R module, RELAIMPO.

https://prof.beuth-hochschule.de/groemping/software/relaimpo/

What’s amazing to me about this is that the companies I’ve worked for it would be unthinkable not to care about heterogeneity. There’s huge value in finding interaction effects or slicing the data to find the segment or conditions where a given treatment works. And, of course, initial studies are really just EDA since everything is rolled out with backtesting to confirm the hypothesized effects.

Why this is true in industry is as simple as: Finding heterogeneity is a competitive advantage, executives understand that (even if they wouldn’t use the word), and people who identify profitable heterogeneity get larger bonuses and are promoted.

As an outsider, it’s fascinating to me that there’s less concern. I imagine that the biggest reason that this is true is that there simply isn’t an incentive to find it. Perhaps there’s a cultural solution: If it becomes the expectation to highlight possible sources of heterogeneity in write-ups (or to share the data) and publishing follow-up studies confirming that heterogeneity is rewarded, presumably it’ll happen.

I can’t but wonder if there’s a more fundamental issue. So many studies, particularly in healthcare, are aimed at value creation. Do we need to do a better job of tying value created by research to value received by the researcher?

OccasionalReader said:

“As an outsider, it’s fascinating to me that there’s less concern. I imagine that the biggest reason that this is true is that there simply isn’t an incentive to find it. Perhaps there’s a cultural solution: If it becomes the expectation to highlight possible sources of heterogeneity in write-ups (or to share the data) and publishing follow-up studies confirming that heterogeneity is rewarded, presumably it’ll happen.”

Good points

I think this is something we are just starting to get at in a lot of education field research. Yes, interactions are hard, so more and more studies treat the question of effects on subgroup as the primary RQ, the main effect of interest. Even if that means giving the intervention to an entire school and then assessing just three students per classroom (which is feasible power-wise because of clustering) on some outcomes.

Andrew said,

1.”If variation in effects is so damn important and so damn obvious, why do we hear so little about it?”

I think one reason we hear so little about variation in effects is that it is far from obvious to many people.

Andrew also said,

“It’s difficult to estimate variability in treatment effects (recall the magic number 16), and in statistics we’re often trained to think that if something can’t be measured or estimated precisely, that it can be safely ignored.”

This seems to support my statement above.

Dixx said,

“One reason for the relative lack of discussion wrt effect size, relative variable importance, etc., is the lack of literature explicitly developing methods for evaluation.”

I partly agree and partly disagree — I think we first need to give more attention to discussion of variation in effects as just a part of reality/nature — and this needs to be emphasized from the get- go — ie., starting in Stats 101. And we need to keep talking about this — e.g, remembering to point out in every study that only measures average effect size, that the results require caution in application — precisely because of variation in effects. And, of course, as Dixx says, we need to keep talking about variability in effect sizes, starting in Stats 101 and continuing in everything subsequent.

To ramble on a bit on possibilities for such emphasis:

One example to illustrate is talking about things like heredity. I often think about a picture I have of my mother and her two siblings when they were ages ranging from about 16 to 21. My mom was the oldest but the shortest, her brother was the middle child but the tallest, and her sister was the youngest and the middle in height. There are other aspects of siblings that illustrate variability (e.g.,interests, talents). And we can also point to the variability in terms of the biological processes of mitosis and meiosis. We need to talk about these things early and often.

I made the comment below to the recent post on “The two most important formulas in statistics” but I think it might be useful to repeat it here because the magic number 16 does not seem right to me.

Here was my comment:

I missed the 16 discussion last year but if had seen it I would have made a few points.

The usual definition of an interaction between factors A and B is (at least for two level factors) the difference between the effect of A at high B and the effect of A at low B, divided by two. The division is to make all the standard errors the same.

Using this definition, and if you assume that interactions are about half the size of main effects, 16 becomes 4.

But maybe it should be 1, at least in physical experiments. In their Bayesian method for finding active factors in fractional factorial designs, Meyer and Box (Journal of Quality Technology, 1993) assume a prior for active effects as N(0, gamma * sigma^2) and a prior for inactive effects as N(0, sigma^2), where they suggest that gamma be chosen to minimise the probability of finding no active factors. They say “important main effects and interactions tend, in our experience, to be of roughly the same order of magnitude, justifying the parsimonious choice of one common scale parameter gamma…”. In the BsProb() command in R in the BsMD package (Barrios, 2020, based on Meyer’s code) the default value of gamma is 2 (although it is possible to set different gamma values for main effects and interactions, but this appears to be rarely done).

Also, for two level experiments, it is quite common to get the magnitude of the interaction approximately equal to the magnitude of the two main effects. This just means that one combination of the two factors is unusually high or low and the other three combinations give about the same response (Daniel, 1975, page 135).

Neil:

Yes, it depends on the context. I’m thinking of the sort of social and environmental science problems that I have worked on.

While 16 is impotant to remember it’s also important to remember that sometimes the interaction is much much larger than the main effect. I recently read a study where they had examine silently, describe, and control conditions before attempting to draw a subject accurately. Waiting and describing were about equally better than control in improving drawings, by a small amount. But the largest effect by far was the length of the description. Drawings that followed long descriptions were vastly better than wait and short ones worse.

So, while it’s true that 16x the sample is needed to detect an interaction of equivalent size, it’s not true that interactions are constrained to be similar to the sizes of main effects.

There is actually quite a long history and large literature on quantile treatment effects and related phenomena, but research often tends to be compartmentalized so that there is little communication between strands of the literature. Lehmann’s Nonparametrics book provides a very cogent definition and this has been elaborated. It is important in my view to distinguish what I would call observable heterogeneity, which is often confined to estimating conditional mean models perhaps with interactions, and a willingness to admit that there is something more going on than models of conditional expectations can reveal. Estimation of conditional quantile models have been quite helpful in some drug trials where response is negligible for many patients, but effective for some others. Mixture models and recent work on empirical Bayes methods for nonparametric mixtures is another place where heterogeneity is a central concern, and offers a valuable complement to classical shrinkage methods and hierarchical modeling.

The point is to look for patterns in the variation with the goal of figuring out what “laws” the system is following. Not just calling all differences from the average “noise/error” then coming up with one-size-fits-all claims valid for some average scenario that doesn’t exist in reality.

“If variation in effects is so damn important and so damn obvious, why do we hear so little about it?”

This is an interesting question so I took a day to think about it. I think it is because estimating variation/heterogeneity is so far removed from theorizing. All the glamour in science goes to the theorist. This ties back to statistical significance. If I come up with a clever theory about an effect, I want to know if the effect exists. But to know how big it is, that is perceived as technician work that has no impact upon the viability of the theory. From this viewpoint, statistics is an engineering discipline, a tool you take off the shelf to carve your data into a recognizable shape once you know what it is supposed to look like.

Theorizing about how the data was generated is how you figure out the heterogeneity though. The problem is people running studies seemingly uninformed by any kind of real theory that amount to checking if paint prevents rust when 80% of the cases were already painted (so it didn’t matter) or didn’t get enough paint (so it also doesn’t matter). Then you spend lots of money on a huge sample size to see whether paint prevents rust on average.

I couldn’t find a good review with a quick search but my impression is that the NNT for most medical treatments is like 5-100. That would mean 90%+ of the patients should be be getting the treatment.

90%+ *should not be* getting the treatment.