Jason Collins discusses a paper by Milkman et al. that presented “a megastudy testing 54 interventions to increase the gym visits of 61,000 experimental participants.” Some colleagues and I discussed that paper awhile ago—I think we were planning to write something up about it but I don’t remember what happened with that.
As I recall, the study had two main results. First, researchers had overestimated the effects of various interventions: basically, people thought they had great ides for increasing fitness participation, but the real world is complicated, and most things don’t work as effectively as you might expect. Second, regarding the interventions themselves, evidence was mixed: even in a large study it can be hard to detect average effects
The overestimation of effect sizes is consistent with things we’ve seen before in other areas of policy research. Past literature tends to report inflated effect sizes: the statistical significance filter, both within and between studies, leads to a selection bias which is always a concern but particularly so when improvements are incremental (there is no magic bullet that will get people to the gym). Beyond this, the effects we typically envision are the effects when the treatment is effective. When considering the average treatment effect, we’re also averaging over all those people for whom the effect is near zero, as illustrated in Figure 1d of this paper.
The big problem: Where do the interventions come from?
Collins’s discussion seems reasonable to me. In particular, I agree with his big problem about the design of this “mega-study,” which is that there’s all sorts of rigor in the randomization and analysis plan, but no rigor at all when it comes to deciding what interventions to test.
Unfortunately, this is standard practice in policy analysis! Indeed, if you look at a statistics book, including mine, you’ll see lots and lots on causal inference and estimation, but nothing on how to come up with the interventions to study in the first place.
Here’s how Collins puts it:
At first glance, the list of 54 interventions suggests the megastudy has an underlying philosophy of “throw enough things at a wall and surely something will stick”. . . .
Fair enough. But this concession implicitly means the authors have given up on developing an understanding of human decision making that might allow us to make predictions. Each hypothesis or set of hypotheses they tested concern discrete empirical regularities. They are not derived from or designed to test a core model of human decision making. We have behavioural scientists working as technicians, seeking to optimise a particular objective with the tools at hand. . . .
A big problem here is that “tools at hand” are not always so good, especially when these tools are themselves selected based on past noisy and biased evaluations. What are those 54 interventions, anyway? Just some things that a bunch of well-connected economists wanted to try out. Well-connected economists know lots of things, but maybe not so much about motivating people to go to the gym.
A related problem is variation: These treatments, even when effective, are not simply push-button-X-and-then-you-get-outcome-Y. Effects will be zero for most people and will be highly variable among the people for whom effects are nonzero. The result is that the average treatment effect will be much smaller than you expect. This is not just a problem of “statistical power”; it’s also a conceptual problem with this whole “reduced-form” way of looking at the world. To put it another way: Lack of good theory has practical consequences.
Psychology and other behavioral sciences have lots of “theory” (lists of causal mechanisms?) about how to influence behavior, and most approaches are consistent with one theory or another. But I agree that whether a theory will “work” in a given case is difficult to predict. If there is a mistake in the kind of research you discuss, perhaps it is that the problem is viewed as part of science, when it is better seen as a problem of engineering.
In general, engineers do not do controlled experiments. The resources required to test every little modification of a design are prohibitive. Instead, they go through a cycle of build-test-build-test, observing each new build as a sort of case study. If you run a gym and want more customers, you try various things, see what works, keep making changes until you can’t improve the results anymore. Professors and other educators do this when they design courses. Methods of psychotherapy, and surgery, are developed this way. And computer software like the design of a web browser, and of course the design of mechanical objects like automobiles and pepper mills.
While I agree that this incremental optimization is quite useful in many cases, it has one significant drawback: If you consider each improvement as a random experiment, you have only one observation (n=1) in that experiment. In an engineer’s or software lab that’s surely less of a problem than in a noisy gym membership study. Particularly, imagine some intervention for that gym that coincides with a new TikTok trend that you need a gym membership for. Since n=1, we cannot distinguish between TikTok and actual intervention. That is, if we know about that trend; otherwise we will simply overestimate the effect of our intervention. The latter is particularly problematic, since we may allocate resources to further useless treatment down the road. Therefore, I consider those incremental improvement optimizations critical in the context of social science field studies.
I love the following article which explains this issue much better than I: https://slimemoldtimemold.com/2023/02/02/n1-hidden-variables-and-superstition/
Jonathan:
I think that engineers do lots of controlled experiments. They’re just focused more on the treatment and the outcome and less on things that statisticians tend to obsess on such as randomization and sample size. If someone decides to do X in situation A, then do Y in situation B, and compare the outcomes, that’s a controlled experiment!
I agree that engineers (in the broad sense that includes designers of just about anything) sometimes do experiments. But the build-test framework does not always require comparison of conditions, since feedback comes in other forms that provide specific information that is useful for improvement (e.g., bug reports, complaints from clients, devices that keep breaking in the same place).
Another way of stating my argument is that these studies that compare one kind of nudge to another cannot usually draw general conclusions about “types” of nudges, since each type can go through a process of refinement. And some interventions in these studies may suffer from the absence of any refinement.
Why is going to the gym being optimized rather than exercise itself?
Most people can just go to the park, play some sports, take stairs instead of the elevator, and so on.
Probably because its easier to measure “how many times a week did subjects go to the gym?” than to estimate all sources of strength or aerobic training in their lives. And strength training (eg.) offers benefits that no other type of training offers, and interventions to increase working out in a gym are different than interventions to encourage commuting by bicycle or joining a running club.
But gyms are probably the worst possible option for strength training (with exercise machines). Bodyweight exercises are much better for the general population, training various muscle groups together, resulting in functional strength. And they require minimal equipment (a pullup bar or somewhere to hang gymnastic rings).
For a typical adult who is going for the health benefits, 20-25 minutes of bodyweight resistance exercise (includes 3 sets of 8-12 reps for two muscle groups, w/ warmup) 2-3 times a week can accomplish a lot. This can be done at home, or in a park, with minimal extra time loss for transit etc.
I exercise regularly (bodyweight for resistance training and cardio, biking to work for cardio, handstands and similar for balance, and active stretching), but I have never been to a gym in my life. If that was my only option for exercise, I would probably not do it, or do much less, because it is such a hassle and I don’t like the environment.
Another option is yoga. There are now great apps with various workouts, categorized by strength level and duration. If you have 10 minutes, you can still do a meaningful yoga session. A great isometric workout.
Conflating gyms with exercise (or any kind) is questionable, especially these days. “Easier to measure” is not an excuse to measure the wrong thing, IMO it would have been better to give up a lot of the sample size but measure a meaningful outcome (strength gain over time for various muscle groups, indicators of cardio health, etc).
Every gymnasium I have been in has free weights and barbells, which are hard to have at home because you need a big set but only use a fraction of it at any stage of your training. Body weight training is good but has limits, few people do difficult bodyweight training, and an intervention to encourage attending a gym is different than an intervention to encourage demanding bodyweight exercises.
I agree that reaching extreme levels with bodyweight exercises (eg one-armed pullups) can be super-difficult, but that is a red herring.
For the general population, as a public health intervention, bodyweight exercises that are within reach for everyone are perfectly feasible and include smooth progressions that start at very basic levels (eg look up wall pushups). Same applies to yoga.
I think it is unfortunate that people associate exercise with a dedicated setting (eg a gym) and/or tons of equipment. These just increase the barriers to entry and make it much less likely that people can integrate exercise, especially resistance training, into their weekly routine.
I think it is generally agreed that exercise has a nice dose response curve for most people, ie even small amount provide a benefit. But going to the gym for 10-15 minutes of exercise is not going to make sense for most people.
I remember reading this other paper by the same lead author (and lots of co-authors): https://www.pnas.org/doi/10.1073/pnas.2101165118 It has exactly the same approach – lots of interventions, mostly unrelated, just trying everything and see what works – but in the context of text-message based nudges to get a doctor’s appointment to get a vaccine. Some interesting and important results – ‘we’ve got a flu shot with your name on it waiting for you’ seems to work best – but very little hypothesizing/disucssion. I am impressed by studies like this (and jealous of her research budget…) but they also frustrate me because it doesn’t answer *why* a particular intervention works (or doesn’t) in this particular situation. It doesn’t even try. A quick scan suggest there isn’t a lot of overlap between the nudges tested in the gym and the vaccine shot study. Even something as simple as that would be enlightening: oh, this nudge works for the gym but not for the doctor’s appointment or something.