[image of cat lifting weights]

A graduate student who wishes to remain anonymous writes:

I was wondering if you could answer an elementary question which came to mind after reading your article with Carlin on retrospective power analysis.

Consider the field of exercise science, and in particular studies on people who lift weights. (I sometimes read these to inform my own exercise program.) Because muscle gain, fat loss, and (to a lesser extent) strength gain happen very slowly, and these studies usually last a few weeks to a few months at most, the effect sizes are all quite small. This is especially the case when comparing any two interventions that are not wholly idiotic for a difference in means.

For example, a recent meta-analysis compared non-periodized to periodized strength training programs and found an effect size of approximately 0.25 standard deviations in favor of periodized programs. I’ll use this as an example, but essentially all other non-tautological, practically relevant effects are around this size or less. (Check out the publication bias graph (Figure 3, page 2091), and try not to have a heart attack when you see someone reported an effect size of 5 standard deviations. More accurately, after some mixed effects model got applied, but still…)

In research on such programs, it is not uncommon to have around N=10 in both the control and experimental group. Sometimes you get N=20, maybe even as high as N=30 in a few studies. But that’s about the largest I’ve seen.

Using an online power calculator, I find you would need well over N=100 in each group to get acceptable power (say 0.8). This is problematic for the reasons outlined in your article.

It seems practically impossible to recruit this many subjects for something like a multi-month weight training study. Should I then conclude that doing statistically rigorous exercise science—on the kinds of non-tautological effects that people actually care about—is impossible?

(And then on top of this, there are concerns about noisy measurements, ecological validity, and so on. It seems that the rot that infects other high noise, small sample disciplines, also infects exercise science, to an even greater degree.)

My reply:

You describe your question as “elementary” but it’s not so elementary at all! Like many seemingly simple statistics questions, it gets right to the heart of things, and we’re still working on in our research.

To start, I think the way to go is to have detailed within-person trajectories. Crude between-person designs just won’t cut it, for the reasons stated above. (I expect that any average effect sizes will be much less than 0.25 sd’s when “when comparing any two interventions that are not wholly idiotic.”)

How to do it right? At the design stage, it would be best to try multiple treatments per person. And you’ll want multiple, precise measurements. This last point sounds kinda obvious, but we don’t always see it because researchers have been able to achieve apparent success through statistical significance with any data at all.

The next question is how to analyze the data. You’ll want to fit a multilevel model with a different trajectory for each person: most simply, a varying-intercept, varying-slope model.

What would help is to have an example that’s from a real problem and also simple enough to get across the basic point without distraction.

I don’t have that example right here so let’s try something with fake data.

**Fake-data simulations and analysis of between and within person designs**

The underlying model will be a default gradual improvement, with the treatment being more effective than the control. We’ll simulate fake data, then run the regression to estimate the treatment effect.

Here’s some R code:

## 1. Set up setwd("/Users/andrew/AndrewFiles/teaching/multilevel_course") library("rstan") library("rstanarm") library("arm") options(mc.cores = parallel::detectCores()) rstan_options(auto_write = TRUE) ## 2. Simulate a data structure with N_per_person measurements on each of J people J <- 50 # number of people in the experiment N_per_person <- 10 # number of measurements per person person_id <- rep(1:J, rep(N_per_person, J)) index <- rep(1:N_per_person, J) time <- index - 1 # time of measurements, from 0 to 9 N <- length(person_id) a <- rnorm(J, 0, 1) b <- rnorm(J, 1, 1) theta <- 1 sigma_y <- 1 ## 3. Simulate data from a between-person experiment z <- sample(rep(c(0,1), J/2), J) y_pred <- a[person_id] + b[person_id]*time + theta*z[person_id]*time y <- rnorm(N, y_pred, sigma_y) z_full <- z[person_id] exposure <- z_full*time data_1 <- data.frame(time, person_id, exposure, y) ## 4. Simulate data from a within-person experiment: for each person, do one treatment for the first half of the experiment and the other treatment for the second half. z_first_half <- z T_switch <- floor(0.5*max(time)) z_full <- ifelse(time <= T_switch, z_first_half[person_id], 1 - z_first_half[person_id]) for (j in 1:J){ exposure[person_id==j] <- cumsum(z_full[person_id==j]) } y_pred <- a[person_id] + b[person_id]*time + theta*exposure y <- rnorm(N, y_pred, sigma_y) data_2 <- data.frame(time, person_id, exposure, y) ## 5. Graph the simulated data pdf("within_design.pdf", height=7, width=10) par(mfrow=c(2, 2)) par(mar=c(3,3,3,1), mgp=c(1.5, .5, 0), tck=-.01) plot(range(time), range(data_1$y, data_2$y), xlab="time", ylab="y", type="n", bty="l", main="Between-person design:\nControl group") for (j in 1:J){ ok <- data_1$person_id==j if (z[j] == 0){ points(time[ok], data_1$y[ok], pch=20, cex=.5) lines(time[ok], data_1$y[ok], lwd=.5, col="blue") } } plot(range(time), range(data_1$y, data_2$y), xlab="time", ylab="y", type="n", bty="l", main="Between-person design:\nTreatment group") for (j in 1:J){ ok <- data_1$person_id==j if (z[j] == 1){ points(time[ok], data_1$y[ok], pch=20, cex=.5) lines(time[ok], data_1$y[ok], lwd=.5, col="red") } } plot(range(time), range(data_1$y, data_2$y), xlab="time", ylab="y", type="n", bty="l", main="Within-person design:\nControl, then treatment") for (j in 1:J){ ok <- person_id==j if (z[j] == 0) { points(time[ok], data_2$y[ok], pch=20, cex=.5) lines(time[ok&time<=T_switch], data_2$y[ok&time<=T_switch], lwd=.5, col="blue") lines(time[ok&time>=T_switch], data_2$y[ok&time>=T_switch], lwd=.5, col="red") } } plot(range(time), range(data_1$y, data_2$y), xlab="time", ylab="y", type="n", bty="l", main="Within-person design:\nTreatment, then control") for (j in 1:J){ ok <- person_id==j if (z[j] == 1) { points(time[ok], data_2$y[ok], pch=20, cex=.5) lines(time[ok&time<=T_switch], data_2$y[ok&time<=T_switch], lwd=.5, col="red") lines(time[ok&time>=T_switch], data_2$y[ok&time>=T_switch], lwd=.5, col="blue") for (i in 1:N_per_person) { ok2 <- ok & index==i } } } dev.off() ## 6. Fit models using rstanarm fit_1 <- stan_glmer(y ~ (1 + time | person_id) + time + exposure, data=data_1) fit_2 <- stan_glmer(y ~ (1 + time | person_id) + time + exposure, data=data_2) print(fit_1) print(fit_2)

And here are the simulated data from these 50 people with 10 measurements each.

I'm simulating two experiments. The top row shows simulated control and treatment data from a between-person design, in which 25 people get control and 25 get treatment, for the whole time period. The bottom row shows simulated data from a within-person design, in which 25 people get control for the first half of the experiment and treatment for the second half; and the other 25 people get treatment, then control. In all these graphs, the dots show simulated data and the lines are blue or red depending on whether control or treatment was happening:

In this case the slopes under the control condition have mean 1 and standard deviation 1 in the population, and the treatment effect is assumed to be a constant, adding 1 to the slope for everyone while it is happening.

Here's the result from the regression with the simulated between-person data:

family: gaussian [identity] formula: y ~ (1 + time | person_id) + time + exposure observations: 500 ------ Median MAD_SD (Intercept) 0.0 0.2 time 1.2 0.2 exposure 0.5 0.3 sigma 1.0 0.0 Error terms: Groups Name Std.Dev. Corr person_id (Intercept) 0.91 time 0.91 0.11 Residual 1.01 Num. levels: person_id 50

The true value is 1 but the point estimate is 0.5. That's just randomness; simulate new data and you might get an estimate of 1.1, or 0.8, or 1.4, or whatever. The relevant bit is that the standard error is 0.3.

Now the within-person design:

family: gaussian [identity] formula: y ~ (1 + time | person_id) + time + exposure observations: 500 ------ Median MAD_SD (Intercept) -0.1 0.2 time 1.0 0.1 exposure 1.1 0.1 sigma 1.1 0.0 Error terms: Groups Name Std.Dev. Corr person_id (Intercept) 0.93 time 0.96 0.11 Residual 1.12 Num. levels: person_id 50

With this crossover design, the standard error is now just 0.1.

**A more realistic treatment effect**

The above doesn't seem so bad: sure, 50 is a large sample size, but we're able to reliably estimate that treatment effect.

But, as discussed way above, a treatment effect of 1 seems way too high, given that the new treatment is being compared to some existing best practices.

Let's do it again but with a true effect of 0.1 (thus, changing "theta = 1" in the above code to "theta = 0.1"). Now here's what we get.

For the between-person data:

stan_glmer family: gaussian [identity] formula: y ~ (1 + time | person_id) + time + exposure observations: 500 ------ Median MAD_SD (Intercept) 0.2 0.2 time 1.0 0.2 exposure 0.0 0.3 sigma 1.0 0.0 Error terms: Groups Name Std.Dev. Corr person_id (Intercept) 1.1 time 1.0 0.24 Residual 1.0 Num. levels: person_id 50

For the within-person data:

For info on the priors used see help('prior_summary.stanreg').stan_glmer family: gaussian [identity] formula: y ~ (1 + time | person_id) + time + exposure observations: 500 ------ Median MAD_SD (Intercept) 0.2 0.2 time 1.0 0.1 exposure 0.1 0.1 sigma 1.0 0.0 Error terms: Groups Name Std.Dev. Corr person_id (Intercept) 0.98 time 0.95 0.32 Residual 1.00 Num. levels: person_id 50

The key here: The se's for the coefficient of "exposure"---that is, the se's for the treatment effect---have not changed; they're still 0.3 for the between-person design and 0.1 for the within-person design.

**What to do, then?**

So, what to do, if the true effect size really is only 0.1? I think you have to study more granular output. Not just overall muscle gain, for example, but some measures of muscle gain that are particularly tied to your treatment.

To put it another way: if you think this new treatment might "work," then think carefully about *how* it might work, and get in there and measure that process.

**Lessons learned from our fake-data simulation**

Except in the simplest settings, setting up a fake-data simulation requires you to decide on a bunch of parameters. Graphing the fake data is in practice a necessity in order to understand the model you're simulating and to see where to improve it. For example, if you're not happy with the above graphs---they don't look like what your data really could look like---then, fine, change the parameters.

In very simple settings you can simply suppose that the effect size is 0.1 standard deviations and go from there. But once you get to nonlinearity, interactions, repeated measurements, multilevel structures, varying treatment effects, etc., you'll have to throw away that power calculator and dive right in with the simulations.

“To put it another way: if you think this new treatment might “work,” then think carefully about how it might work, and get in there and measure that process.”

Well, yes and no. This comes up a lot in clinical research. We see a lot of studies that show that a drug changes some physiologic measure like blood pressure or a cholesterol level. Those studies can be done rather easily, the measurements are simple to gather, and the effects are typically fairly large and emerge rapidly. But nobody really cares about blood pressure and cholesterol except to the extent that they are on the causal pathway to cardiovascular disease and mortality.

And therein lies the rub. While nearly everybody believes, for example, that increased blood pressure lies on a causal pathway to heart attacks, strokes, heart failure, and death, it turns out that when you study it long enough in a large enough cohort, you find that some blood pressure reducing medications will also reduce these important complications, but others, despite reducing blood pressure itself, do not reduce the complications. We imagine this is because they have other effects that offset the benefit from the blood pressure reduction, though this is not really certain.

So things like blood pressure and lipid levels are considered proxy measures. They’re a good start, but at the end of the day, they’re not what matters. Ultimately, what we really care about are symptomatic illnesses, disability, and death. And effects on those require large and long studies.

It sounds to me like what you are recommending here is the use of proxy measures. Maybe not–maybe there are steps along the pathway of effect that are, in and of themselves, considered important to exercisers. And maybe there is strong science backing up the causal connections of how a treatment works in this context. But if not, all we get is a tradeoff between a really high quality answer to a question of so-so importance, or a mediocre answer to a question that really matters.

I know nothing about the world of weight training, but it isn’t obvious to me why larger and longer studies can’t be done in that context. Perhaps it’s just a matter of lack of funding? Or is there some more inherent barrier. (In large, long clinical trials, keeping study participants involved all the way to the end is a major difficulty.)

I don’t think Andrew’s recommending proxy measures, I think he’s recommending considering a causal process and trying to measure additional factors that enter because of that.

For example, suppose you think repetition to exhaustion is important (ie. you should barely be able to make the last lift). Why is it important? How important is it? Consider doing experiments where you sometimes do this and sometimes don’t, and track how it affects the different groups. Or suppose you think higher frequency (days per week) is better than high intensity (a long time and high weights) or something. Now try a variety of frequencies, and a variety of intensities, and see if you can fit a model that makes sense.

That’s very different from “try measuring how much metabolite X there is in the blood after lifting” and use that as a proxy measure for how strong you’re getting.

That’s a really cool post! It works just on its own to introduce R, the various Stan packages (I’ve never used those and now feel compelled to do so), fake data simulation, study design and visual and statistical data interpretation.

I could easily imagine this as a tutorial or as the first hour of a course on something like “study design and statistics”…

The image looks like the weight is lifting the cat (or at least, its front paws), not vice-versa.

Fantastic example … I think I got all the way through until “(1+time|person_id)”. What’s that? I believe

Andrew once lamented that it’s hard to google “stan”… but it’s even harder with “R”.