Yesterday I shared the following exam question:
In causal inference, it is often important to study varying treatment effects: for example, a treatment could be more effective for men than for women, or for healthy than for unhealthy patients. Suppose a study is designed to have 80% power to detect a main effect at a 95% confidence level. Further suppose that interactions of interest are half the size of main effects. What is its power for detecting an interaction, comparing men to women (say) in a study that is half men and half women? Suppose 1000 studies of this size are performed. How many of the studies would you expect to report a statistically significant interaction? Of these, what is the expectation of the ratio of estimated effect size to actual effect size?
Here’s the solution:
If you have 80% power, then the underlying effect size for the main effect is 2.8 standard errors from zero. That is, the z-score has a mean of 2.8 and standard deviation of 1, and there’s an 80% chance that the z-score exceeds 1.96 (in R, pnorm(2.8, 1.96, 1) = 0.8).
Now to the interaction. The standard of an interaction is roughly twice the standard error of the main effect, as we can see from some simple algebra:
– The estimate of the main effect is ybar_1 – ybar_2, which has standard error sqrt(sigma^2/(N/2) + sigma^2/(N/2)) = 2*sigma/sqrt(N); for simplicity I’m assuming a constant variance within groups, which will typically be a good approximation for binary data, for example.
– The estimate of the interaction is (ybar_1 – ybar_2) – (ybar_3 – ybar_4), which has standard error sqrt(sigma^2/(N/4) + sigma^2/(N/4) + sigma^2/(N/4) + sigma^2/(N/4)) = 4*sigma/sqrt(N). [algebra fixed]
And, from the statement of the problem, we’ve assumed the interaction is half the size of the main effect. So if the main effect is 2.8 on some scale with a se of 1, then the interaction is 1.4 with an se of 2, thus the z-score of the interaction has a mean of 0.7 and a sd of 1, and the probability of seeing a statistically significant effect difference is pnorm(0.7, 1.96, 1) = 0.10. That’s right: if you have 80% power to estimate the main effect, you have 10% power to estimate the interaction.
And 10% power is really bad. It’s worse than it looks. 10% power kinda looks like it might be OK; after all, it still represents a 10% chance of a win. But that’s not right at all: if you do get “statistical significance” in that case, your estimate is a huge overestimate:
> raw <- rnorm(1e6, .7, 1) > significant <- raw > 1.96 > mean(raw[significant])  2.4
So, the 10% of results which do appear to be statistically significant give an estimate of 2.4, on average, which is over 3 times higher than the true effect.
The most important point here, though, has nothing to do with statistical significance. It’s just this: Based on some reasonable assumptions regarding main effects and interactions, you need 16 times the sample size to estimate an interaction than to estimate a main effect.
And this implies a major, major problem with the usual plan of designing a study with a focus on the main effect, maybe even preregistering, and then looking to see what shows up in the interactions. Or, even worse, designing a study, not finding the anticipated main effect, and then using the interactions to bail you out. The problem is not just that this sort of analysis is “exploratory”; it’s that these data are a lot noisier than you realize, so what you think of as interesting exploratory findings could be just a bunch of noise.
I don’t know if all this in the textbooks, but it should be.
Some regression simulations in R
In response to a comment I did some simulations which I thought were worth including in the main post.
I played around in R to get a sense of how the standard errors depend on the parameterization.
For simplicity, all my simulations assume that the true (underlying) coefficients are 0; that’s no big deal, as my point here is to work out the standard error.
I started with the basic model in which I simulate 1000 data points with two predictors, each taking on the value (-0.5, 0.5). This is the same as the model in the above post: the estimated main effects are simple differences, and the estimated interaction is a difference in differences. I’ve also assumed the two predictors are independent, which is how I’d envisioned the problem:
library("arm") N <- 1000 sigma <- 10 y <- rnorm(N, 0, sigma) x1 <- sample(c(-0.5,0.5), N, replace=TRUE) x2 <- sample(c(-0.5,0.5), N, replace=TRUE) display(lm(y ~ x1)) display(lm(y ~ x1 + x2 + x1:x2))
And here's the result:
lm(formula = y ~ x1 + x2 + x1:x2) coef.est coef.se (Intercept) -0.09 0.32 x1 -0.41 0.63 x2 -0.13 0.63 x1:x2 -0.91 1.26
Ignore the estimates; they're pure noise. Just look at the standard errors. They go just as in the above formulas: 2*sigma/sqrt(N) = 2*10/sqrt(1000) = 0.63, and 4*sigma/sqrt(N) = 1.26.
Now let's do the exact same thing but make the predictors take on the value (0, 1) rather than (-0.5, 0.5):
x1 <- sample(c(0,1), N, replace=TRUE) x2 <- sample(c(0,1), N, replace=TRUE) display(lm(y ~ x1)) display(lm(y ~ x1 + x2 + x1:x2))
Here's what we see:
lm(formula = y ~ x1 + x2 + x1:x2) coef.est coef.se (Intercept) -0.44 0.64 x1 0.03 0.89 x2 0.95 0.89 x1:x2 -0.54 1.26
Again, just look at the standard errors. The s.e. for the interaction is still 1.26, but the standard errors for the main effects went up to 0.89. What happened?
What happened was that the main effects are now estimated at the edge of the data: the estimated coefficient of x1 is now the difference in y, comparing the two values of x1, just at x2=0. So its s.e. is sqrt(sigma^2/(N/4) + sigma^2/(N/4)) = 2*sqrt(2)*sigma/sqrt(N). Under this parameterization, the coefficient of x1 is estimated just from the data given x2=0, only half the data so the s.e. is sqrt(2) times as big as before. Similarly for x2.
But these aren't really "main effects"; in the context of the above problem, the main effect of the treatment is the average over men and women, hence if we put the problem in a regression framework, we should be coding the predictors as (-0.5, 0.5), not (0, 1).
But here's another possibility: what about coding each predictor as (-1, 1)? What happens then? Let's take a look:
x1 <- sample(c(-1,1), N, replace=TRUE) x2 <- sample(c(-1,1), N, replace=TRUE) display(lm(y ~ x1)) display(lm(y ~ x1 + x2 + x1:x2))
coef.est coef.se (Intercept) -0.23 0.31 x1 0.28 0.31 x2 -0.60 0.31 x1:x2 0.05 0.31
(Again, ignore the coefficient estimates and look at the standard errors.)
Hey---the coefficients are all smaller (by a factor of 2), and they're all equal! What happened?
The factor of 2 is clear enough: If you multiply x by 2, and x*beta doesn't change, then you have to divide beta by 2 to compensate, and its standard error gets divided by 2 as well. But what happened to the interaction? Well, that's clear too: we've multiplied x1 and x2 each by 2, so x1*x2 is multiplied by 4.
So to make sense of all these standard errors, you have to have a feel for the appropriate scale for the coefficients. In my post above, when I talked about interactions being half the size of main effects, I was thinking of differences between the two groups, which corresponds to parameterizing the predictors as (-0.5, 0.5).
As you can see from the above simulations, the exact answer will depend on your modeling assumptions.
It all depends on what you mean by "interaction half the size of the main effect"
This came up in comments.
What do I mean by "interactions of interest are half the size of main effects"? Suppose the main effect of the treatment is, say, 0.6 and the interaction with sex is 0.3, then the treatment effect is 0.45 for women and 0.75 for men. That's what I meant.
If, by "interactions of interest are half the size of main effects," you have in mind something like a main effect of 0.6 that is 0.3 for women and 0.9 for men, then you'll only need 4x the sample size, not 16x.
P.S. Further discussion here. Thanks, everyone, for all the thoughtful comments.