Perhaps I missed some of the points above; did anyone really simulate the case study as suggested? Admittedly, estimates of effect sizes are key; and these impact on the estimated SE in a model, correct?

Under the Null, main effect has SE 0.63; interaction has SE of 1.26, with N=1000, sigma 10.

With the given example, main effect 2.8*sigma; for x2== -.5, 2.1*sigma, and for x2== .5, 3.5*sigma. This is the implication of the interaction being half the size of the main effect (1.4; -.7 and +.7 effect).

This results is 3 interesting findings:

1. SE of main effect is estimated larger as 0.70

2. SE of main effect is estimated as 0.66 if we adjust for x2

3. SE of interaction is estimated as 1.26.

My reflections:

– The only correct model is the model with interaction; and there the SE is identical to what was derived under the Null (SE 1.26).

– In practice, we will start with the main effect model, and some variance is explained by adjusting for other covariates that are associated with the outcome. This is indeed what we observed for x2, even if x2 is interacting with x1. So, this confirms the recommendation to include covariates that are associated with the outcome, more than searching for subgroup effects. https://www.ncbi.nlm.nih.gov/pubmed/2727470

– The inclusion of prognostic covariates is beneficial in linear models as well as in generalized linear models such as logistic or Cox regression: https://www.ncbi.nlm.nih.gov/pubmed/9620808; https://www.ncbi.nlm.nih.gov/pubmed/10783203; https://www.ncbi.nlm.nih.gov/pubmed/15196615; https://www.ncbi.nlm.nih.gov/pubmed/16275011

The R script:

library(“arm”)

N <- 1000

sigma <- 10

y <- rnorm(N, 0, sigma)

x1 <- sample(c(-0.5,0.5), N, replace=TRUE)

x2 <- sample(c(-0.5,0.5), N, replace=TRUE)

display(lm(y ~ x1))

display(lm(y ~ x1 + x2 + x1:x2))

# this was with y under the Null

# now with y under the alternative of separate x1 effects for x2 values

# specifically:

# overall effect of x1 = 2.8 * sigma; for x2==-.5: 2.1 * sigma; for x2==0.5: 3.5 * sigma

y[x1==.5 & x2== -.5] <- rnorm(length(y[x1==.5 & x2==-.5]), 2.1*sigma, sigma)

y[x1==.5 & x2== .5] <- rnorm(length(y[x1==.5 & x2== .5]), 3.5*sigma, sigma)

display(lm(y ~ x1)) # SE 0.72

display(lm(y ~ x1 + x2)) # SE 0.69

display(lm(y ~ x1 + x2 + x1:x2)) # SE interaction 1.31; 1.31/0.72 equals 1.82 rather than a factor 2

Rogers, W. M. (2002). Theoretical and mathematical constraints of interactive regression models. Organizational Research Methods, 5, 212–230.

]]>Maxwell and Delaney have it covered (“Designing Experiments and Analyzing Data: A Model Comparison Perspective”, 2nd Ed., p. 318). They refer to an Abelson insight: The ratio of the t-values for the main effect and the interaction effect is equal to (t1+t2)/(t1-t2), where t1 and t2 refer to the simple main effects. Assuming t1 and t2 to be on the same side (i.e., ordinal interaction), the effect size of the interaction will always be smaller than the main effect unless one of the simple main effects is zero. Playing around with the equation by assuming that one simple main effect is half the size of the other (t2=.5*t1) yields the solution that the effect size of the main effect must be three times that of the interaction. In many cases where disordinal interactions are implausible, t2=.5*t1 may be a reasonable assumption. In this scenario, the power for the interaction would be ~5 times lower than the power for the main effect: pnorm(2.8, 1.96, 1) / pnorm(2.8/3, 1.96, 1).

]]>> anova(fit1)

Analysis of Variance Table

Response: Val

Df Sum Sq Mean Sq F value Pr(>F)

Group 1 4.5071 4.5071 2.5798 0.1835

Sex 1 0.2409 0.2409 0.1379 0.7292

Group:Sex 1 0.0657 0.0657 0.0376 0.8557

Residuals 4 6.9881 1.7470

> anova(fit2)

Analysis of Variance Table

Response: Val

Df Sum Sq Mean Sq F value Pr(>F)

Group 1 4.5071 4.5071 2.5798 0.1835

Sex 1 0.2409 0.2409 0.1379 0.7292

Group:Sex 1 0.0657 0.0657 0.0376 0.8557

Residuals 4 6.9881 1.7470

Perhaps related (And someone cites Gelman 2005 there):

https://stats.stackexchange.com/questions/175246/why-is-anova-equivalent-to-linear-regression

Its probably another one of those things that is “obvious” to people with good training in stats but severely disturbing to most people trying to apply it.

]]>set.seed(1234)

n = 4

mu = 1

a = rnorm(n, 0.0, 1)

b1 = rnorm(n/2, mu – mu/4, 1)

b2 = rnorm(n/2, mu + mu/4, 1) # interaction is mu/2

dat1 = dat2 = data.frame(Group = c(rep(“C”, n), rep(“T”, n)),

Sex = rep(c(rep(“F”, n/2), rep(“M”, n/2)), 2),

Val = c(a, b1, b2))

# Now male is the reference

dat2$Sex = factor(dat2$Sex, levels = c(“M”, “F”))

fit1 = lm(Val ~ Group*Sex, dat1)

fit2 = lm(Val ~ Group*Sex, dat2)

# Results

dat1

dat2

summary(fit1)

summary(fit2)

]]>> dat1

Group Sex Val

1 C F -1.2070657

2 C F 0.2774292

3 C M 1.0844412

4 C M -2.3456977

5 T F 1.1791247

6 T F 1.2560559

7 T M 0.6752600

8 T M 0.7033681

> dat2

Group Sex Val

1 C F -1.2070657

2 C F 0.2774292

3 C M 1.0844412

4 C M -2.3456977

5 T F 1.1791247

6 T F 1.2560559

7 T M 0.6752600

8 T M 0.7033681

>

> summary(fit1)Call:

lm(formula = Val ~ Group * Sex, data = dat1)Residuals:

1 2 3 4 5 6 7 8

-0.74225 0.74225 1.71507 -1.71507 -0.03847 0.03847 -0.01405 0.01405Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.4648 0.9346 -0.497 0.645

GroupT 1.6824 1.3218 1.273 0.272

SexM -0.1658 1.3218 -0.125 0.906

GroupT:SexM -0.3625 1.8692 -0.194 0.856Residual standard error: 1.322 on 4 degrees of freedom

Multiple R-squared: 0.4079, Adjusted R-squared: -0.03622

F-statistic: 0.9184 on 3 and 4 DF, p-value: 0.5081> summary(fit2)

Call:

lm(formula = Val ~ Group * Sex, data = dat2)Residuals:

1 2 3 4 5 6 7 8

-0.74225 0.74225 1.71507 -1.71507 -0.03847 0.03847 -0.01405 0.01405Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.6306 0.9346 -0.675 0.537

GroupT 1.3199 1.3218 0.999 0.374

SexF 0.1658 1.3218 0.125 0.906

GroupT:SexF 0.3625 1.8692 0.194 0.856Residual standard error: 1.322 on 4 degrees of freedom

Multiple R-squared: 0.4079, Adjusted R-squared: -0.03622

F-statistic: 0.9184 on 3 and 4 DF, p-value: 0.5081I usually do not (pretty much never) do anovas or use R factors, but this seems pretty disturbing.

The basic function is this:

sim1 = function(n=40, mu=0.6343) {

a = rnorm(n, 0.0, 1)

b1 = rnorm(n/2, mu – mu/4, 1)

b2 = rnorm(n/2, mu + mu/4, 1) # interaction is mu/2

dat = data.frame(Group = c(rep(“C”, n), rep(“T”, n)),

Sex = rep(c(rep(“M”, n/2), rep(“F”, n/2)), 2),

Val = c(a, b1, b2))

fit = lm(Val ~ Group*Sex, dat)

return(summary(fit)$coefficients)

}

Where is my error?

It can’t be that an arbitrary label (and resulting positive or negative coefficient) halves my power from 70% to 35%!

As you can see, using Group*Sex as factors results in slightly different values and only 70% power for the main effect (just as Howard wrote below). The SE of the interaction is also only sqrt(2) larger than SE of main (see Andrew’s reasoning above). Applied to your graph, this may account for the discrepancy between your “analytic results” (dashed lines) and the actual results.

If instead you encode the groups as (-0.5, 0.5) you get 80% power (sim2()). Now the z.coeffs are ~2.8 and ~0.7 (just as in Andrew’s calculation) and inflation of significant interaction effects is ~3x.

Which leads me to the question: why is dummy coding used in all regression packages, if it effectively diminishes power (from 80% to 70%)? And how would this look like for factors with more than two groups?

]]>1-pnorm(1.96,2.8,1)

than

pnorm(2.8,1.96,,1)

it took a while to see that the two areas under the respective bell curves are always the same,

visually just reflected to the vertical axis at (2.8-1.96)/2

In particular, in a simple Gaussian model, for sigma_mainEffect to be greater than 10% larger than sigma_withInteraction, we need delta_interaction just about equal to sigma_withInteraction, implying delta_mainEffect = 2 * sigma_withInteraction. As delta_mainEffect gets smaller, so does the relative difference between sigma_mainEffect and sigma_withInteraction.

This (a) doesn’t change the sample size ratio very much at all and (b) would considered a very strong effect in most fields.

]]>I was just worried because a post with a theme, “Interactions are hard to estimate,” could be taken to imply, “Don’t estimate interactions.” But I *don’t* want to imply that. I *do* want people to estimate interactions, I just want them (a) to estimate them with Bayesian inference using prior information, and (b) to accept that their posterior inferences will have a lot of uncertainty, in particular not considering it a failure when the 95% interval includes zero.

Of course, I don’t expect you to be someone who wants to protect type I error rates **at all costs**!

]]>Just to be clear, I never said, “don’t look for interactions because it might take away from your power to find a main effect.” I think interactions are very important; we just need to accept that in many settings we won’t be able to attain anything like near-certainty regarding the magnitude or even direction of particular interactions. That’s a message that people don’t want to hear, which is too bad, because uncertainty is a core statistical principle.

]]>If you’re willing to make the assumption that the estimates of the main + interaction effects are normally distributed (not too strong an assumption), then I think the required sample size ratio for the interaction effect should actually be a function of the original required sample size and the distribution of interaction variable (which we have stated is binary with p = 0.5)? I.e., if a small sample size is required, then the main effect must be large relative to sigma_mainEffectOnly, and so sigma_withInteraction should be very small (as presented in my pathological case above, you can actually make the sample size required *smaller* than the no interaction effect sample size!). But it’s not really clear to me that this function would be consistent across different response distributions (i.e. I think you get a different function if you’re looking a Gaussian response vs. survival analysis model etc.).

Just a little tangent, but I think this is one lesson the field of statistics has taken the very wrong approach and the field of machine learning has really done right. Don’t worry, I’ll blame it all on p-values. Classical frequentist methods have been very afraid of model building, because it can make it very difficult to keep valid frequentist inference while also performing model selection from the data (although there’s been plenty of recent work on addressing this). This fear of invalid inference has scared many statisticians away from model building (don’t look for interactions because it might take away from your power to find a main effect!), while many machine learning researchers skipped p-value day and said “wow, it’s really easy to build models with high predictive power if we just survey a large number of models (or hyper parameters) and pick the one that does best on out-of-sample error!”

]]>How do you think power should be taught, relative to how it is currently taught?

]]>As I wrote above, I understand that my statistical advice can be upsetting because I’m a bearer of bad tidings. All I can tell you is that these issues have confused a lot of people for a long time, which is one reason why the replication crisis is a crisis and is one reason why people keep being surprised that “p less than 0.05” results aren’t getting replicated. Getting angry is easy but it won’t help you understand the world better in a replicable way.

]]>(Accidentally hit submit there)

]]>Seems a bit harsh, and a bit hard to believe. Who goes to a technical statistical methods blog and skins

]]>See comment here for what I’d been thinking of.

]]>In particular, suppose there is a *non-zero* interaction effect (which we have to be). If we are powered at 0.80 to detect a main effect in a model in which we’ve excluded the interaction effect, then the interaction effect gets absorbed into sigma. So sigma_mainEffectOnly can be arbitrarily larger than sigma_withInteractions. In fact, it’s theoretically possible to have power 0.80 with the main effect and power 1.00 when we include the interaction with the same sample size (i.e. the pathological case in which all the error came from excluding the interaction effect).

So under this interpretation of your problem, which I think is quite reasonable, the answer is undefined.

]]>I absolutely agree. As if this whole hysterical obsession with cats hasn’t caused enough damage already!

]]>I take no responsibility for readers who don’t read the post. I think everyone should read the comments too.

If I just wanted to write titles, I’d be on twitter. I blog because I want to work things through.

]]>http://jakewestfall.org/blog/index.php/2015/05/11/the-hierarchical-ordering-principle/

Rather, my objection (1) was that the title is misleading because it’s easily misinterpreted by readers who don’t carefully read the blog post, which is almost certainly a large majority. I don’t argue that it’s actually common for main effects and interactions to have the same standardized effect size (e.g., the same partial correlation with Y)–like I said, hierarchical ordering is probably true more often not–just that the natural interpretation of your title is that it refers to such cases. The context for me here is that many in my field mistakenly believe that interactions are *inherently* harder to detect than main effects *because they are interactions*–that is, even for the same partial correlation. And your title seems to just fuel those misconceptions.

I don’t really understand things like standardized effect size and partial eta—actually, I’ve never heard of partial eta. As discussed in the addendum, the power for estimating main effect and interaction will only be equal in a setting where, for example, the main effect is 0.6, with an effect of 0 for one group and 1.2 for the other. I agree that this *can* occur but I don’t think it’s typical. This is related to the piranha principle that we discussed not long ago on the blog.

In any case, I hope the R code and then the addendum helped.

To elaborate on one of the points implicitly arising in our discussion: There are cases where main effects are small and interactions are large. Indeed, in general, these labels have some arbitrariness to them, as my colleagues and I realized many years ago when studying congressional elections: recode the outcome from Democratic or Republican vote share to incumbent party vote share, and interactions with incumbent party become main effects, and main effects become interactions. So, yes, the above post is in the context of main effects which are modified by interactions; there’s the implicit assumption that if the main effect is positive, then it will be positive in the subgroups we look at, just maybe a bit larger or smaller. Again, I think this makes sense in most of the social science research I’ve seen, and I think it makes sense for most of the interactions that people look at—especially in the common setting that people look at lots of interactions, in which case I think most of them will have to be small—but it won’t apply in every case.

To continue that last thought: I think it makes sense, where possible, to code variables in a regression so that the larger comparisons appear as main effects and the smaller comparisons appear as interactions. This is what we did in our paper on incumbency analysis. You might say I’m now engaged in circular reasoning, but I think that in most cases, the very nature of a “main effect” is that it’s supposed to tell as much of the story as possible. When interactions are important, they’re important as modifications of some main effect. Again, not always (you can have a treatment that flat-out hurts men while helping women), but in such examples it’s not clear that the main-effects-plus-interaction framework is the best way of looking at things.

]]>(1) Contrary to the suggestion of the title, an interaction and a main effect will have the same power given the same N and the same standardized effect size (partial eta^2). Of course, the body of the blog post doesn’t actually dispute this–it’s based on the explicit assumption that the interaction effect size is half that of the main effects–so I’d just say the title is misleading, given that most people who see the title’s conclusion will likely interpret it as applying to situations where the standardized effect size is held constant.

(2) As noted in the addendum, there’s some ambiguity about what it means for the interaction effect size to be half that of the main effect on an unstandardized scale. I’d argue that we can only sensibly compare the sizes of unstandardized effects in this way after equating them on var(X), otherwise it’s sort of apples-to-oranges. The 16X conclusion violates this principle because it compares main effects on a [-1/2, +1/2] scale to an interaction that’s on a [-1/4, +1/4] scale. So when you cut β_interaction in half in addition to that change of scaling, I’d argue you’re really putting the interaction at a quarter the size of the main effects, not half the size. If you equate the scales first and then set β_interaction to half on that scale, the sample size multiplier is 4X, not 16X (as hinted in the addendum). This also agrees with the fact that the sample size multiplier is 4X, not 16X, when the standardized effect size (partial eta^2) for the interaction is half that for the main effects.

]]>(X11 – X12) – (X21 – X22) = (X11 + X22) – (X12 + X22)

If you code your interaction this way, it’ll be twice the size and have twice the SE as a main effect (coded as the difference between the marginal means). If you divide it by 2 it’ll be exactly the same. So,

(X11 + X22)/2 – (X12 + X22)/2 looks a lot like (X11 + X12)/2 – (X21 + X22)/2

So, maybe if you start off by assuming that the difference of differences (interaction) should be the same size as the difference in marginal means (main effect), you can arrive at the conclusion that the SE should be twice as large. But that’s just another way of saying you expect the interaction to be small.

]]>It’s not *just* that smaller effects have less power. That’s one of the factors of 4. The other factor of 4 is that the estimate of the interaction is a difference in differences, which will have twice the standard error (thus 4 times the variance) of the estimate of the main effect.

Regarding your question about a priori reasons: I guess it depends on the application. It makes sense to me that large interactions could be as large as the main effect or even larger (for example, if the main effect is 0.6 and the effect is 0 for women and 1.2 for men, then the interaction is twice the size of the main effect!); I’d guess that *typical* interactions are of the order of half the size of the main effect. A lot depends on whether you’re going in looking for one particular interaction, or if you’re rooting around looking for what interactions might turn up.

What do you base 1) above on though? I’ve asked people and they don’t seem to think there’s any a priori reason to assume an interaction will be half the size of a main effect.

Otherwise your claim seems to reduce to smaller effects = less power, which isn’t surprising.

]]>Oh, sure, I was just assuming you’d do all the tests together. The tests are what they are, no need to be adjusting thresholds.

]]>In the balanced design that I was assuming for this problem (a randomly assigned treatment, thus approximately equal proportions of men and women in each group), you can estimate the interaction without changing the estimate of the main effects.

Regarding your other question: I’m not at all proposing that decisions be made based on statistical significance or rejection of a hypothesis test. The real point here is working out the standard errors.

]]>Maybe it will help to consider specific numbers. Suppose the main effect of the treatment is, say, 0.6 and the interaction with sex is 0.3, so that the treatment effect is 0.45 for women and 0.75 for men. Then if you do the regression with the predictors on the (-0.5, 0.5) scale, you’ll get estimates of 0.6 for the treatment effect and 0.3 for the interaction. If you do the regression with the predictors on the (-1, 1) scale, you’ll get estimates of 0.3 for the treatment effect and 0.075 for the interaction.

Depending on the context, maybe I’m being too pessimistic in the above post. One could easily imagine a treatment where the main effect is 0.6, with an effect of 0.3 for women and 0.9 for men. In that case, the interaction is as large as the main effect, so then you only need 4 times the sample size, not 16.

]]>Doesn’t our prior (it is prior, right?) intent to test the interaction change the main effect test itself? (And how many other interactions will we test?). Or are you thinking of running the same study design twice?

Second, we want power for the interaction test, supposing “that interactions of interest are half the size of main effects.” Are we to suppose that the interactions are half the size of our pre-experiment expectation (whatever we used in the main-effect power calculuation) or does the power calculation for the interaction test depend on what we actually learned about the main effect? It seems to be weird to get to a situation where, perhaps, don’t reject the main effect null, but continue to claim that the interaction test had high power (against a value which we perhaps don’t take seriously any more.)

]]>The [-1 1] case seems most reasonable given that ANOVA computes differences.

Linear modeling examples you present above show that SE depends (partly) on the relative norm of the predictor vectors. Using a norm of 1 everywhere leads to both the main effect and interaction predictors to also be of unit length, and thus all SEs are the same. Makes all numbers more comparable… ]]>

For instance I likely would have used a two-sided test for the interaction and if I’ve done the calculations correctly that would make the increase in sample size a factor of 20 (larger for not knowing the direction of say male effect – female effect).

]]>