A journalist writes that he read a paper reporting on a medical experiment conducted on two different groups of people, and all that was reported was the estimated average effects. In this case, the treatment when applied to people in the first group was qualitatively different from the treatment when applied to the second group.
That is, there were groups 1 and 2, and in each group, there was a comparison of T to C. The journalist wanted to see T1 – C1 and T2 – C2, but all that was reported was (T1 + T2)/2 – (C1 + C2)/2, and the concern was that T1 and T2 were two different things. When asked why he didn’t share the separate estimates for 1 and 2, the author said that his team didn’t do this because they didn’t want to risk introducing too many statistical comparisons into their analysis.
The journalist asked for my thoughts on this, and I replied as follows:
Yes, I’ve seen people do this sort of averaging before. Sometimes it’s a mistake, other times it makes some sense because the separate estimates can be so noisy. The situation is that you can get a more stable estimate of (A+B)/2 than you can of either A or B, so that’s cool. The bad news is that now you’re not estimating either A or B, you’re estimating (A+B)/2, so the question is what interpretation does this have.
Here’s an example where some averaging is ok. Way back a few decades ago my colleagues and I estimated “the incumbency advantage” in congressional elections. We estimated the effect separately for each election year, which made sense, because the effect was changing over time. It could’ve been reasonable to estimate by averaging over each decade, because within any given decade it doesn’t change so much and then you get a more stable estimate. What we did, though, was estimate for each year and then plot the time series of estimates, so that the reader could do the smoothing by eye–I think that was the best way to go, short of fitting a hierarchical time series model, which would’ve been more work (but maybe now this would be the way to go).
What we did not do, though, was separately estimate the incumbency advantage for Democrats and for Republicans. Actually, we did separate estimates, plotted the separate time series, and they just looked like two noisy versions of the same thing, so we decided to make things simple and estimate a single incumbency advantage for each year. I think this was ok, largely because (A+B)/2 can be interpreted as the average incumbency advantage for that year, and (a) there’s no strong theoretical reason to think the incumbency advantage would be much different between the two parties, and (b) even if it does, we’re estimating an average incumbency advantage, which has a clear enough interpretation.
Here’s an example where averaging doesn’t make sense to me. Many years ago I was working with some colleagues who were studying civil war. I don’t remember all the details, but the basic story was they were fitting logistic regression to predict whether a country would be in civil war. The data were country-years, and the outcome was 1 if civil war and 0 if not. I argued that they should be fitting two separate models: one model predicting the probability that a civil war starts in a given country and year, and one model predicting whether a civil war ends. These would be fit to two different datasets, the first being all the country-years that were not already in civil war and the second being the others. So, for example, the United States from 1789-1861 and 1866-present would be in the first dataset, and the United States from 1861-1865 would be in the second dataset. There’d be a lot more data points in dataset 1 than in dataset 2; that’s just the way it is. The point is that there’s no good reason to be interested in averages of these two processes.
I don’t know enough about the context of the problem to say more than that.
Re the civil war example, it is my understanding that onset and continuation have long been analyzed separately (here is a 2011 article stating it is “common” to do so https://journals.sagepub.com/doi/abs/10.1177/0022343310394697), so it looks like you won that argument.
Andrew, I would love more advice and discussion on this point.
Given that “the treatment when applied to people in the first group was qualitatively different from the treatment when applied to the second group,” publishing study-wise numbers seems reasonable. This is different from making comparisons, which you can pick up.
When my co-author and I were writing this: https://link.springer.com/epdf/10.1007/s11109-017-9395-7, we had similar concerns. Partly, we just decided to follow what was done earlier, and partly, we wanted to be upfront about the noisiness. On the other side, I struggle to think about theoretical “justifications” (vs. empirical explorations) for disagg. At any rate, we just published disagg. results.
The underlying concern and trade-off is clear: describing fully and ‘over-interpreting’ (multiple comparisons). I suppose we are aware of what to do to clamp down on Type 1 errors. Is there ever a case for not ‘describing fully’? Often, I find that the spirit of your analysis is somewhat open-minded empiricism (with energy spent on thinking about reasonable priors, etc.). I remember your Rich State/Poor State, which has a bunch of this, and not everything is satisfying there, but maybe we need to let go of cogency and leap into explaining variance. But I think we are missing a mental model here and a rigorous way to think about what are we doing with the data.
Also, I would love to hear your views on whether the modal empirical paper in political science or economics publishes too few or too many statistical summaries. We can do advice on how to write those summaries up in a careful manner separately.
“the author said that his team didn’t do this because they didn’t want to risk introducing too many statistical comparisons into their analysis.” Chances are this refers to multiple testing, and keeping the number of statistical tests down, maybe even only running a single test regarding what was specified as hypothesis of interest in the very beginning, could be seen as a good thing, even though many on this blog may say, better keeping the number of tests down to zero… Of course, in advance, one may well be interested more in an average effect than in groupwise effects, depending on the situation.
I do however think that whatever data analytic effort should be made to get more information out of the data that is of potential relevance. Not even showing (maybe not even looking at) the groupwise differences for fear of “too many statistical comparisons” rather looks like a travesty, and all the more so if there are good reasons to think that the two groups might be essentially different. Andrew in the first example actually showed groupwise differences despite not using them for the “main message”. Also here I think it can only do good to show such differences and comment on them, without necessarily running a test or something.
This issue is connected to the more general theme on whether and how it may harm to do too many things in data analysis. We know that making too many data dependent decisions may hurt (“garden of forking paths”) but even then this has to be balanced against the danger of doing something foolish having ignored some important features of the data. Now here that question is, what is the danger of just looking for more information/insight even if this doesn’t change any analysis further down the stream? One danger may be that just because a specific thing that we may see doesn’t influence later analyses, we can’t be sure whether it could have influenced the analysis, had we seen something else. There may also be a suspicion that indeed authors may have looked at multiple tests without reporting them all, or even that it actually pretty much amounts to running a test if a statistician with good intuition looks at a certain aspect and says, without explicitly computing a p-value, “nothing seems to be going on here”.
There’s also an issue of audience-friendliness (and being economic with time) doing and reporting all kinds of stuff that in the end doesn’t have consequences and may seem boring or a distraction, but this doesn’t seem to be the issue here.
I don’t know why the data weren’t analyzed as a factorial design. Then one could check for an overall treatment effect and look for the differential effect of treatment on group by testing the treatment X group interaction. If significant, they could have probed the simple effect within each group.
This reminds me of something I like to do when reporting on factorial experiments. Let’s say the analysis says that there is no significant treatment*age group interaction for males but there is a significant treatment*age group interaction for females. For males, I prefer to report the treatment effect after aggregating the age groups; this give a more precise estimate of the average effect on males. For females, we say the effect depends on age and show how.
I agree with tht strategy. Report the main effects when the interaction is not significant. This reports the overall effect after aggregating groups. However the main effect is not interpretatble in the face of an interaction, so explore the interaction instead.
What if you have lots of funding so the sample size is very large and measurements are very precise?
In the limiting case all the main effects (and interactions) will always be observed significant. You can often even change the model to include a new effect/interaction then see the old values become significant in the other direction. There’s an effectively infinite number of plausible variations on model specification to be tried.
In practice the field will collectively choose a more stringent significance threshold (eg, 5-sigma rather than 0.05) to avoid thinking about this too hard. Ie, significance measures ability to get $$$, not the phenomenon under study.
Here’s a fun article https://datacolada.org/126 that suggests we should show group-wise means