A journalist pointed me to a recent research article, “The impact of a poverty reduction intervention on infant brain activity,” which stated:
Here, we report estimates of the causal impact of a poverty reduction intervention on brain activity in the first year of life. . . . Shortly after giving birth, mothers were randomized to receive either a large or nominal monthly unconditional cash gift. Infant brain activity was assessed at approximately 1 y of age in the child’s home, using resting electroencephalography (EEG; n = 435). . . . using a rigorous randomized design, we provide evidence that giving monthly unconditional cash transfers to mothers experiencing poverty in the first year of their children’s lives may change infant brain activity.
This research was also featured in the newspaper:
A study that provided poor mothers with cash stipends for the first year of their children’s lives appears to have changed the babies’ brain activity in ways associated with stronger cognitive development, a finding with potential implications for safety net policy.
The differences were modest — researchers likened them in statistical magnitude to moving to the 75th position in a line of 100 from the 81st — and it remains to be seen if changes in brain patterns will translate to higher skills, as other research offers reason to expect.
Still, evidence that a single year of subsidies could alter something as profound as brain functioning highlights the role that money may play in child development and comes as President Biden is pushing for a much larger program of subsidies for families with children.
I actually don’t think a difference of 6 percentile points would be so modest, if there was really evidence that it was happening. I mean, sure, you’re not turning all these kids’ lives around, but it’s the usual story. First, they’re
just paying people $333 a month: this is about 20% of the average income of the mothers in the sample, so it’s a lot but not massive in all cases. Second, the effect will vary: a 6 percentile change on average could correspond to an approximate 18 percentile change on a third of the people. The point is that if there really were an average 6 percentile-point-effect, I’d consider that to be respectable.
The other issue that was raised was external validity: you can change brain activity but will this change people’s lives? The argument in the published paper is that these brain activity patterns are associated with various good things—that’s the “as other research offers reason to expect” bit from the news report.
Here, though, I want to ask a more basic question: Do the data in the paper support the claim of “important evidence of the effects of increased income”?
At first glance
Are the paper’s claims supported by its evidence? From one perspective the answer would seem obviously to be yes, as it was published in a top journal, and it has this impressive figure:
According to the preregistration, they expected a positive effect in the alpha and gamma bands and a negative effect in the theta band, which is exactly what they found, so that looks good. On the other hand, I don’t see any uncertainty bounds on this graph . . . . we’ll get back to this point.
Also, if you read the abstract carefully, the claims are kind of hedged. Not, “We found an effect on brain activity” or “Giving money causes changes in brain activity,” but “using a rigorous randomized design, we provide evidence that giving monthly unconditional cash transfers to mothers experiencing poverty in the first year of their children’s lives may change infant brain activity.” No problem with the randomized design, but the “may” in the sentence is telling. And there’s this:
The preregistered plan was to look at both absolute and relative measures on alpha, gamma, and theta (beta was only included later; it was not in the preregistration). All the differences go in the right direction; on the other hand when you look at the six preregistered comparisons, the best p-value was 0.04 . . . after adjustment it becomes 0.12 . . . Anyway, my point here is not to say that there’s no finding just because there’s no statistical significance; I can just see now why there’s all that careful language in the abstract and the rest of the paper. Without a clean p-value, you don’t say you discovered an effect. You say you “may” have discovered something, or that the results are “suggestive,” or something like that. So they followed those rules.
Looking in detail
Before going on, I want to thank thank the article’s authors, Sonya Troller-Renfree, Molly Costanzo, Greg Duncan, Katherine Magnuson, Lisa Gennetian, Hirokazu Yoshikawa, Sarah Halpern-Meekin, Nathan Fox, and Kimberly Noble. Along with their article they include comprehensive supplementary material (including preregistration information) and access to all their data! This is a rare published research article where I can figure out what was really done.
I was wondering what was going on with the figure shown above, so I downloaded the data, which was easy! And I took a look. I know basically nothing about studies of brain activities, so I took the data as given.
My first plan was to follow their plan of separate analysis for each of the different brain bands (theta, alpha, beta, and gamma), but instead of straight differences, I’d first take the log—all the measurements are positive and it would seem reasonable to start with proportional effects—and, most importantly, include pre-test brain activity as a predictor. Thus, for each frequency zone, fit y ~ z + x, where y = outcome (log brain activity), z = treatment indicator, x = pre-treatment measure. Then try including interaction of x and z. That’s our usual workflow. And I’d plot y vs. x with blue dots for the controls (z=0) and red dots for the treated kids (z=1).
But I couldn’t follow that plan, because . . . there were no pre-treatment measurements of brain activity. I guess that makes sense: they weren’t gonna do these measurements on newborns! So, no pre-test. The study does have some individual demographic and socioeconomic variables, and I guess it makes sense to include them in the model, but I can’t imagine them having huge predictive power.
So let’s take a look at the data and see what we’ve got. The preregistration talked about two measures: Absolute and Relative Power. In my little analysis here I looked at Absolute Power measure because this is what the authors seemed to be focusing on in their paper. So here are the raw data for the 435 children in the study: first the log measurements themselves, then relative to the mean, then the averages, relative to the mean at each frequency, for the treated (red) and control (blue) groups:
Sorry about the blurry graph; I’m just trying to get the point across without spending too much time struggling with the different graphics formatting.
Anyway, the two groups of children almost entirely overlap, except for the three blue curves at the bottom of the graph. One of them has really low values. The average curve looks similar to what was in the published paper, and we see average differences as high as 7%, which isn’t nothing. There is a question of whether such a large difference could’ve arisen just by chance . . . we’ll get to that in a moment.
But first let’s follow the authors’ lead and go back to the original, unlogged scale. First the raw data, then the z-scores (y – y_bar)/s_y, where y_bar is the average over the children in the data, s_y is the sd over the children in the data, and we’re doing this normalization separately for each frequency:
This looks like what they found! Again, full credit for a rare case of clean data sharing. I wish in the published article they’d shown the color graph with all 435 paths. It would’ve been easy enough to do.
At this point, I think it would make sense to construct a predictor using health, demographic, and socioeconomic variables measured before the treatment, and maybe I’ll get to that, but first let me move to the question of sampling variability. N=435 is a pretty large number of kids, but as we can see in the above graphs, there’s a lot lot of overlap between the red and blue curves, so as baseline it would be good to see what could happen by chance alone.
To get the chance distribution, my first thought was to just look at sd/sqrt(n), and I guess that n is large enuf that the normal approximation would do the trick, but since we have the raw data right here I’ll just permute the 435 treatment assignments (keeping the same observations but randomly permuting the treatment assignment variable) and see what happens. I’ll reproduce the rightmost graph just above, and to see what might happen I’ll do it 9 times:
Hmmmm . . . the patterns in these random permutations don’t look so different, either qualitatively or quantitatively, from what we saw from the actual comparison.
Measurement and statistical power
What’s going on, then? The simplest summary is that there’s a reason they didn’t find statistical significance, however measured, as the data are consistent with no effect. As usual, the way we think of this is that there’s a lot of variation between people, and so with even a moderate sample size it will be difficult to detect small average differences. The authors of the paper wrote that the study was powered to detect differences of 0.21 standard deviations—-but 0.21 standard deviations is a lot, when you consider all the differences between children, along with the many factors that affect them before birth and in the first year. If the true effect is, say, 0.07 standard deviations, then this study just isn’t up to finding it, at least not using this direct approach of calculating averages or running simple regressions. The authors also average over frequency bins, which seems like a good idea, but it doesn’t help as much as you might think because the individual paths are so highly autocorrelated. It also doesn’t help to analyze the data on the individual scales; I think log scale would make more sense if you could figure out what was happening with that one kid with the really low measurements. These are small things, though. In the absence of a pre-test measurement or more granularity in some other ways, it just doesn’t seem like there’s enough going on in these data for any average treatment effect to show up.
Another way of saying this is that, to the extent that there is an effect, we would anticipate that effect to be highly variable, with some kids benefiting much more than others.
My impression from reading the article and the quotes in the newspaper was that the researchers were like, yeah, sure, it’s not quite statistically significant, but that’s just kind of a technicality because if you get enough data your p-values will go down. But that’s not right! I mean yeah, it is correct as you get more data your p-values will go down—but you don’t know where the estimate is going to end up. The effect might be negative, not positive, or it could just look patternless.
There are, I assume, good theoretical reasons that this treatment could have an effect on brain patterns and learning ability—giving a few hundred extra dollars a month to a poor person can make a difference, and I’ll take the researchers’ word for it on the relevance of this particular measure as a proxy for some ultimate cognitive outcome of interest. But, again, it would not be at all a surprise for average effects to be small and to show patterns much different from those expected by the researchers.
So I think the main message from the data so far is not that there’s evidence for an effect, and not that the effect is zero (not statistically significant != no effect), but just that, given the design of the study, the data are too noisy to learn anything useful about the effects of this particular treatment on this particular outcome.
Where does that leave us? This study was a supplement applied to a subset of kids in a larger study of 1000 children. In retrospect maybe this was all a waste of effort—but I guess you couldn’t know this ahead of time. If there were strong theoretical reasons to believe an effect size of 0.21 standard deviations, then with careful statistical analysis this might’ve all turned out ok. And, once the data have been collected, it’s great that they are being shared. It’s also possible that useful things will be learned from later waves of the study. Recall that my big disappointment when considering statistical analysis was that there were no pre-treatment EEG’s at age 0. We can’t go back in time, but it should be possible to do future comparisons.
I always recommend including a pre-test in the model, but it’s especially relevant here, given that a key part of this research involves the supposition that the EEG spectrum can be considered as an important descriptor of a child. I think that would imply that the spectrum, or aspects of it, are stable over time, so that adjusting for pre-test (maybe using simple linear regression, maybe some more sophisticated analysis) would give a huge benefit when trying to observe treatment effects. It often seems that the clean causal identification arising from randomized experiments leads researchers to not think carefully enough about including pre-tests in their designs and analysis. I’m speaking in generalities here, as I have no idea whether it would’ve been feasible to perform EEG’s on newborns.
Finally, there is more in the study that I have not discussed. In particular, section 6 of the supplementary material presents data broken down by brain region, yielding results that the authors find to be consistent with their story of what is going on. I guess it might be possible to study this more carefully using a multilevel model. So for now I do not find the data convincing, but it is possible that a fuller analysis, along with the new data that will come in the future, will clarify some of these issues.