Maggie Fox writes:
Brain scans may be able to predict what you will do better than you can yourself . . . They found a way to interpret “real time” brain images to show whether people who viewed messages about using sunscreen would actually use sunscreen during the following week.
The scans were more accurate than the volunteers were, Emily Falk and colleagues at the University of California Los Angeles reported in the Journal of Neuroscience. . . .
About half the volunteers had correctly predicted whether they would use sunscreen. The research team analyzed and re-analyzed the MRI scans to see if they could find any brain activity that would do better.
Activity in one area of the brain, a particular part of the medial prefrontal cortex, provided the best information.
“From this region of the brain, we can predict for about three-quarters of the people whether they will increase their use of sunscreen beyond what they say they will do,” Lieberman said.
“It is the one region of the prefrontal cortex that we know is disproportionately larger in humans than in other primates,” he added. “This region is associated with self-awareness, and seems to be critical for thinking about yourself and thinking about your preferences and values.”
Hmm . . . they “analyzed and re-analyzed the scans to see if they could find any brain activity” that would predict better than 50%?! This doesn’t sound so promising. But maybe the reporter messed up on the details . . .
I took advantage of my library subscription to take a look at the article, “Predicting Persuasion-Induced Behavior Change from the Brain,” by Emily Falk,Elliot Berkman,Traci Mann, Brittany Harrison, and Matthew Lieberman. Here’s what they say:
– “Regions of interest were constructed based on coordinates reported by Soon et al. (2008) in MPFC and precuneus, regions that also appeared in a study of persuasive messaging.” OK, so they picked two regions of interest ahead of time. They didn’t just search for “any brain activity.” I’ll take their word for it that they just looked at these two, that they didn’t actually look at 50 regions and then say they reported just two.
– Their main result had a t-statistic of 2.3 (on 18 degrees of freedom, thus statistically significant at the 3% level) in one of the two regions they looked at, and a t-statistic of 1.5 (not statistically significant) in the other. A simple multiple-comparisons correction takes the p-value of 0.03 and bounces it up to an over-the-threshold 0.06, which I think would make the result unpublishable! On the other hand, a simple average gives a healthy t-statistic of (1.5+2.3)/sqrt(2) = 2.7, although that ignores any possible correlation between the two regions (they don’t seem to supply that information in their article).
– They also do a cross-validation but this seems 100% pointless to me since they do the cross-validation on the region that already “won” on the full data analysis. For the cross-validation to mean anything at all, they’d have to use the separate winner on each of the cross-validatory fits.
– As an outcome, they use before-after change. They should really control for the “before” measurement as a regression predictor. That’s a freebie. And, when you’re operating at a 6% significance level, you should take any freebie that you can get! (It’s possible that they tried adjusting for the “before” measurement and it didn’t work, but I assume they didn’t do that, since I didn’t see any report of such an analysis in the article.)
The bottom line
I’m not saying that the reported findings are wrong, I’m just saying that they’re not necessarily statistically significant in the usual way this term is used. I think that, in the future, such work would be improved by more strongly linking the statistical analysis to the psychological theories. Rather than simply picking two regions to look at, then taking the winner in a study of n=20 people, and going from there to the theories, perhaps they could more directly model what they’re expecting to see.
The difference between . . .
Also, the difference between “significant” and “not significant” is not itself statistically significant. How is this relevant in the present study? They looked at two regions, MPFC and precuneus. Both showed positive correlations, one with a t-value of 2.3, one with a t-value of 1.5. The first of these is statistically significant (well, it is, if you ignore that it’s the maximum of two values), the second is not. But the difference is not anything close to statistically significant, not at all! So why such a heavy emphasis on the winner and such a neglect of #2?
Here’s the count from a simple document search:
MPFC: 20 instances (including 2 in the abstract)
precuneus: 8 instances (0 in the abstract)
P.S. The “picked just two regions” bit gives a sense of why I prefer Bayesian inference to classical hypothesis testing. The right thing, I think, is actually to look at all 50 regions (or 100, or however many regions there are) and do an analysis including all of them. Not simply picking the region that is most strongly correlated with the outcome and then doing a correction–that’s not the most statistically efficient thing to do, you’re just asking, begging to be overwhelmed by noise)–but rather using the prior information about regions in a subtler way than simply picking out 2 and ignoring the other 48. For example, you could have a region-level predictor which represents prior belief in the region’s importance. Or you could group the regions into a few pre-chosen categories and then estimate a hierarchical model with each group of regions being its own batch with group-level mean and standard deviation estimated from data. The point is, you have information you want to use–prior knowledge from the literature–without it unduly restricting the possibilities for discovery in your data analysis.
Near the end, they write:
In addition, we observed increased activity in regions involved in memory encoding, attention, visual imagery, motor execution and imitation, and affective experience with increased behavior change.
These were not pre-chosen regions, which is fine, but at this point I’d like to see the histogram of correlations for all the regions, along with a hierarchical model that allows appropriate shrinkage. Or even a simple comparison to the distribution of correlations one might expect to see by chance. By suggesting this, I’m not trying to imply that all the findings in this paper are due to chance; rather, I’m trying to use statistical methods to subtract out the chance variation as much as possible.
P.P.S. Just to say this one more time: I’m not at all trying to claim that the researchers are wrong. Even if they haven’t proven anything in a convincing way, I’ll take their word for it that their hypothesis makes scientific sense. And, as they point out, their data are definitely consistent with their hypotheses.
P.P.P.S. For those who haven’t been following these issues, see here, here, here, and here.
Part of my concern with neuroscience studies is that the definition of what constitutes a "region" does not seem to be fixed. In some cases, it seems to correspond exactly to a physiologically defined area, but in others it seems to be roughly "that area within the anatomical region of interest that (on the averaged subtracted image) is most brightly colored". It seems to me that such flexibility would play havoc with statistical inferences, but maybe there are techniques to deal with it. Or maybe the analyses are always based on activity in an anatomically defined region – it's just my understanding of brain anatomy that's deficient.
Nevertheless, I'd be interested in finding out how the region of interest is defined here.
Morgan: In this case, the two regions appear to have been strictly defined ahead of time.
Thank you for the reply.
Also for the pointers in the P.P.P.S. Instant gratification to see that my concerns have already been looked into. And as a bonus the links reminded me of the existence of the word "voxel". Always liked that one.
Question: if you analyze this with a hierarchical model. Wouldn't you make some strong assumptions about the distribution of these regions then? In such a messy system as the brain, would you be confident that you use the "right" distribution?
Gustaf: Sure, but what about the implicit model corresponding to the selection of statistically-significant correlations in two regions? Where does that leave all the others? I'd prefer to form a model, and then check it, than to make no assumptions and thus restrict myself to a very crude analysis of data. Or, to put it another way, I'd be happy with a non-model-based approach if it were to make good use of available scientific information. I think there must be something better than simply picking two regions and setting all the others aside (only to pick them up at the very end in a non-quantitative analysis).
Hey n=20 is big in this literature! A lot of studies using fMRI have much less, I have seen papers with n=4 but there may be multiple observations per individual. The resolution (the size of the voxel) depends on the machine amongst other things & can be around a few cubic millimetres, its gradually getting smaller.
There have been some serious questions raised about this type of methodology for example:
http://www.mindhacks.com/blog/2008/06/the_fmri_sm…
…see also this fantastic poster by Craig Bennett; it shows up insufficient consideration of multiple comparisons in fMRI data – with a dead fish.
As for beating a dead fish, to try to get this across to some medical students studying gene expressions and heart attacks about 5 years ago – we agreed with them to scramble the rows within each patient (some had heart attacks and some had not) before doing the analysis.
But then when the false positive associations showed up – we had a really hard time stopping them from coming up with biological explanations for them (OK can't reject a hypothesis on they way it was generated, but they did seem overly interested/attached …)
As with other recent areas where multiple comparisons are so potentially hazardous – in addition to better analysis of what was done in the individual study – location and consideration of other studies and especially planning and conducting of replication studies surely will also be required.
K?
What the heck kind of average is this!
"(1.5+2.3)/sqrt(2) = 2.7"
This makes no sense to me as a useful measure,
> (50+100)/sqrt(2)
[1] 106.0660
> (50+1000)/sqrt(2)
[1] 742.4621
> (1+10)/sqrt(2)
[1] 7.778175
>
Please explain
Patrick:
I'm averaging t-scores. (1.5+2.3)/2 = 1.9, but then since it's an average, you scale up the t-score by sqrt(2), thus 1.9*sqrt(2) = 2.7.
Is that supposed to represent the t score on the joint hypothesis test, essentially h0: y~1 vs h1:y~x1+x2+1
If so, it should be uniquely higher then the test on just one of those variables. However:
(1+10)/sqrt(2)
[1] 7.778175
Sorry, this is just not so obvious to me.
Patrick: See chapter 2 of ARM. The short answer is that I'm talking about computing averages. I never do the sort of hypothesis testing that you're talking about. Also, it's not true that the t-score for an average of two studies must be larger than both of the individual t-score.