Let’s do preregistered replication studies of the cognitive effects of air pollution—not because we think existing studies are bad, but because we think the topic is important and we want to understand it better.

In the replication crisis in science, replications have often been performed of controversial studies on silly topics such as embodied cognition, extra-sensory perception, and power pose.

We’ve been talking recently about replication being something we do for high-quality studies on important topics. That is, the point of replication is not the hopeless endeavor of convincing ESP scholars etc. that they’re barking up the wrong tree, but rather to learn about something we really care about.

With that in mind, I suggest that researchers perform some careful preregistered replications of some studies of air pollution and cognitive function.

I thought about this after receiving this in email from Russ Roberts:

You may have seen this. It is getting a huge amount of play with people amazed and scared by how big the effects are on cognition.

THIRTY subjects. Most results not significant.

If you feel inspired to write on it, please do…

What I found surprising was how many smart people I know have been taken by how large the effects are. Mostly economists. I have now become sensitized to be highly skeptical of these kinds of findings. (Perhaps too skeptical, but put that to the side…)

Patrick Collison, not an economist, but CEO of Stripe and a very smart person, posted this, which has been picked up and spread by The Browser which has a wide and thoughtful audience and by Collison on twitter. Collison’s piece is a brief list of other studies that “confirm” the cognitive losses due to air pollution.

My general reaction (using one of the studies) is that if umpires do a dramatically worse job on high pollution days because their brains are muddled by pollution, there must have been a massive (and noticeable) improvement in accuracy over the last 40 years as particulate matter has fallen in the US. Same with chess players—another study—there should be many more grandmasters and the quality of chess play overall in the US should be dramatically improved.

The big picture

There are some places where I agree with Roberts and some places where I disagree. I’ll go through all this in a moment, but first I want to set out the larger challenges that we face in this sort of problem.

I agree on the general point that we should be skeptical of large claims. A difficulty here is that the claims come in so fast: There’s a large industry of academic research producing millions of scientific papers a year, and on the other side there are about 5 of us who might occasionally look at a paper critically. A complicating factor here is that some of these papers are good, some of the bad papers can have useful data, and even the really bad papers have some evidence pointing in the direction of their hypotheses. So the practice of reading the cited papers is just not scalable.

Even in the above little example, Collinson links to 9 articles, and it’s not like I have time to read all of them. I skimmed through the first one (The Impact of Indoor Climate on Human Cognition: Evidence from Chess Tournaments, by Steffen Kunn, Juan Palacios, and Nico Pestel) and it seemed reasonable to me.

Speaking generally, another challenge is that if we see serious problems with a paper (as with the first article discussed above in this post), we can set it aside. The underlying effect might be real, but that particular study provides no evidence. But when a paper seems reasonable (as with the article on chess performance), it could just be that we haven’t noticed the problems yet. Recall that the editors of JPSP didn’t see the huge (in retrospect) problems with Bem’s ESP study, and recall that Arthur Conan Doyle didn’t realize that these photos were faked.

To get back to Roberts’s concerns: I have no idea what are the effects of air pollution on cognitive function. I really just don’t know what to think. I guess the way that researchers are moving forward on this is to look at various intermediate outcomes such as blood flow to the brain.

To step back: on one hand, the theory here seems plausible; on the other hand, I know about all the social and statistical reasons why we should expect effect size estimates to be biased upward. There’s a naive view that associates large type S and type M errors with crappy science of the Wansink variety, but even carefully reviewed studies published in top journals by respected researchers have these problems.

Preregistered replication to the rescue

So we’re at an impasse. Plausible theories, some solid research articles with clear conclusions, but this is all happening in a system with systematic biases.

This is where careful preregistered replication studies can come in. The point of such studies is not to say that the originally published findings “replicated” or “didn’t replicate,” but rather to provide new estimates that we can use, following the time-reversal heuristic.

Again, the choice to perform the replication should be considered as a sign of respect for the original studies: that they are high enough quality, and on an important enough topic, to motivate the cost and effort of a replication.

Getting into the details

1. I agree with Roberts that the first study he links to has serious problems. I’ll discuss these below the fold, but the short story is that I see no reason to believe any of it. I mean, sure, the substantive claims might be true, but if the estimates in the article are correct, it’s really just by accident. I can’t see the empirical analysis adding anything to our understanding. It’s not as bad as that beauty-and-sex-ratio study which, for reasons of statistical power, was doomed from the start—but given what’s reported in the published paper, the data are too noisy to be useful.

2. As noted above, I looked quickly at the first paper on Collinson’s list and I saw no obvious problems. Sure, the evidence is only statistical—but we sometimes can learn from statistical evidence. For reasons of scalability (see above discussion), I did not read the other articles on the list.

3. I’d like to push against a couple of Roberts’s arguments. Roberts writes:

If umpires do a dramatically worse job on high pollution days because their brains are muddled by pollution, there must have been a massive (and noticeable) improvement in accuracy over the last 40 years as particulate matter has fallen in the US.

Actually, I expect that baseball umpires have been getting much more accurate over the past 40 years, indeed over the past century. In this case, though, I’d think that economics (baseball decisions are worth more money), sociology (the increasing professionalization of all aspects of sports), and technology (umpires’ mistakes are clear on TV) would all push in that direction. I’d guess that air pollution is minor compared to these large social effects. In addition, the findings of these studies are relative, comparing people on days with more or less pollution. A rise or decline in the overall level of pollution, that’s different: it’s perfectly plausible that umps do worse on polluted days than on clear days because their bodies are reacting to an unexpected level of strain, and the same effect would not arise from higher pollution levels every day.

Roberts continues:

Same with chess players . . . there should be many more grandmasters and the quality of chess play overall in the US should be dramatically improved.

Again, I think it’s pretty clear that the quality of chess play overall has improved, at least at the top level. But, again, any effects of pollution would seem to be minor compared to social and technological changes.

So I feel that Roberts is throwing around a bit too much free-floating skepticism.

P.S. As promised, here are my comments on the first paper that Roberts linked to, that I do think has problems.

Here’s the abstract:

This paper assesses the effect of short-term exposure to particulate matter (PM) air pollution on human cognitive performance via a double cross over experimental design. Two distinct experiments were performed, both of which exposed subjects to low and high concentrations of PM. Firstly, subjects completed a series of cognitive tests after being exposed to low ambient indoor PM concentrations and elevated PM concentrations generated via candle burning, which is a well-known source of PM. Secondly, a different cohort underwent cognitive tests after being exposed to low ambient indoor PM concentrations and elevated ambient outdoor PM concentrations via commuting on or next to roads. Three tests were used to assess cognitive performance: Mini-Mental State Examination (MMSE), the Stroop Color and Word test, and Ruff 2 & 7 test. The results from the MMSE test showed a statistically robust decline in cognitive function after exposure to both the candle burning and outdoor commuting compared to ambient indoor conditions. The similarity in the results between the two experiments suggests that PM exposure is the cause of the short-term cognitive decline observed in both. The outdoor commuting experiment also showed a statistically significant short-term cognitive decline in automatic detection speed from the Ruff 2 and 7 selective attention test. The other cognitive tests, for both the candle and commuting experiments, showed no statistically significant difference between the high and low PM exposure conditions. The findings from this study are potentially far reaching; they suggest that elevated PM pollution levels significantly affect short term cognition. This implies average human cognitive ability will vary from city to city and country to country as a function of PM air pollution exposure.

And here are the key results:

Also this:

A mean of 41.4 with a standard deviation of 46.1 . . . That implies that much of the time the concentration was pretty low. So let’s look at the result as a function of concentration:

Nothing much going on. And this is their best result—none of the other outcomes are even statistically significant!

The paper includes a desperate attempt to put a positive spin on things:

There appears to be a tendency for subjects exposed to to the highest PM2.5 mass concentrations during the candle burning test to have a greater reduction in test performance after exposure to candle burning, however, a linear regression does not provide a statistically robust gradient value (p = 0.610).

That’s a good one: indeed, I don’t think anyone would call p=0.610 a statistically robust finding!

The paper continues:

To further assess the apparent tendency, we compared the differential T-score to subjects who were exposed to a PM2.5 concentration above or below the daily WHO recommendation (25 µg/m³), see Fig. 3. The average differential T-score for when PM2.5 was significantly greater when the PM2.5 concentration is greater than the WHO recommendation compared to when it is less than the recommendation. The two distributions, with PM2.5 greater or less than the WHO recommendation, were not normally distributed, as assessed by the Kolmogorov-Smirnov normality test. Hence, the Mann-Whitney test was performed to compare the medians of the two groups different T-scores, the results showed that the p-value not adjusted for ties was 0.045, and adjusted for ties was 0.041. When the PM2.5 concentration was less than the WHO recommendation, the median differential T-score (=50) was significantly higher than the value obtained (=42) when the PM2.5 concentration was greater than the recommendation. This finding suggests that higher exposures to PM2.5 lead to a greater decline in short term cognitive performance. The seemingly non-linear relationship between cognition and PM2.5 concentration, see Fig. 2, suggests a threshold mass concentration of PM2.5 is required before cognitive decline is observed.

Wow. I mean, just wow. I don’t think I’ve ever seen this many forking paths in a single paragraph. Daryl Bem, Brian Wansink, step aside and watch how the pros do it!

Seriously, though, I suppose that the authors of this paper, as with so many other researchers, are trying their best and just have the idea that the goal of quantitative research is to find something, somewhere, that’s “statistically significant”—and then move to story time.

In some sense, though, the paper was a success, as it was featured in the London Times:

Traffic pollution damages commuters’ brains

Going to work can make you stupid, scientists have found, with your brain capacity falling sharply because of exposure to traffic pollution during the daily commute.

The researchers tested people in Birmingham before and after they travelled along busy roads during rush hour, and found their performance in cognitive tests was significantly lower after their journey. . . .

To be fair, I’m guessing that the Science Editor for this newspaper lives in London, so maybe he had a tough commute and his cognitive abilities were diminished at the time he was editing this piece.

Also, Ray Keene writes for the London Times, right? So it’s not like their standards are so damn high.

13 thoughts on “Let’s do preregistered replication studies of the cognitive effects of air pollution—not because we think existing studies are bad, but because we think the topic is important and we want to understand it better.

  1. This example makes me think of another suggestion (I’m not sure I would say this should be employed generally, but it seems to be relevant to this case in my opinion).

    I think much of the problem is that the analysis tends to focus on a few variables when the problem is highly multidimensional. Wouldn’t it be better to seek carefully curated studies that collect appropriate multidimensional data without conducting the subsequent analysis? The credit (easily obtained by having an appropriate journal have a special issue that consists of peer-reviewed, yes peer-reviewed, data sets). The data would then be available for others to focus on the analysis. All I am proposing is to shift the emphasis from trying to conclude whether (and how much) pollution impairs cognitive ability to an emphasis on what good data for this problem would look like. I think this also makes pre-registration largely irrelevant. What we want is high quality data, ideally from multiple sources, multidimensional enough to address many of the shortcomings in the above example. As long as the carrot is publishing a finding, we will see too many papers with too many forked paths and too little data (sample size and/or number of relevant variables collected) to have faith in the conclusions.

    • Dale:

      I agree, and that’s traditionally what they do in economics: the Bureau of Labor Statistics etc. collect big ongoing datasets and then learn what they can. Also the tradition in astronomy for many centuries: lots of record-keeping, then people can go back to the data. In social science, we have the National Election Study and the General Social Survey.

  2. If I understand him correctly, Russ Roberts is pointing to a general problem in academia, a sort of team psychology in research speculation. A publishes a paper making a particular claim, then B cites A in making the same sort of claim, then C cites A and B, etc. This would be normal, except C, D and so on tend to cite only the studies that agree with their story. Before long you have a pseudo-consensus that can be invoked to lend credence to minimally informative work or simply unconstrained speculation. This is a very big problem in parts of the humanities, but I’ve seen the same thing close up in economics.

    I think Roberts is saying, taken individually these studies are not very strong, but what is really worrisome is that, by appearing to support one another, they are being used as a group to demonstrate that the effects of air pollution on cognition are large and well demonstrated.

    Like you, Andrew, I sympathize with part of where he is coming from; the problem of mutual support groups is real. On the other hand, while it’s good to hang on to skepticism about large claims in general and certainly about weak research methods, over the long haul we have seen increased recognition of the health hazards of various sorts of pollution, such as small particulates. It’s plausible that cognitive function would reflect this, maybe liminally. (Several decades ago I took an interest in the effect of air pollution on traffic accidents in the LA basin, but I didn’t do a full study. That was a mean smog back then.)

    For me the bottom line is that, when reviewing prior research, it should be imperative to not cherry-pick. And in these small-scale lit reviews the authors have the same responsibility to curate the studies they cite as they would in a meta-analysis. Sometimes studies are cited as examples of a method or claim, and if that’s all the citation is doing, fine. But if the citation is used to buttress your own claim, there’s an obligation to evaluate it, since your claim depends (in part) on theirs. The culture of just amassing a wall of supportive references is problematic.

  3. I am prepared to believe that air pollution has a deleterious effect on cognition. Leaded gasoline is one culprit, as is carbon monoxide; maybe nitrous oxide. But particulate matter? Really? I suppose that it is an irritant, and that ordinary people who are not used to cogitation might possibly be negatively affected. But chess masters, who are used to concentrated thinking? Do we think that burning a few candles in their vicinity would have much effect?

    I used to be a bridge expert. In fact, my partner and I won a national charity game in 1983, edging out a world championship pair by a half a matchpoint. A lot of players smoked back then. Talk about particulate matter, I once attended a bridge tournament where the smoke was so thick that you couldn’t see all the way across the room!

    Over time, bridge tournaments in the US started having smoking and non-smoking sections in different rooms, which provided conditions for a kind of natural experiment, as many players played under both conditions. The particulate matter concentration from hundreds of cigarettes was surely quite high. Yet I know of no player who said that he or she played better in the non-smoking sections than in the smoking sections. If that made a difference, surely there would be anecdotal evidence of it. The only related anecdote I have heard of is that Nimzowitch once complained about his chess opponent smoking a cigar. But air pollution may not be the reason, as he also later complained that his opponent, who had put out his cigar, looked like he wanted to smoke it. ;) It seems to me that the lack of anecdotal evidence is important.

    • I can tell you for sure that trying to play chess or bridge in a smoke filled room would have severely hindered my ability. I mean, after passing out from holding my breath, it’d be pretty hard to continue.

      I suspect self-selection bias is heavy in your natural experiment.

    • IMO the question isn’t about how well one does in one room or another on a given day. It’s whether the people who regularly inhale *particulate matter* have a performance decline over time due to the particulate matter. Smoking specifically, however, isn’t that simple, because nic is a very powerful stimulant and probably enhances cognitive ability in some way or another.

      • Thanks, Jim.

        Yes, the cumulative effect of particulate matter over time is an interesting question. Certainly lead and other heavy metals have a deleterious effect.

        As for nicotine being a stimulant, is there much of it in second hand smoke? If not, then the possible effect of second hand smoke on those who are not smoking should be pertinent.

  4. I wish there were more than five people reading papers critically.

    Just glancing at the paper on the chess tournaments, the paper cites two previous papers, both of which have n < 25. From the abstracts, at least, they're hardly overwhelmingly convincing, although there's no clear reason that they're wrong.

Leave a Reply to Peter Dorman Cancel reply

Your email address will not be published. Required fields are marked *