## What sort of identification do you get from panel data if effects are long-term? Air pollution and cognition example.

Don MacLeod writes:

Perhaps you know this study which is being taken at face value in all the secondary reports: “Air pollution causes ‘huge’ reduction in intelligence, study reveals.” It’s surely alarming, but the reported effect of air pollution seems implausibly large, so it’s hard to be convinced of it by a correlational study alone, when we can suspect instead that the smarter, more educated folks are more likely to be found in polluted conditions for other reasons. They did try to allow for the usual covariates, but there is the usual problem that you never know whether you’ve done enough of that.

Assuming equal statistical support, I suppose the larger an effect, the less likely it is to be due to uncontrolled covariates. But also the larger the effect, the more reasonable it is to demand strongly convincing evidence before accepting it.

“Polluted air can cause everyone to reduce their level of education by one year, which is huge,” said Xi Chen at Yale School of Public Health in the US, a member of the research team. . . .

The new work, published in the journal Proceedings of the National Academy of Sciences, analysed language and arithmetic tests conducted as part of the China Family Panel Studies on 20,000 people across the nation between 2010 and 2014. The scientists compared the test results with records of nitrogen dioxide and sulphur dioxide pollution.

They found the longer people were exposed to dirty air, the bigger the damage to intelligence, with language ability more harmed than mathematical ability and men more harmed than women. The researchers said this may result from differences in how male and female brains work.

The above claims are indeed bold, but the researchers seem pretty careful:

The study followed the same individuals as air pollution varied from one year to the next, meaning that many other possible causal factors such as genetic differences are automatically accounted for.

The scientists also accounted for the gradual decline in cognition seen as people age and ruled out people being more impatient or uncooperative during tests when pollution was high.

Following the same individuals through the study: that makes a lot of sense.

I hadn’t heard of this study when it came out so I followed the link and read it now.

You can model the effects of air pollution as short-term or long-term. An example of a short-term effect is that air pollution makes it harder to breathe, you get less oxygen in your brain, etc., or maybe you’re just distracted by the discomfort and can’t think so well. An example of a long-term effect is that air pollution damages your brain or other parts of your body in various ways that impact your cognition.

The model includes air pollution levels on the day of measurement and on the past few days or months or years, and also a quadratic monthly time trend from Jan 2010 to Dec 2014. A quadratic time trend, that seems weird, kinda worrying. Are people’s test scores going up and down in that way?

In any case, their regression finds that air pollution levels from the past months or years are a strong predictor of the cognitive test outcome, and today’s air pollution doesn’t add much predictive power after including the historical pollution level.

Some minor things:

Measurement of cognitive performance:

The waves 2010 and 2014 contain the same cognitive ability module, that is, 24 standardized mathematics questions and 34 word-recognition questions. All of these questions are sorted in ascending order of difficulty, and the final test score is defined as the rank of the hardest question that a respondent is able to answer correctly.

Huh? Are you serious? Wouldn’t it be better to use the number of questions answered correctly? Even better would be to fit a simple item-response model, but I’d guess that #correct would capture almost all the relevant information in the data. But to just use the rank of the hardest question answered correctly: that seems inefficient, no?

Comparison between the sexes:

The authors claim that air pollution has a larger effect on men than on women (see above quote from the news article). But I suspect this is yet another example of The difference between “significant” and “not significant” is not itself statistically significant. It’s hard to tell. For example, there’s this graph:

The plot on the left shows a lot of consistency across age groups. Too much consistency, I think. I’m guessing that there’s something in the model keeping these estimates similar to each other, i.e. I don’t think they’re five independent results.

The authors write:

People may become more impatient or uncooperative when exposed to more polluted air. Therefore, it is possible that the observed negative effect on cognitive performance is due to behavioral change rather than impaired cognition. . . . Changes in the brain chemistry or composition are likely more plausible channels between air pollution and cognition.

I think they’re missing the point here and engaging in a bit of “scientism” or “mind-body dualism” in the following way: Suppose that air pollution irritates people, making it hard for people to concentrate on cognitive tasks. That is a form of impaired cognition. Just cos it’s “behavioral,” doesn’t make it not real.

In any case, putting this all together, what can we say? This seems like a serious analysis, and to start with the authors should make all their data and code available so that others can try fitting their own models. This is an important problem, so it’s good to have as many eyes on the data as possible.

In this particular example, it seems that the key information is coming from:

– People who moved from one place to another, either moving from a high-pollution to a low-pollution area or vice-versa, and then you can see if their test scores went correspondingly up or down. After adjusting for expected cognitive decline by age during this period.

– People who lived in the same place but where there was a negative or positive trend in pollution. Again you can see if these people’s test scores went up or down. Again, after adjusting for expected cognitive decline by age during this period.

– People who didn’t move, comparing these people who lived all along in high- or low-pollution areas, and seeing who had higher test scores. After adjusting for demographic differences between people living in these different cities.

This leaves me with two thoughts:

First, I’d like to see the analyses in these three different groups. One big regression is fine, but in this sort of problem I think it’s important to understand the path from data to conclusions. This is especially an issue given that we might see different results from the three different comparisons listed above.

Second, I am concerned with some incoherence regarding how the effect works. The story in the paper, supported by the regression analysis, seems to be that what matters is long-term exposure. But, if so, I don’t see how the short-term longitudinal analysis in this paper is getting us to that. If effects of air pollution on cognition are long-term, then really this is all a big cross-sectional analysis, which brings up the usual issues of unobserved confounders, selection bias, etc., and the multiple measurements on each person is not really giving us identification at all.

P.S. The problems with this study, along with the uncritical press coverage, suggests a concern not with this particular paper but a more general concern with superstar journals such as PNAS, Science, Nature, Lancet, NEJM, JAMA, etc., which is that they often seem to give journalists a free pass to report uncritically. This sort of episode makes me think the world would be better if these superstar journals just didn’t exist, or if they were all to shut down tomorrow and be replaced by regular old field journals.

1. Peter Dorman says:

I really like the suggestion to partition the sample according to the three groups. This not only digs deeper into the potential underlying processes; it also tests the model/evidence. One of the less commented on consequences of the hunt for statistical significance is the resistance to looking at subsamples, with their lower N. But subsamples matter, whether or not you think your effect should be approximately constant across them. Of course, it’s important that the selection into subsamples follows exogenous criteria, as yours does.

• David Bailey says:

Yes, I think it is hard to overemphasize this point. Many people don’t seem to realize that the benefit of large sample sizes is not just to reduce statistical uncertainties, but so that the sample can be sliced and diced to investigate and evaluate possible systematic issues and biases.

2. Data set or it didn’t happen.

• Which is to say that there are so many ways in which to analyze this data that there is no reason to believe that any given person with an interest in the question should believe that all of the relevant ways were handled by this analysis.

why do we let people get away with publishing this stuff without data? all that work to put the dataset together thrown away after the “revealed truth” is published… :-(

• harryq says:

I don’t quite understand this sentiment. Out of the ~30 papers I’ve written/co-authored, I can only think of one where I had the right to share the data — and that’s because the data was created by me purely for illustration purposes. All of the other papers either (a) involved data use agreements that expressly forbade sharing the data or (b) used publicly-available data that weren’t really “mine” to share — e.g., data from CDC WONDER that can be easily downloaded.

And as someone who is ~dabbling in the area of data privacy, the idea of a bunch of privacy-naive researchers sharing sensitive data with potentially identifying information — solely for the purpose of satisfying a journal/reviewer’s “open data” requirement — makes me a bit nervous…

• Dale Lehman says:

I can’t let this pass uncontested. I can’t speak personally to your case and you may be accurate in your characterization, but I can say that too many people hide behind data use agreements. It is particularly egregious when they are private company data that is being used in a study that is aimed at influencing public policy. In those cases, I find the privacy/proprietary restrictions almost totally unjustified.

Of course health data and education data (and….) are different in that we really don’t want people able to identify individuals. But data can (and should) be anonymized. The degree to which data is not made available is more than an irritation – it is irresponsible and is corrupting analysis. Your reference to “privacy-naive” researchers is only one step away from “methodological terrorists” and “research parasites.” You may well be nervous about naive researchers getting their hands on sensitive data – I am definitely nervous about people doing sloppy or intentionally misleading research and getting it published. This blog has had a litany of examples of the abused position of “elite” researchers/editors/referees that have helped create the crisis we are in. And, if you don’t think it is a crisis – I do. It is much of the reason why the public no longer believes anything and facts have become optional. So, which is the worse danger?

• Clyde Schechter says:

At least in my sphere of clinical and epidemiological research, Data Privacy (TM) is the fastest growing industry. Let me tell you an experience I had just today.

I am a co-investigator on an NIH funded grant. The grant involves collaboration between my institution and the another organizastion. The other organization will be collecting the data, and under the terms of the grant, I will be the one to analyze it. All of this was clearly spelled out in the grant, and the IRBs of both my institution and the other organization approved these arrangements at the start, about 4 years ago. Part of the arrangements also are that the other institution will mask the data in such a way that individuals cannot be identified.

The PI for the grant was also faculty at my institution at the time. But due to some administrative changes, he is now, technically, not an employee of my institution but has status as a “consultant.” Data collection is nearly complete, and the other organization’s legal counsel now insists that we must have a written data use agreement which will, among other things, forbid me to use the data for any other purpose than those spelled out in the grant, and forbid me to further disseminate the data. They have sent a draft agreement to the office of legal counsel at my institution, who have, in turn, raised additional concerns. Because the PI is no longer technically our employee, they now propose that his current employer also be party to the data use agreement, or that his access to the data be restricted to viewing it on our premises. Since he is the PI and is responsible for the overall conduct of the study, it is clear that he needs to have access to the data (although in our past collaborations he has always been content to just review my outputs and code without getting his hands on the data.) They have also noted that although both IRB approvals spell out that the other institution collects the data and I analyze it, the IRB approval does not mention anything about sending the data from there to here, so they want us to amend our IRB applications to cover this. (It’s unclear to me why they think the IRB might have imagined that I would analyze the data if it were never sent to me, but, whatever.)

Look, I’m not usually one to complain about IRBs and other regulation of research. In fact, I used to be an IRB member myself, and have defended regulation of research both here and on other blogs. I think privacy is important and that it cannot be left to the whims of investigators to do whatever they please. Some oversight is needed.

In the end, I’m sure we will resolve these particular problems and in some reasonable way I will get to analyze the data and the PI will get to see it if he chooses to. But, if the lawyers get their way, we will have to repeat the IRB approval process, which will delay us a few months, and probably also involve this third institution, whose lawyers will no doubt feel the need to justify their pay by adding still more hurdles to jump, leading to more delays. And I am even more certain that there will be an ironclad prohibition about sharing this data. And were I to break that prohibition, I am confident I would be disciplined, and perhaps even dismissed for doing so.

• Which is why your research proposal and every clinical research proposal should have initially explicitly required you to collect a signed statement from the clinical patients with an explicit opt-in opt-out decision in which they check a box saying “my data must never be shared or used for any other purpose” vs “my data must be anonymized and made available publicly to improve the information available for further scientific study”

or some similar thing.

let the patients explicitly make a decision, everything I’ve read says many patients are ANGRY that their data is siloed.

• Another way to put this is all the data privacy and siloing is about *protecting and benefiting the university* rather than protecting the patients.

• Another key observation is think how effective it would be to have a complete ban on publication of any research not including the dataset… right away all your lawyers would change their tune to be all about making sure you collect the appropriate releases… because otherwise there’d be no research funds for the organization.

• Andrew says:

Daniel:

Yah, or IRB rules where the default is data are shared, and if you don’t want to share the data you have to give a very good reason why not.

• harryq says:

Perhaps it would be useful if you could explain what you mean by “anonymized” and/or how you’d go about doing that. Are you removing some variables? If so, which variables? Are you adding noise to other variables? If so, which variables, what kind of noise, and how much noise? I might live in a big city, but if I’m the only 32-year-old white male with a PhD and a family history of heart disease, will my record be identifiable? If so, how will my data be protected?

I guess my point is that you make it sound like there’s a PROC ANONYMIZE in SAS where any user can input their data, click the picture of the stick-figure running man, and magically produce a dataset that has the SAS “100% Safe to Share” seal of approval. If that’s the case, then the Census Bureau has spent *way* too much time researching ways to produce a differentially private 2020 Census data release.

https://www.sciencemag.org/news/2019/01/can-set-equations-keep-us-census-data-private

• Nope, there’s no one size fits all for anonymization. So, there should be an IRB requirement that an anonymization protocol be proposed and discussed. Also the releases signed by participants should mention anonymization and its failures. Some patients may be fine releasing the data even if their name is published, so the inability to anonymize perfectly may not be of interest to them. Others may be extremely privacy concerned…

releasing subsets of the dataset based on individual preferences of the subjects should be an OK thing. Releasing a variety of descriptive statistics of the withheld group to compare with the released group should be an ok thing…

releasing simulated data that is fit to a model together with the model and a description of how the descriptive statistics of the real data differ from the descriptive statistics of the simulated data should be a thing…

there are many many ways to release more than nothing, there is only one way to release nothing. Currently by default we have “release nothing” to me that’s laziness encouraged by the fact that releasing nothing is maximally beneficial to the universities.

• bozobrain says:

It may be useful to recall how AOL released “anonymized” search queries years ago. Unfortunately, people were able to reveal the identities of individuals associated with the queries, which, if I recall, became quite a bit of a liability issue for the company.

https://www.nytimes.com/2006/08/09/technology/09aol.html

• Simons says:

A major difference between ‘privacy-naieve researchers’ and ‘methodological terrorists’ is that, to the best of my knowledge, statistics is on the curriculum most of the time and (differential) privacy is absent from it most of the time. Anecdotally, I see lots of medical researchers who think that anonymization is the same as removing a person’s name while retaining birthday, sex, postcode, ethnicity, income etc. I’ve encountered one researcher who claimed that he managed to link two anonymized datasets with 98% matches. He was proud that he could bypass the custodians of both datasets as the ethical board had refused his request to give him the raw datasets. And then he was annoyed when I didn’t want to participate in analyzing the illegally linked data while waiting for the amended request to be reconsidered. The main point of that example isn’t that there’s bad actors out there, but that a person not trained in anonymization or de-anonymization was capable of defeating the experts.

I am not an expert in privacy; I know just enough to realize that I know little and I am very afraid that currently many others don’t even (want to) know that. I don’t think that is primarily due to malice but largely due to inertia. Privacy didn’t use to be a core concern and neither did publishing data. The scale of data collection and processing has changed rapidly and for a while it seemed there were only benefits. Now we’re having multiple scandals regarding privacy – not necessarily related to research – and new laws are coming in. Some of these laws end up with ever longer consent forms that very few people read.

Yes, I’ve seen researchers and institutes hoard data. I’ve seen them do so with the intent to charge exuberant prices or out of fear that there is something worthwhile in the data that they haven’t thought of and likely never will. They’d rather impede everyone else advance science than miss out on a penny. I haven’t seen them use privacy as an excuse, but if forced to release data as a condition to publish, I fully expect them to minimize the effort of anonymization as a cost-saving exercise and smaller groups may not even have the means to do better. And in terms of effort the order is obvious: don’t release the data at all, release all the data, release all data but remove some of the columns, release all data with injected noise but don’t bother to check whether this reproduces the original findings or whether the noise is sufficiently large…

• we should definitely be having lots of discussions about anonymization, as well as put laws in place making it illegal to weaponize data. for example insurance pricing. there should be a list of variables it’s legal to use, and everything else should be illegal. want to use a new variable, get Congress to pass a law adding it.

• Andrew says:

Daniel:

I agree that the real problem is people using available data for the purposes of theft, fraud, blackmail, espionage, denying health coverage, etc., and that’s what should be stopped. To fight these real problems by encouraging scientific researchers to not share their data, that seems to miss the point entirely.

• Right, particularly when some patients specifically participate in medical studies for the purpose of improving science, and WANT their data to be used as widely as possible in hopes of getting good treatments or cures or prevention strategies. Not even asking them is basically unethical in my opinion. Imagine if I donated food to a food bank and it was composted to improve the landscaping at the administrators mansions, or I donated my time to build a house for a housing charity and then they sold the house to the HR person at half the market value… that’s basically what collecting data and hoarding it and publishing papers that let you get grants is.

• Dale Lehman says:

I agree with the comments that anonymization is difficult – it is a subject worthy of more research certainly. I also think that we should separate the legal questions from the science questions – at least a bit. HIPAA is quite stringent and may not be what is good for scientific progress. We still need to obey HIPAA but should not substitute that for what is ethical or best for science. For example, I’ll be happy to make my prostate biopsy results public – I’ll even send you my medical records if you wish. And, when I participated in a research trial, I was happy to let that data be made available to anyone. I fully realize that there are many cases where people don’t feel the same – after all, there is a reason why we have HIPAA. But I do object to the blanket appeal to HIPAA as if that is all that needs to be said concerning what is made available or what is required if an analysis is to be trusted at all.

In some extreme cases, I think it might be appropriate to say that a study should not be published if the data cannot be provided – and in some of these cases it may well be true that the data cannot be provided. But we are often asking people to make real decisions (involving money, health, time, etc.) on the basis of studies. If we can’t release enough information for assurance that the findings are robust, then perhaps it should not be published.

What I am saying is that there are many gray areas where more data might be able to be provided, where constraints may prohibit that distribution, and/or where the study is being used to influence real decisions. We can’t afford to say that all data will be provided nor can we afford a blanket statement that HIPAA dictates that none be provided. It is a complex question and it deserves serious discussion and analysis.

• I like all these comments. I’d also add that it should *always* be OK to release simulated data from a model together with a bunch of descriptive statistical comparisons between the simulated data and the real dataset. In many cases, this would be enough for people to develop alternative models and discuss them scientifically.

If a study can’t release the real data, it should be required to produce this type of simulated data and a description of how it differs from the real data. We should have discussions about what kinds of descriptive statistics should generally be required for the model / data comparison.

• harryq says:

(This is a response to Daniel’s comment)

Not to plug my own work, but it is *not* always okay to release simulated data from a model.

tl;dr version: Suppose your dataset consists of individuals’ addresses and incomes. If you fit a spatial model to those data (i.e., model income given location using spatial random effects), the model will have a tendency to overfit any “spatially outlying” individuals (i.e., people who are geographically isolated). If you then use the posterior predictive distribution to generate new/synthetic incomes for these addresses, the synthetic values for the spatial outliers may end up being a little too similar to the true value.

This is obviously a bit of a special case, but I think the general result could apply to any model capable of overfitting a subset of the data. In any case, my broader point would be that if we can’t trust someone to correctly *analyze* their data, I certainly wouldn’t trust them to correctly *anonymize* their data.

• If you use a public dataset, it’s sufficient in my opinion to publish the description of how to get it, and the code you use to analyze it… But is this paper about a publicly available dataset? I honestly don’t know what the chinese publish, so maybe it is? Other than that, I can’t see why it would be hard to anonymize the results here. If the study involves small towns where there are less than 1000 people you might have a problem… but I doubt that is the case.

• jrc says:

Some scatter plots or time-series/panel graphs would help too. It’s almost like people think it is beneath them to just visualize the data and variation.

• So you’re saying Pics or it didn’t happen.
:-D

• jrc says:

Except it did happen and I just hadn’t downloaded the Appendix yet. There is a lot there. And it is much better documented than I realized at first glance.

Dear Authors – My bad about the snarky comment above. Also, Figure 3 Panel A is striking. I see why it is the kind of result that a top journal would want to publish. It looks like you did a whole lot of work to convince them and yourselves that the result is real in the data. I should not have implied you were not doing your due diligence – you clearly did. Apologies.

• Figure 3 panel A is the one Andrew remarks on in the original post:

The plot on the left shows a lot of consistency across age groups. Too much consistency, I think. I’m guessing that there’s something in the model keeping these estimates similar to each other, i.e. I don’t think they’re five independent results.

I tend to agree. But the graph is kind of amazing. I mean here are adults men vs women who have nothing more than 6th grade educations… and the women are definitely less affected by the men on average. I have to wonder if this isn’t something like occupational. If you have a primary school education only in china and are a man, aren’t you probably more likely to be an outdoor laborer (construction etc)? And as a woman maybe something like a factory worker? So exposure is likely different here.

• Also I have to wonder about other determinants of health: is smoking incidence and intensity the same among undereducated chinese men vs women? Is nutrition similar? Is exercise level in polluted air similar? How many petite chinese women vs men run outdoor delivery pedal vehicles or whatever.

3. Koray says:

“All of these questions are sorted in ascending order of difficulty, and the final test score is defined as the rank of the hardest question that a respondent is able to answer correctly.”

I was not aware that there was such a thing as an objective ranking of math problem difficulty.

“Polluted air can cause everyone to reduce their level of education by one year”: this statement is insane. It presumes that they have a method of pulling people at random from crowds and estimating their level of education within a year. They can tell the difference between college sophomores and seniors, regardless of major. This alone would be impressive if true.

• >I was not aware that there was such a thing as an objective ranking of math problem difficulty.

You can at least rank it by percentage of people who tried the problem and got it wrong in a random sample of people in the population.

But general point taken.

My bigger issue with using the highest ranked question answered correctly is it’s noisy. For example for multiple choice questions, if you guess on the most difficult question and get it right suddenly you’re a genius.

• elin says:

“I was not aware that there was such a thing as an objective ranking of math problem difficulty.”

There absolutely is, especially since these are given to people at different ages. What children learn in school at age 5 is different than what they learn at 12.

3*2 is objectively easier than 32*3 is easier than 32*9.

4. Nick Matzke says:

Prediction: in the near future someone will do a GWAS of air pollution.

5. Off topic, but if you want to include LaTeX, MathJax is turned on in posts, so you just write $latex e^x$.

In the transition to the new hosting, MathJax and <pre> tags got turned off in comments. Could you get those fixed, please?

6. Jacob says:

Lot of talk about data availability and it seems like nobody bothered googling the name of the data source. Here’s an English language website: http://www.isss.pku.edu.cn/cfps/en/index.htm

I poked around for a bit and it appears you can access the data after approval — doesn’t appear there are any costs, etc. involved just a check that you aren’t claiming to want to use the data for bad acts. I don’t see any indication of this, but it wouldn’t surprise me if the location data are unavailable without IRB approval. I also don’t know if the pollution data were linked to the respondent-level data by the researchers or if it was included with the respondent-level data.

7. Emerald Shelton says:

I think, here is an interesting idea. According to their study, ethical lapses are caused by air pollution. But whether it is correct to compare people’s health with morality? I’m not sure…It’s interesting to read their not published study about elevated levels of pollution compared to crime rates on the same days in LA and Chicago. By the way, similar research was conducted by the researchers from London School of Economics who discovered that the crime rate in London is 8.4 per cent higher on the most polluted day. Everyone knows that ozone can cause asthma attacks as well as lead to lung diseases, but I can’t see its connection with morality. However, if the study is correct, such organizations as EPA should work in this direction. Like it was stated in the report https://complexminds.net/2019/04/15/air-pollution/
, nowadays they help states to meet standards for common pollutants with the help of issuing federal emissions standards and policy guidance for state implementation plans.