My NOAA story

I recently learned we have some readers at the National Oceanic and Atmospheric Administration so I thought I’d share an old story.

About 35 years ago my brother worked briefly as a clerk at NOAA in their D.C. (or maybe it was D.C.-area) office. His job was to enter the weather numbers that came in. He had a boss who was very orderly. At one point there was a hurricane that wiped out some weather station in the Caribbean, and his boss told him to put in the numbers anyway. My brother protested that they didn’t have the data, to which his boss replied: “I know what the numbers are.”

Nowadays we call this sort of thing “imputation” and we like it. But not in the raw data! I bet nowadays they have an NA code.

5 thoughts on “My NOAA story

  1. For an excellent history of climate change data and models, check out historian Paul Edwards' A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming. Imputation plays a pretty important role, since we never have truly global instrument data, and certainly not historically. Edwards has some incredibly useful discussions of how to think about data and data work as part of the process and politics of doing science, in particular the important claim that without models there are no data (and hence, that claims by climate change deniers that we need better data, not models, are very problematic and misunderstand how climate science works).

  2. I'll make a second plug for Edwards's Vast Machine. It's a wonderful, engagingly written, and very insightful history and, as asociologist says, its discussion of the history of imputation and diagnostic modeling is very relevant to the anecdote about NOAA as well as to the highly political arguments over trust in climate models.

    One of my favorite anecdotes from Edwards's book is that by the mid-1960s there was real concern that the weight of the national archive of raw meteorological observations, which comprised hundreds of millions of punch-cards, threatened the structural integrity of the National Weather Records Center building in Asheville NC where they were stored.

  3. Rahul, "imputation" refers to filling in missing data. Interpolation and extrapolation are specific mathematical methods that might be used to perform imputation, but they're far from the only ones.

    For instance, suppose I have data on a lot of people — say their sex, height, weight, age, LDL and HDL cholesterol levels, percent body fat, etc. — and I'm missing the body fat measurement for some of them. I might use all of the cases with complete data to fit a model that predicts body fat as a function of everything else, and use that model to impute the missing data. Or, if fitting such a model is complicated because of nonlinearities or other features, I might just find a group of people who are similar to the missing case — they're the same sex, about the same weight and age, they have about the same cholesterol measurements — and randomly pick the body fat measurement from one of them, and impute that to the missing case. (That's similar to the old "hot deck" procedure that the census bureau used to use).

    In case anyone is wondering why you even need to impute data, it's so you can use a single model for your whole database. If I am predicting, say, the number of days an employee will be absent over the course of their career, using the data listed above, then I don't want to exclude cases where I'm missing just one or two parameter values — I might end up excluding lots of cases that way — but I also don't want to fit a zillion separate models: one that doesn't use the LDL measurement, one that doesn't use the HDL measurement, one that doesn't use body fat, and so on.

    If your missing data problem is small, almost any reasonable-sounding method will probably do OK. (Still, it's good to try a couple of different methods and make sure you get the same results both ways; by "results" I don't mean the imputed values, I mean the results of your full analysis). If your missing data problem is big, then you definitely want to learn some techniques from the statistical literature, because even methods that seem reasonable (like the hot-deck procedure) can generate statistical artifacts that can mess up your results.

Comments are closed.