Skip to content
 

Know your data, recode missing data codes

We had a class assignment where students had to graph some data of interest. A pair of students made the above graph, as a reminder that some data cleaning is often necessary. The students came up with the excellent title as well!

77 Comments

  1. Adede says:

    I don’t get it. What’s missing? What needs to be cleaned up? Looks like a very sparse yet dispersed (bimodal?) dataset.

    • Jordan says:

      The missing values were probably coded as something like “-99” in the original dataset (this is a Qualtrics default, for example). The potential error here is forgetting to tell your statistical software that “-99” = “Missing”. From what I can tell, the choice of “-99” for missing values did its job by making the analyst scratch his/her head when they tried to perform their analysis.

      “Wait, those values are impossib- oooooooooooooooooh, I forgot to tell R that ‘-99’ = ‘Missing'”

  2. Jordan says:

    Forgive me for being a bit slow, but isn’t the lesson here to always originally code missing values with a number so absurd (-9999) that it would clobber any analysis that treated it as non-missing? That way, missing values in the same raw dataset will throw up a bajillion red flags regardless of the stats software you’re using.

    I might have titled this something like “Missing values in raw data should be so absurd that you couldn’t possibly forget to recode them after importing the dataset into your preferred software”. But then again, this blog wouldn’t be nearly as popular if I were the one choosing the titles :)

    • Dale Lehman says:

      One of my favorites is the US airfare data (from the Bureau of Transportation Statistics). Missing data is coded as 99999 for the airfare. If you don’t recode it, you get some interesting airfares. When I originally discovered this, after contacting them to find out what all the 9999 values were, I (we) also discovered that the data has been mistakenly entered as 9999 rather than 99999. Either way, pretty expensive fares, even before COVID.

    • Jon Mellon says:

      We always code don’t knows as 9999 in the British Election Study for that reason. Instantly blows up any model you include them in.

    • Mikhail Shubin says:

      Is the lesson here that one should always encode missing values with NA or Null or None or something more appropriate than wrong numbers?

      • Brent Hutto says:

        Mikhail,

        IMHO, yes that is the takeaway lesson.

        In fact I was gobsmacked to discover from this discussion that so many people still feel otherwise.

      • Carlos Ungil says:

        Sometimes “wrong” numbers can be (and often were) the most appropriate solution. Non-numeric choices are (were) not always an option and also have their own disadvantages. The lesson here is to read the dataset documentation (or at least verify if there are strange things going on which suggest that you should read the documentation).

        • Brent Hutto says:

          Is there any software in common use today which will not allow a blank or empty field?

          Of course it’s a good practice to read the documentation. But it’s also correct practice to design things in a way that fail in the safest possible way whenever someone does not read the document (or misunderstands it or the documentation is wrong/incomplete).

          My former life as a software engineer was a long, long time ago. But even in the 70’s we were taught that counting on an end-user to “read the manual” was no excuse for poor design and coding decisions. Placing values in a dataset which will lead to incorrect results with no error or warning is a poor design decision. Saying it’s all documented in the manual does nothing to change that.

          • Agreed, but it is still common practice to release fixed width field datasets, usually from some government sources. when the data file has “x, a two byte signed integer equal to the number of dollars spent per week on goldfish food” there’s not much you can do to deal with missing values other than code it as something absurd like -9999

            If the govt stops putting data out in this form, decades old analysis codes that analyze data series published since the 1960s will stop working. so while I agree with moving future analyses to more flexible formats like the free public domain SQLite format, as an analyst it’s important to remember the rule: always read the data codebook, and clean the data before analyzing.

            basically belt AND suspenders

          • Carlos Ungil says:

            Sometimes you have multiple kinds of missingness, one kind of “emptiness” is not enough. Even if there is only one, you can leave a numeric field empty when the data is given as text but not when it’s in binary format. Not all the datasets that one may want to use have been created “today” and expectations have to be set accordingly.

            • Brent Hutto says:

              So when do we stop perpetuating awful, error-prone design decisions because of limitation in half-century old software systems? In the year 2030? 2040? 2100?

              Literally this week I had someone suggest that I go through their dataset and replace the missing values (represented by blank fields in a CSV file) with the number “99” to make it clear they were missing. I declined the request and left them blank.

              I’ll be retired, dead and buried one day and people will still be littering datasets having nothing to do with 1960’s vintage legacy software with silly things like that out of sheer habit. Just stop, now, please. Legacy software dealing with legacy data will have to strip out all that legacy crap, yes. But there’s no reason to keep repeating the same problem in new data.

              • Agreed we should stop using this kind of thing for new data analysis. Incidentally a good way to handle multiple types of missingness is to set the field to a NULL value, and then have a second field “missingreason” where you can encode things like refused to answer, hung up, didn’t know, ate too many chili peppers, whatever.

              • Carlos Ungil says:

                > Legacy software dealing with legacy data will have to strip out all that legacy crap, yes.

                As will new software dealing with legacy data. Which I think is the scenario those students found themselves in. Hence the takeaway “if you use legacy data be prepared to handle it appropriately”.

              • jim says:

                “replace the missing values…with the number “99” to make it clear they were missing”

                See what I mean? There’s no rational reason for that. If the data are not there, clearly they are missing!

                I mean, when was the last time you found a blank field and – because you’d wisely filled all blanks with -99 – you were able to determine that the data had accidentally been deleted? And even if you did figure that out, how did you recover the missing data?

                Aye yi yi

            • jim says:

              “Sometimes you have multiple kinds of missingness, one kind of “emptiness” is not enough. “

              If you have to characterize a numeric data field – ie say why it’s empty – put the information in *separate field*. Really it’s just not that hard. Putting wrong numbers everywhere is just planting land mines. Not to mention having people who are entering the data consult a code book or type these things in. What a disaster. Make a separate field and create a lookup table for it. Boom, done, safe, accurate, transportable.

              Reading the documentation is a great idea and everyone should do it. No one ever does it, but everyone should. That’s probably because the documentation is usually so impenetrable that it’s easier just to figure it out on your own.

        • Mikhail Shubin says:

          There seems to be one case when “wrong” number is preferred to an empty fields: visa applications
          https://statmodeling.stat.columbia.edu/2020/08/05/kafka-comes-to-the-visa-office/

  3. Andrew says:

    Adede:

    I don’t remember the details, but it was something like: missing data were coded as -999 and they weren’t recoded before the analysis, leading to these weird averages. The point is that people often just perform analyses without inspecting their data, ending up with completely wrong numbers.

  4. Eric B Rasmusen says:

    I’m guessing, but I can imagine some foolish person decided that the best way to code “Missing” was not to use the string “missing”, because if he did that, then whoever ran a regression would have the app return an error like “Your data includes string values” instead of giving an answer. Instead, he had the bright idea of putting something like the integer “-66” or “-77” to represent missing values. I have to admit I might do that myself in a dataset for personal use, though I at least would use “-999”. Eyeballing the data, the user would see that “-66” was a nonsense value, and, being clever, would deduce that it must be the code for “missing value”.

    This reminds me of the CDC, except the CDC isn’t that smart. In their online official list of covid19 deaths, check out the latest week. It will always be astoundingly low. Then look at the table notes. They explain that the data is added as it comes in from the states, and that takes 1 to 8 weeks. They don’t go on to say it, but of course this means their reported deaths numbers for the past 2 months are pretty much worthless. (Note: they do this for *totals*; it’s not just “average death rate” I’m talking about.)

    • Jordan says:

      > some foolish person decided that the best way to code “Missing” was not to use the string “missing”

      The reason someone might use something like “-999” is so that when you import the dataset your stats software is much more likely to properly categorize each variable (e.g., string, integer, factor, etc.). Integers for missing values are far less likely to mess up this process whereas strings will interfere with it almost every time. Large negative integers are the standard for many professionally created datasets (e.g., General Social Survey).

      • Jordan says:

        Correction: I just checked the GSS and it looks like the use “99” for missing (Shoulda checked before pretending to be so certain, apologies!). Still, the point is that an absurdly high/low integer is less likely to interfere with importation.

        • Eric B Rasmusen says:

          Thanks. I spoke too hastily. There actually *is* a good reason to use integers, as you say. The choice of “99” is terrible, though, since that’s a realistic value for lots of variables. Better to go negative, though maybe in 1960 you couldn’t use negative signs. Better, at least, to go outside the 0-100 range, so when you saw it you couldn’t think it was a percentage value.

          I haven’t done regression work for five years or so. It does seem like a professional-level app ought to be able to input string data without jumping through hoops, though. Do you have to tell Stata and R and Python the type of the variable? I do recall starting with Excel, just to eyeball and clean the data before feeding it in, but by now I’d hope the real stats apps would have caught up.

          • Jordan says:

            I use R and for the most part R will automatically identify the correct type when importing a dataset. However, there are inherent assumptions in an automatic process like that. For example, AFAIK every kind of number can be treated as a string with no inherent order or scale (e.g., ID codes), but not every string can be treated as a number. So, if R detects a single unfamiliar string in a column, it just assumes that the entire column should be treated as a string. Of course, the only exception is the string “NA”, which it knows to treat as missing. If a column only contains integers and NAs, then it will correctly treat the column as an integer. The only problem with using NAs for missing values in the raw data is that other programs (e.g., SPSS) don’t recognize “NA” as missing automatically (I think).

            I guess the way I see it is that using a large negative integer avoids interfering with the automatic typing of variables across most software setups at the cost of needing to take an extra step to flag that value as missing. In R, it just requires adding a single line after importing the dataset:

            df[df==-999] <- NA

            • Clyde Schechter says:

              In Stata, the presence of anything that cannot be parsed as a number will cause that variable to imported as a string variable. You then have to add a little code to de-string the numbers and code the “NA” or similar as system-missing value. For importation to Stata, the most convenient arrangement is to put nothing in the place of a missing value. That is, in a delimited file, just have two consecutive delimiters, or in a fixed-width file, just leave a blank space in that column. In that case, Stata will automatically import that entry as system-missing.

              • Jordan says:

                That works in R as well, but the only problem is if you have different categories of missing data. If all of your missing data are missing at random, then leaving blanks works just fine. However, if some are missing at random and other data are missing for non-random reasons, then you still need to choose some kind of value to differentiate between the two. I tend to use “-9999” for missing-at-random and “-8888” and/or “-7777” to mark specific categories of missing-not-at-random.

              • jim says:

                “if some are missing at random and other data are missing for non-random reasons…”

                Still seems like the appropriate way to handle this is to make a new categorical field characterizing the missing data and leave the data field blank. Then you can have as many explanations as you need, they can be self-explanatory, and you can bang out an easy summary table of how much is missing for each reason.

            • R does give strings an inherent order, in the sense that “” operators, and “sort”, have defined behaviour. The problem is that if the strings are actually numbers, it won’t give what you want: it sorts strings by character order before length, so if you forget to change them back to numbers first you can easily get silly results like “2” < "12". That single correction line you give above doesn't change this problem, so if you then go on to treat those numbers as factor levels where the order matters then you can run into problems.

          • Jordan says:

            Meant to say that I completely agree that “99” is a terrible choice. Both because it is too close to reasonable values for a lot of variables and because it is positive. The best numbers will be things like “-9999” where most of the time its two or three orders of magnitude outside the possible range of most variables.

            Upon closer inspection, the GSS actually uses different values to represent missing values for different variables (sometimes its 9, sometimes its 99). I can’t imagine why they would do this, but perhaps it made a big difference back then? Maybe if a variable could take on a true value of 9, they used 99 as missing and only used 9 when the true values were between 1 and 8. If that’s the case, it could have reduced the overall size of the dataset and the amount of memory needed to analyze it using older computers? I have no clue really.

    • Anoneuoid says:

      This reminds me of the CDC, except the CDC isn’t that smart. In their online official list of covid19 deaths, check out the latest week. It will always be astoundingly low. Then look at the table notes. They explain that the data is added as it comes in from the states, and that takes 1 to 8 weeks. They don’t go on to say it, but of course this means their reported deaths numbers for the past 2 months are pretty much worthless. (Note: they do this for *totals*; it’s not just “average death rate” I’m talking about.)

      It’s not worthless, you just have to know how much is typically missing 1 week back, 2 weeks back, etc.

      Eg, I usually drop the latest point and then guesstimate 10-20k extra deaths onto the remaining most recent one. It’s pretty accurate if rather imprecise. I’d say after 5 weeks the data is petty stable.

  5. Merlin says:

    The dataset is Polity IV. They use -99, -88, and I think also -77 for either missing data or what they call interregnum periods, i.e. transitions between political regimes that can’t be clearly defined. These codes are not uncommon. The ALLBUS (a yearly survey in Germany) that I had the “pleasure” to work with in undergrad used -7, -8, -9; note for variables defined on the 1-9 range usually…

  6. Eric B Rasmusen says:

    This reminds me of something I like to say about music in church services. At our church, and others, we commonly have highly skilled musicians in the “praise band”, who practice selflessly during the week (most are volunteers). The worship pastor has degree in baroque violin from a top five music school; the guy on the bass guitar has a degree in violin-making, and so forth. But who is the most important member of the worship team? — the guy who posts the powerpoint slides. If he messes up and doesn’t put up the next slide with the words to the hymn for five seconds too long (as happens occasionally, if the leader decides to do the order differently than planned, or repeat a stanza, or just loses his place himself), the congregation is completely confused and the psychological effect is to pretty much destroy the entire hymn, no matter how skilled the musicians are or how much they practiced. The Slides Guy doesn’t have to have a fancy degree, but he has to be sensible, quick, and utterly reliable– and if need be, you could eliminate the entire praise band and still get by so long as he was there with his slides.
    The Data Guy is the same way.

    My PhD advisor, Frank Fisher, has a good article where he says every litigation expert witness team needs a Data Guy (he was IBM’s chief antitrust witness against the Justice Dept., and the Justice Dept’s chief witness against Microsoft). See Econometricians, and Adversary Proceedings. Franklin M. Fisher. Journal of the American Statistical Association , 81
    (Jun., 1986), 277-286. http://www.jstor.com/stable/2289215.

    I like to go meta. This article appears on page 8 of the Google Scholar results for “Franklin Fisher” with 48 cites, just below “Identifiability criteria in nonlinear systems: A further note”. Yet I don’t think I talk about any other paper of his as often. (I did pick him as advisor partly because he was such a good writer, so there’s a selection effect.)

  7. Michael Nelson says:

    I see nothing at all wrong with this graph, the students were obviously just data-smoothing with a cubic model. Only way it could be made any more clear is, perhaps, moving the values on the x-axis out of sequence.

  8. Thomas says:

    A second-hand story about the Multienter aids cohort study back around 1990. The item “nb of sex partners in past week” was to be coded 99 if the person didn’t want to answer. Some underpaid keyboard operator entered 88 by mistake. It made for some interesting mean values.

    • The entire analysis of alcohol consumption by alcoholics in the US is based on physiologically impossible values of consumption due to misunderstanding the question…

      There’s a government phone survey on risky behaviors https://www.cdc.gov/brfss/data_documentation/index.htm

      here’s the codebook: https://www.cdc.gov/brfss/annual_data/2018/pdf/codebook18_llcp-v2-508.pdf

      An important question is: “Avg alcoholic drinks per day in past 30” and it’s coded: 1-76 real answers, 77 is don’t know, 99 is refused to answer.

      The question immediately before this is “Days in past 30 had alcoholic beverage”. Lot’s of people answer 30 because they have basically 1 drink per day… But then they answer “Avg alcoholic drinks per day in past 30” with “30” because they mistakenly believe they’ve been asked not the average but the TOTAL. Worse yet, some people answer 60, or even 76! There are nontrivial numbers of people who claim to EVERY DAY on AVERAGE drink 3 to 4 times the physiologically possible worst case consumption.

      It’s widely acknowledged by studies that it takes a little over an hour to eliminate one standard drink from the bloodstream, though this does vary somewhat. But there being 24 hours in a day, 30 drinks a day is 6 more drinks than you can consume and keep a steady state alcohol level. So after 30 days of drinking that you’d have blood alcohol levels equivalent to downing 180 drinks all at once, which is pretty well guaranteed fatal.

      And yet the proportion of people who answer 30 drinks a day is nontrivial, several percent…

      From this you get such graphs as:

      https://www.washingtonpost.com/news/wonk/wp/2014/09/25/think-you-drink-a-lot-this-chart-will-tell-you/

      and entire books written on the subject of just how MUCH the high end drinker drinks. (The Philip Cook book that the wapo post mentions).

      So, folks, NO it’s NOT the case that you should code stuff as NA by putting in implausibly large integer values… because someone will take that and run with it, leveraging it into a multi million dollar a year research project!

  9. Jai says:

    I took away two valuable basic lessons.

    1. If you provide data, code missing values with an absurd value and document that in your codebook for all to see.
    2. If you consume data, read the codebook and handle missing values on import. Then summarize the data you loaded to detect outliers in the event you forgot to handle the missing values.

    This is the safest ecosytem. Otherwise, if you mix datatypes in the data you provide (like coding missing values with a string like “MISSING”) it can mess up the variable’s datatype on import. Observing the two lessons benefits data interchange. Absurd outliers alert the user about missing values while import errors alert the user about inconsistent datatypes that are likely to be data entry defects.

    • jim says:

      Jai,

      Using a numerical code for a numerical field is dangerous! You may ask, like that little dude used to do on TV, “What means DAN GER OUS?”

      Well it’s like having your shoes untied. 99% of the time you’ll get along just fine but one day a dog will jump out and try to bite you and that’s when you’ll step on your shoe lace and smash your face into the ground. Sorry I guess that’s a little bit graphic.

      • Brent Hutto says:

        Yes, the whole “-9999” thing is stupid IMO. If the data value is missing, it needs to be represented with something out of band that will throw an error if someone tries to use it with their intended software.

        • I’m pretty sure the reason it’s done is that software from the 60s 70s and 80s almost universally used binary fixed width fields and there was simply no way to get an out of band representation into such a file.

          Since the census and bureau of labor statistics and etc were early consumers and producers of such data files, they have continued to do such things to provide uniformity in their data series

          although CSV has become the standard… the real way to publish tabular data these days is actually to plop it all together with metadata into SQLite files.

      • Jai says:

        Jim, Brent, and Daniel,

        Okay. I thought Jordan in his comments above and my summarization of his points demonstrated that it is, in fact, not dangerous. Rather, it’s a safe way to get correct data typing while also being able to detect and distinguish between missing values and data defects.

        Brent, you used an image to express your disapproval of that approach. I’m ready to receive instruction. Why, and what instead? You suggested an “out of band” value. I thought that was the intention of -9999. What is an advisable “out of band” value? What distinguishes that and an “absurd value” and why is the “out of band” value better?

        Daniel suggested metadata with a SQLite file. I would love to see an example. Any to point to on GitHub?

        Jim, I think your suggestion below was not to code missing values at all. Leave them null. Add another categorical variable to document the reason for the null. Did I restate your alternative correctly?

        I’d like to understand these points if I may request help, please.

        Thank you all for responding to my comment.

        Jai

        • Brent Hutto says:

          By “out of band” I mean a value that can not possibly be interpreted as usable by the analysis software. If it’s a numeric value than “out of band” means not a number.

          I want the software to either throw an error or ignore the “out of band” value at run time. A number that’s simply implausible depends on some human to take action to remove it. I want fail-safe, not fail-wrong to be the behavior when the analysts fails to notice something.

          • Andrew says:

            Brent:

            That’s what R does. In R, NA’s are contagious. But we get our data from many different sources, so we also need to check our data for weird missing data codes, or at least be aware that such things can happen!

            • Brent Hutto says:

              Yes, I heartily agree with the way R handles its NA value’s special meaning. One of the many ways I find R simpatico with the way I view data handling and analysis.

              Unfortunately I can only use R for such a small part of my daily work that I never quite develop the fluency with it that I’d like.

            • Jai says:

              Andrew,

              I get it that we need to check for missing data codes when we import data. I’m trying to get to a so-called best practice when it comes to preparing a dataset for consumption by others.

              Say I just became King of the Bureau of Labor Statistics. Now’s my chance to mop up formats once and for all. I don’t think I know yet what the prescribed alternative is for representing missing data, just that as a consumer I should be ready for anything.

              What’s your advice?

              • Brent Hutto says:

                What I tell people who are preparing their data to send me (usually in a CSV file) is to leave it blank if it is missing. Because if they don’t, the first thing I have to do is read in their CSV and convert a bunch of stuff to blanks.

              • jim says:

                Jai: I agree with Brent. If a field has no value leave it blank. And that as a data consumer you definitely should be ready for anything! :)

        • Jai, SQLite files are well documented single file databases. I don’t have an example file to show you but it’s easy to get started with SQLite in R, Python, Julia etc. You can even create the files from the command line with sqlite3.

          To see how it works you can read any old set of CSV files into R and then jam them in separate tables in SQLite. Then you can build an extra table that has table name, variable name, and description as it’s columns. enter a description of every variable in your tables … voila you have a self describing database file

  10. jim says:

    What a great teaching moment for how to prepare and manage data!

    A field that doesn’t have data should, well, not have any data. If the blank needs to be qualified the appropriate way to do it is to provide a categorical field (“Answered, Refused to Answer, Had to Poo, Hung up Accidentally”) so you can analyze or filter on that information.

    can you say “use a relational database to gather and handle your data”? It’s funny how much discussion there is here of bad data handling yet the solution was invented oh so long ago.

    • Brent Hutto says:

      My first job after grad school we had a huge dataset and hundreds (maybe a couple thousand even?) of items, each had the actual value plus a parallel item with about seven or eight reasons for missingness some akin to “had to poo”. And when I said they had a parallel missingness variable that’s because I wrote a SAS macro to extract those codes (something like 9993, 9994, 9995, 9996, 9997, 9998, 9999) from the response variable and separate them out.

      I only worked there a few years but never, ever, not even once saw any of those missingness-category variable used or referred to. I think there was at least one missingness category that never even showed up in the actual data, although it was in the original codebook.

      So pretty useless effort in the end but you gotta give us points for thoroughness!

      • jim says:

        Ha, that’s hilarious!

        I suggest the metafield as a pacifier for people who freak when there is a blank. “OM*G did I delete that? OM*G did I lose that data on the import? OM*G is that supposed to be blank?? What happened?? Can we even use this data??” Strangely though they don’t worry too much about adding wrong data, even if it’s sure to screw something up eventually.

  11. Alex says:

    The code for missing data is 12345?? That’s the same code I use for my luggage!

  12. Bob76 says:

    There’s much discussion above of the value to use code missing data.

    I can see, if the data base was defined back in the days of unit record equipment (punched cards), a number like 99999 would require reserving 5 columns for the data.
    https://en.wikipedia.org/wiki/Unit_record_equipment

    So, if one has a variable that can take on values 1 through 5, 9 is a reasonable code for missing data. Using 9 rather than 99999 cuts the number of card columns needed to store the field from 5 to 1—an 80% reduction. It is probably the case that many long-lived databases can trace their origin back to unit record storage and their coding rules have changed little or not at all.

    Me, I use a value like -9999 or, if I am thinking, I use -32767 or -32768.

    Bob76

  13. Phillip Middleton says:

    +1

    Though this doesn’t absolve the data generation process from not using a data dictionary of some sort. It’s a PiTA naturally (sometimes made better with a few tools in the belt), but I think a requirement nevertheless. The data need not be in an RDBMS either, SQLite or any other – it could just as easily be JSON/XML or some other text file format. In the latter case, data typing is then the responsibility of the analytic software / coding environment, not the RDBMS.

  14. From The Grammar of Graphics (1999):

    Numerically coded missing values have the potential for doing great damage when ignored or miscoded because they are usually designed to look like outliers. Codes such as 99, or –999 can profoundly affect numerical calculations, even when sparse in the data. Jasso (1985), for example, analyzed the frequency of marital coitus in the prior month reported by a sample of 2,361 married couples. In her paper she computed a ceteris paribus effect of over 100 acts of coitus per month for a prototypical 45-year-old married woman. Reanalyzing her data, Kahn and Udry (1986) discovered four cases with highly unusual values of 88; no other coded value was greater than 63 and almost all were under 40. Since the missing value code for this variable was 99, Kahn and Udry concluded that the peculiar values were keypunch errors. While Jasso (1986) offered a rejoinder, it is most likely that inspecting histograms or tables prior to modeling would have triggered an alarm on Berkson’s intraocular traumatic test (Edwards et al., 1963).

Leave a Reply