What is the most important real-world data processing tip you’d like to share with others?

This question was in today’s jitts for our communication class. Here are some responses:

Invest the time to learn data manipulation tools well (e.g. tidyverse). Increased familiarity with these tools often leads to greater time savings and less frustration in future.

Hmm it’s never one tip.. I never ever found it useful to begin writing code especially on a greenfield project unless I thought of the steps to the goal. I often still write the code in outline form first and edit before entering in programming steps. Some other tips.
1. Choose the right tool for the right job. Don’t use C++ if you’re going to design a web site.
2. Document code well but don’t overdo it, and leave some unit tests or assertions inside a commented field.
3. Testing code will always show the presence of bugs not their absence ( Dijkstra) but that dosen’t mean you should be a slacker.
4. Keep it simple at first, you may have to rewrite the program several times if it’s something new so don’t optimize until you’re satisfied. Finally, If you can control the L1 cache, you can control the world (Sabini).

Just try stuff. Nothing works the first time and you’ll have to throw out your meticulous plan once you actually start working. You’ll find all the hiccups and issues with your data the more time you actually spend in it.

Consider the sampling procedure and the methods (specifics of the questionnaire etc.) of data collection for “real-world” data to avoid any serious biases or flaws.

Quadruple-check your group by statements and joins!!

Cleaning data properly is essential.

Write a script to analyze the data. Don’t do anything “manually”.

Don’t be afraid to confer with others. Even though there’s often an expectation that we all be experts in all things data processing, the fact is that we all have different strengths and weaknesses and it’s always a good idea to benefit from others’ expertise.

For me, cleaning data is always really time-consuming. In particular when I use real-world data and (especially) string data such name of cities/countries/individuals. In addition, when you make a survey for your research, there will be always that guy that digit “b” instead of “B” or “B “ (pushing the computer’s Tab). For these reason, my tip is: never underestimate the power of Excel (!!) when you have this kind of problems.

Data processing sucks. Work in an environment that enables you to do as little of it as possible. Tech companies these days have dedicated data engineers, and they are life-changing (in a good way) for researchers/data scientists.

If the data set is large, try the processing steps on a small subset of the data to make sure the output is what you expect. Include checks/control totals if possible. Do not overwrite the same dataset in important, complicated steps.

While converting data types, for example, extracting integers or convert to date, always check the agreement between data before and after convention. Sometimes when I was converting levels to integers, (numerical values somehow are recorded as categorical because of the existence of NA), there are errors and the results are not what I expected (e.g. convert “3712” to “1672”).

Learn dplyr.

Organisation of files and ideas are vital – constantly leave reminders of what you were doing and why you made particular choices either within the file names (indicating perhaps the date in which the code or data was updated) or within comments throughout the code that explain why you made certain decisions.

Thanks, kids!

P.S. Lots of good discussion in comments, especially this from Bob Carpenter.

94 thoughts on “What is the most important real-world data processing tip you’d like to share with others?

  1. Always have dependency/bug issues with tidyverse/dplyr, not a fan at all. It has been so bad that the first thing I do is remove them if I get someone else’s code. The pipes are also very slow, it is extra overhead for very little gain in my opinion.

    A much better library to learn is data.table, just wish the syntax wasn’t so ugly.

    • I’ve never had dependency/bug issues with tidyverse packages. I agree that it if speed is a factor, data.table is the way to go. The syntax is so bad, though, that I (a) never wanted to use it, and (b) knew that nobody else would be able to decipher my code if I did use it. The pipes are so intuitive to me that I don’t mind the speed; I’m also not working with truly big data frequently, though.

      • I’ve never had dependency/bug issues with tidyverse packages

        Interesting, have you ever tried to reuse any old code with new versions of the packages?

        • I love pipes and many other aspects of the tidyverse, and would hate to have to do without it. I occasionally use data.frame for the speed though.

          I have run into plyr/dplyr problems a few times. I prefer to use only dplyr when possible, never loading plyr at all.

          As you all probably know but maybe not: if you are going to use both plyr and dplyr, load dplyr first. Also, if you are using both in the same project (and maybe even if you aren’t), specify the package explicity when calling a function that is (or might be) in both: dplyr::thisfunction() or plyr::thisfunction()

        • Not sure how this happened, whether I interchanged plry and dplyr or confused first with last, but I said it exactly backwards: if you need both dplyr and plyr then always load plyr first.

    • I second data.table. A list of data.table tips and tricks would be helpful. One I use frequently is dt[, .SD, .SDcols=j_cols] where j_cols is the vector of column names

    • If you are querying databases, dplyr is translating the pipes to SQL and executed on the server. There is even a command to view the generated SQL.

  2. This is horrible advice:

    > Data processing sucks. Work in an environment that enables you to do as little of it as possible. Tech companies these days have dedicated data engineers, and they are life-changing (in a good way) for researchers/data scientists.

    That is the best way to lose control over one of the most crucial steps in data analysis. Even if you have a data engineering team, you still need to be part of the data engineering effort. And if you aren’t, you’ll find yourself working with an undocumented $%#& data warehouse that was not designed by people who care about research.

    • I always try to do both, as one of the pieces of advice suggests. Provide the raw data and a fully automated script for producing munged data from the raw data. There’s no point in making everyone do the redundant work, though it should be clear how the work was done. I remember Andrew once telling me he didn’t mind that the Gelman and Hill examples didn’t work and data were messed up because it reflected reality and students need to get used to it.

      • > I always try to do both
        Agree, but I would think of it in terms of a random audit – always haphazardly check bits of all parts of the research pipeline and especially let others know you will be doing this – absolutely for sure.

        > examples didn’t work and data were messed up because it reflected reality and students need to get used to it.
        In an Epi course I was co-teaching, I suggested we mess up the data set for the assigned analysis project. I was over-ruled but the faculty member (outside the course) who volunteered to prepare the data set messed up and so it was messed up. When the students finally noticed this they vehemently complained. We responded by telling them that we had actually thought to this on purpose but decided not too. They got it.

        • There are two types of data processing. One type is universal – for example, converting dates to a specific format; or standardizing some variable; or recoding F as female. It doesn’t make sense for each team or individual to repeat these tasks. The other type is case-specific: some assumptions are embedded in the processing, e.g. binning of a variable, any kind of imputation. I can’t take a data scientist/analyst seriously if these tasks are considered below his/her pay grade.

          Keith – sorry to hear you were overruled. Students should be given imperfect data (with errors, with missing columns or cells) for which they need to fix. Also, the problem should be stated vaguely so that they will learn how to define and frame a problem.

  3. I know this tip will attract a barrage of criticism: but, look the data as much as possible, in as many ways as possible, and with whatever tools make that job as effortless as possible. Yes, that last part for me means using a tool (JMP for me) that allows me to look directly at the data in many ways without needing to write any code at all. We’ve had this discussion before – and probably don’t need to rehash it now – but I think what is important is that the analyst understand the data they are looking at. If coding helps with that task, fine. But if the tip focuses on the code, then it leaves the impression that this is the only way you can clean the data. I maintain it is not.

  4. Do NOT write a crap load of imperative / stepwise code that poorly implements the query plan of a good relational database… write SQL instead.

    In particular, the tidyverse pipe stuff is basically what I’m telling you not to do, so use it sparingly… it can be good when you need to calculate some advanced statistics of the data that SQL won’t give you, but use it just for that kind of thing, where R’s numerical/computational abilities are key for the function you’re using.

    Here’s the thing: SQL databases are optimized for reading through big tables of data in an efficient way, and SQL is a declarative language that describes the dataset you will get at the end. A sequence/pipeline of data processing steps leaves you with an ungodly mess in terms of figuring out what came out the other end.

    If you have to, create a couple of temporary intermediate tables or some views or something, and learn how to optimize queries and create indexes… you WILL do a lot better job of data munging.

    I believe in this so much that I use sqldf almost everywhere a normal user might just write a couple of lines of regular indexing…

    sqldf(“select a,b from ds1 join ds2 on ds1.q = ds2.qq where a 56 order by b asc;”)

    makes a lot more sense to me a couple weeks later than whatever you’d do to do this in raw R or tidyverse pipes…

    • goram it, the blog ate the less than and greater than signs in my SQL… it should have read something like:

      sqldf(“select a,b from ds1 join ds2 on ds1.q = ds2.qq where a &gt 1 and a < 56 order by b asc;”)

      • It’s not, it’s a relatively simple use of the tool, which is fine, it’s specifically the: “crap load of imperative / stepwise code that poorly implements the query plan of a good relational database” that I’m talking about.

        if your pipeline has more than 3 or 4 of these steps, and if you have more than 1 of these pipelines is where you fall afoul of the aphorism.

        I use this kind of stuff sometimes, particularly where I want to summarize by quantiles or compute the mean of f(x) or whatnot. It’s great for that quick and dirty stuff

        but if you’re joining 4 tables, filtering, grouping by, re-joining, etc… do it in SQL

      • It’s jargony, using symbols like % and %>%, and thus leaving people like me confused (someone told me all this piping has something to do with first args, much like |> in OCaml, but clearly something else is going on as there are now two of these pipe operators).

        R itself is rather messy in terms of scope, function argument resolution (presumbly theMeans goes into group_by as an argument, but it’s written as if myGroup is the first argument), arbitrarily dependent return types, and the use of unbound variables in expressions like myGroup and value in the above.

        • I think of the %>% operator as the dot operator (from many other languages: Python, Java, Ruby, Javascript, Lua). If you can make that connection, it should be easier to grok.

          For example, in javascript and many object-oriented languages, an approach to API design often uses what is called ‘method chaining’, where operations are performed and update the same object without explicitly referencing the object:

          $(‘#my-div’)
          .css(‘background’, ‘blue’)
          .height(100)
          .fadeIn(200);

          Instead of

          var $div = $(‘#my-div’);
          $div.css(‘background’, ‘blue’);
          $div.height(100);
          $div.fadeIn(200);

          Example taken from https://schier.co/blog/2013/11/14/method-chaining-in-javascript.html

          The dplyr %>% operator functions similarly to the dot operator in these object-oriented languages.
          The object (typically a data frame) is implicitly passed as the first argument to the functions that follow the symbol (where by convention is that first argument should always be the data object being operated on). In many traditional object-oriented languages this is enforced by their syntax.

          However, there is considerable meta-programming magic that dplyr provides to make the syntax even more terse, and I found Wickham’s Advanced R book very helpful: http://adv-r.had.co.nz/

        • The dot is just syntax for resolving a method. So if I have a variable x of type Set in Java or C++, I can use x.size() to get the number of elements in the set. The dot just says you’re calling a method rather than some other kind of function. With objects, the object itself is implicitly the first argument of any methods. So x.size() is really just utility syntax for what really behaves like size(x).

          The chaining pattern arises when there’s a class of type A with methods like foo() and bar() that return references of type A, so you can chain calls together in the pattern a.foo(…).bar(…), with each method acting on the original a.

        • Like the total opposite of functional programming basically… grab an object and mangle the crap out of it returning it as your return value so the next thing in line can take its turn…

          in the end the state of the object is what matters, and the changes made by your chain of stuff is potentially pretty opaque

          add in some macro weirdness so that you’ve invented a special syntax with lots of % signs and maybe nontrivial control over the time of evaluation etc… and it becomes a bit of a nightmare to keep track of formally… of course it’s often great interactively where you can just print out a couple lines of data and see if it did what you want…. definitely a tradeoff here.

        • Generally, f(a) %>% g(b) = g(f(a), b).
          The ‘.’ character lets you move it: f(a) %>% g(b,.) = g(b, f(a))

          Generally, can just use it like shell pipes though. Or like |xargs if you have to move output around.

          I use them interactively, but I do hate the overuse of them. The tidyverse is also a major pain to program around due to the heavy use NSE/quoted expressions. The fact that they had to make the rlang package to make programming around the tidyverse ‘better’ is a testament to how terrible it is to program around the tidyverse. And rlang is still a pain to use.

        • Non-standard expressions make R very confusing to me. I’m not trying to say they’re good or bad, just that I can never tell when something’s going to be evaluated in R from just looking at the code.

          The . syntax for binding is also odd. Other languages such as C++ would write what Stephen R. Martin wrote as g(b, .) using lambdas and closures as (x).g(b,x) where now x is just like any other variable and g(b, x) is just like any other expression in the language.

      • It’s ok as long as you keep it to one line. Stuff like this looks awful to me though:


        starwars %>%
        group_by(species) %>%
        summarise(
        n = n(),
        mass = mean(mass, na.rm = TRUE)
        ) %>%
        filter(n > 1)

        https://dplyr.tidyverse.org/

        This looks much cleaner:

        res = summarize(group_by(starwars, species),
        n = n(),
        mass = mean(mass, na.rm = TRUE))
        res = filter(res, n >1)

        I’m trying out the “code” tag and something will probably get messed up with the formatting. Here is how it looks in sublime text:
        https://i.ibb.co/7QfL15R/pipes.png

        Sublime also doesn’t handle the indentation correctly when using the pipe operators…

        • Consider the SQL:

          select species, avg(mass) as meanmass, count(species) as n from starwars group by species having n > 1;

          you easily see what you’re going to get at the end: a table with three named columns, and where it came from (a table called starwars), etc

        • I don’t like how all the different steps are just jumbled together on one line though. It is like reading an unnecessarily confusing long sentence. Splitting it into multiple lines plus sort of indentation/brackets to show hierarchy would be helpful.

        • Line breaks are not significant to the language, so it’s the same as

          select species, avg(mass) as meanmass, count(species) as n
          from starwars
          group by species
          having n > 1;

          and in fact when doing complex joins etc one normally does use many many lines.

        • That is much better. I would probably do:

          select species, avg(mass) as meanmass, count(species) as n from starwars
          group by species having n > 1;

        • Whether something looks cleaner or not is a matter of opinion and so there can be no accounting for that. To each their own.

          One critique of your approach Anoneuoid is your application of state and re-assignment to the same identifier, ie, with the variable `res`; this introduces ambiguity in the value of a variable, where its meaning is dependent upon its historical values rather than its mathematical relationship. It’s not obviously bad here because the context is small, but when you have to deal with larger codebases it can be difficult to decipher the value of variable and their scope.

          The dplyr approach is modeled after the functional (more mathematical) models/paradigms of programming (like SQL as Daniel Lakeland pointed out). No interstitial state is needed. Also dplyr reads sequentially as a dataflow instead of inside out and then outside in again, which is more cleaner or readable for many (especially if they’re coming from other languages). There are plenty of languages that have similar APIs and structure when dealing with sequencing operations on data objects: Haskell, Ruby, Python, C#, and even javascript. So it’s definitely convergent evolution.

        • It’s not obviously bad here because the context is small, but when you have to deal with larger codebases it can be difficult to decipher the value of variable and their scope.

          Yea, I wouldn’t do that except on consecutive lines.

        • Daniel, are you suggesting that every time you want to join two dataframes, you should export them both into your sql database and do your joining there?

        • sqldf does this automatically. and by default it uses sqlite so you don’t need a database server, and is fast enough I’ve never complained.

          if you got the data from a DB server, just run a different query rather than join your data frames.

        • In a recent project, I got data through a database query, did a bunch of calculations on various groups to generate a new dataframe, then read in a file from an external source and joined to my dataframe, then did additional calculations from there.

          I think you’re saying that what I should have done instead is imported my dataframe into the database, also imported the datafile into the database, done my join in sql, and then worked with the resulting file. I’m trying to clarify if that is indeed what you are suggesting. I’m not asking if there is some way to make this easy or fast or whatever, just asking whether this is what you are suggesting. Trying to make it a yes or no question.

        • Well, under the hood, this is what

          sqldf(“select * from dataframe1 join dataframe2 on …”) does, yes, but I’m not suggesting you should manually do it all, it’s automated in a single line of sqldf

        • Also I’d argue rather than “read in a file from an external source and join” in your data analysis project you should have a script that’s run first that reads in all the external sources and plops them in the DB. Then the analysis code has one nice source from which it works. It doesn’t always work best that way, but it’s a good default.

        • Working with an early script that puts all the data into a database in a standardized format is a great way to detect and fix issues with the raw data all at once, before you start analyzing it… So basically it seems like a bad idea to in the middle of your analysis code, suddenly read in a file…. if the file is *perfectly clean* then this can work ok, but if not… suddenly you’ll need a bunch of irrelevant cleaning code right here in the middle of your analysis…

          (and don’t get me wrong, I do this kind of stuff all the time, but if it’s more than a quick one-off I always regret it later… and eventually go back and database-ify the data handling)

          much better to get all the data read in, cleaned, and saved into a database specifically for use on this project… then start the analysis, in which case every time you need some data, you do a SQL query

          if you’ve created a new data table that becomes useful in joins, yes write it into the database explicitly. If it’s a temporary dataset and you just want to get to some end result of this dataset joined to another one, write it into the database as a temporary table and do your join, and get back the result.

          if you’re working with a smallish set of data all of which is in clean files, just read the files into data tables first thing and then rely on sqldf to do the munging directly on the data tables, treat the initially read in data tables as if they were tables in a database…

          In case it’s not clear, sqldf is a package that lets you treat R data frames as if they were tables in a database, fairly transparently

          https://cran.r-project.org/web/packages/sqldf/index.html

        • Rstudio is slow and heavy. That sounds odd until you have a workflow that isn’t. I still do use Rstudio to make packages and docs though.

        • Yea, I could never go back now. You can use it for free indefinitely if you deal with a periodic nag screen. But I paid for a license (I think $70)… eventually. I just couldn’t justify not paying to myself any more.

          Basically the only negative is that it is not open source, but it’s so customizable it feels like open source.

  5. If an experiment or study is well designed and data collected with care to ensure freedom from bias and minimum variation, then the data will yield useful information, possibly without any data analysis at all. Conversely, if we really foul up then no amount of data analysis will reveal any useful information. I have been arguing this all my working life, usually to blank faces.

  6. Understand how the data was generated before you process it. If you understand how the data was collected, what it means, and how it has been transformed on its way to the system you are interfacing with, you’ll be much less likely to make mistakes. Failure to do so can lead to things like bad merges, inappropriate combination of columns, mistaking missing values for zeros (and visa versa), etc..

  7. Create a project folder with organized subfolders. If using R Studio, then do this as an R project. Then use the R Studio Notebook to write the script.

    If you constantly write short R scripts for small, one-off projects, such as playing with concepts in this and other blogs/twitter/papers, then create a project folder for this. Even better create an R studio project and each one-off project can be its own notebook file. Even better, create a blogdown site. https://rdoodles.rbind.io

  8. For me, cleaning data is always really time-consuming. In particular when I use real-world data and (especially) string data such name of cities/countries/individuals. In addition, when you make a survey for your research, there will be always that guy that digit “b” instead of “B” or “B “ (pushing the computer’s Tab). For these reason, my tip is: never underestimate the power of Excel (!!) when you have this kind of problems.

    I don’t trust any dataset that has been touched by excel. You are basically playing russian roulette with your data.

    I’ve seen it mess up data just by opening a csv, looking at it, and resaving it. Eg, ~20% of genomics datasets have been messed up by excel: https://www.ncbi.nlm.nih.gov/pubmed/27552985

    • yep, don’t let Excel have an original copy of any data. A useful trick is to make your original data source read only… process it all you like into a new clean dataset… but ALWAYS have the original unprocessed version (mostly applies for things you aren’t downloading off a public website)

      • Whenever I start a project the first thing I do is make a directory structure that includes folders like source/downloads, code, data, plots, docs, refs, etc. Then the data gets copied from the source/downloads directory to the data directory. It is really annoying down the line when something looks strange if you don’t have the original data to check. You are left wondering if you have a bug or the data was bad to begin with.

    • there will be always that guy that digit “b” instead of “B” or “B “ (pushing the computer’s Tab). For these reason, my tip is: never underestimate the power of Excel (!!) when you have this kind of problems.

      Actually, I’m not even sure why you would want to use excel for something like that. It is a one-liner in R:


      toupper(gsub(" ", "", x))

  9. My advice is to pick up a copy of The Pragmatic Programmer, read it, and follow its advice. Also keep in mind that writing one-off research code for a case study is very different than writing production code for Stan, which in turn is different than writing flight control and pacemaker software. “Horses for courses” is the guiding principle here.

    Most advice is situational. For example, you do want to use a C++ back end to deal with web site processing if you need to be performant at the billions of query scale. Or you want to plug C++ into the back end of your PHP, etc. the same way you do in R or Python for performant code.

    Advice like “document code, but don’t overdo it” is off on the wrong foot. Let me be more blunt. The only code that should be documented is code that implements a confusing algorithm or statements/expressions that violate what an expert in the language would expect. Or, code meant to be used in a tutorial setting. Otherwise, it’s better to modularize code and use meaningful names so that the code’s readable. Comments get stale, whereas code never lies, so professionals wind up reading the code after reading the documentation just to verify it does what it says it’s going to do.

    With “keep it simple at first, you may have to rewrite,” the second half is good advice. Usually it’s impossible to keep things simple until you really understand the problem which doesn’t usually arise until the third iteration/rewrite. Good rewrite of code simplifies and shortens it by modularizing common operations, which you don’t even know about until you’ve rebuilt the system a couple times for a couple of different applications.

    Controlling the cache is part of the modern performance battle in getting bytes from memory to CPU registers. It takes on the order of dozens of arithmtic operations to get a floating point value from memory to the CPU. Same issue with getting bytes from disk or a network into memory. At that level, operations can fail (as they can with disks or memory if you have enough trials). We also need to worry a lot about branch point prediction so that streaming data gets streaming operations (failed branch point prediction means all the optimistic computing the CPU does ahead of the true data flow gets dumped and you go back to do it again, most likely also messing up your cache). And of course, concurrency and synchronization.

    I always like the conflicting nature of “look before you leap” and “he who hesistates is lost”. You need a bit of designing and planning ahead so you know what your goal is and how to break it into subgoals so you can even get started. But too much is useless as some of this advice mentions as your plans don’t survive contact with the real world, at least if you have to interface with other code modules.

    The one piece of advice I can get fully behind is:

    > If the data set is large, try the processing steps on a small subset of the data to make sure the output is what you expect. Include checks/control totals if possible. Do not overwrite the same dataset in important, complicated steps.

  10. For any reasonably complicated set of data you can almost never achieve the best visualization the first time. Make lots of different graphs and select among them.

  11. “Write a script to analyze the data. Don’t do anything “manually”.”

    NO. NO. NO. Page through the data a bit. Load a sample into Excel. Mess around with it.

    Then, start all over with a fresh copy and write that script. Prematurely going to scripting doesn’t really save you time.

    • Seriously, I would NEVER EVER use Excel for ANYTHING other than producing a PDF invoice to send to someone.

      I can produce pages and pages of graphs and short snippets of data and soforth in minutes in R. There is nothing about doing that stuff in excel which is faster, quite the opposite.

      I don’t disagree with the idea of messing around with the data to understand it better but I think if you are someone who sees Excel as more interactive and easier, then you have a huge opportunity to learn more R or Julia and make things much faster.

      Often when working on a project I produce a big sequence of “looks at the data” as exploration. each of which is maybe 5 or 10 lines of R. They essentially say something like “hey, I wonder what would happen if I subsampled the data like this, and plotted these two variables against each other as points, and then colored the points based on this 5 level factor, and made the points have alpha = .1 to get a general idea of the density of the different populations in different regions…

      I can literally do that in less than 5 lines of R which I can write in something like 15 seconds, certainly less than the time I takes to manually select the range of data and open the graphing dialog and try to click on something that kind of does sort of that in Excel.

      If I want to do that across all pairs of two variables in my dataset it’s a simple matter to put what I just wrote in a loop and create pages of PDFs…

      there’s no such thing as “prematurely” going to scripting in my opinion.

      • I use excel to do a lot of data cleaning. If I have something in excel like
        id things_I_like
        1 mice
        2 soccer
        3 diamonds

        and I want to recode “things_I_like” to coarser groupings then I’ll do this

        id things_i_like til_regroup code
        1 mice animals =”if id=”&a2&” then til_regroup = ‘”&c2&”‘;”
        2 soccer sports
        3 diamonds fashion

        then I copy the first cell under code down to the last data line which gives all the re-codings coded. I can then copy and paste all the code into SAS and can look at the counts of the regroups to see if I need finer or coarser groupings. If something is wrong I just need to make a change in til_regroup column and recopy the code column into SAS. I usually save the excel spreadsheet so I can have a copy of the decisions I’ve made.

        So, using excel to write my SAS code (it could equally be R code). Sounds gruesome!

        • you are working in SAS which is pretty much the first problem… but again this is a step you should do in a database… make a table of the mappings, then join the table.

          food, category
          icecream, sweets
          chocolate, sweets
          chicken, meat
          pork, meat
          beef, meat

          now whatever table that had food, join it to the food, category table and take the category column. either use code that explicitly generates the table as a temp table, or keep the table in the DB for later reproducibility.

  12. I think these are overlooked among people who say they are writing “scripts,” not considering them not to be software. Both of these save time on a project even as small as the R script that does my tax return. (Not that I do them. But it would save me time if I did.)

    (1) Use a source control system, such as git. (And learn to quickly read/scan diffs.) Get in the habit of checking small changes in freqently.

    (2) Freqently run an entire suite of unit tests, and add tests to the suite as you code. To get started, when you find something that needs to be corrected, write a test that fails before you fix the problem, and succeeds after it.

    Do these things because someday you will swear that your script worked last week when it doesn’t now; you will eliminate half the effort of debugging if you know whether or not you are wrong about last week, and what changed in the meantime.

  13. I’ve always asked myself how to add unit tests for a data analysis project. I can think of ways of adding some data validation checks such that the data is what you expect, but of course when I think of it, it usually works on a company setting. If I’m an academic researcher working on a static csv file, then the data won’t grow/shrink, for example. This data validation makes sense on some production systems where the data has to have a certain shape/structure to be digested whenever the data ingestion occurs (quite frequently).

    You could always just add tests for the final dataset once the paper is done to sort of cross-check that everytime you run the script, the final data is what you expect. But then again, you’re merely testing the ‘characteristics’ of the data (column classes, # rows, …) because a decimal can always change. And this leads me to the results…

    How would you unit test the results of your analysis? I have some unit tests on some of my packages that tests whether a given algorithm gives the same result on the same piece of data to some numerical precision, but is this the way to go for academic papers? We often have a few dozen graphs which all are complex in their own way. Do people actually unit tests their results? How do they go about?

  14. These data processing tips are like weight loss tips. Everybody suggest a diet that helped them, regardless of whatever it would be appropriate for others.

    Data Analysis is too diverse now to have a single-fits-all toolbox. Its not just about different tools being useful for different application; different people have natural tendencies towards different tools.

  15. Here is an advice from me:

    Data processing is serious, hard, time consuming job and should be treated as such.

    Data processing is not something that should be done as quickly as possible before Exiting Science begins. Dedicate time and resources for data processing. Dedicate time to learn data processing tools. It does not matter how you do it, as long as you do it properly. If you are a team lead, dedicate professionals to do data processing and to explain everyone else how it was done.

  16. to a student new to data analysis, exploratory data analysis and data mining leading to forking paths/p-hacking/HARKing look the same — probably because all prior knowledge coming into a class leads them to believe the goal of science is to “find a significant difference”. Does anyone have a good paper that clearly compares and contrasts these and especially how to teach EDA without encouraging forking/phacking/HARKing?

    • They are the same thing. Strawman NHST is just fundamentally incompatible with science.

      Imagine if you are testing your hypothesis, then you would p-hack/etc and hope to not find a significant difference from the prediction. That is why I say NHST is bizarro science.

      For anyone who has not seen it:
      Paul E. Meehl, “Theory-Testing in Psychology and Physics: A Methodological Paradox,” Philosophy of Science 34, no. 2 (Jun., 1967): 103-115. https://doi.org/10.1086/288135

        • “Sometimes the only difference between good exploratory data analysis and bad scientific practice is publish-or-perish intensives.” What does this even mean? If a naive student explores a rich data set looking for patterns and “confirms” these patterns with p-values with the goal of genuine interest in the noble goal of making scientific discoveries is this good EDA because the incentive is not publishing or perishing? Is there good EDA (model building, model checking, etc)? If so, how do we distinguish it from bad EDA? Or is *all* EDA bad, as Anoneuoid implies?

        • Or is *all* EDA bad, as Anoneuoid implies?

          This is the exact opposite conclusion the reader was supposed to draw from my comment. Get rid of NHST, not EDA.

      • You seem to be saying that there is no exploratory data analysis that is not NHST/fishing. I’m guessing that most applied statisticians/data analysts would disagree and argue that some amount of data exploration is a pretty good practice. For example – several of the posts above + many posts from Andrew + Andrew Gelman (2004) Exploratory Data Analysis for Complex Models. I’m asking the question, for those that think some level and some types of data exploration is a pretty good practice, is there a clear demarcation between the kinds of exploratory analysis we should do and the kinds that we shouldn’t do and is there a good source describing this (blog posts are welcome).

        • I’m saying exploratory data analysis is great, everyone should do it. It is good scientific practice.

          Just don’t check for “significant correlations” or try to conclude anything from it. You need to come up with explanations for whatever you find, derive a prediction about what you expect to see in new data, and then go ahead and collect that data and compare the prediction to it.

        • in an exploratory analysis you try to build a variety of models that make sense and have a chance of explaining the data based on your knowledge of the process. then you fit them using a Bayesian approach and see whether the posterior for parameters makes sense…

          the confirmation should come from looking at posterior predictive and seeing where the models predict differences and testing those differences.

          nowhere should you make intermediate yes no decisions based on p values. it’s this noisy classification that will lead you totally down the wrong garden path.

        • Agree. So going back to my original question, to a naive student, this looks like same, just without p-values. I was asking for a good source on how to teach the similarities and differences between good EDA and bad EDA to students.

        • I don’t have a good source, but I think EDA without decision making on marginal p values is fine, EDA in which you look to classify things into “true” and “false” based on p values and then especially follow the trail of “true” will lead you completely astray.

          In a teaching environment, perhaps you can do a simulation to see how bad this whole dichotomous conditioning with noise is.

          Imagine a binary tree, you do a “test” and branch left or right based on the results, and then at the next step, you do another test… left or right…

          Suppose there are 7 conditional tests (8 levels of the tree). Now suppose that your p value even gives you *exactly* the truth 95% of the time…

          simulate your way through the tree and calculate the probability of getting to the right answer… to make it easy make the right answer be the far left leaf of the tree. You can call left 1 and right 0… so the sequence you want is 1,1,1,1,1,1,1

          how often, using a test that is 95% correct, is your final resting place the truth? You can calculate it pretty easily as 1-.95^7 = 0.30 but simulations might help people see how far off they can get… particularly if you think of putting consequences on the errors… like if you are trying to diagnose a disease or something and each error you make puts you down a path to treating some other disease which leads to serious health consequences if you don’t have that disease (imagine for example cancer drugs without cancer, or liver transplant without liver disease…)

          See how dramatically bad your consequences get as you pile on these important but 95% correct binary decisions…

        • Duh, sorry it’s .95^7 = .70 but anyway the idea remains the same, consequences of ruling out true stuff because of an error in a binary classification are that you can never go back and everything you do from then on is wrong. Consequences pile up.

          And that’s just if your “tests” are 95% correct. In reality even minor model specification errors can easily make your 95% correct turn into say 90% or 85% and more serious misspecification can lead you astray to a massive extent…

        • differences between good EDA and bad EDA

          I don’t really think bad EDA exists. People should feel free to do whatever they want. Pray to god to reveal the secrets of your data if thats something that appeals to you. Of course, some methods will be more efficient than others but I still wouldn’t use the term “bad”.

          The problems only arise when the speculations arising from EDA are presented as something else.

  17. This was likely the most helpful in my career – “Don’t be afraid to confer with others”.

    I was bit surprised it was in the class responses as I believe it is generally rare behavior in statistical practice.

    When I was at the University of Toronto, I regularly conferred with other statisticians about the projects I was currently working on at the Toronto Hospital. Most frequently with David F Andrews. Usually about once a month, I would put together a summary of the projects and what I had done and or planned to do in them and go through this with David. Was not always pleasant, but almost always was helpful. Especially when we both missed something that later could be seen as my being irresponsible – I hadn’t been as I ran it by another statistician.

    Now when I left Toronto, David told no one else working at the U of T ever did this with him (over about 20 to 30 years) on applied projects they were actually currently working on.

  18. One tip would be get good at data visualization. I find that there are many sorts of data errors (e.g., outrageously large numbers, weird unrealistic distributions given prior knowledge) that I can find faster with some relatively simple plots like histograms, scatterplots and the like that are hard to identify with just raw numbers.

    For example, one time I found that personality data (Extraversion in this case) had a bimodal distribution. This was really odd and not typical for this particular measure. It turned out because in our multi-site project, one site used a 1-4 scale while others used a 1-7 scale for measurement! Wasn’t documented clearly, and wouldn’t have thought to investigate without reviewing the plots!

  19. A long time ago, I was taking a short course on survival analysis by a distinguished statistician. He was covering software and asked the class, “What is the most important SAS PROC?” People of course threw out MIXED and PHREG. “No he said, it’s PROC PRINT!” That advice has has saved my butt innumerable times since then. What you thought you did regarding data processing isn’t always what you actually did.

  20. Every “do” must be followed by a “check”. (And every means 100%, including the most trivial.)
    If you’re going to replace “m” with “M”, don’t just run it and assume it’s worked because of no error message. Do a tabulation or a visualization to confirm that the result is as expected.

    Always think about side-effects. The processing step may fix the problem you’re thinking about but does it cause another problem you aren’t thinking about? This is one reason why the check is so important!

    If you are receiving data from some other team, open the data in a text editor and look at the first few lines. You’d be surprised (or not) how often you can immediately return to sender.

Leave a Reply to Roy Tamura Cancel reply

Your email address will not be published. Required fields are marked *