A colleague writes,
Now that I work for a large software company, my computer code is heavily scrutinized: there are style guides, rules for indenting, conventions for variable naming, etc. I’ve come around to it–it really does make my code a lot better. blah blah blah. But my company doesn’t really have formal rules for R because hardly any engineers use it. I’m working on writing said rules, and it seems incomplete not to include something about writing efficient code: avoiding nested loops, etc. Do you know of any good references on how to write good R code?
1. I’m curious what the R experts say. My own stylistic preferences differ slightly from what appears to be the default in R packages: in particular, I like to indent 2 characters, but R seems to indent a lot more, which to my taste makes the code hard to read in a text editor. (My own stylistic preferences can be deduced from the examples of R code in my books.)
2. I know that there is general advice to avoid global variables–it’s better to pass information in function arguments. When writing Umacs we found this to be awkward and so we used global variables instead, but since then somebody explained to me how to do it all using local variables (without requiring a huge effort in passing lists of variables). Unfortunately I can’t remember now how I was going to do it. Maybe the idea was to put all the arguments in a list.
3. I personally like to fill up arrays with NA’s when I set them up, so that if something goes wrong, I’ll get lots of NA’s in the result, and I can track back where the problem is.
4. I think R has a debugger but I’ve never used it. I probably should.
5. As you know, I’ve been moving toward the idea of simulating fake data for every problem as a test of the algorithm and code. I call this the self-cleaning oven principle: a good package should contain the means of its own testing. We haven’t yet done this with “arm” but we should.
6. I agree about avoiding nested loops–when it causes programs to be slow. On the other hand, sometimes a matrix implementation can just be mysterious, and I find it helpful to spell things out with loops. (Again, we discuss this in our book–we even have a footnote or two explaining why we have some loops.)
7. I like to follow the general principle that lines of code should (almost) never be repeated. I’m always seeing students write scripts with cut and pasted code, and I’m always telling them to use a function and a loop instead.
8. A silly little thing: with if () statements, I recommend always using braces (curly brackets), even if the conditional command is just one line. If you or someone else wants to modify a function, it’s much easier to do so if the braces are already there.
9. R functions are getting uglier and uglier. I’d say the typical R function is 90% “paperwork” (exception handling, passing of names, etc) and only 10% “meat” (to mix analogies). I attribute some of this to the S4 system of objects with sockets etc. For one thing, it’s typically no longer possible to see what a function does by typing its name. I don’t know what to say here, except to recommend not drinking the Kool-Aid: maybe you can try to keep your functions clean rather than putting all the effort into the paperwork. (Unfortunately, we didn’t really follow this advice with bayesglm: we made the mistake of adapting the existing glm function.)
10. Scalability is a big issue in R. Ideally any new function would be accompanied by a statement explaining how it scales as the inputs increase in size.
11. When summarizing the results of your output, I recommend working with “display()” (from the arm package) rather than “summary()”. The summary() function always seems to give a lot of crap, and we’ve tried to be cleaner and more focused with display(). One option is to set up functions for both so that users can typically use display(), with some extra information in summary().
12. It’s a good idea to graph inferences. Graphs aren’t just for raw data.
This seems like too much advice; maybe some of the above rules are unnecessary or can be written more generally.
In any case, if you’re writing guidelines, I recommend giving examples of the recommended approach and also the bad approach for each rule.
Perhaps others have suggestions too (or comments on my ideas)? Once you’ve written your guidelines, I hope you can publish them, with discussion, in a statistics journal so all can see. There may already be some R style guide that you can adapt and react to.