The future of R

Some thoughts from Christian, including this bit:

We need to consider separately

1. R’s brilliant library

2. R’s not-so-brilliant language and/or interpreter.

I don’t know that R’s library is so brilliant as all that–if necessary, I don’t think it would be hard to reprogram the important packages in a new language.

I would say, though, that the problems with R are not just in the technical details of the language. I think the culture of R has some problems too. As I’ve written before, R functions used to be lean and mean, and now they’re full of exception-handling and calls to other packages. R functions are spaghetti-like messes of connections in which I keep expecting to run into syntax like “GOTO 120.”

I learned about these problems a couple years ago when writing bayesglm(), which is a simple adaptation of glm(). But glm(), and its workhorse, glm.fit(), are a mess: They’re about 10 lines of functioning code, plus about 20 lines of necessary front-end, plus a couple hundred lines of naming, exception-handling, repetitions of chunks of code, pseudo-structured-programming-through-naming-of-variables, and general buck-passing. I still don’t know if my modifications are quite right–I did what was needed to the meat of the function but no way can I keep track of all the if-else possibilities.

If R is redone, I hope its functions return to the lean-and-mean aesthetic of the original S (but with better graphics defaults).

23 thoughts on “The future of R

  1. I agree. Not that for most (simple) operations R code is not concise and simple, but anything complicated, and you beg for a real language. That does not mean C or C++ or FORTRANxx—we really have moved beyond those dinosaurs, but we really do have much better, well-defined scripting languages that are so much better than the S language and its S3 and S4 classes. Really.

    Now, would I stop using R in that absence? Hardly, all I do are simple operations, as do 99% of us (as a guesstimate). Suck in a library or two, and get the job done. That is why we like R. Get rid of the arcane, obscure language and (did you RTFM data types)?, and we might have something more universal.

  2. To quote Bill Venables: ""Most packages are very good, but I regret to say some are pretty inefficient and others downright dangerous." And he wrote that almost three years ago.

    R's library is like the curate's egg, good in parts. The quality of packages is not checked and as a result you need to be very careful about using any package that you do not know well. We need a new version of the expression caveat emptor. What is the Latin for downloader?

  3. Yes the innards of the R functions can be daunting. But the idea is that the calls to the functions are lean and mean. So the actual statements for your analysis are lean and mean, but the code beneath them is not. For me it is an acceptable trade-off. Also I do not think that you it is possible to achieve both goals: nice code and flexibility. It is the same story with optimised and readable code. Heavily optimised code usually obscures what the code actually does.

  4. I would much rather see Python to become gain wider acceptance in statistics community than waiting better R. Python already is the de-facto glue language in the open source community, has great support for native calls (C, Fortran) and is making inroads in scientific computing circles. I convert BDA, ARM examples into Python (Numpy, Scipy, Scikit) whenever I can. And of course, there is PyMC.

  5. I think this sort of reflects the different preferences/needs for statisticians and computer scientists. So R, as a stat package and a language, has to strike a balance somewhere. I'd love to just focus on stat parts programming in a super efficient and concise language and let CS folks take care of the peripheral stuff, but in reality we probably have to deal with the language specifics too, more or less.

    So for questions like this I listen to guys like Radford Neal, who is both a statistician and computer scientist. I think he is more concerned with speeding up R, maybe we can also get his take on how to get R "cleaner" for statisticians.

  6. I come from IT programming background, and I believe it makes sense for scientists to use a programming language that can be used in multitude of tasks, as well as sci/stat computing. This way if a researcher wants to code up a Webapp, data scraping subroutines, etc. they wouldnt have to learn a different language and toolset.

    As for language specifics, Python's exception handling, support for iterators, regex, functional and OO styles and its readable format makes it a good candidate. My $0.02.

  7. I love python, and would love it if SciPy/Numpy became the better version of R. Alas, the python libraries last I checked are nowhere close. They lack vital core functions, to say nothing of the panoply of useful CRAN addons.

    More folks like Neal digging through the code guts would be great. Speedups like his usually have the ancillary effect of cleaning up the codebase.

  8. Think about it like playground surfacing. For metaphorical purposes R's library itself is play equipment while the actual foundation of the playground is the usage of a more advanced language.

  9. I think the problems with R's interpreter is not something inherent to R. Anything that gets used by people is going to pick up weird warts over time and it's possible to write bad code in any language. I don't think switching to some other language would really solve any of those problems though it would make users of Language Y happy.

    If we really MUST pick a general purpose language I propose we ditch all of these Johnny-come-lately languages and use a language that actually makes sense: Smalltalk. The persistent, interactive, environment is suitable for the analysis needs of the practitioner and it has been a proven platform for production applications for decades for the engineer. It is also a heavily graphical environment (the GUIs we use today were famously born there) so we can even retain R's world-class static graphics production while adding a new suite of interactive graphics facilities. It has even been carefully designed by one of the so-called "CS experts" I hear so much about. (And while this is a bit snarky, it's not actually a joke. I actually think it would be a good choice if a choice needed to be made).

    On the Neal subject, I think it would be nice if R Core members didn't denigrate the work someone did (and nicely packaged for them) to get some nice gains without (as far as I can tell) semantic changes to the language. If I have a problem that takes 25 hours to process a day's worth of data that suddenly takes 20 hours to process I just went from having an infeasible approach to have a feasible approach without code or hardware changes. That is a big deal to me.

    In the longer term, I think that R's interpreter performance problems are actually solvable to a degree. Duncan Temple Lang (he's like R's santa claus) I noticed recently posted a working Rllvm package (unlike my own which got to the 'sort of working stage' before being lost to the day job) which is interesting since we can now start to JIT operations. The tracing JITs being built into Javascript interpreters are also interesting in that they can get performance gains from languages that are otherwise very annoying to compile. Something like the V8 engine might actually be a good host for a high performance R.

    It does really help with the Playing Nicely With Others problem. For a while I thought the CLR might be the right approach. It's a good idea: IronPython, IronRuby… IronR? All sharing a common environment. There exist C and Fortran compilers so it might even be possible to slowly bootstrap your way there. Unfortunately, it seems like the will for that sort of thing has faded (and people are understandably leery of tying themselves to a Microsoft technology even if there is an open source implementation)

  10. I agree there's some work to be done on Numpy / Scipy; for example there is no native equivalent for R sample() call in Python — I wrote the corresponding function in pure Python. But its matrix manipulation routines are robust enough, the same is true for the plotting routines, accessed via Matplotlib. A sign of its maturity, there is now a company founded purely to provide services around scientific Python (Enthought).

    Also, India announced a billion dollars to fund Python in scientific education under a project called FOSSEE.

    http://fossee.in/

    My ARM, BDA code can be found here:

    http://ascratchpad.blogspot.com/search/label/gcsr

    http://ascratchpad.blogspot.com/search/label/arm

  11. Personally I love the R language. It does a great job of being useful for both interactive work and building libraries/classes. The semantics of passing arguments to functions are very cool; Tom Moertel detailed them here better than I could: http://blog.moertel.com/articles/2006/01/20/wondr

    About the R internals though, I know almost nothing. Maybe they're really crufty. But that doesn't affect most of us very often, and it can likely be sorted out by creative refactoring & cleanup if the test suite is good enough.

  12. As someone with a fairly strong programming background (many years of mathematical model development in C++), I have to say I've found R quite frustrating to work with. The plethora of date types, mysterious type conversion errors, appearance of strings or "levels" when I thought I was working with numbers and so forth, make it a challenge. Everything has to be tried out line by line in the interpreter.

    Not being a SAS, SPLUS or S user, I can't comment on the analogous experience, but I'll just say that I find it even more challenging to get things right the first time in R than in Perl, and that's saying something.

    With respect to the specific comments about exception handling, etc., that kind of stuff typically improves the lives of end users a lot. Graceful failure and meaningful error messages can save enormous amounts of debugging time.

  13. I love Lisp, and would love to see that as a basis for New R, as Incanter proposes. Problem is, the vast majority of users simply cannot accept Lisp's (superior) syntax. And in some sense, we've been there, done that, left it behind: xlispstat, et al.

    Python's a nice language, but I haven't found it to be as easy to get up and keep up to date as R, you have to decide which version of Python you're going to run with, and as others have commented, its "batteries included" barely applies to the statistical areas covered by R and its packages. Not to mention that if people don't like vectorizing their R code, exactly how are they going to begin to understand iterators or take advantage of its functional programming aspects?

    (It also feels like we've been there and done that with Python as well: Sage.)

    And, having used many languages over the years, I like R. Some of it is too flexible for its own good, but on the whole it's refreshing as a language, and it's a testimony to the brilliance of the S designers that it has so many modern features. I'm not persuaded that (essentially) attempting to port its users and libraries to Lisp or Python is the answer.

  14. All the Lisp people know that there has already been a Lisp-based statistical environment, right? It was called LispStat, the book about it is probably in your local University's library. You don't actually need to reinvent that wheel.

  15. There is a third huge feature of R besides the library and the language and that is the community. The community is worth more than either the language or the library because either of those can be recreated by an energetic community in very short order.

    The sad fact is, though, that this community will make it impossible to ever change R all that much. It may someday be possible for a competitor to arise that provides a significant enough advantage to eventually build a community, but that will very incredibly difficult to do. Systems like Incanter with built-in features like parenthesis based syntax that stimulate allergies in large portions of the population are starting with one foot in a very deep hole as a result.

  16. There are many R functions that is intending to do a lot of things. In these case, Almost always, a simple S4 object can generate very clean code, with a outside reader going straigth to the "10 lines" of code. I think that is a bad culture of R packagers development that a language problem.

  17. How could one speed up the evolution of packages ?
    (The question is aimed not at core developers,
    but at the many people who use and write packages,
    whether in R or in numpy/scipy.)
    Sites like … with user comments and ratings must help;
    package reviews by expert critics might too.
    I'm sure R has such but I'm not an R guy.

    What makes for good packages ? Analyzing some really good ones would at least be more fun than beating on crummy ones —
    "some people *enjoy* not quite understanding what they're doing".

    For a discussion on R vs numpy-scipy a year ago, mostly by software engineers, see
    http://stackoverflow.com/questions/1177019/what-c
    (and look for "bovine".)

Comments are closed.