R sucks

I was trying to make some new graphs using 5-year-old R code and I got all these problems because I was reading in files with variable names such as “co.fipsid” and now R is automatically changing them to “co_fipsid”. Or maybe the names had underbars all along, and the old R had changed them into dots. Whatever. I understand that backward compatibility can be hard to maintain, but this is just annoying.

29 thoughts on “R sucks

  1. I think R’s behavior for handling special chars in column names has always been suboptimal. Often the first thing I do after reading a dataset is to sanitize the names in a way that I like. the gsub function can be helpful here.

  2. I’m not able to fully reproduce this behavior, I used the latest version of R (beta).
    Can you share some codes to reproduce to error ?

    df <- read.table(text = "A.b C.d
    1 3
    10 11",
    header = TRUE)

    str(df)
    ## 'data.frame': 2 obs. of 2 variables:
    ## $ A.b: int 1 10
    ## $ C.d: int 3 11

    sessionInfo()
    ## R version 3.0.1 Patched (2013-06-28 r63090)
    ## Platform: x86_64-unknown-linux-gnu (64-bit)

    ## locale:
    ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
    ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
    ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
    ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
    ## [9] LC_ADDRESS=C LC_TELEPHONE=C
    ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

    ## attached base packages:
    ## [1] stats graphics grDevices utils datasets methods
    ## [7] base

    ## other attached packages:
    ## [1] XML_3.98-1.1 stringr_0.6.2 reshape2_1.2.2 plyr_1.8
    ## [5] MASS_7.3-26

    ## loaded via a namespace (and not attached):
    ## [1] compiler_3.0.1 tools_3.0.1

    • I cannot reproduce this behavior either with the current version of R. I believe it was because an old version of R changed the underscore to a dot, but more recent versions of R can respect the underscores now.

      > df
      > df
      A.b C.d
      1 1 3
      2 10 11
      > sessionInfo()
      R version 3.0.1 (2013-05-16)
      Platform: x86_64-pc-linux-gnu (64-bit)

      locale:
      [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
      [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
      [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
      [7] LC_PAPER=C LC_NAME=C
      [9] LC_ADDRESS=C LC_TELEPHONE=C
      [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

      attached base packages:
      [1] stats graphics grDevices utils datasets methods base

      • I’m pretty sure is that Andrew’s problem is that he has column headers with underscores which *used* to get converted to dots by make.names() (called from read.table()) but no longer do. In fact, ?make.names shows that there is an “allow_” argument (“for compatibility with R prior to 1.9.0”):

        > make.names(“a_b”)
        [1] “a_b”
        > make.names(“a_b”,allow_=FALSE)
        [1] “a.b”

        So following x <- read.table(…) with setNames(x,make.names(names(x),allow_=FALSE)) should
        reproduce the old behavio(u)r.

        This seems to do the trick:

        library("Defaults")
        setDefaults(make.names,allow_=FALSE)
        read.table(text=
        "first_col second_col
        12 34
        ",header=TRUE)
        ## first.col second.col
        ## 12 34

      • Why is underscore better? It’s annoying for me, because ESS in Emacs always converts it to <- unless I quote it first. I prefer . And cleaning up names myself makes the code easier for me to maintain when I'm collaborating with folks who change variable names on me and capitalize inconsistently.

        • ESS will include an underscore as underscore (instead of <-) if you just hit underscore twice. Much easier than C-q _.

        • I’m a much faster typer than ESS gives me credit for, I can type <- basically three times as fast as I can recover from typing _ and having it display <- and having my brain say "hit delete something went wrong".

          How can I turn off this *feature* in ESS that has always annoyed me!!!

        • Excellent thanks. I used to just think it was mildly annoying, but ever since starting to use ggplot2 with all its “geom_line” and “coord_cartesian” type function names etc it’s become a huge annoyance.

  3. `_` was not allowed in R a names long time ago as it was synonymous with the assignment operator `<-`. R is not doing any such conversion today (modified from Ahmadou's comment)

    > df df
    A.b C_d
    1 1 3
    2 10 11

    This change was long requested by UseRs before R Core relented. The problem is the cleaning up that goes on when R reads in data to data frames – you have to turn off a lot of things if you want to read the file verbatim, but then you need to work harder as variable names may not be usable in R without quoting them.

    I now routinely read in data and then clean up the variable names by explicitly setting them. That way even if R’s sugar changes I have set the variable names myself so the rest of the code will continue to work. This assumes the underlying data file doesn’t change but I have a personal rule not to change raw data files.

    • Rahul:

      Good idea. I have to admit, though, that the backward compatibility problem is mine as well as R’s. In addition to the problem with the dots and underscores, I also had some code out of order, and some data files were in the wrong directories. I eventually did get it to work (using R 3.0), though.

    • Andrew’s problem derives from the absence of commercial purpose in the R project (which is not to say that there is anything wrong with the design of the R project). Any commercially-oriented solution would strive to include a system to preserve the execution of legacy code: see, e.g., the version command in Stata. The R project gets its user base from many qualificatives like “state of the art” and “addictive like crack”; “version resilient” is not one of them.

      A simple fix is Rahul’s fix. A less simple fix would be to bundle all versions of R together into a single software package, and to allow the user to set the version of the interpreter. I might be underestimating the feasability of that task, beyond the size of the resulting package. But if some people are developing faster versions of R, it should be possible to develop a mere collage of all R interpreters, right?

      This has several implications for teachers too, since teaching R is teaching something that is moving freely, quickly and in many directions (e.g. knitr, ggplot2, Shiny, rCharts, Rcpp, RHadoop). The size and anarchic nature of the package ecology is one thing that you end up teaching to students who are used to much more hierarchical representations of social order. It also connects well with other learning themes in open access statistics.

      • It’s a tough tradeoff. It’s hard to maintain backward compatibility and add features or speed etc. at all times. A common software engineering problem.

        Python has a similar break in compatibility. In clusters I’ve administered (Linux) it was fairly common to maintain several legacy distributions of code for this very reason.

        A setenv / alias or similar got your R or Python to point to whatever version you wanted to use.

        I’m not sure if the “bundle all packages” is a neat solution. That causes bloat and bug fixes etc. get harder.

        One option R should consider is to provide a code translator. Something I can run on legacy code that will transform tricky parts to the new standard automagically.

  4. What you are describing is a fundamental difficulty of software development. Imagine if you had to ensure that stan code written today would function identically with stan version 12.1 ten years from now. Your dev team would kill you. That said, there are much better ways of handling versioning and dependency that would alleviate some of the frustration. see http://arxiv.org/pdf/1303.2140.pdf for one set of proposals.

    • I think your code must be a wee bit older than 5 years. If I understood the problem correctly, your file had variable names with underscores (e.g., co_fipsid). In R = 1.9.0, underscores in names are valid, so mangling no longer takes place. Current R is preserving the original variable names, which is a feature that everyone wants.

      I’m sorry that this breaks your code, but I have to point out that R 1.9.0 was released in April 2004, which would make your code more like 10 years old. This is frankly too old for you to be justified in the use of such inflammatory language when it doesn’t work.

  5. The worst backwards-compatibility problem I’ve encountered with R is reading in a saved workspace. If you are missing a critical package that’s used by something in the image, it won’t load. You’re stuck as far as I can tell. (At least in base R.)

    This invariably bites me when a new version of R comes out and I jump to it a bit too soon (before some obscure package I can’t remember using is updated).

  6. Pingback: Michael Crotty

Comments are closed.