Learn by experimenting!

A students wrote in one of his homework assignments:

Sidenote: I know some people say you’re not supposed to use loops in R, but I’ve never been totally sure why this is (a speed thing?). My first computer language was Java, so my inclination is to think in loops before using some of the other R functions that iterate for you. Maybe I should practice more with those.

There’s an answer to why loops are slow in R, I think it has something to do with memory allocation and the creation of new variables.

But the “why” is not really my point here.

Rather, my point is that you can directly check if it’s a speed thing. Just run a simulation.

For example:

N <- 1e5
date()
a <- rep(1, N)
date()
for (n in 1:N) a[n] <- 1
date()

OK, let's try it out:

> N <- 1e5
> date()
[1] "Fri Sep 15 09:15:18 2017"
> a <- rep(1, N)
> date()
[1] "Fri Sep 15 09:15:18 2017"
> for (n in 1:N) a[n] <- 1
> date()
[1] "Fri Sep 15 09:15:18 2017"

Damn! It ran too fast. Let's up the N to 10 million:

> N <- 1e7
> date()
[1] "Fri Sep 15 09:16:03 2017"
> a <- rep(1, N)
> date()
[1] "Fri Sep 15 09:16:03 2017"
> for (n in 1:N) a[n] <- 1
> date()
[1] "Fri Sep 15 09:16:07 2017"

Yup. The loop runs slower than the unlooped version.

Is it just that rep(1, N) is just some super-fast function? We'll try another:

N <- 1e7
date()
a <- rnorm(N)
date()
for (n in 1:N) a[n] <- rnorm(1)
date()

Here's what happens:

> N <- 1e7
> date()
[1] "Fri Sep 15 09:17:29 2017"
> a <- rnorm(N)
> date()
[1] "Fri Sep 15 09:17:29 2017"
> for (n in 1:N) a[n] <- rnorm(1)
> date()
[1] "Fri Sep 15 09:17:50 2017"

Again, the loop runs slower than the unlooped version.

The above is nothing definitive. The point is that, yes, a little experimentation can clarify our questions. The same principle applies in more complicated settings, for example this article which relates the results of experiments that had a big effect in the practice of computational statistics.

It's good to get in the habit of experimenting as a routine part of your workflow.

24 thoughts on “Learn by experimenting!

  1. This got me looking at the rnorm source code to see if I could point out all the overhead from calling it multiple times… but I came away more confused. Where is the loop implemented? It seems like the “n” argument just disappears:

    Here is the source for rnorm, it is a wrapper function that calls some compiled code called C_rnorm:

    rnorm <- function(n, mean=0, sd=1) .Call(C_rnorm, n, mean, sd)

    https://github.com/wch/r-source/blob/2ab7d19ebb0c167a60fe3d1d68982615725dc909/src/library/stats/R/distn.R

    You can see that function is in the stats library and there should be a C function somewhere called “rnorm”

    > stats:::C_rnorm$name
    [1] “rnorm”

    Here it is, without any argument for number of results to return (“n”):

    double rnorm(double mu, double sigma)
    {
    if (ISNAN(mu) || !R_FINITE(sigma) || sigma < 0.)
    ML_ERR_return_NAN;
    if (sigma == 0. || !R_FINITE(mu))
    return mu; /* includes mu = +/- Inf with finite sigma */
    else
    return mu + sigma * norm_rand();
    }

    https://github.com/wch/r-source/blob/af7f52f70101960861e5d995d3a4bec010bc89e6/src/nmath/rnorm.c

    Then that function is calling norm_rand which also looks like it returns one result at a time:
    https://github.com/wch/r-source/blob/af7f52f70101960861e5d995d3a4bec010bc89e6/src/nmath/snorm.c

    It seems norm_rand includes a couple different algorithms, we can see what is being used with RNGkind:

    > RNGkind()[2]
    [1] “Inversion”

    So we want to check the “Inversion” method:

    case INVERSION:
    #define BIG 134217728 /* 2^27 */
    /* unif_rand() alone is not of high enough precision */
    u1 = unif_rand();
    u1 = (int)(BIG*u1) + unif_rand();
    return qnorm5(u1/BIG, 0.0, 1.0, 1, 0);

    https://github.com/wch/r-source/blob/af7f52f70101960861e5d995d3a4bec010bc89e6/src/nmath/snorm.c

    To see what that is doing we can look at unif_rand and qnorm5 which also look like they deal with one value at a time:
    https://github.com/wch/r-source/blob/af7f52f70101960861e5d995d3a4bec010bc89e6/src/nmath/standalone/sunif.c
    https://github.com/wch/r-source/blob/af7f52f70101960861e5d995d3a4bec010bc89e6/src/nmath/qnorm.c

    I really expected to find some loops in the source code, but there are none…

    • or `system.time()`, which is in base R. But those MCMC methods can take months or years to run, so I see why Andrew uses `date()` (being facetious here)

  2. The microbenchmark package is indispensable for timing fast code:

    > microbenchmark(a <- rep(1e5, 1), for (n in 1:1e5) a[n] <- 1)
    Unit: nanoseconds
    expr min lq mean median uq max neval cld
    a <- rep(1e+05, 1) 303 541 3714.97 2356 6187.5 27574 100 a
    for (n in 1:1e+05) a[n] <- 1 5587133 7129368 25695275.88 14099214 40116061.5 114521706 100 b

  3. Your first example of unlooped code has a mistake. It repeats N just 1 time, rather than repeating the number 1 N times. (But it takes <1 second to run either way.)

    N <- 1e5
    rep(N, 1) # 1e5
    rep(1, N) # 1 1 1 1 1 1 1 1 1 [etc]

  4. I think there are two ways to look at the “why” question. The first is speed, and the second is programming style. Every language (computer or otherwise) has a point of view with regard what things are easily expressed and things that are difficult. R (and S) trace their lineage more from Lisp, and thus are more comfortable with functional expressions of computational ideas. Mutation and imperative styles of programming are of course fully supported (for loops and the like), but I think that it is important to note that just because something is the dominate form of expression, doesn’t mean it is the most concise or expressive.

    On the speed note, I’d just add an aside, that for loops and vectorized formulations (e.g. *apply) are similar in performance, so long as the vectorize computation is happening in R. This is really the key in optimizing R code. What you want to do is find a way to express your computation so that the performance critical part of it is being evaluated in C.

    Consider the below. The first expression is vectorized, but actually slower than the for loop. One still might prefer the first expression to the second based on the conciseness of the code. On the other hand, the third expression vectorize the code and executes it down in C, which in general will net you about a ~100 times speed up.

    system.time({v <- replicate(1e6, function(x) rnorm(1))})
    user system elapsed
    2.189 0.053 2.242

    system.time({
    v <- rep(NA,1e6)
    for (n in 1:1e6) v[n] <- rnorm(1)
    })
    user system elapsed
    1.725 0.391 2.115

    system.time({v <- rnorm(1e6)})
    user system elapsed
    0.062 0.001 0.063

    • What I’ve read before is that the loops buried deep in R’s underlying C code are much more highly optimized than you can implement in R itself.

      • That is exactly right. R is what they call an interpreted language, which means that a program parses the commands and then does what they tell it to do (details about byte code compilation not withstanding). C code is directly compiled into machine instructions and requires no intermediary program. So, when you run a loop in R where it does lots of loops and each loop does relatively little work, the result is that a lot of time is spent parsing commands and dispatching the work.

        This will help you get some intuition about where in your code to look for speed up opportunities. If a for loop does a 100 iterations a second, there is probably not a lot of overhead wasted in the looping itself. However, if you are iterating 100,000 times a second, you probably have a prime candidate to try to push that loop down into the C code through either vectorization or a foreign language interface (e.g. Rcpp).

  5. Beginners and many experienced developers over emphasize code efficiency. Efficiency is not an issue for most programs.

    Trying to avoid loops is about a change of thinking about how to write code.

    Loops involve some amount of housekeeping, e.g., the loop variable.
    Having to use a variable for indexing increases the risk of making a mistake (different kinds of mistakes can be made with the alternative approaches, but I think they are less likely, at least for the kind of code I write).

    I found that avoiding loops completely involved lots of head scratching to start with, but over time the benefits have made the investment worthwhile.
    http://shape-of-code.coding-guidelines.com/2012/07/27/my-no-loops-in-r-hair-shirt/

  6. The poor performance of loops in R and Python has led to some misleading intuitions for users of Stan, who assume it’s going to behave the same way.

    Stan is like C++ in that loops are compiled down to very efficient code. This means intuitions developed from interpreted languages (e.g., R, Python, BUGS, JAGS) will lead users to think they need to vectorize in these situations.

    Stan is like interpreted languages in that the automatic differentiation is more efficient when rolled into vectorized calls, especially if expressions can be reduced to vectorized log density functions or matrix arithmetic. The reason isn’t that loops are slow per se. Rather, looped expressions involving derivatives create a large expression graph as the log density is evaluated. The reverse pass of reverse-mode autodiff then walks over that graph from the result down to the parameters propagating the chain rule. That code is essentially interpreted code, albeit more efficient interpreted code than you find in R because of much tighter static data typing. Vectorized expressions further allow us to avoid recomputing. For example, an evaluation of normal_lpdf(y | mu, sigma) will only evaluate log(sigma) once, even if y is a vector. The other efficiency boost from pushing things into matrix arithmetic comes from a combination of compound matrix derivatives (Stan extracts underlying values and operates on those using Eigen’s efficient matrix lib, which is itself much faster than looping even for pure double code).

    A further twist is that there are also statistical efficiency concerns with sampling. There is some latitude in parameterizations, priors, etc., when defining a model, that can have a big impact on the statistical efficiency (mixing) of the sampler (or optimizer).

    There’s a chapter in the manual on efficiency that covers both types in more detail.

  7. One sidenote:
    Most code using *apply functions is no faster than loops and I believe some include loops internally.

    So that aspect of R code basically avoids loops for aesthetic reasons and avoiding side effects.

    • This is where I have seen this misconception expressed (as both a student and teacher of R).

      Many of the R classes I have been part of have stated at some point that lapply/sapply (or more generally the *apply family) are more efficient than for loops. Simple examples demonstrate that this is not the case (at least, not in any meaningful sense). This is an unfortunate pattern because there ARE very good reasons to avoid the usage of for loops when *apply will suffice. Specifically, lapply/sapply are substantially more readable and do not allow side effects.

      The correct statement one can make from an efficiency standpoint is to avoid loops (including the *apply family) when “Vectorized” code will do. This has more to do with the overhead in calling C/Fortran code (as all R code eventually does). http://www.noamross.net/blog/2014/4/16/vectorization-in-r–why.html seems to do a good job summarizing the content/ideas around Vectorization.

      • “Specifically, lapply/sapply […] do not allow side effects”

        Not sure what you mean by side effects, but I think `apply(iris, 1, print)` is a sufficient counter-example to demonstrate that apply does allow side effects. Even more clear is to consider what `apply(iris, 1, function (x) {assign(“foo”, x, envir = .GlobalEnv)})` would do.

    • Cugrad:

      I know just enough R to do what I need to do, and no more. Every once in awhile I learn some more R because I need to learn it to do something that I couldn’t do before.

      I guess that my attitude toward R is similar to many scientists’ attitude toward statistics!

Leave a Reply to John Myles White Cancel reply

Your email address will not be published. Required fields are marked *