Skip to content
 

R on the cloud

Just as scientists should never really have to think much about statistics, I feel that, in an ideal world, statisticians would never have to worry about computing. In the real world, though, we have to spend a lot of time building our own tools.

It would be great if we could routinely run R with speed and memory limitations being less of a concern. One suggestion that sometimes arises is to run things on “the cloud.” So I was interested upon receiving this email from Niklas Frassa:

Time intensive calculations, as known from life science, finance or business intelligence, can now be processed at a whole new level of speed – in the Cloud. cloudnumbers.com provides an intuitive platform that enables everyone to run time consuming calculations on clusters with more than 1000 CPUs.

So far, High Performance Computing has only been accessible for large corporations and universities leading to significant competitive disadvantages for small and medium-sized companies. With cloudnumbers.com we finally make High Performance Computing accessible to everyone.

cloudnumbers.com’s scalable server environment results in minimal idle times – and great cost savings, as customers only pay for what they actually consume. Furthermore, users do no longer need a degree in computer science to be able to access the computing power of supercomputers.

I don’t know anything more about this. Feel free to comment, either on this or any better options.

P.S. Based on the comments below, this cloudnumbers.com thing doesn’t sound so wonderful, at least for people like me who are already using Rstudio.

P.P.S. See response from the company in comments below.

12 Comments

  1. DavidC says:

    I frequently use Amazon EC2 for this. Haven't tried other options.

  2. Stuart says:

    There's crdata, too, although I haven't actually used it.

  3. Janne says:

    A few comments as someone who is using supercomputing resources as part of his work:

    * On a per-project level you already pay for what you use (you get an allocation, actually, and can use CPU hours and storage up to that limit).

    * Existing machines – at least newer, larger, well-administered ones – have employees specifically tasked with making sure popular software runs properly. Some facilities even have people whose job it is to help their users port custom software to the computer. You certainly don't "need to be a computer scientist" to use them.

    * You would be surprised how often these machines are open for use by small companies and even private citizens. They are often funded in part or whole by the public, and generally one of their tasks is to provide computing resources for private companies and people. It may actually be quite easy and inexpensive to get a low-priority account for running the occasional large-scale calculation at some center.

    I'm not saying this new service is a bad idea – I think it may be a very good one – but it would be a mistake to take their advertising copy at face value and assume existing resources are too difficult to use and unavailable for small companies.

  4. Kyle says:

    The major issue I have with this is that any calculation you do in R can be done much faster in another language, assuming the libraries exist for that language. Even when R calls FORTRAN from the 1960s, it does so much more slowly than other high-level languages which can link to old compiled code (there are a number of blog posts I've seen demonstrating this). We probably should spend time translating the R libraries to a language with decent performance (and things like lexical scope).

    • Steve L says:

      and to follow up on my previous question about asking for references about calling external function in FORTRAN/C are slow in R, I'll just leave this recent post from Simon Urbanek on the R-devel mailing list as evidence to the contrary, just in case anybody stumbles on this from the internet:
      http://thread.gmane.org/gmane.comp.lang.r.devel/2

  5. Madeleine says:

    I've also had a good experience running R on EC2. The multicore package makes it easy to parallelize a job on a single machine. Parallelizing across multiple machines requires more work, and I don't see anything on cloudnumbers.com indicating that they make that any easier than EC2 does.

  6. David Harris says:

    Cloudnumbers costs 50% more than EC2 for a roughly comparable machine. They justify that markup with this tweet: https://twitter.com/cloudnumberscom/status/875709

    For me, it's worth it to learn the ropes with EC2 and RStudio.

  7. dearl says:

    I think your stated goal of not having to worry much about computing and the proposed solution of "the cloud" are contradictory. What is "the cloud"? If you can not answer that question then "the cloud" is of no use to you. On the other hand, if you can answer that question then you've been worrying about computing and "the cloud" has not solved your wish-for-less-worry problem!

    I think we're doomed. If you want to do applied science today (i.e. work with large datasets) you have to worry, to some degree, about computing. (And statistics too, you computer scientists / biologists!).

    (I'm a bioinformatics grad student)

  8. Yes, we are currently more expensive than AWS and you can start creating a computer cluster with AWS EC2 by yourself.

    We are a company founded a short time ago. We are very much interested in your constructive feedback to improve our service. We are working in many directions to fulfill users requirements.

    Our three main features are:

    * Security: cloudnumbers.com provides an additional layer of security. For example, the communication between your calculation nodes is encrypted and all your private data in your workspace is encrypted.

    * Databases: Latest versions of public databases are hosted on our machines and can be mounted in seconds (even if bigger than 10 GB) into your cloud computing cluster. And you do not have to pay for that (no traffic!)

    * Applications: We provide you with a list of pre-configured applications which work out of the box. All the linking to libraries – e.g. MPI – is configured and example scripts are provided to customize your code. If your R applications is to slow, feel free and reimplement it with C++ and execute it e.g. in a single-treaded high memory machine at cloudnumbers.com.

    If you have any more questions or application wishes please do not hesitate to contact us directly. Take the chance and have an impact to the development of a new cloud computing calculation platform.

    Cloudnumbers.com
    Senior Community Manager
    markus.schmidberger@cloudnumbers.com

  9. Steve L says:

    @Kyle — the things you're saying … I've never heard before.

    As far as interpreted languages go, yes, R doesn't have the fastest interpreter speed, but can you provide some references to back up your statement that R is "so much slower at calling/linking to external libraries" (paraphrased)?

    This seems more than impossible to me.

  10. S.E. Lazic says:

    “Just as scientists should never really have to think much about statistics…”

    It’s a shame that too many scientists think this way. Statisticians should encourage more statistical thinking, not less!