Bob pointed me to this article by Ashlee Vance following up on the recent newspaper article. Bob writes that “Pfizer’s bought into the FUD (fear, uncertainty and doubt) argument that marketers employ to discourage the use of open-source or other free software.” From the Vance article:
Pfizer was a prominent R user mentioned in the story. The company relies on R for its nonclinical drug studies and has shied away from using the technology for clinical research that will ultimately be presented to regulators. For such work, Pfizer instead turns to software from SAS Institute, which brings in more than $2 billion a year in revenue from data analytics software and services.
Were Pfizer to use R in clinical studies, it would run the risk of seeing its research questioned or even rejected by regulators doubting the veracity of results based on what they view as an unknown quantity.
“It’s very hard to displace the industry standard in those types of cases,” said Max Kuhn, associate director of nonclinical statistics at Pfizer.
I’m actually working with Neal Thomas and other people at Pfizer on an improved and more trustworthy OpenBugs implementation that they can use for their research. It’s actually worth it for Pfizer to put resources into an open-source project: open-source can mean more beta-testing and more reliability.
At a technical level, Sam Cook and I are working with them on implementing unit testing (the so-called “self-cleaning oven”; see item 5 here) for Bayesian modeling, following our earlier work in this area.
Finally, Vance concludes with a discussion of the size of R’s user community. I imagine this is tricky to define–for example, do you count students?
I'm a huge fan of R, recommend it all the time, and use it all the time. I also think it is ridiculous for people to claim that just because a product costs a lot of money, it will be more accurate or less error-prone than an open source product.
All of that said, it is still true that several years ago I downloaded and used a new version of R, several years ago, in which ordinary linear regression ( the lm() command in R) was "broken": it gave wrong answers, without generating an error message. This caused me a lot of hassle, confusion, and frustration. Now, I always try to stay a couple of versions "behind" in R.
Are there significant advantages R has over Python in doing statistics, besides the fact that R has many more available statistical libraries? Python strikes me as a more user friendly language.
The R Foundation has prepared a guideline document on the use of R in regulated clinical trial environments.
But Pfizer's decision is understandable. The SAS licence fees they pay are a kind of insurance against having their data analysis questioned. With millions of dollars of revenue at stake, what would you do?
The reasons that keep me from using Python for Statistics are:
1. the vast statistical library;
2. matrix computations are a bit easier to code in R than in SciPy;
3. graphics-generating capabilities (base, lattice, grid, ggplot);
3. latex-generating capabilities (sweave, xtable).
In my opinion, Python is much better designed than R; it has superior development tools (better IDEs, profiling, debugging, unit testing etc); and has a vast module library that covers many facets of application development.
It would not be hard to extend python in R's direction, e.g., by writing interfaces for C/C++/Fortran R packages. But it would be time-consuming, and I don't see the point.
You may be interested in RPY, which allows you to write in python and interface with R to get the statistical services you need.
gappy: I would recommend that you look into http://rpy.sourceforge.net/ . I also work quite a bit in R & Python, and I often find Rpy useful for getting some additional statistical functionality into Python.
We used RPy with Andrew on one of our projects: the file parsing and graphical user interface would be horribly messy with R, but was easy with Python. While this approach is fine for the tools you use inhouse, it makes any kind of installation very difficult.
Python is a good language for string, dictionary and list processing, with Numeric it becomes excellent for vectors and matrices (a par with MatLab), but R has all the libraries and the community.
Speaking of communiy, there is also a useful RPY mailing list.
As to whether using SAS constitutes insurance against errors, there's the literal interpreation. Does anyone know if SAS (or anyone else, for that matter) offers indemnification to their users against calculation errors? It must cost a fortune to get that kind of indemnification if they offer it, given the kind of damages errors could cause. Error indemnification's not part of SAS's academic site licenses.
The question in my mind is whether or not having the source makes a product more reliable, or at least easier to verify. It's easy to prove something doesn't work by finding a test case where it goes wrong or by finding a bug in the code. It's pretty much impossible to verify that software does work, even if you have the source.