Software is as software does

We had a recent discussion about statistics packages where people talked about the structure and capabilities of different computer languages. One thing I wanted to add to this discussion is some sociology. To me, a statistics package is not just its code, it’s also its community, it’s what people do with it.

R, for example, is nothing special for graphics (again, I think in retrospect my graphs would be better if I’d been making them in Fortran all these years); what makes R graphics work so well is that there’s a clear path from the numbers to the graphs, there’s a tradition in R of postprocessing.

In comparison, consider Sas. I’ve never directly used Sas but whenever I’ve seen it used, whether by people working for me or with me or just people down the hall who left Sas output sitting in the printer, in all these cases there’s no postprocessing. It doesn’t look interactive at all. The user runs some procedure and then there are pages and pages and pages of output. The point about R graphics is not that they’re so great, it’s that R users such as myself graph what we want. In fact, lots of default R graphics are horrible. (Try applying the default plot() function to the output of a linear model if you want to see some yuck.) I think Sas is horrible, not out of some inherent sense of its structure as a computer program but because I see what people do with it. In contrast, I see people using Stata creatively and flexibly, so I have much warmer feelings toward Stata (even though I don’t actually know how to use it myself).

The internals of a program do have something to do with how it’s used, I’m sure. I assume that Excel really is crappy, it’s not just that people use it to make crappy graphs. But as a user, I don’t really care about that, I just know to avoid it.

29 thoughts on “Software is as software does

  1. Stata is more powerful for graphics than most people think. GGPlot2, however, changed the way I think about data and presentation.

    I’d expected something a little different from this post. From the trenches of grad school, it seems obvious to me at least that R+LaTeX (and now +Sweave or +Markdown) are ways for folks to signal tech skills and sophistication–occasionally without the bother of having to be sophisticated about analysis or theory. (Gary King’s suggestion in passing in one of his papers advising grad students that reviewers are more charitable to papers written in Computer Modern may have done more to spur the adoption of LaTeX than any inherent advantage of the platform.)

  2. Perhaps that reflects the user’s desires rather than the program. SAS has both a customized output capability (ODS) and a graphing language (GTL). My group uses the ODS output regularly to generate reports via LaTeX and Excel files for distribution. The reports are laid out with typical to, from, subject, summary sections, section headers etc. Likewise, one can build custom plots if one wishes, using either annotate commands or the newer Graphic Template Language.

    R has an advantage in a more consistent language throughout, rather than the hodgepodge of languages SAS seems to employ.

  3. I would agree with PM that ggplot2 changes R graphics for the better. I used to make plots using R base graphics and the lattice package but found the tens (hundreds) of options difficult to remember, even after years of use!

    The ggplot2 package approaches statistical graphic construction from a more “object-oriented” approach so I found it initially hard use as compared to the more functional base and lattice packages, but once I got the hang of using it I could great good looking plots without writing lots of custom plotting code. The documentation for the package is great, but I recommend Hadley Wickam’s book ggplot2 book [ggplot2: Elegant Graphics for Data Analysis (Use R!)] to understand the framework.

    Excel is great when used appropriately! I know of few other software programs that let you prototype layouts and basic plots as quickly. It’s also familiar to almost everyone. I often use it with R through the ‘xlsx’ package – easily write an R data frame to anywhere in an Excel workbook.

    • I’d agree that one can draw a good graph in Excel, and it’s just that most people don’t get to the good point. [The same is probably true of SAS, which I’ve used for six years or so and in which I have drawn a grand total of about three non-diagnostic graphs.)

      But the main advantage of using a proper programme (i.e. one that you *can* actually programme) is with quick customisation: the ease of tweaking elements, adding in an extra column of data. This really makes a difference if you’re working on multiple figures at once.

      For example, last year I had to redraw 45 graphs for a report on hospital admissions. These had been previously graphed in Excel (by someone else) and there were errors in some of the confidence intervals (which had also been calculated in Excel.) Since these needed fixing, I decided to do this in R, and then redraw with ggplot2.

      It took some time to make the initial templates (there were about eight or nine different graph layouts, really) but then customising the individual templates was relatively quick, and the figures look (as you’d expect) much more professional and consistent in style (e.g. getting consistent legend placement). Having said that, I did lose about two weeks’ worth of evening relaxation time! TBH, I could have drawn the figures again in Excel in about the same time, but they still would have been cheap looking.

  4. It’s a losing battle for a commercial vendor: How many programmers and statisticians can SAS employ versus the community that codes for R? The R ecosystem will be wider and more up to date.

  5. It is my understanding that the free version of R is not a good software to handle large databases (over 100K observations). Is this true?

    • It depends… We use at the company I work for. We integrate R with SQL and in general we can handle up to 10 million observations quite well in a single machine. Most of time we spend cleaning and transforming Data. There is the package data.table if data fit in memory and SQL if the data don’t fit in memory.

      However, if you wanna use all 10 million with some algorithms (some kind of cluster analysis, maybe some MCMC) you may run out of memory and this will be a problem. But then you can move to a cluster or server. Right now we use a server with Rstudio and that allows us to have more memory to handle large data. Again, if you have say ten billion cases, then you have to move to hadoop or something alike. But, again, I don’t think Stata, SPSS or even Sas will handle well 10 billion cases. With Big Data you have to use hadoop. And though I never used Hadoop with R, I guess there are some packages that allows R to be integrated to Hadoop.

      Another possibility is to code in C (or Python or C++ or whatever you like) and call the C code from R.

      All in All, free R is fine and it’s not a surprise that people at google and facebook use R.

    • It really strongly depends on what you want to do with those 100k observations. whenever you’re working with more than a few thousand observations you should probably be accessing them through some SQL database, including the free file-based SQLite one if you don’t want to set up a SQL server. Once you’re doing that, you can easily do a lot of basic filtering and selection and data reduction using SQL.

      if you really need to run a nonlinear maximum likelihood fit to a complex multi-parameter model with millions of observations you’ll have some difficulty. Sticking yourself on a machine with say 128 gigabytes of RAM would go a long way towards enabling that sort of thing, and can be done these days for not that much money. Sticking your problem into a cluster would be another way and can be done with R.

      Big datasets are way less of a problem for R now than they were say in 2000.

    • On the graphing theme, it is always possible to use R to graph summary statistics that have been produced from another stats package (e.g. SAS, if you have millions of observations.) But this was a case for me of using the tools I knew I could use to produce the results I wanted (SAS for large-scale database handling, R and ggplot2 for graphing.)

    • 1. One person’s large is another person’s small. 100K isn’t big even for R. Now your mileage may vary depending on which R package you want to apply to the data, but RStan will work through R with this much data.

      2. Anything integrated with R at the C level and then redistributed (like Revolution Enterprise) has to be distributed with source due to the terms of R’s licensing under the Gnu Public License (GPL).

      For instance, you can get the sources for Revolution R Enterprise here:

      http://www.revolutionanalytics.com/downloads/gpl-sources.php

      They don’t have to make it easy to find them, but that’s what Google’s for. And they also don’t have to make them easy to install. But they do have to distribute them and other people are free to write installers on top of them.

      • On closer inspection, Revolution R’s only releasing their modified versions of R itself, not their full “Enterprise” versions. Their page says, “We provide open-source R bundled with proprietary modules from Revolution Analytics that provide additional functionality for our users.” I wonder what they mean by “bundled with”?

  6. None of this should be a surprise, given the history. It is another example of a general rule in computing that any software system that gains commercial success always retains characteristics from its period of origin, and installed base that makes it harder to move than things that start later, especially if they exist for more years in research environments that do not need to do enterprise-grade support.

    1) <a href="http://en.wikipedia.org/wiki/SAS_Institute&quot; SAS started ~1966, i.e., mainframes, not particularly interactive. Graphics outputs: line printers, Calcomp plotters.
    Over time, SAS was developed over time to run on multiple platforms, and with different floating point, and still be required to give exactly the same answers, for enterprises and government who required that. Unless you’ve been involved with such efforts, it may not be obvious how much work and Q/A time that takes. (At SGI, during 1990s,I used to manage an alliance that included SAS folks, visited Cary, met Jim Goodknight once (CEO, but still used to write code to keep his hand in).

    2) S started at Bell Labs in the mid-1970s, originally on an old MH GCOS system, but moved to UNIX as the 32-bit minis spread in the late 1970s. I.e., it really got rolling in interactive environments. Graphics output devices were more available, the GUI work at XEROX PARC was in mid-1970s, personal computers were appearing, as were workstations (Apollo, Sun), and Bell Labs Blit terminals (1982). See short video. (I had one of these, and got my lab to have the biggest concentration of them outside Computing Research. Certainly, by 1980/1981, anyone with any vision *knew* that multi-window bitmap displays were going to become pervasive, i.e., graphics accessible to a much wider audience, and was trying to figure out how to get volume and lower costs.) Of course, BTL was not selling commercial software, and inside, folks like Chambers could bui8ld software and it would spread, but Research didn’t sell support contracts. :-)

    • Exactly. If you had to submit your job on a mainframe and then wait for hours for it to return, you would want to have everything you could possibly need in your output. If you didn’t get a bunch of diagnostic output right then, you were less likely to ask for it later. That’s the environment SAS, SPSS, OSIRIS, BMD and possibly others I’m forgetting grew up in, of which only SAS and SPSS remain.

      Oddly enough, the fact that mainframe memory partitions were limited to 120k or so meant the algorithms used had to be economical of memory — which is an advantage now with “big data” applications relative to programs that assume everything is in memory.

  7. Take a look at JMP, folks. It’s a whole new ballgame! While, it’s a SAS product, it’s nothing like SAS, although you can use them together very effectively.It’s also easily customizable – you set your preferences for what kinds of output you want generated automatically, and then choose additional results or hide parts of it by toggling a menu of options. You can interact with the results in real time: change fonts, colors, line thicknesses, markers, etc. by clicking on the items – it makes your data into a adjustable gadget. You understand your data at lightening speed.

  8. I agree with Anonymous. I think JMP is a hugely underrated SAS product. It provides visual analysis and interaction, plus the underlying statistical output.

  9. +1 for JMP. Its ggobi+Mondrian on steroids and has a nice lispy scripting language, more lispy than R, and it has a nice GUI library to boot.

  10. You can do customised printouts and graphs with SAS. You can basically get any number generated by SAS written to a dataset and then manipulate that dataset into a graph or a printout. It is a lot of really nitty gritty work since you need to specify everything and really know how to code. So a lazy person or someone in a hurry either uses the SAS default graphs or plot is with something else. I think you can generate any plot, looking any way you want with SAS.

  11. I agree that community is an important past of using software. That is why it is important to connect with a GOOD community that is using the software in intelligent ways. SAS users have formed communities since 1976, but a few years ago SAS introduced various virtual communities (https://communities.sas.com/community/support-communities) to serve those people who do not attend user-group meetings. About the same time, I started writing my statistical programming blog to highlight some of the best practices in SAS programming that I’ve learned while interacting with SAS users and colleagues. When you see colleagues that are using SAS in a “horrible” way, encourage them to modernize their skills by participating in a real or virtual community that understands that sometimes less is more when it comes to voluminous statistical output and graphics.

  12. If I were I wouldn’t bash on SAS just so quickly, when you google gelman-rubin statistic, the first one comes is from the SAS website. You ought to thank them. With that said, I am also not a fan of SAS.

Comments are closed.