Skip to content
 

Axes that extend below 0 or above 1: actually a bigger issue involving how statistical variables are stored on the computer

I was thinking more about axes that extend beyond the possible range of the data, and I realized that it’s not simply an issue of software defaults but something more important, and interesting, which is the way in which graphics objects are stored on the computer.

R (and its predecessor, S) is designed to be an environment for data analysis, and its graphics functions are focused on plotting data points. If you’re just plotting a bunch of points, with no other information, then it makes sense to extend the axes beyond the extremes of the data, so that all the points are visible. But then, if you want, you can specify limits to the graphing range (for example, in R, xlim=c(0,1), ylim=c(0,1)). The defaults for these limits are the range of the data.

What R doesn’t allow, though, are logical limits: the idea that the space of the underlying distribution is constrained. Some variables have no constraints, others are restricted to be nonnegative, others fall between 0 and 1, others are integers, and so forth. R (and, as far as I know, other graphics packages) just treats data as lists of numbers. You also see this problem with discrete variables; for example when R is making a histogram of a variable that takes on the values 1, 2, 3, 4, 5, it doesn’t know to set up the bins at the correct places, instead setting up bins from 0 to 1, 1 to 2, 2 to 3, etc., making it nearly impossible to read sometimes.

What I think would be better is for every data object to have a “type” attached: the type could be integer, nonnegative integer, positive integer, continuous, nonnegative continuous, binary, discrete with bounded range, discrete with specified labels, unordered discrete, continuous between 0 and 1, etc. If the type is not specified (i.e., NULL), it could default to unconstrained continuous (thus reproducing what’s in R already). Graphics functions could then be free to use the type; for example, if a variable is constrained, one of the plotting options (perhaps the default, perhaps not) would be to have the constraints specify the plotting range.

Lots of other benefits would flow from this, I think, and that’s why we’re doing this in “mi” and “autograph”. But the basic idea is not limited to any particular application; it’s a larger point that data are not just a bunch of numbers; they come with structure.

14 Comments

  1. John says:

    Lots of other benefits would flow from this, I think, and that's why we're doing this in "mi" and "autograph".

    What are “mi” and “autograph”?

  2. Bob O'H says:

    Of course, S already has integers defined, so you could write a hist.integer() function.

    I suspect putting ranges on variables like you suggest would be a huge programming headache, necessitating lots of checks that the range isn't exceeded, for example. I can almost see Martyn Plummer weeping at the mere suggestion.

  3. Andrew says:

    John: "mi" is multiple imputation; "autograph" is our function to automatically graph raw data (univariate histograms).

    Bob: The ranges are an option; the key idea is for variables to have types. I would think it could be an option, so that if no type is assigned, a function can just assume unrestricted continuous (which is the implicit default). We also have a function that tries to guess the type (for example, if a variable takes on only two values in the data, it's likely to be binary).

  4. C. Zorn says:

    What I think would be better is for every data object to have a "type" attached: the type could be integer, nonnegative integer, positive integer, continuous, nonnegative continuous, binary, discrete with bounded range, discrete with specified labels, unordered discrete, continuous between 0 and 1, etc.

    In other words, like Stata does with variables…

    (This goes to the earlier discussion about databases and data management in R; I love R, but I do nearly all my data management in Stata precisely because it has useful commands like (e.g.) -compress-).

  5. Hadley says:

    One of my pet hates is density plots that extend outside the possible range of the data.

    But otherwise, sometimes you have to extend the range a little to avoid a collision between the plot annotations and plot data – I think it's the lesser of two evils.

  6. You also see this problem with discrete variables; for example when R is making a histogram of a variable that takes on the values 1, 2, 3, 4, 5, it doesn't know to set up the bins at the correct places, instead setting up bins from 0 to 1, 1 to 2, 2 to 3, etc., making it nearly impossible to read sometimes.

    Sounds like you really want a barchart instead of a simple histogram. Compare:

    <pre>
    set.seed(1)
    x=rbinom(20,5,.5)
    par(mfrow=c(1,2))
    hist(x)
    barplot(table(x))
    </pre>

  7. Kevin Wright says:

    In addition to 'type', I have thought that it would be useful for density functions to have 'support' attribute to define the domain. It would be most useful for the beta distribution, which is where I needed it one time.

  8. Andrew says:

    Hadley and Kevin: Exactly. The key is to distinguish betwee the "range of the data" and the "possible range of the data" (i.e., the "range of the sample space"). And then there's the range of the plot itself. plot() in R works with the range of the data, and the range of the plot, but not the range of the sample space.

    Karl: Sure, but I'd rather that hist() did it right the first time. I can and do write workaround scripts but ideally this wouldn't be needed so often.

  9. I think "types" are nice, but what R really needs is to allow variables to have both numeric values and value labels simultaneously–with R, you're stuck with either values or a (possibly ordered) factor, and neither are ideal if you're working with something like the NES; you either lose the values or lose the labels, since there's no easy way to keep around both if you import data from Stata or SPSS. You could keep two copies of the data frame around, but that's just silly. recode in the car package makes things slightly more bearable, but the key word is slightly.

    If R had a sane implementation of data types and objects, this wouldn't be that hard, but R's object systems (S3 and S4) are just plain ridiculous.

  10. Hadley says:

    R does have a sane implementation of data types and objects, it's just different to what you are used to. It definitely has its warts (and is not particularly accommodating for many common tasks) but it does have a solid theoretical background.

    If you wanted to make a type that was a union of factors and continuous variables, it would certainly be possible (but not easy). The tricky part would be adapting the many statistical algorithms to correctly deal with the new data type.

  11. John says:

    what R really needs is to allow variables to have both numeric values and value labels simultaneously–with R, you're stuck with either values or a (possibly ordered) factor, and neither are ideal if you're working with something like the NES; you either lose the values or lose the labels

    unclass(myfactor) and as.numeric(myfactor) aren't good enough?

  12. Hadley says:

    Andrew: you mentioned in your comment that you had some code for determining variable types. Would you mind sharing it?

  13. Andrew says:

    Hadley,

    Versions of it are in "autograph" and "mi", two packages that we are close to releasing.

  14. Hadley says:

    Andrew: can I suggest that you think about alternative names for your packages? They are going to be impossible to find with search engines. That makes it hard to find if people are using your packages, and if they are having problems. With ggplot2, for example, I can easily track the entire web to see who is using it and what problems they are reporting.

    Wrt determining variable type, it seems to me like this functionality would be sufficiently useful (and general) to deserve its own package.