Skip to content
 

R fixed its default histogram bin width!

I remember hist() in R as having horrible defaults, with the histogram bars way too wide. (See this discussion:

A key benefit of a histogram is that, as a plot of raw data, it contains the seeds of its own error assessment. Or, to put it another way, the jaggedness of a slightly undersmoothed histogram performs a useful service by visually indicating sampling variability. That’s why, if you look at the histograms in my books and published articles, I just about always use lots of bins.

But somewhere along the way someone fixed it. R’s histogram function now has a reasonable default, with lots of bins. (Just go into R and type hist(rnorm(100)) and you’ll see.)

I’m so happy!

P.S. When searching for my old post on histograms, I found this remark, characterizing the following bar graph:

Screen Shot 2016-01-24 at 4.17.21 PM

This graph isn’t horrible—with care, you can pull the numbers off it—but it’s not set up to allow much discovery, either. This kind of graph is a little bit like a car without an engine: you can push it along and it will go where you want, but it won’t take you anywhere on its own.

5 Comments

  1. Carlos Ungil says:

    By “lots of bins” do you mean 8-12 bins in your example?

  2. Thomas Lumley says:

    I’m getting the same bins for hist(sin(1:100)) in 3.5.2 as in 3.0.1, and I can’t see anything in the NEWS file, so, um, maybe?

  3. gg says:

    There’s a rumor that, for ggplot2, Hadley chose the default histogram bins as 30 because this was the worst possible default width he’d encountered. The idea was that users would be enticed to set the width to something they found reasonable.

  4. I’ll bet $5 that Andrew “fixed” this in his own .Rprofile and forgot about it, or that some package he regularly uses replaces hist with a better version of hist.

  5. Dave says:

    I often like adding `breaks = “fd”` to my hist() calls because it generally increases the number of bins as you get more data.

    For example, hist(rgamma(1000, 0.25)) doesn’t really convey how skewed the distribution is until you increase the number of bins.

Leave a Reply