After reading Rachel and Cathy’s book, I wrote that “Statistics is the least important part of data science . . . I think it would be fair to consider statistics as a subset of data science. . . . it’s not the most important part of data science, or even close.”
But then I received “Data Science for Business,” by Foster Provost and Tom Fawcett, in the mail. I might not have opened the book at all (as I’m hardly in the target audience) but for seeing a blurb by Chris Volinsky, a statistician whom I respect a lot.
So I flipped through the book and it indeed looked pretty good. It moves slowly but that’s appropriate for an intro book. But what surprised me, given the book’s title and our recent discussion on the nature of data science, was that the book was 100% statistics! It had some math (for example, definitions of various distance measures), some simple algebra, some conceptual graphs such as ROC curve, some tables and graphs of low-dimensional data summaries—but almost no programming, data cleaning, etc. No code, no instructions on how to scrape or munge or whatever. There were some passages on data preprocessing and other nitty-gritty issues, but not so much.
I’m not saying this as a criticism: it’s typical of statistics texts to present data that has already been cleaned—or, when data-cleaning or data-gathering is discussed, for that to be presented in bare-bones form. Lots of detail about statistical inference, only a little bit of detail about the data. That’s the way things go. Indeed, that’s how it goes in my own books. And perhaps this makes sense: data cleaning etc is idiosyncratic while statistical principles are more generalizable.
In any case, now I understand more why people say that “data science” is just another word for “statistics” (as applied to a particular sort of problem). If data science is defined as by Rachel Schutt and Cathy O’Neil, then, no, it’s a lot more than statistics, indeed statistics is only a small part. But using Foster Provost and Tom Fawcett’s implicit definition, data science is just statistics, albeit reframed and refocused in a way that is more useful for certain online settings.
Again, this is not meant as a criticism of Provost and Fawcett’s book, or a criticism of data science more generally. Statistics is great, and it’s worth putting in the effort to present it in different ways to different audiences. I’m no fan of standard approaches to introductory statistics and I have no problem with the fundamental ideas being presented in a “data science” context.