More on “data science” and “statistics”

After reading Rachel and Cathy’s book, I wrote that “Statistics is the least important part of data science . . . I think it would be fair to consider statistics as a subset of data science. . . . it’s not the most important part of data science, or even close.”

But then I received “Data Science for Business,” by Foster Provost and Tom Fawcett, in the mail. I might not have opened the book at all (as I’m hardly in the target audience) but for seeing a blurb by Chris Volinsky, a statistician whom I respect a lot.

So I flipped through the book and it indeed looked pretty good. It moves slowly but that’s appropriate for an intro book. But what surprised me, given the book’s title and our recent discussion on the nature of data science, was that the book was 100% statistics! It had some math (for example, definitions of various distance measures), some simple algebra, some conceptual graphs such as ROC curve, some tables and graphs of low-dimensional data summaries—but almost no programming, data cleaning, etc. No code, no instructions on how to scrape or munge or whatever. There were some passages on data preprocessing and other nitty-gritty issues, but not so much.

I’m not saying this as a criticism: it’s typical of statistics texts to present data that has already been cleaned—or, when data-cleaning or data-gathering is discussed, for that to be presented in bare-bones form. Lots of detail about statistical inference, only a little bit of detail about the data. That’s the way things go. Indeed, that’s how it goes in my own books. And perhaps this makes sense: data cleaning etc is idiosyncratic while statistical principles are more generalizable.

In any case, now I understand more why people say that “data science” is just another word for “statistics” (as applied to a particular sort of problem). If data science is defined as by Rachel Schutt and Cathy O’Neil, then, no, it’s a lot more than statistics, indeed statistics is only a small part. But using Foster Provost and Tom Fawcett’s implicit definition, data science is just statistics, albeit reframed and refocused in a way that is more useful for certain online settings.

Again, this is not meant as a criticism of Provost and Fawcett’s book, or a criticism of data science more generally. Statistics is great, and it’s worth putting in the effort to present it in different ways to different audiences. I’m no fan of standard approaches to introductory statistics and I have no problem with the fundamental ideas being presented in a “data science” context.

1. J Galbraith says:

I just started reading these two books concurrently and immediately noticed this difference in approach as well.

I’m hoping they will pair to provide a balanced survey of the field.

• John says:

It may be a biased survey depending on whether you take an unweighted average of the two books or weight the average by page count.

2. Robert says:

I am puzzled why you would exclude data cleaning/ pre processing from being part of ‘statistics’, because I would expect that if your job title is ‘applied statistician’ or similar, this would usually be a step you would have to go through. I mean, yes, writing code for scraping is something quite different, and arguably so is advanced database querying, but just organising the data correctly seems like an essential step.

• Wayne says:

I agree. Applied statistics textbooks really need to address how you organize data so that you have an audit trail and end up with reproducible results.

Even once data is initially obtained, cleaned, and organized, there are things you find in initial exploratory analysis that should also be part of statistics textbooks. This second-level cleaning/reconciliation requires domain knowledge, but isn’t as idiosyncratic as scraping free-form text. I’m thinking things like improperly documented data (units are metric for most variables, but one variable is in feet), odd encodings (e.g. “no weight taken” indicated by 0, or 999, or something else), measurements which don’t fall into theoretically-justified distributions, samples that aren’t representative of the desired population, etc.

I think this kind of thing can be generalized well-enough to qualify as statistics. (At least to the extent that there are canonical examples.) Which would be valuable to all kinds of scientists who are trained in statistics. This really does tie back into Andrew’s recent threads about p-values and scientists who have a brittle knowledge of statistics.

3. Nicholas says:

Um, I would go so far as to say data scraping and munging is part of my applied staitiscs tool box. I findmyself doing a lot of pulling tables out of papers and performing adhoc meta analysis to tryand understand trial populations for planning a clinicl trial. Also did a lot of the same using online data bases when I was still doing genetics. I think the issue, no offense, is that in the academic world of a “pure” statistician all that stuff either never comes up, or is something your collaborators do.

• Andrew says:

Nicholas:

You write, “in the academic world of a ‘pure’ statistician all that stuff [data scraping and munging] either never comes up, or is something your collaborators do.”

No! It’s the other way. I live in the academic world and I get dirty with data (sometimes). But in the above post I discuss the book “Data Science for Business” by a couple of non-academics, which has just about nothing on that sort of data processing, it’s all about statistical inference. So don’t blame academics for that!

• Dean Eckles says:

Foster Provost is an academic. He’s a professor at NYU Stern.

I do think it is interesting that they don’t spend more time on those aspects, especially since for people managing data analysts or consuming their reports etc. (part of who the book is targeted at), they should understand where their analysts may be spending much of their time.

4. Rupert says:

I think this all comes down to people wanting to have a definition for data science, and if you are defining it you can say you are doing it. These authors are doing data science because they consider data science statistics. The computational biologist is a data scientist because that is what they think they are doing. A database programmer is a data scientist because he mines a database for data and likes to read blogs about big data. Some guy who made a visualization software package is a data scientist because people who don’t know what it means say he is one. Wait, I thought the PhD Ecologist was an ecologist but now they’ve fit a number of GLMs to their data so now they’re a data scientist. The definition is too vague and many want to think they are doing ‘that thing.’

• John says:

I think you hit the nail on the head. The phrase “data science” still does not have a concrete definition, although one is slowly forming. In the meantime, particularly in industry, you can fit any quantitative endeavor that helps the company out into the function of data science. It is a sexy term however and I will use it in job postings to get good junior candidates. Don’t get me started on the ambiguities of “big data”.

5. John says:

I think the field of Statistics should be defined based on its purposes and goals, and not based on a particular set of technologies currently in use, as these technologies will certainly be come obsolete in the future. If one defines Statistics based on its goals, such as something based on the mathematical analysis of quantitative data, then the field is dynamic, and can expand as new methods and types of data become available. If Statistics is defined based on currently available methods, then the field is static, and must be replaced by another field (e.g. Data Science) as science/tech advances. Defining Statistics in this static way, based on current methods, is like defining Astronomy based on a particular type/brand of telescope – a definition that certainly cannot sustain the field through time.

• Longhai Li says:

This is a great comment on statistics field.

• John says:

Thanks! Though this just reminded me of an old quote:

“Computer science is no more about computers than astronomy is about telescopes.” (unknown)

It seems that computer scientists are also wary of defining their field in a static manner based on currently technologies. I vaguely recall some Computer Science type department having referred to itself as “Computing Science” – a phase that seems to get around the problem.

John

6. John says:

I think the field of Statistics should be defined based on its purposes and goals, and not based on a particular set of technologies currently in use, as these technologies will certainly be come obsolete in the future. If one defines Statistics based on its goals, such as something based on the mathematical analysis of quantitative data, then the field is dynamic, and can expand as new methods and types of data become available. If Statistics is defined based on currently available methods, then the field is static, and must be replaced by another field (e.g. Data Science) as science/tech advances. Defining Statistics in this static way, based on current methods, is like defining Astronomy based on a particular type/brand of telescope – a definition that certainly cannot sustain the field through time.

7. MikeM says:

Statistics was originally defined as data about the state, so data science subsumes statistics, if we accept that original meaning. Of course, now it means things like how to compute various properties of the data, whereas data science embraces things like visualization, which is not a mainstay of current statistics.

• Robert says:

I think the same thing comment applies to visualisation as applies to data cleaning – surely this is in fact a mainstay of applied statistics? Possibly more so – statistics textbooks which address using statistics whilst giving examples of real data typically graph the data. Bayesian Data Analysis seems indisputably a statistics textbook, for example, and there is a visualisation whenever a data set is presented for the first time. Maps of the US used to illustrate cancer death rate or electoral forecast, in particular, seem to be what people mean by ‘visualisation’ even if they exclude histograms as being too basic to wear that title.

8. […] “More on “data science” and “statistics”” http://statmodeling.stat.columbia.edu/2013/11/19/22182/ … […]

9. Thomas Speidel says:

Frankly this should be hardly surprising given how loosely defined and subjective the term “Data Science” is. The most sensical definition I have seen comes in the form of a Venn diagram with three circumferences containing hacking skills, math & stats, subject matter expertise. The intersection of these three is defined as Data Science. Now how many folks really have a thorough and equally proportioned knowledge of all these three areas? With all the big data and data science hype, it’s really easy to monetize by re-branding any of these three as data science.

• dab says:

To save any who haven’t already seen it the googling time, here’s a link to said Venn diagram:

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

• Georgette Asherman says:

I consult to an organization that went from calling itself ‘statistics’ to ‘data sciences’ nearly 10 years ago before anyone heard of this. The reasoning was to include laboratory and social science professionals with strong data analysis and modeling skills in the capacity of the group. None of this involved ‘big data’ as used now although the computations were on a larger scale than the classical inference methods typical before. So I guess this was an overlap of two circles of the Venn diagram, statistics and subject matter knowledge.

This discussion is not just an academic question but it is important for human resource and recruiting. Now that universities offer Masters in Data Science, as well as Analytics, how will these graduates be placed and what will be the role of a grounding in inference out of Statistics programs? If anyone can do Data Science, what is the point of an advanced degree? When two books by good authors are so different, clearly a consensus is far.

Does anyone remember Decision Sciences? I worked briefly in a group with that title and also heard it was an approach from the 1970s that later was disparaged. Is Data Science another such academic fad?

10. John says:

I agree with Thomas. I think data science should be used for team/collective approach to manage, analyse, and report data, using reproducible methods commonly found in scientific disciplines. I am not convinced it is something an individual could conduct on their own. “Data Scientist” strikes me as a rather fanciful job title – “Jack of all trades, master of none.”

11. John Mashey says:

1) A certain well-known university President (and computer scientist) once said that any discipline that had science as part of its name probably wasn’t one. :-)

2) This reminds me a bit of a computational statistics journal whose topic list seemed to encompass much of computer/computing science.

3) We have often talked about renaming the Computer History Museum into the Computing Museum, but it’s always been hard to face the hassle.

• Andrew says:

John:

Regarding your point #1, see what Rachel Schutt and Cathy O’Neil say, reproduced in the fifth paragraph here.

• John Mashey says:

Yep. I don’t know who originated this, but it’s been a few years.
Of course, the epitomy is “creation science,” hard to beat for something that calls itself science but isn’t.

Some computer scientists really do science some of the time, even if most of us do engineering or other things more of the time, although (like statisticians), sometimes we get to help natural scientists do better science.

• John Mashey says:

Stats & (climate) science, lately: RealClimate: Statistics & Climate, including my comment #15 about a good workshop in which all but one of the presentations was pretty good, the poor one being by one of Andrew’s “favorites.”

(The climate folks have long had to do serious data scrubbing and inferences from maddenlingly-partial data that cannot be fixed just by running a new experiment. A nice recent paper is Cowtan & way(2013) in which they use satellite data and kriging to improve the estimates of temperature change over the Arctic and other areas where ground stations are at best sparse. Discussed here.