Rachel Schutt and Cathy O’Neil just came out with a wonderfully readable book on doing data science, based on a course Rachel taught last year at Columbia. Rachel is a former Ph.D. student of mine and so I’m inclined to have a positive view of her work; on the other hand, I did actually look at the book and I did find it readable!
What do I claim is the least important part of data science?
Here’s what Schutt and O’Neil say regarding the title: “Data science is not just a rebranding of statistics or machine learning but rather a field unto itself.” I agree. There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science.
The question then arises: why do descriptions of data science focus so strongly on statistical tasks? (As Schutt and O’Neil write, “the media often describes data science in a way that makes it sound like as if it’s simply statistics or machine learning in the context of the tech industry.”) I think it’s because statistics is the fun part and the part that, in this context, is new. The tech industry has always had to deal with databases and coding; that stuff is a necessity. The statistical part of data science is more of an option.
To put it another way: you can do tech without statistics but you can’t do it without coding and databases. But in recent years, lots of tech companies have made use of statistical methods (including various statistical ideas that have been developed in the computer science literature). So, from the industry perspective, the new part of data science is the statistics. Statistics is the least important part of data science, hence it is the part most recently added, hence it is the part that is getting the most attention right now.
Schutt and O’Neil also write:
People have said to us, “Anything that has to call itself a science isn’t.” Although there might be truth in there, that doesn’t mean that the term “data science” itself represents nothing, but of course what it represents may not be science but more of a craft.
What is Hadoop, anyway?
OK, back to the book. I read and enjoyed the first couple of chapters and then went back to the table of contents to see where I could learn. Chapter 14 grabbed my eye: “Data Engineering: MapReduce, Pregel, and Hadoop.” I keep hearing about “map reduce” and “hadoop” but I’ve never known what they are about. Before checking out the chapter, I did a quick Wikipedia read. The wiki articles seem clear enough but after a 30-second read (hey, I’m impatient!) I still don’t really have a sense of what is going on here. So on to the chapter, which is coauthored with David Crawshaw and Josh Wills:
You’re dealing with Big Data when you’re working with data that doesn’t fit into your computer unit. Note that makes it an evolving definition: Big Data has been around for a long time. . . . Today, Big Data means working with data that doesn’t fit in one computer.
Then they get into the details. I still don’t understand map reduce and hadoop, but at this point I’m pretty sure it’s my fault, not theirs—or, to put it another way, to learn it I’d need to be able to have a q-and-a discussion with Bob or Daniel Lee or someone else who can explain it to me, or else I’d need to put in a bit more work. Fair enough, it’s not like someone could learn Bayesian data analysis by just reading a book and not doing any homework.
In that hadoop chapter, we get the following motivation for comprehensive integration of data sources, a story that is reminiscent of the parables we sometimes see in business books:
By some estimates, one or two patients died per week in a certain smallish town because of the lack of information flow between the hospital’s emergency room and the nearby mental health clinic. In other words, if the records had been easier to match, they’d have been able to save more lives. On the other hand, if it had been easy to match records, other breaches of confidence might also have occurred. Of course it’s hard to know exactly how many lives are at stake, but it’s nontrivial.
We can assume we think privacy is a generally good thing. . . . But privacy takes lives too, as we see from this story of emergency room deaths.
But what about this story?
One or two patients per week? 75 people is a lot! To calibrate, I’d like to get a denominator, the total number of deaths each year.
I’m not sure how large the “smallish town” is. Here’s Wikipedia: “A town is a human settlement larger than a village but smaller than a city. The size definition for what constitutes a ‘town’ varies considerably in different parts of the world. . . . In the United States of America, the term “town” refers to an area of population distinct from others in some meaningful dimension, typically population or type of government. . . . In some instances, the term “town” refers to a small incorporated municipality of less than 10,000 people, while in others a town can be significantly larger. Some states do not use the term ‘town’ at all, while in others the term has no official meaning and is used informally to refer to a populated place, of any size, whether incorporated or unincorporated. . . .” Wikipedia then goes state by state, for example, “In Alabama, the legal use of the terms ‘town’ and ‘city’ are based on population. A municipality with a population of 2,000 or more is a city, while less than 2,000 is a town.”
Just to go forward on this, I’ll assume the “smallish town” has 10,000 people. If approximately 1/70 of the population is dying every year, that’s 140 deaths a year. So that can’t be right—there’s no way that half the deaths in this town are caused by poor record-keeping in a hospital. If the town had 20,000 people (which would seem to be near the upper limit of the population of a town that one would call “smallish,” at least in the United States), then we’re talking 1/4 of the deaths, which still seems way too large a proportion. Even if it is a town with lots of old people, so that much more than 1/70 of the population is dropping off each year, the numbers just don’t seem to add up. Maybe the town happens to have a large regional hospital. But, 75 excess deaths a year caused by “lack of information flow” still seems like a lot, and if the patients are drawn from a large population, it seems a bit misleading to describe these deaths as being “in a certain smallish town.”
What I just did was statistical reasoning, or maybe I should call it mathematical reasoning or numeracy. Based on my calculations, I feel like there is something missing in the story that was told about the hospital records. I could be wrong, though. I might be missing something subtle or even something obvious. It’s hard for me to know, though, because the story is not sourced. This is a reminder that all data, big or small, is more easily used when its source is clear. From a statistical perspective, we want to know the data-generation process (also called the likelihood function, also called “where did the data come from”) as well as the numerical data (or, in this case, the story, or anecdote, or parable, itself).
I enjoyed Rachel and Cathy’s book, it’s readable, informative, and like no other book I’ve read on the topic of statistics or data science. It has a lot in common with the “365 stories” project that we started here (but never got off the ground because far fewer than 365 people sent us their stories). I think/hope that lots of people will get a lot out of this book. It got me thinking about all sorts of things.
P.S. I wonder what Richard Stallman would think about the book. On one hand, it’s all about being a “data humanist,” which I think he’d like. On the other hand, Rachel Schutt is the Senior VP of Data Science at News Corp, which would surely be a turnoff to the Gnu-man. And I seem to recall he’s down on O’Reilly.