Statistics is the least important part of data science

This came up already but I’m afraid the point got lost in the middle of our long discussion of Rachel and Cathy’s book. So I’ll say it again:

There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science. . . .

The tech industry has always had to deal with databases and coding; that stuff is a necessity. The statistical part of data science is more of an option.

To put it another way: you can do tech without statistics but you can’t do it without coding and databases.

This came up because I was at a meeting the other day (more comments on that in a later post) where people were discussing how statistics fits into data science. Statistics is important—don’t get me wrong—statistics helps us correct biases from nonrandom samples (and helps us reduce the bias at the sampling stage), statistics helps us estimate causal effects from observational data (and helps us collect data so that causal inference can be performed more directly), statistics helps us regularize so that we’re not overwhelmed by noise (that’s one of my favorite topics!), statistics helps us fit models, statistics helps us visualize data and models and patterns. Statistics can do all sorts of things. I love statistics! But it’s not the most important part of data science, or even close.

68 thoughts on “Statistics is the least important part of data science

  1. I would rephrase as: “Statistics, properly defined, is a key part of data science”. Statistics should incorporate statistical software development, data munging, reproducibility, visualization, and report writing. That is most of the data science pipeline. Just my 2 cents.

    • Jeff:

      Interesting point on expanding the boundaries of statistics. To me, “statistics” already includes visualization and software development. Maybe also data munging too, but I’m not quite sure what that is! So, yeah, I’m with you on this. For example, suppose a researcher writes a program to scrape some data off the web. I don’t know how to do this, and I’ve not thought about it as a statistical skill, but maybe it is, and my views have just been too limited. Certainly I feel very strongly that visualization and software development are part of statistics, and I get annoyed when people don’t want to count it as statistics.

      • Andrew

        Data munging is just another way to say data cleaning. I think that is a critical part of statistics since it helps you understand many sources of hidden variation/bias you might miss if you only have the cleaned data. I think scraping data from the web is now a necessary/default skill we should be teaching statisticians, much the way software development wasn’t considered “core” but now is.

        I think we agree that if we define our field narrowly then of course it is less important and that may cause us to become obsolete. Hopefully that was the take home message from the conference.


        • Jeff:

          Yes, but I think there are disagreements. When I hear people say that data science is just statistics, I don’t think they’re necessarily saying that statistics includes programming, data munging, etc. I think they often have a narrow view of statistics, and they just don’t realize how much of data science falls outside those boundaries.

        • Andrew – Hopefully those of us with a broader view will eventually convince them that statistics is a much bigger discipline than narrow mathematical results – Jeff

        • It’s surely not “narrow” mathematical results but if the definition of “statistics” is expanded so far that it includes writing a function to scrape data from web pages then I think the definition will be too broad to be useful.

  2. I think this may be a semantic problem. Clearly people who have been working with databases for the last 70+ years have been doing science. The optimizations underlying classic databases are amazing, from data storage to SQL plan execution, and the technology advances that have made all modern data processing possible is clearly science. I agree with the idea that a majority of the work with data is both scientific and not focused on statistics.

    The BuzzWord “Data Science”, however, I think, is used primarily in a more statistical sense. There’s a difference between “Data Science” and data science, stupid as that may sound. I say that primarily because of the term’s tenure — pretty much anything prior to 2001 or 2002 didn’t use the term “Data Science” — and because of it’s usage compared to everything that was done before the BuzzWord took hold. One of the earliest usages was a book on the topic is in fact titled “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics” — clearly Statistics was important to that thought process.

    The term is nearly synonymous with “Predictive Analytics”, which I think are very statistical in nature. The idea of taking data (typically a LOT of data) and extracting useful information is highly entrenched in statistical concepts, even if the tools and processes obscure the statistics and make the capability accessible to non-statisticians.

    Statistics existed before “Data Science”, as did (as you pointed out) visualization, large databases, distributed computing, and what have you. “Data Science” is the modern crossroads where these things intersect. The statistical part of this is not an option, it’s one of the key components that turned it into something hype and buzzword worthy, neither more nor less important than the others.

  3. Missing three paragraphs. Rich subject with numerous possible perspectives and meaningful detail. Blog != dumping ground for incomplete thoughts.

    2/10 would not read again

    • Ummm . . . if you really think that “Blog != dumping ground for incomplete thoughts,” you clearly don’t belong here! Incomplete thoughts is what we’re all about. If what you’re looking for is “complete thoughts,” I recommend you read my published articles and spend less time clicking on blogs.

    • This is a great example reply for statistical natural language processing. To wit, what’s the probability, given the comment’s text and context, that the comment is spam?

      This form of comment is tricky. It’s the exact opposite of spam of the form “ooh, I love your format, ” “that post showed real expertise,” and “how do I get this to render on my browser?” Which is pretty clever!

      A great natural-language-based spam detector would be able to detect vagueness/non-responsiveness to the post. That is, answer the question, “Could this reply be put on any blog post by anyone anywhere?” Too bad that’s beyond the state of the art in NLP.

  4. I feel like you can keep regressing to something that’s important, but it doesn’t mean it’s the most important thing.

    Paper is the most important part of data science. You can’t do much without paper. Trees! You can’t have paper without trees. Trees are the most important part of data science (and so on). It’s all important, you miss one component, and you’re missing it all (I’d say).

  5. Outside of academia this is just semantics, but in academia it’s a real turf war. It would be in the interest of Statistics to annex programming into their domain, but Computer Science got there first. This has all sorts of implications for how the basics are taught.

  6. I would argue that statistics is very important to data science as it’s practiced right now, but not in the way that you seem to be describing. People who work with data tend to be very optimistic about what they can do with big data sets, and there’s a lot of hype about what exactly we can accomplish. The role of the statistician is to give a more grounded take on what’s actually possible and reasonable to do. This definitely requires very strong communication skills, but it’s an essential task that people from other disciplines aren’t really positioned to do.

  7. My favorite definition is

    Data Science = Statistics + Brogramming

    But jokes aside… I completely agree with your assessment. Dealing with big data stores, sql and data munging are not typically taught in even the most computational of Statistics PhD programs (like ucla for instance). So it is kind of hard to claim that these are an integral part of the subject matter even though they may be integral to its practice.

  8. I think it’s way to early to even be worrying about exact taxonomies.

    In industry there’s this bubble of big data, where everyone is rushing to gather data faster and faster until everyone has giant databases without anyone who has any idea how to use it. The data is messy and complex, riddled with unknown sampling mechanisms and other systematic biases and without any strong statistical knowledge analysis is limited to decimating the data to the point where constants, lines, and MAYBE logistic regressions can yield reasonable fits (incidentally, destroying your data until it’s easy to analyze is exactly what physicists do best which is perhaps why it’s pretty straightforward to get a job with such a background). On the other hand, even those simple analyses yield significant benefits so there’s little motivation to push much beyond that. Yet.

    Then there’s academia, where everyone is rushing to rebrand themselves to fit into this ephemeral definition of “big data” because all of a sudden funding agencies are taking notice (not to mention industry partnerships and other private sources of funding). People try to shove machine learning and statistics into holes that don’t quite fit because because they don’t know any better. Seriously, how many papers parade their algorithms against sparking clean data sets instead of real-world, messy data? When I was working in industry I lead the development of a sentiment analysis algorithm that was trying to balance bias so that bulk positive and negative sentiment could be more easily seen visually (which was all that you could hope for given the incredible messiness of the data), but when we tried to submit it to an academic journal the referees lambasted the submission because we didn’t compare against every other traditional algorithm, even though they weren’t appropriate for the very real problem we were trying to solve.

    Let’s all calm down and try to solve real problems together instead of jockeying for position. Unless you’re writing a grant to fund Stan, then go ahead and bullshit all you want. Our data is clearly the biggest. :-D

  9. Pingback: HUFFPOLLSTER: Obamacare Support Shows Signs Of Decline | Tiggio Blogs and More

  10. Pingback: Somewhere else, part 91 | Freakonometrics

  11. Statistics is neither less nor more important than any other aspect of data science. It is simply a tool to be used at the most appropriate time.

    Once you have all the tools of your trade (maths, stats, programming, image analysis, visualisation, etc.) in your toolbox, then you’ll be a hell of a data scientist and very much in demand.

    It’s all about using the correct methods at the correct time.

    Lee Baker
    Chi-Squared Innovations

    • Jm:

      I disagree. For example, you can do science by putting together data to get a big database, then using these data to assess predictions from social-scientific theories. This is not perfect science—correlation is not necessarily causation, there are potential issues of nonrepresentative samples, etc.—but it’s still science.

      • I don’t think that’s true. Science is essentially making declarative statements about the world in a frame-work that allows for falsification. In your example, assessing predictions would entail some way of possibly falsifying those predictions, whether by significance testing, model fit or whatever. And that is – in my view – statistics.

  12. Much heat, little light in this thread. It sounds as if the conference Andrew attended was pretty much navel-gazing by a bunch of statisticians, unleavened with the views of non-statisticians working on similar issues. In industry, the fault lines are clear…traditional statistical (read 20th c) approaches to multivariate modeling were built with small data and do not work in the presence of petabytes of information or more generally anything that won’t fit on a laptop. This isn’t the fault of statisticians for not developing solutions that will work, in all fairness it’s that the hardware has yet to catch up. Advances in computational power are such that, quite obviously, it soon will. To the point of an earlier post, academic statisticians can and should be faulted for developing “solutions” to big data that get funding or publish papers but that are based on toy datasets that aren’t even remotely scalable to real world challenges. Consider one academic’s contention that a parallelized approach to leveraging R in the presence of big data using Hadoop and Mapreduce but developing it on 127 variables — real world big data can contain tens of thousands if not millions of predictors — frankly represents a refusal to offer a plausible answer or even take the issues seriously. And this isn’t just a “pure” vs “applied” question that would occupy a Bell Labs researcher for years. Another statistical knee-slapper is the frequently heard nostrum to “sample.” How? Based on what? Randomly? Can the dependence structures even be defined on which a sampling scheme could be built? In the absence of that, sampling big data is not a solution…it’s a variance-destroying prescription that will run any endeavor into the ditch.

    Then there is machine learning, a discipline separate and siloed in large part from developments in stats. These are the ones that most people today refer to as “data scientists”…the grads of MIT and CMU — and for the most part, they are eating statisticians lunch in the world of big data, offering solutions that are efficient and work — in general. ML experts are being asked to build predictive models with big data even though many of them don’t subscribe to reductionist statistical mandates regarding model-building or are even aware of them. So, developments in this world will apply one-pass, algorithmic probability models evolved from theories of Kolmogorov Complexity that use all of the available information, not a just subset. Statisticians are a long way from offering solutions like these…at least until the technology catches up.

    • Our research group has long been a mixture of database people (indexing mechanisms, distributed data indexing etc.) and data mining/machine learning people, plus some activity on data visualisation. To me the “data science” term was very convenient when it became fashionable, because it was able to describe in a concise way what the group is doing. Before, I always had to make a painful paragraph-long explanation : “we are doing this and this and that, but hey, it’s quite connected in fact, look at all the common keywords in the VLDB and KDD call for papers”. More precisely, our group works on data clustering, which is central both the database and the “mining” aspects of data processing.

  13. I am not exactly sure what Gelman is referring to here. Its like saying a stove is the most important part of a kitchen (true but trivial). This post reminds me of the turf wars between the various social sciences in the 1960’s. At some margin, you need statistical skills (some inference has to ultimately be done with the database) while without data there is nothing to infer with (see induction).

    • I think at least some of the problem is that so much of statistical teaching and methodology is devoted to inferences from low/moderately powered situations. What statistical inference tools do you throw at a problem when the sample size is 50 million and any model you throw at it will be rejected at a significant level? Sure there are aspects of statistics that touch on useful things in this context – e.g. focusing on the bias aspect of the bias/variance tradeoff, confounding etc. but it’s not really the “core” of the statistical enterprise in terms of its institutional role in academia and society.

  14. Pingback: Why statistics can never be the most important part of science | LARS P. SYLL

  15. Pingback: DataBeat 2013 | LANDsds Sustainable Voice

  16. Food for thought:

    In what sense is coding a “science”?
    In what sense is data science a “science”?
    In what sense is the *practice* of data science a “science”?

    I think statisticians need to play a much larger role in this, otherwise we will end up in a world of utter confusion as you can pretty much justify any kind of “theory” by fishing among a large enough data set.

  17. I’m a bit late to this, but is calculating the mean of a collection of data “statstics”? Is working out whether to measure a central tendency with a median or a mean (or whatever) “statistics”? Because I’m pretty sure that most data science projects end up doing these things, and couldn’t get very far without them.

    The fact that you can often do a lot and learn a lot from simple descriptive summaries of data.

    • Patrick:

      To me, if it’s just averaging or counting or comparing, that’s not statistics. For it to be statistics it needs to involve some structure in the data collection (for example, random sampling or random treatment assignment) or in the analysis (for example, regression, matching, or poststratification).

      There will always be methods that are at the border of my (or any) definition. For example, suppose you compute age-adjusted cancer rate by taking raw rates in age categories (from some scraped dataset) and then averaging. This could be considered a simple weighted average and thus just number crunching, or it could be considered basic statistics. Indeed, one reason I like statistics (rather than simply computation) as a framework for data science is that questions are framed in terms of estimation rather than simply operations on data.

      So, to me, very simple descriptive summaries (averages, counts, rates, etc) of data that were not collected statistically: that’s not quite statistics. But more complicated descriptive summaries (such as what’s in Red State Blue State; that’s all descriptive), I’d call that statistics.

      • One of the jargon definitions of “Statistic” is macro or aggregate variable (some function of the data), but if you’re just calculating statistics (of this type) then you’re pretty much doing “number crunching”. You only need someone who’s got a “statistician” background if you actually want to know what those numbers imply about stuff that the number itself doesn’t tell you. For example, what happened at unmeasured locations, or might be expected to happen in the future under certain conditions, or actually happened at the measured locations given that the measurement instrument is very imprecise. If your question doesn’t implicitly or explicitly involve a model, it’s a question for number crunching.

        Other examples of this are the difference between say cash accounting and accrual accounting. One is more or less a record of financial transactions, and the other is a kind of model of past and future contractual arrangements.

  18. I never see a statistician claiming statistics is the most important thing.

    This is a bogey man of new breed of data scientists who need a job where the tools and interpretation is heavily dependent on statistical thinking. So one way to gain popularity is to discredit the statistical sciences.

    This is like a statistician saying computing is not the most important thing in behavioral intelligence.

    If this kind of statement comes from a statistician then it may have some reason to think about.

    • At JSM Nate Silver (who doesn’t consider himself a statistician) said that “Data scientist is just a sexed up word for statistician”, to which the audience erupted in applause.

  19. Pingback: Is Statistics the Least Important Part of Data Science? - Data Community DC

  20. For me it’s pretty simple taxonomy. “Data Science” = Data Engineer/Architect + Statistician.

    Sometimes those two jobs can be done by the same person, more often the Data Engineer can do some stats or the Statistician can do some data engineering.

    Data Engineering is really about building the tools that the company needs to deal with big data, whereas a statistician uses the end results (“the data”) to carry out the analytical function that the company needs.

    A statistician who can run their own MapReduce jobs and build dashboards with NodeJS and D3 would be a prototypical “Data Scientist”, but I doubt there are too many of these outside of Google, Facebook, etc.

  21. I think to over-focus on the data science/stats debate is not really productive. The birth date for stats precedes the birth date for data science and yet here we are debating whether the former has the right to belong to the latter. Your aptitude to vary your learning would only result in you knowing a little more of something else ….and that really is it.

  22. Indeed, great point although I’d love to hear what you think are the hard core requirements of data science. In my mind, it starts with a a few key important parameters:
    1) A Strong sense of business needs.
    2) A quantitative and creative mind (hard to find).
    3) More importantly, though an attitude towards others.

    Here I mean, the willingness and ability to branch out of the world of data science in order to spread the word of data across the company. I call that the ‘soft stuff is the hard stuff’. My Lead Data Scientist wrote a great blog on this recently.

    Check it out @

    Analytically Yours,

  23. In general, data science/analytics groups at fortune companies are not heavily recruiting Ph.D.’s in Stats or need them, yet. A masters or advanced undergrad is just fine. In my experience as a fresh stats phd interviewing for these spots, an MBA with a quant talent was hired over the statistician most times, if a statistician was ever interviewed. Reasons are often communication skills or importance placed not on how the estimation of a method increases gains. Any method was acceptable. The one job I did get pays about half of what an MBA was getting, no more complicated than a logistic regression, ever. Communication or influencing members of the organization to accept data-based practices was probably the biggest deal and then maybe programming SQL. So, trying to up the method with some Stats PhD skill just has higher organizational costs. I agree with Andrew.

  24. Pingback: Statistics is the least important part of data science – Statistical Modeling, Causal Inference, and Social Science | Carmen J. McKell

  25. Pingback: Statistics and data science, again « Statistical Modeling, Causal Inference, and Social Science Statistical Modeling, Causal Inference, and Social Science

  26. If you want to cook an egg, you need some source of heat, an egg, and a flat surface like a pan. If you lack only one of these things, you can’t cook an egg.

    Arguing what’s the “most important” or “least important” element to cook an egg, is superflous. You *need* all three.

  27. Pingback: Am I a data scientist? | Hyndsight

  28. Pingback: The Data Science Delusion – Cloud Data Architect

  29. Pingback: The Data Science Delusion - Launchship

  30. Pingback: The Data Science Delusion - Use-R!Use-R!

  31. Pingback: Missing Values, Data Science and R – Mubashir Qasim

  32. Pingback: Can you be a Data Scientist without coding? – Site Title

    • I think we are going there. But it is hard to be a data scientist without any form of coding. I mean how can you extract a very voluminous dataset if you do not know any SQL for instance. There are more and more tools that decrease the amount of coding you do. In general, you will see that as you learn more of the algorithm and how they function, coding them will be easier. When I did my MSc in Data Science, some of my classmates never coded before and they managed to graduate in 2 years. So coding is not a big obstacle to becoming a Data Scientist. The more you do it through the easier it will get.

      • “I mean how can you extract a very voluminous dataset if you do not know any SQL for instance.”

        Well, I use the Query Builder in JMP and can pull data from multiple sources in very large data sets and join them. The SQL code is written for me as I select which tables to use, which fields to join on, and which variables I want in the joined dataset. So, why do I need to code?

Leave a Reply

Your email address will not be published. Required fields are marked *