Is data science a discipline?

Jeannette Wing, director of the Columbia Data Science Institute, sent along this link to this featured story (their phrase) on their web site.

Is data science a discipline?

Data science is a field of study: one can get a degree in data science, get a job as a data scientist, and get funded to do data science research. But is data science a discipline, or will it evolve to be one, distinct from other disciplines? Here are a few meta-questions about data science as a discipline.

  • What is/are the driving deep question(s) of data science? Each scientific discipline (usually) has one or more “deep” questions that drive its research agenda: What is the origin of the universe (astrophysics)? What is the origin of life (biology)? What is computable (computer science)? Does data science inherit its deep questions from all its constituency disciplines or does it have its own unique ones?
  • What is the role of the domain in the field of data science? People (including this author) (Wing, J.M., Janeia, V.P., Kloefkorn, T., & Erickson, L.C. (2018)) have argued that data science is unique in that it is not just about methods, but about the use of those methods in the context of a domain—the domain of the data being collected and analyzed; the domain for which a question to be answered comes from collecting and analyzing the data. Is the inclusion of a domain inherent in defining the field of data science? If so, is the way it is included unique to data science?
  • What makes data science data science? Is there a problem unique to data science that one can convincingly argue would not be addressed or asked by any of its constituent disciplines, e.g., computer science and statistics?

I don’t understand how bullet point two is supposed to distinguish data science from the more prosaically titled field of applied statistics.

The story goes on to enumerate ten research challenges in data science. Some of them are hot AI topics like ethics and fairness, some of them are computer science topics such as computing systems for data-intensive applications, and some of them are statistics topics like causal inference.

31 thoughts on “Is data science a discipline?

  1. Good questions. I got some pushback in comments here a few years ago when I predicted that there was a need for a college major called something like data science. It looks like I was right, huzzah.

    But if you get philosophical, IMHO, the content is really data engineering, not data “science.” Removing the loaded word “science” with “engineering” would answer or obviate some of these questions, especially #3. But since schools teach “computer science” in which the content is almost all programming, they are probably stuck with the “science” tag. Academics should be Wittgensteinian enough not to let the word determine what they think the field “really is.”

    • I am sorry to ask, but how did you get to the conclusion “since schools teach “computer science” in which the content is almost all programming”? IMO, it needs a lot more context to make it true. My undergrad CS curriculum contained much more than just programming, data structure, algorithms (proving algorithms are legit), distributed system, cloud computing, intro to AI, network stuff.

      IMO, that statement is a little bit offensive to computer scientists who they could probably argue back saying “schools teach ‘statistics’ in which the content is almost all linear regression, mean and standard error, they are probably stuck with the ‘science’ tag.”

      • Sorry. I based that on items I have seen written by recent CS grads. (I think one school was Rutgers.) Good for the schools that teach more to undergrads.

  2. Bob:

    1. Jeannette writes, “Each scientific discipline (usually) has one or more ‘deep’ questions that drive its research agenda.”

    I don’t think statistics has such questions. I think of statistics as a branch of engineering, and I don’t think engineering disciplines usually are driven by deep questions in that way. Consider mechanical engineering, chemical engineering, electrical engineering, etc. These fields are driven by particular problems (for example, developing a better battery), in the same way that subfields of statistics are driven by particular problems (for example, formalizing Bayesian workflow, scalable computation for hierarchical regressions, and construction of model-based survey weights, to consider three examples from my own subfields).

    Thinking about scientific disciplines such as biology, astrophysics, computer science, chemistry, physics, biology, political science, economics, psychology etc.: I’d say each of these fields is driven by a mix of deep questions and specific engineering challenges (what we call “puzzles” in social science).

    2. Jeanette writes, “data science is unique in that it is not just about methods, but about the use of those methods in the context of a domain.”

    Is data science really unique in this way? I’d think this describes just about every field within engineering. For example, civil engineering requires materials science (which itself is some sort of hybrid of physics and chemistry) but it’s all about the use of materials science in the context of particular domains of application such as building a dam or whatever.

      • I think it’s an almost true statement, though.

        That said, I still love the Two Envelopes Problem and as many times as I’ve thought it through and understood it, if I go a couple of years without thinking about it I have to go through it all again. That’s a sure sign that there’s some little part of it that I still don’t really understand at the deepest level. That may be a somewhat pathetic thing to admit, but on the other hand I’m far from alone in that. The fact that the Wikipedia page contains more than one ‘proposed resolution’ is enough to indicate that there’s something here worth thinking about. It’s especially funny-peculiar because it is super easy to pose the problem in a way that there is no paradox and nothing confusing about it. Explaining why the right answer is right is not a challenge at all, it’s explaining why the wrong answer is wrong that is so difficult. https://en.wikipedia.org/wiki/Two_envelopes_problem

        • Well,

          Statistics uses probability, which ultimately relies on philosophical questions. I don’t think anything else can get any deeper than that.

          Data ‘science’, AI, deep ‘learning’, ‘neural’ networks, etc. are just trendy marketing terms and they sell well.

          Here in SV, a bunch of former programmers whose skills in math/stats are just ok or less, simply re-branded themselves, because they were already in place. Can’t blame them, as I’d do the same.

          To be a data ‘scientists’, one would need to be very savvy with almost all of stats/math and be excellent at research methods from various sciences. However, currently that is mostly not the case. Your average DS is just a former coder, with very little knowledge in the above. But hey, he knows how to program in Python, because that’s what the companies ask for; a data-cruncher.

          That’s why those made-up fields are not going far. They are not dealing with very deep questions (e.g. self-awareness and survival instinct in case of AI). All of that fluff is just fancy automation.

          Cheers

    • The fact many people and scientists believe Statistics has no deep questions is a huge problem, and it filters into the work done in many other disciplines. I remember feeling that way after my first master’s degree (not in Statistics) and it was just a matter of not knowing enough to know what I didn’t know and having Statistics presented as a set of technical skills.

      The problem of making inferences from data and probability models contains many deep and hard questions. In general, inference is, and should be, hard. The problem of making Inferences deserves more attention relative to the focus on all the “methods” churned out to produce results conditional on deep and largely unquestioned underlying assumptions — as well as many unsettled foundations that are treated as settled.

      • Thanks for pointing this out! Although stats is geared toward some practical aim in the long run, at its core it is about constructing formal models of the stochastic processes that generate data and using data to inform those models. As you say, though, most people never (need to) make contact with that deep level, which as you say, is pretty dang deep in that it relates to how we (as humans) build models to learn about the world.

        This makes me think that actually most engineering disciplines also have deep questions at their core, e.g., civil engineering is not just about how to build the most cost-effective bridge, but about the abstract concepts of resource flow, statics, etc.

        But on the other hand, I can see where Andrew is coming from in saying stats/engineering aren’t “driven” by those deep questions, in that people typically engage with those fields in a practical way. And one could argue that those “deep questions” really belong to some other domain, like mathematics or physics, i.e., once it’s deep it’s no longer stats/engineering. But then we get into arguments about how “fundamental” something is which, like Michael, makes me sad.

      • I agree. Inductive inference is one of the big questions. It’s basically the scientific method itself. One could do worse than starting with John Stuart Mill’s book A System of Logic, Ratiocinative and Inductive for a philosophical grand tour of the problems. Just make sure to get one of the later editions.

        I think the reason machine learning is so popular is that it put the focus on prediction rather than null hypothesis significance testing. That’s a huge sea change compared to traditional statistics practice (as defined by most intros to statistics in course or textbook form). Now this is not just machine learning. A lot of pragmatic Bayesians (my new term for those of us who aren’t either pure “objective” or “subjective” Bayesians) also focus on prediction.

  3. Last year I attended a short course by Garret Grolemund. He pointed out that there is not a “Biological Science”, rather we talk about the biological sciences and there is not a “Physical Science”, rather we talk about the physical sciences. He thinks (and I agree) that we should more talk about the data sciences than a specific field called “Data Science”, statistics would be one of the data sciences, but parts of mathematics, computer science, and others would also be in the data sciences.

    • UT Austin has a Department of Statistics and Data Sciences.

      The history of this name is interesting: The university never had a stand-alone “Department of Statistics” — there were statistics groups in Mathematics, Engineering, Educational Psychology, Business, Sociology, and perhaps other departments. So a number of years ago, there was a move to establish a Department of Statistics, but for reasons of internal policies, etc., it seemed better to propose a Division of Statistics. There was also, at the same time, a move to establish a Division of Scientific Computing, since there were people in this field scattered in several departments. At some point, the Dean of Natural Sciences decided to combine these efforts, and a Division of Statistics and Scientific Computation was proposed and approved by the administration. After a few years, this got upgraded to the current Department of Statistics and Data Sciences.

  4. Disciplines have a way of trying to subsume and denigrate other disciplines; so who is the author of “chemistry is merely the physics of the outer ring”?

  5. I agree with those who’ve commented that “data science” is a misnomer. The data sciences are those that share the practice of using data to study topics in their respective domains. It’s like calling astrophysicists “light scientists” because they use light to study stars.

    I do think there is potential for a standalone discipline called “information sciences” since there are actual scientists who study information itself–which, unlike data, is (arguably) not a purely human-created construct. But it’s probably inextricably embedded in to a dozen or so different applied disciplines.

      • Your question (I think) is equivalent to asking, “What is information?” or “What is a useful definition of information?”

        The simplest answer to your question is just to quote the Wikipedia entry for information: “The concept of information has different meanings in different contexts.[1] Thus the concept becomes related to notions of constraint, communication, control, data, form, education, knowledge, meaning, understanding, mental stimuli, pattern, perception, representation, and entropy.”

        I have no doubt that semiotics has its own version of an answer to this question, but it’s probably not perfectly compatible with the answer you’d get from statistics or information theory.

        I’m sincerely intrigued by your assertion that “information is not information before it is noticed.” It makes me wonder what the “it” is before “it” is noticed. Entropy? Complexity? Potential information? The thing you are calling “it,” the thing that turns into what you are calling information once it’s perceived, is what I mean when I talk about the thing studied by information sciences.

        Whatever that thing is or is called, it seems unavoidable to conclude that it exists outside of our perception. For example: The thing that is in both a sample of raw observations and in its sample mean, and is dependent upon the population mean, is an attribute of the sample before, during and after I compute the mean, or if I never compute it. My actual knowledge of any of these quantities does not affect whether the sample obeys the central limit theorem.

        Deep thoughts!

        • Keith’s taking the philosophically pragmatic approach that constructs like “data science” are the inventions of cognitive agents rather than discoveries. Information is then encoded relative to the set of concepts we employ. The contrast is a more Platonic philosophy in which concepts like “atom” are out there for us to discover. Arguing about whether atoms are real is a fun philosophical exercise in semantics and pragmatics (the philosophical kind, not the linguistic kind).

  6. “Is data science a discipline”

    I hope nothing important hinges on the answer to this question. The debate seems knee-deep in fuzzy definitions, slippery logic, and at least a little bit of posturing. I keep wanting to add “ish” to many of the terms: “data science is engineeringish,” “data science has some deepish questions,” “data science is scienceish.”

    Maybe it is better to let each area of study be what it is and be content with some generalizations that are more or less true, kind of useful, malleable over time, and variable across contexts.

  7. I’m on the same line of thought as Terry and Keith. The question is boring and irrelevant – only academia cares whether it is a new field, a science, or whatever. A related issue is the myriad attempts to distinguish between data scientist, statistician, business analyst, data engineer, business intelligence analyst, etc. I’ve read so many attempts to distinguish between these and absolutely none of them make any sense. Among my favorites: one definition said that business analysts look at data from the past while data scientists are concerned with predicting the future – as if it makes sense to predict the future without looking at the past and that it makes sense to look at the past and not be thinking about the future. McKinsey made their own attempt to distinguish these positions – but if you look at their own job ads, the list all the analytics type jobs under all the titles they just attempted to distinguish. All I can say is these definitional games are not converging and not helpful.

    • “only academia cares whether it is a new field, a science, or whatever”

      Well, academic and licensing boards and those selling certification. I wonder how the INFORMS attempt to provide “certification” for statistica/analytics have worked out? Or the voluntary professional accreditation program of the American Statistical Association?

      https://www.informs.org/ORMS-Today/Public-Articles/October-Volume-39-Number-5/Certified-Analytics-Professional
      https://www.amstat.org/asa/files/pdfs/accreditation/Guidelines.pdf

    • “as if it makes sense to predict the future without looking at the past and that it makes sense to look at the past and not be thinking about the future.”

      +1

      Or to have a financial person who has no expertise in data working with historical data, or a data person who has no expertise in finance predicting financial outcomes….

    • Dale > only academia cares whether it is a new field
      That may be true, but we need/want academia to function well in advance current methods and enable more capable students and is it a new field is a big question and it affect what careers will be, who gets them, how they progress and who ends up in charge.

      • Keith
        I don’t disagree with this, but it is the tip of an iceberg. Where there are standardized credentialing exams, it becomes simple – nurses, physicians, pharmacists, CPAs, etc. either have the credential or not (and sometimes a particular degree is required, sometimes it is just based on an exam and/or experience). In some fields, the academic degree stands as a proxy for some body of knowledge – true for many STEM degrees. However, there are plenty of academic disciplines where career preparedness has little relationship with the degree: e.g., it is hard to know what a person with a sociology (or political science, or economics, or history, or……) degree has the capability to accomplish by virtue of their degree. I am not downgrading such degrees, merely pointing out that the definition of the field does not map very well into a set of particular skills. So, with data science, does such a degree convey knowledge of what career a person can be successful at?

        I think this is difficult, as some careers demand you know Python, others R, others AWS, etc etc. The certification attempts mentioned by zbicyclist above are attempts to rise above these details and concentrate on more basic reasoning about data – a worthy goal, in my mind. Supplementing such a certification with certifications in particular technologies (e.g. SAS, Google Analytics, etc.) would seem to provide a great deal of information relevant to careers and capabilities. I think many people (admittedly, a conjecture on my part) would agree that the specific certifications are not what we are talking about when we think about data science. So, the academic field would be defined by the type of things the more general certifications are aimed at. When/if there is a generally accepted certification, the field will be relatively easy to identify.

        However, it may be hard to get there. We have many people with business degrees, many with MBAs, but there is no generally accepted set of fundamental questions that define the field. There is no licensing body either (though there are accrediting bodies for academic programs, with limited value for careers). There are a number of reasons for this, but I think the most fundamental is that the variability of careers is too large to standardize the “deep questions” of the field. Is data science closer to business or to engineering in that respect? I guess that is the question, and I’m not sure of the answer.

    • Semantics is usually boring. And annoying (Mitzi tells me there’s a special place in hell for us semanticists—I got so deep I wrote a textbook in semantics while teachig the class because nothing else went deep enough mathemaitcally). Semantics is also usually relevant as it frames how people speak about things. The semantics of “data science”, “machine learning”, and “statistics” play into how science funding is allocated, how high school classes are organized and taught taught, and how HR departments at companies sort through applicants. Arguably the pragmatics around this of how it plays into decisions and conversations is even more important (that’s “pragmatic” in the linguistic sense, not the philosophical one, Keith, but that’s also relevant here).

      The reason companies are distinguishing these positions is to help sort their huge number of applicants. People working on visualization in Python and sampling alorithms in C++ might both refer to themselves as data scientists. Recently, I met someone recently who said they work on “featurizing” data sets to hand off to statisticians. I’m sure someone has a term for data scientists that’s analogous to “full stack web developer.”

  8. In my opinion the only way to consider data science as something new, if we include the aspects of computer science in it particularly regarding data storage and management. Development of such tools allowing effective computation of large and heterogenous datasets certainly is out of scope of statistics. On the other hand it is so data dependent that it cannot be fully thrown under CS umbrella. Unless of course we consider statistics also part of computer science.

  9. I’m late to this, but as someone who does a fair amount of hiring in industry for Stats, OR, Economics, CS/SW Engineering, etc, I’ll add that when I see “Data Science” on someone’s resume, I have no idea what their skill set is, and I tend to assume that they learned a broad number of topics superficially, rather than one thing deeply. I get what the point is: a mix of computing and data analysis/modeling skills, but the term data science is too broad to be useful, in my opinion. Maybe it has some utility as an umbrella for interdisciplinary research?

Leave a Reply to Megan Higgs Cancel reply

Your email address will not be published. Required fields are marked *