Data science as the application of theoretical knowledge

Patrick Atwater writes:

Insights that “much of what’s hard looks easy” and it’s about “getting the damn data” highlight important points that much of the tech-ey industry dominating definitions overlook in the excitement about production ML recommendation systems and the like.

Working to build from that grounded perspective, I penned together a quick piece digging into what really defines data science and I think the applied nature of the work that you hint at holds an important key. In many ways, the confused all-things-to-all-people nature echos the fractal nature of a field like “management” which devolves into poetic aphorisms and intellectual-lite books elucidating best practices while at the same time pulling from more formal academic disciplines (civil engineering, environmental science, and chemistry for instance in water management).

Consider an applied data science example. A friend at a water utility I work with built a billing calculator using R shiny. That required something like a half day of analytical work and then a couple weeks to get the UI/UX looking right and the servers playing nicely. Note that’s an analyst doing the work rather than a software engineer which I think speaks to the interdisciplinary nature of data science and the oft cited CS / Statistics / Domain expertise venn diagram.

I don’t really have anything to say about this—the language is too far from mine—but I thought I’d share it with you.

3 thoughts on “Data science as the application of theoretical knowledge

    • The Donoho piece is quite interesting. Contrast it with the recent (2012) article in the Journal of Statistics Education, “Statistical Education in the 21st Century: a Review of Challenges, Teaching Innovations and Strategies for Reform,” by Tishkovskaya and Lancaster. Remarkably (to me, at least) this survey, 56 pages long, does not contain any reference to data mining, machine learning, or data science! Donoho does a good job of showing what data science might be, and provides ample detail about the extent to which current data science efforts are bypassing statisticians. I know this issue has gotten plenty of attention, but it does seem like statistics and data science are like two ships passing in the night. I think the issue might deserve some more attention.

  1. I took a look at the water billing example and really hope the weeks were spent on the server and not the UI, which seems to have several problems.

    Why can’t I set irrigable area to 0 when living in a flat? Do I misunderstand the term?

    Evapotranspiration does not accept anything but whole numbers, even though the examples given include one digit after the decimal point.

    It is also very straightforward, which is obviously not a flaw, but implicates that putting it together should not take more than a few hours. Maybe it would have been more efficient to have parts of the work done by software engineer (or a sysadmin for the server) after all.

    I also don’t really get much from the blog post. All of the points mentioned seem a normal part of applied statistical work and does not really help to differentiate the data scientist.

    Disclosure: I tend to be skeptical about data science, being anything but the IT crowd discovering more of statistics and feeling the need to put a new fancy term on it.

Leave a Reply

Your email address will not be published. Required fields are marked *