Matt Asay
Contributing Writer

Data science needs drudges

analysis
Aug 3, 20215 mins
Data ScienceData ScientistMachine Learning

Quality data science outputs depend on quality inputs. Data cleansing and preparing may not be exciting work, but itโ€™s critical.

momentum man pushing boulder uphill conquer challenge by yogysic getty images 479447604
Credit: yogysic / Getty Images

Data scientist may be one of the sexiest jobs of our century, as Harvard Business Review opines, but it sure does involve a lot of unsexy, manual labor. According to Anacondaโ€™s 2021 State of Data Science survey, survey respondents said they spend โ€œ39% of their time on data prep and data cleansing, which is more than the time spent on model training, model selection, and deploying models combined.โ€

Data scientist? More like data janitor.

Not that thereโ€™s anything wrong with that. In fact, thereโ€™s much that is right with it. For years weโ€™ve oversold the glamorous side of data science (build models that cure cancer!) while overlooking the simple reality that much of data science is cleaning and preparing data, and this aspect of data science is fundamental to doing data science well. As consultantย Aaron Zhu notes, โ€œAny statistical analysis and machine learning models can be as good as the quality of the data you feed into them.โ€

Someoneโ€™s got to get their hands dirty

Positive or negative, time spent with data wrangling (data prep and cleaning) seems to be declining. Although data scientists today report they spend 39% of their time on data wrangling, last year the same Anaconda survey reported that number was 45%. Just a few years ago, the number might have been closer to 80%, by some estimates.

Such sky-high estimates were almost certainly incorrect, asย Leigh Dodds of the Open Data Institute has argued. Worse, he insists, by demeaning the act of data wrangling we misunderstand the value of that wrangling. โ€œI would argue that spending time working with data to transform, explore, and understand it better is absolutely what data scientists should be doing. This is the medium they are working in. Understand the material better and youโ€™ll get better insights.โ€

In other words, while we might want to focus on data science outputs, we canโ€™t do so effectively if weโ€™ve overlooked the inputs. Garbage in, garbage out.

The people part of data science

For as long as weโ€™ve been talking about data science and its ancestor โ€œbig data,โ€ weโ€™ve wrung our hands about machines obviating the need for people. This is true for data science as a category, but also for data wrangling as an input to that category.

Itโ€™s tempting to think that we can simply automate all of this data prepโ€”how much thought can go into cleaning up data, after all? But the reality is that although some data work can be automated, it is ultimately a human task. Why? Data wrangling is a โ€œcritical part of the analytical process,โ€ as suggested by Tim Stobierski, a contributing writer for Harvard Business School Online. It requires someone who can โ€œunderstand what clean data looks like and how to shape raw data into usable forms.โ€ For example, during the discovery phase of data wrangling, you need someone who can see gaps in the data as well as patterns.

Or, as noted in the Anaconda 2021 report, โ€œWhile data preparation and data cleansing are time-consuming and potentially tedious, automation is not the solution. Instead, having a human in the mix ensures data quality, more accurate results, and provides context for the data.โ€

This has always been the case. In the early days of big data, we imagined a world in which we could just throw data at Apache Hadoop and out would pop โ€œactionable insights.โ€ However, lifeโ€”and data scienceโ€”donโ€™t work that way. Asย I wrote back in 2014, ultimately data science is a matter of people. โ€œThose who do data science well blend statistical, mathematical, and programming skills with domain knowledge.โ€ That domain knowledge enables human creativity with data. The more familiar a person is with their business, the better theyโ€™re able to not only prepare that data for modeling, but also the more likely theyโ€™ll be to intuit insights from patterns and anomalies.

Domain knowledge also should help with the eventual output of data science models. According to the Anaconda report, only โ€œ36% of people said their organizationโ€™s decision-makers are very data literate and understand the stories told by visualizations and models. In comparison, 52% described their organizationโ€™s decision-makers as mostly data literate but needing some coaching on the stories told by visualizations and models.โ€ Well, that may partly be a problem with the recipients of the models/visualizations, but it also arguably has to do with the data scientists preparing them. Greater familiarity with their domains should enable them to more clearly articulate how their machine learning models describe what the business can learn from its data.

Again, that domain knowledge doesnโ€™t start to become useful when the data scientist is on the final sprint to the boardroom with the models. It starts early in the not-so-lowly task of data wrangling that is the foundation for all good data science. We should celebrate not deprecate it.

Matt Asay

Matt Asay runs developer marketing at Oracle. Previously Asay ran developer relations at MongoDB, and before that he was a Principal at Amazon Web Services and Head of Developer Ecosystem for Adobe. Prior to Adobe, Asay held a range of roles at open source companies: VP of business development, marketing, and community at MongoDB; VP of business development at real-time analytics company Nodeable (acquired by Appcelerator); VP of business development and interim CEO at mobile HTML5 start-up Strobe (acquired by Facebook); COO at Canonical, the Ubuntu Linux company; and head of the Americas at Alfresco, a content management startup. Asay is an emeritus board member of the Open Source Initiative (OSI) and holds a JD from Stanford, where he focused on open source and other IP licensing issues. The views expressed in Mattโ€™s posts are Mattโ€™s, and donโ€™t represent the views of his employer.

More from this author