Quality data science outputs depend on quality inputs. Data cleansing and preparing may not be exciting work, but itโs critical.
Data scientist may be one of the sexiest jobs of our century, as Harvard Business Review opines, but it sure does involve a lot of unsexy, manual labor. According to Anacondaโs 2021 State of Data Science survey, survey respondents said they spend โ39% of their time on data prep and data cleansing, which is more than the time spent on model training, model selection, and deploying models combined.โ
Data scientist? More like data janitor.
Not that thereโs anything wrong with that. In fact, thereโs much that is right with it. For years weโve oversold the glamorous side of data science (build models that cure cancer!) while overlooking the simple reality that much of data science is cleaning and preparing data, and this aspect of data science is fundamental to doing data science well. As consultantย Aaron Zhu notes, โAny statistical analysis and machine learning models can be as good as the quality of the data you feed into them.โ
Someoneโs got to get their hands dirty
Positive or negative, time spent with data wrangling (data prep and cleaning) seems to be declining. Although data scientists today report they spend 39% of their time on data wrangling, last year the same Anaconda survey reported that number was 45%. Just a few years ago, the number might have been closer to 80%, by some estimates.
Such sky-high estimates were almost certainly incorrect, asย Leigh Dodds of the Open Data Institute has argued. Worse, he insists, by demeaning the act of data wrangling we misunderstand the value of that wrangling. โI would argue that spending time working with data to transform, explore, and understand it better is absolutely what data scientists should be doing. This is the medium they are working in. Understand the material better and youโll get better insights.โ
In other words, while we might want to focus on data science outputs, we canโt do so effectively if weโve overlooked the inputs. Garbage in, garbage out.
The people part of data science
For as long as weโve been talking about data science and its ancestor โbig data,โ weโve wrung our hands about machines obviating the need for people. This is true for data science as a category, but also for data wrangling as an input to that category.
Itโs tempting to think that we can simply automate all of this data prepโhow much thought can go into cleaning up data, after all? But the reality is that although some data work can be automated, it is ultimately a human task. Why? Data wrangling is a โcritical part of the analytical process,โ as suggested by Tim Stobierski, a contributing writer for Harvard Business School Online. It requires someone who can โunderstand what clean data looks like and how to shape raw data into usable forms.โ For example, during the discovery phase of data wrangling, you need someone who can see gaps in the data as well as patterns.
Or, as noted in the Anaconda 2021 report, โWhile data preparation and data cleansing are time-consuming and potentially tedious, automation is not the solution. Instead, having a human in the mix ensures data quality, more accurate results, and provides context for the data.โ
This has always been the case. In the early days of big data, we imagined a world in which we could just throw data at Apache Hadoop and out would pop โactionable insights.โ However, lifeโand data scienceโdonโt work that way. Asย I wrote back in 2014, ultimately data science is a matter of people. โThose who do data science well blend statistical, mathematical, and programming skills with domain knowledge.โ That domain knowledge enables human creativity with data. The more familiar a person is with their business, the better theyโre able to not only prepare that data for modeling, but also the more likely theyโll be to intuit insights from patterns and anomalies.
Domain knowledge also should help with the eventual output of data science models. According to the Anaconda report, only โ36% of people said their organizationโs decision-makers are very data literate and understand the stories told by visualizations and models. In comparison, 52% described their organizationโs decision-makers as mostly data literate but needing some coaching on the stories told by visualizations and models.โ Well, that may partly be a problem with the recipients of the models/visualizations, but it also arguably has to do with the data scientists preparing them. Greater familiarity with their domains should enable them to more clearly articulate how their machine learning models describe what the business can learn from its data.
Again, that domain knowledge doesnโt start to become useful when the data scientist is on the final sprint to the boardroom with the models. It starts early in the not-so-lowly task of data wrangling that is the foundation for all good data science. We should celebrate not deprecate it.


