New data management and integration solutions featuring AI and machine learning signal that help is on the way to meet the ballooning enterprise data challenge.
Artificial intelligence and machine learning already deliver plenty of practical value to enterprises, from fraud detection to chatbots to predictive analytics. But the audacious creative writing skills of ChatGPT have raised expectations for AI/ML to new heights. IT leaders canβt help but wonder: Could AI/ML finally be ready to go beyond point solutions and address core enterprise problems?
Take the biggest, oldest, most confounding IT problem of all: Managing and integrating data across the enterprise . Today, that endeavor cries out for help from AI/ML technologies, as the volume, variety, variability, and distribution of data across on-prem and cloud platforms climb an endless exponential curve. As Stewart Bond, IDCβs VP of data integration and intelligence software, puts it: βYou need machines to be able to help you to manage that.β
Can AI/ML really help impose order on data chaos? The answer is a qualified yes, but the industry consensus is that weβre just scratching the surface of what may one day be achievable. Integration software incumbents such as Informatica, IBM, and SnapLogic have added AI/ML capabilities to automate various tasks, and a flock of newer companies such as Tamr, Cinchy, and Monte Carlo put AI/ML at the core of their offerings. None come close to delivering AI/ML solutions that automate data management and integration processes end-to-end.
That simply isnβt possible. No product or service can reconcile every data anomaly without human intervention, let alone reform a muddled enterprise data architecture. What these new AI/ML-driven solutions can do today is reduce manual labor substantially across a variety of data wrangling and integration efforts, from data cataloging to building data pipelines to improving data quality.
Those can be noteworthy wins. But to have real, lasting impact, a CDO (chief data officer) approach is required, as opposed to the impulse to grab integration tools for one-off projects. Before enterprises can prioritize which AI/ML solutions to apply where, they need a coherent, top-down view of their entire data estateβcustomer data, product data, transaction data, event data, and so onβand a complete understanding of metadata defining those data types.
The scope of the enterprise data problem
Most enterprises today maintain a vast expanse of data stores, each one associated with its own applications and use casesβa proliferation that cloud computing has exacerbated, as business units quickly spin up cloud applications with their own data silos. Some of those data stores may be used for transactions or other operational activities, while others (mainly data warehouses) serve those engaged in analytics or business intelligence.
To further complicate matters, βevery organization on the planet has more than two dozen data management tools,β says Noel Yuhanna, a VP and principal analyst at Forrester Research. βNone of those tools talk to each other.β These tools handle everything from data cataloging to MDM (master data management) to data governance to data observability and more. Some vendors have infused their wares with AI/ML capabilities, while others have yet to do so.
At a basic level, the primary purpose of data integration is to map the schema of various data sources so that different systems can share, sync, and/or enrich data. The latter is a must-have for developing a 360-degree view of customers, for example. But seemingly simple tasks such as determining whether customers or companies with the same name are the same entityβand which details from which records are correctβrequire human intervention. Domain experts are often called upon to help establish rules to handle various exceptions.
Those rules are typically stored within a rules engine embedded in integration software. Michael Stonebraker, one of the inventors of the relational database, is a founder of Tamr, which has developed an ML-driven MDM system. Stonebraker offers a real-world example to illustrate the limitations of rules-based systems: a major media company that created a βhomebrewβ MDM system that has been accumulating rules for 12 years.
βTheyβve written 300,000 rules,β says Stonebraker. βIf you ask somebody, how many rules can you grok, a typical number is 500. Push me hard and Iβll give you 1,000. Twist my arm and Iβll give you 2,000. But 50,000 or 100,000 rules is completely unmanageable. And the reason that there are so many rules is there are so many special cases.β
Anthony Deighton, Tamrβs chief product officer, claims that his MDM solution overcomes the brittleness of rules-based systems. βWhatβs nice about the machine learning based approach is when you add new sources, or more importantly, when the data shape itself changes, the system can adapt to those changes gracefully,β he says. As with most ML systems, however, ongoing training using large quantities of data is required, and human judgment is still needed to resolve discrepancies.
AI/ML is not a magic bullet. But it can provide highly valuable automation, not only for MDM, but across many areas of data integration. To take full advantage, however, enterprises need to get their house in order.
Weaving AI/ML into the data fabric
βData fabricβ is the operative phrase used to describe the crazy quilt of useful data across the enterprise. Scoping out that fabric begins with knowing where the data isβand cataloging it. That task can be partially automated using the AI/ML capabilities of such solutions as Informaticaβs AI/ML-infused CLAIRE engine or IBMβs Watson Knowledge Catalog. Other cataloging software vendors include Alation, BigID, Denodo, and OneTrust.
Gartner research director Robert Thanarajβs message to CDOs is that βyou need to architect your fabric. You buy the necessary technology components, you build, and you orchestrate in accordance with your desired outcomes.β That fabric, he says, should be βmetadata-driven,β woven from a compilation of all the salient information that surrounds enterprise data itself.
His advice for enterprises is to βinvest in metadata discovery.β This includes βthe patterns of people working with people in your organization, the patterns of people working with data, and the combinations of data they use. What combinations of data do they reject? And what patterns of where the data is stored, patterns of where the data is transmitted?β
Jittesh Ghai, the chief product officer of Informatica, says Informaticaβs CLAIRE engine can help enterprises derive metadata insights and act upon them. βWe apply AI/ML capabilities to deliver predictive dataβ¦ by linking all of the dimensions of metadata together to give context.β Among other things, this predictive data intelligence can help automate the creation of data pipelines. βWe auto generate mapping to the common elements from various source items and adhere it to the schema of the target system.β
IDCβs Stewart Bond notes that the SnapLogic integration platform has similar pipeline functionality. βBecause theyβre cloud-based, they look atβ¦ all their other customers that have built up pipelines, and they can figure out what is the next best Snap: Whatβs the next best action you should take in this pipeline, based on what hundreds or thousands of other customers have done.β
Bond observes, however, that in both cases recommendations are being made by the system rather than the system acting independently. A human must accept or reject those recommendations. βThereβs not a lot of automation happening there yet. I would say that even in the mapping, thereβs still a lot of opportunity for more automation, more AI.β
Improving data quality
According to Bond, where AI/ML is having the most impact is in better data quality. Forresterβs Yuhanna agrees: βAI/ML is really driving improved quality of data,β he says. Thatβs because ML can discover and learn from patterns in large volumes of data and recommend new rules or adjustments that humans lack the bandwidth to determine.
High-quality data is essential for transaction and other operational systems that handle vital customer, employee, vendor, and product data. But it can also make life much easier for data scientists immersed in analytics.
Itβs often said that data scientists spend 80 percent of their time cleaning and preparing data. Michael Stonebraker takes issue with that estimate: He cites a conversation he had with a data scientist who said she spends 90% of her time identifying data sources she wants to analyze, integrating the results, and cleaning the data. She then spends 90% of the remaining 10% of time fixing cleaning errors. Any AI/ML data cataloging or data cleansing solution that can give her a chunk of that time back is a game changer.
Data quality is never a one-and-done exercise. The ever-changing nature of data and the many systems it passes through have given rise to a new category of solutions: data observability software. βWhat this category is doing is observing data as itβs flowing through data pipelines. And itβs identifying data quality issues,β says Bond. He calls out the startups Anomolo and Monte Carlo as two players who claim to be βusing AI/ML to monitor the six dimensions of data qualityβ: accuracy, completeness, consistency, uniqueness, timeliness, and validity.
If this sounds a little like the continuous testing essential to devops, thatβs no coincidence. More and more companies are embracing dataops, where βyouβre doing continuous testing of the dashboards, the ETL jobs, the things that make those pipelines run and analyze the data thatβs in those pipelines,β says Bond. βBut you also add statistical control to that.β
The hitch is that observing a problem with data is after the fact. You canβt prevent bad data from getting to users without bringing pipelines to a screeching halt. But as Bond says, when dataops team member applies a correction and captures it, βthen a machine can make that correction the next time that exception occurs.β
More intelligence to come
data management and integration software vendors will continue to add useful AI/ML functionality at a rapid clipβto automate data discovery, mapping, transformation, pipelining, governance, and so on. Bond notes, however, that we have a black box problem: βEvery data vendor will say their technology is intelligent. Some of it is still smoke and mirrors. But there is some real AI/ML stuff happening deep within the core of these products.β
The need for that intelligence is clear. βIf weβre going to provision data and weβre going to do it at petabyte scale across this heterogeneous, multicloud, fragmented environment, we need to apply AI to data management,β says Informaticaβs Ghai. Ghai even has an eye toward OpenAIβs GPT-3 family of large language models. βFor me, whatβs most exciting is the ability to understand human text instruction,β he says.
No product, however, possesses the intelligence to rationalize data chaosβor clean up data unassisted. βA fully automated fabric is not going to be possible,β says Gartnerβs Thanaraj. βThere has to be a balance between what can be automated, what can be augmented, and what could be compensated still by humans in the loop.β
Stonebraker cites another limitation: the severe shortage in AI/ML talent. Thereβs no such thing as a turnkey AI/ML solution for data management and integration, so AI/ML expertise is necessary for proper implementation. βLeft to their own devices, enterprise people make the same kinds of mistakes over and over again,β he says. βI think my biggest advice is if youβre not facile at this stuff, get a partner that knows what theyβre doing.β
The flip side of that statement is that if your data architecture is basically sound, and you have the talent available to ensure you can deploy AI/ML solutions correctly, a substantial amount of tedium for data stewards, analysts, and scientists can be eliminated. As these solutions get smarter, those gains will only increase.


