Data governance is one of the toughest, dreariest problems in computing. Sadly, the tools offered with the major Hadoop distributions aren't really up to task
In the data business, we now have our own denier movement of people who claim that that unless you start planning, you should keep throwing most of your data away. They call you a โdata hoarder.โ
Thereโs a bit of absurdity here. If you throw it away, you canโt get it back; if you keep it, you can eventually organize and purge what you donโt need. Those who store data now while getting their governance in place are not automatically โdata hoarders.โ This is a false dilemma.ย
The idea that you need to come up with a perfect plan before keeping any data or bringing in any new sources is a little like saying we need perfect social justice for everyone before we can address police killings of African-Americans.
Instead, get started now. Stop throwing out the baby with the bathwater and begin finding your use cases. Meanwhile, make data the point rather than a side effect of your processes and govern it accordingly. These arenโt โsteps,โ but initiatives you need to undertake, usually in parallel.
That said, how do you go about planning? How do you start cataloging your data and establish some structure around its evolution? There are traditional solutions like those covered in last yearโs Forrester Wave report โ Informatica, various IBM offerings, SAS, and Collibra, among others โ but some of these come with a lot of baggage and form part of a vendorโs overall platform play.
Meanwhile, a new class of data governance tools is being developed specifically for Hadoop. These tools have less of a legacy, but are also less mature. They are focused on the Hadoop ecosystem rather than your whole organization, allowing you to integrate them more closely with your new data architecture.
Cloudera Navigator
Navigator is Clouderaโs closed-source offering for data governance. It incorporates both security auditing and metadata management, and it allows both integration with traditional data governance products like Informatica and automated data lineage tracking.
At its core, it tracks where data came from, what transformations happened to it, where the data landed, and where the heck itโs located. You can even set up rules (policies) for automatically tagging data based on its type and origin.
Navigator also allows you to trigger actions based on these policies, some of which arenโt necessarily best done in Navigator (for example, triggering actions to archive or move data). Among the biggest concerns is that you can trigger auditing with or without Sentry, Clouderaโs authorization module for Hadoop.
On the one hand, โchoice is good,โ but on the other hand, if you go to the condiment counter at a fast food joint and find 15 brands of generic ketchup packages, which do you choose? I donโt really need multiple paths for an audit implementation becauseโฆI just want to log the stuff already and I donโt care about choice for that.
Apache Atlas
Hortonworks is newer to the data governance game. It has proposed Apache Atlas, which was accepted into Apacheโs incubator โ sometimes, but not always, a sign of project maturity. The rise to a top-level Apache project is a very political process.
Atlas has high hopes, but itโs pretty early on in its development. It integrates with Apache Ranger according to the README.txt, though thatโs the only use of the word โRangerโ in the whole source repository, and it isnโt a lot of code. While Atlas is part of Hortonworksโ recent 2.3 release, itโs clearly an early cut, and probably not the core of your master-data-management or data governance initiative at this point.
The buyerโs lament
With Sentry versus Ranger and Navigator versus Atlas, youโre seeing a real split. On one hand Cloudera offers a mature more complete offering; on the other hand itโs proprietary and already diverging from the less mature, less-thought-out Sentry product. Hortonworks answers with an open source offering, but obviously, it integrates with its own preferred security implementation.
In other words, weโre seeing a sort of Hadoop distribution lock-in with each new layer we add. Part of why we pick an open source technology is to put the choice back in the userโs hands.
Neither Navigator nor Atlas are particularly complete offerings, and while itโs nice that Navigator can work with existing data governance offerings such asย Informatica, these have their own plug-ins, anyhow.
You have to ask: Do I need a Hadoop data governance solution or do I need a complete data governance solution that includes Hadoop? In many cases, Iโd say the latter.
It would be nice to see full-on open source data governance software. But for now, if you look at a complete, mature, and proprietary tool like Collibra, whichย offers a complete vision, youโre unlikely to be happy even with Navigator. It would probably easier for Collibra to deepen its Hadoop integration and offer better data lineage than for Cloudera to make Navigator a more complete offering. If youโre using a proprietary product anyhow, you might as well use a complete one that covers all of your data (and if you have a lot of it, you probably have Informatica anyhow).
Someday a complete open source data governance or master data management tool will emerge. But it canโt be aligned with a single technology vertical. I mean, I donโt really want Data Governance for Hadoop, Data Governance for MongoDB, Data Governance for Oracle and a freaking data lake project just to tie back together my metadata from my data governance tools.
The catch with many existing tools is they are heavy duty and suited to bureaucratic organizations that hold long-winded data governance committee meetings. For organizations just getting into data governance, who simply need to stop digging, the implementation costs can be daunting.
Whichever governance software you choose, remember that owning a hammer doesnโt make you a carpentry business, just as having a data governance tool doesnโt make your initiative happen. Governance is really about your processes โ the actual gathering and cataloging of data and how you think about data.
Meanwhile, whatever you do, donโt listen to the naysayers and throw your data away because you havenโt figured out how to govern it yet. Thatโs like killing the patient because treatment is a lot of work.


