Andrew C. Oliver
Contributing Writer

Are you a data hoarder? Hadoop offers little choice

analysis
Aug 27, 20156 mins
Data ManagementData and Information SecurityHadoop

Data governance is one of the toughest, dreariest problems in computing. Sadly, the tools offered with the major Hadoop distributions aren't really up to task

millennial hipster standing in front of huge black chalkboard looking at data
Credit: Thinkstock

In the data business, we now have our own denier movement of people who claim that that unless you start planning, you should keep throwing most of your data away. They call you a โ€œdata hoarder.โ€

Thereโ€™s a bit of absurdity here. If you throw it away, you canโ€™t get it back; if you keep it, you can eventually organize and purge what you donโ€™t need. Those who store data now while getting their governance in place are not automatically โ€œdata hoarders.โ€ This is a false dilemma.ย 

The idea that you need to come up with a perfect plan before keeping any data or bringing in any new sources is a little like saying we need perfect social justice for everyone before we can address police killings of African-Americans.

Instead, get started now. Stop throwing out the baby with the bathwater and begin finding your use cases. Meanwhile, make data the point rather than a side effect of your processes and govern it accordingly. These arenโ€™t โ€œsteps,โ€ but initiatives you need to undertake, usually in parallel.

That said, how do you go about planning? How do you start cataloging your data and establish some structure around its evolution? There are traditional solutions like those covered in last yearโ€™s Forrester Wave report โ€” Informatica, various IBM offerings, SAS, and Collibra, among others โ€” but some of these come with a lot of baggage and form part of a vendorโ€™s overall platform play.

Meanwhile, a new class of data governance tools is being developed specifically for Hadoop. These tools have less of a legacy, but are also less mature. They are focused on the Hadoop ecosystem rather than your whole organization, allowing you to integrate them more closely with your new data architecture.

Cloudera Navigator

Navigator is Clouderaโ€™s closed-source offering for data governance. It incorporates both security auditing and metadata management, and it allows both integration with traditional data governance products like Informatica and automated data lineage tracking.

At its core, it tracks where data came from, what transformations happened to it, where the data landed, and where the heck itโ€™s located. You can even set up rules (policies) for automatically tagging data based on its type and origin.

Navigator also allows you to trigger actions based on these policies, some of which arenโ€™t necessarily best done in Navigator (for example, triggering actions to archive or move data). Among the biggest concerns is that you can trigger auditing with or without Sentry, Clouderaโ€™s authorization module for Hadoop.

On the one hand, โ€œchoice is good,โ€ but on the other hand, if you go to the condiment counter at a fast food joint and find 15 brands of generic ketchup packages, which do you choose? I donโ€™t really need multiple paths for an audit implementation becauseโ€ฆI just want to log the stuff already and I donโ€™t care about choice for that.

Apache Atlas

Hortonworks is newer to the data governance game. It has proposed Apache Atlas, which was accepted into Apacheโ€™s incubator โ€” sometimes, but not always, a sign of project maturity. The rise to a top-level Apache project is a very political process.

Atlas has high hopes, but itโ€™s pretty early on in its development. It integrates with Apache Ranger according to the README.txt, though thatโ€™s the only use of the word โ€œRangerโ€ in the whole source repository, and it isnโ€™t a lot of code. While Atlas is part of Hortonworksโ€™ recent 2.3 release, itโ€™s clearly an early cut, and probably not the core of your master-data-management or data governance initiative at this point.

The buyerโ€™s lament

With Sentry versus Ranger and Navigator versus Atlas, youโ€™re seeing a real split. On one hand Cloudera offers a mature more complete offering; on the other hand itโ€™s proprietary and already diverging from the less mature, less-thought-out Sentry product. Hortonworks answers with an open source offering, but obviously, it integrates with its own preferred security implementation.

In other words, weโ€™re seeing a sort of Hadoop distribution lock-in with each new layer we add. Part of why we pick an open source technology is to put the choice back in the userโ€™s hands.

Neither Navigator nor Atlas are particularly complete offerings, and while itโ€™s nice that Navigator can work with existing data governance offerings such asย Informatica, these have their own plug-ins, anyhow.

You have to ask: Do I need a Hadoop data governance solution or do I need a complete data governance solution that includes Hadoop? In many cases, Iโ€™d say the latter.

It would be nice to see full-on open source data governance software. But for now, if you look at a complete, mature, and proprietary tool like Collibra, whichย offers a complete vision, youโ€™re unlikely to be happy even with Navigator. It would probably easier for Collibra to deepen its Hadoop integration and offer better data lineage than for Cloudera to make Navigator a more complete offering. If youโ€™re using a proprietary product anyhow, you might as well use a complete one that covers all of your data (and if you have a lot of it, you probably have Informatica anyhow).

Someday a complete open source data governance or master data management tool will emerge. But it canโ€™t be aligned with a single technology vertical. I mean, I donโ€™t really want Data Governance for Hadoop, Data Governance for MongoDB, Data Governance for Oracle and a freaking data lake project just to tie back together my metadata from my data governance tools.

The catch with many existing tools is they are heavy duty and suited to bureaucratic organizations that hold long-winded data governance committee meetings. For organizations just getting into data governance, who simply need to stop digging, the implementation costs can be daunting.

Whichever governance software you choose, remember that owning a hammer doesnโ€™t make you a carpentry business, just as having a data governance tool doesnโ€™t make your initiative happen. Governance is really about your processes โ€” the actual gathering and cataloging of data and how you think about data.

Meanwhile, whatever you do, donโ€™t listen to the naysayers and throw your data away because you havenโ€™t figured out how to govern it yet. Thatโ€™s like killing the patient because treatment is a lot of work.