Andrew C. Oliver
Contributing Writer

Big data needs big security changes

analysis
Feb 12, 20166 mins
AnalyticsData ArchitectureJakarta EE

Access control for big data analytics needs policy-based security that includes context as well as users and roles

Security for big data analytics is challenging. Hereโ€™s why: When you canโ€™t analyze in place, you need to copy that data โ€” at which point all the stipulations about who can see or change all manner of data under what circumstance should be replicated, too. Today, thatโ€™s nearly impossible to do.

On the Hadoop/Spark side, we have only role-based, limited access control lists (ACLs) or the Wild West. But I believe thereโ€™s a way forward: Adopt the policy-based approach that has arisen in the broader security market. To explore how that could work, we need to revisit the history of access control and how it evolved to produce a policy-based model.

A three-minute history of access control

In the beginning, there were usernames and passwords to keep out everyone who might want in, despite what Richard Stallman said.

There was an inherent problem with this system. The number of user/password combinations tended to explode as new applications were written, so we ended up with a different user/password for each application. Worse, some applications asked for different passwords to reach different levels of security.

We became smarter and divided up โ€œrolesโ€ from usernames. Weโ€™d have one โ€œuser/password,โ€ but to access the administrative functions, that user/password would also need an โ€œadminโ€ role, for example. However, each application tended to implement this on its own, so you still had a growing list of passwords to remember.

Weย became even smarter and created central systems that eventually became LDAP, Active Directory, and the like. These united the user/password in a core repository and established one place to look up the roles for a given user โ€” but this replaced one problem with another.

In an ideal world, each new application looks at the list of roles in Active Directory and maps them to application roles, so thereโ€™s a clean, one-to-one relationship. In reality, most applications think of roles differently, and besides, simply because youโ€™re an admin for one application doesnโ€™t mean you should be an admin for another. In the end, youโ€™ve replaced an explosion of user/password combinations with an explosion in the number of roles.

Which begs the question: Who ends up in charge of adding new roles? It tends to be either some IT-administrative or shared-HR function. Since thereโ€™s a good chance none of those people with the menial task of adding roles will actually understand the application very well, this usually ends up being a โ€œmanager approvalโ€ or โ€œrubber stamp,โ€ and that isnโ€™t, as they say, good.

Many applications still punt on the question of roles by using AD for authentication and having the application handle its own local role implementation. Thereโ€™s a lot to be said for this approach, because itโ€™s clearly the application administrator who knows who should have what level of access.

Meanwhile, there are clear rules that do not cleanly fit into a user/role system. At its simplest, because Iโ€™m a banking customer doesnโ€™t mean I can withdraw money from any account even if I have the โ€œcanWithdrawโ€ role. Roles often need to be associated with data, which is why we have ACLs that map to entries in our data store. That is, account 1234 has an association that identifies me as its owner and my spouse as an authorized account administrator.

However, some businesses have rules that are more complicated than โ€œis this yours?โ€ or โ€œwhat permissions do you have on this record?โ€ Instead, they use what you might call โ€œcontextualโ€ or โ€œpolicy-basedโ€ security rules. In other words, I might have permission only to withdraw money while Iโ€™m within the continental United States. Thereโ€™s no way to express this in an ACL or role-based model. Instead, weโ€™ve crossed over into policy-based security.

When you can do only some things sometimes

Policy-based security exists quite often in a central repository and relies on central authentication mechanisms (LDAP, Kerberos, and so on). The difference is, instead of maintaining simple roles (such as canWithdraw), each user is associated with a set of policies. The policies are based on a set of attributes about the user, also known as attribute-based access control (ABAC). Those policies cannot be centrally enforced as they are entirely application-dependent.

There are already standards for supporting this, derived in part fromย defense and other select industries. One such standard is eXtensible Access Control Markup Language (XACML), which allows you to express sets of policies. Enforcement is usually application-based, using some sort of algorithm or rule system. XACML is a pretty comprehensive standard for expression and even handles exceptions like conflicts in policy or two algorithms enforcing one policy.

Often these ABAC-driven policies, as in the case ofย RBAC, are based on data rather than application function alone (you can access the schematic for the F-22 only while youโ€™re in the United States working for this particular company and a citizen in good standing). One of the first steps in applying policy is often identifying and โ€œtaggingโ€ the data to which the policy rules should apply.

Why you should care about advanced security

Clearly, using ABAC-style policies and XACML is a hefty step over RBAC. You should have the motivation to do this, if only to avoid aย big, fat $100 million fine. I mean, $100 million here and $100 million there, and before long it adds up to real money.

Also, some organizations have complex rules and ownership of data. As these companies increasingly move to become data-driven and canโ€™t analyze everything in place, but instead require centralization, theyโ€™ll need a system that goes beyond the common RBAC models of today. Moreover, to make that feasible, theyโ€™ll need tagging and libraries that allow them to apply policies expressed in something like XACML as well as the tools to manage the policy centrally while applying it locally where meaningful.

When we look at todayโ€™s big data offerings, such as Ranger and Sentry, nothing comes close to answering this call. Evenย solutions for RDBMS-based systems tend to be proprietary, expensive, and often incomplete. Organizations doing high security with complex security rules are forced to implement this on their own. Heck,ย data tagging tools are still in their infancy for big data systems like Hadoop.

In other words, thereโ€™s a big opportunity here for the vendors who can figure it out. Clearly, the defense industry is the first customer, because itโ€™s already doing it out of necessity. As more companies create central data repositories for big data analysis, the need for policy-based security is only going to grow.