Large data security: The evolution of the Hadoop security model

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The security and protection of sensitive information is one of the most popular concerns today. In the age of large data, many organizations collect data from various sources, analyze them, and make decisions based on analysis of massive datasets, so security issues in this process become increasingly important. At the same time, laws and regulations such as HIPAA and other privacy laws require organizations to tighten access control and privacy restrictions on these datasets. The increase in network security vulnerabilities from internal and external attackers is usually a matter of months before they are discovered, and those affected are paying the price. Organizations that have failed to make proper access to their data will be prosecuted, appear in negative reports, and face fines from regulators.

Consider the following statistics that give you an eye-opening outlook:

A study published this year by Symantec and Ponemon Institute shows that the average organizational cost of a vulnerability in the United States is 5.4 million US dollars 1. According to a recent study, only cyber-crime in the United States caused a loss of 14 billion dollars a year.

The vulnerability in the 2011 Sony Gaming Network was one of the biggest security vulnerabilities in recent times, and experts estimate that Sony's losses related to the vulnerability range from 2.7 billion to 24 billion dollars (a large scope, but the loophole is too big to quantify). 2

Netflix and AOL have been prosecuted for millions of of billions of dollars (some have already been filed) for the large amount of data they manage and the protection of personal information, although they have been "anonymously" processed and published for research. 3

In addition to the quantifiable costs associated with security vulnerabilities (loss of customer and business partners, litigation, regulatory fines), the credibility and reputation of organizations experiencing such incidents can be affected and may even cause the company to close down. 4

In short, if there is no proper security control, large data can easily become a big problem that costs a lot.

What does this mean for an organization that handles large data? means that the more data you have, the more important it is to protect the data. It means not only to control the data leaving the own network safely and effectively, but also to control the data access inside the network. Depending on the sensitivity of the data, we may want to ensure that data analysts can see data that they can analyze, and must understand the consequences of releasing the data and the results of its analysis. Just one case of Netflix's data leak is enough to suggest that even if an attempt is made to "anonymously" process the data, some unexpected information may be released-something that is marked in the Differentiated Privacy field.

Apache Hadoop is one of the most popular large data processing platforms. Although the initial design of Hadoop did not consider security issues at all, its security model is evolving. The rise of Hadoop has also attracted a lot of criticism, and Hadoop has been improving its security as security experts continue to point out their potential security vulnerabilities and the security risks of large data. There has been an explosion in the "Hadoop security" market, and many vendors have released a "security-enhanced" version of Hadoop and a solution that complements the security of Hadoop. These products include Cloudera Sentry, IBM infosphere Optim Data masking, Intel's secure version of Hadoop, DataStax Enterprise Edition, dataguise for Hadoop, Proteg for Hadoop rity large Data protectors, Revelytix loom, zettaset security data warehouses, in addition to a lot, here is no longer enumerated. At the same time, Apache also has a project such as Apache Accumulo, which provides a mechanism for adding additional security measures to use Hapdoop. Eventually, open source projects such as the Knox Gateway (contributed by Hortonworks) and the Rhino Project (contributed by Intel) were finally presented, promising to make a major change in Hadoop itself.

The huge need for Hadoop to meet security requirements makes Hadoop change, which is what I'm going to focus on in this article.

The history of Hadoop Security (Jane)

It is a well-known fact that Doug cutting and Mike Cafarella initially developed Hadoop for the Nutch project without considering security considerations. Because Hadoop's original use cases revolve around how to manage a large amount of public web data, regardless of confidentiality. As Hadoop initially conceived, it assumes that the cluster is always in a trusted environment, composed of trusted users who collaborate with each other.

There is no security model in the original Hadoop, which does not authenticate the user or service, nor does it have data privacy. Because Hadoop is designed to execute code on a distributed device cluster, anyone can submit code and be executed. Although audit and authorization control (HDFS file license) was implemented in earlier versions, this access control is easy to avoid because any user can simulate any other user simply by making one command-line switch. This kind of simulation behavior is very common, most users will do so, so this existing security control does not actually play a role.

At the time, organizations that considered security issues were isolating Hadoop in proprietary networks, accessible only to authorized users. However, there are a lot of accidents and safety incidents in this environment because there is little security control inside Hadoop. Well-meaning users can make mistakes (such as deleting large amounts of data in seconds with a distributed deletion). All users and programmers have the same access to all data within the cluster, and all tasks have access to any data within the cluster, and all users are likely to read any dataset. Because MapReduce does not have the concept of authentication or authorization, a naughty user may reduce the priority of other hadoop tasks, even worse, and kill other tasks directly in order to get their tasks done faster.

As Hadoop becomes increasingly prominent in the data analysis and processing platform, security experts are beginning to care about the threat of malicious users from within the Hadoop cluster. Malicious developers can easily write code that impersonates other users ' Hadoop services (such as writing a new tasktracker and registering it as a hapdoop service, or impersonating a HDFs or mapred user, deleting everything in HDFs, and so on). Because Datanode does not have access control, a malicious user can bypass access control to read arbitrary blocks of data from Datanode or write garbage data to Datanode to destroy the integrity of the target profiling data. Everyone can submit tasks to the Jobtracker and execute them arbitrarily.

Because of these security issues, the Hadoop community realizes that they need more robust security controls, so a Yahoo team decided to focus on authentication issues and chose Kerberos as the authentication mechanism for Hadoop, as recorded in their 2009 white paper.

They achieved their goals when Hadoop released the 20.20x version, which used the following mechanisms:

Mutual authentication with Kerberos RPC (SASL/GSSAPI) on RPC connections-use SASL/GSSAPI to authenticate the users, processes, and Hadoop services on Kerberos and RPC connections.

Provides "Plug and Play" Authentication for HTTP Web consoles-that is, Web applications and Web console implementations can implement their own authentication mechanisms for HTTP connections. Includes (but is not limited to) HTTP Spnego authentication.

Enforces HDFs file permissions--You can enforce access control of files in HDFs by Namenode based on file permissions (user and Group access control lists (ACLs)).

Proxy tokens for subsequent authentication checks-to reduce performance overhead and load on the Kerberos KDC, you can use proxy tokens after the initial user authentication of various clients and services. Specifically, the proxy token is used to communicate with Namenode to complete subsequent authentication without the need for the Kerberos server to participate.

Block access token for block access control-when accessing a block of data, Namenode makes access control decisions based on HDFS file permissions and emits a block access token (with HMAC-SHA1) that can be handed over to Datanode for block access requests. Because Datanode does not have the concept of a file or Access License, a docking must be established between HDFs license and data block access.

Force a task with a job token--The job token is created by Jobtracker and passed to Tasktracker to make sure that the task can only do the jobs that are assigned to them. You can also configure a task to run only when a user submits a job, simplifying access control checks.

Putting these together makes Hadoop a big step forward. Since then, a number of commendable changes have been achieved:

From "Plug and Play Authentication" to HTTP Spnego authentication-although the 2009-year Hadoop security design focuses on Plug and play authentication, because RPC connectivity (user, application, and Hadoop services) has been Kerberos-certified, The Hadoop developer community feels that it's better to stay in line with Kerberos. The Hadoop Web Console is now configured to use HTTP Spnego, a Kerberos implementation for the Web console. This will partially meet the much-needed consistency of Hadoop.

Network Encryption-A SASL connection can be configured to use the Secret protection Quality (QOP) to force encryption at the network layer, including using Kerberos RPC connections and subsequent authentication using the proxy token. Web Console and MapReduce random operations can be configured to use SSL for encryption. The HDFs file transmitter can also be configured for encryption.

The security model for Hadoop has largely not changed since security was redesigned. Over time, some components of the Hadoop system build their own security layer on top of Hadoop, such as Apache Accumulo, which provides cell-level authorization, while HBase provides column and family-level access control.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More