Why does Cloudera need to create a Hadoop security component Sentry?

Source: Internet
Author: User
Tags hortonworks hadoop ecosystem

Why does Cloudera need to create a Hadoop security component Sentry?
1. Big Data Security System

To clarify this issue, we must start from four levels of the big data platform security system: Peripheral Security, data security, access security, and access behavior monitoring, as shown in;

Peripheral Security technology refers to the network security technology mentioned in the traditional sense, such as firewall and login authentication;

In a narrow sense, data security includes encryption and decryption of user data, which can be subdivided into storage encryption and transmission encryption. It also includes desensitization of user data, desensitization can be seen as "lightweight" data encryption. For example, if a person's birthday is, the desensitized data is 2014-x-x ". The data profile still exists, but the value cannot be accurately located. The higher the degree of masking, the lower the data readability. The above example can also be desensitized to "x-x", which is equivalent to completely blocking this information.

Access Security is mainly used to manage user authorization. In Linux/Unix systems, user-group read, write, and execution permission management is a classic model. HDFS expands this concept to form a more complete ACL system. With the popularization and deepening of big data applications, the need for differentiated data access permissions within files is becoming increasingly important;

Access Behavior Monitoring mostly records user access to the system, for example, the file to be viewed and the SQL queries to be run. On the one hand, access behavior monitoring provides real-time alarms, quickly deal with illegal or dangerous access behavior; on the other hand, to investigate and obtain evidence afterwards, analyze and locate specific purposes from long-term data access behavior.

Among the four security layers, the relationship between Layer 3 and Layer 3 businesses is the most direct: the multi-tenant of an application directly relies on the technical implementation of this layer for permission-based access control.

2. HDFS authorization system

In the third layer above, the Hadoop ecosystem has long followed the Linux/Unix system authorization management model, the file access permission can be divided into read-write permissions (HDFS does not have the concept of executable files). The permission owner is divided into three categories: owner ), group and other persons (other ). Only three types of owners are allowed to restrict permissions for this model. If you try to add a new "group" and set the permissions of users in the group to be different from those of owner, group, or other, the existing Linux/Unix authorization model cannot solve this problem elegantly.

For example, describe the preceding situation: Assume that there is a sales department and the department manager has the right to modify the sales data sales_data. The sales department members have the right to view the sales_data, sales data sales_data cannot be viewed by people outside the sales department. The authorization for sales data sales_data is as follows:

  1. -Rw-r ----- 3 manager sales 0 sales_data

Later, the sales department expanded its staff and came to two sales managers, one being manager1 and the other being manager2. The two sales managers are also allowed to modify the sales data. In this case, manager1 and manager2 can only use one new account, manager_account, and then allow this account to use setuid to modify sales_data. This makes permission management for the same data complex and not easy to maintain.

As a result of the preceding problems, the hdfs acl (Access Control Lists) support is added to Hadoop2.4.0. This new feature effectively solves the above problems. However, as Hadoop is widely used in enterprises, more and more business scenarios require that the granularity of big data access control is no longer limited to the file level, it is more detailed to restrict the data in a file that can be read and written, which can only be read, and which cannot be accessed at all. For SQL-based big data engines, data access is not only accurate to the table granularity, but also to the row and column levels.

3. Authorization of Hiveserver2

Hive is one of the engines that early introduced advanced Query Language SQL into the Hadoop platform. The early Hive server process was called Hiveserver1. Hiveserver1 does not support concurrent connections, access Authorization Control is not supported. Later, these two problems were solved on Hiveserver2. Hiveserver2 was able to use the grant/revoke statement to restrict users' access permissions to databases, tables, and views, row and column permissions are controlled by generating views. However, the authorization management system of Hiveserver2 is considered to be faulty, that is, any authenticated login user can add access permissions to any resource for himself. That is to say, Hiveserver2 does not provide a secure authorization system. The authorization system of Hiveserver2 provides a safeguard mechanism to prevent normal user misoperations. It is not designed to protect the security of sensitive data. However, these are more comments from some companies. In fact, Hiveserver2's security system is gradually improved, and the above problems are also being quickly fixed.

However, authorization management is not only required by Hive, but is also urgently required by other query engines to improve and standardize data access by applications. For the implementation of fine-grained authorization management, a large part of the functions can be shared among engines. Therefore, an independent authorization management tool is necessary.

4. Security authorization management provided by Sentry

In this context, some developers of Cloudera have used the existing authorization management model in Hiveserver2 to expand and refine many details, and completed a relatively useful authorization management tool Sentry, is a comparison between the authorization management model in Sentry and Hiveserver2:


Many of the basic models and design ideas of Sentry come from Hiveserver2, but the concept of RBAC is enhanced based on it. In Sentry, all permissions can only be granted to roles. When a role is attached to a user group, users in the group have the corresponding permissions. The ing relationship between permissions, roles, and user groups is particularly clear in Sentry. The ing of This line shows how a permission can be owned by a user; permissions, roles, and user groups are granted through the grant/revoke SQL statement. From "User Group" to "user impact", Hadoop's user-group ing is implemented. Hadoop provides two mappings: one is the ing between Linux/Unix users on the local server and the other is the ing between users and their groups through LDAP; the latter is more suitable for large systems because of its centralized configuration and ease of modification.

Sentry extends the Data Objects supported in Hiveserver2 from the database, table, and view to the server, URI, and column granularity. Although the permission control of columns can be implemented using views, the view naming becomes very complex when many users and tables are large; in addition, the original query statement for the original table cannot be directly used because the view name may be different from that for the original table.

Currently, Sentry1.4 supports only SELECT, INSERT, and ALL authorization levels, but later versions support the current level with hiveserver2. Sentry comes from the authorization management model in Hiveserver2, but it is not limited to managing only Hive, and wants to manage Impala, Solr, and other query engines that require authorization management. The Sentry architecture is as follows:

The Sentry architecture has three important components: Binding, Policy Engine, and Policy Provider.

Binding allows you to authorize different query engines. Sentry inserts its Hook function into different stages of compilation and execution of each SQL engine. These Hook functions play two major roles: one is to act as a filter and only allow SQL queries with the access permission of the corresponding data objects; the other is to take over the authorization function. After using Sentry, the permissions managed by grant/revoke are completely taken over by Sentry, and the execution of grant/revoke is fully implemented in Sentry; authorization information for all engines is also stored in the unified database set by Sentry. In this way, the permissions of all engines are centrally managed.

Policy Engine determines whether the entered permission requirements match the saved permission description. Policy Provider reads the previously set access permissions from files or databases. Policy Engine and Policy Provider are required for any authorization system. Therefore, they are public modules and can serve other query engines in the future.

5. Summary

Fine-grained access control on the big data platform is being implemented. Of course, the platform vendors are dominated by Cloudera and Hortonworks. Cloudera focuses on the Sentry authorization system; on the one hand, Hortonwork relies on Control over open-source communities, and on XA Secure acquired. Regardless of the influence of the two companies on the big data platform market in the future, fine-grained authorized access to the big data platform is worth learning.

6. Reference
  • Http://zh.hortonworks.com/blog/hdfs-acls-fine-grained-permissions-hdfs-files-hadoop/
  • Https://cwiki.apache.org/confluence/display/Hive/ SQL +Standard+Based+Hive+Authorization

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.