Cloudera recently released a news article on the Rhino project and data at-rest encryption in Apache Hadoop. The Rhino project is a project co-founded by Cloudera, Intel and Hadoop communities. This project aims to provide a comprehensive security framework for data protection.
Data encryption in Hadoop has two aspects: static data, persistent data on the hard disk, data transfer, data transfer from one process or system to another process or system. Most Hadoop components provide the ability to encrypt transmitted data, but static data is not supported for encryption. Security regulators, such as HIPAA, PCI DSS and FISMA, also advocate data protection and encryption.
The Rhino project contributed key security features to HBase 0.98. It provides data-cell-level encryption and fine-grained access control.
InfoQ recently talked to Steven Ross, Product Manager for Cloudera Security, about the Rhino project.
InfoQ: When did the Rhino project start? What is the goal of this project?
Steven Ross: In order to promote a comprehensive security framework for Apache Hadoop data protection, Intel initiated the Rhino project initiative in early 2013 and set several major goals for the project:
HARDWARE ENHANCED ENHANCEMENT PERFORMANCE Supports Enterprise-Class Authentication and Single Sign-On Services for Hadoop Provides Role-Based Access Control for Hadoop and Performs Data Unit Granularity Access Control in HBase Ensures Consistency with Apache Hadoop Critical Components Review
InfoQ: The Rhino project is a comprehensive project. Apache Sentry is also included in the Rhino project. Rhino contains different sub-projects, can you share some of the details of these projects?
SR: In the summer of 2013, open source software released by Cloudera became the foundation of the Apache Sentry project (in its infancy). This project has been greatly assisted by Oracle, IBM and Intel engineers. Apache Sentry provides fine-grained authentication support for data and metadata for Hadoop clusters and has been deployed in products by some large enterprises.
Cloudera and Intel have a strategic partnership. Security architects and engineers from both teams have reiterated their commitment to accelerating the development of Apache Hadoop security features. Developed for Apache Hadoop more robust authentication mechanism, Rhino project and Apache Sentry goal is exactly the same. The results of security experts from both companies have been combined and they are now investing in both projects.
InfoQ: What is Apache Sentry?
SR: Apache Sentry is a highly modular system. It provides fine-grained, role-based authentication for data and metadata stored in an Apache Hadoop cluster.
Projects in the Hadoop ecosystem have their own different authentication systems that require separate configurations. The flexibility of Hadoop makes it possible for different projects in the ecosystem such as Hive, Solr, MapReduce, Pig to access the same data. Because each project's certification configuration is independent, administrators are likely to get inconsistent, overlapping strategies in an attempt to keep the strategy consistent.
Sentry provides a centralized strategy. This strategy can be applied to many different ways of accessing. In this way, Sentry solves this IT management and security challenge. Therefore, IT administrators can set permissions on the data set. And know that no matter by what means access to data, these permission controls will be implemented consistently.
Sentry technical details:
Sentry controls access to every schema object in a Hive Metastore with a set of privileges, such as SELECT and INSERT. The schema object is a common entity in data management, such as SERVER, DATABASE, TABLE, COLUMN, and URI, which is where HDFS files are located. Cloudera Search has its own set of privileges (such as QUERY) and objects (such as COLLECTION).
Like other RBAC systems already familiar to IT teams, Sentry offers:
Hierarchical objects automatically inherit permissions from upper-level objects; rules that contain a set of multiple object / permission pairs; user groups can be granted one or more roles; and users can be assigned to one or more user groups .
Sentry is usually configured to disallow access to services and data by default. Therefore, users are restricted to accessing the system until they are assigned to user groups with the specified access role.
InfoQ: What is Advanced Encryption Standard New Instructions (AES-NI)? What does it matter with the Rhino project?
SR: Intel AES-NI is the new encryption instruction set in the Intel Xeon processor family and Intel Core processor family. It improves the Advanced Encryption Standard (AES) algorithm and increases the speed of data encryption.
When enabling encryption, the main concern for business is the "overhead" that CPUs require. These "overheads" can slow down the storage and retrieval of data. AES-NI will be entrusted with the processing of encryption dedicated hardware. This hardware can be done faster encryption and decryption operations, thereby reducing the CPU load.
AES-NI plays an important role in the success of those encrypted subprojects in the Rhino project. However, there is no requirement that Hadoop users using HDFS encryption must use Intel chips or AES-NI. While these techniques do improve encryption / decryption performance when turning on encryption, they can reduce the impact on system performance.
InfoQ: What is the future roadmap for the Rihno project?
SR: Next, the big goal of the Rhino project is likely to remain the same. The next sub-projects (these sub-projects usually exist in two forms, the Apache project or some of the existing projects JIRA) is likely to gradually develop. The other two subprojects are currently gaining momentum as they reach milestones for fine-grained HBase security (as described above):
HDFS static data encryption. Unified Certification: Dedicated to providing a set of enforced access policies, regardless of how users access the data, whether it be Hive, MapReduce, or other means of access. This work is being done through the Apache Sentry project.
All integration work has been completed, the entire solution testing and documentation has been fully completed.
The Rhino project implements a subproject that is part of Apache Hadoop (and other related Apahce projects). CDH bundles Apache Hadoop and other related projects in the ecosystem.
View in original language Data Encryption in Apache Hadoop with Project Rhino - Q & A with Steven Ross