Big Data Study Notes

Source: Internet
Author: User

From: http://www.csdn.net/article/2013-12-04/2817707-Impala-Big-Data-Engine

 

Big data processing is a very important field in cloud computing. Since Google proposed the mapreduce distributed processing framework, open source software represented by hadoop has been favored by more and more companies. This article describes a new member in the hadoop system: Impala.

Impala Architecture Analysis

Impala is a new query system developed by cloudera. It provides SQL semantics and can query Pb-level big data stored in hadoop HDFS and hbase. Although the existing hive system also provides SQL semantics, the underlying hive execution uses the mapreduce engine and is still a batch processing process, which is difficult to satisfy the query interaction. In contrast, Impala's biggest feature is its speed. So how can Impala implement fast big data query? Before answering this question, we need to introduce Google's dremel system, because impala was initially designed with reference to the dremel system.

Dremel is Google's interactive data analysis system. It is built on Google's GFS (Google File System) and other systems, supporting Google's data analysis services such as bigquery. There are two highlights of dremel's technology: one is to implement column storage of nested data, and the other is to use a multi-layer query tree, this allows tasks to execute and aggregate results concurrently on thousands of nodes. Columns are stored in relational databases, which can reduce the amount of data processed during queries and effectively improve query efficiency. The difference in dremel's column store is that it is not for traditional relational data, but for nested data. Dremel can convert records of individual nested structures into columns for storage. When querying, it reads the required columns Based on the query conditions and then performs conditional filtering, during the output, the columns are assembled into records of the nested structure. Both the forward and reverse conversion of records are achieved through an efficient state machine. In addition, the multi-layer query tree of dremel draws on the design of the distributed search engine. The root node of the query tree is responsible for receiving the query and distributing the query to the next node, the underlying node reads and queries specific data and then returns the results to the upper node. For more information about the technical implementation of dremel, refer to [Note: Google dremel principle: how to analyze 1 Pb in 3 seconds at http://www.yankay.com/google-dremel-rationale /].

 

Big Data Study Notes

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.