First knowledge of cloudera impala

Last Update:2014-06-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Impala is a new query system developed by cloudera. It provides SQL semantics and can query Pb-level big data stored in hadoop HDFS and hbase. Although the existing hive system also provides SQL semantics, the underlying hive execution uses the mapreduce engine and is still a batch processing process, which is difficult to satisfy the query interaction. In contrast, Impala's biggest feature is its speed. Impala provides a real-time SQL query interface for data stored in HDFS and hbase.

Advantages of impala

From zdnet, it describes some advantages of Impala:

The main advantages are as follows: SQL is friendly and faster than hive. It supports multiple storage engine file formats, rich interfaces (ODBC, JDBC, client), open source, and easy to deploy.

Impala Architecture

The Impala solution contains the following parts:

Clients:Including hue, ODBC clients, JDBC clients, and the impala Shell
Hive MetaStore:Stores Schema-defined metadata. When you create, delete, or modify a table structure, or load data into a table, the impala node is automatically notified.
Cloudera Impala:Run on the data node to analyze, schedule, and execute query tasks. Each Impala instance can receive and schedule queries from the client. These queries are distributed to the impala node for query, the Impala node is equivalent to a working process. It executes the query and returns the result.
Hbase and HDFS: stores data for Impala query.

Describes the impala architecture:

The yellow part is the impala component. Impala uses the hive SQL interface (including select, insert, join, and other operations), but currently only implements a subset of hive SQL semantics (for example, UDF is not supported yet ), metadata information of a table is stored in the MetaStore of hive. Statestore is a sub-service of Impala. It monitors the health status of each node in the cluster and provides functions such as node registration and error detection. Impala runs a background service impalad on each node. impalad is used to respond to external requests and complete the actual query processing. Impalad consists of three modules: Query planner, query coordinator, and query exec engine. Querypalnner receives queries from SQL apps and ODBC, converts queries to many subqueries, and query Coordinator distributes these subqueries to each node, the query exec engine on each node is responsible for executing the subquery, and finally returns the results of the subquery. The intermediate results are aggregated and finally returned to the user.

Impala Process

From the process perspective, there are three types of processes:

The Impala daemon
It is the core process of Impala. The process name is:ImpaladRun on all data nodes, read and write data, receive client query requests, execute query requests from other nodes in the cluster in parallel, and return intermediate results to the scheduling node. Call the node to return the result to the client.
The Impala statestore
The status management process regularly checks the health status of the impala daemon and coordinates the information relationships between impalad instances. Impala uses this information to locate the data to be queried. The process name isStatestoredIn the cluster, you only need to start such a process. If the impala node is offline for physical, network, software, or other reasons, statestore notifies other nodes, this prevents query tasks from being distributed to unavailable nodes.
The Impala catalog Service
Metadata Management Service. The process name isCatalogdTo distribute the changed data table information to various processes.

These processes are found in the cdh5 environment:

Impala PROCESS DISTRIBUTION
Hostname	Process name
H1.worker.com	Statestored and catalogd
H2.worker.com	Impalad
H3.worker.com	Impalad
H4.worker.com	Impalad

[[email protected] ~]# hostnameh1.worker.com[[email protected] ~]# ps -ef | grep impalaimpala   14048  7910  0 04:13 ?        00:00:30 /opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/impala/sbin-retail/catalogd --flagfile=/var/run/cloudera-scm-agent/process/57-impala-CATALOGSERVER/impala-conf/catalogserver_flagsimpala   14070  7910  0 04:13 ?        00:03:01 /opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/impala/sbin-retail/statestored --flagfile=/var/run/cloudera-scm-agent/process/61-impala-STATESTORE/impala-conf/state_store_flagsroot     48029 31543  0 10:13 pts/0    00:00:00 grep impala[[email protected] ~]#

[[email protected] ~]# hostnameh2.worker.com[[email protected] ~]# ps -ef | grep impalaimpala   13919  4405  0 04:13 ?        00:01:12 /opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/impala/sbin-retail/impalad --flagfile=/var/run/cloudera-scm-agent/process/58-impala-IMPALAD/impala-conf/impalad_flagsroot     24212 18173  0 10:16 pts/0    00:00:00 grep impala

Why impala is fast

I found a reason why impala was fast online, mainly due to the following reasons.
Impala does not need to write intermediate results to the disk, saving a lot of I/O overhead.
Saves the overhead of mapreduce job startup. Mapreduce starts tasks slowly (each heartbeat interval is 3 seconds by default). Impala directly schedules jobs through corresponding service processes, which is much faster.
Impala has completely abandoned mapreduce, a paradigm that is not suitable for SQL queries. Instead, Impala uses the idea of MPP parallel databases like dremel, so it can do more query optimization, this saves unnecessary shuffle, sort, and other expenses.
By using llvm to compile and run the time code in a unified manner, unnecessary overhead is avoided to support general compilation.
Implemented Using C ++ and made a lot of targeted hardware optimization, such as using SSE commands.
I/O scheduling mechanism supporting Data Locality is used to allocate data and computing on the same machine as much as possible, reducing network overhead.

Impala source code

Https://github.com/cloudera/impala

Next we will focus on the source code of Impala. I personally think that the architecture of the distributed database query engine is different.

References

Cloudera Impala User Guide

Cloudera aims to bring real-time queries to hadoop, big data

Impala: a new generation of open-source big data analysis engine

Original works, reprinted please indicate the source http://blog.csdn.net/yangzhaohui168/article/details/34185579

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

First knowledge of cloudera impala

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

First knowledge of cloudera impala

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support