Impala is a new query system developed by cloudera. It provides SQL semantics and can query Pb-level big data stored in hadoop HDFS and hbase. Although the existing hive system also provides SQL semantics, the underlying hive execution uses the mapreduce engine and is still a batch processing process, which is difficult to satisfy the query interaction. In contrast, Impala's biggest feature is its speed. Impala provides a real-time SQL query interface for data stored in HDFS and hbase.
Advantages of impala
From zdnet, it describes some advantages of Impala:
The main advantages are as follows: SQL is friendly and faster than hive. It supports multiple storage engine file formats, rich interfaces (ODBC, JDBC, client), open source, and easy to deploy.
Impala Architecture
The Impala solution contains the following parts:
Clients:Including hue, ODBC clients, JDBC clients, and the impala Shell
Hive MetaStore:Stores Schema-defined metadata. When you create, delete, or modify a table structure, or load data into a table, the impala node is automatically notified.
Cloudera Impala:Run on the data node to analyze, schedule, and execute query tasks. Each Impala instance can receive and schedule queries from the client. These queries are distributed to the impala node for query, the Impala node is equivalent to a working process. It executes the query and returns the result.
Hbase and HDFS: stores data for Impala query.
Describes the impala architecture:
The yellow part is the impala component. Impala uses the hive SQL interface (including select, insert, join, and other operations), but currently only implements a subset of hive SQL semantics (for example, UDF is not supported yet ), metadata information of a table is stored in the MetaStore of hive. Statestore is a sub-service of Impala. It monitors the health status of each node in the cluster and provides functions such as node registration and error detection. Impala runs a background service impalad on each node. impalad is used to respond to external requests and complete the actual query processing. Impalad consists of three modules: Query planner, query coordinator, and query exec engine. Querypalnner receives queries from SQL apps and ODBC, converts queries to many subqueries, and query Coordinator distributes these subqueries to each node, the query exec engine on each node is responsible for executing the subquery, and finally returns the results of the subquery. The intermediate results are aggregated and finally returned to the user.
Impala Process
From the process perspective, there are three types of processes:
The Impala daemon
It is the core process of Impala. The process name is:ImpaladRun on all data nodes, read and write data, receive client query requests, execute query requests from other nodes in the cluster in parallel, and return intermediate results to the scheduling node. Call the node to return the result to the client.
The Impala statestore
The status management process regularly checks the health status of the impala daemon and coordinates the information relationships between impalad instances. Impala uses this information to locate the data to be queried. The process name isStatestoredIn the cluster, you only need to start such a process. If the impala node is offline for physical, network, software, or other reasons, statestore notifies other nodes, this prevents query tasks from being distributed to unavailable nodes.
The Impala catalog Service
Metadata Management Service. The process name isCatalogdTo distribute the changed data table information to various processes.
These processes are found in the cdh5 environment:
Impala PROCESS DISTRIBUTION
Hostname |
Process name |
H1.worker.com |
Statestored and catalogd |
H2.worker.com |
Impalad |
H3.worker.com |
Impalad |
H4.worker.com |
Impalad |
[[email protected] ~]# hostnameh1.worker.com[[email protected] ~]# ps -ef | grep impalaimpala 14048 7910 0 04:13 ? 00:00:30 /opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/impala/sbin-retail/catalogd --flagfile=/var/run/cloudera-scm-agent/process/57-impala-CATALOGSERVER/impala-conf/catalogserver_flagsimpala 14070 7910 0 04:13 ? 00:03:01 /opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/impala/sbin-retail/statestored --flagfile=/var/run/cloudera-scm-agent/process/61-impala-STATESTORE/impala-conf/state_store_flagsroot 48029 31543 0 10:13 pts/0 00:00:00 grep impala[[email protected] ~]#
[[email protected] ~]# hostnameh2.worker.com[[email protected] ~]# ps -ef | grep impalaimpala 13919 4405 0 04:13 ? 00:01:12 /opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/impala/sbin-retail/impalad --flagfile=/var/run/cloudera-scm-agent/process/58-impala-IMPALAD/impala-conf/impalad_flagsroot 24212 18173 0 10:16 pts/0 00:00:00 grep impala
Why impala is fast
I found a reason why impala was fast online, mainly due to the following reasons.
Impala does not need to write intermediate results to the disk, saving a lot of I/O overhead.
Saves the overhead of mapreduce job startup. Mapreduce starts tasks slowly (each heartbeat interval is 3 seconds by default). Impala directly schedules jobs through corresponding service processes, which is much faster.
Impala has completely abandoned mapreduce, a paradigm that is not suitable for SQL queries. Instead, Impala uses the idea of MPP parallel databases like dremel, so it can do more query optimization, this saves unnecessary shuffle, sort, and other expenses.
By using llvm to compile and run the time code in a unified manner, unnecessary overhead is avoided to support general compilation.
Implemented Using C ++ and made a lot of targeted hardware optimization, such as using SSE commands.
I/O scheduling mechanism supporting Data Locality is used to allocate data and computing on the same machine as much as possible, reducing network overhead.
Impala source code
Https://github.com/cloudera/impala
Next we will focus on the source code of Impala. I personally think that the architecture of the distributed database query engine is different.
References
Cloudera Impala User Guide
Cloudera aims to bring real-time queries to hadoop, big data
Impala: a new generation of open-source big data analysis engine
Original works, reprinted please indicate the source http://blog.csdn.net/yangzhaohui168/article/details/34185579