Comparison between Impala and Hive

Source: Internet
Author: User
1. impala architecture Impala is a real-time interactive SQL Big Data Query Tool developed by Cloudera inspired by Google's Dremel. Impala no longer uses slow Hive + MapReduce batch processing, instead, it uses a distributed query engine similar to that in a commercial parallel relational database, such as QueryPlanner, QueryCoordinator, and QueryExecEng.

1. impala architecture Impala is a real-time interactive SQL Big Data Query Tool developed by Cloudera inspired by Google's Dremel. Impala no longer uses slow Hive + MapReduce batch processing, instead, it uses a distributed Query engine similar to that in commercial parallel relational databases (Query Planner, Query Coordinator, and Query Exec Eng ).

1. Impala Architecture

Impala is a real-time interactive SQL Big Data Query Tool developed by Cloudera inspired by Google's Dremel. Impala no longer uses slow Hive + MapReduce batch processing, instead, it uses a distributed Query Engine similar to that in commercial parallel relational databases (composed of three parts: Query Planner, Query Coordinator, and Query Exec Engine ), you can use SELECT, JOIN, and statistical functions to query data directly from HDFS or HBase, which greatly reduces latency. As shown in architecture 1, Impala mainly consists of Impalad, State Store, and CLI.

Figure 1

Impalad: It runs on the same node as DataNode and is represented by the Impalad process. It receives the query request from the client (Impalad receiving the query request is Coordinator, and the Coordinator calls the java front-end to explain the SQL query statement through JNI, generate a query plan tree, and then distribute the execution plan to other Impalad with corresponding data through the Scheduler), read and write data, and execute queries in parallel, the result is sent back to the Coordinator through network streaming, and the Coordinator returns the result to the client. At the same time, Impalad also maintains a connection with State Store to determine which Impalad is healthy and can accept new jobs. Start three thriftservers in Impalad: beeswax_server (connected to the client), hs2_server (borrowed Hive metadata), be_server (used internally by Impalad), and an ImpalaServer service.


Impala State Store: Tracks the health status and location information of Impalad in the cluster, represented by the statestored process. It creates multiple threads to process the registration and subscription of Impalad and maintain heartbeat connection with each Impalad, each Impalad caches a copy of the information in the State Store. When the State Store is offline (Impalad finds that the State Store is offline, it enters the recovery mode and registers repeatedly, after the State Store is added to the cluster again, it automatically returns to normal and updates the cache data. Because Impalad has the State Store cache, it can still work, but some Impalad will become invalid, the cache data cannot be updated. As a result, the execution plan is allocated to an invalid Impalad, causing query failure.


CLI: Provides the command line tool used for user query (Impala Shell is implemented using python), and Impala also provides the Hue, JDBC, and ODBC interfaces.

2. Relationship with Hive

Both Impala and Hive have different data query tools built on Hadoop, but Impala and Hive have many similarities from the client usage, such as data table metadata, ODBC/JDBC drivers, SQL syntax, flexible file formats, and storage resource pools. The relationship between Impala and Hive in Hadoop is shown in 2. Hive is suitable for long-time batch query and analysis, while Impala is suitable for real-time interactive SQL queries. Impala provides data analysts with a big data analysis tool to quickly experiment and verify ideas. You can use hive for data conversion, and then use Impala to perform rapid data analysis on the result dataset processed by Hive.

Figure 2


3. Impala query and processing process

Impalad is divided into Java frontend and C ++ processing backend. The Impalad that accepts the client connection serves as the Coordinator of this query, coordinator uses JNI to call the Java front-end to analyze the user's query SQL to generate an execution plan tree. Different operations correspond to unused plannodes, such as SelectNode, ScanNode, SortNode, AggregationNode, and HashJoinNode.

Each atomic operation in the execution Plan tree is represented by a PlanFragment. Generally, a query statement consists of multiple plans Fragment. Plan Fragment 0 indicates the root of the execution tree, and the aggregation result is returned to the user, the leaf node of the execution tree is generally a Scan operation, which is executed in parallel in a distributed manner.

The execution plan tree generated by the Java frontend is returned to Impala C ++ backend (Coordinator) in the format of Thrift data (the execution plan is divided into multiple stages, each stage is called a PlanFragment, each PlanFragment can be executed in parallel by multiple Impalad instances (some PlanFragment can only be executed by one Impalad instance, such as an aggregation operation), and the entire execution plan is an execution plan tree ), the Coordinator stores data according to the Execution Plan (Impala interacts with HDFS through libhdfs. Use the hdfsGetHosts method to obtain the location information of the node where the file data block is located. Use the scheduler (currently only simple-scheduler, using the round-robin algorithm) Coordinator :: exec assigns the generated execution plan tree to the corresponding backend executor Impalad for execution (the query uses LLVM for code generation, compilation, and execution. For how to improve the performance of LLVM, we can call the GetNext () method to obtain the calculation result. If it is an insert statement, write the computation result back to HDFS through libhdfs. When all input data is consumed, the execution ends and the query service is canceled.


The Query Process of Impala is roughly 3:

Figure 3

The following uses an SQL query statement as an example to analyze the Impala Query Process. For example, select sum (id), count (id), avg (id) from customer_small group by id; the plan generated by this statement is:

Plan fragment 0
PARTITION: UNPARTITIONED

4: EXCHANGE
Tuple ids: 1

Plan fragment 1
PARTITION: HASH_PARTITIONED:

STREAM DATA SINK
Exchange id: 4
UNPARTITIONED

3: AGGREGATE
| Output: SUM ( ), SUM ( )
| Group:
| Tuple ids: 1
|
2: EXCHANGE
Tuple ids: 1

Plan fragment 2
PARTITION: RANDOM

STREAM DATA SINK
Exchange id: 2
HASH_PARTITIONED:

1: AGGREGATE
| Output: SUM (id), COUNT (id)
| Group by: id
| Tuple ids: 1
|
0: SCAN HDFS
Table = default. customer_small # partitions = 1 size = 193B
Tuple ids: 0

As shown in execution line plan tree 4, the green part allows distributed parallel execution:


Figure 4

4. optimization technology used by Impala relative to Hive

1. MapReduce is not used for Parallel Computing. Although MapReduce is a good parallel computing framework, it is more oriented to batch processing mode than interactive SQL Execution. Compared with MapReduce, Impala divides the entire Query into an execution plan tree instead of a series of MapReduce tasks. After the execution plan is distributed, Impala obtains the results by pulling data, the result data is aggregated by execution tree stream transfer, which reduces the overhead of writing intermediate results to the disk and then reading data from the disk. Impala uses the service to avoid the overhead that needs to be started for each query execution, that is, there is no MapReduce startup time compared to Hive.

2. Use LLVM to generate Running code, generate specific code for specific queries, and use Inline to reduce the overhead of function calls and accelerate execution efficiency.

3. Make full use of available hardware commands (SSE4.2 ).

4. For better IO scheduling, Impala knows that the disk location of the data block can better utilize the advantages of multiple disks. At the same time, Impala supports direct data block reading and local code calculation checksum.

5. You can obtain the best performance by selecting the appropriate data storage format (Impala supports multiple storage formats ).

6. The maximum memory usage. The intermediate results are not written to the disk and are transmitted through the network in a timely manner in stream mode.

5. Similarities and Differences between Impala and Hive

Data Storage: Data can be stored in HDFS and HBase using the same storage data pool.


Metadata: The two use the same metadata.


SQL Interpretation: The Execution Plan is generated through lexical analysis.


Execution Plan:
Hive: Depends on the MapReduce execution framework. The execution plan is divided into map-> shuffle-> reduce-> map-> shuffle-> reduce... . If a Query is compiled into multiple rounds of MapReduce, more intermediate results will be written. Due to the features of the MapReduce execution framework, too many intermediate processes will increase the execution time of the entire Query.
Impala: shows the execution plan as a complete execution plan tree, which can be more natural to distribute the execution plan to each Impalad for query, instead of combining it into a pipeline map-> reduce mode like Hive, this ensures Impala has better concurrency and avoids unnecessary intermediate sort and shuffle.


Data Stream:
Hive: pushes data to subsequent nodes after each computing node completes computing.
Impala: The pull method is used. Later, the node takes the initiative to request data from the forward node through getNext. In this way, the data can be streamed back to the client, and as long as one piece of data is processed, it can be displayed immediately, instead of waiting until all processing is completed, which is more suitable for SQL interactive queries.


Memory usage:
Hive: If no data can be stored in the memory during execution, external storage is used to ensure that the Query can be executed in sequence. At the end of each round of MapReduce, the intermediate results will be written to HDFS. Similarly, due to the features of MapReduce execution architecture, the shuffle process will also write operations to the local disk.
Impala: When data cannot be stored in the memory, the current version 1.0.1 directly returns an error instead of using external storage. Later versions should be improved. Impala currently imposes certain restrictions on Query processing. It is best to use Impala with Hive. Impala uses the network to transmit data between multiple stages. During the execution process, no disk write operations are performed (except insert ).


Scheduling:
Hive: Task Scheduling depends on Hadoop's scheduling policy.
Impala: scheduling is done by yourself. Currently, there is only one scheduler, simple-schedule, which will satisfy the Data Locality as much as possible. The process of scanning data is as close as possible to the physical machine where the data is located. The scheduler is still relatively simple. In SimpleScheduler: GetBackend, we can see that the scheduler has not considered factors such as load and network I/O status. However, Impala has statistics and analysis on the performance of the execution process. You should use these statistics for scheduling in future versions.


Fault Tolerance:
Hive: relies on Hadoop's fault tolerance capabilities.
Impala: There is no fault-tolerant logic in the query process. If a fault occurs during execution, an error is returned directly (this is related to Impala design because Impala is positioned for Real-Time query, an error occurred while performing a query. The cost of performing another query is very low ). However, Impala is highly fault tolerant. All Impalad is equivalent. You can submit a query to any Impalad. If one Impalad fails, all the running queries on them will fail, but you can re-submit the Query to be executed by other Impalad instead, without affecting the service. Currently, there is only one State Store, but when the State Store fails, the service will not be affected. Each Impalad caches the State Store information, but cannot update the cluster status, the execution task may be assigned to an invalid Impalad for execution, resulting in a Query failure.


Applicability:
Hive: complex batch query tasks and data conversion tasks.
Impala: Real-time Data Analysis. Because it does not support udfs, there are limits on the problem domains that can be processed. It can be used with Hive to perform real-time analysis on Hive result datasets.

6. Advantages and Disadvantages of Impala

Advantages:

  1. Supports SQL queries to quickly query big data.
  2. You can query existing data to reduce data loading and conversion.
  3. You can select multiple storage formats (Parquet, Text, Avro, RCFile, and SequeenceFile ).
  4. It can be used with Hive.

Disadvantages:

  1. Udfs are not supported.
  2. Full-text search for text fields is not supported.
  3. Transforms is not supported.
  4. Fault Tolerance During query is not supported.
  5. High memory requirements.

Original article address: Comparison between Impala and Hive. Thank you for sharing the original article.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.