Comparison between Impala and hive

Source: Internet
Author: User
Tags shuffle
1. Impala Architecture
Impala is a real-time interactive SQL Big Data Query Tool developed by cloudera under the inspiration of Google's dremel. Impala no longer uses slow hive + mapreduce batch processing, instead, it uses a distributed query engine similar to that in commercial parallel relational databases (composed of three parts: Query planner, query coordinator, and query exec engine ), data can be directly queried using select, join, and statistical functions in HDFS or hbase, greatly reducing latency. As shown in architecture 1, Impala mainly consists of impalad, State store, and CLI.

Figure 1

Impalad: It runs on the same node as datanode and is represented by the impalad process. It receives the query request from the client (the impalad receiving the query request is coordinator, and the Coordinator calls the Java front-end to explain the SQL query statement through JNI, generate a query plan tree, and then distribute the run plan to other impalad with corresponding data through the Scheduler), read and write data, and run queries in parallel, the result is sent back to the Coordinator through network streaming, and the Coordinator returns the result to the client. At the same time, impalad also maintains a connection with State store to determine which impalad is healthy and can accept new jobs. Start three thriftservers in impalad: beeswax_server (connect to client), hs2_server (borrow hive metadata), be_server (used internally by impalad), and an impalaserver service.
Impala State store: Tracks the health status and location information of impalad in the cluster, represented by the statestored process. It creates multiple threads to process the impalad injection queue subscription and maintains heartbeat connection with each impalad, each impalad caches a copy of the information in the State store. When the State store is offline (impalad finds that the state store is offline, it enters the recovery mode. Repeat the commit mode. When the State store is added again? After the cluster, you can take the initiative to restore the normal status and update the cache data. Because impalad has a State store cache, it can still work, but some impalad will become invalid, but the cache data cannot be updated, the execution plan is assigned to an invalid impalad, causing query failure.
CLI: Provides the command line tool used for user query (Impala shell is implemented using Python). At the same time, Impala also provides the hue, JDBC, and ODBC interfaces.

2. Relationship with hive
Impala and hive are both data query tools built on hadoop and have different adaptive regions. However, Impala and hive have many similarities in the use of the client, such as data table metadata, ODBC/JDBC drivers, SQL syntax, flexible file formats, and storage resource pools. See the relationship between Impala and hive in hadoop 2. Hive is suitable for long-time batch processing query and analysis, while impala is suitable for real-time interactive SQL queries. Impala provides data analysts with a big data analysis tool that can quickly experiment and verify their ideas. You can use hive for data conversion, and then use Impala to perform high-speed data analysis on the result dataset processed by hive.

Figure 2


3. Impala query and processing process
Impalad is divided into Java frontend and C ++ processing backend. The impalad that accepts the client connection serves as the coordinator of this query, the coordinator uses JNI to call the Java front-end to analyze the user's query SQL to generate a run plan tree. Different operations require no plannode, such as selectnode, scannode, sortnode, aggregationnode, and hashjoinnode. Each atomic operation in the execution plan tree is represented by a planfragment. Generally, a query statement consists of multiple plans fragment. Plan fragment 0 indicates the root of the execution tree, and the aggregation result is returned to the user, the leaf node of the Running Tree is usually scan, and runs in parallel in a distributed manner. The execution plan tree generated by the Java frontend is returned to Impala C ++ backend (Coordinator) in the format of thrift data (the execution plan is divided into multiple stages, each stage is called a planfragment, each planfragment can be run in parallel by multiple impalad instances (some planfragment can only run by one impalad instance, such as aggregation operation), and the entire execution plan is an execution plan tree ), based on the running plan, the Coordinator stores data information (Impala interacts with HDFS through libhdfs. Use the hdfsgethosts method to obtain the location information of the node where the file data block is located. Use the scheduler (now only simple-scheduler, using the round-robin algorithm) Coordinator :: exec assigns the Generated run plan tree to the corresponding backend runner impalad for running (the query uses llvm for code generation, compilation, and running. This article describes how to improve the performance of llvm.) Call the getnext () method to obtain the computation result. Assume that it is an insert statement, write the computing result back to HDFS through libhdfs. When all input data is consumed, the operation ends and the query service is canceled.
The following figure shows the Query Process of Impala:

Figure 3

The following uses an SQL query statement as an example to analyze the impala Query Process. For example, select sum (ID), count (ID), AVG (ID) from customer_small group by ID; the plan generated by this statement is:

Plan fragment 0
Partition: unpartitioned

4: Exchange
Tuple IDs: 1

Plan fragment 1
Partition: hash_partitioned: <slot 1>

Stream Data sink
Exchange ID: 4
Unpartitioned

3: Aggregate
| Output: sum (<slot 2>), sum (<slot 3>)
| Group by: <slot 1>
| Tuple IDs: 1
|
2: Exchange
Tuple IDs: 1

Plan fragment 2
Partition: Random

Stream Data sink
Exchange ID: 2
Hash_partitioned: <slot 1>

1: Aggregate
| Output: sum (ID), count (ID)
| Group by: ID
| Tuple IDs: 1
|
0: Scan HDFS
Table = default. customer_small # partitions = 1 size = 193b
Tuple IDs: 0

As you can see in the execution line plan tree 4, the green part is capable of Distributed Parallel Operation:


Figure 4

4. optimization technology used by Impala relative to hive
1. mapreduce is not used for Parallel Computing. Although mapreduce is a good parallel computing framework, it has many other batch-oriented modes instead of interactive SQL operations. Compared with mapreduce, Impala divides the entire Query into an execution plan tree instead of a series of mapreduce tasks. After distributing the execution plan, Impala obtains the results by pulling data, the result data is aggregated by Running Tree Stream Transfer, reducing the process of writing intermediate results to the disk and then reading data from the disk. Impala uses the service method to avoid the overhead that needs to be started every time you run the query, that is, there is no mapreduce startup time compared to hive. 2. Use llvm to generate Execution Code and generate specific code for specific queries. At the same time, use inline to reduce the overhead of function calls and accelerate execution efficiency. 3. Make full use of available hardware commands (sse4.2 ). 4. For better Io scheduling, Impala knows that the disk location of the data block can better utilize the advantages of multiple disks. At the same time, Impala supports direct data block reading and local code calculation checksum. 5. You can achieve the best performance by selecting the appropriate data storage format (Impala supports multiple storage formats ). 6. The maximum memory usage. The intermediate results are not written to the disk and are transmitted through the network in a timely manner in stream mode. 5. Similarities and Differences between Impala and hive
Data Storage: Using the same storage data pool, you can store data in HDFS and hbase.
Metadata: The two use the same metadata.
SQL Interpretation: Similar to running, the running plan is generated through lexical analysis.
Running plan:
Hive: Depends on the mapreduce running framework. The running plan is divided into map-> shuffle-> reduce-> map-> shuffle-> reduce... . If a query is compiled into multiple rounds of mapreduce, many other intermediate results will be written. Because of the features of mapreduce runtime framework, too many intermediate processes will be added? The running time of the entire query.
Impala: presents a running plan as a complete running plan tree, which can more naturally distribute the execution plan to various impalad running queries, instead of combining it into a pipeline map-> reduce mode like hive, this ensures impala has better concurrency and avoids unnecessary intermediate sort and shuffle.
Data Stream:
Hive: The worker uses the push method. After each computing node completes computing, the data is actively pushed to the xingxu node.
Impala: Pull is used to pull the data. It is suggested that the node uses getnext to actively request data from the forward node. In this way, the data can be streamed back to the client, and only one piece of data is processed, it can be displayed immediately, instead of waiting until all processing is completed, which is more suitable for SQL interactive queries.
Memory usage:
Hive: during the running process, if the memory cannot store all the data, the external storage is used to ensure that the query can be run in sequence. At the end of each round of mapreduce, the intermediate results will be written to HDFS. Similarly, because of the features of mapreduce running architecture, the shuffle process will also write data to the local disk.
Impala: When data cannot be stored in the memory, the current version 1.0.1 directly returns an error instead of using external storage. The later version number should be improved. Impala currently has some restrictions on query processing. It may be used with hive. Impala uses network data transmission between multiple stages, and there is no disk write operation during the running process (except insert ).
Scheduling:
Hive: Task Scheduling depends on hadoop's scheduling policy.
Impala: scheduling is completed by yourself. Currently, there is only a simple-schedule scheduler which will satisfy the Data Locality as much as possible. The process of scanning data should be as close as possible to the physical machine where the data is located. The scheduler is still simpler than scheduler. It can be seen in simplescheduler: getbackend that scheduling has not been considered for load, network I/O status, and other factors. However, impala has statistical analysis on the performance of the running process. You should use the statistical information for scheduling in later versions.
Fault Tolerance:
Hive: relies on hadoop's fault tolerance capabilities.
Impala: There is no fault-tolerant logic in the query process. If a problem occurs during the running process, an error is returned directly (this is related to Impala design, because impala is positioned in Real-Time query, an error occurred while performing a query. The cost of performing another query is very low ). However, in general, impala is highly fault tolerant, and all impalad is equivalent. You can submit a query to any impalad. If one impalad fails, all the queries that are running on them will fail, but the user can submit the query again and replace it with other impalad operations without affecting the service. There is only one State store at the moment, but when the State store fails, it will not affect the service. Every impalad caches the State store information, but cannot update the cluster status, it is possible that the running task is assigned to an invalid impalad, which causes the query to fail.
Applicability:
Hive: complex batch query tasks and data conversion tasks.
Impala: Real-time Data Analysis. Because UDF is not supported, the problem domains that can be processed have certain limitations. It can be used with hive to perform real-time analysis on hive result datasets. 6. Advantages and Disadvantages of impala

Strengths:

  1. Supports SQL query and high-speed query of big data.
  2. Query existing data to reduce data loading and conversion.
  3. You can select multiple storage formats (parquet, text, Avro, rcfile, and sequeencefile ).
  4. Able to work with hive.

Disadvantages:

  1. Udfs are not supported.
  2. Full-text search for text fields is not supported.
  3. Transforms is not supported.
  4. Fault Tolerance During query is not supported.
  5. High memory requirements.

Original from: http://tech.uc.cn /? P = 1803

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.