Comparative analysis of Impala and Hive

Last Update:2017-02-27 Source: Internet

Author: User

Tags odbc

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Impala Architecture

Impala is Cloudera in Google's Dremel inspired by the development of real-time interactive SQL large data query tool, Impala no longer use slow hive+mapreduce batch processing, Instead, by using a distributed query engine similar to the commercial parallel relational database (composed of Query planner, query Coordinator, and query Exec engine), you can directly select from HDFs or HBase, Join and statistic functions query data, which greatly reduces latency. The architecture is shown in Figure 1, and the Impala consists primarily of impalad, state store, and CLI.

Figure 1

Impalad: Runs on the same node as Datanode, represented by the Impalad process, which receives a query request from the client (the Impalad that receives the query request is coordinator. Coordinator through JNI call Java front-end interpretation of SQL query statements, generate a query plan tree, and then through the scheduler to distribute the execution plan to the other impalad with corresponding data execution), read and write data, execute the query in parallel, And the result is transmitted back to coordinator by the network streaming, and returned to the client by coordinator. Impalad also maintains a connection with the state store to determine which Impalad is healthy and can accept new work. Launch three thriftserver:beeswax_server (connect client) in Impalad, hs2_server (Borrow hive metadata), Be_server (Impalad internal use) and a impalaserver service.

Impala State STORE: Tracking the health status and location information of Impalad in a cluster, represented by the statestored process, by creating multiple threads to handle Impalad's registered subscriptions and maintaining heartbeat connections with each Impalad, Each Impalad caches a message in the state store, and when the state store is offline (Impalad discovers that the state store is offline, it goes into recovery mode, registers repeatedly, and when the state store rejoin the cluster, Automatically return to normal, update cached data) because the Impalad has the state store's cache still working, but because some impalad is invalidated, the cached data cannot be updated, causing the execution plan to be assigned to the failed Impalad, causing the query to fail.

CLI: A command-line tool that is provided to user queries (Impala shell uses Python implementations), while Impala also provides HUE,JDBC, ODBC uses interfaces.

2. Relationship with Hive

Impala and Hive are all the data query tools built on Hadoop with different emphasis on adaptation, but from the perspective of client use Impala and hive have a lot in common, such as datasheet metadata, ODBC/JDBC driver, SQL syntax, flexible file format, Storage resource pools, and so on. The relationship between Impala and Hive in Hadoop is shown in Figure 2. Hive is suitable for long time batch processing query analysis, and Impala is suitable for real-time interactive SQL query, Impala provides data analyst with quick experiment, validation idea of large data analysis tool. You can use hive for data conversion processing, and then use Impala to perform fast data analysis on the result dataset after hive processing.

Figure 2

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More