1. Impala Architecture
Impala is Cloudera in Google's Dremel inspired by the development of real-time interactive SQL large data query tool, Impala no longer use slow hive+mapreduce batch processing, Instead, by using a distributed query engine similar to the commercial parallel relational database (composed of Query planner, query Coordinator, and query Exec engine), you can directly select from HDFs or HBase, Join and statistic functions query data, which greatly reduces latency. The architecture is shown in Figure 1, and the Impala consists primarily of impalad, state store, and CLI.
Figure 1
Impalad: Runs on the same node as Datanode, represented by the Impalad process, which receives a query request from the client (the Impalad that receives the query request is coordinator. Coordinator through JNI call Java front-end interpretation of SQL query statements, generate a query plan tree, and then through the scheduler to distribute the execution plan to the other impalad with corresponding data execution), read and write data, execute the query in parallel, And the result is transmitted back to coordinator by the network streaming, and returned to the client by coordinator. Impalad also maintains a connection with the state store to determine which Impalad is healthy and can accept new work. Launch three thriftserver:beeswax_server (connect client) in Impalad, hs2_server (Borrow hive metadata), Be_server (Impalad internal use) and a impalaserver service.
Impala State STORE: Tracking the health status and location information of Impalad in a cluster, represented by the statestored process, by creating multiple threads to handle Impalad's registered subscriptions and maintaining heartbeat connections with each Impalad, Each Impalad caches a message in the state store, and when the state store is offline (Impalad discovers that the state store is offline, it goes into recovery mode, registers repeatedly, and when the state store rejoin the cluster, Automatically return to normal, update cached data) because the Impalad has the state store's cache still working, but because some impalad is invalidated, the cached data cannot be updated, causing the execution plan to be assigned to the failed Impalad, causing the query to fail.
CLI: A command-line tool that is provided to user queries (Impala shell uses Python implementations), while Impala also provides HUE,JDBC, ODBC uses interfaces.
2. Relationship with Hive
Impala and Hive are all the data query tools built on Hadoop with different emphasis on adaptation, but from the perspective of client use Impala and hive have a lot in common, such as datasheet metadata, ODBC/JDBC driver, SQL syntax, flexible file format, Storage resource pools, and so on. The relationship between Impala and Hive in Hadoop is shown in Figure 2. Hive is suitable for long time batch processing query analysis, and Impala is suitable for real-time interactive SQL query, Impala provides data analyst with quick experiment, validation idea of large data analysis tool. You can use hive for data conversion processing, and then use Impala to perform fast data analysis on the result dataset after hive processing.
Figure 2