Hive components and execution procedures

Source: Internet
Author: User

Transfer from http://blog.csdn.net/lifuxiangcaohui/article/details/40262021

The basic composition of Hive is summarized:

1. Components:

Meta-storage (Metastore)-a component that stores "system directories and metadata about tables, columns, partitions, and so on".
Drive (Driver)-a component that controls the HiveQL lifecycle when HiveQL queries pass through hive. The driver manages the session handle and the statistics of any session.
Query compiler Compiler-is a component that compiles hiveql into a map/reduce task in the form of a directed acyclic graph (directed acyclic graph, DAG).
Execution engine-the execution engine-is a component that performs a compiler-generated task in a dependency order (dependency order).
Hive Server Hiveserver-a component that provides a "robust interface (Thrift interface), JDBC/ODBC server, and an integrated hive and other application."
Client components-Similar to the command line interface CLI, the Web UI and the Jdbc/odbc driver. An extensible interface that contains both the SERDE and the Object Observer (Objectinspector) interface, similar to the user-defined function UDF described above, and the user-defined aggregate function Udaf (Users Defined AggregateFunction) interface that allows the user to define their own column functions.


2, the implementation of the process:

HIVEQL is submitted through the Cli/web UI or the external interface of the thrift, ODBC, or JDBC interface, through the Complier compiler, using the cloud data in Metastore for type detection and parsing, to generate a logical scheme (logical plan), Then, with simple optimization, a map-reduce task is presented in the form of a data structure of a directed acyclic graph Dag.


3, meta-storage (Metastore)

Storing all the metadata about tables, tables, partitions, schemas, columns and their types, table addresses, and so on, can be queried through the thrift interface and, because of the need to quickly provide it to the compiler, use an RDBMS

4. Query compiler (complier)

To generate an execution plan with metadata from cloud storage, follow these steps:
1). Parse (parse)-ANLR parse its generative syntax tree AST (This is also the hibernate): convert HQL to abstract syntax tree AST
2). Type checking and parsing (type checking and semantic analysis): Transforms the abstract syntax tree into this query block (Query block tree) and transforms the query block into a logical query plan (logic plan Generator);
3). Optimization (optimization): Rewrite the query plan (logical optimizer) and turn the logical query plan into a physical plan (physical plan Generator)-Select the best join strategy ( Physical optimizer)

Parse SA LPG lo PPG PO
HQL------->ast------>QB----->op tree------->op tree------->task tree------->task Tree

First the HQL statement parsing, constructs an AST tree, obtains the queryblock from the AST tree, then converts the QB to the corresponding operator, generates the logical query plan, optimizes the logical query plan (predicate pushes), generates the physical query plan, optimizes the physical query plan ( Mapjoinresolver/skewjoinresolver/commonjoinresolver) to get the final execution plan.

Mapjoinresolver: The small table of Mr Results into hashtablefiles-->distributedcache, large table from the distributed cache to obtain data to join; When the hash data is large, the distributed cache query efficiency is reduced, At the same time, the large table map > is waiting for the hash files, so the results of the column optimization processing of the small table into the DC to compress and update, large table traversal from the DC to remove the TAR packet, and then extract read local hash files


Hive completes the following transformations as part of the optimization phase:
1). Column pruning: The only columns needed in query processing are actually projected out of the row
2). Predicate push (predicate pushdown): After pushing the filter action only on one table to Tablescanoperator,
3). Partition clip (Partition pruning): Filter out fields that are not eligible on the partition
4). Map End connection (map side joins): When the join table is very small, copy it first in the Map section and then join, in the following format:
SELECT/*+ mapjoin (T2) */T1.C1, t2.c1 from T1 JOIN T2 on (t1.c2 = T2.C2);
The number of rows in a table that are controlled "at any time in memory" by Hive.mapjoin.size.key and Hive.mapjoin.cache.numrows, and the size provided to the System Union key
5). Connection reordering (join reordering): Save smaller tables in memory, large tables for traversal operations, ensure system memory does not overflow

5, the further optimization of Mapjoin

1). Data repartitioning to groupby the asymmetric (skews): With two mapreduce, the first phase distributes the data randomly (or in the case of distinct aggregation by distinct column) to reducers, and computes the aggregated value The results of these aggregations are then distributed to the second reducer by the Group by column;

Set Hive.groupby.skewindata= true;
SELECT t1.c1, sum (T1.C2)
from T1
GROUP by T1.C1;

2). Hash-based local aggregation in mappers: equivalent to combiner, aggregated in the map-side memory, then sent to reducers, parameter hive.map.aggr.hash.percentmemory describes mapper The amount of memory that can be used to control the portion of the hash table. If 0.5 ensures that the hash table size exceeds half of the maximum memory used for mapper, the portion of the aggregation stored there is sent to the reducers. The Hive.map.aggr.hash.min.reduction parameter is also used to control the amount of memory used for mappers



6. Execution engine (execution engines):

Executes according to the task's dependency sequence


7. Other Optimizations:

1). The left Semi join implements the In/exists subquery:
SELECT a.* from A left SEMI joins B on (A.key = B.key and B.key > 100);
Identical to select a.* from A where A.key in (select B.key FORM B where b.key > 100);
Role: Map end uses group by to reduce the amount of data flowing into the reduce end

2). Bucket Map Join:
Set hive.optimize.bucketmapjoin = true;
Work with the map join;
All join tables are divided into buckets, while the number of large buckets is an integer multiple of the small table bucket;
The column that makes the bucket must be a join column;

SELECT/*+mapjoin (a,c) */a.*, b.*, c.*
A join B on a.key = B.key
Join C on A.key=c.key;
In the real production environment, there will be hundreds of buckets;

3). Skew join:
Data skew during join, resulting in the reduce end oom
Set hive.optimize.skewjoin = true;
Set hive.skewjoin.key = threshold;

When the map of the Join is over the threshold, the in-memory a-k1/b-k1 data is stored in HDFs, and then the two pieces of data on the HDFs are then made into a map Join, together with the other key to form the final result

Hive components and execution procedures

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.