Note: Major technical improvements to hive (Major Technical advancements in Apache hive)

Last Update:2015-04-27 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Http://web.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-14-2.pdf
(Auxiliary reference: Https://cwiki.apache.org/confluence/display/Hive/Correlation+Optimizer)

Introduction
Primary deficiencies of hive: storage and query plan execution. Three main improvement points are proposed in this paper.

New file Format ORC
Query plan component Optimization (Association optimizer correlation Optimizer
Vector execution model to take full advantage of the CPU CACHE

Hive Architecture
Identify the lack of hive

The storage format is not aware and only one row of data can be processed at a time. In hive, storage efficiency is determined by serialization and file format. The previously supported text and sequence formats, and the Rcfile supported after v0.4, are type-aware. Rcfile can only process one row of data at a time. Type-unaware means that there is no type-oriented optimization; processing one row of data at a time means that the degree of parallelism is low and the compression ratio of the serialization is low.
There are no data indexes (including statistical summary information) and complex data types are not supported. Rcfile is designed for data scanning and does not index and provide additional semantics to skip useless data. Parsing of complex types, such as map, array, is not supported, meaning that accessing any member of that type reads the entire type of data.
The connection between data operations is ignored, resulting in a lot of shuffle that do not have to be.
Each processing of a single row of data also limits the use of modern CPU caches and parallel processing.

File format optimization (ORC)
type identification; support for the data index of the first type; support for complex data type decomposition

1 Table Data layout method (the table placement method), see. Note that the ORC does not support placing columns in the column group.

Advantage the default size of 1:stripe is 256m (Rcfile is 4M)
Advantage 2: Support for complex data types, see Table1.

A somewhat 3:stripe boundary is aligned with the boundary of the HDFs. Typically, the stripe size is smaller than the block size of HDFs, and with this alignment, you can ensure that a stripe is always stored inside the same block.
Data Index (Indexes)
At load speed, only sparse indexes are used. There are two types of indexes:

Statistics (data statistics). including Counter/mix/max/sum/len
Data statistics divided into three levels: file, stripe, logical data block (default 10,000 value of a block, configurable)
Position pointer (position pointer).

Compression.
There are two levels of compression,

Type-based compression. (string compression, if the number of the de-weight divided by the total amount is greater than 0.8 using dictionary compression, otherwise use byte type compression.)
Universal compression mode (optional) (e.g., gzip,snappy, etc., default compression window 256k)

Memory management. Automatically adjusts the size of the actual stripe used based on the memory limit. (It should refer to the size of each chunk of data read)

Query plan
Three points of inadequacy:

Unnecessary map stages. Because there is a maximum of one shuffle for an Mr Job, it is normal for multiple Mr Jobs to appear. The intermediate file of Mr will be written back to HDFs. If a map does not have reduce, it introduces an unnecessary write-back HDFs.
Duplicate data is loaded. This table is loaded multiple times when a table is used more than once in different Mr Cases.
Unnecessary data re-sharding.
Eliminate unnecessary map phase.
The map-only job was generated because the Mr Job was converted into a map job. There is a concentration situation, the most representative of which is the hash join between the smaller table and the large table. To reduce the map phase, calculate whether the smaller table in the Map-only job that participates in the hash join operation is less than a threshold value each time the reduce join is converted to a map join, and if so, jion the map-only to his child job.
Association optimizer (Correlation optimizer)
Based on Ysmart (http://web.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf).
There are two kinds of associations:

Input correlation: means that a table is not used more than once in a different Mr Job.
Job Flow correlation (Workflow Association): one operation relies on another operation, and both operations use the same data sharding method.
There are three conditions that determine whether an upstream RSoP is associated with a downstream RSoP:

The resulting rows use the same sorting method;
Use the same data shard method
There is no conflict in the number of reduce (? ）

Operation Tree Transformations

The underlying RSoP must be used to generate the row data
Add Demuxoperator to reduce unnecessary RSoP.

Operation Coordination.
Because Mr is the push data, many unnecessary data is transmitted. Therefore, a coordinator is required to implement the "on-demand transfer" feature.

Query execution
The aim is to make full use of the features of modern CPUs. The efficiency of modern CPUs depends largely on the degree of parallelism. In order to implement multiple pipelining parallel execution, the instruction branch needs to be reduced. In addition, the independence of data also helps to improve the degree of parallelism. In addition, a single line of execution results in low cache performance.

Datasets represent bulk rows (default 1024, configurable).
In single-line mode, a row of data is processed by the entire query tree before the next line is processed, and now the bulk behavior unit is executed.

Performance testing

File format
Query plan
Query execution

From for notes (Wiz)

Note: Major technical improvements to hive (Major Technical advancements in Apache hive)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More