Note: Major technical improvements to hive (Major Technical advancements in Apache hive)

Source: Internet
Author: User
Tags shuffle

Http://web.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-14-2.pdf
(Auxiliary reference: Https://cwiki.apache.org/confluence/display/Hive/Correlation+Optimizer)
  1. Introduction
    Primary deficiencies of hive: storage and query plan execution. Three main improvement points are proposed in this paper.
      1. New file Format ORC
      2. Query plan component Optimization (Association optimizer correlation Optimizer
      3. Vector execution model to take full advantage of the CPU CACHE
  2. Hive Architecture

  3. Identify the lack of hive
      1. The storage format is not aware and only one row of data can be processed at a time. In hive, storage efficiency is determined by serialization and file format. The previously supported text and sequence formats, and the Rcfile supported after v0.4, are type-aware. Rcfile can only process one row of data at a time. Type-unaware means that there is no type-oriented optimization; processing one row of data at a time means that the degree of parallelism is low and the compression ratio of the serialization is low.
      2. There are no data indexes (including statistical summary information) and complex data types are not supported. Rcfile is designed for data scanning and does not index and provide additional semantics to skip useless data. Parsing of complex types, such as map, array, is not supported, meaning that accessing any member of that type reads the entire type of data.
      3. The connection between data operations is ignored, resulting in a lot of shuffle that do not have to be.
      4. Each processing of a single row of data also limits the use of modern CPU caches and parallel processing.
  4. File format optimization (ORC)
    type identification; support for the data index of the first type; support for complex data type decomposition
      1. 1 Table Data layout method (the table placement method), see. Note that the ORC does not support placing columns in the column group.

        Advantage the default size of 1:stripe is 256m (Rcfile is 4M)
        Advantage 2: Support for complex data types, see Table1.

        A somewhat 3:stripe boundary is aligned with the boundary of the HDFs. Typically, the stripe size is smaller than the block size of HDFs, and with this alignment, you can ensure that a stripe is always stored inside the same block.
      2. Data Index (Indexes)
        At load speed, only sparse indexes are used. There are two types of indexes:
        1. Statistics (data statistics). including Counter/mix/max/sum/len
          Data statistics divided into three levels: file, stripe, logical data block (default 10,000 value of a block, configurable)
        2. Position pointer (position pointer).
      3. Compression.
        There are two levels of compression,
        1. Type-based compression. (string compression, if the number of the de-weight divided by the total amount is greater than 0.8 using dictionary compression, otherwise use byte type compression.)

        2. Universal compression mode (optional) (e.g., gzip,snappy, etc., default compression window 256k)
      4. Memory management. Automatically adjusts the size of the actual stripe used based on the memory limit. (It should refer to the size of each chunk of data read)
  5. Query plan
    Three points of inadequacy:
    1. Unnecessary map stages. Because there is a maximum of one shuffle for an Mr Job, it is normal for multiple Mr Jobs to appear. The intermediate file of Mr will be written back to HDFs. If a map does not have reduce, it introduces an unnecessary write-back HDFs.

    2. Duplicate data is loaded. This table is loaded multiple times when a table is used more than once in different Mr Cases.
    3. Unnecessary data re-sharding.
    4. Eliminate unnecessary map phase.
      The map-only job was generated because the Mr Job was converted into a map job. There is a concentration situation, the most representative of which is the hash join between the smaller table and the large table. To reduce the map phase, calculate whether the smaller table in the Map-only job that participates in the hash join operation is less than a threshold value each time the reduce join is converted to a map join, and if so, jion the map-only to his child job.

    5. Association optimizer (Correlation optimizer)
      Based on Ysmart (http://web.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf).
      There are two kinds of associations:
        1. Input correlation: means that a table is not used more than once in a different Mr Job.
        2. Job Flow correlation (Workflow Association): one operation relies on another operation, and both operations use the same data sharding method.
          There are three conditions that determine whether an upstream RSoP is associated with a downstream RSoP:
          1. The resulting rows use the same sorting method;
          2. Use the same data shard method
          3. There is no conflict in the number of reduce (? )
        3. Operation Tree Transformations
          1. The underlying RSoP must be used to generate the row data
          2. Add Demuxoperator to reduce unnecessary RSoP.

        4. Operation Coordination.
          Because Mr is the push data, many unnecessary data is transmitted. Therefore, a coordinator is required to implement the "on-demand transfer" feature.
    6. Query execution
      The aim is to make full use of the features of modern CPUs. The efficiency of modern CPUs depends largely on the degree of parallelism. In order to implement multiple pipelining parallel execution, the instruction branch needs to be reduced. In addition, the independence of data also helps to improve the degree of parallelism. In addition, a single line of execution results in low cache performance.
        1. Datasets represent bulk rows (default 1024, configurable).
        2. In single-line mode, a row of data is processed by the entire query tree before the next line is processed, and now the bulk behavior unit is executed.


    7. Performance testing
        1. File format


        2. Query plan
        3. Query execution



From for notes (Wiz)

Note: Major technical improvements to hive (Major Technical advancements in Apache hive)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.