The Hive MapReduce SQL implementation principle--sql eventually decomposed into Mr Tasks, while group by IS in Mr and the word statistic Mr does not differ

Source: Internet
Author: User
Tags shuffle

transferred from: http://blog.csdn.net/sn_zzy/article/details/43446027the process of converting SQL to MapReduce

After learning about the basic SQL operations of MapReduce, let's look at how hive transforms SQL into a MapReduce task, and the entire compilation process is divided into six phases:

    1. ANTLR defines SQL syntax rules, completes SQL lexical, parses syntax, transforms SQL into abstract syntax trees ast tree
    2. Iterate through the AST Tree and abstract out the basic constituent unit of the query Queryblock
    3. Traverse Queryblock, translate to execute action tree Operatortree
    4. The logic Layer optimizer makes Operatortree transformations, merges unnecessary reducesinkoperator, and reduces the amount of shuffle data
    5. Traverse Operatortree, translate to MapReduce task
    6. The physical layer optimizer transforms the MapReduce task to generate the final execution plan
The implementation principle of join
select u.name, o.orderid from order o join user u on o.uid = u.uid;

Tag the data of different tables in the output value of map, and judge the data source according to tag in the reduce phase. The process of MapReduce is as follows (this is just the implementation of the most basic join, as well as other implementations)

The implementation principle of Group by
select rank, isonline, count(*) from city group by rank, isonline;

The GroupBy field is combined with the output key value of the map, and the Lastkey is distinguished by the use of the mapreduce ordering to save the different keys in the reduce phase. The process of MapReduce is as follows (this is, of course, the non-hash aggregation process of the reduce).

The realization principle of distinct
select dealid, count(distinct uid) num from order group by dealid;

When there is only one distinct field, if you do not consider the hash GroupBy of the map stage, simply combine the GroupBy field and the distinct field into the map output key, using the sort of mapreduce, At the same time, the GroupBy field is used as the key of reduce and the lastkey is saved in the reduce phase to complete the weight

If there are multiple distinct fields, such as the following SQL

select dealid, count(distinct uid), count(distinct date) from order group by dealid;

There are two ways of implementing this:

(1) If you still follow the method of the above distinct field, that is, this implementation, can not be sorted according to the UID and date, and can not be lastkey by the weight, still need in the reduce phase in memory by hash to weight

(2) The second implementation method, can be all distinct field number, each row of data to generate n rows of data, then the same field will be sorted separately, only need to record lastkey in the reduce phase can be heavy.

This implementation is a good use of the mapreduce sequencing, saving the reduce phase deduplication memory consumption, but the disadvantage is to increase the amount of shuffle data.

It is important to note that when you generate reduce value, the value of the remaining distinct data rows can be empty except for the row in which the first distinct field is left.

The Hive MapReduce SQL implementation principle--sql eventually decomposed into Mr Tasks, while group by IS in Mr and the word statistic Mr does not differ

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.