The Hive MapReduce SQL implementation principle--sql eventually decomposed into Mr Tasks, while group by IS in Mr and the word statistic Mr does not differ

Last Update:2017-01-31 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

transferred from: http://blog.csdn.net/sn_zzy/article/details/43446027the process of converting SQL to MapReduce

After learning about the basic SQL operations of MapReduce, let's look at how hive transforms SQL into a MapReduce task, and the entire compilation process is divided into six phases:

ANTLR defines SQL syntax rules, completes SQL lexical, parses syntax, transforms SQL into abstract syntax trees ast tree
Iterate through the AST Tree and abstract out the basic constituent unit of the query Queryblock
Traverse Queryblock, translate to execute action tree Operatortree
The logic Layer optimizer makes Operatortree transformations, merges unnecessary reducesinkoperator, and reduces the amount of shuffle data
Traverse Operatortree, translate to MapReduce task
The physical layer optimizer transforms the MapReduce task to generate the final execution plan

The implementation principle of join

select u.name, o.orderid from order o join user u on o.uid = u.uid;

Tag the data of different tables in the output value of map, and judge the data source according to tag in the reduce phase. The process of MapReduce is as follows (this is just the implementation of the most basic join, as well as other implementations)

The implementation principle of Group by

select rank, isonline, count(*) from city group by rank, isonline;

The GroupBy field is combined with the output key value of the map, and the Lastkey is distinguished by the use of the mapreduce ordering to save the different keys in the reduce phase. The process of MapReduce is as follows (this is, of course, the non-hash aggregation process of the reduce).

The realization principle of distinct

select dealid, count(distinct uid) num from order group by dealid;

When there is only one distinct field, if you do not consider the hash GroupBy of the map stage, simply combine the GroupBy field and the distinct field into the map output key, using the sort of mapreduce, At the same time, the GroupBy field is used as the key of reduce and the lastkey is saved in the reduce phase to complete the weight

If there are multiple distinct fields, such as the following SQL

select dealid, count(distinct uid), count(distinct date) from order group by dealid;

There are two ways of implementing this:

(1) If you still follow the method of the above distinct field, that is, this implementation, can not be sorted according to the UID and date, and can not be lastkey by the weight, still need in the reduce phase in memory by hash to weight

(2) The second implementation method, can be all distinct field number, each row of data to generate n rows of data, then the same field will be sorted separately, only need to record lastkey in the reduce phase can be heavy.

This implementation is a good use of the mapreduce sequencing, saving the reduce phase deduplication memory consumption, but the disadvantage is to increase the amount of shuffle data.

It is important to note that when you generate reduce value, the value of the remaining distinct data rows can be empty except for the row in which the first distinct field is left.

The Hive MapReduce SQL implementation principle--sql eventually decomposed into Mr Tasks, while group by IS in Mr and the word statistic Mr does not differ

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More