transferred from: http://blog.csdn.net/sn_zzy/article/details/43446027the process of converting SQL to MapReduce
After learning about the basic SQL operations of MapReduce, let's look at how hive transforms SQL into a MapReduce task, and the entire compilation process is divided into six phases:
- ANTLR defines SQL syntax rules, completes SQL lexical, parses syntax, transforms SQL into abstract syntax trees ast tree
- Iterate through the AST Tree and abstract out the basic constituent unit of the query Queryblock
- Traverse Queryblock, translate to execute action tree Operatortree
- The logic Layer optimizer makes Operatortree transformations, merges unnecessary reducesinkoperator, and reduces the amount of shuffle data
- Traverse Operatortree, translate to MapReduce task
- The physical layer optimizer transforms the MapReduce task to generate the final execution plan
The implementation principle of join
select u.name, o.orderid from order o join user u on o.uid = u.uid;
Tag the data of different tables in the output value of map, and judge the data source according to tag in the reduce phase. The process of MapReduce is as follows (this is just the implementation of the most basic join, as well as other implementations)
The implementation principle of Group by
select rank, isonline, count(*) from city group by rank, isonline;
The GroupBy field is combined with the output key value of the map, and the Lastkey is distinguished by the use of the mapreduce ordering to save the different keys in the reduce phase. The process of MapReduce is as follows (this is, of course, the non-hash aggregation process of the reduce).
The realization principle of distinct
select dealid, count(distinct uid) num from order group by dealid;
When there is only one distinct field, if you do not consider the hash GroupBy of the map stage, simply combine the GroupBy field and the distinct field into the map output key, using the sort of mapreduce, At the same time, the GroupBy field is used as the key of reduce and the lastkey is saved in the reduce phase to complete the weight
If there are multiple distinct fields, such as the following SQL
select dealid, count(distinct uid), count(distinct date) from order group by dealid;
There are two ways of implementing this:
(1) If you still follow the method of the above distinct field, that is, this implementation, can not be sorted according to the UID and date, and can not be lastkey by the weight, still need in the reduce phase in memory by hash to weight
(2) The second implementation method, can be all distinct field number, each row of data to generate n rows of data, then the same field will be sorted separately, only need to record lastkey in the reduce phase can be heavy.
This implementation is a good use of the mapreduce sequencing, saving the reduce phase deduplication memory consumption, but the disadvantage is to increase the amount of shuffle data.
It is important to note that when you generate reduce value, the value of the remaining distinct data rows can be empty except for the row in which the first distinct field is left.
The Hive MapReduce SQL implementation principle--sql eventually decomposed into Mr Tasks, while group by IS in Mr and the word statistic Mr does not differ