Order
Order by performs global sorting on the input, so there is only one CER (multiple reducer cannot guarantee global order)
There is only one reducer, which results in a long computing time when the input size is large.
Set hive. mapred. mode = nonstrict; (default value/default value)
Set hive. mapred. mode = strict;
The order by function is the same as the order by function in the database. It is output in order of one or more items.
The difference between order by and database is that limit must be specified in hive. mapred. mode = strict mode; otherwise, an error is reported during execution.
Hive> select * from test order by ID;
Failed: Error in semantic analysis: In strict mode, if order by is specified, limit must also be specified. Error encountered near token 'id'
Cause: In the order by state, all data will be sent to one server for the reduce operation, that is, only one reduce operation. If the data volume is large, the result cannot be output, if you perform limit N, only N * map number records exist. Only one reduce can be processed.
Sort
Sort by is not a global sorting. It completes sorting before data enters CER Cer.
Therefore, if sort by is used for sorting and mapred. Reduce. Tasks> 1 is set, sort by only ensures the output order of each reducer and does not guarantee global order.
Sort by is not affected by whether hive. mapred. mode is strict or nostrict.
The data of sort by can only be sorted by specified fields in the same reduce.
With sort by, you can specify the number of reduce tasks (set mapred. Reduce. Tasks = <number>) to merge and sort the output data to obtain all the results.
Note: The limit clause can be used to greatly reduce the data volume. After limit N is used, the number of data records transmitted to the reduce end (Single Machine) is reduced to N * (number of maps ). Otherwise, the data is too big to produce results.
Distribute
Data is divided into different output reduce/files based on specified fields.
Insert overwrite local directory '/home/hadoop/out' select * from test order by name distribute by length (name );
This method is divided into different reduce according to the length of name, and is finally output to different files.
Length is a built-in function. You can also specify other functions or use custom functions.
Cluster
In addition to the distridistributed by function, cluster by also provides the sort by function.
However, sorting can only be in reverse order, and the sorting rule cannot be ASC or DESC.
From http://metooxi.iteye.com/blog/1447621