The role and usage of Order by, Sort by, and Dristribute by,cluster by in Hive
1. ORDER BY
Set hive.mapred.mode=nonstrict; (Default value/defaults)
Set hive.mapred.mode=strict;
Order BY is consistent with the order by function in the database, sorting output by one item & several items.
The difference from the order by in the database is that limit must be specified in Hive.mapred.mode = Strict mode otherwise execution will cause an error.
Hive> SELECT * FROM test order by ID;
Failed:error in semantic analysis:1:28 in strict mode, if ORDER by was specified, LIMIT must also be specified. Error encountered near token ' ID '
Cause: In order by state all data will be to a server to reduce the operation is also only a reduce, if the volume of data can not output the results of the case, if the limit n, then only n * map number records. Only one reduce can handle it.
2. Sort by
Sort by is not affected by whether Hive.mapred.mode is strict, nostrict
Sort by data can only guarantee that the data in the same reduce is sorted by the specified field.
Using sort by you can specify the number of reduce (set mapred.reduce.tasks=<number>) to be executed so that more data can be output.
The data of the output is then merged and sorted so that all results can be obtained.
Note: You can use the limit clause to significantly reduce the amount of data. With limit N, the number of data records transferred to the reduce side (stand-alone) is reduced to n (number of maps). Otherwise, the data is too large to be able to produce results.
3. Distribute by
Divides the data into different output reduce/file according to the specified field.
Insert overwrite local directory '/home/hadoop/out ' SELECT * from test order by name distribute by length (name);
This method is divided into different reduce according to the length of name, and eventually output to a different file.
Length is a built-in function, or you can specify other functions or this uses custom functions.
4. Cluster by
Cluster by has the function of sort by in addition to the function of distribute by.
Sort in reverse order, and you cannot specify a collation. ASC or DESC.
For more highlights, please follow: http://bbs.superwu.cn
Follow the Superman Academy: Bj-crxy
Differences and comparisons of several sorting methods in Hadoop Hive