Differences and comparisons of several sorting methods in Hadoop Hive

Source: Internet
Author: User

The role and usage of Order by, Sort by, and Dristribute by,cluster by in Hive

1. ORDER BY

Set hive.mapred.mode=nonstrict; (Default value/defaults)

Set hive.mapred.mode=strict;

Order BY is consistent with the order by function in the database, sorting output by one item & several items.

The difference from the order by in the database is that limit must be specified in Hive.mapred.mode = Strict mode otherwise execution will cause an error.

Hive> SELECT * FROM test order by ID;

Failed:error in semantic analysis:1:28 in strict mode, if ORDER by was specified, LIMIT must also be specified. Error encountered near token ' ID '

Cause: In order by state all data will be to a server to reduce the operation is also only a reduce, if the volume of data can not output the results of the case, if the limit n, then only n * map number records. Only one reduce can handle it.

2. Sort by

Sort by is not affected by whether Hive.mapred.mode is strict, nostrict

Sort by data can only guarantee that the data in the same reduce is sorted by the specified field.

Using sort by you can specify the number of reduce (set mapred.reduce.tasks=<number>) to be executed so that more data can be output.

The data of the output is then merged and sorted so that all results can be obtained.

Note: You can use the limit clause to significantly reduce the amount of data. With limit N, the number of data records transferred to the reduce side (stand-alone) is reduced to n (number of maps). Otherwise, the data is too large to be able to produce results.

3. Distribute by

Divides the data into different output reduce/file according to the specified field.

Insert overwrite local directory '/home/hadoop/out ' SELECT * from test order by name distribute by length (name);

This method is divided into different reduce according to the length of name, and eventually output to a different file.

Length is a built-in function, or you can specify other functions or this uses custom functions.

4. Cluster by

Cluster by has the function of sort by in addition to the function of distribute by.

Sort in reverse order, and you cannot specify a collation. ASC or DESC.

For more highlights, please follow: http://bbs.superwu.cn

Follow the Superman Academy: Bj-crxy

Differences and comparisons of several sorting methods in Hadoop Hive

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.