Functions and usage of order by, sort by, dristribute by, and cluster by in hive

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Order

Order by performs global sorting on the input, so there is only one CER (multiple reducer cannot guarantee global order)
There is only one reducer, which results in a long computing time when the input size is large.

Set hive. mapred. mode = nonstrict; (default value/default value)

Set hive. mapred. mode = strict;

The order by function is the same as the order by function in the database. It is output in order of one or more items.

The difference between order by and database is that limit must be specified in hive. mapred. mode = strict mode; otherwise, an error is reported during execution.

Hive> select * from test order by ID;

Failed: Error in semantic analysis: In strict mode, if order by is specified, limit must also be specified. Error encountered near token 'id'

Cause: In the order by state, all data will be sent to one server for the reduce operation, that is, only one reduce operation. If the data volume is large, the result cannot be output, if you perform limit N, only N * map number records exist. Only one reduce can be processed.

Sort

Sort by is not a global sorting. It completes sorting before data enters CER Cer.

Therefore, if sort by is used for sorting and mapred. Reduce. Tasks> 1 is set, sort by only ensures the output order of each reducer and does not guarantee global order.

Sort by is not affected by whether hive. mapred. mode is strict or nostrict.

The data of sort by can only be sorted by specified fields in the same reduce.

With sort by, you can specify the number of reduce tasks (set mapred. Reduce. Tasks = <number>) to merge and sort the output data to obtain all the results.

Note: The limit clause can be used to greatly reduce the data volume. After limit N is used, the number of data records transmitted to the reduce end (Single Machine) is reduced to N * (number of maps ). Otherwise, the data is too big to produce results.

Distribute

Data is divided into different output reduce/files based on specified fields.

Insert overwrite local directory '/home/hadoop/out' select * from test order by name distribute by length (name );

This method is divided into different reduce according to the length of name, and is finally output to different files.

Length is a built-in function. You can also specify other functions or use custom functions.

Cluster

In addition to the distridistributed by function, cluster by also provides the sort by function.

However, sorting can only be in reverse order, and the sorting rule cannot be ASC or DESC.

From http://metooxi.iteye.com/blog/1447621

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Functions and usage of order by, sort by, dristribute by, and cluster by in hive

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support