Functions and usage of order by, sort by, dristribute by, and cluster by in hive

Source: Internet
Author: User
Order

Order by performs global sorting on the input, so there is only one CER (multiple reducer cannot guarantee global order)
There is only one reducer, which results in a long computing time when the input size is large.

Set hive. mapred. mode = nonstrict; (default value/default value)

Set hive. mapred. mode = strict;

The order by function is the same as the order by function in the database. It is output in order of one or more items.

The difference between order by and database is that limit must be specified in hive. mapred. mode = strict mode; otherwise, an error is reported during execution.

Hive> select * from test order by ID;

Failed: Error in semantic analysis: In strict mode, if order by is specified, limit must also be specified. Error encountered near token 'id'

Cause: In the order by state, all data will be sent to one server for the reduce operation, that is, only one reduce operation. If the data volume is large, the result cannot be output, if you perform limit N, only N * map number records exist. Only one reduce can be processed.

 

Sort

Sort by is not a global sorting. It completes sorting before data enters CER Cer.

Therefore, if sort by is used for sorting and mapred. Reduce. Tasks> 1 is set, sort by only ensures the output order of each reducer and does not guarantee global order.

Sort by is not affected by whether hive. mapred. mode is strict or nostrict.

The data of sort by can only be sorted by specified fields in the same reduce.

With sort by, you can specify the number of reduce tasks (set mapred. Reduce. Tasks = <number>) to merge and sort the output data to obtain all the results.

Note: The limit clause can be used to greatly reduce the data volume. After limit N is used, the number of data records transmitted to the reduce end (Single Machine) is reduced to N * (number of maps ). Otherwise, the data is too big to produce results.

 

Distribute

Data is divided into different output reduce/files based on specified fields.

Insert overwrite local directory '/home/hadoop/out' select * from test order by name distribute by length (name );

This method is divided into different reduce according to the length of name, and is finally output to different files.

Length is a built-in function. You can also specify other functions or use custom functions.

Cluster

In addition to the distridistributed by function, cluster by also provides the sort by function.

However, sorting can only be in reverse order, and the sorting rule cannot be ASC or DESC.

 

From http://metooxi.iteye.com/blog/1447621

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.