Distribute by in hive

Source: Internet
Author: User

The distribute by in hive controls how to split data on the map end to the reduce end.
Hive distributes data based on the number of reducers in the column after distripartition by. The hash algorithm is used by default.

To test distridistributed by, you must allocate multiple reduce nodes for processing. Otherwise, the effect of distridistributed by cannot be seen.

Hive> select * From test09;
OK
100 Tom
200 Mary
300 Kate
400 tim
Time taken: 0.061 seconds

Hive> insert overwrite local directory '/home/hjl/sunwg/OOO' select * From test09 distribute by ID;
Total mapreduce jobs = 1
Launching job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes ):
Set hive.exe C. reducers. bytes. Per. Cer CER =
In order to limit the maximum number of specified CERs:
Set hive.exe C. Fetch CERs. max =
In order to set a constant number of specified CERs:
Set mapred. Reduce. Tasks =
Starting job = job_201105020924_0070, tracking url = http: // hadoop00: 50030/jobdetails. jsp? Jobid = job_201105020924_0070
Kill command =/home/hjl/hadoop/bin/../bin/hadoop job-dmapred. Job. Tracker = hadoop00: 9001-kill job_201105020924_0070
06:12:36, 644 stage-1 Map = 0%, reduce = 0%
06:12:37, 656 stage-1 Map = 50%, reduce = 0%
06:12:39, 673 stage-1 Map = 100%, reduce = 0%
06:12:44, 713 stage-1 Map = 100%, reduce = 50%
06:12:46, 733 stage-1 Map = 100%, reduce = 100%
Ended job = job_201105020924_0070
Copying data to local directory/home/hjl/sunwg/ooo
Copying data to local directory/home/hjl/sunwg/ooo
4 rows loaded to/home/hjl/sunwg/ooo
OK
Time taken: 17.663 seconds

The first execution is based on the ID field for distribution. The results are as follows:

[Hjl @ sunwg SRC] $ CAT/home/hjl/sunwg/OOO/attempt_201105020924_0070_r_000000_0
400tim
200 Mary
[Hjl @ sunwg SRC] $ CAT/home/hjl/sunwg/OOO/attempt_201105020924_0070_r_0000000000
300 Kate
100tom

In another distribution method, we use the length (ID) result. Because the ID fields of these records have the same length, they should be distributed to the same reduce.

Hive> insert overwrite local directory '/home/hjl/sunwg/lll' select * From test09 distribute by length (ID );
Total mapreduce jobs = 1
Launching job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes ):
Set hive.exe C. reducers. bytes. Per. Cer CER =
In order to limit the maximum number of specified CERs:
Set hive.exe C. Fetch CERs. max =
In order to set a constant number of specified CERs:
Set mapred. Reduce. Tasks =
Starting job = job_201105020924_0071, tracking url = http: // hadoop00: 50030/jobdetails. jsp? Jobid = job_201105020924_0071
Kill command =/home/hjl/hadoop/bin/../bin/hadoop job-dmapred. Job. Tracker = hadoop00: 9001-kill job_201105020924_0071
06:15:21, 430 stage-1 Map = 0%, reduce = 0%
06:15:24, 454 stage-1 Map = 100%, reduce = 0%
06:15:31, 509 stage-1 Map = 100%, reduce = 50%
06:15:34, 539 stage-1 Map = 100%, reduce = 100%
Ended job = job_201105020924_0071
Copying data to local directory/home/hjl/sunwg/lll
Copying data to local directory/home/hjl/sunwg/lll
4 rows loaded to/home/hjl/sunwg/lll
OK
Time taken: 20.632 seconds

Check whether the result is as expected:
[Hjl @ sunwg SRC] $ CAT/home/hjl/sunwg/lll/attempt_201105020924_0071_r_000000_0
[Hjl @ sunwg SRC] $ CAT/home/hjl/sunwg/lll/attempt_201105020924_0071_r_0000000000
100tom
200 Mary
300 Kate
400tim

The attempt_201105020924_0071_r_000000_0 file has no records, and all the records are in attempt_201105020924_0071_r_0000000000.

 

Transferred from http://www.oratea.net /? P = 626

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.