Hive Sort Merge Bucket Map Join (Large Table Association)

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hive Bucket

For each table or partition, hive can be further organized into buckets, which means that buckets are more granular data range divisions. Hive is also an organization of buckets for a column. Hive uses a hash of the column values, divided by the number of buckets, to determine which bucket the record is stored in.
There are two reasons to organize the table (or partition) into buckets (buckets):

(1) Get higher query processing efficiency. The bucket adds an extra structure to the table, and Hive can take advantage of this structure when dealing with some queries. Specifically, connecting two tables that divide the buckets on the same column (which contains the connection columns) can be implemented efficiently using the MAP-side join (map-side join). such as join operations. For join operations, two tables have an identical column if both tables are bucket-operated. Then the bucket with the same column values will be saved as a JOIN operation, which can significantly reduce the amount of data to join.

(2) make sampling (sampling) more efficient. When working with large datasets, it can be handy to run queries on a small subset of data in a dataset at the stage of developing and modifying queries.

test: A 40 million and a more than 50 million table join, correlation key data skew, and Cartesian product, the effect is obvious. Create a small table

CREATE TABLE Lxw_test1 (ID int,name string,date_time string)  
clustered by (ID) sorted by (ID) into buckets;

build a big table

CREATE TABLE Lxw_test2 (ID int,name string,date_time string)  
clustered by (ID) sorted by (ID) into 5 buckets;

Ps:
1. Two table association keys are IDs, need to be sorted by ID, and the number of buckets in the small table is a multiple of the large list of buckets.
2. For map end connections, two tables divide the buckets in the same way. Processing the mapper of a bucket in the left table knows that the matching rows in the right table are within the corresponding bucket. Therefore, mapper only needs to get the bucket (which is just a small portion of the data stored in the right table) to connect. This optimization method does not necessarily require that two tables must be the same number of buckets, the number of buckets in two tables is a multiple relationship can also
3. The data in a bucket can be sorted in addition to one or more columns. Because of this, the connection to each bucket becomes an efficient merge sort (merge-sort), so you can further improve the efficiency of the map-side connection enable bucket table

Set hive.enforce.bucketing = true;

insert 40 million records into a small table

  Insert Overwrite table Lxw_test1  
  select Id,name,null from    
  woa_all_user_info_his   
  where pt = ' 2012-05-28 '  
  limit 40000000;

inserting 5000多万条 records into large tables (5000多万条 records in Woa_all_user_info_his)

Insert Overwrite table Lxw_test2  
 select Id,name,date_time from  
 woa_all_user_info_his  
 where pt = ' 2012-05-28 ';

set parameters for the sort Merge Bucket Map join

Set hive.optimize.bucketmapjoin = true;  
Set hive.optimize.bucketmapjoin.sortedmerge = true;  
Set Hive.input.format=org.apache.hadoop.hive.ql.io.bucketizedhiveinputformat;

PS: The situation at this time is the bucket columns==join columns==sort columns, fully equipped with the use of the Sort Merge Bucket Map Join conditions. Enquiry

Select/*+ Mapjoin (b) */COUNT (1) from  
lxw_test1 a   
join Lxw_test2 b on  
a.id = b.id

Test Results:

Including insert data, the sort Merge Bucket Map join takes about 10 minutes.

If these two tables do ordinary join, take 1 hours, still run not finish, finally had to kill off.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hive Sort Merge Bucket Map Join (Large Table Association)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hive Sort Merge Bucket Map Join (Large Table Association)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support