Hive Sort Merge Bucket Map Join (Large Table Association)

Source: Internet
Author: User
Hive Bucket
For each table or partition, hive can be further organized into buckets, which means that buckets are more granular data range divisions. Hive is also an organization of buckets for a column. Hive uses a hash of the column values, divided by the number of buckets, to determine which bucket the record is stored in.
There are two reasons to organize the table (or partition) into buckets (buckets):

(1) Get higher query processing efficiency. The bucket adds an extra structure to the table, and Hive can take advantage of this structure when dealing with some queries. Specifically, connecting two tables that divide the buckets on the same column (which contains the connection columns) can be implemented efficiently using the MAP-side join (map-side join). such as join operations. For join operations, two tables have an identical column if both tables are bucket-operated. Then the bucket with the same column values will be saved as a JOIN operation, which can significantly reduce the amount of data to join.

(2) make sampling (sampling) more efficient. When working with large datasets, it can be handy to run queries on a small subset of data in a dataset at the stage of developing and modifying queries.
test: A 40 million and a more than 50 million table join, correlation key data skew, and Cartesian product, the effect is obvious. Create a small table
CREATE TABLE Lxw_test1 (ID int,name string,date_time string)  
clustered by (ID) sorted by (ID) into buckets;  
build a big table
CREATE TABLE Lxw_test2 (ID int,name string,date_time string)  
clustered by (ID) sorted by (ID) into 5 buckets;  

Ps:
1. Two table association keys are IDs, need to be sorted by ID, and the number of buckets in the small table is a multiple of the large list of buckets.
2. For map end connections, two tables divide the buckets in the same way. Processing the mapper of a bucket in the left table knows that the matching rows in the right table are within the corresponding bucket. Therefore, mapper only needs to get the bucket (which is just a small portion of the data stored in the right table) to connect. This optimization method does not necessarily require that two tables must be the same number of buckets, the number of buckets in two tables is a multiple relationship can also
3. The data in a bucket can be sorted in addition to one or more columns. Because of this, the connection to each bucket becomes an efficient merge sort (merge-sort), so you can further improve the efficiency of the map-side connection enable bucket table

Set hive.enforce.bucketing = true;  
insert 40 million records into a small table
  Insert Overwrite table Lxw_test1  
  select Id,name,null from    
  woa_all_user_info_his   
  where pt = ' 2012-05-28 '  
  limit 40000000;  
inserting 5000多万条 records into large tables (5000多万条 records in Woa_all_user_info_his)
Insert Overwrite table Lxw_test2  
 select Id,name,date_time from  
 woa_all_user_info_his  
 where pt = ' 2012-05-28 ';  
set parameters for the sort Merge Bucket Map join
Set hive.optimize.bucketmapjoin = true;  
Set hive.optimize.bucketmapjoin.sortedmerge = true;  
Set Hive.input.format=org.apache.hadoop.hive.ql.io.bucketizedhiveinputformat;  
PS: The situation at this time is the bucket columns==join columns==sort columns, fully equipped with the use of the Sort Merge Bucket Map Join conditions. Enquiry
Select/*+ Mapjoin (b) */COUNT (1) from  
lxw_test1 a   
join Lxw_test2 b on  
a.id = b.id   
Test Results:

Including insert data, the sort Merge Bucket Map join takes about 10 minutes.

If these two tables do ordinary join, take 1 hours, still run not finish, finally had to kill off.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.