Hive Bucket
For each table or partition, hive can be further organized into buckets, which means that buckets are more granular data range divisions. Hive is also an organization of buckets for a column. Hive uses a hash of the column values, divided by the number of buckets, to determine which bucket the record is stored in.
There are two reasons to organize the table (or partition) into buckets (buckets):
(1) Get higher query processing efficiency. The bucket adds an extra structure to the table, and Hive can take advantage of this structure when dealing with some queries. Specifically, connecting two tables that divide the buckets on the same column (which contains the connection columns) can be implemented efficiently using the MAP-side join (map-side join). such as join operations. For join operations, two tables have an identical column if both tables are bucket-operated. Then the bucket with the same column values will be saved as a JOIN operation, which can significantly reduce the amount of data to join.
(2) make sampling (sampling) more efficient. When working with large datasets, it can be handy to run queries on a small subset of data in a dataset at the stage of developing and modifying queries.
test: A 40 million and a more than 50 million table join, correlation key data skew, and Cartesian product, the effect is obvious.
Create a small table
CREATE TABLE Lxw_test1 (ID int,name string,date_time string)
clustered by (ID) sorted by (ID) into buckets;
build a big table
CREATE TABLE Lxw_test2 (ID int,name string,date_time string)
clustered by (ID) sorted by (ID) into 5 buckets;
Ps:
1. Two table association keys are IDs, need to be sorted by ID, and the number of buckets in the small table is a multiple of the large list of buckets.
2. For map end connections, two tables divide the buckets in the same way. Processing the mapper of a bucket in the left table knows that the matching rows in the right table are within the corresponding bucket. Therefore, mapper only needs to get the bucket (which is just a small portion of the data stored in the right table) to connect. This optimization method does not necessarily require that two tables must be the same number of buckets, the number of buckets in two tables is a multiple relationship can also
3. The data in a bucket can be sorted in addition to one or more columns. Because of this, the connection to each bucket becomes an efficient merge sort (merge-sort), so you can further improve the efficiency of the map-side connection enable bucket table
Set hive.enforce.bucketing = true;
insert 40 million records into a small table
Insert Overwrite table Lxw_test1
select Id,name,null from
woa_all_user_info_his
where pt = ' 2012-05-28 '
limit 40000000;
inserting 5000多万条 records into large tables (5000多万条 records in Woa_all_user_info_his)
Insert Overwrite table Lxw_test2
select Id,name,date_time from
woa_all_user_info_his
where pt = ' 2012-05-28 ';
set parameters for the sort Merge Bucket Map join
Set hive.optimize.bucketmapjoin = true;
Set hive.optimize.bucketmapjoin.sortedmerge = true;
Set Hive.input.format=org.apache.hadoop.hive.ql.io.bucketizedhiveinputformat;
PS: The situation at this time is the bucket columns==join columns==sort columns, fully equipped with the use of the Sort Merge Bucket Map Join conditions.
Enquiry
Select/*+ Mapjoin (b) */COUNT (1) from
lxw_test1 a
join Lxw_test2 b on
a.id = b.id
Test Results:
Including insert data, the sort Merge Bucket Map join takes about 10 minutes.
If these two tables do ordinary join, take 1 hours, still run not finish, finally had to kill off.