Hive Bucket Research document
, &NB Sp , &NB Sp , &NB Sp , &NB Sp , &NB Sp , &NB Sp from Jing
1. Hive Sub-bucket Overview ... 3
2 . Hive bucket Use ... 3
2.1 Basic Operation ... 3
optimization of join in 2.2 hive ... 4
2.2.1 Common join. 4
2.2.2 Map join. 5
2.2.3 Convert common join to map join. 6
2.2.4 Bucket Map join. 6
2.2.5 Sort Merge bucket map join. 8
3 . Performance Comparison ... 8
3.1 Bucket Map join VS sort Merge bucket map join. 8
3.1.1 Performance Analysis ... 8
3.1.2 Test Statement ... 9
3.1.3 Performance Comparison ... 12
3.2 Barrels Number of factors to consider ... 12
3.2.1 Performance Analysis ... 12
3.2.2 Test Statement ... 13
3.1.3 Performance Comparison ... 14
3.3 Sorting barrel Insert and normal barrel insertion ... 15
3.2.1 Performance Analysis ... 15
3.2.2 Test Statement ... 15
3.1.3 Performance Comparison ... 16
4. sort Mergejoin existing deficiencies ... 17
4.1 Sort Merge Join optimization does not work with joins appearing in clauses ... 17
4.2 Sort Merge Bucket map join does not support joins for multiple partition tables. 17
4.3 sort merge bucket map join does not support left outer join between each partition table in a multi-differentiated table. 19
4.4 Sort Merge Bucket map join does not support descending bucket table. 21
5 . Reference Documents ... 22
1. Overview of Hive buckets
Bucket is an implementation of hashpartition in hive. Both the table and table partitions in hive can be further divided into multiple buckets. Specify the clustered field when building a table, and you can break the table into multiple buckets by hashing. Splitting a table can improve the performance of the following operations.
1. GROUP BY
e.g SELECT user_id, count (time) fromt GROUP by user_id;
2. Distribute by
e.g SELECT user_id, time from Tdistribute by user_id;
3.distribute By,sort by
e.g SELECT user_id, time from Tdistribute by user_id SORT by user_id, time;
4. Join
E.g Select/*+ Mapjoin (b) */a.user_id from T a joins T2 b on (a.user_id = b.user_id);
2. Hivebucket Use
2.1 Basic Operations
1. Create Bucket table
CREATE Tabledwd_prod_word_list_bucket_part (
word_id BIGINT COMMENT ' keyword ID ',
word_text STRING COMMENT ' keyword literal '
)
COMMENT ' DWD layer keyword and ID corresponding to the Relationship code table _ bucket _ partition '
partitioned by (pdate STRING)
CLUSTERED by (word_id)
SORTED by (word_id ASC)
Into BUCKETS
ROW FORMAT Delimited
Fields TERMINATED by ' \ t '
LINES TERMINATED by ' \ n ' STORED as textfile;
The build Table statement simply defines the metadata, where clustered by and sorted by do not affect the import of the data. The user must ensure that the data is imported correctly, including the buckets and sorting of the data .
2. Inserting data
Before inserting the data, you need to set the parameter hive.enforce.bucketing=true to force the number of reducer in hive to the number of buckets. If you do not set this hive parameter, the last number of buckets may not be the number in the build table statement. Alternatively, you can control the number of reducer by setting the parameter mapred.reduce.tasks to the number of buckets, which is recommended in the first way.
sethive.enforce.bucketing=true;
INSERT overwritetable
dwd_prod_word_list_bucket_partpartition (pdate= ' 2013-04-15 ')
SELECT
word_id,
Word_text
from
dwd.dwd_prod_word_list
wherepdate= ' 2013-04-15 '
CLUSTER by word_id;
After inserting the data, view the HDFs file directory:
[rigelcitest@tc-crm-rd03.tc.baidu.com ~]$ Hadoop Fs-ls/app/ecom/rigelci/hive/zj_exercise.db/dwd_prod_word_ List_bucket_part/pdate= ' 2013-04-15 '
Found Items
-rw-r--r--3 Rigelci rigelci 322404484 2013-04-25 10:36/app/ecom/rigelci/hive/zj_exercise.db/dwd_prod_word_list_ Bucket_part/pdate=2013-04-15/000000_0
-rw-r--r--3 Rigelci rigelci 322433479 2013-04-25 10:36/app/ecom/rigelci/hive/zj_exercise.db/dwd_prod_word_list _bucket_part/pdate=2013-04-15/000001_0
-rw-r--r--3 Rigelci rigelci 322429728 2013-04-25 10:36/app/ecom/rigelci/hive/zj_exercise.db/dwd_prod_word_list _bucket_part/pdate=2013-04-15/000002_0
-rw-r--r--3 Rigelci rigelci 322447140 2013-04-25 10:36/app/ecom/rigelci/hive/zj_exercise.db/dwd_prod_word_list_ Bucket_part/pdate=2013-04-15/000003_0
...
2.2 Hive optimization of the join in 2.2.1 Common join
The implementation diagram for the common join in hive is shown in Figure 1, mapper reads the join tables and goes to the (key, value) key value pair and deposits the temporary file. Hadoop sorts and merges these key-value pairs in the shuffle phase. Reducer will sort the merged results as output and join in the reducer phase. The shuffle phase requires a lot of CPU because of the need for sorting and merging. Omitting the shuffle and reduce phases can improve performance.
Figure 1 Common join schematic diagram 2.2.2 Map Join
The purpose of the map join is to omit the shuffle and reducer phases and complete the join in the map phase. Before the MapReduce task starts, the MapReduce local task is started, which reads the small table from HDFs to a Hashtable in memory and writes Hashtable to the local hashtable file. When the MapReduce task is started, the local hashtable file is uploaded to the distributed cache. Each one gets the Hasht file from the distributed cache, and the hashtable that is restored to the small table is loaded into memory, and the large table is join.
Figure 2 mapjoin schematic diagram 2.2.3 Convert common join to map join
In the previous syntax, you need to specify in the statement which table is a small table. For example, the following SQL. It is a better solution to automatically convert a normal join to mapjoin if it is possible not to show the hint.
Select/*+mapjoin (a) */* from Src1 x joins Src2 y on X.key=y.key;
A schematic diagram of the automatic conversion of common join to Mapjoin is shown in Figure 3. Because some tables are temporary tables that are produced in clauses, query processor cannot know at the compile stage which table is a small table, but only in the Execute phase to know the size of the input file. In the compile phase, query processor produces a conditional task, which contains a task list, each of which is a join scheme. For example, with SELECT * from Src1 x joins Src2 y on X.key=y.key, first, the Commonjoin task is placed in the task list, and in addition, two SRC1 tasks with SRC2 and mapjoin as large tables are placed into a task. In the Execute phase, the conditional task knows the size of each input table, even if input table is an intermediate table. If all input tables are too large, the common join is executed, and if a large table is small enough to load into memory, the conditional task chooses to start the corresponding map join local task. In this way, the join is automatically converted to Mapjoin.
Figure 3 Convert Commonjoin to map join 2.2.4 Bucket map Join
The Bucket Map Join execution plan is divided into two steps that will also generate a MapReduce Local Task before Map/reducetask starts. This local task converts data from HDFs to each bucket of a small table into number_buckets hashtable file. When Mapreducetask starts, these hashtable files are placed into the distributed cache, and the number_buckets hashtable files generated by the small table are broadcast to the map end of all large tables [1]. The map end of a large table accepts the number_buckets of a small table hashtable does not need to synthesize a large hashtable. Large table map operation will produce Number_buckets split, each split tag is the same as the Hashtable tag of the small table, when performing projection operation, only need to put a hashtable of the small table into memory, The corresponding split of the large table is then taken out for judgment, so its memory is limited to the size of the largest hashtable in the small table. The bucket map join does not have to convert the entire small table to a hashtable in the local task before the Map/reduce task, rather than the normal map join, generating number_buckets Hashtable; The task does not have to match Hashtable with a large table, but only with the corresponding bucket split and the hashtable of a bucket of the small table. Therefore, in performance, the bucket map join is better than the normal mapjoin, especially for large table join operations. The schematic diagram of the common mapjoin is shown in Figure 4, and the diagram of the bucket map join is shown in Figure 5.
Figure 4 Implementation of the common map join [2]
Fig. 5 schematic diagram of bucket mapjoin [2]
When the table for join satisfies the following three conditions, Hive joins the same bucket in the table during the map phase of the join.
1. Each table is divided into buckets
2.bucket Column ==join column
3. The number of buckets in a table must be an integer multiple of the number of buckets in another table
This feature improves the join performance of the table, especially if you encounter two large tables to join, because you do not need to scan a table to get all the data and then match the data in each bucket of the other table. However, this optimization is not enabled by default in hive, and the hive parameter Hive.optimize.bucketmapjoin must be set to true to turn on.
set hive.optimize.bucketmapjoin=true; |
2.2.5 sort merge bucket map join
On the basis of the bucket map join, if the data is ordered on the join column, you can use the sort merge join optimization. The schematic diagram of the sort merge bucket map join is shown in Figure 6, as the table is ordered on the Joinkey, so it is not necessary to put the table all in memory, so that the value can read some small table data, and then use the large table line by row to match. Such joins do not limit the size of the memory. You can also perform an all-out connection.
Figure 6 Sort Merge Bucketmap join diagram
Hive performs a higher performance Sort-merge bucket map join when the table as join meets the following three conditions.
1. Each table is divided into buckets
2. Bucket column ==join column ==sort column
3. Number of buckets in a table = = number of buckets in another table
Both the large table and the small table are ordered on the join key, so there is no memory limit at join. Small table However, this optimization is not enabled by default in hive, and the hive parameter Hive.optimize.bucketmapjoin must be set to true to turn on.
Set hive.optimize.bucketmapjoin = true; Set hive.optimize.bucketmapjoin.sortedmerge = true; set Hive.input.format=org.apache.hadoop.hive.ql.io.bucketizedhiveinputformat; |
So that the data of the small table can be read only a portion at a time, and then a large table row by line to match, such a join does not limit the size of memory. You can also perform an all-out connection.
3. Performance comparison
3.1 Bucket map join VS sortmerge bucket map join 3.1.1 Performance analysis
The bucket map join and sort Merge bucket map join principles are described in the second section. For Bucketmap join, the MapReduce Local task is started before the MapReduce task to convert the buckets to Hashtable, and each hashtable needs to be serialized into the local TMP folder and compressed. When a MapReduce task is started, the previous Number_bucket tar package is stored in the distributed cache. You can see that there are three bottlenecks:
1) The local task is performed locally and is a memory-intensive task that will cause an error if it is larger than the client JVM heap size
2) serialization of Hashtable to the local TMP folder involves IO
3) To ensure that the Hashtable file sum of the compressed buckets can not be too large, or put into the distributed cache will be problematic.
the sort merge bucket map join can be much better than bucketmap join in the join phase, and the time is mainly consumed in the previous sort. Compared to the bucket map join, the sort merge bucket map join has an advantage in memory consumption and CPU usage. In the next section, you compare the performance of the normal bucket insert and sort bucket inserts. 3.1.2 Test statements
Bucket map Join Test statement
----Set up a general bucket table 1, no sort on bucket column CREATE TABLE Dwd_prod_word_list_bucket_test1 ( word_id BIGINT COMMENT ' keyword ID ', word_text STRING COMMENT ' keyword literal ' ) COMMENT ' DWD layer keyword and ID corresponding to the Relationship code table _ barrels ' CLUSTERED by (word_id) into BUCKETS ROW FORMAT Delimited Fields TERMINATED by ' \ t ' LINES TERMINATED by ' \ n ' STORED as Textfile; ----Set up a general bucket table 2, no sort on bucket column CREATE TABLE Dwd_prod_word_list_bucket_test2 ( word_id BIGINT COMMENT ' keyword ID ', word_text STRING COMMENT ' keyword literal ' ) COMMENT ' DWD layer keyword and ID corresponding to the Relationship code table _ barrels ' CLUSTERED by (word_id) into BUCKETS ROW FORMAT Delimited Fields TERMINATED by ' \ t ' LINES TERMINATED by ' \ n ' STORED as textfile; SET hive.enforce.bucketing = true; -----Inserting Data INSERT OVERWRITE TABLE dwd_prod_word_list_bucket_test1 SELECT word_id, Word_text From dwd_prod_word_list WHERE pdate= ' 2013-04-15 '; -----Inserting Data INSERT OVERWRITE TABLE dwd_prod_word_list_bucket_test2 SELECT word_id, Word_text From dwd_prod_word_list WHERE pdate= ' 2013-04-14 '; set hive.mapjoin.smalltable.filesize=400000000; set hive.optimize.bucketmapjoin=true; ----Bucket map Join SELECT nd.word_id From dwd_prod_word_list_bucket_test1 ND Left OUTER JOIN Dwd_prod_word_list_bucket_test2 OD On nd.word_id=od.word_id WHERE od.word_id is null; |
Sort Merge Bucket map join Test statement
----Create a bucket table 1, sort on bucket column CREATE TABLE Dwd_prod_word_list_sort_bucket_test1 ( word_id BIGINT COMMENT ' keyword ID ', word_text STRING COMMENT ' keyword literal ' ) COMMENT ' DWD layer keyword and ID corresponding to the Relationship code table _ barrels _ sorting ' CLUSTERED by (word_id) SORTED by (word_id ASC) Into BUCKETS ROW FORMAT Delimited Fields TERMINATED by ' \ t ' LINES TERMINATED by ' \ n ' STORED as Textfile; ----Create a bucket table 2, sort on bucket column CREATE TABLE Dwd_prod_word_list_sort_bucket_test2 ( word_id BIGINT COMMENT ' keyword ID ', word_text STRING COMMENT ' keyword literal ' ) COMMENT ' DWD layer keyword and ID corresponding to the Relationship code table _ barrels _ sorting ' CLUSTERED by (word_id) SORTED by (word_id ASC) Into BUCKETS ROW FORMAT Delimited Fields TERMINATED by ' \ t ' LINES TERMINATED by ' \ n ' STORED as Textfile; SET hive.enforce.bucketing = true; -----Inserting Data INSERT OVERWRITE TABLE dwd_prod_word_list_sort_bucket_test1 SELECT word_id, Word_text From dwd_prod_word_list WHERE pdate= ' 2013-04-15 ' CLUSTER by word_id; -----Inserting Data INSERT OVERWRITE TABLE dwd_prod_word_list_sort_bucket_test2 SELECT word_id, Word_text From dwd_prod_word_list WHERE pdate= ' 2013-04-14 ' CLUSTER by word_id; set hive.mapjoin.smalltable.filesize=400000000; set Hive.input.format=org.apache.hadoop.hive.ql.io.bucketizedhiveinputformat; set hive.optimize.bucketmapjoin=true; set hive.optimize.bucketmapjoin.sortedmerge=true; ----Sort Merge bucket map join SELECT nd.word_id From dwd_prod_word_list_sort_bucket_test1 ND Left OUTER JOIN Dwd_prod_word_list_sort_bucket_test2 OD On nd.word_id=od.word_id WHERE od.word_id is null; |
Common join test statement, because of the large amount of data, each table size is more than 30G
Select nd.word_id from (select word_id, Word_text from dwd_prod_word_list where pdate= ' 2013-04-15 ') ND Left outer join (select word_id, Word_text from dwd_prod_word_list where pdate= ' 2013-04-14 ') OD On nd.word_id=od.word_id where od.word_id is null; |
3.1.3 Performance comparison
Data volume
[Rigelcitest@tc-crm-rd03.tc.baidu.com rigelcitest]$ Hadoop fs-du/app/ecom/rigelci/hive/zj_exercise.db/dwd_prod_ Word_list Found 2 Items 32212496222 Hdfs://szwg-ecomoff-hdfs.dmop.baidu.com:54310/app/ecom/rigelci/hive/zj_exercise.db/dwd_prod_word_li St/pdate=2013-04-14 32242452202 Hdfs://szwg-eco |