Performance analysis of small table and large Table association (join) in Hive ZZ

Source: Internet
Author: User
Tags join

In fact, the recommendation of the hive optimization is that the large table is placed behind when the multiple tables are associated, because the previous association results are put into memory. However, Bo main article itself is also good to write.

Often see some of the recommendations of the Hive optimization that when the small table with the large table to do the association, the small table written in front, so that can make the association of Hive Faster, the reason is said because the small table can be put into memory, and then the big table each record to go in memory, and finally complete the correlation query. The reason for this seems reasonable, but careful scrutiny, but also untenable.

How small the table counts the small table. What to do if the so-called small table does not fit in memory. I use 2 tables with only a few records to do the correlation query, this should be considered as a small table, while viewing the execution log of reduce is still a write disk operation. In fact, reduce must have an operation to sort all the key-value pairs and merge the write to the disk file after receiving the output of all maps. Writing to disk (spill) can be multiple times, so it is possible to generate more than one temporary file, but eventually it will be merged into a single file, that is, each reduce will only process one file at a time.

I did an experiment with 1 records of tables and more than 300 million records of tables to join, whether the small table is placed in front of the join or after the join, the execution time is almost the same. To see the execution log of reduce, the 1 record table is almost the same as the reduce log of two queries after join or join. If the table in the left side of the join is placed in memory to wait for the join to the right of the table in memory, then when the table of more than 300 million records is placed on the left side of the join, the memory must not tolerate so many records, it is bound to write disk operations, The execution time should be a lot longer than the small table before the join, but it is not the case, it also shows that the reasons mentioned above are unreasonable.

in fact, it is wrong to say that "putting a small table in front of an association can improve efficiency." The correct argument should be that "associating a table with a few duplicate association keys in front of a join can improve the efficiency of the join." "

Analyze how hive is implemented at the bottom of the two-table association. Because no matter how complex hive queries are ultimately converted to a mapreduce job to execute, Hive's implementation of the association should be similar to the MapReduce implementation of the association. In the case of MapReduce, the implementation of the association is simply a combination of the association key and the identifier on the right or left of the join as the key combination (key), combining a record with the identity bit on the left or right of the join as the value (value). In the shuffle phase of reduce, the main sort is performed according to the key of the key combination, when the correlation key is the same, then it is sorted by the identification bit. While in the partition section, only the correlation key in the correlation key to sub-section, so that the same association key records will be placed in the same value list, while ensuring that the join to the left of the table record in the value list, and the join to the right of the table is recorded after the value list.

For example a join B on (a.id = b.id), if both A and B tables have 1 id = 3 of the record, then a table this record key combination is (3,0), B table This record key combination is (3,1). You can ensure that the records of table A are in front of the records of table B when sorting. In the case of reduce processing, put id=3 in the same value list, form key = 3,value list = [Record of Table A id=3, table B id=3 Records]

Let's take a look at what the reduce does when two tables are associated. Reduce will process all records with the same ID. We use the value list to represent the array.

1) Reduce first read the first record v[0], if found V[0] is the record of B table, that indicates that there is no record of a table, and ultimately will not correlate the output, so no further processing of this ID, read V[0] used 1 read operations.

2) If the discovery of v[0] to v[length-1] is all a record of a table, it shows that there is no record of table B, the same end will not correlate the output, but note here that the value has been done a length of read operations.

3) For example, a table id=3 has 1 records, B table id=3 has 10 records. First read V[0] found is a table of records, with 1 read operations. Then read V[1] found is the operation of table B, then v[0] and v[1] can be directly associated with the output, the cumulative use of 2 operations. At this time, reduce already knew from v[1] Start is b table of record, so can directly use v[0] in turn and v[2],v[3]......v[10] to do the correlation operation and output, accumulated used 11 times operation.

4) In exchange, assume that a table id=3 has 10 records, B table id=3 has 1 records. First read V[0] found is a table of records, with 1 read operations. Then read V[1] found is still a record of a table, the cumulative use of 2 read operations. And so on, read V[9] when the discovery or a table records, the cumulative use of 10 read operations. Then read the last 1 records v[10] found is a record of B table, can be v[0] and v[10] to the associated output, the cumulative use of 11 operations. Next, v[1]~v[9] can be directly associated with the V[10], the cumulative use of 20 operations.

5) A little more complicated, assuming that a table id=3 has 2 records, B table id=3 has 5 records. First read V[0] found is a table of records, with 1 read operations. Then read V[1] found is still a record of a table, the cumulative use of 2 read operations. Then read V[2] found is a record of B table, at this time v[0] and v[2] can directly correlate the output, the cumulative use of 3 operations. Next, v[0] can be associated output sequentially and v[3]~v[6], with 7 operations accumulated. Next v[1] and v[2]~v[6] are associated with the output, the cumulative use of 12 operations.

6) The 5 example is transferred, assuming that a table id=3 has 5 records, B table id=3 has 2 records. Read V[0] found is a table record, with 1 read operations. Then read V[1] found is still a record of a table, the cumulative use of 2 read operations. And so on, read to v[4] discovery is still a table of records, the cumulative use of 5 read operations. Next read V[5], found to be a record of B table, at this time v[0] and v[5] can directly correlate the output, the cumulative use of 6 operations. Then v[0] and v[6] for the associated output, the cumulative use of 7 operations. Then v[1], respectively, and V[5], v[6] associated output, the cumulative use of 9 operations. V[2], respectively, and V[5], v[6] associated output, the cumulative use of 11 operations. And so on, the last v[4] and v[5], v[6] associated output, the cumulative use of 15 operations.

7) In addition, when reduce detects the record of a table, it also records the number of records of the same key in a table, When the number of records for the same key exceeds the value of Hive.skewjoin.key (the default is 1000000), the key is printed in the reduce's log and marked as an associated key that is skewed.

The final conclusion is: The table written on the left side of the association with each of the 1 Duplicate association keys, the bottom layer will be more than 1 operations processing.

Suppose that a table has 10 million IDs, the average ID has 3 duplicate values, then put a table in front of the association will do more than 30 million times the operation of processing, at this time who wrote in front of who wrote in the after see the difference in performance.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.