Performance Analysis of join between small and medium-sized tables in Hive zz

Source: Internet
Author: User

In fact, we recommend that you add large tables to the back when associating multiple tables, because the preceding association results should be placed in the memory. However, the blog post itself is also well written.

We often see some hive optimization suggestions that when small tables are associated with large tables, we should write the small tables in front so that the hive association can be faster, the reason is that small tables can be put into the memory first, and then each record of large tables can be checked in the memory to complete the association query. Such a reason seems reasonable, but careful scrutiny makes no sense.

How many small tables are considered as small tables? What if small tables cannot be stored in the memory? I used two tables with only a few records for join queries. This should be a small table, and there are still disk write operations when viewing reduce execution logs. In fact, after receiving all map output, reduce will certainly have an operation to sort all key-value pairs and merge them into disk files. Writing data to a disk (spill) may occur multiple times. Therefore, multiple temporary files may be generated, but they must be merged into one file, that is, each reduce can process only one file.

I did an experiment. I joined a table with one record and a table with more than 0.3 billion records, no matter whether a small table is placed before or after join, the execution time is almost the same. Let's look at the reduce execution log. The reduce log queried by a record table before or after join is almost the same. If we put the table on the left of the join statement in memory and wait for the table on the right of the join statement to be checked in memory, when the table with more than 0.3 billion records is on the left of the join statement, the memory is certainly unable to accommodate so many records, so it is necessary to write data to the disk. Therefore, the execution time of the memory should be much longer than that of the small table before the join operation, however, this is not the case, which means that the reasons mentioned above are not reasonable.

In fact, it is wrong to say that "putting small tables in front for association can improve efficiency. The correct statement should be "putting tables with fewer duplicate join keys before join can improve join efficiency ."

Analyze how Hive achieves association between two tables at the underlying layer. Because no matter how complicated Hive queries are, they are eventually converted into mapreduce jobs for execution. Therefore, Hive's associated implementation should be similar to mapreduce's associated implementation. For the implementation of association, mapreduce simply uses the Association key and tag on the left or right of join as the key combination ), combine a record and a tag on the left or right of join as a value ). In the shuffle phase of reduce, sort by the associated keys of the key combination. When the associated keys are the same, sort by marker. In the partition segment, only the associated key in the association key is used for the partition segment, so that records with the same association key will be placed in the same value list, at the same time, the record of the table on the left of join is ensured to be before the value list, while the record of the table on the right of join is behind the value list.

For example, A join B ON (. id = B. id). If table A and table B both have 1 record with id = 3, the key combination of the record in Table A is ), the key combination for this record in Table B is (3, 1 ). During sorting, you can ensure that the records of Table A are before the records of Table B. In reduce processing, Put id = 3 in the same value list to form records with key = 3 and value list = [A Table id = 3, records of Table B id = 3]

Next, let's look at what reduce does when two tables are joined. Reduce processes all records with the same id. We use an array to represent the value list.

1) Reduce first reads the first record v [0]. If it finds that v [0] is A record of Table B, it means that no record of Table A will be associated with the output, therefore, you do not need to process this id any more. Read v [0] with one read operation.

2) if we find that all records from Table v [0] to Table v [length-1] are from Table A, it means that no records from Table B will be associated with the output, but note that, the value has been read for length.

3) For example, table A id = 3 has 1 record, and table B id = 3 has 10 records. First, read v [0] And find that it is A record of table A. One read operation is used. Then read v [1] and find that the operation is performed on Table B. At this time, v [0] and v [1] can be directly associated with the output, and two operations are used in total. At this time, reduce knows that table B is followed by records starting from v [1], so we can directly use v [0] in sequence and v [2], v [3]… V [10] performs Association operations and outputs, and uses 11 operations in total.

4) if Table A id = 3 has 10 records, and table B id = 3 has 1 record. First, read v [0] And find that it is A record of table A. One read operation is used. Then, read v [1] and find that the record is still in Table A. Two read operations are used in total. Similarly, when reading table v [9], we found that it was still A record of table A. A total of 10 read operations were used. Then, read the last record v [10] And find that it is a record of Table B. You can associate v [0] with v [10] for output, and use 11 operations in total. Next, you can directly import v [1] ~ V [9] is associated with v [10], and 20 operations are used in total.

5) to be more complex, assume that table A id = 3 has two records, and table B id = 3 has five records. First, read v [0] And find that it is A record of table A. One read operation is used. Then, read v [1] and find that the record is still in Table A. Two read operations are used in total. Then read v [2] and find that it is a record of Table B. At this time, v [0] and v [2] can be directly associated with the output, and three operations are used in total. Next v [0] can be followed by v [3] ~ V [6] is associated with the output, and seven operations are used in total. Next v [1] and then v [2] ~ V [6] is associated with the output, and 12 operations are used in total.

6) Call the example 5. Assume that table A id = 3 has five records, and table B id = 3 has two records. Read v [0] first and find that it is A record of table A. One read operation is used. Then, read v [1] and find that the record is still in Table A. Two read operations are used in total. Similarly, after reading data from Table v [4], we found that the record is still in Table A, and 5 read operations were used in total. Next, read v [5] and find that it is a record of Table B. At this time, v [0] and v [5] can be directly associated with the output, and six operations are used in total. Then v [0] and v [6] are associated with the output, and seven operations are used in total. Then v [1] is associated with v [5] and v [6] respectively, and 9 operations are used in total. V [2] is associated with v [5] and v [6] respectively, and 11 operations are used in total. Similarly, v [4] is associated with v [5] and v [6] respectively, and 15 operations are used in total.

7) Additionally, when reduce detects records in Table A, it also records the number of records of the same key in Table A. When it finds that the number of records of the same key exceeds hive. skewjoin. when the key value (1000000 by default) is used, the key is printed in the reduce log and marked as a skewed Association key.

The final conclusion is that each table written on the left of the association has one duplicate Association Key, and the underlying layer will perform one operation more than once.

Assume that table A has 10 million IDs and each id has 3 duplicate values on average. Therefore, if Table A is placed in the front of the table for association, more than 30 million operations will be performed, at this time, the performance difference will be seen after who wrote it before it.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.