In-depth understanding of Oracle table (5): definitions, principles, algorithms, costs, modes, and bitmaps of Hash Join

Source: Internet
Author: User

In-depth understanding of Oracle table (5): definitions, principles, algorithms, costs, modes, and bitmaps of Hash Join

In-depth understanding of the relationship between Oracle Table (4) and segments
Http://www.bkjia.com/database/201304/203778.html
Hash Join can only be used for equal connections, and can only be used in CBO optimizer mode. Compared with nested loop join, hash join is more suitable for processing large result sets.
The execution plan of Hash Join is a hash table (build table) and a probe table (probe table). Generally, it is not called the internal External table and the nested loop has the internal External table.
A Hash table is also called an internal table.
The execution plans of the two are as follows:
Nested loop
Outer table -- driver table
Inner table
 
Hash join
Build table (inner table) -- driver table
Probe table (outer table)
First take a look at an image to get a general idea of the Hash Join process:
 
Next, let's take a look at Hash Join.
(I) concept of Hash join
 
A basic idea of the Hash join algorithm is to record small tables as S based on the small row sources (called build input, which is the build table mentioned above, the larger table is B)
Create a hash table that can exist in the hash area Memory
Then, we use a large row sources (called probe input, that is, the probe table mentioned above) to test the hash table created earlier.
If the hash area memory is not large enough, the hash table cannot be fully stored in the hash area memory.
In this case, Oracle splits build input and probe input into multiple unconnected partitions using a hash function in the connection key.
It is recorded as Si AND Bi respectively. This stage is called the partitioning stage, and then corresponding partitions, that is, Si and Bi, are used for Hash join. This stage is called the join stage.
If the HASH table is too large to be constructed in memory at a time, it is divided into several partitions and written into the disk's temporary segment, there will be an extra write cost, which will reduce the efficiency.
As for the concept of small tables, hash tables that can be accommodated in pga can be called small tables. Generally, for example:
Pga_aggregate_target big integer 1073741824
The hash area size can be used more than 40 MB, which may usually contain hundreds of thousands of records.
Hash area size is 2 * sort_area_size by default. We can directly modify the size of SORT_AREA_SIZE, And the HASH_AREA_SIZE will also change.
If your workarea_size_policy = auto, we only need to set pga_aggregate_target.
But remember, this is a session-level parameter. Sometimes, we prefer to set the hash_area_size to about 1.6 times the size of the driver table.
The driver table is only used for nested loop join and hash join, but Hash join does not need to have an index on the driver table, while nested loop join is urgently required.
For tables with 1 millions or 2 millions million records joined, hash join usually performs very well.
However, many and few, large and small, are difficult to quantify in many cases, and the specific situation must be analyzed.
If the hash table created for a partition is still too large after the partition, oracle uses nested loop hash join.
The so-called nested-loops hash join is to create a hash table for some Si, and then read all Bi and the created hash table for connection.
Then create a hash table for the remaining Si and connect all Bi with the created hash table until all Si connections are complete.
 
(Ii) Principle of Hash Join
 
Consider the following two datasets:
S =}
B =}
The first step of Hash Join is to determine whether a small table (that is, build input) can be fully stored in the hash area memory.
If it can be fully stored in the memory, create a hash table in the memory, which is the simplest hash join.
If not all are stored in the memory, build input must be partitioned. The number of partitions is fan-out.
Fan-out is determined by hash_area_size and cluster size. The cluster size is equal to db_block_size * _ hash_multiblock_io_count.
Hash_multiblock_io_count is a hidden parameter. After 9.0.1, [SQL] sys @ ORCL> ed is no longer used.
Wrote file afiedt. buf
 
1 select a. ksppinm name, B. ksppstvl value, a. ksppdesc description
2 from x $ ksppi a, x $ ksppcv B
3 where a. indx = B. indx
4 * and a. ksppinm like '% hash_multiblock_io_count %'
Sys @ ORCL>/
 
NAME VALUE DESCRIPTION
-----------------------------------------------------------------------------------------------
_ Hash_multiblock_io_count 0 number of blocks hash join will read/write at once
Oracle uses an internal hash function to act on the connection key and splits S and B into multiple partitions.
Here, we assume that this hash function is the evaluate remainder function, I .e. Mod (join_column_value, 10)
Ten partitions are generated as follows:
 
After such a partition, you only need to join the corresponding partitions (that is, the so-called partition pairs)
If one partition is NULL, the corresponding partition join can be ignored.
When reading the S table into a memory partition, oracle records the unique value of the connection key and constructs a so-called bitmap vector.
It accounts for about 5% of the hash area memory. Here {1, 3, 4, 5, 8, 10}
When table B is partitioned, the value on each connection key is compared with the bitmap vector. If not, the record is discarded.
In our example, the following data in Table B will be discarded: {0, 9}
This process is bitmap vector filtering.
After S1 and B1 are connected, connect Si AND Bi.
Here, oracle will compare two partitions. Selecting a small one for building input means dynamic role swap.
This dynamic role swap occurs in partitions except the first pair of partitions.

(Iii) Hash Join algorithm
 
Step 1: Determine whether all small tables can be stored in the hash area memory. If yes, perform the memory hash join operation. If not, go to step 2.
Step 1: determine the number of fan-out
(Number of Partitions) * C <= Favm * M
Where C is Cluster size and its value is DB_BLOCK_SIZE * HASH_MULTIBLOCK_IO_COUNT
Favm is the percentage of memory available in hash area, usually about 0.8.
M is the size of Hash_area_size
Step 1: read some small table S and use the internal hash function (hash_fun_1)
Map the connection key value to a partition and use the hash_fun_2 function to generate another hash value for the connection key value.
This hash value is used to create a hash table and is stored together with the connection key value.
Step 2: Create a bitmap vector for build input
Step 2: If there is no space in the memory, write the partition to the disk.
Step 2: Read the remaining part of small table S and repeat Step 3 until all small table S has been read
Step 1: Sort partitions by size and select several partitions to create a hash table (the principle of selecting partitions here is to make the most selected)
Step 2: Create a hash table based on the hash value calculated using the hash_fun_2 function.
Step 1: read Table B and use bitmap vectors to filter bitmap Vectors
Step 1: Use the hash_fun_1 function to map the data to the corresponding partition and calculate the hash value of hash_fun_2.
Step 2: If the partition is in the memory, connect the hash value calculated through the hash_fun_2 function with the existing hash table in the memory.
Write the result to the disk. If the partition is not in the memory, the corresponding value is put together with the corresponding partition of table S.
Step 2: Continue to read Table B and repeat Step 2 until table B is read.
Step 2: Read the corresponding (Si, Bi) for hash connection. Dynamic Role SWAps will occur here
Step 2: If the smallest partition is larger than the memory after the partition, the nested-loop hash join occurs.
 
(Iv) cost of Hash Join
 
(1) In-Memory Hash Join
Cost (HJ) = Read (S) + build hash table in memory (CPU) + Read (B) + Perform In memory Join (CPU)
Ignore the cpu time, then:
Cost (HJ) = Read (S) + Read (B)
 
(2) On-Disk Hash Join
Based on the above steps, we can see that:
Cost (HJ) = Cost (HJ1) + Cost (HJ2)
Among them, the Cost of Cost (HJ1) is to scan tables S and B, and write the parts that cannot be stored in the memory back to the disk, corresponding to the first steps 2nd to 12th
Cost (HJ2) is the Cost of implementing the nested-loop hash join, corresponding to the previous steps 13th to 14th
Where Cost (HJ1) is approximately equal to Read (S) + Read (B) + Write (S-M) + (B-B * M/S ))
During the nested-loop hash join operation, the entire probe input must be read for each chunk build input. Therefore
Cost (HJ2) is approximately equal to Read (S-M) + n * (B-B * M/S), where n is the number of times nested-loop hash join requires loops: n = (S/F)/M
Generally, if n is greater than 10, the performance of hash join will be greatly reduced.
From the formula of n, we can see that the inverse ratio of n to Fan-out can be increased to reduce n
When hash_area_size is fixed, you can reduce the cluster size to increase the fan-out
From this we can see that increasing the value of the hash_multiblock_io_count parameter does not necessarily improve the performance of hash join.
 
(V) Hash Join Process
 
A complete hash join is as follows:
1. Calculate the number of partitions (buckets) in a small table-Hash bucket
An important factor determining hash join is the number of partitions (buckets) in small tables.
This number is determined by the hash_area_size, hash_multiblock_io_count, and db_block_size parameters.
Oracle retains 20% of hash area to store partition header information, hash bitmap information, and hash table
Therefore, the formula for calculating this number is:
Number of buckets = 0.8 * hash_area_size/(hash_multiblock_io_count * db_block_size)
 
2. Hash calculation
Reads small table data (R for short) and computes each piece of data based on the hash algorithm.
Oracle uses two hash algorithms to calculate the hash value that can reach the fastest speed (the first hash value and the second hash value)
All the hash values of these partitions (the first hash value) become hash tables.
 
3. store data in hash memory
Put the data calculated by the hash algorithm into the corresponding bucket according to the hash value (the first hash value) of each bucket.
The second hash value is stored in each record.
 
4. Create a hash bitmap
At the same time, a hash bitmap is created for the mappings between the two hash values.
 
5. The part exceeding the memory size is moved to the disk.
If the hash area is full, the largest partition will be written to the disk (temporary tablespace ).
Any records that need to be written to the disk partition will update the disk partition.
In this way, the performance will be seriously affected, so we must avoid this situation as much as possible.
2-5: the entire table is read.
 
6. Sort partitions
In order to make full use of the memory and store more partitions as much as possible, Oracle will sort them in the memory according to the size of each partition.
 
7. Read big table data and perform hash matching
Next, we will start to read data from a large table (S ).
Read each record in order, calculate its hash value, and check whether it is consistent with the hash value of the partition in the memory.
If yes, return join data
If the partitions in the memory do not match, write the data in S to a new partition. This partition uses the same algorithm as R to calculate the hash value.
That is to say, the new number of partitions generated by the data in S should be the same as the number of partitions in the R partition set. These new partitions are stored on the disk (temporary tablespace ).
 
8. Read all data from a big table
Keep following 7 until all data in the Big Table is read.
 
9. process data without join
At this time, a lot of join data is generated and the partitions stored on the disk are computed from R and S.
 
10 secondary hash calculation
Extracts the smallest partition from the R and S partitions, and uses the second hash function to calculate and create a hash table in the memory.
The reason for using the second hash function is to make the data distribution better.
 
11 secondary hash matching
Read partition data from another data source (different from the data source of the partition where hash is located in the memory) and match the new hash table in the memory. Return join data
 
12. Complete all hash join Operations
Continue to process the remaining partitions according to 9-11 until all partitions are processed.

(6) Hash Join Mode
In Oracle, Hash Join also has three modes: optimal, one-pass, and multi-pass.
(1) optimal
When all the hash tables generated by the driver result set can be placed in the hash area of PGA, it is called optimal. The general process is as follows:
① Obtain the driver result set based on the driver table.
② Generate hash bulket in the hash area and divide several bulket into a group to form a partition. A bitmap list is also generated, with each bulket occupying one
③ Hash the join key of the result set to distribute data to the bulket of the corresponding partition.
After the computation is complete, if the key value is more unique, the data in bulket will be even, or some data in the bucket may be empty.
In this way, the corresponding flag bit on bitmap is 0, and the bucket with data will be 1
④ Start scanning the second table and PERFORM hash operations on the jion key to determine whether a bulket of a certain partition should be used for testing.
Before the test, check whether the bitmap of bulket will be 1. If it is 0, it indicates no data, and this line will be discarded directly.
⑤ If bitmap is 1, exact match will be performed in the bucket. If it is OK, data will be returned.
This is the optimal hash join. Its cost is basically the full table scan of two tables.
The figure at the beginning of the blog describes this situation.

(2) one-pass
What if the pga of the process is small or the result set of the driver table is large and exceeds the hash area size?
Of course, temporary tablespace will be used. At this time, oracle's processing method is a little complicated. You need to pay attention to the concept of partition mentioned above.
It can be understood that the data goes through two hash operations. First, confirm your partition, and then confirm your bulket.
Assume that the hash area is smaller than the entire hash table, but at least greater than the size of one partition. In this case, one-pass is used.
After the hash table is generated, some partitions are left in the memory, and other partitions are left in the temporary tablespace of the disk.
Of course, it is also possible that half of a partition is in the memory and half is in the disk. The remaining steps are roughly as follows:
① Scan the second table and PERFORM hash operations on the join key to determine the corresponding partition and bulket.
② Check bitmap to check whether bulket has data. If not, discard it directly.
③ If there is data and the partition is in the memory, it enters the corresponding bucket for exact match. If it can match, this row of data will be returned; otherwise, it will be discarded.
④ If partition is on the disk, store the row of data in the disk for temporary storage. The storage format is partition and bulket.
⑤ When the second table is scanned, there are a lot of partitions generated by the driver table and the test table, which are retained on the disk.
⑥ Since the data on both sides adopts the same hash algorithm for partition and bulket, we only need to compare the partition data on both sides in pairs.
In addition, oracle was optimized during the comparison, and there was no strict driver-to-be-driven relationship.
In the partition pair, he selects a smaller one as the driver until all the partition pairs on the disk are joined.
It can be found that, compared with optimal, the extra cost is to re-read the partition that cannot be put into the memory, so it is called one-pass
As long as your memory can be installed with the next partition, oracle will free up space, and each disk's partition will achieve one-pass
 
(3) multi-pass
This is the most complex and worst hash join.
At this time, the hash area is too small to accommodate a partition. After scanning the driver table
Only half of the partitions may be left in the hash area, and the other half will be added to all partitions on the disk.
The remaining steps are similar to the one-pass price comparison. The difference is the processing of partition.
Since the driver table has only half of the partitions in the memory, the corresponding partition data of the test table is used for detection.
If this line does not match, it cannot be discarded. You need to keep it to the disk and join the remaining half of the driver table.
The example here is that the second half of a partition can be installed in the memory. If less is installed, repeated join operations will increase.
When multi-pass occurs, the number of physical reads of partition increases significantly.

(Vii) bitmap of Hash Join
This bitmap contains information about whether each hash partition has values. It records the hash value of the partition with data
The biggest role of this bitmap is that if the data in probe input does not match the hash table in the memory
First, check the bitmap to determine whether to write unmatched data to the disk.
Data that cannot be matched (that is, the corresponding partition on the bitmap has no data) will not be written to the disk.
 
Summary
① Make sure that the small table is the driving table
② Confirm that the involved table and connection key have been analyzed
③ If the data on the connection key is uneven, we recommend that you create a column chart.
④ If possible, increase the hash_area_size or the pga_aggregate_target value.
⑤ Hash Join is suitable for joining small tables with large tables and returning large result sets.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.