Oracle table connection

Source: Internet
Author: User

Author: skate
Time: 2010-08-20

 
Table join
 
Table join refers to the association between tables in an SQL statement to retrieve relevant data from one or more tables. If an SQL statement has more than two associated tables,
So what is the order of connection? ORACLE first connects two tables to generate a result set, and then associates the generated result set with the next table.
Process until all tables are connected; finally, the required data is generated.
 
Since ORACLE 6, the optimizer uses four different table Connection Methods:
 
1 NESTED LOOP JOIN)
2. CLUSTER JOIN)
3. SORT and MERGE connections (sort merge join)
4 CARTESIAN JOIN)
 
In ORACLE 7.3
 
5. hash join ).
 
In ORACLE 8
 
6. INDEX JOIN ).
 
These six connection methods have their own technical characteristics. Under certain conditions, they can make full use of efficient performance. However, they also have their own limitations.
It can improve efficiency, but will seriously affect the system performance. Therefore, it is necessary to thoroughly explore the internal operation mechanism of the Connection Mode for performance optimization.

1 nested loop connection
 
Internal processing process of nested loop connections:
 
1) Oracle optimizer selects one of the two tables as the driving table based on the Rule RBO or cost-based CBO and specifies it as the External table.
2) The Oracle optimizer then specifies another table as an internal table.
3) Oracle reads the first row from the External table and compares it with the data in the internal table one by one. All matching records are placed in the result set.
4) Oracle reads the second row from the External table and compares it with the data in the internal table one by one. All matching records are added to the result set.
5) Repeat the preceding steps until all records in the External table are processed.
6) Finally, a result set meeting the requirements is generated.

Nested loop join is the fastest way to extract the first batch of records from the result set. The source table of the driving row (that is, the record being searched, the External table) is small, or the internal row
When the connected columns in the source table have a unique index or a non-unique index that is highly optional, nested loop join is ideal. Nested loop connections have advantages over other connection methods.
You can quickly extract the first batch of records from the result set without waiting for the entire result set to be fully determined. In this way, end users can view the data on the query screen.
The first batch of records, while reading other records at the same time. No matter how conditions or modes of connection are defined, any two row record sources can use nested loop connections, so nested loop connections
The connection is flexible.
 
However,
If the connected columns of the internal row source table (the second table to be read) do not contain indexes, or the indexes are not highly optional, the nested loop connection efficiency is very low.
Other connection methods may be more effective if the number of records in the driver table is very large.
 
You can add HINTS (use_nl) to an SQL statement to force the ORACLE optimizer to generate an execution plan for nested loop connections.

2. CLUSTER JOIN)
 
Cluster connection is actually a special case of nested loop connection. If the two connected source tables are in the cluster, that is, the two tables belong to the same SEGMENT (SEGMENT ),
Then ORACLE can use the cluster connection. The process is as follows: ORACLE reads the first row from the first row source table, and then uses the CLUSTER
Search for records that can be matched. Continue with the above steps to process the second row in the row source table until all records are processed.
The cluster connection efficiency is extremely high, because the two row source tables that participate in the connection are actually in the same physical block. However, there are limits on cluster connection.
Tables cannot be connected to clusters. Therefore, cluster connections are rarely used.

 
3. SORT and MERGE connections (sort merge join)
 
Internal process of sorting and merging connections:
1) The optimizer checks whether the source table of the first row has been sorted. If it has been sorted, it goes to Step 1; otherwise, it goes to step 2.
2) sorting of the first source table
3) The optimizer checks whether the source table of the second row has been sorted. If it has been sorted, it goes to Step 1; otherwise, it goes to step 2.
4) sort the second source table
5) merge two source tables that have already been sorted and generate the final result set.
 
When there is a lack of data selectivity or available indexes, or the source tables of both rows are too large (the selected data exceeds 5% of the number of table records), sort merge connections are more than nested
Loop connection is more efficient. Sort and merge connections require a relatively large temporary memory block for sorting, which will occupy more memory and disk I/O in the temporary tablespace.
You can add HINTS (use_merge) to an SQL statement to force the ORACLE optimizer to generate an execution plan for sorting and merging connections.

4 CARTESIAN JOIN)
 
Cartesian join is a condition in which the SQL statement does not write table connections. The optimizer connects each record of the first table to all records of the second table. If
The number of records in a table is m, and the number of records in the second table is m, m * n records are generated.
Due to Cartesian connections, SQL statements with poor performance are rarely used.

5. Hash Join)
 
Hash Join Overview
A basic idea of the Hash join algorithm is to create
The hash table exists in the hash area Memory and uses a large row sources (called probe input) to detect the hash table created earlier. If
If the hash area memory is not large enough, the hash table cannot be fully stored in the hash area memory. In this case, Oracle uses a hash function
Build input and probe input are divided into multiple unconnected partitions (which are recorded as Si AND Bi respectively). This stage is called the partition stage, and then corresponding partitions are separated,
That is, Si and Bi PERFORM Hash join again. This stage is called the join stage.
 
If the hash table created for a partition is still too large after the partition, oracle uses nested-loops hash join. the so-called nested-loops hash join is to create a hash table for some Si, then read all Bi and the created hash table, and then create a hash table for the remaining Si.
Connect all Bi with the created hash table until all Si connections are complete.
 
The Hash Join algorithm has a limitation that it assumes that the two tables are even on the connection key, that is, each partition has almost the same data. However, the actual data is not
Even, in order to solve this problem well, oracle has introduced several technologies, such as bitmap vector filtering, role swapping, and column charts. The specific meanings of these terms will be detailed later.

Hash Join Principle
Let's use an example to explain the principles of the Hash Join algorithm and the terms mentioned above.
Consider the following two datasets.

S =}
B =}

Based on statistics, the oracle optimizer determines whether a small table (build input) can be fully stored in the Hash area memory. If it can be fully stored in the memory, create a hash table in the memory, which is the simplest hash join.

If not all are stored in the memory, build input must be partitioned. The number of partitions is fan-out. Fan-out is determined by hash_area_size and cluster size. Cluster size is equal to db_block_size * hash_multiblock_io_count, and hash_multiblock_io_count is an implicit parameter in oracle9i. Note that fan-out is not the size of build input/hash_ara_size. That is to say, the partition size determined by oracle may not be completely stored in the hash area memory. Large fan-out leads to many small partitions, which affects performance. Small fan-out leads to a few large partitions, so that each partition cannot be fully stored in the memory, this also affects the performance of hash join.
 
Oracle uses an internal hash function to act on the connection key, and splits S and B into multiple partitions. Here we assume that this hash function is the remainder function, I .e. Mod (join_column_value, 10 ). In this way, ten partitions are generated, as shown in the following table.
 

Partition   B0 B1 B2 B3 B4 B5 B6 B7 B8 B9
Value 0, 0, 10, 10 1, 1, 1, 11 2, 2, 2, 2 3 NULL NULL NULL NULL 8 9, 9
S0 10                  
S1 1, 1                  
S2 Null                    
S3 3, 3                  
S4 4,4, 4,4                    
S5 5                    
S6 NULL                    
S7 NULL                    
S8 8, 8, 8                  
S9 NULL                    


After such a partition, you only need to join the corresponding partitions (that is, partition pairs). If a partition is NULL, the corresponding partition join can be ignored.
 
When reading the S table into a memory partition, oracle records the unique value of the connection key to construct a so-called bitmap vector, which accounts for about 5% of the hash area memory. Here {1, 3, 4, 5, 8, 10 }.

When table B is partitioned, the value on each connection key is compared with the bitmap vector. If not, the record is discarded. In our example, the following data in Table B will be discarded: {0, 9 }. This process is bitmap vector filtering.

After S1 and B1 are connected, connect Si AND Bi. Here, oracle will compare two partitions and select a small one for build input, which means dynamic role swaps, this dynamic role swap occurs in partitions except the first pair of partitions.
 
Hash Join algorithm
Step 1: Determine whether all small tables can be stored in the hash area memory. If yes, perform the memory hash join operation. If not, go to step 2.
Step 1: determine the number of fan-out instances.
(Number of Partitions) * C <= Favm * M
Where C is Cluster size,
Its value is DB_BLOCK_SIZE * HASH_MULTIBLOCK_IO_COUNT; Favm is the percentage of memory available in hash area, usually about 0.8; M is the size of Hash_area_size.
 
Step 2: read some small table S and use the internal hash function (hash_fun_1) to map the connection key value to a partition, at the same time, the hash_fun_2 function is used to generate another hash value for the connection key value. This hash value is used to create a hash table and is stored together with the connection key value.

Step 2: Create a bitmap vector for build input.
Step 2: If there is no space in the memory, write the partition to the disk.
Step 2: Read the remaining part of small table S and repeat Step 3 until all small table S is read.
Step 1: sort the partitions by size and select several partitions to create a hash table (the principle of selecting partitions here is to make the selected data the most ).
Step 2: Create a hash table based on the hash value calculated using the hash_fun_2 function.
Step 1: read Table B and use bitmap vectors to filter bitmap vectors.
Step 1: Use the hash_fun_1 function to map the data to the corresponding partition and calculate the hash value of hash_fun_2.
Step 1: If the partition is in the memory, connect the hash value calculated through the hash_fun_2 function with the existing hash table in the memory, write the result to the disk. If the partition is not in the memory, the corresponding value is put together with the corresponding partition of table S.

Step 2: continue reading table B and repeat Step 2 until table B has finished reading.
Step 2: Read the corresponding (Si, Bi) for hash connection. Dynamic Role SWAps will occur here.
Step 2: If the smallest partition is larger than the memory after the partition, the nested-loop hash join occurs.

Cost of Hash Join
1. In-Memory Hash Join

Cost (HJ) = Read (S) + build hash table in memory (CPU) + Read (B) + Perform In memory Join (CPU)
When the cpu time is ignored
Cost (HJ) = Read (S) + Read (B)
 
2. On-Disk Hash Join
According to the above steps, we can see that
Cost (HJ) = Cost (HJ1) + Cost (HJ2)
 
Among them, the Cost of Cost (HJ1) is to scan tables S and B, and write the parts that cannot be stored in the memory back to the disk, corresponding to the first steps 2nd to 12th.
Cost (HJ2) is the Cost of implementing the nested-loop hash join, corresponding to the previous steps 13th to 14th.
 
Cost (HJ1) is approximately equal to Read (S) + Read (B) + Write (S-M) + (B-B * M/S )).
During the nested-loop hash join operation, the entire probe input must be read for each chunk build input. Therefore
Cost (HJ2) is approximately equal to Read (S-M) + n * (B-B * M/S ))
 
Where n is the number of cycles required for nested-loop hash join. N = (S/F)/M
 
Generally, if n is greater than 10, the performance of hash join will be greatly reduced. From the formula of n, we can see that n is in inverse proportion to Fan-out,
Increase the fan-out value and reduce the value by n. When hash_area_size is fixed, you can reduce the cluster size to increase the fan-out.
 
From this we can see that increasing the value of the hash_multiblock_io_count parameter does not necessarily improve the performance of hash join.

When the two connected tables use equivalent join and the data volume of the tables is large, the Optimizer may use Hash join. Hash connections are based on CBO. Only in the database
When the initialization parameter HASH_JOIN_ENABLED is set to True and a large enough value is set for PGA_AGGREGATE_TARGET, Oracle will use
Greek EDGE connection. HASH_AREA_SIZE is a backward compatible parameter, but HASH_AREA_SIZE should be used in versions earlier than Oracle9i. When the ORDERED prompt is used,
The first table in the FROM clause is used to create a hash table.
 
When a useful index is missing, hash connections are more effective than nested loop connections. Hash connections may also be faster than nested loop connections because hash tables in memory are processed.
It is faster than B _ tree indexes. Hash Join is suitable for joining small tables with large tables and returning large result sets.
 
An in-depth understanding and understanding of oracle table connections is critical to optimizing database performance. Different optimizer selection methods and missing or unified statistical information
The table connection method automatically selected by ORACLE is not necessarily the best because the calculation information is not accurate. When the SQL statement execution efficiency is low, you can use auto trace to execute
Track and analyze data. When a multi-Table connection occurs, you need to carefully analyze whether there are better connection conditions. According to the system features, you can add
HINTS to change the SQL Execution Plan and optimize the performance.
 
6. Index connection
If a group of existing indexes contains all the information required for the query, the optimizer will generate a group of hash tables in the index. Possible range or speed
Quick global scan accesses each index, and the method of scanning selected depends on the conditions available in the WHERE clause. A table contains a large number of columns, but you only want to access
This method is very effective. The more constraints the WHERE clause has, the faster the execution speed. Because when the optimizer evaluates the Optimization Path for query execution
The bundle condition is treated as an option. You must create an index on the appropriate columns (those that meet the entire query) to ensure that the optimizer makes the index connection optional.
. This task usually involves adding an index to a column that does not have an index or that has not previously created a joint index. Connected to the index
The advantage is that quick global scan only supports one single index for the entire query, and index connections can have multiple indexes for the entire query.
 
 
 
 
------ End -----

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.