Oracle and postgresql join method

Source: Internet
Author: User

Oracle and postgresql join method

First, we will introduce several oracle multi-Table Connection Methods:
Nested loop: nested loop, which is actually a dual FOR LOOP
1. The driver table (External table) and search table (internal table) are divided here)
2. To search for a table with an index, you can use an index to find and match the table, improving the efficiency,
3. Use a small table as the driving table
4. The two tables have a small amount of data.

Sometimes the driver table and the search table selected by the oracle optimizer are not suitable. Therefore, we need to specify the search table and the driver table here. the hint and USE_NL prompts are added, but they are not always valid, you can use ordered use_nl (tab1 tab2) to force tab1 as the driving table. Note that all the environments mentioned here are CBO.
Therefore, the nested loop is generally used for connecting tables with indexes, and the indexing is more selective. The internal principle here is to select a drive table (outer) and a search table (inner). Each row of the drive table matches the record of the search table. the less records the driver has, the faster the returned result set.
Cost = outer access cost + (inner access cost * outer cardinality)

According to the formula, cost = 3 + (1*14) = 17
Hash join: a hash join is used to connect a large dataset or a large table to a small table. The optimizer creates a HASH table in the memory using the small table connection field, and then scans the large table, calculate the hash value of the connection field of a large table, and determine whether data can be found in the hash list. Data is returned successfully; otherwise, data is discarded. If the table is very large and cannot be fully put into the memory, the optimizer will divide it into Several partitions and cannot put it into the memory, then it will be written into the temporary segment of the disk, therefore, a large temporary segment is required to improve IO performance.
1. Create a hash table in Memory Based on the table size, and then test the hash table.
2. Use a small table as a hash list and create a hash Based on the connection key.
3. Work under CBO
4. It is recommended that the hash table be fully stored in the memory declared by HASH_AREA_SIZE.


This is not ideal in the actual environment. First, filter unnecessary data in the small data set to generate a hash table, that is, build hash, and put it in hash_area, then scan the big data set to perform a hash operation on the rows and then test the generated hash table in the hash area. Here, there is a bucket, where there is a hash list, and there is not much explanation. Scan the list, data is returned after successful matching; otherwise, data is discarded.


It's onepass,
When the hash area cannot accommodate all partitions but is sufficient to accommodate at least one partition, this is onepass hash join. Partition the hash table and write it to the temporary space on the disk. After partitioning the hash table, use the same hash function to partition the probe table, then the corresponding partitions are joined separately. Note that, except for the first partition, Oracle automatically exchanges build table and probe table based on the partition size starting from the second partition to achieve better performance. Because all partitions can be read only once from the disk, it is called onepass.
1. Scan the second table and PERFORM hash operations on the join key to determine the corresponding partition and bucket.
2. Check bitmap to check whether the bucket has data. If no data exists, discard it directly.
3. If there is data and the partition is in the memory, it enters the corresponding bucket for exact match. If it can match, this row of data will be returned; otherwise, it will be discarded.

4. If partition is on the disk, store the row of data in the disk for temporary storage. The storage format is partition and bucket.
5. When the second table is scanned, the remaining parts are a lot of partitions generated by the driver table and the test table, which are retained on the disk.
6. since the data on both sides adopts the same hash algorithm for partition and bucket, we only need to compare the partition data on both sides in pairs, and oracle also performs optimization during the comparison, if there is no strict relationship between the driver and the driver, he selects a smaller one in the partition pair as the driver until all the partition pairs on the disk are joined.

Multipass
The worst hash join. At this time, the hash area cannot even accommodate a partition. After scanning the driver table, only half of the partitions may be left in the hash area, the other half and other partition are all on the disk. The remaining steps are similar to the onepass price. The difference is the processing of partition.

Since the driver table only has half a partition in the memory, if the partition data corresponding to the test table is not matched, this row cannot be directly discarded and needs to be retained to the disk, join with the remaining half partition of the driver table. The example here is that the memory can have the lower half of the partition. If less is installed, the number of repeated joins will increase, when multipass occurs, the number of physical reads of partition increases significantly.
Cost = (outer access cost * # of hash partitions) + inner access cost
SORT MERGE JOIN
The two table rows are sorted in rows. There is no driver table or search table here. Because the two tables are sorted, you can directly determine the result through the connection condition.
Hash join focuses on CPU consumption, while sort merge join focuses on IO consumption. It can be forcibly used through USE_MERGE (TAB1, TAB2 ).

Trigger rule:
RBO mode, non-equivalent connection (>,> +, <,< =)
Hash_join_enabled = false
Data sources sorted
The algorithm of cost is the same as that of hash join.
Postgresql connection method:
Like oracle, it provides three connection modes: nested loop, hash_join, and merge join.
Nested loop:
Example:
SELECT oid
FROM pg_proc
Order by 1
LIMIT 8;
Create temporary table sample1 (id, junk)
SELECT oid, repeat ('x', 250)
FROM pg_proc
Order by random (); -- add rows in random order
Create temporary table sample2 (id, junk)
SELECT oid, repeat ('x', 250)
FROM pg_class
Order by random (); -- add rows in random order
The created tables have no indexes or statistics.
Explain select sample2.junk
FROM sample1 JOIN sample2 ON (sample1.id = sample2.id)
WHERE sample1.id = 33;
QUERY PLAN
---------------------------------------------------------------------
Nested Loop (cost = 0. 00 .. 253.42 rows = 378 width = 32)
-> Seq Scan on sample1 (cost = 0. 00 .. 220.76 rows = 54 width = 4)
Filter: (id = 33: oid)
-> Materialize (cost = 0. 00 .. 27.95 rows = 7 width = 36)
-> Seq Scan on sample2 (cost = 0. 00 .. 27.91 rows = 7 width = 36)
Filter: (id = 33: oid)
(6 rows)
Source code for sequential scanning of nested connections:
For (I = 0; I <length (outer); I ++)
For (j = 0; j <length (inner); j ++)
If (outer [I] = inner [j])
Output (outer [I], inner [j]);
The principle is the same as that of oracle.
Hash join:
Explain select sample1.junk
FROM sample1 JOIN sample2 ON (sample1.id = sample2.id)
WHERE sample2.id> 33;
QUERY PLAN
----------------------------------------------------------------------
Hash Join (cost = 33. 55 .. 1097.55 rows = 24131 width = 32)
Hash Cond: (sample1.id = sample2.id)
-> Seq Scan on sample1 (cost = 0. 00 .. 194.01 rows = 10701 width = 36)
-> Hash (cost = 27. 91 .. 27.91 rows = 451 width = 4)
-> Seq Scan on sample2 (cost = 0. 00 .. 27.91 rows = 451 width = 4)
Filter: (id> 33: oid)
Hash join source code:
For (j = 0; j <length (inner); j ++)
Hash_key = hash (inner [j]);
Append (hash_store [hash_key], inner [j]);
For (I = 0; I <length (outer); I ++)
Hash_key = hash (outer [I]);
For (j = 0; j <length (hash_store [hash_key]); j ++)
If (outer [I] = hash_store [hash_key] [j])
Output (outer [I], inner [j]);
Merge join:
Explain select sample1.junk
FROM sample1 JOIN sample2 ON (sample1.id = sample2.id );
QUERY PLAN
--------------------------------------------------------------------------
Merge Join (cost = 1005. 10 .. 2097.74 rows = 72392 width = 32)
Merge Cond: (sample2.id = sample1.id)
-> Sort (cost = 94. 90 .. 98.28 rows = 1353 width = 4)
Sort Key: sample2.id
-> Seq Scan on sample2 (cost = 0. 00 .. 24.53 rows = 1353 width = 4)
-> Sort (cost = 910. 20 .. 936.95 rows = 10701 width = 36)
Sort Key: sample1.id
-> Seq Scan on sample1 (cost = 0. 00 .. 194.01 rows = 10701 width = 36)
The location changes of the two tables are not affected.
Merge join source code:
Sort (outer );
Sort (inner );
I = 0;
J = 0;
Save_j = 0;
While (I <length (outer ))
If (outer [I] = inner [j])
Output (outer [I], inner [j]);
If (outer [I] <= inner [j] & j <length (inner ))
J ++;
If (outer [I] <inner [j])
Save_j = j;
Else
I ++;
J = save_j;
ANALYZE sample1;
ANALYZE sample2;
With the statistical information:
Postgres = # explain select sample2.junk
Postgres-# FROM sample1 JOIN sample2 ON (sample1.id = sample2.id );
QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost = 17. 39 .. 139.45 rows = 284 width = 254)
Hash Cond: (sample1.id = sample2.id)
-> Seq Scan on sample1 (cost = 0. 00 .. 110.43 rows = 2343 width = 4)
-> Hash (cost = 13. 84 .. 13.84 rows = 284 width = 258)
-> Seq Scan on sample2 (cost = 0. 00 .. 13.84 rows = 284 width = 258)
Create index I _sample1 on sample1 (id );
Create index I _sample2 on sample2 (id );
After related indexes are created:
Postgres = # explain select sample2.junk
Postgres-# FROM sample1 JOIN sample2 ON (sample1.id = sample2.id)
S-# WHERE sample1.id = 33;
QUERY PLAN
---------------------------------------------------------------------------------
Nested Loop (cost = 0. 00 .. 16.55 rows = 1 width = 254)
-> Index Scan using I _sample1 on sample1 (cost = 0. 00 .. 8.27 rows = 1 width = 4)
Index Cond: (id = 33: oid)
-> Index Scan using I _sample2 on sample2 (cost = 0. 00 .. 8.27 rows = 1 width = 258)
Index Cond: (id = 33: oid)
It is no longer Inner Sequential Scan, but Inner Index Scan

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.