1. Nested loop join (nested cyclic join)
Algorithm:
The idea is quite simple and straightforward: for each tuple R of relation R, it is directly compared with each tuple s of relation s in the field of the join condition, and the eligible tuples are filtered out. The pseudocode is written in the following code:
Price:
The order of the joined tables in the inner or outer layers has a very important impact on disk I/O overhead. The CPU overhead is relatively small, mainly because the cost of the tuple read into memory (in-memory) is O (n * m)
For I/O overhead, according to the page-at-a-time prerequisites, I/o cost = m + M * N,
The cost of I/O is the overhead of reading I/O cost of M page + I/o cost reading n pages.
2. Sort-merge Join (sort merge Join)
Nested loop is generally inefficient when two sets are large, and sort-merge is much more efficient in this case, especially if there is a clustered index (clustered index) on the join field of the two collection. Sort-merge performance will be best achieved.
Algorithm:
The basic idea is also very simple (review the data structure of the merge sort bar), there are two main steps:
A. Sort by join Field
B. Merge the two sets of sorted collections and compare them with each other from the source side (you need to do a special "Partition" based on whether there are duplicate values in the Join field)
Cost: (mainly I/O overhead)
There are two factors around the cost of Sort-merge: whether the join field is sorted and how many duplicate values are on the join field.
In the best case (two columns are sorted and at least one column has no duplicates): O (n + m) only needs to scan each of the two sets. (The m,n here is better if you can use the index.)
Worst case (two columns are not sorted and all values on both columns are the same): O (n * log n + M * Log m + n * m) two times sorted and a Cartesian product between all tuples
3. Hash join (Hash joins)
A Hash join is essentially similar to the processing thought of a sort-merge-partition (patitioning), when both columns have duplicate values. But they are also different: Hash joins are partitioned by hash partitions (each bucket is a partition) and sort-merge by sorting (each duplicate value is a partition).
It is noteworthy that the large difference between the hash join and the above two algorithms is also a large limitation that it can only be applied to the equivalent join (equality join), which is mainly due to the deterministic and unordered nature of the hash function and its bucket.
Algorithm:
The basic hash join algorithm consists of the following two steps:
With nested loop, build input is located above the execution plan, probe input is located below.
The hash join operation is done in two phases: the build (construction) phase and the probe (probing) phase.
A.build Input Phase: Based on the join field, the hash function is used to build an in-memory (in-memory) hash table with the same key value as the linked list of a bucket (bucket).
B.probe Input Phase: A hash table is checked on a larger R set to complete the join.
Price:
It is noteworthy that for each tuple r of a large set R, each tuple in the bucket of the corresponding r in the hash bucket needs to be compared with R, which is where the algorithm is most time-consuming.
CPU overhead is O (M + n * b) b is the average number of tuples per bucket.
Summarize:
Three join methods, all with two inputs, are optimized for the basic principles:
1. Avoid large data hash join, (hash join suitable for low concurrency, he occupies memory and Io is very large);
2. Try to convert it into an efficient merge join, nested loop join. Possible tools are table structure design, index tuning design, SQL optimization, and business design optimization.