I/O costs of SQL Server Algorithms

Source: Internet
Author: User
1. nestedLoopJoin (nested loop join) algorithm: The idea is quite simple and straightforward: for each tuples of relational R, r compares them with every tuples of relational S in the JOIN condition field and filters out the qualified tuples. Writing pseudo code is: cost: the order of the inner or outer layers of the joined table is very heavy for disk IO overhead.

1. nested Loop Join (Nested Loop Join) algorithm: The idea is quite simple and straightforward: for each tuples of relational R, r compares them with every tuples of relational S in the JOIN condition field and filters out the qualified tuples. Writing pseudo code is: cost: the order of the inner or outer layers of the joined table is very heavy for disk I/O overhead.

1. Nested Loop Join (Nested Loop Join)

Algorithm:

The idea is quite simple and straightforward: For each tuples in relation R, r compares them with every tuples in relation S in the JOIN condition field and filters out the qualified tuples. The pseudocode is:

Cost:

The order of the inner or outer layers of joined tables has a significant impact on disk I/O overhead. However, the CPU overhead is relatively low, mainly because the overhead of in-memory after the tuples are read into the memory is O (n * m)

For I/O overhead, according to the prerequisites of page-at-a-time, I/O cost = M + M * N,

Translation is the I/O overhead = the I/O overhead for reading M pages + the I/O overhead for reading N pages M times.

2. Sort-Merge Join (Sort Merge Join)

The Nested Loop is generally inefficient when both sets are large, and Sort-Merge is much more efficient than it in this case, especially when both JOIN fields have clustered indexes, Sort-Merge performs the best.

Algorithm:

The basic idea is also very simple (review the merging and sorting in the data structure). There are two main steps:

A. sort by JOIN Fields

B. Merge and sort the two sorted sets and compare the data columns obtained from the source end (Special "partition" processing should be performed based on whether there are duplicate values in the JOIN field)

Cost: (I/O overhead)

There are two factors about Sort-Merge Overhead: whether the JOIN field is sorted and the number of duplicate values in the JOIN field.

◆ Best case (both columns are sorted and at least one column has no duplicate values): O (n + m) only needs to scan each of the two sets. (Here, m and n are better if they all use indexes)

◆ Worst case (both columns are not sorted and all values in the two columns are the same): O (n * log n + m * log m + n * m) two sorting times and Cartesian product between all tuples

3. Hash Join)

Hash Join is essentially similar to the processing idea of Sort-Merge when both columns have duplicate values-partition ). But there are also differences: Hash Join uses Hash to partition (each bucket is a partition) and Sort-Merge uses sorting to partition (each duplicate value is a partition ).

It is worth noting that the big difference between Hash Join and the above two algorithms is also a big limitation that it can only be applied to equality join ), this is mainly caused by the certainty and disorder of the hash function and its bucket.

Algorithm:

The basic Hash Join algorithm consists of the following two steps:

Similar to the nested loop, In the execution plan, build input is located above, and probe input is located below.

The hash join operation is completed in two phases: build and probe.

A. build Input Phase: Based on the JOIN field, use the hash function h2 to Build a memory (in-memory) hash table for a small S set, A bucket composed of linked lists with the same key value)

B. Probe Input Phase: checks the hash table on a large R set to complete the join operation.

Cost:

It is worth noting that for each R tuples in a large set of r, each of the buckets corresponding to r in the hash bucket needs to be compared with r, which is also the most time-consuming part of the algorithm.

CPU overhead is O (m + n * B) B is the average number of tuples for each bucket.

Summary:

The three join methods all have two inputs. The basic principle of optimization is as follows:

1. Avoid hash join for big data (hash join is suitable for low concurrency, And it occupies a lot of memory and io );

2. convert it into efficient merge join and nested loop join as much as possible. Possible means include table structure design, index adjustment design, SQL optimization, and business design optimization.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.