I/O costs of SQL Server Algorithms

Last Update:2018-06-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. nestedLoopJoin (nested loop join) algorithm: The idea is quite simple and straightforward: for each tuples of relational R, r compares them with every tuples of relational S in the JOIN condition field and filters out the qualified tuples. Writing pseudo code is: cost: the order of the inner or outer layers of the joined table is very heavy for disk IO overhead.

1. nested Loop Join (Nested Loop Join) algorithm: The idea is quite simple and straightforward: for each tuples of relational R, r compares them with every tuples of relational S in the JOIN condition field and filters out the qualified tuples. Writing pseudo code is: cost: the order of the inner or outer layers of the joined table is very heavy for disk I/O overhead.

1. Nested Loop Join (Nested Loop Join)

Algorithm:

The idea is quite simple and straightforward: For each tuples in relation R, r compares them with every tuples in relation S in the JOIN condition field and filters out the qualified tuples. The pseudocode is:

Cost:

The order of the inner or outer layers of joined tables has a significant impact on disk I/O overhead. However, the CPU overhead is relatively low, mainly because the overhead of in-memory after the tuples are read into the memory is O (n * m)

For I/O overhead, according to the prerequisites of page-at-a-time, I/O cost = M + M * N,

Translation is the I/O overhead = the I/O overhead for reading M pages + the I/O overhead for reading N pages M times.

2. Sort-Merge Join (Sort Merge Join)

The Nested Loop is generally inefficient when both sets are large, and Sort-Merge is much more efficient than it in this case, especially when both JOIN fields have clustered indexes, Sort-Merge performs the best.

Algorithm:

The basic idea is also very simple (review the merging and sorting in the data structure). There are two main steps:

A. sort by JOIN Fields

B. Merge and sort the two sorted sets and compare the data columns obtained from the source end (Special "partition" processing should be performed based on whether there are duplicate values in the JOIN field)

Cost: (I/O overhead)

There are two factors about Sort-Merge Overhead: whether the JOIN field is sorted and the number of duplicate values in the JOIN field.

◆ Best case (both columns are sorted and at least one column has no duplicate values): O (n + m) only needs to scan each of the two sets. (Here, m and n are better if they all use indexes)

◆ Worst case (both columns are not sorted and all values in the two columns are the same): O (n * log n + m * log m + n * m) two sorting times and Cartesian product between all tuples

3. Hash Join)

Hash Join is essentially similar to the processing idea of Sort-Merge when both columns have duplicate values-partition ). But there are also differences: Hash Join uses Hash to partition (each bucket is a partition) and Sort-Merge uses sorting to partition (each duplicate value is a partition ).

It is worth noting that the big difference between Hash Join and the above two algorithms is also a big limitation that it can only be applied to equality join ), this is mainly caused by the certainty and disorder of the hash function and its bucket.

Algorithm:

The basic Hash Join algorithm consists of the following two steps:

Similar to the nested loop, In the execution plan, build input is located above, and probe input is located below.

The hash join operation is completed in two phases: build and probe.

A. build Input Phase: Based on the JOIN field, use the hash function h2 to Build a memory (in-memory) hash table for a small S set, A bucket composed of linked lists with the same key value)

B. Probe Input Phase: checks the hash table on a large R set to complete the join operation.

Cost:

It is worth noting that for each R tuples in a large set of r, each of the buckets corresponding to r in the hash bucket needs to be compared with r, which is also the most time-consuming part of the algorithm.

CPU overhead is O (m + n * B) B is the average number of tuples for each bucket.

Summary:

The three join methods all have two inputs. The basic principle of optimization is as follows:

1. Avoid hash join for big data (hash join is suitable for low concurrency, And it occupies a lot of memory and io );

2. convert it into efficient merge join and nested loop join as much as possible. Possible means include table structure design, index adjustment design, SQL optimization, and business design optimization.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

I/O costs of SQL Server Algorithms

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

I/O costs of SQL Server Algorithms

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support