From search sort to join

Last Update:2015-05-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

An article published in "programmer" April 2015 B , under the blog archive. According to the magazine request, in their own blog published the article also must note: This article for CSDN compilation, without permission not reproduced, if necessary reprint please contact market#csdn.net (#换成 @)

First, preface

The semantics of join is to combine the values of the two tables through the attributes, the higher the database paradigm, the more the tables are split, the more tables in the application to be joined. SQL statements that do not involve joins are almost impossible to find in our daily development, even if the keyword containing join is not displayed. When the database system executes, it chooses an explicit and optimal join method to execute. Through the understanding of the join principle can help to understand the database's join selection, and then a deeper understanding of the query plan, understand the same table structure data row number changes, index changes will affect the choice of database join method, so that more targeted performance optimization.

For example, the execution plan for Oracle and MSSQL is an internal connection involving four tables. Showing the table connection for each step in the query plan, you can see a total of three table join operations (interestingly, for similar datasets, two databases choose a different connection method).

Figure 1 Oracle Execution plan

The join principle is sufficiently detailed in the database vendor's documentation and in the classic database system concept, respectively, from the perspective of application and theory. This article attempts to generalize the principle of the three most common join methods by using the most basic lookup sorting algorithm (Search&sort) in the data structure from another perspective: Nested loops join (Nested loop join), sort Merge Join ( Sorted Merge join), hash connection (hash join). In order to understand the relationship between the three ways, the evolution and the principle of the decision of its application of subtle differences. It is indicated that these kinds of tables are the technology of database, but like other database technologies such as database index, it is also the extension of typical data structure.

Figure 2 MSSQL Execution plan

Ii. Examples of data

To make it clear, the two datasets in the example used in this article are relatively simple. For the following two tables.

T_fc

fc_id	Fc_name	Officialsite	Cityid
1	Inter	Www. Inter. It	2
2	Barcelona	http://www.fcbarcelona.com/	8
3	Manchester	Http://www.manutd.com	12
4	Liverpool	http://www.liverpoolfc.com/	11
5	Arsenal	http://www.arsenal.com/	36
6	Acm	http://www.acmilan.com/	2
7	Chelsea	http://www.chelseafc.com/	36
8	Manchester City	http://www.mcfc.co.uk/	12

Chart 3 test Data T_fc

T_city

fc_idCityid	Fc_name	Country	Desc
36	Landon	Britain
11	Liverpool	Britain
12	Manchester	Britain
2	Milan	Italy

Chart 4 test Data t_city

The first table is an entity table that records a few general descriptions of the football club. One of the first records is the team that the author has supported for 17 years: a big country in Milan city! The second table is an entity table of city information, and two tables are associated with the Cityid column, indicating that the club is located in that city. As can be seen from the sample data, different clubs can have the same Cityid, which is the legendary derby, the most famous, the greatest, the hottest, the author has been concerned about the Milan Derby, but also in recent years began to focus on the Manchester derby.

For simple periods, an example is the equivalent connection bar in the simplest inner connection .

SELECT T_fc. fc_id, T_fc. Fc_name, T_fc. Officialsite, T_fc. Cityid, T_city.city_name, T_city.countryfrom  T_fc, T_citywhere T_fc. Cityid = t_city. Cityid

The data is ready, then go into the text and see how to summarize the join from the search Sort of the data structure.

analysis of the principle of table connection from the look-up sorting algorithm

Table connection is mainly to investigate the relationship between the connection column, so as a pointcut, temporarily ignore the connection key outside the column, found to be connected to two tables into two arrays, discuss the problem suddenly looked refreshing and simple a lot. the so-called connection is simply to match the corresponding elements of the connection conditions in the two related columns to find out to pair . Our example is an equivalent condition, and the discussion of this matching condition into a lookup is more straightforward, which is to examine the elements of a set and find the elements of another set whose values are equal.

Chart 5 Association Columns

Did you see it? We're trying to swap the concept. Replace join joinwith Search search. Yes, this is our cut-in point, and it will be back around again.

Retrieval, which is a classic technique in data structures and algorithms. Later, we will find that the "invention" of the database system in the high-up connection, but also from the corresponding classical retrieval algorithm deduction.

1. Nested loops Join (Nested Loop Join )

Look at the two sets of numbers in front, do not waste brain cells, the most stupid way, poor lifting Bai, take one side of the data, a one to take out, see and next door which pair of eyes. Performance in the implementation of the algorithm is actually a double cycle, the loop body is a match to see whether to meet the connection conditions, for this example is the value is equal. That is, take a group of numbers, and then match them in another group to see if there are any equivalent elements.

Suppose we first consider the t_city data, that is, the outer loop is the data that traverses the t_city table. It is written as pseudo-code as follows.

For each Cityid in t_city does beginfor each tuple Cityid in T_fc do beginif T_city.cityid=t_fc. Cityid   output (T_fc. Cityid,t_city.cityid)

The process of this connection is described, the records are selected line by row from t_city , and the t_fc cityid column is found to match the connection condition T_city.cityid=t_fc. Cityid 's record. Find then add an arrow to make a connection (join), how image! See a total of 7 arrows added, corresponding to the 7 join results in the join result set.

Chart 6 Example of nested loops join

This is the stupidest search method for data sets with no rules in the search. Congratulations, this is the stupidest thing you've written. The retrieval pseudo-code is the main idea of nested loops joins (Nested Loop joins) in the database system. In terms of the perimeter loop, the data set t_city is called the external table (Outer table), also known as the driver table (Driving table), and the table T_fc The inner loop is called the internal table (Inner tables).

The implementation scans each row of the external table, for each row of data, scans the internal table for records that meet the join condition as the output result set. The final result set is the result set of each external scan and together. Because you want to compare each of the two sets of data by a double loop to determine if the join condition is met, it is easy to see that this is a relatively expensive method.

For nested Loop join,oracle, Mysql, MSSQL almost all relational databases are supported. Although the implementation is slightly different in detail. Ideas are the same.

According to this principle, it is not difficult to analyze the best external table that the driver table cannot be too large. If the two tables are larger, especially if the outer table is larger, the Inner table will be scanned many times, the complexity of the algorithm increases very quickly, and the total resource consumption increases quickly. At the same time, if the inner table has an index on the join condition, the efficiency of each inner loop is very high and the inner table dataset is not related, so you can navigate to the matching row without scanning the inner table.

2. sorting merge joins (sort merge Join)

Is there a more efficient way to scan than the entire list? Of course, follow the search idea. For a dataset that has no rules, it can only scan all, the time complexity of the retrieval is O (n), and if the recordset is ordered, the lookup becomes simple.

Combined with the above example, if the two sets of numbers are sorted, what will join be like? Immediately the mind floated the following scenario: inserting an ordered set into the corresponding position of another ordered set, so that the final set is also kept in order, that is, to merge two ordered sets into an ordered set. , merge the yellow records in the left t_city table with the green records in the right T_fc .

Figure 7 Merge Sort Example

Is this not the classic merge sort? Yes, it's a merger! Assuming that two sets are from small to large, the main idea of the merge sort is to set two pointers in the ordered set of inputs, starting positions of two sorted sequences, comparing two pointers to elements, selecting relatively small elements into the merged space of the output, and moving the pointer to the next position, Until a collection is empty, put the remaining elements of the other collection into the merged space of the output. Even if you do not understand the internal logic, just look at the loop, you can find and nested loop connection The biggest difference is that the loop is only one layer. Because there is no need to compare each of the two elements of the two collection. That time complexity must have lowered one level.

The above merge sort example diagram is made by small changes. The input is the same, two ordered sets. The merge process in the middle is similar, except that the elements of an ordered set are not inserted into another ordered set, but only in order to establish a link between the elements of the two ordered set that are equal.

Figure 8 Sorting Merge Connections example

Refer to the connection condition T_city.cityid=t_fc. Cityid, According to the t_city table on the left Cityid The above values, one by one on the right side of the T_fc to find, and nested loops join the example diagram, the result is added to meet the conditions of the 7 arrows, And also got 7 join results, but the arrow distribution is still very big difference. The arrows in the sort Merge join are not intersecting, while the arrows in the nested Loop join are much more intersecting. This simple feature actually reflects the difference in principle between the two methods. As an example of an arrow originating from a line citypid=2 in t_city , in the sort Merge join, because the right-hand rowset is sorted so that the first two rows have been scanned for two matching records, and a third row that does not match is hit, the scan ends While in nested Loop join, the rowset on the right is not sequential, and the right rowset must be fully scanned to find a connection condition.

When you observe the key value of a connection in a outer table, each time you compare the connection key value of inner table, you do not have to start with the first element, and you do not have to scan to the last element. Therefore, the time complexity will be lower than the nested Loop join.

According to this principle it is easy to find that using a sort merge connection can achieve high connection efficiency. But the so-called No free lunch, using a sort merge connection two sets to be connected are ordered on the connection column. We know that the cost of sorting large datasets is generally not small, so the optimizer should also refer to other criteria when choosing whether to use sort merge joins. If the input collection itself is ordered (such as a B-tree index on a Join Condition column), or if there is an order by requirement in SQL semantics, the optimizer might consider selecting a sort merge connection if it is required to be ordered on the join column. Generally for large collections if the sort merge connection cannot be used, then another connection method hash connection is considered.

3. Hash Connection (hash join)

or from the search algorithm thinking to consider. The search algorithm has a retrieval method that is in exchange for the time Complexity of O (1) by constructing a little extra space, that is, hash lookup. The analogy of this algorithm to the database is the famous and powerful hash connection. In hash lookup, a hash function is adopted to group the collection of found elements, which makes it possible to have a certain connection between the storage location and the element value. Then for the element to be found, as long as the calculation of its hash value to get the storage location, direct access (hash function is not good choice caused by excessive hash conflict, resulting in the calculation of the location of the positioning accuracy is not enough to discuss).

In the hash join, the same hash function is used to group the join condition columns on the two rowset separately, so as to compare the records within the same group on both sides . Instead of a nested loop join, compare each of the two elements in a two collection. In fact, it can be thought that nested loops are the only case where there is a single grouping in the hash join. The better the hash function, the more uniform the group, the less the hash conflict, the more effective the grouping, the better the performance.

In the example, use the simplest hash function hash (n) =n%10 to divide the join condition column of the two input rowset Cityid into 10 groups. By the definition of the hash function it is known that the join condition satisfies the following derivation:

So the arrows on the left and right side of the graph Partiton the corresponding record are horizontal . In the yellow section 12 and 2 fall into Partition2, then the record in the partition2 of the left yellow on the join is only used to establish a connection with the records in the right partition2.

Figure 10 Hash Join example

In the implementation of the database system, the hash join method usually constructs the hash table through an input set, scans another input set, and locates each element in the hash table by the method of hashing to locate the corresponding position of the hash, that is to find the record that satisfies the connection condition. As in the example, a hash table is established based on the Cityid in the t_city table. Using the same hash function for Cityid in T_fc , the hash value is computed, and the position of the element in the hash table can be found by locating the matching element in that location. This method and the above description of the two datasets are grouped with the same hash function, and then the group checks that the match is logically a meaning.

By analyzing the principle of hash join, we can analyze if the difference between the two datasets is large, if the small data set can be loaded into the memory construct hash Table, and then scan another large data set from disk, Calculate the hash value for the join column in the large data set to get the memory hash table index, and then establish the connection. This performance can be very high, when the real cost of physical IO is to make hash table when the small data set scanning and matching the large data set scan. Of course, it is easy to see a natural limitation of the hash join, that is, the connection condition must be the equivalent condition, because the hash join mechanism needs to calculate the hash value to get the hash table position, to establish the connection. However, hash join does not require the data set to be sorted in advance, nor does it require an index on it.

Iv. Summary

This paper tries to summarize the connection and difference of three kinds of classical sorting algorithms in the realization principle based on the most basic searching sort algorithm in data structure. Interestingly, the simplest way is not to look at the author's wordy description, but to observe the characteristics of the join arrows in the joins, the same two datasets are connected, all t_city on the left T_fc in the right, so the arrows are directed from left to right. However, there is a difference in the characteristics of the arrows: nested Loops (Nested loop join) with arrows that are either on or off in various directions, and that intersect each other; the arrows of the sortMerge Join(sort) are all in the opposite direction, But the arrows with hash joins (hashjoin) are most characteristic and all arrows are horizontal.

The above is only a logical data set on the comparison of pure analysis of the database system of three ways to connect the principle. The actual situation is much more complex, and there is no such method, which is more efficient than other methods. The query optimizer of the database chooses a connection method which combines the data statistics of the Operation object, queries the request of the statement, and chooses an appropriate connection method. Join hint is supported in database systems that support more than one join approach, such as Oracle and MSSQL, enabling the user to require the optimizer to override the Join method that the optimizer chooses by using the specified connection method. It is generally considered that the optimizer is smart enough not to use join Hint.

Finally, based on the principle analysis in the following table on the three kinds of connection in the use of the scene and other details of a more comprehensive comparison and induction:

	Nested loop joins (Nested Loop join)	Sorting merge joins (sort merge Join)	Hash connection (hash join)
Supported common databases	Oracle, Mssql,mysql	Oracle, MSSQL	Oracle, MSSQL
Applicable scenarios	Two smaller datasets with indexes on the Inner table join column	A larger data set, an index on the join field, or a statement that requires the result set to be sorted.	Large data set, no index on join field and no ordering required
Join field expression	Any expression	<, <=, >, >=; not including <>	Only for equivalent connections
Data Set sorting characteristics	does not require	Requirements	does not require
Memory Resource requirements	Less requirements	When you need to sort the datasets, the memory requirements are high.	High
Returns the first query result	The first query results can be returned faster when the index selectivity is high or when the limiting condition is relatively long.	You must get all the results before you can return data.	Because you want to make the resume hash table in memory, the result is slower for the first time.

Finish.

original article. In order to maintain the version of the article consistent, latest, traceable, please specify: reproduced from Idouba

This article link address: from search sort to join

From search sort to join

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

From search sort to join

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support