Introduction
In SQL Server, the Inner Join and Outer Join operations between common tables are performed by the execution engine based on the selected columns and whether the data has an index, the selected data is selectively converted to one of the Loop Join, Merge Join, and Hash Join physical connections. Understanding these three physical connections is the basis for solving performance problems during table connections. The following describes the principles and applicable scenarios of these three connections.
Nested Loop Join)
Loop nested join is the most basic link. As shown in its name, loop nesting is required. nested loop is the only method that supports inequality connection among the three methods, the process of this connection method can be simply described as follows:
Figure 1. Step 1 of loop nested join
Figure 2. Step 2 of loop nested join
From the above two figures, it is not difficult to see that the number of times the nested loop table searches for internal loops is equal to the number of rows in the External Loop. When there are no more rows in the external loop, the nested loop ends. In addition, we can also see that this connection method requires an internal cycle of the table order, that is, there is an index), and the number of rows in the external cycle table is smaller than the number of rows in the internal cycle, otherwise, the query analyzer is more inclined to Hash Join (which will be discussed later in this article ).
Through nested loop connections, we can see that with the increase of data volume, the consumption of performance will increase exponentially. Therefore, when the data volume reaches a certain level, this method is often used by the query analyzer.
The following is an example of a nested loop connection that uses Microsoft's AdventureWorks database:
Figure 3. A simple nested loop connection
In Figure 3, ProductID is indexed, and the Product table in the External table of the loop) contains 870 rows that conform to ProductID = 4688. Therefore, the corresponding SalesOrderDetail table needs to be searched 4688 times. Let's consider another example in the preceding query, as shown in figure 4.
Figure 4. Additional bookmarked search for extra columns
As shown in figure 4, because an UnitPrice column is selected, the connected index cannot overwrite the query, and must be searched by bookmarks, this is also why we need to develop a good habit of selecting only required columns. To solve the problem above, we can both overwrite the index or reduce the required columns to avoid bookmarking. In addition, there are only five ProductID-compliant rows. Therefore, the query analyzer selects bookmarks to search for the rows. If we increase the number of rows that meet the conditions, the query analyzer tends to scan tables. Generally, if the number of rows in a table is more than 1%, the system performs table scan instead of bookmarked search, but this is not absolute.), 5.
Figure 5. Table scan selected by the query Analyzer
It can be seen that the query analyzer selects table scanning for connection at this time, which is much less efficient. Therefore, both good coverage index and Select * are important. In addition, even if Table scanning is involved, it is still ideal. Even worse, when multiple inequalities are used as the join, the query analyzer knows the statistical distribution of each column, but do not know the Joint Distribution of several conditions, resulting in an incorrect execution plan, as shown in figure 6.
Figure 6. Deviations caused by inability to predict the Joint Distribution
As shown in figure 6, we can see that the estimated number of rows is greatly different from the actual number of rows, so we should use the table scan, but the query analyzer selects the bookmark search, this will have a greater impact on performance than table scanning. To what extent? We can compare the default plan of the forced table scan and query analyzer, as shown in 7.
Figure 7. Forced table scan performance is better