2. Table join(Table join)
So far, we have been dealing with a single table. Let's look at the connections in the query. The following query gets the employee information and connects firstname and lastname, so that the returned information is more friendly.
SELECT e.[Title], a.[City], c.[LastName] + ', ' + c.[FirstName] AS EmployeeNameFROM [HumanResources].[Employee] eJOIN [HumanResources].[EmployeeAddress] ed ON e.[EmployeeID] = ed.[EmployeeID]JOIN [Person].[Address] a ON [ed].[AddressID] = [a].[AddressID]JOIN [Person].[Contact] c ON e.[ContactID] = c.[ContactID];
Execution Plan
Figure 2-12
There are multiple processing steps in this query, and the overhead of each step is also different. They are accumulated from right to step by step in the execution tree. The three operations with the largest overhead are:
1. Index scan on the person. Address Table, 45%
2. Hash match join operation between the humanresource. employeeaddress table and the person. Address Table, 28%
3. Clustered index scan on the person. Contact table, 17%
In the upper-right corner, the index scan for the humanresource. employeeaddress table is performed. The index scan for the person. Address Table is shown below, which is the operator with the largest overhead. Looking at tooltip, 2-13, we can see that this is a scan of the index ix_address_addressline?addressline2_city_stateprovinceid_postalcode. The storage engine traverses 19614 rows of data to find what we need.
Figure 2-13
As shown in the output list, the query optimizer needs the addressid column and the city column. The query optimizer computes indexes and columns in a table selectively. It considers that the best way is to traverse indexes. The overhead of traversing 19614 rows occupies 45% of the total overhead.
0.180413 is only a number calculated internally. The optimizer is used to measure the relative overhead of each operation. The lower the number, the higher the efficiency of the operation.
Hash match join
In the preceding example, the output of the two index scans is combined by hash match join.
Before discussing what hash match join is, we need to understand two concepts: hash calculation (hash algorithm) and hash table.
Hash is a programming technique used to convert data into symbols so that data can be retrieved more quickly. For example, a single row of data in a table can be converted into a unique value through a program. This value represents the content of this row of data. This is similar to Data Encryption. A hash value can also be converted back to the original data.
A hash table is a data structure that divides all elements into "classes" or "blocks" of the same size, allowing quick access to these elements. The hash function determines which "Block" the element should enter ". For example, you can extract a row of data from the table, convert it to a hash value, and store the hash value in the hash table.
When hash match join occurs, SQL server connects two tables (one table and one table) and performs hash operations on the data rows in the table to insert the generated hash values into the hash table, then, traverse the data in the Big Table and retrieve a row each time to search for matched rows in the hash table. The hash operation is performed on small tables to reduce the size of hash tables. The hash value can be quickly compared. If both tables are large, hash match jion is very inefficient compared with other types of Join Operations.
Hash match join is highly efficient for large datasets, especially when one table is much smaller than the other. If the table is not sorted by the join column or there is no available index in the table, hash match join is also very effective.
The emergence of hash match join may also mean a more efficient connection method (nested loop or merge), which may be caused by the following reasons:
1. Missing indexes or incorrect Indexes
2. The where statement is missing.
3. In the where statement, the calculation or conversion of index columns invalidates the index.
In these cases, the optimizer considers that hash match join is the most efficient way to connect two tables, however, you may be able to get more efficient queries by adding indexes or adding a where statement to reduce data volume.
Clustered index search (clustered index seek)
Next, we can see that the operation that occupies 17% is to query the pk_contact_contactid clustered index on the person. Contact table. For this table, pk_contact_contactid is both a primary key and a clustered index.
Figure 2-15
From the seek predicates section, we can see that this operation directly connects the HumanResources. Employee table and the contactid column of the person. Contact table.
Nested loop join (nested loop join)
Figure 2-16
As the name suggests, the nested loop Join Operation uses nested double-layer loops. In our plan, the result of the (17%) operator is in the outer layer. Because both datasets are small, this is a very efficient operation.
As long as the inner dataset is small and the outer dataset is small (it doesn't matter if it is small) indexed, nested loop join is a very efficient connection mechanism. Unless the dataset is too large, this connection method should be the connection method you most want to see.
Compute scarlar)
Figure 2-17
Scalar calculation indicates that this operation produces a scalar value, which is usually calculated. In this example, the Kana employeename combines the lastname and firstname and uses commas to connect them.
Merge join
In addition to hash match and nested loop, there is also a connection called merge join. Run the following query.
SELECT c.CustomerIDFROM Sales.SalesOrderDetail odJOIN Sales.SalesOrderHeader ohON od.SalesOrderID = oh.SalesOrderIDJOIN S ales.Customer cON oh.CustomerID = c.CustomerID
Execution Plan
Figure 2-18
The two tables connected by merge join must be pre-sorted by the connection column. In this case, merge join is an efficient connection method. If the connected tables are not pre-sorted by the connected columns, the query optimizer will either sort the tables first and then execute merge
Or hash match join with lower execution efficiency.
PS: A method used to process data in Cobol. The estimation principle is the same. There are two groups of data, which are sorted in order. Assume that the data is small to large by a certain field. Set two pointers to represent the current record location of the left and right data respectively, and compare the data size of the current location. If the small right side is large, the pointer on the Left moves one forward, if the small on the right side is big on the left side, the pointer on the right side moves one byte forward. If the pointer on the right side is as big as the other, this is the matching two pieces of data, and the cursor moves one byte forward at the same time. This is only the case where the values of the connected columns are unique, or there may be many-to-many matching. When many-to-many matching is performed, merge join must use cache. The principle should be like this, in this way, the double-layer loop is avoided. Therefore, if the two tables in the order are connected, the merge join efficiency should be the highest.