Top 1 is slower than top 1
Problem description:
There is a query as follows. When top 1 is removed, the result will soon be displayed. However, when Top 1 is added, it usually takes 2 ~ 3 seconds before data is generated. How can this problem be solved?
Select top 1
A. invno
From a, B
Where a. Item = B. itemnumber
And B. ownercompanycode is not null
Cause analysis:
When top 1 is used, SQL Server tries its best to find the top 1 record first, which leads to a scanning algorithm inconsistent with that without top 1, the SQL Server Query Optimizer always believes that 1st matching records can be quickly found. Therefore, when nested loops are used, SQL Server determines the connection policy based on the structure and data statistics. Nested loops are generally applicable to the two associated tables, where the data in one table is large and the data in the other table is small. If the matched query value appears at the front end of the scan, in the case of top 1, it is applicable to the use conditions of nested loop connections. However, when the matched data appears at the backend of the scan, or basically no matching data exists, the nested loop needs to scan two large tables, which is obviously not suitable. It is precisely because of this situation that top 1 is much slower than top 1.
Simulation Environment for this problem:
Use tempdb
Go
Set nocount on
-- ==============================================
-- Create a test environment
-- ==============================================
Raiserror ('create test environment', 10, 1) with Nowait
-- Table
Create Table [DBO]. (
[Trannumber] [int] identity (1, 1) not null,
[Invno] [char] (8) Not null,
[Item] [char] (15) null default (''),
Primary Key ([trannumber])
)
Create index [indexoninvno] on [DBO]. A ([invno])
Create index [indexonitem] on [DBO]. A ([item])
Create index [indexoniteminnvo] on [DBO]. A ([invno], [item])
Go
-- Table B
Create Table [DBO]. B (
[Itemnumber] [char] (15) not null default (''),
[Companycode] [char] (4) not null,
[Ownercompanycode] [char] (4) null,
Primary Key ([itemnumber], [companycode])
)
Create index [itemnumber] on [DBO]. B ([itemnumber])
Create index [companycode] on [DBO]. B ([companycode])
Create index [ownercompanycode] on [DBO]. B ([ownercompanycode])
Go
-- ==============================================
-- Generate Test Data
-- ==============================================
Raiserror ('generate test data', 10, 1) with Nowait
Insert [DBO]. A ([invno], [item])
Select left (newid (), 8), right (newid (), 15)
From syscolumns A, syscolumns B
Insert [DBO]. B ([itemnumber], [companycode], [ownercompanycode])
Select right (newid (), 15), left (newid (), 4), left (newid (), 4)
From syscolumns A, syscolumns B
Go
Speed Test script:
-- ==============================================
-- Query Test
-- ==============================================
Raiserror ('query test', 10, 1) with Nowait
Declare @ DT datetime, @ ID int, @ loop int
Declare @ table (
Id int identity,
[Top 1] int,
[Without top] INT)
Set @ loop = 0
While @ loop <10
Begin
Set @ loop = @ loop + 1
Raiserror ('test % d', 10, 1, @ loop) with Nowait
Set @ dt = getdate ()
Select top 1
A. invno
From a, B
Where a. Item = B. itemnumber
And B. ownercompanycode is not null
Insert @ ([Top 1]) values (datediff (MS, @ DT, getdate ()))
Select @ ID = scope_identity (), @ dt = getdate ()
Select -- top 1
A. invno
From a, B
Where a. Item = B. itemnumber
And B. ownercompanycode is not null
Update @ set [without top] = datediff (MS, @ DT, getdate ())
Where id = @ ID
End
Select * From @
Union all
Select null, sum ([Top 1]), sum ([without top]) from @
Go
Test Data Change script:
Declare @ value char (15), @ value1 char (15)
Select
@ Value = left (newid (), 15 ),
@ Value1 = left (newid (), 15)
Update
Set item =@ Value
From
Inner join (
Select top 1
[Trannumber]
From (
Select top20Percent
[Trannumber]
From
Order by [trannumber]
) AA
Order by [trannumber] DESC
) B
On a. [trannumber] = B. [trannumber]
Update B
Set itemnumber = @ Value
From B
Inner join (
Select top 1
[Itemnumber], [companycode]
From (
Select top 20 percent
[Itemnumber], [companycode]
From B
Order by [itemnumber], [companycode]
) Bb
Order by [itemnumber] DESC, [companycode] DESC
) B1
On B. [itemnumber] = B1. [itemnumber]
And B. [companycode] = B1. [companycode]
Go
Test Description:
1. when the test environment is just set up, there is no matching item. At this time, top 1 will scan all the data in the two tables, run the "speed test script" to check whether there is any efficiency difference in top 1 at this time: Top 1 is obviously slower than top 1.
2. Modify the red in the "Test Data Change script"20,Let the matched data appear at the top, middle, and end of the scan, and use the "speed test script" for testing respectively. We can see that when the matched value is close to the front end of the scan, top 1 is faster than top 1. As the matching data goes through the backend, the efficiency difference will become smaller and smaller, and it will become top 1 slower than top 1.
Note:After each data change and the "speed test script" test is completed, you need to modify the red@The value is @ value1, And the matched data is changed back to mismatch.
Appendix: Connection Methods
1.Nested loop join
Nested loop join is also known as "nested iteration". It uses a join input as an external input table (displayed as the top input in the graphic Execution Plan) and another join input as an internal (bottom) input table. External loops process external input tables row by row. The internal loop is executed for each external row and searches for matching rows in the internal input table.
The simplest case is to scan the entire table or index during a search. This is called "nested loop join ". If an index is used for search, it is called "nested index loop join ". If an index is generated as part of a query plan (and the index is damaged immediately after the query is completed), it is called "temporary index nested loop join ". The query optimizer considers all these different situations.
Nested loop join is particularly effective if the external input is small and the internal input is large and the index is created in advance. In many small transactions (such as those that only affect a small group of rows), index nested loop joins are better than merge joins and hash joins. In large queries, nested loop join is usually not the best choice.
2.Merge join
Merge join requires that both inputs are sorted on the merge columns, and the merge columns are defined by the equivalent (on) clause of the join predicates. Generally, the query optimizer scans the index (if an index exists in an appropriate group of columns), or places a sort operator under the merged join. In rare cases, although there may be multiple equivalent clauses, only some of the available equivalent clauses can be used to obtain the merged columns.
Since each input is sorted, the merge join operator obtains and compares a row from each input. For example, for an inner join operation, if the rows are equal, return. If the rows are not equal, the rows with a smaller value are discarded and the other row is obtained from the input. This process repeats until all rows are processed.
The merge join operation can be a regular operation or multiple-to-multiple operation. Use temporary tables to store rows for multiple-to-multiple join operations. If there are repeated values in each input, when processing each repeated item in one of the input, the other input must be rewound to the start position of the repeated item.
If a resident predicate exists, all rows that meet the merging predicate will take the value of the resident predicate, and only the rows that satisfy the resident predicate will be returned.
The merge join operation is fast, but it takes a lot of time to select a merge join if you want to sort the join operation. However, if the data volume is large and pre-ordered data can be obtained from the existing B-tree index, the merge join is usually the fastest available join algorithm.
3.Hash join
Hash join has two types of input: generate input and test input. The query optimizer assigns these roles so that the smaller two inputs are used as the generated input.
Hash join is used for Multiple matching operations: Internal join; left Outer Join, right Outer Join and full outer join; left half join and right half join; intersection; Union and difference. In addition, some deformation of the hash join can be deleted and grouped repeatedly, such as sum (salary) group by department. These modifications only use one input for the production and test roles.
The following sections describe different types of hash connections: Hash connections in memory, Grace hash connections, and recursive hash connections.
Hash join in memory
Hash join scans or computes the entire generated input, and then generates a hash table in the memory. Insert each row into a hash bucket based on the hash value of the calculated hash key. If the entire generated input is smaller than the available memory, all rows can be inserted into the hash table. The generation phase is the detection phase. Scan or compute the entire test input in one row, calculate the hash key value for each test row, scan the corresponding hash bucket, and generate matching items.
GraceHash join
If the generated input is greater than the memory size, hash join is performed in several steps. This is called "Grace hash join ". Each step is divided into the generation and detection phases. First, consume the entire generation and test input and partition It (using the hash function on the hash key) into multiple files. The hash function can be used to ensure that any two join records must be in the same file pair. Therefore, connecting two large input tasks is simplified to multiple smaller instances of the same task. Then, the hash join is applied to each pair of partition files.
Recursive hash join
If the generated input is so large that the standard external merging input requires multiple merging levels, multiple partition steps and multiple partition levels are required. If only some partitions are large, you only need to use the additional partition steps for those partitions. To make all the partition steps as fast as possible, a large asynchronous I/O operation will be used so that a single thread can make multiple disk drives busy.
Note: |
If the generated input is slightly larger than the available memory, the hash join and Grace hash join elements in the memory are combined in one step to generate a hybrid hash join. |
During the optimization process, you cannot always determine which hash join to use. Therefore, at the beginning of SQL Server, the hash join in the memory is used, and then gradually converted to the grace hash join and recursive hash join Based on the generated input size.
If the optimizer incorrectly predicts which of the two inputs is smaller and thus determines which one is used as the generated input, the generated and probe roles will be dynamically reversed. Hash join ensures that a small overflow file is used as the input. This technology is called "role reversal ". After at least one file overflows to the disk, role inversion occurs only in the hash join.
Note: |
Role reversal is independent of any query prompt or structure. Role reversal is not displayed in the query plan; Role reversal is transparent to users. |
I have doubts about the above.
This is the problem that the query data of the Union Table query and the top join Table query is inconsistent.
Is there any solution?
Is the two tables in the join Table query (
Table A (Aid PK, date clustered index ,....)
Table B (BID PK, FK, clicknum clustered index ,.....)
) One-to-one relationship aid corresponds to bid
The following two queries
1: Select * from a inner Jon B on aid = bid
2: Select top 10 * from a inner Jon B on aid = bid
The first 10 pieces of data are inconsistent. What are the solutions? How is top implemented?