Original: SQL server-focus Inner JOIN and in performance analysis (14)
Objective
In this section we talk about the integration of integrated knowledge, we are in most tutorials or theoretical books are talking about which good, which performance is inferior to which performance, but really talk about the essence of the problem is not too much, so only the series of each article is not too much, but it must be my intention to find a lot of information and write, short content, In-depth understanding, always to review the basics.
Initial discussion of inner join and in performance analysis
Next we look at the first joint comprehensive knowledge to explain the comparative analysis of inner join and in, and we look at the inner join by creating a table.
Creating a Test Table 1
CREATE TABLE Table1 (ID INT IDENTITY PRIMARY KEY, Somecolumn char (4), Filler char ())
Inserting test data
Insert into Table1 (somecolumn) Values (1), (2), (3), (4), (5)
Create test table 2 and insert data
Use tsql2012gocreate TABLE Table2 int ) Insert into Table2 (intcol) Values (1), (2), (2), (3), (4 ), (5), (5)
Next we join the Somecolumn and Intcol in test table 1 and test table 2
* FROMTable1 b = S.intcol
At this point we see that all two test tables return 7 rows of data because there is duplicate data in test table 2 that matches all of the data on all test table 1. Now let's look at the query in
* FROM Table1 WHERE somecolumn in (Select intcol from Table2)
Now return 5 data, from here we know inner join and in is still a big difference, but if there is no duplicate data in the Test table 2, and in the test table 2 does not need the column, then the query data and test table 1 is the same, at this time what difference in performance? Next we create a lot of data in the premise of testing to see.
Create two test tables
CREATE TABLE BigTable (id INT IDENTITY PRIMARY key,somecolumn uniqueidentifier not Null,filler CHAR ( ) CREATE TABLE smallertable (id INT IDENTITY PRIMARY key,lookupcolumn uniqueidentifier not null,somearbdate DATETIME DEFAULT GETDATE ())
Insert 1 million data in the BigTable table somecolumn column
INSERT into BigTable (somecolumn) SELECT NEWID () from dbo. Numswhere n<1000001
Remove 25% data from the bigtable into the smallertable table Lookupcolumn column
Use Tsql2012goinsert into smallertable (lookupcolumn) SELECT DISTINCT somecolumnfrom BigTable tablesample ( PERCENT)
Here we test in three different situations.
(1) Index comparison inner and join not established
= dbo. Smallertable.lookupcolumn
As you can see from the above, there is no difference in query overhead or IO, let's take a look at indexing now
(2) Establishing a non-unique nonclustered index comparison inner join and in
Create INDEX Idx_bigtable_somecolumn on BigTable (somecolumn) CREATE index idx_smallertable_lookupcolumn on smallertable (Lookupcolumn)
At this point, we find that in the case of non-unique nonclustered indexes, there is a big difference in query overhead, and the cost of INNER join is twice times of in and Io is almost equal.
(3) Establish unique nonclustered index comparison inner join and in
Create unique index idx_bigtable_somecolumn on BigTable (somecolumn) create unique index idx_smallertable_lookupcolumn on Smallertable (Lookupcolumn)
Why does the index become a unique clustered index when the performance cost is consistent? A little puzzled, at the same time to here is not to show that in the query performance is better than the performance of join, completely subvert our idea, in the preface we discussed in the tutorial will give most of the join than exists performance, and exists better than in performance, usually hands-on practice, Personal verification is the king, we can only draw a general conclusion: Generally speaking, join is better than exists, and exists is better than in performance. This is all a general case, and this series needs to tell you when you should use exists, when you should use join, and when you should use in, and the content will be discussed in succession. Well, a little off the mark, we have 1 million data to get in the performance of the inner join performance of twice times, completely beyond your expectations, with this question, and then we further explore.
Further discussion of inner join and in performance analysis
The above 25% of the data taken from the BigTable table in the Smallertable table are unique, and we will then set the portion of the 25% data as duplicates. We remove the data from the BigTable table somecolumn This column, and then set the data for the Lookupcolumn column in the smallertable table to repeat 10,000, as follows
' 0067cb6c-64e1-46cc-b7f2-334a7dd812ff ' WHERE ID>=1 and id<=10000
At this point we are querying for the 10,000 duplicates
= dbo. Smallertable.lookupcolumn
At this point the result or in performance is nearly half the performance of the inner join, and then we query the Smallertable table when the duplicate lookupcolumn column data is removed, when our query becomes as follows:
Use tsql2012goselect bigtable.id, Somecolumnfrom bigtablewhere somecolumn in (SELECT lookupcolumn from dbo. smallertable) Select Bigtable.id, Somecolumnfrom bigtableinner JOIN (SELECT DISTINCT lookupcolumn from dbo. smallertable= dbo. Bigtable.somecolumn
Finally the query cost and the above is not the same, at this time the query performance cost is the same, I believe here we should be very clear. We can derive the performance overhead of inner join and in by the above-mentioned large number of pages, and when we are initially exploring the performance analysis of inner join and in, when a non-unique clustered index is established, the in performance is close to twice times the inner join, And when it comes to creating a unique clustered index, the performance overhead is consistent, and it's a little bit puzzling that when we continue down the discussion we finally get to the point where we finally come to the inner join and in performance cost conclusions.
INNER join and in performance overhead conclusion: when the column data in the INNER join table is unique, the performance cost of INNER join and in is the same, when the column data in the INNER join table is duplicated, in which case the in performance is better INNER join.
Summarize
In this section we describe in detail the performance analysis of the inner join and in, and finally the consistency conclusion, the next section we start to discuss not exists and not in performance analysis, short content, in-depth understanding, we'll see you next, good night.
SQL server-Focus Inner JOIN and in performance analysis (14)