Original: SQL Server Tuning series advanced (How to index tuning)
Objective
In the previous article we analyzed the role of statistical information in the database, and we have learned how the database uses statistics to control the distribution of the contents of tables in the database. Children's shoes are not clear and can be clicked for reference.
As a tuning series of articles, the index of the database is certainly not less, so this article we began to analyze this piece of content, about the basic knowledge of the index is not intended to further analysis, a search on the web, this article is more focused on some of the actual combat content display, I hope that through this article you crossing can find the right solution in the real scene enough.
For the use of the index, I would like to encounter problems to find a suitable solution can be, do not mess with!!!
This article analyzes the superiority of the index and also shows the negative effect.
Technical preparation
Database version is SQL Server2012, the previous articles with SQL Server2008rt, the content is not very different, using Microsoft's previous case library (Northwind) for analysis, part of the content will also be applied to another Microsoft Case Library AdventureWorks
Trust the Friends of SQL Server, both libraries are not too unfamiliar.
Conceptual understanding
The so-called indexes, like other types of data pages in SQL Server, are fixed 8KB (8192 bytes), stored in the same b-tree structure, and each page in the index B-tree is called an index node. The top node of the B-tree is the root node. The underlying node in the index is called a leaf node. Any index between the root node and the leaf node is collectively known as the intermediate level.
Forget it, it's too cumbersome to describe, and it's an intuitive display structure on Books Online:
The above diagram visually shows the way the b-tree structure, basically similar to the structure of the data page, here is a point to remind that the bottom of the clustered index leaf node is stored as the actual data page. This provides an ultra-fast way for fast data acquisition, and is also an important part of our tuning, which is analyzed in subsequent articles.
Then look at the nonclustered index.
Nonclustered indexes and clustered indexes are also stored in a b-tree structure, but there is a significant difference in what is stored:
- The data rows of the underlying table are not sorted and stored in the order of the nonclustered index keys
- The leaf layer of a nonclustered index is made up of index pages rather than data
Because of the above characteristics, it is obvious that the fastest way to get data is through a clustered index, because its leaf node is the data page, the same leaf node's data page physical order is also stored in the structure order of the clustered index, which results in a data table can only exist a clustered index, And the clustered index occupies much less disk space than the nonclustered index .
For a nonclustered index, the leaf node stores an index row, and the data must be indexed by the address of the data page (the clustered index key or the RID of the heap table) that is recorded by the index row, which is what makes a data table can have multiple nonclustered clustered indexes and requires its own independent storage space.
Both indexes are designed to facilitate fast access to data pages and improve query performance. This is like a book need to add a table of contents.
About the index of a lot of knowledge, the basic content is not too much introduction, do not understand the information can be self-access, online n more.
Here is a brief introduction to the use of tips and precautions, I believe this is also the most concern of friends.
One, the choice of the clustered index
All of the ways to improve query performance using indexes, the first one is the clustered index, it is fast because b-tree is such a superior storage algorithm, b-tree as a balanced fork tree data structure, is all the relational database in the market in the way, Interested students can study this algorithm in depth.
Take a look at the clustered index, because there can only be one in a table, and the data content can be obtained from the leaf node primarily through clustered index lookups, so the SQL Server database system is also trying to facilitate the presence of clustered indexes.
As an example:
Use [Testdb]gocreate TABLE [dbo]. [TestTable] ( [A] [int] PRIMARY KEY not NULL, [B] [varchar] () null) GO
We create a test table, and generally the best design is to add a primary key to this table.
The concept of the primary key, I believe that almost understand the point of the database of children's shoes is not unfamiliar, two basic features: non-repetition, non-empty.
Well, just these two points are used, the meaning of non-repetition is that the selectivity is high, non-empty more can bring the data density high, so SQL Server happily chose the clustered index on the primary key column, and this way in the database has an elegant name: Primary key index.
So when we're done creating this table, SQL Server sets up the clustered index for that table by default.
To avoid duplicate names, SQL Server defaults to a field that has a GUID added to the name. It was really hard.
Of course, the formal way to let us specify this name, the script is as follows:
CREATE TABLE [TestTable3] ( [A] [int] not NULL, [B] [varchar] ( ) NULL CONSTRAINT pk_index PRIMARY KEY ([A])); GO
It looks more elegant.
In fact, the primary purpose of this default approach to SQL Server is to maximize the use of clustered indexes, because we know the benefits of clustered indexes, and it also creates the underlying condition for the formation of nonclustered indexes: the leaf node of a nonclustered index is the key code of the clustered index.
So based on this, we design the table later, do not disappoint the intentions of SQL Server, each table should have a clustered index.
I've seen a lot of people design tables that are stark piles of tables. And this is not serious, serious is a lot of unknown so the nonclustered index is added to the heap table, which is a typical deadlock environment in a large concurrency scenario, which is reproduced later in the article.
Of course, this is not an optimal way, because we know that when we design the table, the primary key is mostly meaningless keys, it is said that many cases are not used as filters when querying, and that the scope of coverage is limited to primary key columns. So the optimal design is to use a federated primary key or a custom clustered index column. Of course, SQL server above the original purpose of this design is to consider the small white table, weigh the pros and cons selected a compromise, such as no special needs, the default way to build a clustered index basically can meet the business scenario.
Then we analyze the nonclustered index
Second, the choice of non-clustered index
From the previous analysis, we can understand the benefits of a clustered index, but it also has the greatest self-restriction: Only one clustered index can exist on a single table.
To make more use of indexes, SQL Server introduces nonclustered indexes, and the nonclustered index entries for a single table can be many, so it's enough to let us see the performance gains of the index.
Above, we know that when a table is given a primary key, SQL Server creates the clustered index by default, but for the creation of nonclustered indexes, SQL Server does not help to build it, and we need to build it manually because it does not know that your nonclustered index is more appropriate to create on that column.
However, it is often a best practice that, as a relational data in order to be complex business entities, the design structure used is generally a pair of one or one-to-many, many-to-many design ideas, and this design structure formed the primary foreign key relationship, we know that the primary key SQL Server automatically create a clustered index, The recommended way to index foreign keys is to manually create nonclustered indexes to speed up mapping relationships between tables.
However, because of the special nature of its storage structure (non-data pages stored by leaf nodes), nonclustered indexes affect the efficiency of reading data, and more often we want to get a portion of the data rather than a single piece of data.
The use of nonclustered indexes is efficient when you get a portion of the data that is covered by a nonclustered index, which is very inefficient if the obtained data is not indexed, that is, when looking through a clustered index, you also need to introduce additional bookmark lookups, because we know that the bookmark lookup under the b-tree structure is: random io, the performance cost of random Io is very large, and for this reason SQL Server discards this way, directly from the table Scan (table seek) or the clustered Index Scan (index seek) to get the data directly.
The above part of the content, I have in the first article in the previous introduction, you can click to view.
It's too cumbersome to describe, as an example to explain:
SELECT orderid,customerid,orderdate from Ordersorder by OrderDate
Very simple query, take a look at the execution plan
Because there is a primary key on the table, a clustered Index Scan (index Scan) is used here, and if there is no clustered index, it must be a table scan.
Let's take a look at a hint hint to see how SQL Server takes advantage of nonclustered indexes.
Here we use fast N hint hint, this hint is very simple is to tell SQL Server to quickly first get out of the first N rows of data, the other data are back ... Maximize the data acquisition efficiency of the top n rows ( Remember: This tip is the best scenario for paged queries, and many business systems have pagination, plus this hint will make the database get the top number of data in the quickest way )
Our subsequent articles will analyze the usefulness of various hint in detail.
To continue the analysis, I want to quickly get to the first 1 rows of data, the script is as follows:
1)
To quickly get to a row of data, SQL Server changed the execution plan, scanned it with a nonclustered index, and had to introduce a bookmark lookup (Key lookup) in order to get the data from the other columns, from which we can see that the bookmark lookup consumes as much as 66%.
We went on to analyze, I want to get the first 10 rows of data, the script is as follows:
Ten)
When we want to get 10 rows, the consumption of bookmark lookup has started to soar, it has soared to 90% .... The reason is very simple, that is, I analyzed the article before the random io ...
Although bookmark lookup affects efficiency, the data we find is only a small part of it, so here SQL Server thinks it is an optimal way to get data using nonclustered index + bookmark lookup.
We go on to analyze, I want to quickly get 20 rows of data, the script is as follows
)
To this, SQL Server has decisively abandoned the nonclustered index + bookmark lookup this way. This is a cheaper way to use a clustered index scan.
After my tests, I found a range of values that SQL Server considers valid for this clustered index:
)
This discriminant threshold is 15 rows, and once the 15 rows of data are exceeded, SQL Server discards the nonclustered index.
We can analyze the valid range of nonclustered indexes from this process: 15 (number of valid lines)/1660 (total number of rows) = 0.009638, that is, 9% of such a quantity, of course, this value is not a fixed value, depending on a variety of factors, such as row type, content distribution, hardware environment, etc.
However, by this value I would like to tell you: the effectiveness of the nonclustered index is actually very narrow, because its coverage is small, which leads to a lot of children's shoes set up a nonclustered index, but in the actual implementation of the basic is useless.
There's a lot more to be said here, and many people mistakenly think that the god horse Non-clustered index to select the type of int is better than the type of varchar, and the last time I saw someone in order to save the phone number as int .... The goal is to find the fast ...
In fact, these views are very superficial understanding ... The best choice for an indexed column is a good one, but it's good to differentiate between column content distribution, which is the only choice: to maximize the selectivity of SQL Server.
An example of extreme points: adding a gender column to a nonclustered index: selectivity is only 50% .... The non-clustered index coverage is small, this index is basically useless ...
Also, pay attention to the order of the index, such as: two columns of value: Last name, first names, when designing the index, please put the surname in front, then the name ... It's like finding your address book the first name, and then looking for it ....
All right... Talk about the more, return to our content.
The non-clustered index above presents a random IO problem, and SQL Server also gives a workaround from version 2005: Inclusive column index
It is simply that the data page to be fetched when the nonclustered index is stored is included in the leaf node.
is to imitate the way of the clustered index, the leaf node of the nonclustered index is also stored in the data page information, of course, because the physical data page only one copy, so the nonclustered index can only copy one copy of its own storage, so that when looking for a nonclustered index can be directly obtained data.
The code is as follows:
Use [northwind]gocreate nonclustered INDEX [orderdateindex] on [dbo]. [Orders] ( [OrderDate] ASC) INCLUDE ( [OrderID], = on ) GO
In this case, this nonclustered index will be used when looking for this column. It also avoids the existence of random io (bookmark lookup), reduces the IO value and improves performance.
Of course, in most business systems, the amount of data obtained by using a nonclustered index is still relatively small, mostly a display detail page, so that the vantage surface of the nonclustered index is fully visible.
So for OLTP business systems, learn to take advantage of nonclustered indexes.
Of course, there are pros and cons, and can not be too much to create a nonclustered index, if the use of too many indexes this is like a table of each column data copy n copies of storage, occupy space, the main thing is that SQL Server needs to maintain the various nonclustered indexes when new data is added. This causes the data to be inserted slower, resulting in more index fragmentation and increased read IO.
Below, let's reproduce the deadlock phenomenon mentioned earlier in the article, which is purely due to the design not being in place.
About this problem brother Gao in the previous article has been introduced, here I borrowed the following its script to reproduce, click this can connect to the high brother of the article.
The script is as follows:
CREATE TABLE Testklup (ClskeyintNotNULL, NlskeyintNotNULL, Cont1intNotNULL, Cont2Char( the) create unique clustered index inx_cls on Testklup (clskey) Create unique nonclustered index Inx_nlcs on Testklup (Nlsk EY) include (cont1) insert INTO TestklupSelect 1,1, -,'AAA'INSERT INTO TestklupSelect 2,2, $,'BBB'INSERT INTO TestklupSelect 3,3, -,'CCC'
Open a thread input query modification
----intset @i=1=1 begin Set cont1=@i where clskey=1 set @[email protected]+ 1 End
In addition, the same thread queries the operation
----char(in)while1=1begin Selectfromwhere nlskey=1End
Originally two operations, one to modify, one to query, SQL Server will automatically well maintain the order of the two, there will be no deadlock situation, but ... But we created an inclusive nonclustered index on it, and the Cont1 column copy into the nonclustered index, so that the modification would need to maintain the nonclustered index columns, and then we had to query with a nonclustered index, which happened to be on two different key values in the same table, which caused a deadlock to occur.
We open the profile to capture this deadlock.
In fact, there are several ways to solve this problem because we know that the problem is that the nonclustered index we create is inappropriate, so that the query and the modification occur on different key values in the two tables.
So one solution is to simply remove the clustered index. This will not result in the presence of additional key locks.
Another way is to tell us that our nonclustered index contains the Cont2 column, and the script is as follows
CREATE nonclustered INDEX [inx_nlskey_incont2] on [dbo]. [Testklup] ([Nlskey] ASC) INCLUDE ([Cont2])
Of course, you can also increase the isolation level or reduce the isolation level, but this is not the recommended method for simple reasons: reducing the isolation level is dirty read and increasing the isolation level affects concurrency.
I hope you crossing in the design of the database do not happen this kind of tragedy. Especially in cases of high concurrency, be cautious and proceed cautiously.
Of course, here also to remind: do not hold a hammer in your hand, see what is nails in your eyes!! Never over-design.
Or that sentence, the appropriate scenario to take the right plan, all can not be arbitrary, but also can not easily listen to others, to practice to the truth.
The knowledge of the index is really too broad .... A little bit of writing is enough space .... Come here first ... I'll add some of the content about the index later.
We should maintain the index timely, timely reconstruction, defragmentation, delete useless indexes and other operations, including the creation of an index of a series of notes and so on.
About this piece of content next article introduction.
About tuning the content is too broad, we put in the future space to introduce, interested in advance attention can be
Third, the question of investigation
At the end of the article, the sun a few days ago in the book to see a more interesting logic, here to share the next for the courtyard friends to ponder, but also to examine the logical ability of T-SQL statements, this problem can be as an interview problem, not too difficult, but fully able to test the ability of T-SQL programming.
The questions are as follows:
--create a reply-to information record table of the CREATE TABLE dbo. Sessions (keycol INT not NULL IDENTITY, app VARCHAR (Ten) not NULL, usr VARCHAR (Ten) not NULL, host VARCHAR (Ten) notNULL, StartTime DATETIME notNULL, Endtime DATETIME notNULL, CONSTRAINT pk_sessions PRIMARY KEY (keycol), CHECK (Endtime>starttime)); GO--Insert part of the test data insert into DBO. Sessionsvalues ('App1','User1','host1','20030212 08:30','20030212'); INSERT into DBO. Sessionsvalues ('App1','User2','host1','20030212 09:30','20030212'); INSERT into DBO. Sessionsvalues ('App1','User3','Host2','20030212 09:31','20030212 11:20'); INSERT into DBO. Sessionsvalues ('App1','User4','Host2','20030212','20030212 12:30pm'); INSERT into DBO. Sessionsvalues ('App1','User5','Host3','20030212 11:35','20030212 12:35'); INSERT into DBO. Sessionsvalues ('app2','User6','Host3','20030212 08:30','20030212'); INSERT into DBO. Sessionsvalues ('app2','User7','Host3','20030212 08:30','20030212'); INSERT into DBO. Sessionsvalues ('app2','User8','Host3','20030212 08:30','20030212');
For a table, ask for the maximum number of concurrent queries out of each application ....
The problem is not difficult, want to test the ability to try ... Again, it is good to review the problem again to do, you can give me the answer message.
Conclusion
If you have any questions, you can leave a message or private messages, and look forward to an in-depth study of your child's shoes with SQL Server. Learn together and progress together.
At the end of the article gives the previous several connections, the following content basically covers our daily write query operation decomposition, it seems necessary to sort out a directory ....
SQL Server Tuning Series Basics
SQL Server Tuning Series Basics (Summary of common operators)
SQL Server Tuning Series Basics (Union operator summary)
SQL Server Tuning Series basics (Parallel operations Summary)
SQL Server Tuning Series basics (Parallel operations Summary chapter II)
SQL Server Tuning Series Basics (Index operations Summary)
SQL Server Tuning Series Basics (subquery operations Summary)
-----------------The following step-by-step article-------------------
SQL Server Tuning Series advanced (how the query optimizer runs)
SQL Server Tuning Series Advanced (query statements run several indicator value monitoring)
SQL Server Tuning series advanced (in-depth profiling statistics)
If you read this blog, feel that you have something to gain, please do not skimp on your " recommendation ".
SQL Server Tuning series advanced (How to index tuning)