SQL Server Index design < fifth >

Source: Internet
Author: User
Tags arithmetic

The main considerations for designing SQL Server indexes are as follows:

Check where condition and join condition column;

Use narrow index;

Check the selectivity of the column;

Check the data type of the column;

Consider the order of the columns;

Consider the index type (clustered index or nonclustered index);

First, check the Where Condition column and the link condition column

When a query is submitted to SQL Server, the query optimizer attempts to find the best data access mechanism for all tables referenced in the query. The following lists how the query optimizer works for where and how connections work:

    1. The optimizer identifies the columns contained in the WHERE clause and the join condition.
    2. The optimizer then checks the indexes on these columns.
    3. The optimizer evaluates the validity of each index by determining the selectivity of the clauses from the statistics maintained on the index.
    4. Finally, the optimizer estimates the method that reads the lowest cost of qualifying rows based on the phone's information in the previous steps.

Consider an example in order to understand the importance of the WHERE clause in the query.

SELECT * from person WHERE Id = 100;

Assume that the table ID above is listed as a clustered index. The above statement has a WHERE clause, and the query optimizer examines the column ID of the WHERE clause, determines the clustered index on the ID column, evaluates the high selectivity of the WHERE clause from the statistics on the clustered index, and decides to use the index.

The performance description of the query optimizer, WHERE clause columns help the optimizer to select an index operation that is optimal for the query. This also applies to the columns used in the join condition between the two tables. The optimizer finds the index on the WHERE clause or join Condition column and, if available, considers using that index to retrieve rows from the table. The query optimizer considers the index on the WHERE clause or join condition column when executing a query. Therefore, an index on a column that is frequently used in a WHERE clause or join condition will help the optimizer avoid a scan of the base table.

However, when the amount of data in a table is so small that it can fit into a data page, the table scan may be faster than the index lookup, and if you have a good index, but still scan, you can consider this problem.

Second, use narrow index

You can create indexes on a combination of columns in a table, but for best performance, try to use fewer columns in the index. You should also avoid using columns of the wide data type in the index.

    • Narrow index: The number of columns in the index is as small as possible;
    • Wide data types: data types that occupy large space, such as CHAR, VARCHAR, NVARCHAR, CLOB, and so on. Unless absolutely necessary, it is necessary to minimize the use of columns of large size wide data types in the index.

A narrow index can accommodate more rows than a wide index on a 8KB index page, which has the following advantages:

    1. Reduce the number of I/O (read fewer 8KB pages);
    2. Makes the database cache more efficient because SQL Server can cache fewer index pages, thereby reducing the logical read operations required for indexed pages in memory;
    3. reduce database storage space;

The following example illustrates the benefits of a narrow index:

For the first time, our index contains only the name column:

The second time, our index include two more columns:

  

We see. With more than two columns, is there more logical reads than one column? Why is it? Because it contains more than two columns, the index takes up more space, and a data page with fewer index rows requires more data pages to be read.

Third, the selectivity of the index column

Indexes, especially nonclustered indexes, are useful in situations where there is a fairly high level of choice in the index. The so-called selectivity refers to the percentage of unique values in a column. The higher the percentage of unique values in a column, the higher the selectivity, and the greater the overflow of the index. If there are 2000 records in a table, and the table index column has 1990 different values, then the selectivity of this index is 1980/2000=0.99.

As you learned in the previous studies, queries in a nonclustered index are actually just the beginning. To find the real data, you still need to perform a loop traversal of the clustered index. Even using a nonclustered index on the heap, you still need to perform several separate physical reads.

If a lookup in a nonclustered index is going to produce multiple additional lookups on the clustered index, a table scan might be better. The possible impact here is actually very alarming. If the column uniqueness of the index does not reach 90%~95%, it is not worthwhile to consider the loop process created by a nonclustered index. For example, a gender option, set for bit, and then build the index. The query optimizer does not consider using such an index.

From the above analysis, it is known that the selectivity of the primary key is 100%, the closer the selectivity to the primary key, the higher the efficiency of the index built in the column.

The selectivity of an index is a way to measure the utilization of an index, such as in extreme cases, where a table record number is 1000 and the index column has only 5 different values, the index is very selective (only 0.005). It's better to use a full table scan than to take an index.

Below to calculate the selectivity of the index, of course, the test table data is small, it may be queried even if the index, SQL Server may not be used.

  

The above information can be used to calculate the index selectivity of the Fdkeyname column to

110/119 = 0.924

Is this still a problem? Is it a manual calculation? Here is an SQL statement to find out the selectivity method:

SELECT Cast (COUNT (DISTINCT fdkeyname) as float)/cast (COUNT (*) as float) from Jm_keyword;

  

One exception to the selectivity rule is the foreign key, which is probably useful if there is a column in the table that is a foreign key. Why are foreign keys and not other columns? Foreign keys are often the target of connections to the tables they reference. Indexes are very helpful in connection performance, regardless of selectivity, because they allow merging connections. A merge connection gets a row from each table to compare to see if they match the join criteria. Because there are indexes on the related columns in the two tables, it is very fast to find the two rows.

Here's an example to illustrate the problem:

I set up an index on the gender column in a person table and then look at how the query optimizer is queried:

  

Why doesn't the query optimizer choose to find data from the index of the gender column?

I'm going to go back to the first 10 sexes for "male" data, and if you use an index, we know that the index on this gender column is about 50% selective. SQL Server even finds the top 10 male-to-female clustered columns by index, and then looks for data through the ID to the clustered index, which is not as fast as scanning the clustered table directly. Therefore, the query optimizer of SQL Server ignores this index.

Index lookups can be enforced by using the with index (indexed name), which gives a comparison of the number of reads for both of these queries:

Forced index reads:

  

The query optimizer chooses to read:

  

As we can see from the above, it is mandatory to use the index, but the logical read is high, but the pre-read is less. We know that read-ahead is performed in parallel with the analysis and can be loaded into the cache. From the choice of SQL Server, you can basically draw a conclusion that logical reading is more time consuming than pre-reading.

Iv. checking the data type of the index

The data type of the indexed column is also important. For example, an index query on an integer key value is very fast because the size of the int data type is small and arithmetic manipulation is easy. Other variants of the int data (Bigint,smallint,tinyint) can also be used as index columns, while string data types (char, varchar, nchar, Ncarchar,) require string matching operations, which are usually more expensive than integer matching operations.

Suppose you want to create an index on a column but there are two candidate columns, one for the int data type and one for the char (4) data type. Both data types have a size of 4 bytes in SQL Server 2008, but you should still prefer the int data type as the index. Because the value 1 in the char (4) data type is actually saved as 1 followed by 3 spaces, the 4-byte combination is 0x35, 0x20, 0x20, 0x20. The CPU does not understand how to perform arithmetic operations on this data. Therefore, before the arithmetic operation is converted to an integer, the value 1 is saved as 0x00000001 in the INT data type. The CPU can simply perform arithmetic operations on this data.

Five, index column order

The index key values are sorted on the first column of the index, and then again in the next column.

Let's say we create a composite index in a single table:

Then the data in the index is probably as follows:

C1 C2
1 1
1 2
2 1
2 2
3 1
3 2
Assume that most of the queries on the above table are similar to the following statements
SELECT * FROM table where C1 = 1 or 2 SELECT * FROM table where C2 = 1 or 2 and C1 = 1 or 2

(C2,C1) is advantageous for the above two queries, but the index on (C1,C2) is inappropriate because it first sorts on C1, and the first SQL statement needs to be sorted on C2.

It's like using a phone book. All items are indexed in the same way that the first name is followed-if the value knows the name of the person to call is "ready", then this sort order does not bring any benefit. On the other hand, if you only know his surname is "Liu", then the index will be used to narrow the search scope.

Vi. Consideration of index types

There are two main types of indexes in SQL Server that consider the type of index: clustered and nonclustered indexes. Both of these types are B-tree structures. The main difference between the two is that the leaf page in the clustered index is the data of the table. So the data in the table is in the same order as the clustered index, which means that the clustered index is the table. When deciding to use an index type, the difference in leaf level between the two index types becomes very important.

A table has only one clustered index, and it should be chosen wisely.

SQL Server by default, primary keys and clustered indexes are created together. If you do not want to declare the primary key as a clustered index, you simply add the nonclustered keyword when you create the table.

CREATE TABLE mytablekeyexample{Column1 int IDENTITY PRIMARY KEY nonclustered, Column2 int}

Once an index is created, the only way to change it is to delete and rebuild it, so it needs to be done right from the start.

If you change the column in which the clustered index is located, SQL Server will need to completely reorder the entire table (because the table has the same order and index order for the clustered index).

For the table with more data, change the clustered index, need to reorder the data very much, to consider from the following aspects.

How long it will take.
Is there enough space? In order to reorder on a clustered index, the additional average amount of space required will be 1.2 times times the amount of space the table has occupied. Make sure there is enough room to operate.
Should I use the SORT_IN_TEMPDB option? If tempdb is on a different physical array than the primary database, and it has enough space, the answer is yes.

  1. Positive view

A clustered index is useful for such queries if the column is often used as an object for a range query. Such queries typically use between statements or <or> symbols. Queries using GROUP BY and using the Max, Min, and Count aggregate functions are also important examples of queries that use scopes and prefer clustered indexes. The clustered index is suitable for use here because the search can directly reach a specific point in the physical data, read the data until the end of the range, and then stop. This method is very effective. Aggregation is also an excellent method when you want data to be sorted based on a clustered key (order by).

  2. Negative view

There are two scenarios in which you might not want to create a clustered index.

(1), when there is a better place to use it. Do not use it as a clustered index (the primary key is the most common culprit) because the column looks suitable for a clustered index-to determine that no other column is more appropriate.

(2), when a large number of insertions will be made in a discontinuous order. This makes page splits and consumes a lot of time.

For example, a trading system, with

Arxxxx

Glxxxx

Apxxxx

As a primary key and use the default clustered index, page splits often occur when inserting data. Because the data is sorted by the clustered index, it is possible that the page splits will occur frequently, causing a brief pause when data is constantly entered.

Fortunately, there are ways to avoid the situation:

Select the clustered key to be contiguous when inserting. You can create an identity column with this, or you can use another column, which is logically contiguous for any input transaction.

Choose not to use clustered indexes on this table. This is usually the best choice for a similar scenario, because inserts in the nonclustered index are generally faster than inserts on the clustered key.

When should you use clustered and nonclustered indexes

Action description using a clustered index with a nonclustered index
Columns are often sorted by grouping should be
Returning data in a range should not be
One or very few different values should not be
A small number of different values should not be
A large number of different values should not be
Columns that are frequently updated should not be
Foreign key columns should be
The primary key column should be
Frequently modifying index columns should not be

In fact, we can understand the above table through examples of the previous clustered index and the definition of a nonclustered index. For example, to return data in a range. For example, you have a table with a time column, just you have the aggregation index in the column, when you query the January 1, 2010 to January 1, 2013 All the data, this speed will be very fast, because your this field body is sorted by date, A clustered index only needs to find the beginning and end data in all the data to be retrieved, rather than a nonclustered index, you must first look up the page number for each item in the table of contents, and then find the specific content based on the page number.

  3, combining with practice, talking about the misunderstanding of index use

Some of the pitfalls in practice are listed below:

1, the primary key is the index

The idea is extremely wrong and a waste of the clustered index. Although SQL Server defaults to the resume clustered index on the bamboo slips. Typically, we set up an ID column in each table to differentiate each piece of data, and this ID column is automatically incremented, with half the growth set to 1. Take an office automation of copper as an example. If you set the ID column as the primary key, SQL Server will think of this Lieme as a clustered index, which would have the advantage of allowing your data to be physically sorted in the database by ID, but that doesn't make much sense. The speed advantage of a clustered index is obvious, and there can be only one rule for a clustered index in each table, which makes the clustered index more valuable.

From the definition of the clustered index we've talked about, we can see that the biggest benefit of using a clustered index is the ability to quickly narrow the query based on query requirements and avoid full table scans. In practice, because ID numbers are generated automatically, we do not know the ID number of each record, so it is difficult to query with the ID number in time. This primary key becomes a resource waste as a clustered index. Second, a field that has a different ID number as a clustered index does not conform to the "Aggregate index should not be established" rule for a "large number of different values"; Of course, this situation is only for the user to modify the record content, especially when the index entry is negative, but for the query speed does not affect. In the office automation system, whether it is the System home page display needs the user to sign the document, the meeting or the user carries on the file query and so on any circumstance to carry on the data inquiry to be inseparable from the field is "the date" and the user's own "user name".

Typically, the home page of office automation displays files or meetings that each user has not yet signed up for. Although our where statement can only limit the current user has not yet signed the case, but if your system has been established for a long time, and the amount of data is large, then every time each user opens the first page of a full table scan, it is not meaningful to do so, The vast majority of users have browsed through the files 1 months ago, which can only increase the cost of the database. In fact, we can allow users to open the system first page, the database only query the user for nearly 3 months not to read the file, through the "date" this field to limit the table scan, improve query speed. If your office automation system has been established for 2 years, then your homepage display speed will theoretically be 8 times times faster than the original speed.

The word "theoretically" is mentioned here because if your clustered index is still blindly built on the primary key of the ID, your query speed is not so high, even if you set the index (non-aggregated index) on the "Date" field. Let's take a look at the speed performance of various queries in the case of 10 million data volumes (data in 3 months is 250,000):
(1) The clustered index is established only on the primary key, and the time period is not divided:

Spents: 128470 milliseconds (i.e.: 128 seconds)
(2) Set up a clustered index on the primary key and a nonclustered index on Fariq:

Spents: 53763 milliseconds (54 seconds)
(3) Set up the aggregation index on the date column (Fariqi):

Spents: 2423 milliseconds (2 seconds)

Although each statement extracts 250,000 data, the differences in the various cases are enormous, especially when the clustered index is set in the Date column. In fact, if your database really has 10 million capacity, set the primary key

On the ID column, as in the 1th and 2 cases above, the performance on the Web page is timed out and cannot be displayed at all. This is also one of the most important factors that I discard the ID column as a clustered index.

The above speed is calculated by adding: Declare @d datetime in front of each SELECT statement
Set @d=getdate ()

and add it after the SELECT statement:
Select [Statement execution takes time (milliseconds)]=datediff (Ms,@d,getdate ())

2, as long as the index can significantly improve the query speed

In fact, we can see that in the example above, the 2nd and 3 statements are identical, and the indexed fields are the same; only the non-aggregated indexes that were established on the Fariqi field, the latter set up in the aggregate index on this field, but the query speed is vastly different. Therefore, not simply indexing on any field can improve query speed. From the statement in the table, we can see that there are 5,003 different records for the Fariqi field in the table with 10 million data. It is more appropriate to establish an aggregate index on this field. In reality, we send a few documents every day, these documents are issued in the same date, which is fully in line with the requirements of the establishment of a clustered index: "Neither the vast majority of the same, but not only a very few of the same" rule. As a result, it is important for us to build an "appropriate" aggregate index to improve query speed. In fact, we can see that in the example above, the 2nd and 3 statements are identical, and the indexed fields are the same; only the non-aggregated indexes that were established on the Fariqi field, the latter set up in the aggregate index on this field, but the query speed is vastly different. Therefore, not simply indexing on any field can improve query speed. From the statement in the table, we can see that there are 5,003 different records for the Fariqi field in the table with 10 million data. It is more appropriate to establish an aggregate index on this field. In reality, we send a few documents every day, these documents are issued in the same date, which is fully in line with the requirements of the establishment of a clustered index: "Neither the vast majority of the same, but not only a very few of the same" rule. As a result, it is important for us to build an "appropriate" aggregate index to improve query speed.

3. Add all fields that need to increase query speed to the clustered index to improve query speed

As already mentioned above: in the data query can not be separated from the field is the "date" and the user's own "user name." Since both of these fields are so important, we can merge them together to create a composite index (compound index).
Many people think that as long as you add any field to the clustered index, you can improve the query speed, and some people are puzzled: if the composite clustered index field is queried separately, then the query speed will slow? With this problem, let's take a look at the following query speed (the result set is 250,000 data): (the date column Fariqi first in the composite clustered index starting column, the user name Neibuyonghu row in the back column)

Query speed: 2513 ms

Query speed: 2516 ms

(3) Select Gid,fariqi,neibuyonghu,title from Tgongwen where neibuyonghu= ' office '

Query speed: 60280 ms

From the above experiment, we can see that if you use only the starting column of the clustered index as the query condition and the query speed of all columns that are used in the composite clustered index at the same time, it is even faster than using all of the composite index columns (in the same case as the number of query result sets) This index has no effect if only the non-starting column of the composite clustered index is used as the query condition. Of course, the query speed of statements 1, 2 is the same as the number of entries queried, if all the columns of the composite index are used, and the query results are small, so that will form an "index overlay", thus the performance can be achieved optimally. Also, keep in mind that no matter if you use other columns of the aggregated index frequently, the leading columns must be the most frequently used columns.


No other books on the index use experience summary
1. Using aggregate index is faster than primary key with not aggregate index
Here is the instance statement: (all extracts 250,000 data)

Usage Time: 3326 ms

Usage Time: 4470 ms
Here, the aggregate index is nearly 1/4 faster than the primary key speed that is not an aggregated index.

2, using the Aggregate index column than the general primary key for the order by when the speed, especially in the case of small data volume

Spents: 12936

Spents: 18843

Here, it is 3/10 faster to use the aggregate index than the general primary key for order by. In fact, if the amount of data is very small, it is much faster to use the clustered index as the rank sequence than the non-clustered index, and if the data volume is large, such as more than 100,000, the speed difference between the two is not obvious.

3. Using the time period within the aggregated index column, the search time is scaled down by the percentage of the data in the data table, regardless of how many of the aggregated indexes are used

Spents: 6343 milliseconds (extract 1 million)

Spents: 3170 milliseconds (extract 500,000)

Time: 3326 milliseconds (identical to the result of the previous sentence.) If the number of acquisitions is the same, then the greater than and equals sign are the same)

Spents: 3280 milliseconds

4. The date column will not slow down the query speed because there is a minute or seconds input

In the following example, there are 1 million data, 500,000 data after January 1, 2004, but only two different dates, the date is accurate to the day, before the data 500,000, there are 5,000 different dates, the date is accurate to the second.

Spents: 6390 milliseconds

Spents: 6453 milliseconds
5. Other matters needing attention
"The water can carry the boat, also overturn it", the index is the same. Indexes can help improve retrieval performance, but too many or improper indexes can cause system inefficiencies. Because the user adds an index to the table, the database will do more work. Too many indexes can even cause index fragmentation.
So, we want to build an "appropriate" index system, especially for the creation of aggregate indexes, should be better, so that your database can be high performance.
Of course, in practice, as a dedicated database administrator, you should also test a number of scenarios to find out which is the most efficient and effective.

Vi. importance of clustered indexes and how to select clustered Indexes

In the previous section of the title, I wrote: The realization of small data volume and mass data of the general paging display stored procedures. This is because in the practice of applying this stored procedure to the "Office automation" system, the author finds that this third kind of stored procedure has the following phenomenon in the case of small data volume:

1, paging speed is generally maintained between 1 seconds and 3 seconds.

2, in the query the last page, the speed is generally 5 seconds to 8 seconds, even if the total number of pages only 3 pages or 300,000 pages.
Although in the ultra-large capacity situation, the implementation of this page is very fast, but in the first few pages, the speed of this 1-3-second is slower than the number one or even no optimized paging method, the user's words are "not fast Access database", This awareness is enough to cause users to abandon the system you developed.

The author analyzes this, the original cause of this phenomenon is so simple, but also so important: the sorted field is not a clustered index!  
The title of this article is: "Query optimization and paging algorithm scheme." The author only so the "query optimization" and "paging algorithm" the two links are not very big topic together, is because both need a very important thing--clustered index.  
in the previous discussion we have already mentioned that the clustered index has two biggest advantages:  
1. Reduce the query scope as quickly as possible.  
2, sorting the fields as fast as you could. &NBSP

1th is used for query optimization, and 2nd is used to sort data when paging.  

and clustered indexes can only be created in each table, which makes the clustered index more important. The selection of clustered indexes can be said to be the most critical factor in achieving "query optimization" and "efficient paging".  
However, it is often a contradiction to make a clustered index column conform to both the needs of the query column and the needs of the row sequence.  
The author of the previous "index" discussion, the Fariqi, that is, the user posted the date as the starting column of the clustered index, the accuracy of the date is "day." The advantages of this approach, mentioned earlier, in the time period of the fast query, compared with the ID primary key column has a great advantage.  

But when paging, because there is a duplicate record of this clustered index column, it is not possible to use Max or min for the most paged reference, which makes it impossible to achieve a more efficient sort. If you use the ID primary key column as a clustered index, the clustered index, in addition to being sorted, is useless, and is actually a valuable resource for wasting the clustered index.

To resolve this contradiction, the author later added a date column with the default value of GETDATE (). When a user writes to a record, the column is automatically written to the time, and the time is accurate to milliseconds. Even so, to avoid the possibility of a small coincidence, create a unique constraint on this column. Use this date column as a clustered index column.  

With this time-type clustered index column, the user can use this column to find a query for a time period when the user inserts the data, or to implement Max or Min as a unique column, as a reference for the paging algorithm.  
After this optimization, the author found that, whether it is the case of large data volume or small data volume, the paging speed is generally dozens of milliseconds, or even 0 milliseconds. and the date period to narrow the scope of the query faster than the original is not any dull.  
Clustered index is so important and precious, so I summarize, I must set up the clustered index in:  
1, the field you use most frequently to narrow the scope of the query,  
2, the field that you use most frequently, which needs to be sorted.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.