Preface
Indexes are one of the most important objects in relational databases. They can significantly reduce disk I/O and logical read consumption and improve the query performance of SELECT statements. However, it is a double-edged sword. Improper use affects performance: it requires additional space to store the index information and requires additional overhead to maintain index synchronization when data is updated.
The index is like a directory in a dictionary. When you want to search for a word, you can first find the corresponding page number range in the directory based on its comparison/pinyin, then find the word in the range. If you do not have this directory (INDEX), you may need to repeat the entire dictionary to find the words you want.
Indexes in SQL Server are stored in B-Tree format, for example:
Create a clustered index to improve performance
RDBMS will face a decline in query performance as data grows. The index is specially designed to solve this problem. Clustered index is the basis of all indexes. Without a data table, it is a heap ). Clustered indexes determine the physical storage of data. Therefore, a table can have only one clustered index. The sys. partitions System View of SQL Server records information about all clustered indexes (their Index_ID is 1 ).
Clustered indexes can contain multiple fields (columns). You should select filtering fields that are frequently involved in most query statements and learn the following in advance:
- A field should contain a large number of non-repeated values. Example: ID card number
- By default, the primary key field will automatically create a clustered index, but this is not necessary. You can manually modify it to a non-clustered index)
- Fields are often used for filtering, that is, they are often used in WHERE, JOIN, order by, and group by statements.
- Fields are often involved IN comparison, that is, they are often involved IN >,<>=, <=, BETWEEN, IN operations.
- The shorter the field length, the better.
In addition, the following rules are recommended for clustered indexes if possible:
- All included fields are set to unique (unique) and not null)
- The shorter the length of a field, the better.
- Each table has a clustered index, and the fields frequently used in the WHERE field are used as the fields of the clustered index.
- Avoid creating clustered indexes on varchar columns.
We will compare the performance of 10 million pieces of data at a time (for the SQL statement generated for the test data, seeAppendix):
SELECT OrderDate, Amount, Refno FROM ordDemo WHERE Refno <3
Execution Plan before index creation:
CREATE CLUSTERED INDEX idx_refno ON ordDemo (refno)
GO
-Execute the same query again
SELECT OrderDate, Amount, Refno FROM ordDemo WHERE Refno <3
GO
Execution Plan after index creation:
By comparison, we can find that the I/O consumption is reduced from 0.379421 to 0.0571991, and from Table Scan processing to Index Seek.
Create a non-clustered index to improve performance
The index mentioned above can effectively improve the query performance, but since a table can only have one clustered index, a clustered index usually cannot contain all necessary columns, therefore, SQL Server allows us to create non-clustered indexes to meet this requirement.
[SQL Server 2005 and earlier versions allow creation of 249 non-clustered indexes; SQL Server 2008 and SQL Server 2012 allow creation of 999 non-clustered indexes]
Generally, when you create a unique key (unique key) on a field, SQL Server automatically creates a non-clustered index on the column. Sys. partitions system tables store information about non-clustered indexes (Index_ID> 1 ).
Before creating a non-clustered index for a table, check whether the table really needs a non-clustered index? Does the table have suitable fields to create non-clustered indexes?
This is because the poor indexing not only does not improve the performance, but also takes extra space to store the index and generate additional I/O operations!
The following rules should be followed when selecting fields for non-clustered indexes are created:
- A field should contain a large number of non-repeated values.
- Fields are often involved in equivalent (=) operations.
- Fields are often used for filtering, that is, they are often used in JOIN, order by, and group by statements.
Let's continue with the previous tests to see how fast the non-clustered index will increase:
SELECT OrderDate FROM ordDemo
WHERE OrderDate = '2011-11-28 20: 29: 00.000'
GO
The execution plan is as follows:
Create a non-clustered index and run the query again:
CREATE NONCLUSTERED INDEX idx_orderdate
on ordDemo(orderdate)
GO
SELECT OrderDate FROM ordDemo
WHERE OrderDate='2011-11-28 20:29:00.000'
GO
The comparison results are obvious. After the non-clustered index is created, the consumption of I/O Cost, CPU Cost, and Operator Cost is greatly reduced.
In our example, because the OrderDate field is not in the clustered index, the previous query is interpreted as an index scan. When we create a non-clustered index on OrderDate, the query will use this index and interpret it as index seek.
As more and more table data is available, the space for storing non-clustered indexes will become larger and will gradually affect the performance. In this case, you can create non-clustered indexes in an independent database file or file group to reduce the I/O operation pressure on the same file.
Reasonable index coverage to improve performance
Execute the following test SQL
SELECT OrderDate,OrderID FROM ordDemo
WHERE OrderDate='2011-11-28 20:29:00.000'
GO
After observing the execution plan, you will find that the query is parsed to index scan instead of the previous index seek? This is because the two indexes we have created do not contain the OrderId field.
Delete the non-clustered Index and re-create it (the OrderId field is also used as the Index field)
CREATE NONCLUSTERED INDEX idx_orderdate_orderId
on ordDemo(orderdate DESC,OrderId ASC)
GO
Execute the query again. The execution plan is shown in figure
The query is unexpectedly parsed as index seek again.
Note:
An index can contain up to 16 fields, and the length of these fields must be less than 900 bytes.
The following types cannot be used as index key fields (text, ntext, image, nvarchar (max), varchar (max), varbinary (max ))
Adjust the including columns to improve the performance.
The concept of indexed fields originated from SQL Server 2005, and SQL Server 2008 and 2012 also provide this function. It allows you to include non-key fields in non-clustered indexes, these fields will not be recorded in the index size (so we will not push the index field ceiling mentioned above ). In addition, these fields can be of any type except text, ntext, and image.
In the previous test case, OrderId is not a keyword segment, because it is not filtered in the WHERE clause, so it is not suitable to use it as the index keyword segment, now we use INCLUDE to create it as an INCLUDE field:
--Delete the previous index
DROP INDEX idx_orderdate_orderId ON ordDemo
GO
--Rebuild index
CREATE NONCLUSTERED INDEX idx_orderdate_Included
on ordDemo (orderdate DESC)
INCLUDE (OrderID)
GO
--Query again
SELECT OrderDate, OrderID FROM ordDemo
WHERE OrderDate = '2011-11-28 20: 29: 00.000'
GO
The execution plan is as follows:
In terms of performance, the optimization results in this section are almost the same as those in the previous section. However, if you use the include column index, you will be less limited, with the decrease of the index keyword segment, the index usage also decreases and the query becomes more efficient.
To sum up the basic principles for distinguishing index keyword fields and fields:
- The fields used in WHERE, order by, group by, and JOIN-ON are applicable to key fields.
- The fields used in SELECT and HAVING are applicable to include fields.
Use filtered index to improve performance
The filter index originated from SQL Server 2008. SQL Server 2012 also provides this function. You can regard it as a non-clustered index with a WHERE clause. Proper use can reduce the storage size and maintenance consumption of indexes and improve query performance.
The general index is to index each piece of data in the entire table, and the filter index only indexes records that meet specific conditions, this specific condition is defined by the WHERE clause when a filter index is created.
You can use a filter index in the following scenarios:
A giant table that contains years of data. In actual use, only the data of the current year is queried.
A table that records product categories, including many types that are no longer used after expiration.
An order table contains the OrderStartDate and OrderEndDate fields. The OrderEndDate is updated when the order is completed. Otherwise, the value is null. You can create a filter index on OrderEndDate so that it can be used when you need to query which orders are not completed.
Some settings are required when you create a filter index:
- ARITHABORT = ON
- CONCAT_NULL_YIELDS_NULL = ON
- QUOTED_IDENTIFIER = ON
- ANSI_WARNINGS = ON
- ANSI_NULLS = ON
- ANSI_PADDING = ON
- NUMERIC_ROUNDABORT = OFF
Let's take a look at the example:
SET ANSI_NULLS ON
SET ANSI_PADDING ON
SET ANSI_WARNINGS ON
SET ARITHABORT ON
SET CONCAT_NULL_YIELDS_NULL ON
SET QUOTED_IDENTIFIER ON
SET NUMERIC_ROUNDABORT OFF
GO
CREATE NONCLUSTERED INDEX idx_orderdate_Filtered
on ordDemo(orderdate DESC)
INCLUDE(OrderId)
WHERE OrderDate = '2011-11-28 20:29:00.000'
GO
SELECT OrderDate,OrderID FROM ordDemo WHERE OrderDate='2011-11-28 20:29:00.000'
GO
I/O consumption is reduced from 0.0078751 in the previous section to 0.003125, and the optimization effect is remarkable.
Use column store indexes to improve performance
So far, we have discussed rowstore indexes. SQL Server 2012 supports column storage indexes.
Row-store indexes store data rows on the data page, and column-store indexes store data columns on the data page. Suppose we have a table (tblEmployee), which includes three columns: empId, FirstName, and LastName. Row/column storage indexes are stored as follows:
Obviously, when you need to search and filter a certain number of column values, the column storage index requires fewer data pages to be accessed, thus reducing the I/O overhead and improving the execution efficiency. Before you decide to use the column storage index, we recommend that you check the three points:
- Can your data table be set to read-only)
- Whether your data table is huge (millions or more)
- If your database is an OLTP, can you switch (ON/OFF) column storage indexes?
If the answer to the above three points is OK, you can start to use the column storage index, but you will be subject to the following restrictions:
- You cannot contain more than 1024 Fields
- The field types can only be:
Int
Big int
Small int
Tiny int
Money
Smallmoney
Bit
Float
Real
Char (n)
Varchar (n)
Nchar (n)
Nvarchar (n)
Date
Datetime
Datetime2
Small datetime
Time
Datetimeoffset (precision <= 2)
Decimal or numeric (precision <= 18)
Okay. Let's test the column storage index:
Run the following code. Based on the output execution plan, you can find that it has used the clustered index (Row-store index) We created earlier ).
SELECT
Refno
,sum(Amount) as SumAmt
,avg(Amount) as AvgAmt
FROM
ordDemo
WHERE
Refno>3
Group By
Refno
Order By
Refno
GO
Next, we will delete the existing Row Storage index and create a column storage index:
DROP INDEX idx_refno ON ordDemo
CREATE NONCLUSTERED COLUMNSTORE INDEX
idx_columnstore_refno
ON ordDemo (Amount,refno)
Execute the same query statement again. The execution plan is as follows:
Through comparison, we can find that the I/O consumption is significantly reduced :)
Note:: Because a column storage index is created, the table is read-only. If you want to restore to a writable state, you must delete this column storage index!
Appendix
SQL code for generating test data:
--Build a table
CREATE TABLE ordDemo (OrderID INT IDENTITY, OrderDate DATETIME, Amount MONEY, Refno INT)
GO
--Insert 100,000 test data
INSERT INTO ordDemo (OrderDate, Amount, Refno)
SELECT TOP 100000
DATEADD (minute, ABS (a.object_id% 50000), CAST ('2011-11-04' AS DATETIME)), ABS (a.object_id% 10), CAST (ABS (a.object_id% 13) AS VARCHAR)
FROM sys.all_objects a
CROSS JOIN sys.all_objects b
GO