Database indexes are divided into clustered indexes and non-clustered indexes. Clustered indexes are physical indexes, that is, the physical storage order of data. The leaf node of clustered indexes is the data row itself; non-clustered indexes are logical indexes and can be simply considered as indexes created for clustered indexes. Generally, the key of clustered indexes is the leaf node of non-clustered indexes (when include is not used ).
Selection of Indexes
There is no good choice for the index type. Generally, clustered indexes are required (with special requirements). Non-clustered indexes need to be set up flexibly. Therefore, the index mainly decides to create an index on those columns, especially for clustered indexes.
Clustered Index
Clustered indexes, as the most important indexes, are often ignored. The biggest advantage of clustered indexes is the high efficiency of large-scale data queries, therefore, the selection of clustered index columns often has a significant impact on database performance. To make full use of the advantages of clustered index in large-scale data search, we recommend that you select clustered index columns in the following order.
Select a priority for the clustered index field: time field> columns that will be queried in a wide range> fields with unique values that have practical significance> auto-incrementing column ID
1. time Field: if there is a time column in the table and the time increases in the order of data insertion (the time does not need to be unique to have duplicate values, even if it is a large range of duplicates ), we recommend that you use the time column as the first choice for clustered indexes. Reason: a huge advantage of clustered index is to search for a large range of data, and this advantage will become more and more obvious as the amount of data increases, in general, when we need to query a large data range, we will use the time column circumference as the filter condition. Because the clustered index does not have a bookmarked query and can be continuously scanned, the query speed will be very fast. It is best to insert data in the Time column in sequence so that disk fragments can be minimized, which is a relatively concentrated data storage for continuous data reading.
2. columns that will be queried in a large range: If the table does not have a time field or the time field is not suitable for clustered index, when creating a table, you can select columns that clearly know that a large range of data is frequently filtered, and preferably columns with lower selectivity (that is, columns with more repeated values, gender columns). You can use composite indexes if necessary. Reason: the main advantage of clustered index in data query is range data query, which makes clustered index unique and wastes this advantage.
3. practical fields with unique values: if you cannot find the columns that are suitable for condition 1 and condition 2, set the clustered index column to a unique column, it is best to find the actually meaningful and unique column. For example, the Order table can use the order number as the clustered index, and the order list uses the order number and product number as the combined clustered index. Reason: if we cannot find a suitable time field or a low-selection field, it is our most common choice to build a clustered index with the primary key.
We recommend that you set the unique clustered index as the primary key. Like the encoding method and variable name, we recommend that you use a self-explanatory column name. That is, you can see that the column name is the primary key, saving you the trouble to guess, for example, in the order table, you use an auto-incremental ID column as the primary key, and then create an ordercode column as the order number. When using this table, you have to doubt whether this ordercode is a unique drop, if you have a unique constraint on creating an auto-increment column ID in the order list, you may have such a question. It is still a trivial matter. If you forget to create a constraint on the unique column, maybe someday.ProgramPoor control gives you a big surprise.
4. auto-incrementing column ID: we cannot find the appropriate column in the first 3 conditions. Please use our auto-incrementing column ID, the auto-increment column ID is also the primary key that we use most frequently (by the way, it becomes a clustered index), and can better meet most of our needs. The auto-increment ID column is omnipotent. The Int type only occupies 4 bytes to fully meet the requirements of narrow indexes. The absolute sequential storage can effectively reduce index fragmentation, which fully conforms to our table habits, it is always correct to use an auto-increment ID column as the primary key.
Here, the key columns of clustered indexes are mainly for query consideration. Some Opinions hold that clustered indexes should always be set up as unique columns. I do not agree here. It is true that this is required in some special circumstances, however, in general, it is better to create columns with low selectivity and time columns so that clustered indexes can play a huge advantage in range data search. The SQL Server needs to create a unique identifier for the clustered index to locate duplicate data in the column. The extra overhead is very small, so small that it can completely ignore the advantage of range search.
Of course, when selecting columns, we should try to use narrow indexes as much as possible. The main reason is that we still get the benefits at the price we pay. If there is enough benefit, we can drop anything.Remember, our goal is to use clustered indexes to improve the efficiency of large-scale queries.
Non-clustered Index
Different from clustered indexes, non-clustered indexes can be created multiple times, which brings us great flexibility. After all, clustered indexes cannot satisfy all requirements, we need to rely more on non-clustered indexes. Remember that non-clustered indexes are not chinese cabbage. If you want to create a key as much as you want, creating an index is costly. Any data modification involving the index column will lead to index modification, the more data is indexed, the more overhead the data is stored, deleted, and changed. For non-clustered indexes, our goal isCover as many queries as possible with as few indexes as possible.
Column Selection sequence of non-clustered indexes (composite indexes): these columns are often used as query condition columns> Columns with high selectivity (the higher the selectivity, the better the uniqueness, the best)> columns sorted frequently
1. frequently Used as query condition columns: our queries are ever-changing. When creating an index, you must first consider which columns are frequently used for various queries, the column with a high usage frequency is used as the first column (pilot column) of the composite index. If a query does not use the pilot column in the composite index, this index will not be used in most cases, therefore, to reuse the composite index as much as possible, use more query columns as the first column of the composite index. (This applies to composite indexes of clustered indexes)
2. highly Selective columns: This is simple. Use highly selective columns as the pilot columns as much as possible. If you can filter by the first condition (whatever decision logic is needed =,>, <, like ), as long as the data range can be greatly reduced, it is used as the pilot column.
3. if conditions 1, 2, and 3 cannot be determined, Use columns that are frequently sorted. Many of our operations require sorting before further query, such as group, like and other operations are performed first to complete the next query.
Note: You can place the columns that are frequently returned in the include column of the index, and overwrite as many columns as possible without increasing the index key size. In this way, when you encounter certain queries, when you do not use the pilot column of the composite index, but do not feel it is worthwhile to create an index for it, if the field used for this query is indexed by the composite index, you can perform a non-clustered index scan to complete the query (When a non-clustered index achieves index coverage, non-clustered index scanning is more efficient than clustered index scanning.).
Below are some examples to illustrate
Create Table Orders (ID Int Identity Primary Key , -- The auto-increment column ID is used as the primary key, so that the clustered index is created by default. Ordercode Bigint Not Null , -- Order No. Price Decimal ( 18 , 2 ) Not Null , -- Order amount Userid Int Not Null , -- User ID [ Status ] Int Not Null , -- Order Status Posttime Datetime Not Null , -- Order Time Moblie Char ( 11 ) Not Null , -- User mobile phone number [ Address ] Nvarchar ( 200 ) Not Null -- Harvest address )
Create an order table. According to our habits, the ID of an auto-incrementing column is used as the primary key, and then a clustered index is created. Now let's take a look, for order tables, we generally generate order numbers in a group of rules, instead of simply using auto-increment IDs. Therefore, we have created ordercode for order numbers. Of course, order numbers must be unique, therefore, you need to create a unique constraint. After a few days someone else will use the order table or use this table by yourself, there will inevitably be some doubts. ordercode needs to be unique, is this table unique? First, check whether there is a unique constraint on ordercode. Then, you know that ordercode is unique, but I have to check it again. I forgot to confirm it again after a while. It's very troublesome. Let's look at our primary key ID. He's not doing anything, so he stays there, remove it, so the table changes
Create Table Orders (ordercode Bigint Not Null Primary Key , -- Order No. Price Decimal ( 18 , 2 ) Not Null , -- Order amount Userid Int Not Null , -- User ID [ Status ] Int Not Null , -- Order Status Posttime Datetime Not Null , -- Order Time Moblie Char ( 11 ) Not Null , -- User mobile phone number [ Address ] Nvarchar ( 200 ) Not Null -- Harvest address )
Now, no matter who or when you see this table, there is basically no doubt whether the ordercode is unique. An unremarkable small improvement brings great convenience, so it is necessary to self-interpret the primary key.
Let's take a look at the following queries that may be frequently used:
-- Query 1: All orders within a specified time period Select * From DBO. Orders Where Userid = 1 And Posttime Between ' 2012-6-1 ' And ' 2012-6-30 ' -- Query 2: specify a single user order Select * From DBO. Orders Where Userid = 1 And Ordercode= 22222222222 -- Query 3: orders in a specific State within a specified time period Select * From DBO. Orders Where Userid = 1 And Status = 1 And Posttime Between ' 2012-6-1 ' And ' 2012-6-30 ' -- Query 4: All orders within a specified time period Select * From DBO. Orders Where Posttime Between ' 2012-6-1 ' And ' 2012-6-30 ' And Status = 1
The following indexes may be created for optimal query speed:
Index 1:Create IndexIx_useridposttimeOnDBO. Orders (userid, posttime) index 2:Create IndexIx_useridordercodeOnDBO. Orders (userid, ordercode) index 3:Create IndexIx_useridstatusposttimeOnDBO. Orders (userid, status, posttime) index 4:Create IndexIx_posttimestatusOnDBO. Orders (posttime, status)
In the most pessimistic case, the above four indexes may exist at the same time. It is feasible to create a corresponding index for each query, but the cost is not too high. Don't forget that the index is not a Chinese cabbage. Therefore, we should cover as many queries with as few indexes as possible. Let's take a look at the index above. If only Index 1 is created, only Index 1 and index 3 can benefit from Index 1. Index 1 is not used in index 4 because Index 1 is not used, query 2 because clustered indexes do not require any additional non-clustered indexes. The existence of clustered indexes in index 2 is completely unnecessary. Therefore, index 2 is eliminated first. Looking at index 3, index 3 can overwrite query 1, query 3, and query 4. However, due to the order of index columns, it is basically ineffective when dealing with query 4, although the query 1 is effective but the results are not satisfactory, we made a simple adjustment to index 3 to swap the order of the posttime and status columns. After the modification, index 3 has no effect on the original query 3, in addition, the efficiency of query 1 is maximized.
Modified index 3:Create IndexIx_useridposttimestatusOnDBO. Orders (userid, posttime, status)
Now index 3 can well complete query 1 and query 3, so Index 1 can be deleted now, with only index 3 and index 4 left, we can see that the modified index 3 cannot be used for query 4 due to the pilot column problem. In order to make index 3 used for query 4, we modify index 3 again and put posttime in the first column of the index, keep the order of other columns unchanged
The modified index 3:Create IndexIx_posttimeuseridstatusOnDBO. Orders (posttime, userid, status)
We can see that index 3 can also be effectively used for query 4. However, if the Status column and userid column are exchanged for the pilot column, the query 4 efficiency can be improved, but query 1 will be affected, we consider that the Status column is usually several States, and there are dozens more, which is less selective than the userid. Therefore, we still put the userid column with higher selectivity in front, maximize query 1 and query 3 query efficiency. Let's take a look at the last index 4, index 4 and the effects on query 1 and query 4. Because query 1 already has index 3 available, the effect on query 1 is ignored, currently, only query 4 is available. We can see that query 4, index 3, and index 4 have an effect on query 4. There is no doubt that index 4 has a greater effect on query 4, however, considering the low selectivity of the Status column and the high cost of maintaining an index, index 3 can successfully complete query 4, so index 4 is deleted.
In this way, we have optimized and adjusted the four indexes for the four queries in the orders table, but only the modified index 3 is retained, and the index is changed from four to one, however, the query efficiency is not greatly affected.Use as few indexes as possible to complete as many queries as possible.
In the above demonstration, we used ordercode as the clustered index, and completed the query by better adjusting the non-clustered index. In the case of a large number, this would be OK, now, if the orders table has a large amount of data, execute our query 4. If the returned results reach tens of thousands, hundreds of thousands, or even more, it is very likely that the index will become invalid and a table scan will occur. In this case, unless the index used by our team to query 4 achieves index coverage, this is basically not realistic. So how can we solve this problem? At this time, the clustered index on the Time column begins to show the power. modify our orders table to create the clustered index to the posttime column.
-- Delete the primary key pk_orders of the original clustered Index Alter Table DBO. Orders Drop Constraint Pk_orders -- Create a primary key pk_orders for a non-clustered Index Alter Table DBO. Orders Add Constraint Pk_orders Primary Key Nonclustered (Ordercode) -- Create a clustered index on the posttime Column Create Clustered Index Ix_posttime On DBO. Orders (posttime)
After modification, the bookmarked search problem we are worried about disappears completely, because the index usually fails due to excessive bookmarked searches. When a large number of bookmarked searches occur, it is basically out-of-the-box with range queries, time columns are certainly used for large-scale data queries. Therefore, we recommend that you use time columns for clustered indexes. As to whether order query efficiency will decrease after the clustered index is updated, this can be basically ignored. Generally, the data volume will be small during order query, you can ignore the bookmarked search overhead, for example, query 2.
At this time, we may find that the index 3 we modified above has an embarrassing situation. The posttime of the pilot column of index 3 is now a clustered index, so index 3 is useful for queries 1, 3, and 4, and index 3 is also necessary. In general, we recommend that you do not use clustered indexes as the pilot columns for non-clustered indexes, it is only a suggestion to create an index that is equal to the value of White. If you delete index 3, query 1, query 3, and query 4, what is the efficiency, query 4 efficiency will certainly be improved. query 1 and query 3 need to be tested. Of course, if you can combine clustered indexes, you will not have this problem if you put userid in it.
--Create a clustered index on the posttime and userid columns of the Time columnCreate Clustered IndexIx_posttimeuseridOnDBO. Orders (posttime, userid)
Of course, this kind of modification should be done with caution and flexible modification should be made based on actual needs. Ideally, a dedicated read-only database replication should be established to create a clustered index that is best suited for scope search on the read-only database, create a clustered index that is most conducive to addition, deletion, and modification on the master database. All queries with low real-time requirements are transferred to read-only databases for execution, relatively speaking, queries that require a large range of data filtering do not require good real-time performance, but can only be performed in read-only databases. The primary database must be a small data volume query that requires high real-time performance.
Summary:
Clustered index has the advantage of large-scale data query efficiency, therefore, you need to create a clustered index on a time column with relatively low selectivity and is often used for range queries. (If the selectivity is too low, this is definitely not true for gender columns, too low selective column indexes are equivalent to white ones. For example, if you set an index on a gender column to filter out at least half of the data through a gender column, you will be wrong if the range is greatly reduced, this selectivity is usually directly ignored by the query optimizer, but it is not as fast as a table scan), giving full play to the advantages of clustered index large-scale data query.
Non-clustered indexes should try to use columns with higher selectivity to minimize the returned data volume, improve the index Reuse Rate by using composite indexes, and avoid creating too many useless indexes, if you find that a table has many non-clustered indexes, you can extract and analyze those indexes and then merge them to reduce the number of useless indexes to improve the overall performance.
you need to select an index based on your actual needs. The viewpoint described in this article is applicable to most cases, but a good index cannot be created overnight, in theory, the actual application of the index is often counterproductive. The validity of the index depends on the database statistics and other comprehensive considerations. Therefore, you must check the query plan and check the IO overhead after each index is created, check whether the query optimizer uses indexes as expected.