Objective
An index is a structure that sorts the values of one or more columns in a database table, and uses an index to quickly access specific information in a database table. In popular understanding, database indexing is the index of a dictionary in real life.
Advantages and disadvantages of indexes
Index can avoid full table scan;
Create a system-unique index to guarantee the uniqueness of each row of data;
Greatly improve the speed of data retrieval;
Speed up the link between table and table, especially the table with main and foreign key relationship;
When retrieving data using the ORDER BY and GROUPBY clauses, the time for grouping and sorting can be significantly reduced;
It takes time to create and maintain an index, which increases as the amount of data increases;
Indexes need to occupy physical space, in addition to data tables occupy data space, each index also occupies a certain amount of physical space, if you want to establish a clustered index, then the space will be larger;
When the data in the table is added, deleted and modified, the index should be maintained dynamically, thus reducing the maintenance speed of the data;
Suggestions for creating indexes
There is no need to create indexes on each column, but creating too many indexes can affect performance. Therefore, when creating an index, you should carefully consider which columns you can create an index on, and on which columns you cannot create the index. Here are some summary recommendations:
- The column to which the index should be created:
columns that are often searched, that is, the columns followed by select;
On the column of the primary key;
On the columns that are used to connect the foreign keys frequently;
Create an index on a column that often needs to be searched by scope, because the index is sorted and its specified range is continuous;
The use of the column in the WHERE clause to create an index to speed up the judgment of the condition;
- Indexed columns should not be created:
Columns that are seldom used in a query should not create an index. The creation of words is not worth the candle;
There is no need to create an index for those with very little data, because these columns have very few values, such as the gender column in the employee table, which is typically bit type, it has two values of 0 and 1, but it is not unique, every row, and the index searches for all 0 or 1 rows, causing the search scope to become larger. Therefore, increasing the index does not significantly speed up the retrieval;
columns for text, image, varchar (max), and bit data types should not be indexed because the amount of data in these columns is quite large or the value is small;
You should not create an index when the performance of the modification is far greater than the retrieval performance. From the downside of the index above, you know that modifying performance and retrieving performance are contradictory. When you increase the index, the retrieval performance is improved, but the performance of the modification is reduced. When you reduce the index, you increase the performance of the modification and reduce the retrieval performance. Therefore, the index should not be created when the performance of the modification is far greater than the retrieval performance;
Type of index
Depending on whether the index is the same as the physical order of the data table, the index can be divided into two types: a clustered index with the same physical order of the data table as the index order, and a nonclustered index with a different physical order of the data table than the index order.
- Recommendations for using clustered and nonclustered indexes:
Each table can have only one clustered index, because the physical order of the data in the table is unique
The clustered index is created before any nonclustered indexes are created, because the clustered index alters the physical order of rows in the table, the data navigation is arranged in a certain order, and the order is maintained automatically.
The average size of a clustered index is approximately 5% of the data table, but the actual size of the clustered index often varies depending on the size of the indexed column, so make sure you have enough space to create the clustered index
Principles of database indexing
To better use the index, you need to understand how the index is structured in a database. How is it stored in the database? This is explained in SQL Server as an example.
In a SQL Server system, the smallest unit of space that can be managed is the page, a page that is a 8KB-byte physical space, all of which are separated into such pages in the data file. When inserting data, the data is placed in the data according to time submission. In general, there is no connection between the order in which the data is placed and the logical relationship of the data itself. In addition, because the data is stored continuously, so there is a page to write, and then write to the next page of the situation, which is called page decomposition.
Tables and indexes are stored in pages, which are also contained in one or more partitions. A partition is a unit of user-defined data organization. By default, the table or index has only one partition, which contains all the table and index pages. The partition resides within a single filegroup.
When a table or index uses more than one partition, the data is partitioned horizontally so that the row groups are mapped to individual partitions based on the specified columns. A partition can be placed in one or more filegroups in the database. When you query or update data, the table or index is treated as a single logical entity. As shown
SQL Server tables use two methods to organize data pages in a partition: B-tree and heap. First, the heap.
Heap
Simply put, a heap is a table without a clustered index, and the data in the table is not in any order. The data in our tables are stored on the heap before they are indexed. (Note that the heap here differs from the heap on the. NET CLR)
By default, a heap has one partition. When a heap has multiple partitions, each partition has a heap structure that contains the data for that particular partition. For example, if a heap has four partitions, there are four heap structures, and each partition has a heap structure.
Depending on the data type in the heap, each heap structure will have one or more allocation units to store and manage the data for a particular partition. Each partition in each heap has at least one In_row_data allocation unit. If the heap contains large object (LOB) columns, then each partition of the heap will also have a Lob_data allocation unit. If the heap contains a variable length column that exceeds the 8060-byte size limit, the heap will also have a Row_overflow_data allocation unit for each partition.
SQL Server uses an IAM page to move through the heap. The data pages and rows within the heap are not in any particular order, nor are they linked together. The only logical connection between data pages is the information that is recorded within the IAM page. You can scan the IAM page for a table scan of the heap or a serial read operation to find the extents of the pages that hold the heap. Because IAM represents them in the order in which extents exist in the data file, this means that serial heap scans are continuous along each file. Using an IAM page to set the scan order also means that the rows in the heap are generally not returned in the order in which they were inserted.
Describes how the SQL Server database engine uses an IAM page to retrieve data rows from a heap with a single partition.
B-tree
Clustered and nonclustered indexes are created with b-tree, and each page in index B-tree is called an index node. The top node of the B-tree is called the root node. The underlying node in the index is called a leaf node. Any index level between the root node and the leaf node is collectively referred to as the intermediate level.
Shows the structure of the B-tree
Clustered Index
A clustered index consists of index pages and data pages, index pages holding indexes and pointers to the next level, and data pages for storing data. Leaf nodes contain data pages. The root node and intermediate level nodes contain index pages. Each index page is linked in a two-way link list. In a clustered index, the data is always ordered in ascending order.
A clustered index is explicitly structured in a single partition.
Nonclustered Indexes
Nonclustered indexes have the same b-tree structure as clustered indexes, and the significant difference between them is the following two points:
- The data rows of the underlying table are not sorted and stored in the order of the nonclustered keys.
- The leaf layer of a nonclustered index is made up of index pages rather than data pages.
You can either use a clustered index to define a nonclustered index for a table or view, or you can define a nonclustered index based on the heap. Each index row in a nonclustered index contains a nonclustered key value and a row locator. This locator points to a clustered index or a data row in the heap that contains the key value.
Describes the nonclustered index structure in a single partition.
Characteristics of the Index
The index has two characteristics, that is, a uniqueness index and a conforming index.
A uniqueness index guarantees that all data in the indexed column is unique and does not contain duplicate data. If there is already a primary KEY constraint or uniqueness constraint in the table, SQL server automatically creates a unique index when the table is created or when the table is modified. However, you should never create a unique index as a guaranteed data-only method, but you should create a primary KEY constraint or uniqueness constraint to guarantee it.
A composite index is an index that is created on two or more columns. When searching, when two or more columns are a key value, it is best to create a conforming index on those columns.
Index and full-text indexing
A full-text index differs from a normal index. Normal indexes are maintained with the b-tree structure, while full-text indexing is a special type of markup-based functional index that is created and maintained by the SQL Server full-text engine service. Mainly used to improve the performance of a large number of text data searches.
Say it. Database-Index