SQL Server Index Summary

Source: Internet
Author: User
Tags ssis
If we want to optimize the database, we can use the following five methods to optimize the database system:

If we want to optimize the database, we can use the following five methods to optimize the database system:

1.1.1 Summary
If we want to optimize the database, we can optimize the database system using the following five methods.

1. Computer hardware optimization
2. Application Optimization
3. Database INDEX OPTIMIZATION
4. SQL statement Optimization
5. Transaction Processing Optimization

In this blog post, we will want to talk about the index types and application scenarios in databases. This article takes SQL Server as an example and provides reference value for other technical platforms, just replace the corresponding code!

Indexing allows the database engine to perform faster and targeted data retrieval, rather than simply Full table scan ).

To use valid indexes, we must have an understanding of the index structure, and we know that adding an index to a data table requires the creation and maintenance of an index table, therefore, we need to globally determine whether adding an index can improve the query performance of the database system.

On the physical layer, databases are composed of data files, which can be a file group and then stored on disks. Each file contains many partitions. The size of each partition is 64 KB and consists of eight physical consecutive pages (8 KB for one page ), we know that pages are the basic unit of data storage in SQL Server databases. Disk space allocated to data files (. mdf or. ndf) in the database can be logically divided into pages (from 0 to n consecutive numbers ).

The types stored on the page include data, index, and overflow.

Files and file groups
In SQL Server, the logical object of the file group is used to manage the files that store data.

1.1.2 text

On the physical layer, databases are composed of data files, which can be a file group and then stored on disks. Each file contains many partitions. The size of each partition is 64 KB and consists of eight physical consecutive pages (8 KB for one page ), we know that pages are the basic unit of data storage in SQL Server databases. Disk space allocated to data files (. mdf or. ndf) in the database can be logically divided into pages (from 0 to n consecutive numbers ).

The types stored on the page include data, index, and overflow.

Files and file groups
In SQL Server, the logical object of the file group is used to manage the files that store data.

Database file organization

The top layer is our database. Because the database is composed of one or more file groups, and the file group is a logical group composed of one or more files, therefore, we can scatter file components to different disks so that user data can be moved across multiple devices and multiple I/O operations as much as possible to avoid I/O competition, this balances the I/O load and overcomes the access bottleneck.

Zones and pages
As shown in 2, the file is composed of zones, and the zones are composed of eight physical consecutive pages. Because the partition size is 64 K, therefore, every time a partition file is added, 64 K is added.

File Composition

The data types saved on the page include table data, index data, overflow data, allocation ing, page free space, and index allocation, as shown in:
Data Types saved in table 1

On the data page, data rows are placed in sequence immediately after the page header. The page header contains the ID value, such as the page number or object ID of the object data. The data row holds the actual data; finally, the end of the page is a row offset table. For each row on the page, each row Offset Table contains an entry. Each entry records the distance between the first byte of the corresponding row and the page header, the order of entries in the row Offset Table is the opposite to that of the row on the page.

Data Page

Basic index structure
"Index provides the query speed" is the most basic explanation of the Index. Next we will introduce the composition of the Index so that you can have a deeper understanding of the Index.

An index is a unique structure in a database. Because it stores database information, we need to allocate disk space to it and maintain the index table. Creating an index does not change the data in the table. Instead, it only creates a new data structure pointing to a data table. For example, when we use a dictionary for word search, first, we need to know the start Letter of the query word, then go to the directory page, and then find the specific page of the word. At this time, our directory is the index table, and the Directory item is the index.

Of course, indexes are more complex than dictionary directories, because the database must process insert, delete, update, and other operations, which will cause index changes.

Leaf node

Assume that the data on our disk is physically ordered, data changes will inevitably occur when the database is inserted, deleted, and updated. If we want to save the continuous and orderly data, then we need to move the physical location of the data, which will increase the disk I/O, so that the entire database runs very slowly; the main purpose of using indexes is to make the data logic orderly, make data independent from physical and ordered storage.

To achieve data logic order, the index uses the data structure of the two-way linked list to maintain the data logic order. To insert a new node into two nodes, you only need to modify the node's precursor and successor, you do not need to modify the physical location of the new node.

A two-way linked list (Doubly linked list) is also called a double-Link Table. Each of its data nodes has two pointers pointing to the direct successor and direct precursor respectively. Therefore, starting from any node in the two-way linked list, you can easily access its precursor node and successor node.

Theoretically, it takes O (1) to delete an element from a two-way linked list. If you want to delete an element with a specific keyword, in the worst case, the time complexity is O (n ).

During the deletion process, we only need to connect the front and back nodes of the nodes to be deleted, and then set the front and back nodes of the nodes to be deleted to null.

The Code is as follows:
// Pseudocode
Node. prev. next = node. next;
Node. next. prev = node. prev;
Node. prev = node. next = null;



Indexed leaf nodes and corresponding table data

As shown in figure 4, the index leaf node contains the index value and the corresponding RID (ROWID), and the leaf node is connected in an orderly manner through a two-way linked list. At the same time, the data table is different from the index leaf node, the data in the table is unordered. They are not all stored in the same table block and there is no connection between the blocks.

In general, the index stores the physical address values of specific data.

Index type

We know that there are two types of indexes: clustered index and non-clustered index.
Clustered index: physical storage is sorted by index.
Non-clustered index: physical storage is not sorted by index.

Clustered Index

Data Pages of clustered indexes are stored physically and orderly, and data pages are leaf nodes of clustered indexes. Data Pages are connected through a two-way linked list, and the actual data is stored in the data page. After an index is added to a table, the data in the table is sorted by the index.
Suppose we have a table T_Pet, which contains four fields: animal, name, sex, and age, and uses animal as the index column. The specific SQL code is as follows:
The Code is as follows:
-----------------------------------------------------------
---- Create T_Pet table in tempdb.
-----------------------------------------------------------
USE tempdb
Create table T_Pet
(
Animal VARCHAR (20 ),
[Name] VARCHAR (20 ),
Sex CHAR (1 ),
Age INT
)
Create unique clustered index T_PetonAnimal1_ClterIdx ON T_Pet (animal)

-----------------------------------------------------------
---- Insert data into data table.
-----------------------------------------------------------
The Code is as follows:
DECLARE @ I int
SET @ I = 0
WHILE (@ I <1000000)
BEGIN
Insert into T_Pet (
Animal,
[Name],
Sex,
Age
)
SELECT [dbo]. random_string (11) animal,
[Dbo]. random_string (11) [name],
'F' sex,
Cast (floor (rand () * 5) as int) age
SET @ I = @ I + 1
END
Insert into T_Pet VALUES ('your dark', 'Hello', 'F', 1)
Insert into T_Pet VALUES ('cat', 'Kitty ', 'F', 2)
Insert into T_Pet VALUES ('horse', 'M', 'F', 1)
Insert into T_Pet VALUES ('turtles ', 'ssis', 'F', 4)
Insert into T_Pet VALUES ('dog ', 'tomma', 'F', 2)
Insert into T_Pet VALUES ('donkey', 'yoyo ', 'F', 3)


Clustered Index

As shown in figure 5, the first and second layers from left to right are index pages, the third layer is data pages (leaf nodes), and data pages are connected through a two-way linked list, in addition, the data on the data page is sorted by index. Suppose we want to find the animal Ifcey whose name (name) is Xnnbqba. Here we use animal as the table index, so the database first searches by index, when the index value animal = 'ifcey is found, search the data page (leaf node) of the index to obtain specific data. The specific query statement is as follows:
The Code is as follows:
SET STATISTICS PROFILE ON
SET STATISTICS TIME ON

SELECT animal, [name], sex, age
FROM T_Pet
WHERE animal = 'ifcey'

SET STATISTICS PROFILE OFF
SET STATISTICS TIME OFF

When we finish the SQL query plan, place the cursor over "clustered index search", and the information such, we can see an important piece of information: Logical Operation -- Clustered Index Seek. SQL queries directly obtain records based on Clustered indexes, with the fastest query speed.

Query plan

From the query results, we found that the query step is only two steps. First, we quickly find the Index Ifcey through Clustered Index Seek, and then query the leaf node (data page) of the Index to obtain data.

Query execution time: CPU time = 0 ms, occupied time = 1 ms.

Query Result

Now we delete the indexes in the Table and re-execute the query plan. We can see that the Logical Operation has changed to Table Scan. Because the Table contains 1 million rows of data, the query speed is quite slow.

Figure 8 query plan

From the query results, we find that the query step has changed to three steps. First, we use Table Scan to find animal = 'ifcey'. When executing the query, SQL Server automatically analyzes the SQL statement, it also estimates that this query is time-consuming, so the database performs concurrent operations to speed up the query.

Query execution time: CPU time = 329 ms, occupied time = 182 Ms.

Figure 9 query results

Through the above comparison of clustered indexes and none, we found the difference in query performance. If the index database is used, the database first searches for the index, instead of the full table traversal.

Non-clustered Index

Without clustered indexes, the data pages in the table are stored in Heap mode, and the Heap is a table without clustered indexes; heap Storage in SQL Server stores new data rows to the last page.

Non-clustered indexes are physical storage not sorted by indexes. Index leaf pages of non-clustered indexes contain pointers or clustered indexes pointing to specific data rows, no links between data pages are relatively independent pages.

Suppose we have a table T_Pet, which contains four fields: animal, name, sex, and age, and uses animal as a non-index column. The specific SQL code is as follows:
The Code is as follows:
-----------------------------------------------------------
---- Create T_Pet table in tempdb with nonclustered index.
-----------------------------------------------------------
USE tempdb
Create table T_Pet
(
Animal VARCHAR (20 ),
[Name] VARCHAR (20 ),
Sex CHAR (1 ),
Age INT
)
Create unique nonclustered index T_PetonAnimal1_NonClterIdx ON T_Pet (animal)




0 non-clustered Index

Then we need to query the pet information of animal = 'cat' In the table. The specific SQL code is as follows:
The Code is as follows:
SET STATISTICS PROFILE ON
SET STATISTICS TIME ON

SELECT animal, [name], sex, age
FROM T_Pet
WHERE animal = 'cat'

SET STATISTICS PROFILE OFF
SET STATISTICS TIME OFF

As shown in, we found that there are two steps at the far right of the query plan: RID and index search. These two search methods are slow compared with Clustered Index search (Clustered Index Seek ).

1. query plan

First, SQL Server looks for the index value, and then searches for the data row based on the RID until the results that meet the query conditions are found.

Query execution time: CPU time = 0 ms, occupied time = 1 ms

2. query results

Non-clustered index of heap table

Because the heap does not contain clustered indexes, leaf nodes of non-clustered indexes contain pointers to specific data rows.

The previous T_Pet table is used as an example. If T_Pet uses the animal column as a non-clustered index, the non-clustered Index Structure of the heap table is shown in:

3. Non-clustered indexes of heap tables

We found that non-clustered indexes are connected through a two-way linked list, and leaf nodes contain pointers to specific data rows.

If we want to find the animal = 'dog 'information, first we traverse the first layer of index, then the database judges that the Dog belongs to the Cat range index, and then traverses the second layer index, then, find the Dog index to obtain the saved pointer information and obtain the data on the corresponding data page based on the pointer information. Next, we will explain it through a specific example.

Now we create the table employees and add non-clustered indexes to the heap table. The specific SQL code is as follows:
The Code is as follows:
USE tempdb
---- Creates a sample table.
Create table employees (
Employee_id numeric not null,
First_name VARCHAR (1000) not null,
Last_name VARCHAR (900) not null,
Date_of_birth DATETIME,
Phone_number VARCHAR (1000) not null,
Junk CHAR (1000 ),
CONSTRAINT employees_pk primary key nonclustered (employee_id)
);

GO now we are looking for information about employees whose employee_id is 29976.
The Code is as follows:
SELECT *
FROM employees
WHERE employee_id = 29976

Shows the query plan:

4. query plan

First, find the index with the index value employee_id = '20160301' and then find the qualified data rows based on the RID. Therefore, the query efficiency of the heap table index is not as good as that of the clustered table, next we will introduce the non-clustered index of the clustered table.

Clustered table non-clustered Index

When a clustered index exists on a table, any leaf node without a clustered index no longer contains the pointer value but the index value of the clustered index.

In the previous T_Pet table, for example, if T_Pet uses the animal column as a non-clustered index, the index table's non-clustered index structure is shown in:

5. Non-clustered index of the index table

We found that non-clustered indexes are connected through a two-way linked list, and the leaf node contains the index value of the index table.

If we want to find the animal = 'dog 'information, first we traverse the first layer of index, then the database judges that the Dog belongs to the Cat range index, and then traverses the second layer index, find the Dog index to obtain the saved index value, and then obtain the data on the corresponding data page based on the index value.

Next, we modify the previous employees table. First, we delete the non-clustered index of the previous heap table and then add the non-clustered index of the index table. The specific SQL code is as follows:
The Code is as follows:
Alter table employees
Drop constraint employees_pk

Alter table employees
Add constraint employees_pk primary key clustered (employee_id)
GO

SELECT * FROM employees
WHERE employee_id = 29976

6. query plan

Index Validity
Each time SQL Server executes a query, it first needs to check whether the query has an execution plan. If not, an execution plan is generated. What is an execution plan? In short, it can help SQL Server develop an optimal query plan. (For details about the query plan, refer to here)

The following describes how to use indexes in SQL Server through a specific example. First, we define a table named testIndex, which contains three fields: testIndex, bitValue, and filler. The specific SQL code is as follows:
The Code is as follows:
-----------------------------------------------------------
---- Index Usefulness sample
-----------------------------------------------------------

Create table testIndex
(
TestIndex int identity (1, 1) constraint PKtestIndex primary key,
BitValue bit,
Filler char (2000) not null default (replicate ('A', 2000 ))
)

Create index XtestIndex_bitValue on testIndex (bitValue)
GO

Insert into testIndex (bitValue)
VALUES (0)
GO 20000 -- runs current batch 20000 times.

Insert into testIndex (bitValue)
VALUES (1)
GO 10 -- puts 10 rows into table with value 1

Then, we query the data rows with bitValue = 0 in the table, and there are 2000 rows with bitValue = 0 in the table.
The Code is as follows:
SELECT *
FROM testIndex
WHERE bitValue = 0

7. query plan

Now we query the data rows whose bitValue is 1.

SELECT * FROM testIndexWHERE bitValue = 1

8. query plan

Now we have noticed that different data queries to the same table actually execute different query plans. What is the cause?

You can use DBCC SHOW_STATISTICS to view the detailed usage of the table index. The specific SQL code is as follows:
The Code is as follows:
Update statistics dbo. testIndex
DBCC SHOW_STATISTICS ('dbo. testindex', 'xtestindex _ bitvalue ')
WITH HISTOGRAM


9 Histogram

Through the histogram above, we know that SQL Server estimates that bitValue = 0 has approximately 19989 rows of data rows, while bitValue = 1 estimates about 21. the SQL Server optimizer estimates the value based on the amount of data, different execution plans are adopted to achieve optimal query performance. Because bitValue = 0 has a large amount of data, SQL Server can only scan clustered indexes to obtain corresponding data rows, bitValue = 1 actually only has 10 rows of data. SQL Server first searches for data rows with bitValue = 1 by using the key, and then links the nested loop to the clustered index to obtain the remaining data rows.

Summary of the complete instance code:
The Code is as follows:
-- ===================================================== ======
-- Author: JKhuang
-- Create date: 04/20/2012
-- Description: Create sample for Clustered and
-- Nonclustered index.
-- ===================================================== ======

-----------------------------------------------------------
---- Create T_Pet table in tempdb with nonclustered index.
-----------------------------------------------------------
USE tempdb
Create table T_Pet
(
Animal VARCHAR (20 ),
[Name] VARCHAR (20 ),
Sex CHAR (1 ),
Age INT
)
Create unique nonclustered index T_PetonAnimal1_NonClterIdx ON T_Pet (animal)
Create unique clustered index T_PetonAnimal1_ClterIdx ON T_Pet (animal)
-----------------------------------------------------------
---- Insert data into data table.
-----------------------------------------------------------
DECLARE @ I int
SET @ I = 0
WHILE (@ I <1000000)
BEGIN
Insert into T_Pet (
Animal,
[Name],
Sex,
Age
)
SELECT [dbo]. random_string (11) animal,
[Dbo]. random_string (11) [name],
'F' sex,
Cast (floor (rand () * 5) as int) age
SET @ I = @ I + 1
END
Insert into T_Pet VALUES ('your dark', 'Hello', 'F', 1)
Insert into T_Pet VALUES ('cat', 'Kitty ', 'F', 2)
Insert into T_Pet VALUES ('horse', 'M', 'F', 1)
Insert into T_Pet VALUES ('turtles ', 'ssis', 'F', 4)
Insert into T_Pet VALUES ('dog ', 'tomma', 'F', 2)
Insert into T_Pet VALUES ('donkey', 'yoyo ', 'F', 3)

SET STATISTICS PROFILE ON
SET STATISTICS TIME ON
SELECT animal, [name], sex, age
FROM T_Pet
WHERE animal = 'cat'
SET STATISTICS PROFILE OFF
SET STATISTICS TIME OFF

-----------------------------------------------------------
---- Create employees table in tempdb.
-----------------------------------------------------------
Create table employees (

Employee_id numeric not null,
First_name VARCHAR (1000) not null,
Last_name VARCHAR (900) not null,
Date_of_birth DATETIME,
Phone_number VARCHAR (1000) not null,
Junk CHAR (1000 ),
-- PK constraint defaults to clustered
CONSTRAINT employees_pk primary key (employee_id)
);
GO

-----------------------------------------------------------
---- Insert data into data table.
-----------------------------------------------------------
Create view rand_helper as select rnd = RAND ();
GO
---- Generates random string function.
Create function random_string (@ maxlen int) returns varchar (255)
AS BEGIN
DECLARE @ rv VARCHAR (255)
DECLARE @ loop int
DECLARE @ len int
SET @ len = (select cast (rnd * (@ maxlen-3) as int) + 3
FROM rand_helper)
SET @ rv =''
SET @ loop = 0
WHILE @ loop <@ len BEGIN
SET @ rv = @ rv
+ CHAR (CAST (SELECT rnd
FROM rand_helper) * 26 as int) + 97)
IF @ loop = 0 BEGIN
SET @ rv = UPPER (@ rv)
END
SET @ loop = @ loop + 1;
END
RETURN @ rv
END
GO
---- Generates random date function.
Create function random_date (@ mindaysago int, @ maxdaysago int)
Returns varchar (255)
AS BEGIN
DECLARE @ rv datetime
SET @ rv = (SELECT GetDate ()
-Rnd * (@ maxdaysago-@ mindaysago)
-@ Mindaysago
FROM rand_helper)
RETURN @ rv
END
GO
---- Generates random int function.
Create function random_int (@ min int, @ max int) RETURNS INT
AS BEGIN
DECLARE @ rv INT
SET @ rv = (SELECT rnd * (@ max) + @ min
FROM rand_helper)
RETURN @ rv
END
GO
---- Inserts data into employees table.
WITH generator (n)
(
Select 1
Union all
Select n + 1 from generator
Where N <1, 30000
)
Insert into employees (employee_id
, First_name, last_name
, Date_of_birth, phone_number, junk)
Select n employee_id
, [Dbo]. random_string (11) first_name
, [Dbo]. random_string (11) last_name
, [Dbo]. random_date (20*365, 60*365) dob
, 'N'/A' phone
, 'Junk' junk
From generator
OPTION (MAXRECURSION 30000)
-----------------------------------------------------------
---- Index Usefulness sample
-----------------------------------------------------------
Create table testIndex
(
TestIndex int identity (1, 1) constraint PKtestIndex primary key,
BitValue bit,
Filler char (2000) not null default (replicate ('A', 2000 ))
)
Create index XtestIndex_bitValue on testIndex (bitValue)
GO
Insert into testIndex (bitValue)
VALUES (0)
GO 20000 -- runs current batch 20000 times.
Insert into testIndex (bitValue)
VALUES (1)
GO 10 -- puts 10 rows into table with value 1
SELECT filler
FROM testIndex
WHERE bitValue = 1
Update statistics dbo. testIndex
DBCC SHOW_STATISTICS ('dbo. testindex', 'xtestindex _ bitvalue ')
WITH HISTOGRAM

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.