1.1.1 Summary
If you want to optimize the database, we can mainly through the following five ways to optimize the database system.
1. Computer Hardware tuning
2. Application Tuning
3. Database Index Optimization
4. SQL statement Optimization
5. Transaction Processing Tuning
In this blog, we will want to talk about the database index types and use of occasions, this article, for example, SQL Server for other technology platform friends also have reference value, as long as the replacement of the corresponding code on the line!
Indexing enables the database engine to perform faster, targeted data retrieval rather than simply whole table scans (full table scan).
In order to use a valid index, we must have an understanding of the composition of the index, and we know that adding an index to the datasheet necessarily requires creating and maintaining the index table, so we have to measure the ability of the index to improve the query performance of the database system globally.
On a physical level, a database consists of data files that can form filegroups and then be stored on disk. Each file contains many extents, each with a size of 64K consisting of eight physically contiguous pages (one page 8K), and we know that the page is the basic unit of data storage in a SQL Server database. The disk space allocated for the data files (. mdf or. ndf) in the database can be logically divided into pages (numbered consecutively from 0 to N).
The types stored in the page are: data, indexes, and overflows.
Files and Filegroups
In SQL Server, the file that holds the data is managed through the logical object of the filegroup.
1.1.2 Body
On a physical level, a database consists of data files that can form filegroups and then be stored on disk. Each file contains many extents, each with a size of 64K consisting of eight physically contiguous pages (one page 8K), and we know that the page is the basic unit of data storage in a SQL Server database. The disk space allocated for the data files (. mdf or. ndf) in the database can be logically divided into pages (numbered consecutively from 0 to N).
The types stored in the page are: data, indexes, and overflows.
Files and Filegroups
In SQL Server, the file that holds the data is managed through the logical object of the filegroup.
Figure 1 Database File organization
At the top level is our database, since a database is made up of one or more filegroups, and a filegroup is a logical group of one or more files, we can spread the filegroups across a variety of disks, allowing user data to span as many devices as possible, multiple I/O running, avoiding I/O competition, thereby balancing the I/O load, Overcome access bottlenecks.
Areas and Pages
As shown in Figure 2, the file is made up of a zone, and the zone is made up of eight physically contiguous pages, and since the size of the zone is 64K, the increment of 64K is added whenever one area file is added.
Figure 2 File composition
The data types saved in the page are: Table data, index data, overflow data, allocation mapping, page free space, index allocation, etc., as shown in the following illustration:
Page type |
Content |
Data |
When the text in row is set to on, data rows that contain all data except text, ntext, image, nvarchar (max), varchar (max), varbinary (max), and XML data are included. |
Index |
Index entries. |
Text/image |
Large Object data types: text, ntext, image, nvarchar (max), varchar (max), varbinary (max), and XML data. Variable-length data type columns when data rows exceed 8 KB: varchar, nvarchar, varbinary, and sql_variant |
Global Allocation Map, Shared Global allocation map |
Information about whether the zone is allocated. |
Page Free Spaces |
Information about page allocations and the available space for the page. |
Index allocation Map |
Information about the extents used by the tables or indexes in each allocation unit. |
Bulk Changed Map |
Information about the extents modified by the bulk operation from the last BACKUP LOG statement in each allocation unit. |
Differential Changed Map |
Information about the extents in each allocation unit that have changed since the last BACKUP DATABASE statement. |
Table 1 data types saved in the page
On a data page, the data rows are placed in order immediately following the header (header); The page header contains an identity value, such as the object ID of the page number or object data, the data row holds the actual data, and finally, the row offset table at the end of the page, and for each row in the page, each row offset table contains an entry. Each entry records the distance between the first byte of the corresponding row and the page header, and the order of entries in the row offset table is the opposite of the order of the rows in the page.
Figure 3 Data page
the basic structure of the index
"Index provides the speed of the query" This is the most basic explanation of the index, next we will introduce the composition of the index, let you have a more in-depth understanding of the index.
The index is a unique structure in the database, and because it holds the database information, we need to allocate disk space and maintain the index table. Creating an index does not change the data in the table, it simply creates a new data structure that points to the datasheet; For example, usually we use the dictionary to check the word, first we need to know the query word start letter, and then turn to the table of Contents page, and then find the word specific on which page, when our directory is the index table, and the directory item is indexed.
Of course, indexes are more complex than dictionary catalogs because the database has to handle operations such as inserts, deletes, and updates, which cause the index to change.
leaf Node
Let's say the data on our disks is physically ordered, then the database in the INSERT, delete and update operations, will inevitably lead to changes in data, if we want to save the continuous and orderly data, then we need to move the physical location of data, which will increase the disk I/O, so that the entire database running very slowly The main purpose of using indexes is to make the data logically orderly so that the data is stored independently of the physical order.
In order to realize the logical order of the data, the index uses the data structure of the bidirectional linked list to maintain the logic sequence of the database, if you want to insert a new node in two nodes, you need only modify the predecessor and successor of the node, and you do not need to modify the physical location of the new node
Two-way linked list (doubly linked list) is also called a double linked list, which is a kind of linked list, which has two pointers in each data node, pointing to direct successor and direct precursor respectively. Therefore, starting from any node in a two-way list, it is easy to access its predecessor nodes and subsequent nodes.
Theoretically, the time complexity of deleting an element from a two-way list is O (1), and if you want to delete an element with a given keyword, the worst-case time complexity is O (n).
In the process of deletion, we only need to connect the front and rear nodes of the node that will be deleted, and then the front and rear nodes of the node to be deleted are null.
Copy Code code as follows:
Pseudo code
Node.prev.next=node.next;
Node.next.prev=node.prev;
Node.prev=node.next=null;
Figure 4 leaf node and corresponding table data for the index
As shown in Figure 4 above, the Index leaf node contains the index value and the corresponding RID (ROWID). And the leaf nodes are connected in an orderly way through the bidirectional linked list, and we mainly to the data table is different from the index leaf node, the data in the table is unordered stored, they are not all stored in the same table block, and there is no connection between the blocks.
In general, the index holds the physical address value of the specific data.
Types of Indexes
We know that there are two types of indexes: Clustered index and nonclustered index.
Clustered index: Physical storage is sorted by index.
Nonclustered index: Physical storage is not sorted by index.
Clustered Index
The data pages of the clustered index are physically and sequentially stored, and the data pages are the leaf nodes of the clustered index, and the data pages are connected in the form of a two-way linked list, and the actual data is stored in the data page. When we add an index to a table, the data in the table is sorted according to the index.
Suppose we have a table T_pet, which contains four fields: Animal,name,sex and age, and uses animal as the indexed column, with the following SQL code:
Copy Code code as follows:
-----------------------------------------------------------
----Create t_pet table in tempdb.
-----------------------------------------------------------
Use tempdb
CREATE TABLE T_pet
(
Animal VARCHAR (20),
[Name] VARCHAR (20),
Sex CHAR (1),
Age INT
)
CREATE UNIQUE CLUSTERED INDEX t_petonanimal1_clteridx on T_pet (animal)
-----------------------------------------------------------
----Insert data into data table.
-----------------------------------------------------------
Copy Code code as follows:
DECLARE @i int
SET @i=0
while (@i<1000000)
BEGIN
INSERT into T_pet (
Animal
[Name],
Sex
Age
)
SELECT [Dbo].random_string (one) animal,
[Dbo].random_string (one) [name],
' F ' sex,
CAST (Floor (rand () *5) as int) age
SET @i=@i+1
End
INSERT into T_pet VALUES (' Aardark ', ' Hello ', ' F ', 1)
INSERT into T_pet VALUES (' Cat ', ' Kitty ', ' F ', 2)
INSERT into T_pet VALUES (' horse ', ' Ma ', ' F ', 1)
INSERT into T_pet VALUES (' turtles ', ' Sisi ', ' F ', 4)
INSERT into T_pet VALUES (' Dog ', ' tomma ', ' F ', 2)
INSERT into T_pet VALUES (' Donkey ', ' YoYo ', ' F ', 3)
Figure 5 Clustered Index
As shown in Figure 5 above, the first and second layers from left to right are index pages, the third layer is the data page (leaf node), the data pages are connected through the bidirectional linked list, and the data page data is sorted according to the index; Suppose we want to look for the name of the Xnnbqba animal Ifcey, Here we use animal as the index of the table, so the database first looks at the index, and when the index value animal = ' Ifcey is found, it then finds the data page (leaf node) of the index to get the concrete data. The specific query statements are as follows:
Copy Code code as follows:
SET STATISTICS profile on
SET STATISTICS time on
SELECT animal, [name], sex, age
From T_pet
WHERE animal = ' Ifcey '
SET STATISTICS profile off
SET STATISTICS Time off
When we finish executing the SQL query plan, put the mouse pointer over the "clustered index lookup", the following figure appears, and we can see an important message logical operation--clustered Index Seek, SQL queries get records directly from the clustered index, and the query is the fastest.
Figure 6 Query plan
From the query results below, we found that the query step is only 2 steps, first through clustered index seek to quickly find the index Ifcey, and then query the index of the leaf node (data page) to get the data.
Query execution Time: CPU time = 0 milliseconds, elapsed time = 1 milliseconds.
Figure 7 Query Results
Now that we delete the index from the table and rerun the query plan, we can see that logical operation has changed to table Scan, and the query speed is slow because there are 1 million rows of data in the table.
Figure 8 Query plan
From the query results below, we found that the query step became 3 steps, first looking through the table scan animal = ' Ifcey ', SQL Server automatically analyzes the SQL statement when executing the query, and it estimates that our query is time-consuming. So the database makes concurrent operations to speed up the query.
Query execution Time: CPU time = 329 milliseconds, elapsed time = 182 milliseconds.
Figure 9 Query Results
With the above clustered index and no comparison, we found differences in query performance, and if you use the index database to find the index first instead of the aimless full table traversal.
Non-clustered index
In the absence of a clustered index, the data pages in the table are stored in a heap (Heap) Form, and the heap is a table without a clustered index; heap storage in SQL Server stores new data rows to the last page.
Non-clustered indexes are physical stores that are not sorted by index, and the leaf nodes of nonclustered indexes (index leaf pages) contain pointers or clustered indexes that point to specific data rows, and no connections between data pages are relatively separate pages.
Suppose we have a table T_pet, which contains four fields: Animal,name,sex and age, and uses animal as a non indexed column with the following SQL code:
Copy Code code as follows:
-----------------------------------------------------------
----Create T_pet table in tempdb with nonclustered INDEX.
-----------------------------------------------------------
Use tempdb
CREATE TABLE T_pet
(
Animal VARCHAR (20),
[Name] VARCHAR (20),
Sex CHAR (1),
Age INT
)
CREATE UNIQUE nonclustered INDEX t_petonanimal1_nonclteridx on T_pet (animal)
Figure 10 Non-clustered index
Then we want to query the table animal = ' Cat ' pet information, the specific SQL code is as follows:
Copy Code code as follows:
SET STATISTICS profile on
SET STATISTICS time on
SELECT animal, [name], sex, age
From T_pet
WHERE animal = ' Cat '
SET STATISTICS profile off
SET STATISTICS Time off
As the following illustration shows, we find two steps to the far right of the query plan: RID and index lookup. These two lookups are slow (Clustered index Seek) relative to the clustered index lookup.
Figure 11 Query plan
First SQL Server looks for the index value and then finds the data row based on the RID until it finds the result that matches the query criteria.
Query execution Time: CPU time = 0 milliseconds, elapsed time = 1 ms
Figure 12 Query Results
non-clustered index of heap table
Because the heap is a table that does not contain a clustered index, the leaf nodes of the nonclustered index will contain pointers to the concrete rows of data.
Taking the previous T_pet table as an example, assuming that T_pet uses the animal column as a nonclustered index, the nonclustered index structure of its heap table is shown in the following illustration:
Fig. 13 non-clustered index of the heap table
In the diagram above, we find that nonclustered indexes are connected through a two-way linked list, and the leaf nodes contain pointers to specific data rows.
If we are looking for animal = ' Dog ' information, first we traverse the first level index, and then the database to determine the Dog belongs to the Cat range index, then traverse the second tier index, and then find the Dog index to obtain the saved pointer information, according to the pointer information to get the data in the corresponding data page, we will pass the specific example.
Now we create the table employees, and then add a heap table nonclustered index to the table, with the specific SQL code as follows:
Copy Code code as follows:
Use tempdb
----creates a sample table.
CREATE TABLE Employees (
employee_id NUMERIC not NULL,
First_Name VARCHAR (1000) not NULL,
Last_Name VARCHAR (900) not NULL,
Date_of_birth DATETIME,
Phone_number VARCHAR (1000) not NULL,
Junk CHAR (1000),
CONSTRAINT employees_pk PRIMARY KEY nonclustered (employee_id)
);
Go now we look for employee information for employee_id = 29976.
Copy Code code as follows:
SELECT *
From Employees
WHERE employee_id = 29976
The query plan is shown in the following illustration:
Figure 14 Query plan
First, look for the index of index value employee_id = ' 29976 ' and then look for the data rows that match the criteria based on the RID; so the query efficiency of the heap table index is not as good as the clustered table, and then we'll talk about the nonclustered indexes of the clustered tables.
Clustered table non-clustered index
When a clustered index is present on a table, the leaf node of any nonclustered index is no longer the containing pointer value, but the index value of the clustered index.
Taking the previous T_pet table as an example, assuming that T_pet uses the animal column as a nonclustered index, its indexed table nonclustered index structure is shown in the following illustration:
Figure 15 Nonclustered index of Index Table
In the diagram above, we find that nonclustered indexes are connected by two-way linked lists, and that leaf nodes contain index values for index tables.
If we are looking for information about animal = ' Dog ', first we iterate through the first-tier index, then the database judges the Dog as a cat-scoped index, then traverses the second-tier index, then finds the Dog index to get the saved index value, and then gets the data from the corresponding data page based on the index
Next we modify the previous Employees table, first we delete the previous heap table nonclustered index, and then increase the nonclustered index table, the specific SQL code is as follows:
Copy Code code as follows:
ALTER TABLE Employees
DROP CONSTRAINT EMPLOYEES_PK
ALTER TABLE Employees
ADD CONSTRAINT employees_pk PRIMARY KEY CLUSTERED (employee_id)
Go
SELECT * FROM Employees
WHERE employee_id=29976
Figure 16 Query plan
Validity of indexes
Every time SQL Server executes a query, it first checks to see if the query has an execution plan, and if not, generates an execution plan, what is the execution plan? In simple terms, it helps SQL Server develop an optimal query plan. (Please refer to the enquiry plan here)
Here we will illustrate the use of the index in SQL Server with a concrete example, first we define a table Testindex, it contains three fields Testindex,bitvalue and filler, the specific SQL code is as follows:
Copy Code code as follows:
-----------------------------------------------------------
----Index Usefulness sample
-----------------------------------------------------------
CREATE TABLE Testindex
(
Testindex int identity (1,1) constraint Pktestindex primary key,
Bitvalue bit,
Filler char NOT NULL default (replicate (' A ', 2000))
)
CREATE INDEX xtestindex_bitvalue on Testindex (Bitvalue)
Go
INSERT into Testindex (bitvalue)
VALUES (0)
Go 20000--runs the batch 20000 times.
INSERT into Testindex (bitvalue)
VALUES (1)
Go--puts to table with value 1
We then query the table for bitvalue = 0 rows of data, and the table Bitvalue = 0 of the data has 2000 rows.
Copy Code code as follows:
SELECT *
From Testindex
WHERE bitvalue = 0
Figure 17 Query plan
Now we query the data rows for bitvalue = 1.
SELECT *from Testindexwhere bitvalue = 1
Figure 18 Query plan
Now that we've noticed a different query for the same table, what exactly is the reason for executing a very different kind of search plan?
We can see the detailed use of the indexes in the table by using DBCC SHOW_STATISTICS, with the following specific SQL code:
Copy Code code as follows:
UPDATE STATISTICS Dbo.testindex
DBCC show_statistics (' Dbo.testindex ', ' Xtestindex_bitvalue ')
With histogram
Figure 19 Histogram
With the histogram above, we know that SQL Server estimates Bitvalue = 0 rows of rows have about 19989 rows, while Bitvalue = 1 estimates that the 21;SQL server optimizer takes different execution plans based on the amount of data estimates to achieve optimal query performance, Because Bitvalue = 0 data is large, SQL Server can only provide scan clustered index to get the corresponding data row, and Bitvalue = 1 The actual data row only 10 rows, SQL Server first through the key to find Bitvalue = 1 of the data row, The nested loops join to the clustered index to obtain the remaining data rows.
Summarize the full instance code:
Copy Code code as follows:
-- =============================================
--Author:jkhuang
--Create date:04/20/2012
--Description:create Sample for Clustered and
--Nonclustered index.
-- =============================================
-----------------------------------------------------------
----Create T_pet table in tempdb with nonclustered INDEX.
-----------------------------------------------------------
Use tempdb
CREATE TABLE T_pet
(
Animal VARCHAR (20),
[Name] VARCHAR (20),
Sex CHAR (1),
Age INT
)
CREATE UNIQUE nonclustered INDEX t_petonanimal1_nonclteridx on T_pet (animal)
CREATE UNIQUE CLUSTERED INDEX t_petonanimal1_clteridx on T_pet (animal)
-----------------------------------------------------------
----Insert data into data table.
-----------------------------------------------------------
DECLARE @i int
SET @i=0
while (@i<1000000)
BEGIN
INSERT into T_pet (
Animal
[Name],
Sex
Age
)
SELECT [Dbo].random_string (one) animal,
[Dbo].random_string (one) [name],
' F ' sex,
CAST (Floor (rand () *5) as int) age
SET @i=@i+1
End
INSERT into T_pet VALUES (' Aardark ', ' Hello ', ' F ', 1)
INSERT into T_pet VALUES (' Cat ', ' Kitty ', ' F ', 2)
INSERT into T_pet VALUES (' horse ', ' Ma ', ' F ', 1)
INSERT into T_pet VALUES (' turtles ', ' Sisi ', ' F ', 4)
INSERT into T_pet VALUES (' Dog ', ' tomma ', ' F ', 2)
INSERT into T_pet VALUES (' Donkey ', ' YoYo ', ' F ', 3)
SET STATISTICS profile on
SET STATISTICS time on
SELECT animal, [name], sex, age
From T_pet
WHERE animal = ' Cat '
SET STATISTICS profile off
SET STATISTICS Time off
-----------------------------------------------------------
----Create Employees table in tempdb.
-----------------------------------------------------------
CREATE TABLE Employees (
employee_id NUMERIC not NULL,
First_Name VARCHAR (1000) not NULL,
Last_Name VARCHAR (900) not NULL,
Date_of_birth DATETIME,
Phone_number VARCHAR (1000) not NULL,
Junk CHAR (1000),
--PK constraint defaults to clustered
CONSTRAINT employees_pk PRIMARY KEY (employee_id)
);
Go
-----------------------------------------------------------
----Insert data into data table.
-----------------------------------------------------------
CREATE VIEW Rand_helper as SELECT Rnd=rand ();
Go
----generates random string function.
CREATE FUNCTION random_string (@maxlen int) RETURNS VARCHAR (255)
As BEGIN
DECLARE @rv VARCHAR (255)
DECLARE @loop int
DECLARE @len int
SET @len = (SELECT CAST (RND * (@maxlen-3) as INT) +3
From Rand_helper)
SET @rv = '
SET @loop = 0
While @loop < @len BEGIN
SET @rv = @rv
+ CHAR (CAST (SELECT rnd
From Rand_helper) * num as INT) +97)
IF @loop = 0 BEGIN
SET @rv = UPPER (@rv)
End
SET @loop = @loop +1;
End
Return @rv
End
Go
----generates random date function.
CREATE FUNCTION random_date (@mindaysago int, @maxdaysago int)
RETURNS VARCHAR (255)
As BEGIN
DECLARE @rv datetime
SET @rv = (SELECT GetDate ()
-Rnd * (@maxdaysago-@mindaysago)
-@mindaysago
From Rand_helper)
Return @rv
End
Go
----generates random int function.
CREATE FUNCTION random_int (@min int, @max int) RETURNS int
As BEGIN
DECLARE @rv INT
SET @rv = (SELECT rnd * (@max) + @min
From Rand_helper)
Return @rv
End
Go
----inserts data into Employees table.
With generator (n) as
(
Select 1
UNION ALL
Select N + 1 from generator
where N < 30000
)
INSERT into Employees (employee_id
, first_name, last_name
, Date_of_birth, Phone_number, junk)
Select N employee_id
, [Dbo].random_string (one) first_name
, [Dbo].random_string (one) last_name
, [Dbo].random_date (20*365, 60*365) DOB
, ' N/a ' phone
, ' junk ' junk
From generator
OPTION (Maxrecursion 30000)
-----------------------------------------------------------
----Index Usefulness sample
-----------------------------------------------------------
CREATE TABLE Testindex
(
Testindex int identity (1,1) constraint Pktestindex primary key,
Bitvalue bit,
Filler char NOT NULL default (replicate (' A ', 2000))
)
CREATE INDEX xtestindex_bitvalue on Testindex (Bitvalue)
Go
INSERT into Testindex (bitvalue)
VALUES (0)
Go 20000--runs the batch 20000 times.
INSERT into Testindex (bitvalue)
VALUES (1)
Go--puts to table with value 1
SELECT Filler
From Testindex
WHERE Bitvalue = 1
UPDATE STATISTICS Dbo.testindex
DBCC show_statistics (' Dbo.testindex ', ' Xtestindex_bitvalue ')
With histogram