To be honest, I've never used an index in a real project, just know that the index is a fairly important technical point, so I've seen a lot of articles about the differences, classifications, pros and cons of indexes, and how to use indexes. But the most essential thing about the index is what the author has not understood, this article is the author with these questions on MSDN a little summary and a lot of doubts.
1. Table structure
When a developer creates a table in the database, the default is to create a partition for the table, and note that it is a partition. A partition is a data organization unit in which 2 structures can exist, namely, a heap structure or a B-tree structure (index structure), meaning that a partition is either a heap structure or a B-tree structure. To improve performance and ease of management in some ways, we can create partitions ourselves, moving the data horizontally, that is, partitioning the data rows in the behavioral units. Although partitioning divides the data row groups into different places, the table is considered a single logical entity for queries or other operations. The partition contains a heap structure or a B-tree structure that consists of pages, and you might be a little confused about the heap and the B-tree structure at first, but it's okay to first understand the structure of the entire table and then go into the details. For the data pages in the heap structure or B-tree structure, in order to store them more efficiently, and divide 3 kinds of storage units, essentially is the type of the page, they are also a kind of data organization unit. The three types of pages are In_row_data, Lob_data, Row_overllow_data, respectively. The official term for the page type is the allocation unit, which exists because there are small data types and large space-occupying data types in the data page, and for data columns like varchar (max), large storage units are allocated, and small storage units are allocated for normal columns of data. Just like the int type is 4 bytes and the bit type is 1 bytes. Here the composition of a table is finished, as shown.
2. Page type
Now you may be curious about the type of 3 types of pages, from the above why the classification of the introduction can be seen in their classification is based on the size of the space occupied. In_row_data is the most basic type of page that contains all data rows and index rows except LOB data, and the page type is data or index. The Lob_data page type, which is used to store large data objects, including the following data columns text, ntext, varchar (max), nvarchar (max), image, varbinary (max), XML, CLR UDT, The type of the page is text or image. row_overflow_data page type, when this row of data in the In_row_data page type exceeds 8060, the column that occupies the most space on the page is moved to the overflow page, and a pointer to the overflow page is maintained on the original page. Overflow page type and large object page type Management page in the same way, using the IAM page chain to manage, notice the emergence of a term IAM page chain, in a word to describe IAM is the IAM used to track the table of the specified allocation unit. Now we continue to delve into these 3 types of pages, with a deeper understanding or need for SQL instances. I created the student table as follows.
UsetestDbCreate TableStudent (StudentIDint, Studentnamenvarchar(3600), Studentaddressnvarchar(3600), Studentdescriptionnvarchar(Max)) --View the page information that was created, and the last 1 represents all paging information, including IAM paging, data paging, all existing lob paging and row overflow pages, index pagingDBCCInd'testDb','Student',1)--There's nothing at this timeInsert intoStudentValues(1,'John Doe','Hubei','Student')DBCCInd'testDb','Student',1)
after inserting a piece of data and then executing the DBCC IND, it can get each type of page distribution and the file number and page number where they are located, at this point there is no more than nothing but the following results appear.
Pagefid and Pagepid refer to the page ID and page number respectively, Iamfid refers to the file that contains the IAM page that manages the page id,iampid refers to the file number that contains the IAM page that manages the page. Objectid refers to the Student object Id,partitionnumber represents the partition number where the table or index resides, and PartitionID represents the partition ID. The next 4 columns to be introduced are to be paid extra attention.
Iam_chain_type represents the page type, and now we can see the full in_row_data type, in addition to the overflow type and the large object data type.
PageType is also very important, it represents the paging type, 1 represents the data page, 2 is the index page, and 3 is lob_mixed_page,4 is lob_tree_page,10 is the IAM page.
IndexID says The index id,0 represents the heap, 1 represents the clustered index, 2-250 represents the nonclustered index, and the text or image field is greater than 250.
Indexlevel represents the index hierarchy , 0 represents leaf level paging, and greater than 0 represents a non-leaf level, and NULL represents an IAM paging.
From the results we see that when this data is inserted and a new data page and IAM page are added, what is the relationship between the IAM page and the data page? Continue to add data with this problem.
--add data to overflow and insert big Data ObjectsInsert intoStudentValues(2,REPLICATE('Xiao Wang',1700),REPLICATE('Hubei',1700),REPLICATE('ha',80000))DBCCInd'testDb','Student',1)
Now there are 5 new pages added, the data page adds another 114, the overflow page is 93 and its corresponding IAM page, and the large object data adds 109 and IAM pages. The reason for the overflow data is that I added this line in addition to large object data takes up more than 8060 bytes of space, big data objects because 80,000 characters are inserted ' ha ', and the data all use heap structure. Now to understand IAM, which says that IAM is used to keep track of each allocation unit, the meaning of this trace is to record the order of these pages in order to link the pages together. If the allocation unit page information stored by an IAM page exceeds its own size, a new IAM page is created so that the IAM page and the IAM page record the information for the allocation unit page, which is called the IAM page chain.
3. Heap structure and B-tree structure
With a cushion above, now to explore is the heap and B-tree. A heap structure is a data structure that does not contain a clustered index, with at least one In_row_data allocation unit per partition in each heap. If the heap contains large object (LOB) columns, then each partition of the heap will also have a Lob_data allocation unit. If the heap contains variable-length columns that exceed the 8,060-byte row size limit, then each partition of the heap will also have a Row_overflow_data allocation unit. The data pages and rows in the heap are not in any particular order and are not linked together, and the information within the IAM page records the unique logical link between the data pages. Sys.system_internals_allocation_units the columns in the System view first_iam_page the first page of a series of IAM pages that manage the allocated space for a heap in a specific partition, such as a structure diagram.
The next step is the focus of this article, the structure of clustered and nonclustered indexes. The classic way of distinguishing between clustered and nonclustered indexes is that the index is like a dictionary directory, and a clustered index is similar to a phonetic lookup, and a nonclustered index is a search by the radicals. The index is organized by the B-tree structure, so we are now going to associate this specific instance with the B-tree. For a clustered index, the vertex of its B-tree structure is called the root node, and any index level between the root node and the leaf node is considered intermediate, with the most core being that the leaf node contains the data page of the underlying table, the remaining nodes (root and intermediate nodes) contain index pages, index rows exist in the index page, Each index row contains a key-value pair and a pointer. The content of a key-value pair is the index order and the smallest value in the subordinate node to which the row is linked. For example, 2 key-value pairs 5-100 and 6-200, the first indicates that the index row is the 5th, then its subordinate node is the smallest of 100, the range is 100~200, the index row pointer is to the next index page or the data row in the leaf node. Take a look at the clustered index structure that I've truncated on MSDN to get a clear understanding of the essential reasons for using indexing to speed up.
for a nonclustered index, the two most essential difference between it and the clustered index is "table data rows are not sorted and stored in the order of the nonclustered keys and the leaf nodes of the nonclustered index are not data pages but index pages. That is, the nonclustered index does not actually sort the data physically, but an index B-tree structure outside the data table, which is sorted by the nonclustered index column and has pointers to the data rows in the data table in each index node. Index rows in a nonclustered index contain nonclustered key-value pairs and row locators, which are pointers, but sometimes the nonclustered index is created in the presence of a clustered index, and in some cases a nonclustered index is created in the heap structure, so the row locator may be a pointer to the index key in the clustered index. It is also possible to point to a row of data in the heap. Here's another question. Note that the purpose of the index is to sort, and when a nonclustered index is created on the basis of a clustered index, it is possible that the clustered index is a non-unique clustered index, which could cause a sort error, so SQL Server A unique key is generated internally to make all duplicate keys unique, and this four-byte value is not visible to the user and is added only if the clustered key needs to be unique for use in a nonclustered index, and SQL Server retrieves the data row by using the clustered index key stored in the leaf row of the nonclustered index to search the clustered index.
4. Index doubts
when I read an indexed article, I know that the index tool is suitable for columns with a large query size and does not apply to columns that require a lot of updates. When you create a clustered index on a column, because the index page points directly to the data page in the clustered index, the query speed of the clustered index is almost always faster than the query for the nonclustered index. For an update operation, the most essential reason for its slow index query speed is index fragmentation. Index fragmentation can be divided into external fragments and internal fragments, and there are many articles on this in the blog park. The process of the author's learning can be described by understanding-not understanding-understanding-not understanding this feeling, My initial understanding is this.
When a clustered index is created, this is a B-tree structure, assuming that there are 5 data in each index page (8060 bytes). What if a delete operation causes many index bars in the page to no longer be 5, so that there is not much left in the index page, which is internal fragmentation. The internal fragments in the data I looked up were about 2 cases, one being that the new data resulted in paging to the left, and the deletion of some of the data in the original data page caused the remaining space to appear.
If there is a new operation, if there are 3 pages of data, there is now an action to add a new piece of data. The order should be placed on the first page, but now the first page space cannot be accommodated. This causes a new page to be created, but the page is not the same as the logical order between the first and second pages, and the end result is physically disjoint from the first page, which is an external fragment. The most critical to the appearance of external fragments is physical separation rather than continuity.
I thought I understood the index fragment, but I found some essays on the index fragment example, the internal fragment is the new data caused by the index paging, the external fragment is the new data caused by the paging. As shown in the following 2 graphs, the left side is an example of an internal fragment, and on the right is an example diagram of an external fragment. It makes me wonder, on the face of it, that internal and external fragments are new data causing paging, and that the inner fragment is just a different angle from the external fragment, but essentially the same. Later carefully read the articles of the predecessor to know there is still a difference, not only the paging and physically caused the new page and the original page is not contiguous is the external fragment. Soon my understanding became incomprehensible, perhaps I was a bit dead-end, but this place a lot of essays are very vague. I don't understand how this new page is going to allocate the physical space. when will the internal fragments be created continuously with the original page, and when will the external fragments be created with the original page discontinuity?
Think again about the fragmentation in this case, why it slows down. In the internal fragment, there is a lot of space left in the index page, which is what I think is slowing down. I looked up the information on the Internet to explain why the slow discovery is probably the same: obviously only 10 pages of data results are queried in 20 pages, which increases the I/O and also causes the page hit rate to drop, storage must also consume more space. Assuming this is the case, I have 10 pages of data, and the data in each page is reduced after the delete operation, resulting in internal fragmentation. As the general idea is now slowing down, it's true that the data may not have been stored for 10 pages, and maybe 5 pages will suffice. But I used to have 10 pages of data, and now it's still 10 pages of data, why is it slower than the previous full 10-page data? I increasingly want to know what this internal operation is like, if you have Daniel read here also ask you to point out. I do not know why I have this strange doubt, now I can do is to delete the page data, resulting in the case of internal fragmentation of the query speed is really slow? SQL looks like this, and there are 3 more doubts in the test process.
UsetestDbCreate TableStudent (StudentIDint, StudentnameChar( -), StudentaddressChar( -)) Declare @i intSet @i=0 while(@i<3200)begin Insert intoStudentValues(@i,'Wangwu','Hubei') Set @i=@i+1EndSelectIndex_type_desc,--Index Type description: HEAP, CLUSTERED Index, nonclustered index, XML index, PRIMARY XML index, SPATIAL indexIndex_depth,--number of index levels: heap or Lob_data or row_overflow_data allocation unitPage_count,--The total number of index or data pages, for the index represents the total number of index pages in the current level of the B-tree in In_row_data, and for the heap represents the total number of data pages in the In_row_data allocation unitRecord_count,--Total Record CountFragment_count,--In_row_data The number of fragments in the leaf level of the allocation unit. Avg_record_size_in_bytes,--average record size (bytes)Avg_fragment_size_in_pages,--In_row_data The average number of pages of a fragment in the leaf level of the allocation unit. Avg_fragmentation_in_percent,--The logical fragmentation of the index, or the fragmentation of the heap in the In_row_data allocation unit. Avg_page_space_used_in_percent--The average percentage of available data storage space used in all pages. fromSys.dm_db_index_physical_stats (db_id('testDb'),object_id('Student'),NULL,NULL,'sampled')
Create Clustered IndexIndex_studentid onStudent (StudentID)SelectIndex_type_desc,--Index Type description: HEAP, CLUSTERED Index, nonclustered index, XML index,PRIMARYXmlINDEX, SPATIALINDEXIndex_depth,--number of index levels: 1 is a heap or lob_data or row_overflow_data allocation unitPage_count,--The total number of index or data pages, for the index represents the total number of index pages in the current level of the B-tree in In_row_data, and for the heap represents the total number of data pages in the In_row_data allocation unitRecord_count,--Total Record CountFragment_count,--In_row_data The number of fragments in the leaf level of the allocation unit. Avg_record_size_in_bytes,--average record size (bytes)Avg_fragment_size_in_pages,--In_row_data The average number of pages of a fragment in the leaf level of the allocation unit. Avg_fragmentation_in_percent,--The logical fragmentation of the index, or the fragmentation of the heap in the In_row_data allocation unit. Avg_page_space_used_in_percent--The average percentage of available data storage space used in all pages. fromSys.dm_db_index_physical_stats (db_id('testDb'),object_id('Student'),NULL,NULL,'sampled')
(1) A row of data is supposed to be 1004 bytes, a page can hold 8 data, 3,200 should be 400 lines, but the result is 464 pages, this is the first I am puzzled about the place.
(2) After creating the index page Page_count is 458, the number of pages compared to the original 464 is also smaller, this is the second let me very confused.
Declare @i intSet @i=6 while(@i<3200)begin Delete fromStudentwhereStudentID=@i Set @i=@i+8EndGo
(3) The result of the avg_fragmentation_in_persent is 0, which shows that there is no fragmentation, I do not know whether I misunderstood the meaning of this essay or this essay about internal fragments of the place is wrong.
I have a rookie database, if you have ideas to welcome the exchange.
Index structure and confusion in SQL Server