This article describes the principles of fragmentation in SQL Server, the concepts of internal fragmentation and external fragmentation. And the methods of solving the fragmentation and filling factors. In the database, often every feature that is added to a particular aspect of performance is also weakened by performance on the other. Knowledge of the system's learning database to weigh against specific situations is a required course for DBAs and developers
This article requires a certain understanding of how data is stored in the index and SQL
In SQL Server, the smallest unit of data stored is the page, and each page can hold 8060 bytes of data. The page is organized by a B-tree structure (no clustered index on the table is a heap structure, not in the column discussed in this article) such as:
In the clustered index B-tree, only the leaf nodes actually store the data, while the other root and intermediate nodes are only used to hold the data to find the leaf nodes.
Each leaf node is a page, each page is indivisible. The smallest unit that SQL Server stores data within each page is the row of the table. When a newly inserted row or updated row in the leaf node makes the leaf node unable to accommodate the current update or inserted row, the paging occurs. During paging, fragmentation occurs.
Understanding external fragments
First, the "outside" of understanding the external fragment is relative to the page. External fragmentation refers to fragments that result from paging. For example, I want to insert a row into an existing clustered index, which just causes the existing page space to be unable to accommodate the new row. This results in paging:
Because in SQL Server, new pages are generated as data grows, and clustered indexes require continuous row between rows, so in many cases pagination and the original page are not contiguous on disk.
This is called the external fragment.
Because paging results in the movement of data between pages, it can significantly increase IO consumption and cause performance degradation if operations such as inserting updates often result in paging.
For lookups, external fragmentation has no effect on performance when there are specific search conditions, such as WHERE clauses have a fine limit or if the unordered result set is returned. However, external fragmentation can have a performance impact if you want to return to scan a clustered index and look for contiguous pages.
In SQL Server, a larger unit than a page is a zone (Extent). A district can hold up to 8 pages. A zone is a physical unit that is allocated as a disk. So when a page is split, you need to cut the area multiple times. More scanning is required. Because reading continuous data is not pre-read, resulting in additional physical reads and increased disk IO.
Understanding Internal fragmentation
As with external fragments, internal fragments are also relative to the page. Let's look at an example:
We create a table in which each row consists of int (4 bytes), char (999 bytes), and varchar (0 bytes), so each behavior is 1003 bytes, then 8 lines occupy space 1003*8=8024 bytes plus some internal overhead, which can be accommodated in one page:
When we randomly update the Col3 field in a row, the new data cannot be accommodated in the page, resulting in paging:
After paging:
And when paging, if the new page and the current page are physically discontinuous, it can also cause external fragmentation
Impact of internal fragments and external fragmentation on query performance
The impact of external fragmentation on performance as stated above, it is mainly due to the need for more cross-region scanning, resulting in more IO operations.
Internal fragmentation can cause data rows to spread across more pages, aggravating the scanned page tree and reducing query performance.
Let's take a look at the following example, we artificially insert some data for the table just now to create internal fragmentation:
By looking at the fragments, we found that the fragments had reached a relatively high level:
By looking at the IO before and after defragmentation, we can see that IO has dropped significantly:
Solutions for fragmentation
Basically all of the solutions are based on the reconstruction and collation of the index, but in a different way
1. Delete the index and rebuild
This is not a good way. The index is not available during the drop index. Causes blocking to occur. In the case of dropping a clustered index, it causes the corresponding nonclustered index to be rebuilt two times (when the deletion is rebuilt, and then rebuilt at build time). Although this method is not good, it is most effective for index collation.
2. Rebuilding an index using the DROP_EXISTING statement
To avoid rebuilding two indexes, use the DROP_EXISTING statement to rebuild the index, because the statement is atomic and does not cause nonclustered indexes to be rebuilt two times, but the same way it can cause blocking
3. Use the ALTER INDEX rebuild statement to rebuild the index as shown in the previous article
Using this statement also rebuilds the index, but it does not need to unload and rebuild the index by dynamically rebuilding the index. is superior to the first two methods, but will still cause blocking. You can reduce the lock by using the online keyword, but it will cause the rebuild time to be extended.
4. Using the Alter INDEX REORGANIZE
This method does not rebuild the index, nor does it generate a new page, it simply organizes and skips when a locked page is encountered, so it does not cause blocking. But at the same time, the finishing effect will be worse than the first three kinds.
Understanding Fill Factor
Rebuilding an index can solve the problem of fragmentation. But rebuilding the index is not just a hassle, it also causes congestion. Impact on use. And for less data, rebuilding the index is not very expensive. And when the index itself is more than hundred trillion. The time to rebuild the index will be very painful for people.
The effect of the fill factor is exactly the same. For default values, a fill factor of 0 (0 and 100 represents a concept), which means that the page can be used 100%. Therefore, when you encounter the previous update or insert, the lack of space results in paging. You can set the page's usage level by setting the fill factor:
Let's look at an example:
or the table above. I inserted 31 data, or 4 pages:
By setting the fill factor, the page is set on page 5:
Then I insert a page that does not result in paging:
The above concept can be explained by the Tathagata:
As you can see, the use of fill factors reduces the number of pages that are updated or inserted, but because more pages are needed, the corresponding loss lookup performance.
How to set the value of a fill factor
How to set the fill factor value does not have a formula or concept can be set accurately. Using a fill factor reduces the performance of the query and consumes more disk space, although it can reduce the paging at the time of the update or insertion, but also because more pages are needed. How to set this value for trade-off needs to be seen in terms of specific circumstances.
Depending on the reading and writing ratio of the table, I give the values I think are more appropriate:
1. Do not set fill factor, 100% padding when reading and writing ratio is greater than 100:1
2. When the number of writes is greater than the number of reads, set the 50%-70% fill
3.80%-90% fills when the read and write ratio is between the two
How SQL index fragmentation works to resolve fragmentation (SQL defragmentation)