SQL Server performance optimization -- wait -- SLEEP_BPROOL_FLUSH

Source: Internet
Author: User

SQL Server performance optimization -- wait -- SLEEP_BPROOL_FLUSH
Preface:

There is a database for historical archiving (referred to as the historical database). After a certain amount of time, the data file has reached more than 700 GB, and it was decided that some data does not need to be retained, this part of data is truncate and more than 600 GB of space is available. That is to say, after the contraction, the database theoretically only has more than 100 GB. To this end, I have rebuilt the clustered index of each table (the number of tables is small, but the number of single tables is still tens of millions) and prepared to shrink it.

However, when the contraction starts, even if the range of each contraction is reduced to 500 MB, the speed is extremely slow and often does not respond for several hours. After viewing the wait information, it is found that a SPID = 18 SESSION (all SPID <= 50 are system sessions) always shows the waiting status as "SLEEP_BPROOL_FLUSH", and the contraction operation is blocked.

For this reason, I think even a small probability event (because this wait type is common, but it is not always noticeable) can be studied as a result.

Description: The environment is SQL Server 2008R2.

Source: http://blog.csdn.net/dba_huangzj/article/details/50455543

Introduction:

Since this has become a problem, it is necessary to first understand the waiting status of SLEEP_BPOOL_FLUSH. In Microsoft official note: https://technet.microsoft.com/zh-cn/library/ms179984 (v = SQL .105). aspx, only a simple description: When a checkpoint suspends a new I/O release to avoid Disk Subsystem flooding. Obviously, this explanation is insufficient. Therefore, I have rummaged through the blogs and other books of Daniel abroad and summarized them as follows:

This waiting state is directly related to the checkpoint process. checkpoint is mainly used in the buffer (BufferPool) in the memory. Data changes (called dirty pages) occur after being loaded to the memory ), after the checkpoint is triggered, the dirty page is written back from the memory to the data file on the disk.

So it is natural to think of Checkpoint. However, from the perspective of behavior characteristics, it means that your disk subsystem may have performance problems.

Checkpoint introduction:

To understand the SLEEP_BPOOL_FLUSH wait type, you must first understand the Checkpoint. It is a system process triggered by the SQL Server background. You can also manually enter the checkpoint to run it.

This process writes the modified pages of the buffer zone to the data file. This is common in backup. One of the important roles of this process is to speed up database recovery in case of exceptions. When the database fails, SQL Server must restore the database to its normal state as much as possible. SQL Server will use transaction logs for redo or undo operations, and re-append the modifications not written to the data file. If the data page is modified but the data file has not been written, SQL Server must redo the modification. If a previous Checkpoint occurs and these dirty pages are written to data files, this step can be skipped to speed up database recovery. :

After a data page is modified by a transaction, the modification is first recorded in the transaction log (in fact, it is not written into the LDF file but in the memory area called log buffer, and then write it to the LDF file of the disk. This process is represented by WRITELOG and LOGBUFFER wait type ). The corresponding data page in the buffer pool in the memory is marked as a dirty page. When the Checkpoint process is triggered, all dirty pages from the last Checkpoint occurrence to the present will be physically written to the data files on the disk, this process does not care about the status of transactions that cause dirty pages (commit, uncommitted, and rollback ).

Generally, checkpoints are automatically run cyclically by SQL Server (one minute by default ). But it does not mean that it is triggered only after one minute. You can set the running cycle, but do not modify it unless you are sure the root cause of the problem is located. Because the Checkpoint will analyze the current IO request, latency, and other conditions for triggering. This avoids unnecessary high IO overhead.

In SQL Server, there are the following types of checkpoints (detailed descriptions of checkpoints will be described in subsequent articles ):

Internal Checkpoint type: it cannot be configured and is automatically triggered under certain circumstances, such as backup. Automatic Checkpoint type: If the SQLServer configuration is not modified, it will be triggered within one minute. This type can be modified at the instance level, but it can only be modified to a value less than or equal to 1 minute. Manual Checkpoint type: the checkpoint command is initiated through SSMS or other clients. For this trigger, you can enter a number of seconds to specify that the checkpoint must be completed within this number of seconds. This operation is performed at the database level. For example, CHECKPOINT 10 indicates that SQL Server will attempt to execute checkpoint within 10 seconds. Details visible: https://technet.microsoft.com/zh-cn/library/ms188748 (v = SQL .105). aspx indirect Checkpoint Type: this is the library-level option introduced by SQLServer 2012. If the value is greater than 0, it will overwrite the default automatic Checkpoint configuration of a specific database. You can use the following command to achieve this:
Alter database [DATABASE name] SET TARGET_RECOVERY_TIME = [number of seconds or minutes]

As mentioned above, SQL Server will analyze the current system pressure. When it deems that there is no need to perform a Checkpoint, it will kill this process, so as to avoid making the disk subsystem worse. When the Checkpoint is killed, it is recorded in the SLEEP_BPOOL_FLUSH wait type information.

Under normal circumstances, the waiting status should be as close as possible to 0.

Reduce SLEEP_BPOOL_FLUSH wait:

Since there is a problem, it should be solved, even if it may not have much performance problems. In this case, we recommend that you check the configuration first. Do not modify the default configuration unless necessary. You can use the following statement to query the configuration value:

select * from sys.configurations where name ='recovery interval (min)'
The value 0 indicates the default configuration. The unit is minute. The smaller the value, the higher the Checkpoint frequency, and the easier it will lead to SLEEP_BPOOL_FLUSH wait. In addition, frequent use of the CHECKPOINT command in transactions can easily trigger such a wait.

In addition to this situation, the performance of the disk subsystem where the data file is located may also be a problem. As mentioned above, the Checkpoint trigger is to write dirty pages in the buffer to the disk. If the current Disk Load is very large, the Checkpoint operation will be killed frequently, resulting in SLEEP_BPOOL_FLUSH waiting.

Regression topic:

I have introduced the meaning and reason of this waiting state. Now let's look at my problem, because the problem still needs to be solved. After check, the default configuration is correct, and the operation I'm performing is to shrink the data file, so the problem should be to shrink it.

There are three potential problems with shrinking data files:

The contraction logic is to move the data to the partition where the data file is located, because the contraction starts from the last partition of the data file, this operation consumes a lot of time and system resources to move all the data. In this process, SQL Server uses a large amount of CPU resources to determine where data can be moved and how much space can be used to move data, at the same time, a large amount of I/O resources are required to read data from data files and write data to new physical addresses. In addition, if the table does not have a clustered index, the non-clustered index will overwrite the non-clustered index information update because the leaf node records the RID information. Note that "each row of a non-clustered Index" is affected. As you can imagine, this is a very overhead operation. Log File growth: regardless of the current recovery mode, SQL Server records each data movement operation, distribution or recovery of each data page and partition, and changes to each index. This kind of record will increase the system resource overhead of the first problem and cause the log file to increase rapidly. One MVP blog introduced the number of log files required for data file shrinking: Snapshot. Many operations are not affected by fragments. This part can be viewed in Microsoft's White Paper: https://technet.microsoft.com/en-us/library/cc966523.aspx. This section describes the dismounting types of fragments and the fragment scenarios that require attention.

Through the previous analysis, you can check the disk (Common SAS disk) where the historical database of the server is located. It can be preliminarily determined that the disk IO performance is a problem. Because the clustered indexes of all tables have been rebuilt before (no heap table exists), it should be that the data density is high enough. This is the biggest headache. It is impossible to change the disk because of the slow contraction. Even if the disk can be changed, the financial process is not generally cumbersome. Let's think about how to make every read/write operation as small as possible. Source: http://blog.csdn.net/dba_huangzj/article/details/50455543

This is a historical database. At the end of the month (when writing this article), there will be a large number of monthly and annual settlement queries. It is unreasonable to shrink files during frequent use, therefore, this operation is run in idle time (not necessarily at night, mainly depending on the system type and operation period ). In addition, the scale of contraction should be as small as possible. To avoid a large statement, you can use the following statement to perform automatic contraction:

Declare @ SQL nvarchar (1024) declare @ size int = 758000 -- current size, MB is measured in declare @ end int = 1024 -- stop range while @ size >=@ end -- until the stop range is reached, the beginset @ SQL = 'dbcc SHRINKFILE (n'' data file name continues until it reaches the stop range. '', '+ cast (@ size as nvarchar (20) +') '-- print @ sqlexec (@ SQL) set @ size = @ size-500end

The comment-out print statement is used to check whether the command to be executed is correct. This is just an example. You can modify or add other functions as needed. In addition, set @ size = @ size-500 indicates that the code is reduced by 500 MB each time. You can also test the code based on the actual situation. It may be 100 MB/time, but the best and fastest, set it to set @ size = @ size-100.

By adjusting the scale of each contraction and arranging idle time running, you can manually run the Checkpoint from time to time. Although the wait status is still (after all, the disk performance is hard to hit), the contraction progress is still satisfactory.

The most important method is to perform it during idle time on the server. After repeated tests, after, the server maintenance job has not run yet, and the user has been off duty, even if each contraction is 100 GB, it only takes more than one hour.

Despite the unsatisfactory results, readers may want to see how to solve the problem thoroughly. However, it is a formal environment and cannot be easily tried or modified. However, in addition to the preceding method, you can choose another method as needed:

Split the data file and move the file to a disk with low load or high performance. However, this operation requires data merging later. Some databases can be taken offline for a short time. You can move the database to a disk with good performance and then attach the database to shrink it. Actually... Not shrinking is not a good thing. Summary:

SLEEP_BPOOL_FLUSH wait is closely related to the SQL Server Checkpoint process, while the Checkpoint is mainly responsible for writing dirty pages to the disk. Before the Checkpoint is triggered, SQL Server analyzes the current Server load. If the disk subsystem is under too much pressure and the Checkpoint is considered to have to be killed, SQL Server records this status to the SLEEP_BPOOL_FLUSH wait state.

In a normal system, the waiting time should not be long, but it may affect the system performance. Running the CHECKPOINT command too frequently or setting the value of "recovery interval" too low may lead to SLEEP_BPOOL_FLUSH wait. The low performance of the Disk Subsystem of data files also results in such waiting information.

Therefore, check the SQL Server configuration, statements, and disk subsystems when you find that the wait status frequently appears or the wait time is long.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.