Detecting and solving latency and blocking I/O problems in SQL Server 2000 SP 4 released on: 7/13/2005 | updated on: 7/13/2005
Robert Dorr
Microsoft Corporation
Abstract:Special columnist Robert Dorr explores how reporting tools in SQL Server 2000 Service Pack 4 significantly reduce the time spent identifying and identifying the root cause of latency and blocking I/O operations.
Content on this page
|
Introduction |
|
Record and report |
|
Performance and scheduled operations |
|
Indexing and Parallelism |
|
Actual Example from Microsoft SQL Server Support |
|
Summary |
|
References |
Introduction
Database management systems such as SQL Server rely on timely file input/output operations. Failure or misconfiguration of hardware, firmware settings, filter drivers, compression, program errors, and other conditions in the I/O path may cause blocking or delay of I/O problems, and it will soon have a negative impact on SQL server performance.
The effects of the above problems on SQL Server vary greatly depending on the problem details, but they usually lead to blocking, lock contention and timeout, long response time, and excessive resource utilization.
Blocking I/O is an I/O Request (usually an I/O Request Packet (IRP) that must be performed by external intervention )). This usually requires a complete system restart or similar operation to solve the problem, and strongly indicates that the hardware is faulty or there is a program error in the I/O path component.
Latency I/O refers to the I/O requests that can be completed without intervention but take longer than expected (likewise, this is usually IRP ). This situation is typically caused by hardware configuration, firmware settings, or filter driver intervention that requires assistance from hardware or software vendors for tracking and resolution.
SQL Server 2000 SP4 contains the database and log file I/O (read and write) logic to detect latency and congestion. When the I/O operation has not been completed for 15 seconds or longer, SQL server detects and reports this situation. The following messages are recorded in the SQL Server Error Log:
2004-11-11 00:21:25.26 spid1 SQL Serverhas encountered 192 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [E:/SEDATA/stressdb5.ndf] in database [stressdb] (7). The OS file handle is 0x00000000000074D4. The offset of the latest long IO is: x00000000022000".
This message indicates that the current workload requirement exceeds the I/O path or the current system configuration and function, or the I/O path contains software (firmware and driver) that cannot work normally) or hardware components.
The error information recorded provides the following information:
• |
### Occurrences-Number of I/O requests that fail to complete read or write operations within 15 seconds. |
• |
File Information-complete file name, database name, And dbid of the affected file. |
• |
File handle-the operating system handle of the file. You can use the debugger and other utilities to track IRP requests. |
• |
Offset-the offset of the previous blocked or delayed I/O. You can use the debugger and other utilities to track IRP requests. (Note: When recording the message, the I/O may not be blocked or delayed .) |
For more information about SQL Server 2000 I/O mode, seeSQL Server 20I/o Basics.
Back to Top
Record and report
I/O reports and records are executed according to files. Latency and blocking I/O Request detection and reporting are two different operations.
Detection (record) is processed in two locations within SQL Server. The first position is when I/O is actually completed. If the request takes more than 15 seconds, a record operation occurs. The second location is when the writer process is delayed. When the latency writer executes, it includes new operations for checking all pending data and log file I/O requests. If the latency exceeds the 15-second threshold, the record operation will occur.
Reports are executed at intervals of no less than 5 minutes. The next I/O Request to a file is reported. If the record operation has occurred and has elapsed 5 minutes or longer since the previous report, write a new report (the error message shown above) to the error log ).
The threshold value of 15 seconds cannot be adjusted currently. Although this is not recommended, you can use trace flag 830 to completely disable latency and blocking I/O detection. Set startup parameters during SQL server startup-T830You can disable latency/blocking I/O detection. UseDBCC traceon (830,-1)You can disable the SQL server instance that is currently running. DBCC traceon takes effect only when SQL Server is restarted.
NoteA given I/O request that is delayed or blocked will only be reported once. If 10 I/O messages are reported to be delayed, these 10 reports will not happen again. If 15 I/O messages are blocked in the next message, 15 new I/O requests are delayed.
Back to Top
Performance and scheduled operations
Overall system performance may play a key role in I/O processing. When studying reports that delay or block I/O, consider the overall operating status of the system. Excessive load may slow the entire system (including I/O processing. The behavior of the system when a problem occurs may be the key to determining the root cause of the problem. For example, if the CPU usage becomes high or remains high when a problem occurs, it may indicate that a process in the system is consuming so much CPU time, so that it has a negative impact on other processes in various ways.
View performance countersAverage disk SEC/TransferAndAverage disk Queue LengthOrCurrent disk Queue LengthTo obtain specific I/O path information. For exampleAverage disk SEC/TransferUsually less than 15 ms. If this value increases, it may indicate that the I/O subsystem cannot meet the I/O requirements.
Keep in mind that SQL Server makes full use of Windows's asynchronous I/O function and violently expands the disk queue length. Therefore, the above-mentioned performance counters do not have a high value itself.
Back to Top
Indexing and Parallelism
One of the most common scenarios is that a large amount of I/O bursts occur because of the loss of indexes and the resulting pressure on the I/O system caused by scanning, hashing, and sorting. Running "index turning wizard" Once helps solve the system's I/O pressure. If you can add an index to help you avoid table scanning, sorting, or hashing, the system has the following advantages:
• |
Reduces the physical I/O required to complete the operation, which is directly equivalent to improving the query performance. |
• |
Only a small number of pages in the data cache must be turned around, so those pages in the cache can always be related to active queries. |
• |
Avoid unnecessary sorting and hashing |
• |
Reduce tempdb utilization and contention |
• |
Reduces resource utilization and/or parallel operations. SQL Server cannot ensure that the server considers parallel query execution and system load when determining whether to parallelize the query. Therefore, you 'd better optimize all the queries for serial execution. In Q/Max degree of ParallelismSet to 1 to forcibly adjust the worst case where no parallel plan is received from the server. If it is confirmed in the test environment that the query can be executed in serial mode efficiently, parallel plans in the production environment can provide unexpected performance improvements. However, in many cases, SQL Server chooses parallel execution because the absolute amount of data to be traversed is too large. This data volume is usually directly affected by the index. For example, if an index is lost, a large number of sorting operations may be performed. We can easily see how the response speed of multiple auxiliary processes that execute sorting operations is faster than processing sorting in serial mode, but we need to know, this operation may significantly increase the pressure on the I/O system. When multiple secondary processes run concurrently, large read requests from multiple secondary processes may cause I/O bursts and improve CPU utilization. In many cases, if an index is added or other adjustment operations occur, you can adjust the query to make it run faster and use fewer resources. This not only improves the performance of related queries, but also improves the overall performance of the system. |
Back to Top
Actual Example from Microsoft SQL Server Support
Microsoft SQL Server and platforms escalation support have processed the following solutions designed to provide a reference framework, it also helps establish expectations about latency and congestion I/O conditions and how the system may be affected. There is no special hardware or driver assembly that brings any special or higher risk to other software and hardware; in this respect, all systems are the same.
Example1-Blocking45Log write operations in seconds
An attempted SQL server log file write operation is periodically blocked for 45 seconds. This log write operation cannot be completed in time, resulting in blocking, resulting in 30 seconds of client query timeout.
The request is submitted and blocked (log writing is suspended). As a result, the query continues to occupy the lock and the incoming requests from other clients are blocked. Other clients start to time out and make the problem complex, because the application is not designed to roll back unresolved transactions when a timeout occurs. This will cause hundreds of unresolved transactions to occupy locks and cause serious blocking. (For more information about transaction processing and blocking, seeINF: understanding and resolving sqlserver 7.0 or 2000 blocking Problems). The application uses a connection pool to maintain the web site. Therefore, as more connections are blocked, the Web site creates more connections, and these connections are blocked, this cycle will continue.
The log write operation will be completed in about 45 seconds, but hundreds of connections have accumulated so far, leading to blocking problems, it takes several minutes for SQL Server and applications to be restored. When combined with Application Problems, latency I/O conditions will have a very negative impact on the system.
Solution:This is due to the delay in I/O requests in the HBA driver. The computer has multiple HbA cards with Failover Support. The failover timeout value is set to 45 seconds. When an HbA lags behind or does not communicate with the San within 45 seconds or longer, the I/O request is routed to the second HbA for processing and will be completed soon. Recommended failover for hardware products is set to 5 seconds to avoid such latency.
If SQL Server 2000 SP4 has a new function to automatically report the problem, we can quickly learn it during the troubleshooting process, the basic problem is that I/O operations are blocked or delayed due to external issues of SQL Server. In fact, we spent a lot of time solving a problem that was initially presented as a common performance problem.
Example2-Filter driver intervention
Many anti-virus software and backup products use the I/O filter driver. These filter drivers are part of the I/O Request stack and can access IRP requests. Microsoft technical support has encountered various problems-from blocking I/O errors to latencies in the filter driver implementation.
Among them, Microsoft SQL Server Technical Support Department encountered a situation that involves the filter driver used for backup processing (this process can back up files that are open at the time of backup. The system administrator mistakenly included the SQL server data file directory in the file backup selection. When a backup occurs, it tries to collect consistent images of files at the beginning of the backup. When this operation is completed, it will delay subsequent I/O requests so that they can be completed one by one when the software processes them.
When the backup starts, the performance of SQL Server decreases sharply because I/O for SQL Server is forced to be completed one by one. To make this problem more complex, the characteristics of a single I/O logic make I/O generally unable to be executed asynchronously. Therefore, when SQL Server expects to send an I/O Request and continue to work, the ums auxiliary process blocks reading or writing calls until I/O is complete. The SQL Server pre-read function is actually disabled by the filter driver. Moreover, even when the backup is complete, another program error in the filter driver keeps the single I/O behavior unchanged. The only way to restore SQL server performance is to shut down the database and re-open it or restart SQL Server to release and re-obtain the file handle when the current filter driver interaction is not ready.
Solution: Remove the data files of SQL Server from the file backup process, and solve the program error that causes the files to be placed in single I/O mode in the filter driver.
At this time, if we already have the SQL Server 2000 SP4 function to report delayed I/O operations, we can quickly know what the basic problem is during the troubleshooting process.
Example3-Hidden errors
Many high-end systems have multi-channel I/O paths and similar tools used to handle load balancing. Microsoft SQL Server technical support has seen the use of such software. Although I/O Requests fail, the software correctly handles the error and has executed numerous retries. I/O is blocked, and SQL Server cannot complete the specified operation. Similar to the log write status described above, many bad system behaviors occur after such a situation has a negative impact on the system.
Solution:In similar cases, restarting SQL server can alleviate the problem to some extent. However, you sometimes need to restart Windows to restore the process to normal. Of course, the program errors in the I/O subsystem will eventually be solved by the I/O supplier.
SQL Server 2000 SP4's new automatic reporting of such conditions makes it easier to detect similar problems. We can not only see the overall performance decline of the entire server, but also understand the nature of the problem through the new messages recorded by SP4, and it is likely that the problem is out of the SQL Server.
Example4-Remote Storage/Images/RaidDriverTool
Many systems use images or similar technologies to help prevent data loss. Some systems are software-based, while others are hardware-based. Microsoft SQL Server's Technical Support Department often encounters a situation related to these systems that results in increased latency.
When I/O for an image must be successfully completed before I/O operations are deemed to have been completed, this will obviously increase the overall I/O time. Network Delay and retry may be an adverse factor for remote image installation. When a drive failure occurs and the raid subsystem is regenerated, the I/O throughput may be affected.
Solution:In similar cases, we generally recommend that you use strict configuration settings (which varies with vendors and devices) to reduce image latency and raid regeneration.
RAID system overhead and latency may cause I/O to slow down, and SQL Server cannot do anything about it. Like any other application, it is the client for raid hardware and drivers. When this type of problem causes excessive server speed reduction, the new latency and blocking I/O reporting feature in SP4.
Example5-Compression
Microsoft does not support SQL Server 7.0 or 2000 data and log files on the compressed drive. NTFS compression is insecure, not only because it breaks the wal protocol, but also because it requires more processing for each I/O Request. Asynchronous I/O is disabled for compression, so that all SQL server I/O with affected data or log files are executed synchronously.
Solution: In this case, we always recommend that customers extract their data and log files.
NTFS compression may slow I/O, and SQL Server cannot do anything about it. Like any other user-mode application, it is the client of the file system. When compression adversely affects SQL server I/O operations, the new latency and blocking I/O reporting feature in SP4.
Additional data points
The wait type information provided in the system process may help diagnose I/O bottlenecks. The buffer I/O latch wait type and write log wait are key indicators for investigating I/O path performance. Microsoft Knowledge Base822101: The waittype and lastwaittype fields in the sysprocesses tableThis section describes the waiting types and detailed the I/O waiting types related to diagnostic latency or blocked I/O conditions.
Back to Top
Summary
Although blocking and latency I/O problems are rare in SQL Server deployment, they are historically hard to solve once they occur. Because the root cause of this type of problem usually exists in drivers or hardware devices, it may take a lot of time to investigate and solve this type of problem and require professional skills beyond the capabilities of typical database administrators. Using new tools in SQL Server 2000 SP4 can significantly reduce the time required to solve such problems, and at least provide DBA with the correct direction.
Back to Top
References
231619: How to Use the sqliostress utility to stress a disk subsystem such as SQL Server
826433: PRB: Additional SQL Server diagnostics added to detect unreported I/O Problems
230785: INF: SQL Server 7.0 and SQL Server 2000 logging and data storage algorithms extend data reliability
SuitableSender'sSQL Server
Robert DorrHe has been a senior SQL Server upgrading engineer in the Microsoft SQL Server Technical Support Department since 1994. Robert has published various articles on the SQL Server engine on msdn and Microsoft technet. He is also the author of SQL pass demonstrations and key SQL Server utilities (read80trace and sqliostress. Robert lives with his wife and children on the outskirts of Dallas, Texas.
Go to the original English page