"Problem description"
We may find this message in the error log of the database:
A time-out occurred while waiting for buffer latch--type 4, BP 000000097bfdedc0, page 1:19239, stat 0xc00009, database I D:5, allocation unit id:72057615247867904, task 0x0000000005e594c8:0, waittime, flags 0x1018, owning task 0x000000 0000169dc8. Continuing to wait.
Our question is, what is the meaning of this error, under what circumstances will the above error be reported, and how to resolve it?
"Introduction to the Background"
Latch is a data structure used internally by SQL Server to synchronize resource access. Similar to the operating system's critical section or ReaderWriterLock. Latch protects the resources that they want to protect, making access synchronized and orderly. For example, when a thread obtains the exclusive use of the latch of a resource, other threads must wait if they need to access the latch.
In terms of large, there are two kinds of Latch, one is called buffer Latch, the other is called I/O Latch.
Let's start with a brief introduction to I/O Latch. When SQL Server reads a page from a hard disk, it reserves the space for the page in memory. And a bit buf_io in the reserved space is set to 1. If the data is read and written from the hard disk, the bit is set to 0. During the reading of the page from the hard disk, the other threads that also need access to the page are, of course, waiting for the type pageiolatch_sh until read-write is complete and buf_io is set to 0. So if we see a lot of pageiolatch_sh waiting, we can basically conclude that the problem is on disk performance.
Another Latch, called Buffer Latch, is used to protect the data structures in memory, such as index, data pages, and non-leaf pages in the B-tree. When a process needs to read a data page in memory, the process first gets the buffer Latch on that data page. There are various types of latch, including exclusive latch (PAGELATCH_EX) and shared latch (PAGELATCH_SH).
Here's a demonstration of why we need latch. As shown, we have stored two records on page 100.
If there is no latch lock, a process on page 100, insert the following data: Insert VALUES (3, 300), the result is as follows:
At this point, another process to insert the following data on page 100: Insert VALUES (4, 400), because there is no latch lock, so the previous data will be overwritten. Cause data insertion problems.
The right thing to do is to have an exclusive latch lock when the first thread is operating. The second thread must wait until the first thread operation is complete. As shown in the following:
The first record is inserted, then the M_freedata (value 141) and the row pointer (value 126) are modified, while the second INSERT statement waits for the latch state. After the first record is inserted, the exclusive latch lock is released.
The second record begins to be inserted, and the latch lock is also exclusive during insertion to prevent other processes from modifying or reading the page. When finished, the latch lock is also released. The final result. Because of the latch lock, the data can be inserted in an orderly manner.
" LATCH Application Mode "
Latch has the following modes when applying,
- Kp–keep Latch guarantees that the referenced structure cannot be destroyed
- Sh–shared Latch, you need to read the data page
- Up–update Latch You need to change the data page
- Ex–exclusive Latch Exclusive mode, mainly used to write data pages when the need
- Dt–destroy Latch is required to destroy the referenced data structure.
The following table shows the compatibility modes for various latch applications:
Y indicates compatibility, and if two threads are going to read a page, they will go to the SH lock because the SH lock is compatible, and two threads will not interfere with each other. and n indicates that it is incompatible and must wait. Until a previous latch is released.
" LATCH wait Type "
There are three main types of latch waiting.
- Buffer (BUF) Latch. Used to protect an index or data page. Also includes PFS, GAM, Sgam, and IAM data pages. Wait type is pagelatch_* mode
- Non-buffer (NON-BUF) Latch, in addition to the above data structure, other memory structures. Wait type is latch_* mode
- IO Latch, protection of data from disk to page read and write process, wait type is pageiolatch_* mode
We can query the following statement to understand the specific wait type:
SELECT * from Sys.dm_os_wait_stats WHERE wait_type like '%latch% '
"Information Interpretation"
Let's read this piece of information.
A time-out occurred while waiting for buffer latch--type 4, BP 000000097bfdedc0, page 1:19239, stat 0xc00009, database I D:5, allocation unit id:72057615247867904, task 0x0000000005e594c8:0, waittime, flags 0x1018, owning task 0x000000 0000169dc8. Continuing to wait.
Here, type 4 is the kind of wait. With the following query, we can know that the type 4 latch is ex mode.
SELECT * from sys.dm_xe_map_values where name = ' Latch_mode '
BP is a pointer to the BUF structure, and without a memory dump, we have no way to get more useful information from this pointer. Page (1:19,239) is the number of waiting pages. The Stat 0xc00009 is the bit information of the BUF structure. The 0x000004 bit is the buf_io bit we mentioned above. 0xc00009 & 0x000004 = 0, indicating that the page has been fully imported into the BUF structure in memory. The Allocation unit ID is the corresponding allocation unit. A table of a database can have multiple allocation units, usually with the same number of allocation units and the number of indexes on the table.
Using some of the following column queries, we can get their corresponding table names by allocation the Unit ID. (Of course, we can also get the table name through the page's specific information.) With DBCC PAGE command)
-The container_id in the results is partition_id.
SELECT * from sys.allocation_units where allocation_unit_id= ' 72057615247867904 '
--Get object_id
SELECT * from sys.partitions where partition_id = xxxx
--Get Name
SELECT * from sys.objects where object_id = yyyy
Waittime is 300, the unit is seconds, that is, we apply for an exclusive latch lock on page (1:19,239), but applied for 300 seconds (5 minutes), or did not apply.
Here the flags are used to determine whether the latch is superlatch, and for the analysis latch Wait is not useful, so we do not further specific analysis.
The task address and the owning task address are important in memory dump analysis. Especially the owning task, which is the task that blocks other processes from acquiring latch locks.
"How To fix"
Latch application is due to being occupied by other processes. So we need to find out which process occupies the latch that has not been released. Then take the corresponding method to solve. Typically, this is due to disk performance, and if you need to know the exact cause, we may need to analyze the memory dump. For the first latch timeout error, SQL Server generates a memory dump. For subsequent latch Timeout, an error message is reported in the SQL Server error log, but no memory dumps are generated. If you want to change this behavior, that is, require each encounter latch Timeout, to generate a memory dump, we can use Traceflag 838, such as the following command:
DBCC TRACEON (838,-1)
Because it takes a little time to generate a memory dump, it is possible that the latch wait phenomenon disappears when a memory dump is generated, and there is not enough information in the memory dump. Therefore, we may have to crawl memory dumps multiple times to analyze the problem.
"Case Study"
Here is a case study. In the database error log, we found the following error and generated a memory dump.
A time-out occurred while waiting for buffer latch--type 2, BP 0000000088fbfa40, page 1:153568, stat 0xc0010b, database Id:10, allocation unit id:72057594047758336, task 0x000000000c8e2988:0, waittime, flags 0x1a, owning task 0x000000 0005a2f048. Continuing to wait.
The following is an analysis of memory dumps, for the sake of simplicity, the irrelevant call stack is omitted. As we can see, the following process, when acquiring buffer latch, is unable to get the latch lock in time, generating the above error message and generating a memory dump.
0a 00000026 ' ebe09350 00000000 ' 0215d1b8 sqlservr! Cdmpdump::D ump+0x7c
0b 00000026 ' Ebe093a0 00000000 ' 0215dd0c sqlservr! Sqldumperlibraryinvoke+0x1a0
0c 00000026 ' Ebe093d0 00000000 ' 021503f7 sqlservr! Cimagehelper::D ominidump+0x3d4
0d 00000026 ' Ebe095d0 00000000 ' 0206af82 sqlservr!stacktrace+0x82b
0e 00000026 ' Ebe0ab20 00000000 ' 02069f69 sqlservr! Latchbase::D umpontimeoutifneeded+0x19a
0f 00000026 ' Ebe0abf0 00000000 ' 008e7bcc sqlservr! Latchbase::P rintwarning+0x205
Ten 00000026 ' Ebe0ace0 00000000 ' 000e7a29 sqlservr! Latchbase::suspend+0xd12
00000026 ' ebe0b950 00000000 ' 000a0689 sqlservr! Latchbase::acquireinternal+0x1ff
00000026 ' Ebe0b9f0 00000000 ' 000a1792 sqlservr! buf::acquirelatch+0x8d
00000026 ' Ebe0bd10 00000000 ' 000a166c sqlservr! Bpool::get+0xc7
00000026 ' Ebe0bd70 00000000 ' 000a35a0 sqlservr! Pageref::fix+0xbc
00000026 ' Ebe0bdd0 00000000 ' 000a320a sqlservr! Btreemgr::seek+0x44d
00000026 ' Ebe0bff0 00000000 ' 000a2e5d sqlservr! btreemgr::gethpageidwithkey+0x20a
00000026 ' ebe0c070 00000000 ' 000a38e8 sqlservr! Indexpagemanager::getpagewithkey+0xbd
In this case, the owning task address in the error message 0x0000000005a2f048 the analysis of the problem is very important. We can find the process number corresponding to the owning task by owning the address of the task, combined with the memory dump analysis. Its call stack is as follows:
0:175> KM
# CHILD-SP RetAddr Call Site
XX 00000026 ' F027fa68 000007fe ' FCD910DC ntdll! Zwwaitforsingleobject+0xa
00000026 ' F027fa70 00000000 ' 00083b1a kernelbase! waitforsingleobjectex+0x79
00000026 ' F027fb10 00000000 ' 00082d76 sqlservr! sos_scheduler::switchcontext+0x26d
00000026 ' F027ffa0 00000000 ' 00082700 sqlservr! Sos_scheduler::suspendnonpreemptive+0xca
00000026 ' F027ffe0 00000000 ' 000829ac sqlservr! sos_scheduler::suspend+0x2d
00000026 ' f0280010 00000000 ' 000e7e46 sqlservr! Eventinternal<spinlock<153,1,0> >::wait+0x1a8
00000026 ' f0280060 00000000 ' 000e7a29 sqlservr! latchbase::suspend+0x599
00000026 ' F0280cd0 00000000 ' 00645AAC sqlservr! Latchbase::acquireinternal+0x1ff
00000026 ' F0280d70 00000000 ' 00646a49 sqlservr! sqlserverlogmgr::growafile+0x7c
00000026 ' f0281150 00000000 ' 000cb38b sqlservr! Sqlserverlogmgr::reserveandappend+0x1fb
0a 00000026 ' f0281290 00000000 ' 000cbe89 sqlservr! xdesrmreadwrite::generatelogrec+0x567
0b 00000026 ' F02813d0 00000000 ' 000cbc23 sqlservr! xdesrmreadwrite::logbeginxact+0x243
0c 00000026 ' f0281530 00000000 ' 000cbc3f sqlservr! Xdesrmreadwrite::makeactive+0x64
0d 00000026 ' f0281560 00000000 ' 001746c1 sqlservr! xdesrmreadwrite::generatelogrec+0x87
0e 00000026 ' F02816a0 00000000 ' 00174860 sqlservr! Pageref::modifyrow+0xce0
0f 00000026 ' F02818f0 00000000 ' 00176749 sqlservr! Pageref::modifycolumnsinternal+0x1b3
Ten 00000026 ' F0281b50 00000000 ' 00175f71 sqlservr! Indexpageref::modify+0x2fd3
The above call stack shows that this is the growth of the database log file. Because the database log does not grow in time, it causes the thread to have a latch lock for a long time without releasing it, while other processes cannot get the corresponding latch lock in time. But look at the log file's growth setting, each time it grows to 300M. is not growing as a percentage. There seems to be no big problem.
Further review the database's error log, before the warning occurs, SQL Server has a related disk performance warning:
SQL Server has encountered occurrence (s) of I/O requests taking longer than [seconds] on file [F:\sqldata\ LOGS\XXXX_LOG.LDF] in database [XXXX] (10). The OS file handle is 0x0000000000000a2c. The offset of the latest long I/O is:0x000004e6511e00
By collecting disk performance information, we found that at the point of the problem, Avg. Disk Sec/read reached 1.444. Usually this value is about 0.03 more normal. The value of 1.444 is obviously quite biased.
Conclusion
Near this point, disk pressure becomes larger, so disk performance becomes worse. Causes the log file to grow 300M cannot be completed in time. Therefore, the latch lock that the thread owns cannot be freed, while the other process waits for the latch to cause a timeout. The following solutions are available for this issue:
- Replace the better disk. Even under stressful conditions, performance does not become particularly poor.
- Modify the log file growth to a range of 50M per increase. Expect log file growth to be completed quickly.
- Pre-expanding the log file. This way, there will be no growth of log files.
Because the disk cannot be replaced in a short time, the third option is to pre-scale the log file. After adopting this scheme, no more latch timeout occurred.
Of course, for each latch timeout phenomenon, the reasons may vary, but most often are due to disk performance or operating system performance. Sometimes we do not need to analyze memory dumps, but instead look at the operating system's log and database error logs, and may also find the cause.
"References"
- Diagnosing and resolving Latch contention on SQL server:http://sqlcat.com/sqlcat/b/whitepapers/archive/2011/07/05/ Diagnosing-and-resolving-latch-contention-on-sql-server.aspx
- Q&a on latches in the SQL Server Engine http://blogs.msdn.com/b/psssql/archive/2009/07/08/ Q-a-on-latches-in-the-sql-server-engine.aspx
- debugging that latch timeout http://troubleshootingsql.com/2011/08/26/debugging-that-latch-timeout/
Reference Document:
Http://blogs.msdn.com/b/apgcdsd/archive/2012/08/13/buffer-latch-timeout.aspx
Resolution of Buffer Latch timeout