https://troubleshootingsql.com/tag/stack-dump/
debugging that latch timeoutposted on August 26, 2011by Amit Banerjee 6 votes
My Last post of debugging an assertion didn ' t has any cool debugging tips since there are not much Assertion dump unless you has access to private symbols and sometimes even access to the source code. In this post, I am going to not disappoint and show you some more cool things that the Windows debugger can does for you wit H public symbols for a latch timeout issue.
When you encounter a latch timeout (buffer or non-buffer latch), the first occurrence of it ' s type generates a mini-dump. If There is further occurrences of the same latch timeout, then that's reported as an error message in the SQL errorlog.
Buffer latch timeouts is typically reported using error:844 and 845. The common reasons for such errors is documented in a KB article. For a non-buffer latch timeout, you'll get the An 847 error.
Error # |
Error message template (from sys.messages) |
844 |
Time out occurred when waiting for buffer latch-type%d, BP%p, page%d:%d, stat% #x, database ID:%d, allocation unit ID:%i64d%ls, Task 0x%p:%d, waittime%d, flags 0x%i64x, owning task 0x%p. Continuing to wait. |
845 |
Time-out occurred while waiting for buffer latch type%d for page%s_pgid, database ID%d. |
846 |
A time-out occurred while waiting for buffer latch-type%d, BP%p, page%d:%d, stat% #x, database ID:%d, allocation uni T Id:%i64d%ls, Task 0x%p:%d, waittime%d, flags 0x%i64x, owning task 0x%p. Not continuing to wait. |
847 |
Timeout occurred while waiting for Latch:class '%ls ', id%p, type%d, Task 0x%p:%d, waittime%d, flags 0x%i64x, owning Task 0x%p. Continuing to wait. |
This is the "what you'll see" in the SQL errorlog when a latch timeout occurs.
Spid148 time out occurred and waiting for buffer Latch -type 4, BP 00 00000832fe1200, page 3:11234374, stat 0x7c20009,database id:120, allocation unit id:72057599731367936, task 0x0000 000003c4f2e8:0, waittime, flags 0x1a, owning task 0x0000000003c129b8. continuing to wait.
spid148 **dump thread– spid = 148, PSS = 0x000000044dc17bd0, EC = 0x000000044DC17BE0
spid148 ***stack Dump being sent To d:\microsoft SQL server\mssql.1\mssql\log\ SQLDump0001.txt
spid148 * Latch Timeout
spid148 * Input Buffer bytes–
spid148 *    DBCC CHECKDB with ALL_ERRORMSGS
External dump process returned no errors.
I has only pasted the relevant portion from the errorlog for brevity. As I had outlined in my previous blog posts on similar topics, that there was a large opportunity for due diligence that C An is done with the help of the Windows Event Logs and the SQL Server errorlogs before you start spawning off Windows Debu Gger to analyze the memory dump on your system. The first few obvious things that you'll notice is that SPID 148 encountered the issue and performing a CHECKDB on dat Abase ID 120. The timeout occurred while waiting for a buffer latch on a page (page ID was available in the message above). I don ' t see a "Timeout waiting for external dump process" message in the SQL errorlog which means that I had a good chanc E of extracting useful information from the mini-dump that is generated by sqldumper.
Latch timeouts is typically victims of either a system related issue (hardware or drivers or operating system or a Previo US error encountered by SQL Server). So the next obvious action item would is to look into the SQL errorlogs and find out if there were any additional errors P Rior to the latch timeout issue. I see a number of OS Error 1450 reported by the same SPID 148 for the same file handle but different offsets.
spid148 the operating system returned ERROR1450 (insufficient system resources exist to complete the requested service. ) to SQL Server during a write at offset 0x0000156bf36000 in file with handle 0x0000000000001358. This is usually a temporary condition and the SQL Server would keep retrying the operation. If the condition persists then immediate action must is taken to correct it.
Additionally, I see Prior and post (within 5-15 minutes) The latch timeout issue, multiple other SPIDs reporting the same 1450 error message for different offsets if again on the same file.
SPID137 the operating system returned ERROR1450 (insufficient system resources exist to complete the requested service. ) to SQL Server during a write at offset 0x000007461f8000 in file with handle 0x0000000000001358. This is usually a temporary condition and the SQL Server would keep retrying the operation. If the condition persists then immediate action must is taken to correct it.
I also see the latch-timeout message being reported after every-MS for the same page and the database.
spid148 time out occurred and waiting for buffer Latch-type 4, BP 0000000832fe1200, page 3:11234374, stat 0x7c2000 9,database id:120, allocation unit id:72057599731367936, task 0x0000000003c4f2e8:0, waittime 82800, flags 0x1a, owning Task 0x0000000003c129b8. Continuing to wait.
Notice the waittime above, it had increased from + to 82800!! So the next thing I do are look up issues related to CHECKDB and 1450 error messages on the Web using Bing (Yes, I do use B ing!!). These is relevant posts related to the above issue.
http://blogs.msdn.com/b/psssql/archive/2008/07/10/ Sql-server-reports-operating-system-error-1450-or-1452-or-665-retries.aspx
http://blogs.msdn.com/b/psssql/archive/2009/03/04/ Sparse-file-errors-1450-or-665-due-to-file-fragmentation-fixes-and-workarounds.aspx
As of now, it's quite clear that the issue are related to a possible sparse file issue related to file fragmentation. Now it's time for me to check if there is and threads in the dump waiting on Syncwritepreemptivecalls.
Use the location provided in the errorlog snippet reporting the Latch Timeout message to locate the mini-dump for the ISSU E (in this case sqldump0001.mdmp).
Now if you load the dump using WinDBG, you'll see the following information:
Loading Dump File [D:\Microsoft SQL server\mssql.1\mssql\log\sqldump0001.mdmp]
User Mini Dump file:only registers, stack and portions of memory is available
Comment: ' Stack Trace '
Comment: ' Latch timeout '
This dump file has a exception of interest stored in it.
The above tells you are a mini-dump for a Latch timeoutcondition and the location from where you loaded the dump. Then I use the command to set my symbol path and direct the symbols downloaded from the Microsoft symbol server to a Loca L symbol file cache on my machine.
. sympath srv*d:\publicsymbols*http://msdl.microsoft.com/download/symbols
Then I issue a reload command to load the symbols for sqlservr.exe. This can also is done using ctrl+l and providing the complete string above (without. sympath), checking the Reload Checkbo X and clicking on OK. The only difference-the-the-the-all-symbols-loaded modules in the dump would be downloaded from the Microsoft Symbol Server which is available.
. reload/f Sqlservr.exe
Next thing is to verify that the symbols were correctly loaded using THELMVM sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.
0:005> LMVM sqlservr
Start End Module Name
00000000 ' 01000000 00000000 ' 03668000 sqlservr T (pdb symbols) d:\publicsymbols\sqlservr.pdb\2a3969d78ee24fd4 94837af090f5edbc2\sqlservr.pdb
If symbols were not loaded and then you'll see an output as shown below.
0:005> LMVM sqlservr
Start End Module Name
00000000 ' 01000000 00000000 ' 03668000 sqlservr (export symbols) sqlservr.exe
I'll use the!findstack command to locate all threads which has the function call syncwritepreemptive on their CALLSTAC K.
0:070>!findstack sqlservr! Fcb::syncwritepreemptive 0
Thread 069, 1 frame (s) match
Thread 074, 1 frame (s) match
Thread 076, 1 frame (s) match
Thread 079, 1 frame (s) match
Thread 081, 1 frame (s) match
Thread 082, 1 frame (s) match
Thread 086, 1 frame (s) match
Thread 089, 1 frame (s) match
Thread 091, 1 frame (s) match
Thread 095, 1 frame (s) match
Thread 098, 1 frame (s) match
Thread 099, 1 frame (s) match
Thread 104, 1 frame (s) match
Thread 106, 1 frame (s) match
Thread 107, 1 frame (s) match
Thread 131, 1 frame (s) match
Thread 136, 1 frame (s) match
0:070> ~81s
ntdll! ZWWAITFORSINGLEOBJECT+0XA:
00000000 ' 77EF0A2A c3 ret
0:081> kL100
ntdll! Zwdelayexecution
kernel32! SleepEx
sqlservr! Fcb::syncwritepreemptive
sqlservr! FCB::P Ullpagetoreplica
Sqlservr!alloca_probe
sqlservr! Buf::copyonwrite
sqlservr! PageRef::P Reparetodirty
sqlservr! Recoverymgr::D ocowprewrites
You could get all the callstacks with the function that is searching for using the command:!findstack sqlservr! Fcb::syncwritepreemptive 3
If you are on the thread that is raised ended up raising the Latch Timeout warning is also performing a CHECKDB.
0:074>. ECXR
0:074> kL100
kernel32! RaiseException
sqlservr! Cdmpdump::D UMP
sqlservr! Cimagehelper::D ominidump
Sqlservr!stacktrace
sqlservr! Latchbase::D umpontimeoutifneeded
sqlservr! Latchbase::P rintwarning
Sqlservr!alloca_probe
sqlservr! Buf::acquirelatch
...
...
sqlservr! Utildbcccreatereplica
sqlservr! Utildbcccheckdatabase
sqlservr! Dbcccheckdb
sqlservr! Dbcccommand::execute
If You cannot find the thread which raised the Latch Timeout warning and you could print out all the callstacks in the dump U Sing ~*kl100 and the searching for the function call in blue above. It is quite clear from the callstack above, the thread was also involved in performing a CHECKDB operation as reported In the SQL errorlog in the input buffer for the Latch Timeout dump.
If case you were not able to identify the issue right off the bat, you need to check the build that is on and look fo R issues that were addressed related to Latch timeouts for the SQL Server release that is using. The Symptoms section would has sufficient amount of information for your compare with your current symptoms and Scenari O and determine if the KB article that's looking at be applicable in your case.
Now was the time when you need to has some context about the operations that were occurring on the server to actually DET Ermine What's the actual issue is. Based on what I heard from the system administrators that there is a CHECKDB being executed on the the database while the app Lication was executing DML operations on the database. Additionally, the volume on which the disk resides on have fragmentation and the database in question are large (>750GB).
Based on the both MSDN blog posts that I mentioned above, it's quite clear that it's possible to run into sparse file Limitations when there was high amount of fragmentation on the disks or if there be a large number of concurrent changes O Ccurring on the database when a CHECKDB was executing on it. A number of Windows and SQL Server updates along with workarounds to run CHECKDB on such databases are mentioned in The&nbs P;second Blog post mentioned above. On a separate note, which is a issue with checkdb! It's limitation that's hitting with sparse files on the OS layer. Remember CHECKDB, starting from SQL Server 2005, creates a internal snapshot (makes sparse file) to execute the Consisten CY Check. Paul Randal ' s tweet made me add this line to call this out explicitly!
As always ... if you is still stuck, contact Microsoft CSS with the mini-dump file, SQL errorlog and the Windows Event Logs. It might be quite possible the CSS might ask you to collect additional data as most Latch Timeout issues is generally a n After-effect of a previous issue. In this case, it is the OS Error 1450.
Well ... That ' s it for today! Hope this was helpful for the next time encounter a Latch Timeout issue.
SQL Server Debug WinDbg---troubleshootingsql.com