The cycle of death in the win32k

Source: Internet
Author: User

Citation note >> " Zhang Pei" " Original:www.YiiYee.cn/blog"

This is my first issue to the new company. The first-line engineer found me, saying there was a urgent issue had no interest in looking. When he had organized a team of people to help him, I became one of them.

The problem is indeed urgent and has affected the production line. At that time is a Ching Ming holiday, leading to many people are at risk of facing the possibility of qingming work overtime. In fact, this problem has been reported for several months, but the first-line engineers have been treated with appeasement policy, in various ways to reduce the incidence of the problem to the level of customer acceptance. The policy of appeasement has been effective for a long time, but recently it has suddenly failed, and the rate has soared to 20%. The first-line engineer was really worried.

Problem characterization

Group A team person debug the same problem that I have never experienced before. The initial idea is of course to think that people are more powerful. But the real implementation of the time, may also encounter broth predicament.

There is disagreement on the question of how to qualitatively issue. From the situation described by the frontline engineer, there is no blue screen when the problem occurs. Colleagues who have previously debug this issue say that when the problem occurs, only one process and thread are alive, and all other processes are in block state (except for the ideal process), and most crucially, he can still step through WinDbg.

It is completely incorrect for someone to characterize issue as a system crash, BSOD, or system exception. If this is the case, the CPU must be stuck. But now the CPU is alive and can be single-step. Moreover, there is no blue screen, the qualitative BSOD is even more wrong.

What kind of question should it be? From the description point of view, the system is alive, just lost the response. So I characterize issue as a system software hang. There are a number of possible situations, such as a sudden deviation of a system process, causing the system interface to lose its response. Or, as in this case, the system kernel itself is trapped inside a dead loop, and has no other tasks to deal with.

Analyze problems

I only got the dump file on the first day. Because there is only one system that can live debug, it is difficult to co-ordinate the use. After I get the dump file, the first action is to look at the only live thread. The problem points of this kind of issue are very concentrated, so I'm sure I can find some useful clues soon.

ChildEBP RetAddr Args to child 85846dd0 8fd95b75 ffffffff 85846ef0 nt! KECLOCKINTERRUPTNOTIFY+0X28A (Inline)--------------------------------hal! Halptimerclockinterruptepilogcommon+0xa85846de0 00000000 000000d1 00000000 hal! halptimerclockinterruptcommon+0x3e 85846de0 00000000 000000d1 00000000 hal! HALPTIMERCLOCKINTERRUPT+0X1CB 85846ef0 80c66050 8584762c 00000002 win32k! ENUMAREAS::ENUMAREAS+0XB9 8584711c 80da5d98 80da5d98 00000000 Win32k!bspbltscreentoscreen+0x2f8 858474dc 80da5d98 80da5d98 00000000 win32k! spbitblt+0x2bc85847510 80da5d98 80da5d98 00000000 win32k! spcopybits+0x2785847654 000006ce 000000a0 00000027 win32k! ntgdibitbltinternal+0xa3985847700 80d440a8 80c7d5e0 80c7d598 win32k!zzzbltvalidbits+0xc6557 85847768 85847b18 85847ad0 80dac008 Win32k!xxxenddeferwindowposex+0x20b858477a8 00000000 8312c5a0 80dac008 win32k! Xxxprocessdesktoprecalc+0x10b858477e0 d4b8b27d 85847d00 80dac008 win32k!xxxprocesseventmessage+0x7a ... 85847D3C 0118fb90 00000000 00000000 NT! Kisystemservicepostcall

The call stack has a processing function with clock interrupts. When a hardware outage occurs, the CPU will personally preempt the execution of the currently active thread to the ISR. So that's a normal thing. The so-called "system exception" is not happening. After removing the interrupt correlation handler function, the last function of the call stack is win32k! Enumareas::enumareas.

I took the assembly code of this function a little bit, and found there was a dead loop! Of course, not all loops are inherently a "dead loop". A dead loop can only occur in a special state. From the assembly code, win32k is executing in a loop. This loop will only enter once. My colleague provided me with a useful message: every time a problem occurs, the same function is executed. Combined with this information, of course, it can be concluded that the cycle is a "cycle of death."

But when I put this discovery in preparation for the team's discussion, it was a cold shoulder. No one believes or is willing to discuss this cycle of death with me. A colleague of the debug team did not look at the analysis process, but asked me the same question two times: Are you sure this is a dead loop? Makes me very depressed.

The next day when the live debug environment was built again and I was able to get started myself, I first took a moment in this function to make sure it was actually looping through the 4-line assembly code I found. It turns out that no matter how many times I press F10, the instruction register never leaves the 4-line assembly code (except in the case of a clock interrupt preemption). I had a lot of relief at the time, because I was still afraid, in case it wasn't the cycle of death, it humiliated.

The logic of the Dead loop is simple and only 4 lines are assembled. Two jump statements jump to and from each other, which is a typical while loop.

901b7b0f   394104         cmp     dword ptr [ecx+4],eax901b7b12  7e5e             jle        win32k! Enumareas::enumareas+0xb6 (901b7b72) 901b7b72  034908         add      ecx,dword ptr [ecx+8]901b7b75  eb98             JMP     win32k! Enumareas::enumareas+0x53 (901b7b0f)

Take a few minutes to disassemble these 4 lines of code into C, because we already have the private symbol of win32k, so the contents of the counter are very readable.

while (P->ybottom <= this->yboundstop) {     (char*) p + = P->sizescan;}

This is a normal cycle. But if the P->sizescan value in the loop body is 0, it can lead to a dead loop. That's why.

Guess what, guess?

Every senior Debug engineer has the same skill, that is, "guess." When the closed source debugging, "Guess" is the perfect weapon. Of course, you can not guess, otherwise it will lead to the prison-can not become suspicion.

I found that when a dead loop occurs, the value of Ybottom is the value of 1200,yboundstop is 1780. I am very sensitive to 1200 because the target platform's resolution is 1800x1200. Plus the variable name shown by the private symbol adds to the confidence of the guess. So first guess 1200 is the height of the screen.

1780 what is it? With the name Enumareas of the win32k class and the variable name Yboundstop, guess it's the y-axis position of the left vertex of a window.

Keep guessing. Looking closely at the names of the functions in the call stack, you can roughly know what it is doing: BitBlt is a drawing operation through the GDI interface.

Now basically give a reasonable guess: when the problem occurs, a window is removed from the outside of the screen, when the window tries to refresh the UI interface, there is a probability that the win32k will cause a dead loop.

The next step is to give a solution based on the above reasonable conjecture: to avoid any movement of the window position to the outside of the screen during the test.

The first-line engineer was not at ease after learning about the situation and added another one in the solution: avoid all window minimization actions.

When the program was reported to the customer, the customer immediately tested it, and during the thousands of tests, the problem never happened again. And, for one months now, issue never happened again.

Technical details

When the facts were determined, everyone was amazed. Because win32k is a very stable OS module, if it is really win32k bug, its impact must be very far-reaching. This issue has now been reported to Microsoft, Microsoft engineers are still analyzing, and have acknowledged the fact that the system bugs. But Microsoft engineers are more inclined to think that this is not a win32k bug, but that the parameters passed to win32k are corrupted. The final result is how to wait.

If I were the Microsoft engineer, I would put the implementation logic of the relevant function on the call stack, and analyze it carefully. Control the difference between the problem and the normal condition, which distinguishes whether the issue is caused by the code logic or the exception parameter. If it is caused by an exception parameter, determine whether it is a problem when the parameter is passed in, or the problem that follows. to gradually clarify.

while (P->ybottom <= this->yboundstop) {     (char*) p + = P->sizescan;}

Where this is of type enumareas, it is partially defined as follows:

win32k! Enumareas   +0x000 idir                  +0x004 xboundsleft              +0x008 yboundstop               +0x00c xboundsright            +0x010 Yboundsbottom

P points to the structure as spritestate, which is defined in the previous section as follows:

_spritestate +0x000 ytop +0x004 ybottom  +0x008 Sizescan

I think: P points to an array of variable-length structures of type spritestate, where the variable sizescan represents the length of the next struct. When the problem occurs, p has been enumerated to the last struct in the array, whose member variable Sizescan equals 0. Because this loop lacks exception handling, so that P has been working on the last struct, it's done a dead loop on it, and strangely put the whole system on hold.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.