Citation note >> " Zhang Pei" " Original:www.YiiYee.cn/blog"
This is my first issue to the new company. The first-line project division found me, saying there was a urgent issue had no interest in looking. When he had organized a team of people to help him, I became one of them.
The problem is indeed very urgent and has affected the production line. At that time is a Ching Ming holiday, leading to a lot of people are at risk of facing the possibility of qingming work overtime. In fact, this issue has been reported for several months, but the first-line project division has been treated with appeasement policy, in various ways to reduce the incidence of the problem to the level of customer acceptance. The policy of appeasement has been effective for a long time, but it has suddenly failed in the near term, and the rate has soared to 20%. The first-line project teacher was really worried.
Problem characterization
Group A team person debug the same problem that I used to have before. The initial idea, of course, is that people are more powerful. But the real implementation of the time, may also encounter broth predicament.
We disagree on how to issue the problem. From the case of the first-line project master, there is no blue screen when the problem occurs. Colleagues who have previously debug this issue say that when the problem occurs, only one process and thread are alive, and all other processes are in block state (except for the ideal process), and most crucially, he is still able to step through WinDbg.
It is completely wrong for someone to characterize issue as a system crash, BSOD, or system anomaly. Assuming that it is so qualitative, the CPU must be stuck. But now the CPU is alive and can be single-step. Moreover, there is no blue screen, the qualitative BSOD is even more wrong.
What kind of question should it be? From the descriptive narrative, the system is alive, and it simply loses its response. So I characterize issue as a system software hang. There are a number of possible situations where a system process suddenly runs out of control, causing the system interface to become unresponsive. Or, as in this case, the system kernel itself is trapped inside a dead loop, and has no room to handle other tasks.
Analyze problems
I just got the dump file the first day. Because there is only one system that can live debug, it is very difficult to co-ordinate the use. After I get the dump file, the first action is to look at the only live thread. The problem points of this kind of issue are very concentrated, so I am very confident that I can find some practical clues very quickly.
ChildEBP RetAddr Args to child 85846dd0 8fd95b75 ffffffff 85846ef0 nt! KECLOCKINTERRUPTNOTIFY+0X28A (Inline)--------------------------------hal! Halptimerclockinterruptepilogcommon+0xa85846de0 00000000 000000d1 00000000 hal! halptimerclockinterruptcommon+0x3e 85846de0 00000000 000000d1 00000000 hal! HALPTIMERCLOCKINTERRUPT+0X1CB 85846ef0 80c66050 8584762c 00000002 win32k! ENUMAREAS::ENUMAREAS+0XB9 8584711c 80da5d98 80da5d98 00000000 Win32k!bspbltscreentoscreen+0x2f8 858474dc 80da5d98 80da5d98 00000000 win32k! spbitblt+0x2bc85847510 80da5d98 80da5d98 00000000 win32k! spcopybits+0x2785847654 000006ce 000000a0 00000027 win32k! ntgdibitbltinternal+0xa3985847700 80d440a8 80c7d5e0 80c7d598 win32k!zzzbltvalidbits+0xc6557 85847768 85847b18 85847ad0 80dac008 Win32k!xxxenddeferwindowposex+0x20b858477a8 00000000 8312c5a0 80dac008 win32k! Xxxprocessdesktoprecalc+0x10b858477e0 d4b8b27d 85847d00 80dac008 win32k!xxxprocesseventmessage+0x7a ... 85847D3C 0118fb90 00000000 00000000 NT! Kisystemservicepostcall
The call stack has a processing function with clock interrupts. When a hardware outage occurs, the CPU will personally preempt the current active thread's running power to the ISR. So that's a very normal thing. The so-called "system exception" has not occurred. After removing the interrupt correlation handler function, the last function of the call stack is win32k! Enumareas::enumareas.
I took a look at the assembly code of this function a little bit, and found there was a dead loop! Of course, not all loops are inherently a "dead loop". A dead loop can only happen in a special state. From the assembly code, win32k is running in a loop. This loop will only enter once. My colleague provided me with a practical message: every time a problem occurs, the same function is being seen running. Combined with this information, of course, it can be concluded that this cycle is a "cycle of death."
But when I put this discovery in preparation for the team's discussion, it was a cold shoulder. No one believes or is willing to discuss this cycle of death with me. A colleague of the debug team did not look at the analysis process, but asked me the same question two times: Are you sure this is a dead loop? Makes me very depressed.
The next day when the live debug environment was built again and I was able to get started myself, I first took a moment in this function to make sure it was actually looping through the 4-line assembly code I found. It turns out that no matter how many times I press F10, the instruction register never leaves the 4-line assembly code (except in the case of a clock interrupt preemption). I was in the heart of a lot of relief, because I was still very afraid, in case it is not the cycle of death, humiliated.
The logic of the dead loop is very easy, with just 4 lines of compilation. Two jump statements jump to and from each other, which is a typical while loop.
901b7b0f 394104 cmp dword ptr [ecx+4],eax901b7b12 7e5e jle win32k! Enumareas::enumareas+0xb6 (901b7b72) 901b7b72 034908 add ecx,dword ptr [ecx+8]901b7b75 eb98 JMP win32k! Enumareas::enumareas+0x53 (901b7b0f)
Take a few minutes to disassemble the 4 lines of code into the C language, because we have the win32k private symbol, so the content is very readable.
while (P->ybottom <= this->yboundstop) { (char*) p + = P->sizescan;}
This is a normal cycle. But assuming that the P->sizescan value in the loop body is 0, it can lead to a dead loop. That's why.
Guess what, guess?
Every senior Debugproject teacher has the same skill, that is, "guess". When the closed source debugging, "Guess" is the perfect weapon. Of course, you can not guess, otherwise it will lead to the prison-can not become suspicion.
I found that when a dead loop occurs, the value of Ybottom is the value of 1200,yboundstop is 1780. I am very sensitive to 1200 because the target platform's resolution is 1800x1200. Plus the name of the variable shown by the private symbol adds a guess. So first guess 1200 is the height of the screen.
1780 what is it? Combining the name Enumareas of the win32k class with the variable name Yboundstop, guess it is the y-axis position of the left vertex of a form.
Keep guessing. Looking at the name of each function in the call stack, we can know roughly what it is doing: BitBlt is the drawing operation through the GDI interface.
Now basically a reasonable conjecture: when the problem occurs, a form is removed from the outside of the screen, and when the form tries to refresh the UI, there is a probability that the win32k will die.
The next step is to give a solution based on the above reasonable conjecture: in the test process, avoid all movement of the form position to the outside of the screen.
The first-line project teacher is not at ease after understanding this situation, in the solution also adds one: avoids all forms minimizes the action.
When the method was reported to the customer, the customer immediately tested it, and during the thousands of tests, the problem never happened again. And, in the one months to date, issue never happened again.
Technical details
When the facts were determined, everyone was surprised. Since win32k is a very stable OS module, assuming that it is really win32k bug, its impact must be very far-reaching. This issue has now been reported to Microsoft, Microsoft Project Division is still analyzing, and has admitted to the fact that the system bug. But Microsoft Project is more inclined to think that this is not a win32k bug, but that the number passed to win32k has been compromised. The final result is how to wait.
If I were the Microsoft Project teacher, I would put the implementation logic of the relevant function on the call stack, and analyze it carefully. Compare the difference between the problem and the normal case, distinguish whether this issue is caused by the code logic, or the exception is caused by the number of parameters. Assuming that the exception is the result of the argument, you need to determine whether the parameter is passed in the problem, or the problem behind. to gradually clarify.
while (P->ybottom <= this->yboundstop) { (char*) p + = P->sizescan;}
The type of this is enumareas, with some definitions such as the following:
win32k! Enumareas +0x000 idir +0x004 xboundsleft +0x008 yboundstop +0x00c xboundsright +0x010 Yboundsbottom
The structure that the P points to is spritestate, and its preceding section defines such as the following:
_spritestate +0x000 ytop +0x004 ybottom +0x008 Sizescan
I think: P points to an array of variable-length structures of type spritestate, where the variable sizescan represents the length of the next structural body. When the problem occurs, p has been enumerated to the last struct in the array, whose member variable Sizescan equals 0. Due to the lack of exception handling in this loop, p has been working on the last structure, making a dead loop on it and strangely holding the whole system together.