Windbg + SOS: Analysis of Web Server CPU (100%) Instances

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the afternoon, a friend on MSN sent a dump file, saying that the CPU usage of the Web server is at 100%. The problem could not be found. Let's take a look, let's upload the dump file and find out where the problem is.

Framework2.0 is a 2 k Windows OS.

After loading the dump file, load the SOS extension debugging module of Version 2.0:

. Load c: \ windows \ Microsoft. NET \ framework \ v2.0.50727 \ SOS. dll

Habitually list managed threads:

! Threads

The following error is reported:

Unable to load Image

C: \ windows \ Assembly \ nativeimages_v2.0.50727_64 \ mscorlib \ 339a2c337cdafd47f2ba740bb03f9718 \ mscorlib. Ni. dll, Win32 Error 2

* ** Warning: unable to verify checksum for mscorlib. Ni. dll

Lm and below, I found that the symbol file was not loaded, so. Reload, the path of the symbol file I configured is as follows:

C: \ windows \ symbols; C: \ Program Files \ Microsoft Visual studio8

\ SDK \ V2.0 \ symbols; SRV * E: \ Bak \ symbols * http://msdl.microsoft.com/download/symbols

The network traffic is reduced to 30 kb/s. At this time, the symbolic file is being downloaded from the symbol file server. After a while, all the required symbolic files have been downloaded:

First, let's look at the situation in the thread pool:

0: 000>! Threadpool

CPU utilization 100%

Worker thread: Total: 1 running: 0 idle: 1 maxlimit: 100 minlimit: 1

Work request in queue: 0

--------------------------------------

Number of Timers: 8

--------------------------------------

Completion port thread: Total: 1 free: 1 maxfree: 2 Currentlimit: 1 maxlimit: 100 minlimit: 1

Er, the CPU usage is indeed 100%.

Check the CPU time usage:

0: 000>. Time

Debug session time: Thu Jun 12 08:59:14. 0002008 (GMT + 8)

System uptime: 2 days 1:01:08. 081

Process uptime: 2 days 0:30:27. 000

Kernel Time: 0 days 0:00:22. 000

User time: 0 days 0:09:40. 000

Well, the main reason is that the user's application occupies the process time. Then let's take a look at the thread that used all the CPU time of the user mode:

0: 000>! Runaway

User Mode Time

Thread time

12: b9c 0 days 0:06:27. 109

2: b54 0 days 0:00:50. 468

15: f98 0 days 0:00:09. 937

10: b94 0 days at 0:00:03. 531

5: b6c 0 days 0:00:01. 937

11: b98 0 days 0:00:01. 671

3: b58 0 days 0:00:01. 171

0: b44 0 days at 0:00:00. 640

9: B84 0 days 0:00:00. 265

7: b7c 0 days 0:00:00. 203

14: 388 0 days 0:00:00. 000

13: b68 0 days 0:00:00. 000

8: b80 0 days 0:00:00. 000

6: b78 0 days 0:00:00. 000

4: b5c 0 days 0:00:00. 000

1: B50 0 days 0:00:00. 000

User time is a total of 0 days 0:09:40. 000 so much time, thread 12 takes up so much time for 0 days 0:06:27. 109, okay, let's see what we did in private:

0: 000> ~ 12 s

Eax = 000007ff EBX = 00000000 ECx = 0130d894 edX = 013055ac ESI = 013055ac EDI = dbb83133

EIP = 051dc0be ESP = 06a8ef48 EBP = 06a8efb4 iopl = 0 NV up EI PL NZ AC PE NC

Cs = 001b Ss = 0023 DS = 0023 es = 0023 FS = 003b GS = 0000 EFL = 00000216

051dc0be 3bf8 cmp edi, eax

After switching, view the call Stack:

0: 012> K

Childebp retaddr

Warning: frame IP not in any known module. Following frames may be wrong.

06a8efb4 051d4b92 0x51dc0be

06a8f068 0127d1f 0x51d4b92

06a8f080 79e7c74b mscorlib_ni + 0x2f0d1f

06a8f090 79e7c6cc mscorwks! Calldescrworker + 0x33

06a8f110 79e7c8e1 mscorwks! Calldescrworkerwithhandler + 0xa3

06a8f24c 79e7c783 mscorwks! Methoddesc: calldescr + 0x19c

06a8f268 79e7c90d mscorwks! Methoddesc: calltargetworker + 0x1f

06a8f27c 79eb300f mscorwks! Methoddesccallsite: call_retargslot + 0x18

06a8f448 79eb2f31 mscorwks! Executecodewithguaranteedcleanuphelper + 0x9b

06a8f4f8 034f3ff7

Mscorwks! Reflectioninvocation: executecodewithguaranteedcleanup + 0xf9

06a8f510 034f3ede mscorlib_ni + 0x2b3ff7

06a8f528 03530c68 mscorlib_ni + 0x2b3ede

06a8f540 79e7c74b mscorlib_ni + 0x2f0c68

06a8f550 79e7c6cc mscorwks! Calldescrworker + 0x33

06a8f5d0 79e7c8e1 mscorwks! Calldescrworkerwithhandler + 0xa3

06a8f710 79e7c783 mscorwks! Methoddesc: calldescr + 0x19c

06a8f72c 79e7c90d mscorwks! Methoddesc: calltargetworker + 0x1f

06a8f740 79fc58cd mscorwks! Methoddesccallsite: call_retargslot + 0x18

06a8f928 79ef3207 mscorwks! Threadnative: kickoffthread_worker + 0x190

06a8f93c 79ef31a3 mscorwks! Thread: doadcallback + 0x32a

Until calldescrworker, no exception was found. The first part of the call stack of a standard worker thread that initializes and executes a specific task is normal. Then there are two calls at the top of the stack:

06a8efb4 051d4b92 0x51dc0be

06a8f068 0127d1f 0x51d4b92

This does not show what the method is. Well, use SOS to list the managed method call Stack:

0: 012>! Clrstack

OS thread ID: 0xb9c (12)

ESP EIP

06a8ef48 051dc0be rswebgis. rsprotocolclass. getdatalong (byte [], int32)

06a8ef68 051dbf74 rswebgis. rsprotocolclass. parsepackage (byte [])

06a8efbc 051d4b92 rswebgis. rsprotocolclass. doanalysiswork ()

06a8f070 01_d1f system. Threading. threadhelper. threadstart_context (system. Object)

06a8f078 034f40ab system. Threading. executioncontext. runtrycode (system. Object)

06a8f49c 79e7c74b [helpermethodframe_protectobj: 06a8f49c]

System. runtime. compilerservices. runtimehelpers. executecodewithguaranteedcleanup (trycode, cleanupcode, system. Object)

06a8f504 034f3ff7

System. Threading. executioncontext. runinternal (system. Threading. executioncontext, system. Threading. contextcallback, system. Object)

06a8f51c 034f3ede

System. Threading. executioncontext. Run (system. Threading. executioncontext, system. Threading. contextcallback, system. Object)

06a8f534 03530c68 system. Threading. threadhelper. threadstart ()

06a8f760 79e7c74b [gcframe: 06a8f760]

06a8fa50 79e7c74b [contexttransitionframe: 06a8fa50]

At this time, the top three call estimates are the key to the problem. Based on past experience, there are several common performance debugging scenarios:

L high CPU usage, low memory usage

L high CPU usage. memory usage of a module or thread is outrageous

L CPU and memory usage are not high, that is, webpage access is particularly slow

The general causes of these problems and a common analysis routine are as follows. It is useful to know where the problem may occur and how to judge the problem at this time. :)

From the above call stack, there should be a problem with the program written in the code. Okay, is my mother-in-law bullied?

Syntax for separating a module from a dump file:

! Savemodule <base address> <FILENAME>

<Base address> there are several possible solutions. You can run the dumpdomain command to check which modules are loaded in the appdomain. The following is the screenshot! A record of dumpdomain. Based on the above stack display, you can find the rswebgis module:

Assembly: 001de3d0 [c: \ winnt \ Microsoft. NET \ framework \ v2.0.50727 \ temporary

ASP. netfiles \ rswebgis \ cbfaa214 \ c8518bb9 \ Assembly \ dl3 \ b000bb67 \

2dae9f2e_ab93c801 \ rswebgis. dll

Classloader: 001dc518

Securitydescriptor: 001de348

Module name

05860010 c: \ winnt \ Microsoft. NET \ framework \ v2.0.50727 \ temporary

ASP. NET files \ rswebgis \ cbfaa214 \ c8518bb9 \ Assembly \ dl3 \ b000bb67 \

2dae9f2e_ab93c801 \ rswebgis. dll

Here, 05860010 is the base address of the rswebgis module. Of course, this base address can also be obtained by using the LM command to view the loaded module.

Okay, strip it out and save it to drive C:

0: 000>! Savemodule 05880000 c: \ rswebgis. dll. dll

4 sections in file

SECTION 0-Va = 1000, vasize = e82da9, fileaddr = 400, filesize = e82e00

Section 1-Va = e84000, vasize = 24d24, fileaddr = e83200, filesize = ec00

Section 2-Va = ea9000, vasize = 5a8, fileaddr = e91e00, filesize = 600

Section 3-Va = eaa000, vasize = c183c, fileaddr = e92400, filesize = c1a00

A total of four sections are left empty.

Copy reflector decompilation, first find

06a8ef48 051dc0be rswebgis. rsprotocolclass. getdatalong (byte [], int32)

This implementation:

Private int getdatalong (byte [] bytes, int count)

{

If (count> (bytes. length-1) | (count> (bytes. length-2) | (count> (bytes. length-3) | (count> (bytes. length-4) | (count <0 ))

{

Return 0;

}

Return bitconverter. toint32 (bytes, count );

}

My day, how are there so many parentheses? I think it's a blind eye. At the very beginning, it was estimated that the problem was located in this place, where a logic error occurred. After a long time, I couldn't find a reason, and there was no problem with this judgment, A few parentheses may cause a slight deviation from the normal judgment logic, but it will not be 100% of the CPU usage. Then I asked zhaochangj's brother about the program code for this method. The result is:

# Region obtains a long integer from the specified position in the byte array.

/// <Summary>

/// Obtain the data packet size

/// </Summary>

/// <Param name = "bytes"> specified data packet </param>

/// <Param name = "count"> Start position </param>

Private int getdatalong (byte [] bytes, int count)

{

If (count> bytes. Length-1 | count> bytes. Length-2 | count> bytes. Length-3

| Count> bytes. Length-4 | count <0)

It seems that this reflector will be a little short-circuited when dealing with such many parentheses.

Well, since there is no problem, let's look at how the above calls this:

Private void parsepackage (byte [] bytedata)

{

Int datalong;

Int length = bytedata. length;

For (INT I = 0; I <length; I + = datalong)

{

Datalong = This. getdatalong (bytedata, I );

If (I + datalong) <= length) & (datalong> 0 ))

{

Long num4;

Int num5;

Int num6;

String STR;

Byte [] buffer;

Byte [] buffer2;

This. copybytes (bytedata, I, datalong, out buffer2 );

Int start = 0;

If (buffer2.length-Start)> = 15) & (this. tcpmsgheadproc (buffer2, start,

Out num4, out num5, out STR, out num6, out buffer) = 0l ))

{

This. m_nobackcount = 0;

This. tcpdataproc (num4, num5, STR, num6, buffer );

}

Er, you can see the for loop. Well, you need to pay special attention to the fact that this situation may often be an endless loop in a specific situation. Of course, it is not all.

I am wondering whether the defined datalong variable is assigned a value for the first time. In this case, the compiler prompts an error, in addition, I + = datalong is the first time I saw it in the for loop... On MSN, I asked zhaochangj about the code and sent the source code of this function:

While (offset <bytestotal)

{

Int length = getdatalong (bytedata, offset );

Byte [] tmpmsg;

If (Offset + Length <= bytestotal) & (length> 0 ))

{

// Omitted

Finally, the problem is found: If the returned value of the getdatalong method is 0, the loop will continue, that is, for (INT I = 0; I <length; I + = datalong) in the program, if datalong returns 0, that is, when formatting a byte string with a length less than 4, the loop will continue.

Sum: It only describes the general idea of finding a problem with windbg in the case of slow CPU, memory, and access speed after the website is released, you may have specific problems in specific situations.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Windbg + SOS: Analysis of Web Server CPU (100%) Instances

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Windbg + SOS: Analysis of Web Server CPU (100%) Instances

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support