Yesterday took care of an urgent issue, the customer complained that the product upgrade system will be slow and CPU usage is very high, the customer is very strong temper, claiming not to resolve this problem as soon as possible to return, make us R/R pressure is very big, solve this issue task assigned to me, the customer is a South African company, Because of the time difference problem, I had to solve the problem at home remotely, night 8:30 with gotomeeting remote to the client computer to view our system, tossing for four hours, finally in the early hours reproduce this high CPU, hurriedly grab log, with Wireshark grab bag, The related output of the system is saved with Gcore,gstack,strace and top. After the 2nd day analysis of these files, found the product bug, the author of the code assigned 10K buffer, and assume that 10K enough, of course, 99% of the case is sufficient, but in the case of the 1% problem, the buffer is full, and then the program into a dead loop, causing high Cpu. Found the problem, fix is very easy, the temper of the customer also eased, fix quickly can deliver to the customer. Reflect on the process of solving the problem, think that this analysis process has reusability, worth summarizing.
1. Use the top command to see which process is consuming high CPU
The gateway process 14094 consumes up to 891% CPU, which is the cumulative value of each thread in the process that consumes the CPU.
PID USER PR NI VIRT RES SHR S%cpu%MEM time+ COMMAND
14094 Root 0 315m 10m 7308 S 891% 2.2 1:49.01 Gateway
20642 Root 0 17784 4148 2220 S 0.5 0.8 2:39.96 Microdasys
1679 Root 0 10984 1856 1556 R 0.3 0.4 0:22.21 sshd
22563 Root 0 2424 1060 R 0.3 0.2 0:00.03 Top
1 Root 0 2156 492 460 S 0.0 0.1 0:01.59 Init
2. Use the Top-h-P PID command to see the percentage of CPU consumed by each thread in the process
#top-H-P 14094
You can see 107 threads in top, but the following 9 threads occupy high CPU, with thread 14086 as the main reason why the CPU
PID USER PR NI VIRT RES SHR S%cpu MEM time+ COMMAND
14086 Root 0 922m 914m 538m R 101 10.0 21:35.46 Gateway
14087 Root 0 922m 914m 538m R 101 10.0 10:50.22 Gateway
14081 Root 0 922m 914m 538m S 10.0 8:57.36 Gateway
14082 Root 0 922m 914m 538m R 10.0 11:51.92 Gateway
14089 Root 0 922m 914m 538m R 10.0 21:21.77 Gateway
14092 Root 0 922m 914m 538m R 10.0 19:55.47 Gateway
14094 Root 0 922m 914m 538m R 10.0 21:02.21 Gateway
14083 Root 0 922m 914m 538m R 10.0 21:32.39 Gateway
14088 Root 0 922m 914m 538m R 10.0 11:23.12 Gateway
3. Use the Gstack command to view the function call stacks for each thread in the process
#gstack 14094 > Gstack.log
Looking for thread ID14086 in Gstack.log, because the function stack exposes function details, only two function frames are shown, thread ID14086 corresponding thread number is 37
Thread 0x4696ab90 (LWP 14086):
#0 0x40000410 in __kernel_vsyscall ()
#1 0x40241f33 in poll () from/lib/i686/nosegneg/libc.so.6
4. Use the Gcore command to dump the process image and memory context
#gcore 14094
This command generates a core file core.14094
5. Use the Strace command to view system calls and time spent
#strace-T-r-c-P 14094
The-c parameter displays statistics, removing this parameter to see the time and return value of each system call.
% time seconds Usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------------------
99.99 22.683879 3385 6702 poll
0.00 0.001132 0 6702 gettimeofday
0.00 0.000127 1 208 208 Accept
0.00 0.000022 1 Read
0.00 0.000000 0 1 Write
0.00 0.000000 0 1 Close
0.00 0.000000 0 Time
0.00 0.000000 0 2 Stat64
0.00 0.000000 0 4 Clock_gettime
0.00 0.000000 0 7 Send
0.00 0.000000 0 Recvfrom
------ ----------- ----------- --------- --------- ------------------------------
100.00 22.685160 13652 218 Total
6. Debug the core file with GDB and switch thread to line Line 37
Gcore is almost the same as the core file produced by the actual core dump, except that some dynamic debugging is not possible with GDB
(GDB) GDB Gateway core.14094
(GDB) Thread 37
[Switching to Thread 0x4696ab90 (LWP 14086)] #0 0x40000410 in __kernel_vsyscall ()
(GDB) where
#0 0x40000410 in __kernel_vsyscall ()
#1 0x40241f33 in poll () from/lib/i686/nosegneg/libc.so.6
GDB can be debugged according to the detailed function stack, printing some variable values, and combining the source code to analyze why the poll call high CPU.
Because the code involves the company's property rights, GU does not do detailed analysis here, need to understand is the analysis of the process and the use of the command.
Process is: Process id-> thread id-> thread function call stack, function time and call statistics--Source code Analysis
Linux High-CPU solution