Yesterday morning when viewing the Zabbix monitoring interface, the number of processes found on one of the servers and the 1-minute load had reached a very alarming number, the Zabbix default alarm value was the number of processes in the 5-minute average of more than 1000, and the 1-minute system load of 5 minutes with an average value greater than 5.
First, the hardware and software information of the server is listed:
Server hardware: Dell PowerEdge R720, 2 x Intel (R) Xeon (r) CPU e5-2640 v2 @ 2.00ghz;62.87 gb;perc H710 SAS RAID5
Server operating system: Ubuntu 14.04 LTS, kernel: 3.13.0-24-generic
Figure 1:zabbix Alarm Information (ZABBIX message notification is set to a critical (average) level above the SMS alert via the Web SMS Gateway):
650) this.width=650; "title=" image "style=" border-top:0px;border-right:0px;border-bottom:0px;border-left:0px; " Border= "0" alt= "image" Src= "http://s3.51cto.com/wyfs02/M00/72/8D/wKioL1XmpgaRX1PAAACifLJCMMA734.jpg" height= "56"/ >
Figure 2:zabbix Alarm conditions on the trigger
650) this.width=650; "title=" image "style=" border-top:0px;border-right:0px;border-bottom:0px;border-left:0px; " Border= "0" alt= "image" Src= "http://s3.51cto.com/wyfs02/M01/72/8D/wKioL1XmpgahGPFWAACcmwlVqwM222.jpg" height= "47"/ >
Then, soon, there were colleagues who reflected that the server responded very slowly, and some application pages were open for a long time. Because the engine room is far away and there is no login information (this information at the customer), so can only use Xshell SSH login to the Linux system, open the top command to see the system operation, found that the current system has more than 5,800 processes running, where the system load (1, 5 , 15 minutes) have reached the 5400+, but it is strange that CPU, memory and hard disk IO are not high, according to common sense, such a high system load, CPU and IO are already exhausted, but the top display does not.
Figure 3.1:linux Top information in the system
650) this.width=650; "title=" image "style=" border-top:0px;border-right:0px;border-bottom:0px;border-left:0px; " Border= "0" alt= "image" Src= "http://s3.51cto.com/wyfs02/M02/72/8D/wKioL1XmpgehdzfMAAQ3Vi1mlFY582.jpg" height= "418" />
Figure 3.2 Iostat, Vmstat and other commands found that disk IO is not high, but the system load is very high
650) this.width=650; "title=" image "style=" border-top:0px;border-right:0px;border-bottom:0px;border-left:0px; " Border= "0" alt= "image" Src= "http://s3.51cto.com/wyfs02/M00/72/8D/wKioL1XmpgeQKY4TAAGDiS3di9Q470.jpg" height= "261" />
A developer said that the PID 14898 Java process did not allocate such a large amount of memory, the process has a problem, asked to be responsible for writing the program of another developer check the program, then I think, a Java process is not to put the system load so high, there must be a different important reason not found. As usual, I'll look at what kind of programs are running in the system that cause so many processes inside the system, so execute Ps–ef | More, look at the operation of all the processes in the system, the results found that this command is impossible to execute, just output a screen on the card to die on the screen, quickly copy an SSH session, continue to execute, simple to perform PS, see if can execute, The results found that the PS output of the content of a large number of PS process, here I feel good PS can also run only parameters can not be added too much. Open the top command, look at a PS command corresponding to the PID, open with Top–p pid, found to be Zabbix user execution of the PS command, instant feel bad, not because I wrote before the Zabbix script has a problem in the PS command did not quit, only to cause the system a minute load so high.
Figure 4: View Zabbix User's process information with top, Top–u Zabbix
650) this.width=650; "title=" image "style=" border-top:0px;border-right:0px;border-bottom:0px;border-left:0px; " Border= "0" alt= "image" Src= "http://s3.51cto.com/wyfs02/M01/72/8D/wKioL1XmpgiST529AAUgsSOsIuc266.jpg" height= "535" />
Before a colleague phone asked me that there is a Java process in the system can not end, resulting in an application now has two Java process what to do, I told him to try Sudo, the result sudo is not good, root privileges can not be killed. Because at the time in the family was not in the company so did not look for any reason. So at this point, I want to kill Zabbix with Kill the PS process encountered the same problem, found that the PS process can not be killed at all. No matter what process signals are used, term or kill is not good. In a closer look, the status of these processes is the state of D (the column shown in s).
About d states: The D State is a special process state code (CODES), whose English signal is uninterruptible sleep (usually IO), meaning non-disruptive sleep (usually due to IO problems), This IO may be a variety of Io, memory, hard disk, network, bus is possible. Since it is a state of D, it cannot be killed even by the root user, because it does not accept any process signals, so it cannot be killed in any way, only restarting the server. But this problem has not found the root cause, can not be hasty conclusions.
Check the script and command line set in the user parameters of the Zabbix, and find that most of the command line has PS, but there is no problem, so initially troubleshoot the problem in the Zabbix script or command line, as shown in:
Figure 5:zabbix User Parameters command line:
650) this.width=650; "title=" image "style=" border-top:0px;border-right:0px;border-bottom:0px;border-left:0px; " Border= "0" alt= "image" Src= "http://s3.51cto.com/wyfs02/M01/72/91/wKiom1Xmo-eBWjreAAFY22bvBkY803.jpg" height= "83"/ >
So you can only use the Strace command to find out where the PS-EF process is stuck. command is Strace-o strace_psef.log–f ps–ef
Figure 6: Find information such as calls to the system and signals received from the command execution through the strace command
650) this.width=650; "title=" image "style=" border-top:0px;border-right:0px;border-bottom:0px;border-left:0px; " Border= "0" alt= "image" Src= "http://s3.51cto.com/wyfs02/M02/72/91/wKiom1Xmo-izewYuAAU2Uvs4Oa8422.jpg" height= "640" />
By discovering the open file/proc/24602/cmdline, you encounter a Stop (wait), the Cat/proc/24602/cmdline file discovery is not executed correctly, and it will die. Found in the middle like netstat,ps-p 24602–o cmd This command can not be used, even ps-p 1–o cmd can not be displayed, because these commands are used in the/proc directory files, so/proc directory is very likely to have problems, so continue to find the reason, Continue top-p 24602 to find out the process name (Pstree is also available in addition to top availability).
Figure 7.1:ps-p 24602–o cmd
650) this.width=650; "title=" image "style=" border-top:0px;border-right:0px;border-bottom:0px;border-left:0px; " Border= "0" alt= "image" Src= "http://s3.51cto.com/wyfs02/M00/72/8D/wKioL1XmpgrDJoSbAABI4qi5TAk008.jpg" height= "67"/ >
Figure 7.2:ps-p 1–o cmd cannot be displayed
650) this.width=650; "title=" image "style=" border-top:0px;border-right:0px;border-bottom:0px;border-left:0px; " Border= "0" alt= "image" Src= "http://s3.51cto.com/wyfs02/M01/72/91/wKiom1Xmo-ixaLNWAABj1hKECJg510.jpg" height= "107" />
Figure 7.3: Additional questions, the PTS value cannot be restored to 0,
650) this.width=650; "title=" image "style=" border-top:0px;border-right:0px;border-bottom:0px;border-left:0px; " Border= "0" alt= "image" Src= "http://s3.51cto.com/wyfs02/M02/72/8D/wKioL1XmpgqCPDWFAABknKg6YKM778.jpg" height= "85"/ >
The top-p 24602 command found that the PID 14602 process is also a D-state
Figure 8: Find out the process name by Top-p 24602
650) this.width=650; "title=" image "style=" border-top:0px;border-right:0px;border-bottom:0px;border-left:0px; " Border= "0" alt= "image" Src= "http://s3.51cto.com/wyfs02/M02/72/8D/wKioL1XmpgqS8KGUAAFQYoXrjKA670.jpg" height= "153" />
But found that the top can not use the C command to view the command line of this process, so only to enter the/proc directory to see if there is a character can see the process used by some FD (file descriptor, files descriptor), the results found good character, through the CP copied out/proc/ The FD in the task directory is found in 24602, as shown in:
Figure 9: Using FD to find out which application the process belongs to
650) this.width=650; "title=" image "style=" border-top:0px;border-right:0px;border-bottom:0px;border-left:0px; " Border= "0" alt= "image" Src= "http://s3.51cto.com/wyfs02/M00/72/8D/wKioL1XmpguBmHMZAAReY9CA1hM946.jpg" height= "395" />
The application that found the problem was the one that the colleague asked me the day before and the application that the researcher said was problematic. After the restart last night found that the memory consumption of this process is still very high, the preliminary judgment is the problem of the program, of course, but also to check the IO problem, but the Ubuntu system is strange, there is no information available in the system log, and the Java program produces too many logs, The information is too miscellaneous or let more professional developers to check it.
Until now, the problem of the initiator is found, but why is due to the IO problem has not found the reason, starting from the morning to write this article to now, write this article has been told about the developers to analyze the code to go, and now go to ask the next, Initial positioning within the program there is a module there is a dead loop (constantly querying the remote database) caused by the specific changes and in-depth testing will take some time.
Figure 10.1: Last night The program (PID 2988 in the figure) of memory consumption of about 1.4GB, it is common sense, but this morning 11GB (previous 13GB, 12GB record)
650) this.width=650; "title=" image "style=" border-top:0px;border-right:0px;border-bottom:0px;border-left:0px; " Border= "0" alt= "image" Src= "http://s3.51cto.com/wyfs02/M01/72/8D/wKioL1XmpgyijAEvAAZWvvgvydc911.jpg" height= "623" />
Figure 10.2: Up to date, 13GB again.
650) this.width=650; "title=" image "style=" border-top:0px;border-right:0px;border-bottom:0px;border-left:0px; " Border= "0" alt= "image" Src= "http://s3.51cto.com/wyfs02/M02/72/91/wKiom1Xmo-uBYqwtAATvEsfHwXE406.jpg" height= "547" />
Figure 11: The overall approach to problem handling is as follows:
650) this.width=650; "title=" image "style=" border-top:0px;border-right:0px;border-bottom:0px;border-left:0px; " Border= "0" alt= "image" Src= "http://s3.51cto.com/wyfs02/M02/72/91/wKiom1Xmo-vClLjfAAEVMAAytwg071.jpg" height= "252" />
Through this event, lessons learned are as follows:
1. Through the FD judgment process cmd, in addition, TOP-C,PSTREE-A,PS-EF, etc.
2. Analyze where the process is waiting through the strace command
3. Understanding the D status of a process
Tag:strace command usage, process status D,linux process analysis, Linux process debugging, strace Debug Program
--end--
This article is from "Communication, My Favorites" blog, please make sure to keep this source http://dgd2010.blog.51cto.com/1539422/1690817
Fault troubleshooting for 1-minute load 5000+ in a Linux system