Today, dealing with a machine abnormal load (1000+) problem, encountered a situation has never encountered, met an unusually stubborn molecule. I used all the methods I could think of to kill the process, but I couldn't get rid of this stubborn molecule, and finally, under the guidance of Google God, I took out this stubborn molecule that depressed me so much.
1. Description of the problem:
System: Kernel 2.6.32.43
Machine: Web A web+nfs B
The machine load is very high, but can log in normally, the response is also very fast
Analysis Process:
1. Through the top view, found that the CPU and memory are normal, swap used too large
A machine:/usr/local # toptop - 11:01:29 up 1051 days, 16:55, 3 users, load average: 1694.36, 1694.26, 1679.68tasks: 5367 total , 1 running, 5366 sleeping, 0 stopped, 0 ZOMBIECPU (s): 9.1%us, 19.6%sy, 0.0%ni, 50.0%id, 21.3%wa, 0.0%hi, 0.1%si, 0.0%stMem: 8049196k total, 7985004k used, 64192k free, 4080k Buffersswap: 2104504k total, 2067308k used, 37196k free, 3381972k cached pid user pr ni VIRT RES SHR S %CPU %MEM TIME+ command 1896 root 20 0 16840 15m 468 s 6 0.2 95:41.71 sap1002 9393 root 20 0 268m 4572 252 S 6 0.1 3650:16 newoctopusd 27609 root 20 0 9648 5300 876 r 4 0.1 0:00.55 top 13737 root 20 0 61072 58m 58m S 1 0.7 0:02.77 sqm_agent
2.free-m view disk usage, mainly to see the use of swap
a machine:/usr/local # free -m total used free shared buffers cached mem: 7860 6611 1249 0 10 2134-/+ buffers/cache: 4466 3394swap: 2055 2045 10
3 the culprit top +f+p, through the swap bar to find the most use of swap programs, each httpd use 4M, as if not many.
pid user pr ni virt res SHR S %CPU %MEM TIME+ SWAP COMMAND 5135 root 20 0 266m 788 432 S 6 0.0 0:26.68 265m newoctopusd 5082 root 20 0 61072 58m 58m S 1 0.7 0:02.62 1276 sqm_agent 16796 Root 20 0 5796 1484 880 r 1 0.0 0:00.09 4312 top 5186 rooT 20 0 30616 21m 472 s 0 0.3 0:01.21 9112 dnsagent 5831 root 20 0 5288 2060 1320 s 0 0.0 0:00.06 3228 sshd 1 root 20 0 788 304 256 S 0 0.0 0:29.38 484 init 2 root 20 0 0 0 0 S 0 0.0 0:00.00 0 kthreadd
4. Since the machine is transferred from other departments, it is assumed that httpd is not a problem, but still a command, and then stunned, 502 processes.
ps axu |grep http|wc -l
502
This is going to make a noise, so what's the start process?
5. So confident Killall httpd,/usr/local/apache2/bin/apachectl -k start waiting to release resources, found that startup failed, port occupancy. The
then looked at the httpd process and found that there was a stubborn molecule residue, OK, simply point, /usr/local/apache2/bin/apachectl -k Start, add 9 to do not believe there are problems, but the result is still port occupancy.
ps -ef |grep httpnobody 16295 1 0 nov24 ? 00:00:08 /usr/ local/httpd-2.2.19/bin/httpd -k startroot 29211 3398 0 11:02 pts/3 00:00:00 grep httpkill 16295ps -ef |grep httpnobody 16295 1 0 Nov24 ? 00:00:08 /usr/local/httpd-2.2.19/bin/httpd -k startroot 29625 3398 0 11:02 pts/3 00:00:00 grep httpkill -9 16295ps - ef |grep httpnobody 16295 1 0 nov24 ? 00:00:08 /usr/local/ httpd-2.2.19/bin/httpd -k startroot 30112 3398 0 11:02 pts/3 00:00:00 grep httpkill -term 16295ps -ef |grep httpnobody 16295 1 0 nov24 ? 00:00:08 /usr/local/httpd-2.2.19/bin/httpd -k startroot 30112 3398 0 11:03 pts/3 00:00:00 grep http
> Is there any grievance? Then check the status of the process to see what the reason is:
PS Axopid,comm,wchan | grep 16295
The original grievance here: Google down, confirm that it is a 2.6.33.1 before the kernel of an NFS bug
Nfs_wait_bit_uninterruptible:
https://bugzilla.kernel.org/show_bug.cgi?id=15552
Verify:
Machine Disk Condition:
A machine:/usr/local # df -hfilesystem Size Used Avail Use% Mounted on/dev/sda1 9.9g 1.8g 7.6g 20% /udev 3.9G 296K 3.9G 1% /dev/dev/sda3 20g 13g 6.4g 67% /usr/local/dev/sda4 103g 28g 70g 29% /data b Machine:/xx/htdocs 103g 30g 68g 31% /xx/admin/htdocs
Access mounted directory, unreachable, long-time unresponsive
Accessing remote NFS service Machine B, discovering that the machine is super high and basically loses its response, and then restarts the machine.
Restart the B machine nfs machine found that the machine load of a machine also resumed.
View stubborn molecule 16295, the discovery has disappeared.
root 18012 1 1 11:14 ? 00:00:01 /usr/local/httpd-2.2.19/bin/httpd -k startnobody 18168 18012 0 11:14 ? 00:00:00 /usr/local/httpd-2.2.19/bin/httpd -k startnobody 18169 18012 0 11:14 ? 00:00:00 / usr/local/httpd-2.2.19/bin/httpd -k startnobody 18171 18012 0 11:14 ? 00:00:00 /usr/local/httpd-2.2.19/bin/httpd -k startnobody 18173 18012 0 11:14 ? 00:00:00 /usr/local/httpd-2.2.19/bin/httpd -k startnobody 18175 18012 0 11:14 ? 00:00:00 /usr/local/ Httpd-2.2.19/bin/httpd -k start
A dead cockroach, the kill process is not valid for the reason that the kill process is not valid during troubleshooting