A killer cockroach, the kill process is not valid for an analysis process that is not valid during the troubleshooting process for the kill process

Source: Internet
Author: User
Tags disk usage

Today, dealing with a machine abnormal load (1000+) problem, encountered a situation has never encountered, met an unusually stubborn molecule. I used all the methods I could think of to kill the process, but I couldn't get rid of this stubborn molecule, and finally, under the guidance of Google God, I took out this stubborn molecule that depressed me so much.

1. Description of the problem:
System: Kernel 2.6.32.43
Machine: Web A web+nfs B
The machine load is very high, but can log in normally, the response is also very fast

Analysis Process:
1. Through the top view, found that the CPU and memory are normal, swap used too large

A machine:/usr/local # toptop - 11:01:29 up 1051 days, 16:55,  3  users,  load average: 1694.36, 1694.26, 1679.68tasks: 5367 total ,   1 running, 5366 sleeping,   0 stopped,    0 ZOMBIECPU (s):  9.1%us, 19.6%sy,  0.0%ni, 50.0%id, 21.3%wa,   0.0%hi,  0.1%si,  0.0%stMem:   8049196k total,   7985004k used,    64192k free,     4080k  Buffersswap:  2104504k total,  2067308k used,   37196k free,   3381972k cached  pid user      pr  ni   VIRT  RES  SHR S %CPU %MEM    TIME+   command                                                                      1896 root       20   0 16840  15m  468 s    6  0.2   95:41.71 sap1002                                                                             9393 root      20   0  268m 4572   252 S    6  0.1   3650:16 newoctopusd                                                                    27609 root      20    0  9648 5300  876 r    4  0.1    0:00.55 top                                                                                     13737 root       20   0 61072  58m  58m S    1   0.7   0:02.77 sqm_agent

 2.free-m view disk usage, mainly to see the use of swap

 a machine:/usr/local # free -m              total       used       free      shared    buffers     cached  mem:          7860        6611       1249           0         10       2134-/+  buffers/cache:       4466        3394swap:         2055        2045        10 

 3 the culprit top +f+p, through the swap bar to find the most use of swap programs, each httpd use 4M, as if not many.

  pid user      pr  ni  virt  res   SHR S %CPU %MEM    TIME+  SWAP COMMAND       5135 root      20   0   266m  788  432 S    6  0.0    0:26.68 265m newoctopusd       5082 root       20   0 61072  58m  58m S     1  0.7   0:02.62 1276 sqm_agent    16796  Root      20   0  5796 1484  880 r     1  0.0   0:00.09 4312 top         5186 rooT      20   0 30616  21m  472 s     0  0.3   0:01.21 9112 dnsagent     5831 root      20   0  5288 2060  1320 s    0  0.0   0:00.06 3228 sshd        1 root      20   0    788  304  256 S    0  0.0    0:29.38  484 init   2 root      20    0     0    0    0 S     0  0.0   0:00.00    0 kthreadd

4. Since the machine is transferred from other departments, it is assumed that httpd is not a problem, but still a command, and then stunned, 502 processes.

ps axu |grep http|wc -l
502

  This is going to make a noise, so what's the start process?

5. So confident Killall httpd,/usr/local/apache2/bin/apachectl  -k start waiting to release resources, found that startup failed, port occupancy. The
then looked at the httpd process and found that there was a stubborn molecule residue, OK, simply point, /usr/local/apache2/bin/apachectl  -k Start, add 9 to do not believe there are problems, but the result is still port occupancy.

ps -ef |grep httpnobody 16295 1 0 nov24 ? 00:00:08 /usr/ local/httpd-2.2.19/bin/httpd -k startroot 29211 3398 0 11:02 pts/3  00:00:00 grep httpkill  16295ps -ef |grep httpnobody 16295 1  0 Nov24 ? 00:00:08 /usr/local/httpd-2.2.19/bin/httpd -k startroot  29625 3398 0 11:02 pts/3 00:00:00 grep httpkill -9 16295ps - ef |grep httpnobody 16295 1 0 nov24 ? 00:00:08 /usr/local/ httpd-2.2.19/bin/httpd -k startroot 30112 3398 0 11:02 pts/3 00:00:00  grep httpkill -term 16295ps -ef |grep httpnobody 16295 1 0  nov24 ? 00:00:08 /usr/local/httpd-2.2.19/bin/httpd -k startroot 30112  3398 0 11:03 pts/3 00:00:00 grep http 
  > Is there any grievance? Then check the status of the process to see what the reason is: 

PS Axopid,comm,wchan | grep 16295

The original grievance here: Google down, confirm that it is a 2.6.33.1 before the kernel of an NFS bug
Nfs_wait_bit_uninterruptible:
https://bugzilla.kernel.org/show_bug.cgi?id=15552

Verify:

Machine Disk Condition:

A machine:/usr/local # df -hfilesystem             Size  Used Avail Use% Mounted on/dev/sda1              9.9g  1.8g  7.6g  20%  /udev                   3.9G  296K  3.9G   1% /dev/dev/sda3               20g   13g  6.4g   67% /usr/local/dev/sda4              103g   28g   70g  29% /data b Machine:/xx/htdocs                        103g    30g   68g  31% /xx/admin/htdocs 

  Access mounted directory, unreachable, long-time unresponsive

Accessing remote NFS service Machine B, discovering that the machine is super high and basically loses its response, and then restarts the machine.
Restart the B machine  nfs machine found that the machine load of a machine also resumed.
View stubborn molecule 16295, the discovery has disappeared.

root     18012     1  1 11:14 ?         00:00:01 /usr/local/httpd-2.2.19/bin/httpd -k  startnobody   18168 18012  0 11:14 ?         00:00:00 /usr/local/httpd-2.2.19/bin/httpd -k startnobody    18169 18012  0 11:14 ?        00:00:00 / usr/local/httpd-2.2.19/bin/httpd -k startnobody   18171 18012  0  11:14 ?        00:00:00 /usr/local/httpd-2.2.19/bin/httpd  -k startnobody   18173 18012  0 11:14 ?         00:00:00 /usr/local/httpd-2.2.19/bin/httpd -k startnobody    18175 18012   0 11:14 ?        00:00:00 /usr/local/ Httpd-2.2.19/bin/httpd -k start

A dead cockroach, the kill process is not valid for the reason that the kill process is not valid during troubleshooting

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.