In general, we only need to monitor the program process. But this time we encountered such a problem, the company developed the program, the program process is still in progress, but the deadlock. This has caused a wide range of impact. What's even worse is that I don't know where the problem is, or what other test colleagues have helped me find out. I really lost my O & M face. & hellip; to avoid this situation, we analyzed the deadlock of the process and found that the deadlock will occupy 100% of the cpu, and normally only occupy less than 10%. Decided to write the nagios plug-in for monitoring
In general, we only need to monitor the program process. But this time we encountered such a problem, the company developed the program, the program process is still in progress, but the deadlock. This has caused a wide range of impact. What's even worse is that I don't know where the problem is, or what other test colleagues have helped me find out. it's really a waste of O & M faces...
To avoid this situation, we analyzed the deadlock of the process and found that the deadlock will occupy 100% of the cpu, and normally only occupy less than 10%. Decided to write the nagios plug-in to monitor the resources occupied by the program, including cpu and memory.
1. shell script requirement analysis:
You can set the cpu and mem thresholds. if the resource usage exceeds the threshold, an alarm is triggered.
Determine whether the process exists. if one does not exist, an alarm is triggered.
2. the shell script execution result is as follows:
1. if the input format is incorrect, the help information is output.
[Root @ center230 libexec] # shcomponent_resource.sh
Usage parament:
Component_resource.sh [-- cpu] [-- mem]
Example:
Component_resource.sh -- cpu 50 -- mem 50
2. if the threshold is not exceeded, the output resource usage is 0.
[Root @ center230 libexec] # shcomponent_resource.sh -- cpu 50 -- mem 50
VueSERVER_cpu_use = 5.6% bytes = 1.9% bytes = 0.0% VueCenter_cpu_use = 0.0% bytes = 0.0%; VueSERVER_mem_use = 0.2% VueCache_mem_use = 7.4% bytes = 0.5% VueCenter_mem_use = 0.1% bytes = 0.0% bytes
[Root @ center230 libexec] # echo $?
0
3. if the threshold is exceeded and the resource usage is output, the exit value is 2.
[Root @ center230 libexec] # shcomponent_resource.sh -- cpu 5 -- mem 5
VueSERVER_cpu_use = 9.4% bytes = 0.0% bytes = 0.0% VueCenter_cpu_use = 0.0% bytes = 0.0%; VueSERVER_mem_use = 0.2% VueCache_mem_use = 7.4% bytes = 0.5% VueCenter_mem_use = 0.1% bytes = 0.0% bytes
[Root @ center230 libexec] # echo $?
2
4. if the process does not exist, output the down process and the process resources in normal use. the exit value is 2.
[Root @ yckj scripts] # sh component_resource.sh -- cpu 50 -- mem 50
Current VueDaemon VueCenter VueAgent VueCache VueSERVER is down.
[Root @ yckj scripts] # echo $?
2
3. the Shell script code is as follows:
|
[root@center230 libexec] # catcomponent_resource.sh #!/bin/sh #author:yangrong #date:2014-05-20 #mail:10286460@qq.com #pragrom_list=(VueDaemon VueCenter VueAgentVueCache VueSERVER VUEConnector Myswitch Slirpvde) pragrom_list=(VueDaemon VueCenter VueAgentVueCache VueSERVER) #### Obtain the cpu and mem thresholds ####### case $1 in --cpu) cpu_crit=$2 ;; --mem) mem_crit=$2 ;; esac case $3 in --cpu) cpu_crit=$4 ;; --mem) mem_crit=$4 ;; esac ### Determine the parameter quantity. if not 4, the var value is 1, and var0 is normal #### if [[ $1 == $3 ]]; then var=1 elif [ $ # -ne 4 ] ;then var=1 else var=0 fi ### Print error message if [ $var - eq 1 ]; then echo "Usage parament:" echo " $0 [--cpu][--mem]" echo "" echo "Example:" echo " $0 --cpu 50 --mem50" exit fi ### Put a nonexistent process in a variable num=$(( ${ #pragrom_list[@]}-1 )) NotExist= "" for digit in ` seq 0 $num` do a=` ps -ef| grep - v grep | grep ${pragrom_list[$digit]}| wc -l` if [ $a - eq 0 ]; then NotExist= "$NotExist ${pragrom_list[$digit]}" unset pragrom_list[$digit] fi done #echo"pragrom_list=${pragrom_list[@]}" #### Compare the resources and thresholds occupied by processes cpu_use_all= "" mem_use_all= "" compare_cpu_temp=0 compare_mem_temp=0 for n in ${pragrom_list[@]} do cpu_use=` top -b -n1| grep $n| awk '{print $9}' ` mem_use=` top -b -n1| grep $n| awk '{print $10}' ` if [[ $cpu_use == "" ]]; then cpu_use=0 fi if [[ $mem_use == "" ]]; then mem_use=0 fi compare_cpu=` echo "$cpu_use > $cpu_crit" | bc ` compare_mem=` echo "$mem_use > $mem_crit" | bc ` if [[ $compare_cpu == 1 ]]; then compare_cpu_temp=1 fi if [[ $compare_mem == 1 ]]; then compare_mem_temp=1 fi cpu_use_all= "${n}_cpu_use=${cpu_use}% ${cpu_use_all}" mem_use_all= "${n}_mem_use=${mem_use}% ${mem_use_all}" done ### If the variable has a value, the process is down. The exit value is 2. if [[ "$NotExist" != "" ]]; then echo -e "Current ${NotExist} isdown.$cpu_use_all;$mem_use_all" exit 2 ### If the cpu comparison value is 1, it indicates that a process occupies more than the threshold value, and the exit value is 2 elif [[ "$compare_cpu_temp" == 1]]; then echo -e "$cpu_use_all;$mem_use_all" exit 2 # If the mem comparison value is 1, it indicates that the process mem usage exceeds the threshold, and the exit value is 2 elif [[ $compare_mem_temp == 1 ]]; then echo -e "$cpu_use_all;$mem_use_all" exit 2 # Otherwise, the system outputs normally and the proportion of cpu to memory occupied by the output else echo -e "$cpu_use_all;$mem_use_all" exit 0 fi |
4. post:
As more and more shell scripts are written recently, sometimes it is inevitable to change the previously written scripts, which can be understood only after a while.
To facilitate subsequent maintenance, every function and each function in the script should be noted to facilitate maintenance by yourself or others.