Requirements: Use the shell to customize a variety of personalized alarm tools, but the need for unified management, standardized management.
Idea: Specify a script package that contains the main program, subroutine, configuration file, mail engine, output log, and so on.
Main program: As the entire script portal, is the lifeblood of the entire system.
Configuration file: is a control center that uses it to switch individual subroutines, specifying each associated log file.
Subroutine: This is the real monitoring script, used to monitor each indicator.
Mail Engine: It is implemented by a Python program that defines the server to which the message is sent, the person who sent it, and the sender's password
Output log: The entire monitoring system should have a log output.
Requirements: Our machine roles are varied, but the same monitoring system is deployed on all machines, and the entire program framework is consistent, regardless of the role of all machines, depending on the role of the different configuration files.
Program Architecture:
Under Bin is the main program
Conf is the configuration file
Shares is the various monitoring scripts
Mail engine under Mail
Log is the journal.
Main script
#!/bin/bash#是否发送邮件的开关export send=1#过滤ip地址export addr=`/sbin/ifconfig |grep -A1 "ens33: "|awk ‘/inet/ {print $2}‘`dir=`pwd`#只需要最后一级目录名last_dir=`echo $dir|awk -F‘/‘ ‘{print $NF}‘`#下面的判断目的是,保证执行脚本的时候,我们在bin目录里,不然监控脚本、邮件和日志很有可能找不到if [ $last_dir == "bin" ] || [ $last_dir == "bin/" ]; then conf_file="../conf/mon.conf"else echo "you shoud cd bin dir" exitfiexec 1>>../log/mon.log 2>>../log/err.logecho "`date +"%F %T"` load average"/bin/bash ../shares/load.sh#先检查配置文件中是否需要监控502if grep -q ‘to_mon_502=1‘ $conf_file; then export log=`grep ‘logfile=‘ $conf_file |awk -F ‘=‘ ‘{print $2}‘ |sed ‘s/ //g‘` /bin/bash ../shares/502.shfi
Configuration file
Vim mon/conf/mon.conf
##to config the options if to monitor##定义mysql的服务器地址、端口以及user、passwordto_mon_cdb=0 ##0 or 1, default 0,0 not monitor, 1 monitordb_ip=10.20.3.13db_port=3315db_user=usernamedb_pass=passwd##httpd 如果是1则监控,为0不监控to_mon_httpd=0##php 如果是1则监控,为0不监控to_mon_php_socket=0##http_code_502 需要定义访问日志的路径to_mon_502=1logfile=/data/log/xxx.xxx.com/access.log##request_count 定义日志路径以及域名to_mon_request_count=0req_log=/data/log/www.discuz.net/access.logdomainname=www.discuz.net
Monitoring Project
1 Load Monitoring
vim/mon/shares/load.sh
#! /bin/bashload=`uptime |awk -F ‘average:‘ ‘{print $2}‘|cut -d‘,‘ -f1|sed ‘s/ //g‘ |cut -d. -f1`if [ $load -gt 10 ] && [ $send -eq "1" ]then echo "$addr `date +%T` load is $load" >../log/load.tmp /bin/bash ../mail/mail.sh [email protected] "$addr\_load:$load" "`cat ../log/load.tmp`"fiecho "`date +%T` load is $load"
2 502 Error Monitoring
Vim mon/shares/502.sh
#! /bin/bashd=`date -d "-1 min" +%H:%M`c_502=`grep :$d: $log |grep ‘ 502 ‘|wc -l`if [ $c_502 -gt 10 ] && [ $send == 1 ]; then echo "$addr $d 502 count is $c_502">../log/502.tmp /bin/bash ../mail/mail.sh [email protected] "$addr\_502 $c_502" "`cat ../log/502.tmp`"fiecho "`date +%T` 502 $c_502"
3 HDD Monitoring
Vim mon/shares/disk.sh
#! /bin/bashrm -f ../log/disk.tmpfor r in `df -h |awk -F ‘[ %]+‘ ‘{print $5}‘|grep -v Use`do if [ $r -gt 90 ] && [ $send -eq "1" ]then echo "$addr `date +%T` disk useage is $r" >>../log/disk.tmpfiif [ -f ../log/disk.tmp ]then df -h >> ../log/disk.tmp /bin/bash ../mail/mail.sh [email protected] "$addr\_disk$r" "` cat ../log/disk.tmp`" echo "`date +%T` disk useage is nook"else echo "`date +%T` disk useage is ok"fi
Alarm Script
Vim mail.sh
log=$1 //给变量log赋值t_s=`date +%s` //定义当前绝对时间戳,单位为秒t_s2=`date -d "2 hours ago" +%s` //定义上次告警时间戳,用作第一次发现异常if [ ! -f /tmp/$log ]then echo $t_s2 > /tmp/$logfi //将上次告警时间戳写入日志,用作第一次发现异常t_s2=`tail -1 /tmp/$log|awk ‘{print $1}‘` //从日志取出上次告警时间戳echo $t_s >> /tmp/$log //将本次告警时间戳追加到日志v=$[$t_s-$t_s2] //计算两次告警相隔时间if [ $v -gt 3600 ]then ../mail/mail.py $1 $2 "$3" //当相隔超过一小时,发送邮件,提醒有故障发生。 echo "0" > /tmp/$log.txtelse if [ ! -f /tmp/$log.txt ] then echo "0" > /tmp/$log.txt fi nu=`cat /tmp/$log.txt` nu2=$[$nu+1] echo $nu2 > /tmp/$log.txt if [ $nu2 -gt 10 ] then ../mail/mail.py $1 "trouble continue 10 min $2 " "$3" echo "0" > /tmp/$log.txt fi fi //相隔小于一小时,启动计数器,计数器大于10发送邮件,提醒故障持续了10分钟。
Send mail Script
Vim mail.py
#!/usr/bin/env python#-*-coding:utf-8-*-import os,sysreload (SYS) sys.setdefaultencoding (' UTF8 ') import getoptimport Smtplibfrom email. Mimetext Import mimetextfrom Email. Mimemultipart Import mimemultipartfrom subprocess import *def sendqqmail (Username,password,mailfrom,mailto,subject, Content): Gserver = ' smtp.qq.com '//More practical changes Gport = Try:msg = Mimetext (Unicode (content). Encode (' U Tf-8 ') msg[' from '] = Mailfrom msg["to"] = mailto msg[' reply-to '] = mailfrom msg[' Subject '] = Subject SMTP = Smtplib. SMTP (Gserver, Gport) smtp.set_debuglevel (0) Smtp.ehlo () Smtp.login (Username,password) Smtp.sen DMail (Mailfrom, mailto, msg.as_string ()) Smtp.close () except Exception,err:print "Send Mail failed. Error:%s "% errdef Main (): To=sys.argv[1] subject=sys.argv[2] content=sys.argv[3]# #定义QQ邮箱的账号和密码, you need to change your account and password Sendqqmail (' [email protected] ', ' aaaaaaaaaa ', ' [email protected] ', to,subject,content) If __name__ = = "__main__": Main ()
Two questions
1 Alarm Convergence Script principle
First of all understand a premise, we are to find the problem will be alert email notification, then when the alarm script is not loaded, said the name of the current monitoring is normal. Or an exception occurred before, but it has been restored.
The purpose of our script is to prevent problems from appearing and to be alerted frequently during the repair period. Then we are going to judge whether this is a new anomaly or a fault that has been discovered. We can define a time threshold of one hour. For example, the distance from the last problem occurred, that is, the time of the last alarm is greater than one hour, we think is a new anomaly, or is an unresolved failure.
Specifically how to achieve, we can use the idea of logging, each alarm, we record an absolute timestamp, the next time the alarm arrives, with the current time stamp minus the last time stamp of the alarm, and then if the difference is determined to determine the alarm action.
In the specific script defined T2 as the last alarm timestamp, T1 is the current timestamp, T1-T2 is our judgment value. T1 can be obtained directly from the command, t1= ' date +%s ' T2 to be obtained from the log. But there is also a problem, for the first alarm, where the T2 from, and simply, we ourselves define a time stamp that satisfies the new exception condition, such as two hours ago.
T2= ' date-d ' hours ago ' +%s '
For a problem that has been found to be solved, that is, the corresponding time stamp difference of less than one hour, if our script is executed one minute, we can make a counter, each discovery, count once, when the counter is more than 10 o'clock, that is, the exception lasted 10 minutes, this time send mail once, Empty the timer at the same time.
2 $ $ $ pass parameter delivery problem
Because the mail.py script decides to send the message must take three parameters, the recipient mailbox, the subject, the content. So the mail.sh call mail.py must pass to it three parameters, then where does the three parameters come from, from the monitoring item script, such as load.sh
The delivery order is load.sh---->mail.sh----->mail.py
Then explain mail.sh the first line, log=$1, is a variable assignment statement, at this time is the load.sh pass over the mailbox name.
In the program I see the back of the time stamp is saved in the/tmp/$log, the copy statement is to give the log file a meaningful name, easy to manage later. Of course, you can give other assignments, like Log=time.txt.
Linux Learning Summary (62) Shell script 5-monitoring system development