I. Introduction
The important internal services of the company have been monitored through the zabbix introduced earlier, but some recent situations have prompted us to re-Improve the monitoring methods.
During the 11th holiday, more than 300 Alert Messages were received one night, all of which were disconnected from the network and re-connected. From the alert content, I immediately suspected that the Network was unstable, however, the person on duty (Business Department) reported that the service application was normal that night, so that network faults could be ruled out. Afterwards, the monitoring system was suspected to have insufficient performance due to the MYSQL database, at that time, the idea was that the database service should be able to immediately find out whether it was the cause of performance. At that time, professional maintenance staff were not able to remotely maintain the service (only with their mobile phones during the holidays ), this problem is not worth the immediate maintenance of the holiday. At that time, I thought how nice it would be to refresh and view the Custom Service on the webpage.
A few days ago, I received a report saying that the business system could not be accessed and the maintenance staff's mobile phone did not receive an alarm. After troubleshooting, I found that the monitoring service was stopped manually. Therefore, it is imperative to monitor the monitoring service.
1.1 Introduction to Frigga
Xiaomi open-source process monitoring tool (https://github.com/xiaomi-sa/frigga), Frigga is a simple and highly scalable process monitoring framework. Based on the open-source god, She modified and added the web interface and rpc interface to meet the service management needs of large clusters.
In Nordic mythology, frigga is the wife of odin, who is in charge of Marriage and Family, and textile clouds.
1.2 Frigga Functions
Integrated with god, used as the supervise program of the program
C/S structure and integration with multiple authentication methods to support O & M management of large clusters
Api interfaces are provided for basic functions to facilitate expansion.
Supports standalone web-based god for easy viewing and Management
Support Log Viewing
Supports adding custom xmlrpc interfaces for Secondary Development
1.3 environment dependency of Frigga
Ruby 1.9.3 and bundle
Ii. install and configure Frigga
2.1 configure YUM Source
Configure the system-related yum source as follows:
12345678wget http:
//mirrors
.163.com/.help
/CentOS6-Base-163
.repo
mv
CentOS6-Base-163.repo
/etc/yum
.repos.d/
wget http:
//mirrors
.ustc.edu.cn
/fedora/epel/6/x86_64/epel-release-6-8
.noarch.rpm
wget http:
//pkgs
.repoforge.org
/rpmforge-release/rpmforge-release-0
.5.3-1.el6.rf.x86_64.rpm
rpm -ivh epel-release-6-8.noarch.rpm
rpm -ivh rpmforge-release-0.5.3-1.el6.rf.x86_64.rpm
rpm -ivh http:
//rpms
.famillecollet.com
/enterprise/remi-release-6
.rpm
rpm --
import
/etc/pki/rpm-gpg/RPM-GPG-KEY-remi
2.2 install Frigga
Install the dependency packages related to Frigga as follows:
12345678910yum
install
git
# Install git
yum -y
install
ruby gems
# Installing ruby
curl -L https:
//get
.rvm.io |
bash
-s stable
# Upgrade ruby
rvm
install
ruby-1.9.3
# You can log on again before this step.
gem
install
bundle
# Install bundle
cd
/opt/
# Install Frigga
git clone https:
//github
.com
/xiaomi-sa/frigga
.git
cd
/opt/frigga
.
/script/run
.rb start
# Start Frigga
god status
# Viewing the startup status
2.3 configure Frigga
By default, port WEB9001 is enabled after Frigga is started. The authentication configuration of the user name and password is stored in the/opt/frigga/conf/frigga. yml file, as shown below:
1234cat
/opt/frigga/conf/frigga
.yml
---
port: 9001
http_auth: [
"admin"
,
"password"
]
2.4 configure SSH monitoring
The configuration file, ending with the. god suffix, can be stored in the gods folder. The following example uses the ssh service under CentOS6.4 as an example. When an ssh process starts or restart five times in five minutes, if it fails to start, it is changed to unmonitored. After 10 minutes, it is started again. If it is within two hours, all attempts failed once and gave up completely.
12345678910111213141516171819202122232425262728293031cd
/opt/frigga/gods
vim sshd.god
God.watchdo |w|
w.name =
'sshd'
w.start =
"/etc/init.d/sshd start"
w.stop =
"/etc/init.d/sshd stop"
w.restart =
"/etc/init.d/sshd restart"
w.interval = 30.seconds
w.start_grace = 10.seconds
w.restart_grace = 10.seconds
w.pid_file =
'/var/run/sshd.pid'
w.:clean_pid_file)
w.start_if
do
|start|
start.condition(:process_running)
do
|c|
c.interval = 5.seconds
c.running =
false
end
end
# lifecycle
w.lifecycle
do
|on|
on.condition(:flapping)
do
|c|
c.to_state = [:start, :restart]
c.
times
= 5
c.within = 5.minute
c.transition= :unmonitored
c.retry_in = 10.minutes
c.retry_times = 5
c.retry_within = 2.hours
end
end
end
Note that do not write an error in the sshd command path. After completing the preceding settings, enter the following command to load sshd monitoring:
1god load sshd.god
2.5 configure apache monitoring
1234567891011121314151617181920212223242526272829God.watchdo |w|
w.name =
'apache'
w.start =
"/etc/init.d/httpd start"
w.stop =
"/etc/init.d/httpd stop"
w.restart =
"/etc/init.d/httpd restart"
w.interval = 30.seconds
w.start_grace = 10.seconds
w.restart_grace = 10.seconds
w.pid_file =
'/var/run/httpd/httpd.pid'
w.:clean_pid_file)
w.start_if
do
|start|
start.condition(:process_running)
do
|c|
c.interval = 5.seconds
c.running =
false
end
end
# lifecycle
w.lifecycle
do
|on|
on.condition(:flapping)
do
|c|
c.to_state = [:start, :restart]
c.
times
= 5
c.within = 5.minute
c.transition= :unmonitored
c.retry_in = 10.minutes
c.retry_times = 5
c.retry_within = 2.hours
end
end
end
2.7 configure MySQL monitoring
1234567891011121314151617181920212223242526272829God.watchdo |w|
w.name =
'mysql'
w.start =
"/etc/init.d/mysqld start"
w.stop =
"/etc/init.d/mysqld start"
w.restart =
"/etc/init.d/mysqld restart"
w.interval = 30.seconds
w.start_grace = 10.seconds
w.restart_grace = 10.seconds
w.pid_file =
'/var/run/mysqld/mysqld.pid'
w.:clean_pid_file)
w.start_if
do
|start|
start.condition(:process_running)
do
|c|
c.interval = 5.seconds
c.running =
false
end
end
# lifecycle
w.lifecycle
do
|on|
on.condition(:flapping)
do
|c|
c.to_state = [:start, :restart]
c.
times
= 5
c.within = 5.minute
c.transition= :unmonitored
c.retry_in = 10.minutes
c.retry_times = 5
c.retry_within = 2.hours
end
end
end
This article is from the "virtual reality" blog, please be sure to keep this source http://waringid.blog.51cto.com/65148/1313557