Precautions for system O & M monitoring

Source: Internet
Author: User
Tags website server
At present, many enterprise information systems have their own monitoring platforms and monitoring methods. no matter which method is used to achieve real-time monitoring of the system and fault alarms, there are only two methods used: centralized Monitoring and distributed monitoring. Based on the monitoring problems of the company, the author summarizes some experiences and puts forward some suggestions on the monitoring platform, currently, many enterprise information systems have their own monitoring platforms and monitoring methods. no matter which method is used to implement real-time monitoring and fault alarms for the system, most of them adopt two methods: centralized monitoring and distributed monitoring. Based on the monitoring problems of the company, the author summarizes some experiences and puts forward some suggestions on the monitoring platform for your reference, if you have any questions, I hope you can criticize and correct them more.


In order to better and more effectively ensure stable operation after the system is launched. A reliable and sustainable monitoring mechanism is required for server hardware resources, performance, bandwidth, ports, processes, and services, this can reflect the performance bottlenecks and security risks of the server in a timely manner. In addition, we need to have a sense of crisis, that is, to understand what serious problems may occur on the server and how to handle these problems quickly. For example, the database data is lost, the log capacity is too large, and hackers intrude into the database.

I. preparations before going online
1. first, back up data. make a regular backup policy to back up all the data that you think is important, and regularly check whether your backup is effective and comprehensive;
2. Log rotation, no matter which rotation method you want to use, it is your purpose to control log growth and avoid drive full;
3. make certain security measures, such as firewall IptablesDenyhosts is used to prevent remote brute-force cracking;
4, MysqlRemote login permissions, etc;
5. monitoring of servers and network elements.

II. monitoring policy
1. define an alarm priority policy
Generally, the monitored result is success or failure, such as Ping failure, webpage access error, and Socket connection failure. in the event of such failure, the fault is the highest priority alarm. In addition, you can also monitor the returned latency and content, such as Ping the returned latency, the time when the webpage is accessed, and the content obtained from the webpage. You can use the returned results to customize alarm conditions. for example, the return latency of Ping monitoring is generally between 10-30 ms. when the latency is greater than MS, it indicates that the network or server may encounter problems, resulting in slow network response, check whether the traffic is too high or the server CPU is too high.
2. define alarm information content standards
When a server or application fails, there are many alarm information contents, such as the name of the service to be triggered, the IP address of the server, the monitored line, the monitored service error level, error information, and the occurrence time. Pre-defined alarm content and standards enable the alarm content to be normative and readable. This is particularly meaningful for receiving alerts using text messages. the content of the text message can contain a maximum of 70 characters. it is difficult to fully understand the fault content within 70 characters. Therefore, we need to define the content standards in advance. For example, "the live video broadcast server 10.0.211.65 failed to monitor the telecommunications line at on January 18," clearly knows the fault information.
3. receive summary reports by email
I receive a summary report email from the website server monitoring every day. it takes two or three minutes to get a general idea of the website and server status.
4. centralized monitoring and distributed monitoring
Although active (Centralized) monitoring does not require code and programs, it is very secure and convenient, but lacks a lot of detailed monitoring content, for example, the hard disk size, CPU usage, and network traffic cannot be obtained. the monitoring content is very useful. If the CPU usage is too high, the website or program has a problem, if the traffic is too high, it may be attacked.
Passive (distributed) monitoring is commonly used by SNMP (Simple Network Management Protocol). through SNMP, you can monitor most of the content you are interested in. Most operating systems support SNMP, which is convenient and secure to activate and manage. The disadvantage of SNMP is that it consumes bandwidth and consumes a certain amount of CPU and memory. it cannot be effectively monitored when the CPU is too high and network traffic is high.
5. define the primary and secondary failures
For services that monitor the same server, you need to define a primary monitoring object. when the primary monitoring object fails, only alarms of the primary monitoring object are sent, and other secondary monitoring objects are paused. For example, Ping is used as the main monitoring object. if a Timeout occurs when the Ping fails, it indicates that the server has crashed or the network is disconnected. in this case, only the server Ping alarm is sent to continuously monitor the Ping, it is unnecessary to continue monitoring and alerting other services. This greatly reduces the number of alarm messages and makes monitoring more reasonable and efficient.

Standardized deployment of local monitoring scripts
6. conduct unified and standardized deployment of the locally deployed monitoring script and record it to the KM system.
7. implement self-repair for fault-tolerant businesses
This feature allows you to centrally deploy the script for the self-repair function of a fault-tolerant business and check the fault after repair for no more than three times.
8. classification of monitored Business Systems
The level-1 system generates alerts, while the level-2 system generates alerts. the level-3 system generates alerts.
9. monitoring scope and objectives
Implement Server load balancerComprehensive monitoring and management of IT resources such as devices, network devices, servers, storage devices, security devices, databases, middleware and application software; at the same time, IT automatically collects, filters, associates and analyzes fault events generated by various management functions to enable early warning and quick locating of faults. IT monitors the performance of IT resources such as networks and business applications, performance reports and trend reports are provided on a regular basis to provide a scientific basis for performance optimization and future system resizing.
Normally, we can divide the monitored object as follows:
1. server monitoring: mainly monitors servers such as CPU load, memory usage, disk usage, login users, process status, and Nic status.
2. application monitoring is mainly used to monitor the service status, throughput, and response time of the application. different applications need to monitor different objects, which are not listed here.
3. database monitoring, which only lists database monitoring separately, is sufficient to describe its importance. generally, it monitors the database status, usage of database tables or tablespaces, and whether there are deadlocks or error logs, performance information.
4. Network Monitoring: mainly monitors the current network conditions and network traffic.
The above four items should be the most basic and must be known to ensure the normal operation of the website. in this way, we can achieve what we often say: "In strategizing, winning a thousand miles away ".
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.