I. background of Platform Construction
With the business development, the company's business systems are gradually increasing, and the number of online systems is also increasing, in the past, manual inspection systems were used to discover system faults, potential risks, and security risks. The efficiency was getting lower and lower, and the work intensity and pressure of O & M personnel were also increasing, in order to improve the timeliness of system fault detection, the professionalism, standardization, and scientificity of system maintenance, the O & M personnel can also be freed from repetitive work to do more meaningful things, therefore, we urgently need to introduce new monitoring methods and tools to help O & M engineers solve current problems.
Ii. Construction Objectives
To ensure the stability of its own software platform, it automatically monitors the online platform and reasonably sets the monitoring granularity and monitoring objects. Solve and eliminate potential problems as much as possible in the bud, so as to improve the overall integration capability of IT support departments and the operating quality of delivery systems.
The ultimate goal of building a monitoring platform is as follows:
1. promptly discover potential problems and passive maintenance;
2. Provides an intuitive reference for platform performance optimization;
3. Improve the professionalism and standardization of system maintenance;
4. Improve user experience and reduce service downtime.
Iii. Functions and content of the monitoring platform
1. Centralized Monitoring and Management
Collects and processes various alarm information from the system, collects alarm information, and analyzes the root cause to help the O & M personnel identify the cause of the fault, quickly locate fault points and manage networks, hosts, databases, and applications (system software and hardware configuration information, system performance indicators, fault alarms, and log management ).
Specific implementation: the local shell and information collection engine are used for archiving logs to centrally store system information and exception logs on the monitoring platform for analysis and alarm generation and report generation.
2. unified monitoring management interface and various alarm Methods
The real-time status of networks, systems, databases, and applications is centrally reflected through a rational graphical interface, and alarms are triggered by SMS, email, or page.
3. Customize alarm priority policies
Generally, the monitored result is success or failure, such as Ping failure, webpage access error, and Socket connection failure. In the event of such failure, the fault is the highest priority alarm. In addition, you can also monitor the returned latency and content, such as Ping the returned latency, the time when the webpage is accessed, and the content obtained from the webpage. You can use the returned results to customize alarm conditions. For example, the return latency of Ping monitoring is generally between 10-30 ms. When the latency is greater than ms, it indicates that the network or server may encounter problems, resulting in slow network response, check whether the traffic is too high or the server CPU is too high.
4. Customize alarm information content standards
When a server or application fails, there are many alarm information contents, such as the name of the service to be triggered, the IP address of the server, the monitored line, the monitored service error level, error information, and the occurrence time. Pre-defined alarm content and standards enable the alarm content to be normative and readable. This is particularly meaningful for receiving alerts using text messages. The content of the text message can contain a maximum of 70 characters. It is difficult to fully understand the fault content within 70 characters. Therefore, we need to define the content standards in advance. For example, "the live video broadcast server 10.0.211.65 failed to monitor the telecommunications line at on January 18," clearly understands the cause of the failure.
5. SMS warning
At present, the platform can automatically send alarm short messages to the corresponding O & M engineer's mobile phone through the 139 email alarm function based on different businesses and owners.
At the same time, you can call a third-party API to trigger system alerts. A third-party API only needs to have a mobile phone number and the text message content variable to complete the instant messaging function.
6. receive summary reports by email
It takes two or three minutes to get the overall status of the website and server.
7. Monitoring Management Standards
Monitors and manages network running status, system service quality, and fault alarms in real time.
8. Rich data report analysis functions
Based on the above functions, the system can generate standard-format reports based on work requirements, and generate and adjust various reports based on conditions to meet the needs of IT system management and auditing.
Iv. platform monitoring objects
No.
Type
Monitoring Scope
Remarks
1
Network
Switching, routing, F5
Network Device Performance Parameter indicator and performance indicator overrun alarm
2
Host
Linux and Windows
Monitor server performance parameter indicators, performance indicators out-of-limit alarms
3
Middleware
Nginx and Tomcat
Alert for monitoring middleware performance parameter indicators and performance indicators exceeding the limit
4
Streaming Media
Wowza, Nginx
Alarm for monitoring streaming media performance parameter indicators and performance indicators exceeding the limit
5
Database
MySQL
Alarm for monitoring database performance parameter indicators and performance indicators exceeding the limit
V. Platform Architecture Design
1. logical topology of the Platform Architecture
The platform design architecture is shown in Figure 5.1.
The platform uses unified monitoring and centralized display to monitor devices. The monitoring server collects information by engines deployed on various Monitored Objects, filters, processes, and sorts the information through the Report Server, and displays and sends SMS alarms through a unified portal.
2. Availability principles
The deployment of Monitoring and Management Software should not greatly modify or adjust the original system structure and security policies, and minimize the impact on the original system performance, it does not affect the operation of the production system, nor interfere with the normal operation of the system. It consumes as few resources and network resources as possible.