IT system monitoring solution design

Source: Internet
Author: User

I. background of Platform Construction

With the business development, the company's business systems are gradually increasing, and the number of online systems is also increasing, in the past, manual inspection systems were used to discover system faults, potential risks, and security risks. The efficiency was getting lower and lower, and the work intensity and pressure of O & M personnel were also increasing, in order to improve the timeliness of system fault detection, the professionalism, standardization, and scientificity of system maintenance, the O & M personnel can also be freed from repetitive work to do more meaningful things, therefore, we urgently need to introduce new monitoring methods and tools to help O & M engineers solve current problems.

Ii. Construction Objectives

To ensure the stability of its own software platform, it automatically monitors the online platform and reasonably sets the monitoring granularity and monitoring objects. Solve and eliminate potential problems as much as possible in the bud, so as to improve the overall integration capability of IT support departments and the operating quality of delivery systems.

The ultimate goal of building a monitoring platform is as follows:

1. promptly discover potential problems and passive maintenance;

2. Provides an intuitive reference for platform performance optimization;

3. Improve the professionalism and standardization of system maintenance;

4. Improve user experience and reduce service downtime.

Iii. Functions and content of the monitoring platform

1. Centralized Monitoring and Management

Collects and processes various alarm information from the system, collects alarm information, and analyzes the root cause to help the O & M personnel identify the cause of the fault, quickly locate fault points and manage networks, hosts, databases, and applications (system software and hardware configuration information, system performance indicators, fault alarms, and log management ).

Specific implementation: the local shell and information collection engine are used for archiving logs to centrally store system information and exception logs on the monitoring platform for analysis and alarm generation and report generation.

2. unified monitoring management interface and various alarm Methods

The real-time status of networks, systems, databases, and applications is centrally reflected through a rational graphical interface, and alarms are triggered by SMS, email, or page.

3. Customize alarm priority policies

Generally, the monitored result is success or failure, such as Ping failure, webpage access error, and Socket connection failure. In the event of such failure, the fault is the highest priority alarm. In addition, you can also monitor the returned latency and content, such as Ping the returned latency, the time when the webpage is accessed, and the content obtained from the webpage. You can use the returned results to customize alarm conditions. For example, the return latency of Ping monitoring is generally between 10-30 ms. When the latency is greater than ms, it indicates that the network or server may encounter problems, resulting in slow network response, check whether the traffic is too high or the server CPU is too high.

4. Customize alarm information content standards

When a server or application fails, there are many alarm information contents, such as the name of the service to be triggered, the IP address of the server, the monitored line, the monitored service error level, error information, and the occurrence time. Pre-defined alarm content and standards enable the alarm content to be normative and readable. This is particularly meaningful for receiving alerts using text messages. The content of the text message can contain a maximum of 70 characters. It is difficult to fully understand the fault content within 70 characters. Therefore, we need to define the content standards in advance. For example, "the live video broadcast server 10.0.211.65 failed to monitor the telecommunications line at on January 18," clearly understands the cause of the failure.

5. SMS warning

At present, the platform can automatically send alarm short messages to the corresponding O & M engineer's mobile phone through the 139 email alarm function based on different businesses and owners.

At the same time, you can call a third-party API to trigger system alerts. A third-party API only needs to have a mobile phone number and the text message content variable to complete the instant messaging function.

6. receive summary reports by email

It takes two or three minutes to get the overall status of the website and server.

7. Monitoring Management Standards

Monitors and manages network running status, system service quality, and fault alarms in real time.

8. Rich data report analysis functions

Based on the above functions, the system can generate standard-format reports based on work requirements, and generate and adjust various reports based on conditions to meet the needs of IT system management and auditing.

Iv. platform monitoring objects

No.

Type

Monitoring Scope

Remarks

1

Network

Switching, routing, F5

Network Device Performance Parameter indicator and performance indicator overrun alarm

2

Host

Linux and Windows

Monitor server performance parameter indicators, performance indicators out-of-limit alarms

3

Middleware

Nginx and Tomcat

Alert for monitoring middleware performance parameter indicators and performance indicators exceeding the limit

4

Streaming Media

Wowza, Nginx

Alarm for monitoring streaming media performance parameter indicators and performance indicators exceeding the limit

5

Database

MySQL

Alarm for monitoring database performance parameter indicators and performance indicators exceeding the limit

V. Platform Architecture Design

1. logical topology of the Platform Architecture

The platform design architecture is shown in Figure 5.1.

The platform uses unified monitoring and centralized display to monitor devices. The monitoring server collects information by engines deployed on various Monitored Objects, filters, processes, and sorts the information through the Report Server, and displays and sends SMS alarms through a unified portal.

2. Availability principles

The deployment of Monitoring and Management Software should not greatly modify or adjust the original system structure and security policies, and minimize the impact on the original system performance, it does not affect the operation of the production system, nor interfere with the normal operation of the system. It consumes as few resources and network resources as possible.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.