IT system monitoring solution design

Last Update:2014-05-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. background of Platform Construction

With the business development, the company's business systems are gradually increasing, and the number of online systems is also increasing, in the past, manual inspection systems were used to discover system faults, potential risks, and security risks. The efficiency was getting lower and lower, and the work intensity and pressure of O & M personnel were also increasing, in order to improve the timeliness of system fault detection, the professionalism, standardization, and scientificity of system maintenance, the O & M personnel can also be freed from repetitive work to do more meaningful things, therefore, we urgently need to introduce new monitoring methods and tools to help O & M engineers solve current problems.

Ii. Construction Objectives

To ensure the stability of its own software platform, it automatically monitors the online platform and reasonably sets the monitoring granularity and monitoring objects. Solve and eliminate potential problems as much as possible in the bud, so as to improve the overall integration capability of IT support departments and the operating quality of delivery systems.

The ultimate goal of building a monitoring platform is as follows:

1. promptly discover potential problems and passive maintenance;

2. Provides an intuitive reference for platform performance optimization;

3. Improve the professionalism and standardization of system maintenance;

4. Improve user experience and reduce service downtime.

Iii. Functions and content of the monitoring platform

1. Centralized Monitoring and Management

Collects and processes various alarm information from the system, collects alarm information, and analyzes the root cause to help the O & M personnel identify the cause of the fault, quickly locate fault points and manage networks, hosts, databases, and applications (system software and hardware configuration information, system performance indicators, fault alarms, and log management ).

Specific implementation: the local shell and information collection engine are used for archiving logs to centrally store system information and exception logs on the monitoring platform for analysis and alarm generation and report generation.

2. unified monitoring management interface and various alarm Methods

The real-time status of networks, systems, databases, and applications is centrally reflected through a rational graphical interface, and alarms are triggered by SMS, email, or page.

3. Customize alarm priority policies

Generally, the monitored result is success or failure, such as Ping failure, webpage access error, and Socket connection failure. In the event of such failure, the fault is the highest priority alarm. In addition, you can also monitor the returned latency and content, such as Ping the returned latency, the time when the webpage is accessed, and the content obtained from the webpage. You can use the returned results to customize alarm conditions. For example, the return latency of Ping monitoring is generally between 10-30 ms. When the latency is greater than ms, it indicates that the network or server may encounter problems, resulting in slow network response, check whether the traffic is too high or the server CPU is too high.

4. Customize alarm information content standards

When a server or application fails, there are many alarm information contents, such as the name of the service to be triggered, the IP address of the server, the monitored line, the monitored service error level, error information, and the occurrence time. Pre-defined alarm content and standards enable the alarm content to be normative and readable. This is particularly meaningful for receiving alerts using text messages. The content of the text message can contain a maximum of 70 characters. It is difficult to fully understand the fault content within 70 characters. Therefore, we need to define the content standards in advance. For example, "the live video broadcast server 10.0.211.65 failed to monitor the telecommunications line at on January 18," clearly understands the cause of the failure.

5. SMS warning

At present, the platform can automatically send alarm short messages to the corresponding O & M engineer's mobile phone through the 139 email alarm function based on different businesses and owners.

At the same time, you can call a third-party API to trigger system alerts. A third-party API only needs to have a mobile phone number and the text message content variable to complete the instant messaging function.

6. receive summary reports by email

It takes two or three minutes to get the overall status of the website and server.

7. Monitoring Management Standards

Monitors and manages network running status, system service quality, and fault alarms in real time.

8. Rich data report analysis functions

Based on the above functions, the system can generate standard-format reports based on work requirements, and generate and adjust various reports based on conditions to meet the needs of IT system management and auditing.

Iv. platform monitoring objects

No.

Type

Monitoring Scope

Remarks

Network

Switching, routing, F5

Network Device Performance Parameter indicator and performance indicator overrun alarm

Host

Linux and Windows

Monitor server performance parameter indicators, performance indicators out-of-limit alarms

Middleware

Nginx and Tomcat

Alert for monitoring middleware performance parameter indicators and performance indicators exceeding the limit

Streaming Media

Wowza, Nginx

Alarm for monitoring streaming media performance parameter indicators and performance indicators exceeding the limit

Database

MySQL

Alarm for monitoring database performance parameter indicators and performance indicators exceeding the limit

V. Platform Architecture Design

1. logical topology of the Platform Architecture

The platform design architecture is shown in Figure 5.1.

The platform uses unified monitoring and centralized display to monitor devices. The monitoring server collects information by engines deployed on various Monitored Objects, filters, processes, and sorts the information through the Report Server, and displays and sends SMS alarms through a unified portal.

2. Availability principles

The deployment of Monitoring and Management Software should not greatly modify or adjust the original system structure and security policies, and minimize the impact on the original system performance, it does not affect the operation of the production system, nor interfere with the normal operation of the system. It consumes as few resources and network resources as possible.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

IT system monitoring solution design

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

IT system monitoring solution design

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support