About O & M monitoring Selection and Design Ideas

Source: Internet
Author: User
Tags rrdtool grafana

There is a saying in the O & M industry: "No monitoring, no O & M". Yes, it is no exaggeration. Monitoring is commonly known as "the third eye ". Without monitoring and basic O & M, business O & M is "blind ". Therefore, monitoring is the foundation of O & M. Especially when devops is so popular, it is more necessary to support yourself with monitoring data. Some people say that O & M is a treasure, so with monitoring, with sufficient data, Do O & M need to be backed up? As an O & M engineer, it is your first job to build a monitoring system.

Before the start, let's take a global perspective to explore how O & M monitoring tools are selected and how to build an O & M monitoring platform, this column is very suitable for you. If you have been working in the O & M field for many years, it can also help you broaden your thinking and vision.

I. Common O & M monitoring tools

At present, there are many O & M monitoring tools, such as which is good, which is not good, which is suitable for you, and which is not suitable for you. In fact, you only know what their features are, so let's start from here.

1. cacti

Cacti is a set of graphic analysis tools for network traffic monitoring developed based on PHP, MySQL, SNMP, and rrdtool.
Simply put, cacti is a PHP program. It uses the SNMP protocol to obtain remote network devices and related information (in fact, it is obtained by using the snmpget and snmpwalk commands of the net-snmp Software Package), plot it using the rrdtool, and display it through the PHP program. We can use it to show the status or performance trend of the Monitored object over a period of time.

Cacti is a very old monitoring tool. In fact, it is a traffic monitoring tool that is more suitable for precise traffic monitoring. However, it has many disadvantages. It is hard to figure out and does not support distributed monitoring, there is no alarm function, so there will be fewer and fewer users.

2. Nagios

Nagios is an open-source free network monitoring tool that can effectively monitor the status of windows, Linux and UNIX hosts, network settings such as vswitches and routers, and printers. If the system or service status is abnormal, an email or text message alert will be sent immediately to the website O & M personnel. After the status is restored, a normal email or text message notification will be sent.

The main feature of Nagios is monitoring and alarm. The most powerful feature is the alarm function, which supports multiple alarm methods. However, the disadvantage is that there is no powerful data collection mechanism and data plotting is also simple, when there are more and more monitored hosts, adding hosts is also very troublesome. configuration files are all configured based on text and do not support web management and configuration, which is prone to errors and is not suitable for maintenance.

3. zabbix

Zabbix is an enterprise-level open-source solution that provides distributed system monitoring and network monitoring functions based on the Web interface. Zabbix can monitor various network parameters to ensure secure operation of server systems. It also provides a powerful notification mechanism for system O & M personnel to quickly locate and solve various problems.

Zabbix consists of two parts: zabbix server and the optional component zabbix agent. Zabbix server can provide remote server/network status monitoring, data collection and other functions through SNMP, zabbix agent, ping, port monitoring and other methods, it can run in Linux, Solaris, HP-UX, aix, Free BSD, open BSD, OS X, and other platforms.

Zabbix solves the problem that cacti has no alarm and Nagios cannot be configured through web. It also supports distributed deployment, which makes it popular quickly, zabbix has become the most popular O & M monitoring platform for small and medium-sized enterprises.

Of course, zabbix also has some shortcomings. It consumes a lot of resources. If many hosts are monitored, monitoring timeout and Alarm Timeout may occur, but there are also many solutions, for example, improve hardware performance and change the zabbix monitoring mode.

4. Ganglia

Ganglia is a Scalable Distributed monitoring system designed for HPC clusters. It can monitor and display various status information of nodes in clusters, the gmond daemon process running on each node collects data about CPU, memory, hard disk utilization, I/O load, and network traffic, then summarize the data to the gmetad daemon, use rrdtool to store the data, and finally present the historical data in a curve through the PHP page.

The ganglia monitoring system consists of gmond, gmetad, and webfrontend. Gmond is installed on the client that needs to collect data. gmetad is the server, and webfrontend is a PHP Web UI. ganglia collects data through gmond and then displays it in webfrontend.

Ganglia's main feature is to collect data and present data in a centralized manner. This is the advantage and feature of ganglia. ganglia can aggregate all data to a single interface for centralized display and support multiple data interfaces, monitoring can be expanded in many aspects. At the same time, the most important thing is that ganglia is very lightweight in data collection, and the gmond program on the client basically does not consume system resources, this feature makes up for the insufficient performance consumption of zabbix.

Finally, ganglia is more intelligent in monitoring the big data platform. You only need a configuration file to activate ganglia's monitoring of hadoop and spark. There are nearly a thousand monitoring metrics, it fully satisfies the monitoring needs of the big data platform.

5. centreon

Centreon is a powerful distributed it monitoring system that monitors networks, operating systems, and applications through third-party components. First, it is open-source, we can use it for free. Secondly, it uses a monitoring engine similar to Nagios as the monitoring software at the underlying layer, and the monitoring engine regularly writes the monitored data to the database through the ndoutil module, centreon reads the data from the database in real time and displays the monitoring data through the web interface. Finally, we can use centreon web to manage and configure the host with one click, or centreon is a management configuration tool of Nagios, through the Web configuration interface provided by centreon, you can easily complete Nagios's need to manually configure hosts and services.

Centreon's strength is one-click configuration and management, and supports distributed monitoring. Nagios can achieve all functions through centreon, and centreon can also be integrated with ganglia, centreon integrates the data collected by ganglia to enable automatic host monitoring and alarm.

6. Prometheus

Prometheus is an open-source system monitoring and alarm framework. It is applicable to both server-oriented and other hardware metrics monitoring and highly dynamic service-oriented architecture monitoring. For the popular microservices, the multi-dimensional data collection and data filtering query languages of Prometheus are also very powerful. Prometheus is designed for service reliability. When a service fails, it allows you to quickly locate and diagnose problems.

7. grafana

Grafana is an open-source Measurement Analysis and visualization suite. In layman's terms, grafana is a visual display platform that displays our monitoring data through various cool interfaces,
If you think that zabbix's plotting interface is not good enough, you can use grafana for visual display. grafana also supports many different data sources, such as graphite, influxdb, opentsdb, and Prometheus, elasticsearch, cloudwatch, and kairosdb can be perfectly supported.

8. comparison chart

Ii. Unified O & M monitoring platform design ideas

The O & M monitoring platform does not simply download an open-source tool and then build it. It requires various integration and secondary development based on the monitoring environment and characteristics, in order to achieve the degree of completely consistent with your own needs. The following describes how to design the O & M monitoring platform.

To build an intelligent O & M monitoring platform, you must focus on Operation Monitoring and Fault alarms, incorporate network resources, hardware resources, software resources, and database resources involved in all business systems into a unified O & M monitoring platform, and eliminate differences in management software, the differences in data collection methods enable unified management, standardization, processing, display, logon, and permission Control for different data sources, ultimately achieve standardized, automated, and intelligent O & M management.

The intelligent O & M monitoring platform can be divided into six layers and three modules, from low to high. For example:

Data collection Layer: Located at the bottom layer, it mainly collects network data, business system data, database data, and operating system data, and then standardizes and stores the collected data.
Data presentation layer: Located on the second layer, it is a Web display interface that displays the data obtained from the data collection layer in a unified manner. The display mode can be a graph, bar chart, or pie status, by graphical data, O & M personnel can understand the running status and trend of the host or network within a period of time, and be used as the basis for O & M personnel to troubleshoot or solve problems.
Data extraction Layer: Located on the third layer, it mainly standardizes and filters the data obtained from the data collection layer, and extracts the required data to the monitoring and alarm module, this part is the link between the monitoring and alarm modules.
Alarm rule configuration layer: It is located on the fourth layer. It mainly sets alarm rules, alarm threshold values, alarm contact settings, and alarm method settings based on the data obtained on the third layer.
Alert event generation Layer: On the fifth layer, it records alarm events in real time, stores alarm results in the database for calling, and forms an analysis report, to calculate the failure rate and fault Occurrence Trend over a period of time.
User display management: Located at the top layer, it is a Web display interface that displays monitoring statistics and alarm fault results in a unified manner and manages multiple users and permissions, achieve unified user and unified permission control.

In these six layers, functions are divided into three modules: data collection module, data extraction module, and monitoring alarm module. Each module provides the following functions:

Data collection module: This module collects basic data and displays images. There are many data collection methods, which can be implemented through SNMP, proxy module, or custom script. Common data collection tools include cacti and ganglia.
Data extraction Module: This template mainly filters and collects data, and extracts the required data from the data collection module to the monitoring and alarm module. You can extract data through interfaces or custom scripts provided by the data collection module.
Monitoring and alarm module: This module mainly sets monitoring scripts, alarm rules, alarm threshold settings, alarm contact settings, and displays the alarm results in a centralized manner and records the history. Common monitoring and alarm tools include Nagios and centreon.

After learning about the general design of the O & M monitoring platform, I will introduce in detail how to implement such an intelligent O & M monitoring system through software.

It is an O & M monitoring platform topology formed based on the design concept. It can be seen that there are three main components: data collection module, monitoring alarm module, and data extraction module, the data extraction module is used for data communication between the other two modules. The data collection module can be composed of one or more data collection servers. Each data collection server can directly collect various data indicators from the server group, after standardized data format, the data is finally stored in the data collection server. The monitoring and alarm module uses the data extraction module to obtain the required data from the data collection server, and then sets the alarm threshold value and alarm contact to generate real-time alarms. The alarm method supports SMS and email alarms. In addition, you can use plug-ins or custom scripts to extend the alarm method. This complete monitoring and alarm platform is basically implemented.

Iii. Enterprise O & M monitoring platform Selection

1. Select zabbix for the SME Monitoring Platform

Zabbix is a comprehensive O & M monitoring platform that integrates data collection, data display, data extraction, monitoring and alarm configuration, and user display.

Zabbix is quick to learn and has powerful functions. It is a monitoring software that can be quickly used to meet the monitoring and alarm requirements of small and medium-sized enterprises. Therefore, zabbix is the preferred platform for O & M monitoring for small and medium-sized enterprises. However, when zabbix monitors a large number of servers, it may cause many problems, such as inaccurate monitoring data and Alarm Timeout. This is because zabbix has high requirements on server performance, when the number of monitored servers exceeds 500, the Monitoring Performance drops sharply. In this case, distributed monitoring deployment is required and the monitoring server performance needs to be improved.

In terms of security, if the agent of the zabbix client fails, the collected data will be lost, and the zabbix server is also a single point, you may also need to perform Ha on zabbix server to ensure data security and monitor high availability.

2. Select ganglia + centreon as the Internet big enterprise monitoring platform

Combined Application of Open-Source Monitoring Software + secondary development is a basic strategy for large Internet enterprises to build a monitoring platform. For complex monitoring with massive servers and multiple business systems, no software can independently meet all monitoring requirements of the enterprise. Therefore, the combined application and secondary development of multiple open source monitoring software are the final direction of the monitoring platform.

Ganglia is recommended because the ganglia client software occupies a very low amount of service resources and has many extension plug-ins, which makes monitoring expansion very easy. It also integrates with the professional web monitoring platform centreon, we recommend ganglia + centreon combination for monitoring massive servers, including data collection, data display, data extraction, monitoring and alarm configuration, and user display.

Iv. Evolution of our O & M Monitoring Platform

This is an experience and a summary. Based on the evolution of our monitoring platform over the years, I have summarized the ideas and strategies for building the monitoring platform at different stages, different machine quantities, and different monitoring platforms.

1. The number of machines is less than 100

In this period, due to the small number of machines, the monitoring requirement is also very simple. The monitoring usage may be mainly used to notify the problem, quickly locate and solve the problem. Let's give a rough summary, the monitoring platform has the following features:

(1) easy to deploy and easy to use
(2) stable operation without faults
(3) You can send an alarm by email or SMS.

Based on the above features and requirements, you can use popular open-source monitoring software Nagios, cacti, zabbix, ganglia, and so on. There are a lot of popular open-source product documents, which can be quickly used, and there are a lot of previous experiences, it is easy to solve problems.

At first, we chose Nagios, because this software was the first popular. Later, due to the inconvenience in adding hosts and services, we switched to zabbix. At this stage, zabbix should be the best choice.

2. The number of machines ranges from 200 to 1000

In this phase, as the number of machines increases, the monitoring requirements become more complex. However, the monitoring requirements are mainly used for notifications and alarms to discover problems and avoid re-occurrence of the same problems. According to the characteristics of this phase, during this period, we made the following work on the monitoring platform:

(1) classification of monitoring content: As there are many machines to be monitored, the monitoring content also increases, so we classified monitoring based on different purposes, it mainly includes basic system monitoring data, network monitoring data, and business monitoring data.

(2) full coverage monitoring: All machines are monitored, including software monitoring and Hardware monitoring. Hardware Monitoring mainly monitors hardware performance and faults, in addition to the basic monitoring data mentioned in the first step, Software Monitoring also adds business logic monitoring to cover business processes as much as possible, reducing and removing repeated problems through a large number of custom monitoring, ensures stable business operation.

(3) Multiple alarm methods to ensure no false alarm: All monitoring systems are classified based on importance and urgency. notifications are sent by email, SMS, or phone, each monitor corresponds to a different person, ensuring that each monitor is handled by someone, and continuous notification is adopted for important services.

The difficulty in this phase is the processing of alarm information. As more and more machines and more services need to be monitored, the alarm information has experienced explosive growth, it is common to receive thousands of alarm emails every day. When too many emails appear, they actually lose the meaning of the alarm, because we cannot view every email, and many of these alarm emails are not necessary alarms, for example, if the system load increases occasionally, an alert email is sent, which is completely unnecessary.

Therefore, in this phase, we mainly configure and optimize monitoring and alarm policies to minimize unnecessary alarm emails, such as monitoring system loads, you can select the threshold for several consecutive loads, and then perform alarm operations after a long time. Through the optimization of the alarm policy, the alarm information is greatly reduced, up to dozens of messages are sent every day. In this case, you will not miss any alert information.

3. The number of machines exceeds 1000

As the business continues to grow, there are more and more demands for Servers. When our servers exceed 1000, the monitoring situation has changed, or many strange problems have occurred in monitoring.
There are some:

(1) untimely alarms

When we have more than 1000 servers, our zabbix often goes on strike. Sometimes the monitoring data cannot be displayed in time, and sometimes the alarm is delayed, especially the alarm delay. This is the most terrible thing, online Services cannot fail 24x7. Although exceptions have been detected, the monitoring system has been released for one or several hours. What is the significance of monitoring, timeliness is the first requirement of the monitoring system, which must be solved.

How can we solve this problem? In addition to monitoring optimization, such as distributed proxy deployment, zabbix active mode enabled, and data collection expansion and optimization, we collect basic data, zabbix is abandoned and ganglia is used, while zabbix is still used for business data Implementation. By sharing the load of data collection, zabbix is greatly reduced, the accuracy and timeliness of data collection have returned to normal.

(2) single point of failure (spof) in the alarm system

Due to the large number of servers and the rapid growth of collected data, once upon a time, the monitoring server suddenly went down unexpectedly, and it was an hour after the system was restarted, this hour's O & M has become a terrible thing.

Since the monitoring system went down, we have deployed distributed and high-availability monitoring servers to avoid spof and remotely back up the monitored data, when the monitoring server fails, it is automatically switched to the standby monitoring system and the monitoring data is automatically saved and synchronized.

(3) The alarm requirement cannot be met by the monitoring system

As the business grows, the customer's requirements for business stability become more demanding. To ensure the stable operation of the business system, the business logic monitoring requirements are raised, business Logic monitoring is to monitor the operation logic of the business system. When the business operation logic fails, an alarm is also required. Obviously, there is no ready-made tool or code for monitoring the business logic, we can only develop it based on the business logic. By improving the business logic interface and reporting data, we have performed multiple secondary development on zabbix to monitor the business logic.

Finally, the O & M monitoring platform is an indispensable part of the O & M work. How to build an O & M monitoring platform suitable for you? Each company has different requirements and each O & M has different pain points, O & M can do a lot of things no matter what the requirements and demands are. On the road to O & M monitoring, let's move forward together.

Technical egg

~~~~~~~~~~~~~~~~~~~~~~~~~~~
After talking about this, the question arises: how can we build a suitable O & M monitoring platform? I have summarized and refined my work experience over the years and wrote the column "no monitoring, click "no O & M", and 15 articles will link up O & M monitoring and let the experience speak:

What skills can I learn?

About O & M monitoring Selection and Design Ideas

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.