With the vigorous development of Devops, cloud computing, microservices, containers, etc., what kind of architecture and technical solutions are more suitable for such huge and complex monitoring needs?
With the rapid development of cloud computing and the Internet, a large number of applications need to cross different network terminals and widely access third-party services (such as payment, login, navigation, etc.), and the IT system architecture is becoming more and more complicated. Rapidly iterative product requirements and a good user experience require IT operation and maintenance managers to ensure stable and available core business at all times, and the pain points and difficulties in enterprise operation and maintenance also need to be resolved urgently.
1. Business-oriented operation and maintenance, not only care about the running status of single-point IT resources, but also care about the health status of the entire business system
2. If the enterprise uses a large number of APIs and modular applications, then pay attention to the performance changes and indicators of each interface
3. For operation and maintenance supervisors and enterprise management, a large-screen monitoring screen is particularly needed
4. The operation and maintenance needs to view the trend analysis of the report weekly and monthly, but the data export of traditional operation and maintenance tools is difficult
5. Need to turn the bird and quickly find the faulty node to reduce the loss caused by business interruption
With the gradual landing and vigorous development of concepts such as Devops, cloud computing, microservices, and containers, there are more and more machines, more and more applications, fewer and fewer services, and more and more diversified application operating environments. Containers, virtual The machine and the physical machine are different. In the face of hundreds of thousands of virtual machines and containers and dozens of objects to be monitored, can the existing monitoring system support it? How can index data from containers, virtual machines, physical machines, network equipment, and middleware adopt the same set of solutions to quickly and completely collect and analyze alarms? What kind of architecture and technical solutions are more suitable for such huge and complex monitoring needs?
1. Analysis of the unified monitoring platform architecture
To review first, the unified monitoring platform consists of seven roles: monitoring source, data collection, data storage, data analysis, data display, early warning center, and CMDB (Enterprise Software and Hardware Asset Management).
Monitoring source
Divided from the level, it can be roughly divided into three layers, business application layer, middleware layer, infrastructure layer. The business application layer mainly includes application software, enterprise message bus, etc., the middleware layer includes various system software such as database, cache, configuration center, etc. The infrastructure layer mainly includes physical machines, virtual machines, containers, network devices, storage devices, etc. .
data collection:
With so many data sources, the task of data collection cannot be easily relaxed. Data collection can be divided into business indicators, application indicators, system software monitoring indicators, and system indicators from the indicators. Application monitoring indicators such as: availability, exceptions, throughput, response time, current number of waits, resource occupancy rate, request volume, log size, performance, queue depth, number of threads, service calls, access volume, service availability, etc. Monitoring indicators such as large amount of water flow, flow area, flow details, number of requests, response time, number of responses, etc. System monitoring indicators such as: CPU load, memory load, disk load, network IO, disk IO, tcp connection number, process Count and so on.
From the collection method, it can usually be divided into interface collection, client agent collection, and active crawling through network protocols (http, snmp, etc.)
data storage:
The collected data is generally stored in a file system (such as HDFS), an index system (such as elasticsearch), an indicator database (such as influxdb), a message queue (such as kafka, for temporary storage or buffering of messages), and a database (such as mysql)
data analysis:
According to the collected data, the data is processed. There are two types of processing: real-time processing and batch processing. The technology includes Map/Reduce calculation, full log retrieval, streaming calculation, index calculation, etc. The key is to choose different calculation methods according to different scene requirements.
Data presentation:
Graphically display the processed results. In the multi-screen era, cross-device support is essential.
Early warning:
If problems are discovered during data processing, abnormal analysis, risk estimation, and event triggering or alarming are required.
CMDB (Enterprise Software and Hardware Asset Management):
CMDB is a very important part of the unified monitoring platform. Although there are many types of monitoring sources, they are all related. For example, applications run in the operating environment, and the normal operation of applications depends on the network and storage devices. An application also depends on For other applications (business dependencies), if any one of the links fails, the application will become unavailable. In addition to storing hardware and software assets, CMDB also needs to store such an association relationship between assets. If an asset fails, it is necessary to quickly know which other assets will be affected according to this relationship, and then solve the problem one by one.
OK, look back here, enter the topic, system monitoring.
Second, the technical stack of system monitoring
Part of the technology stack for system monitoring is shown in the following figure. There are many monitoring technologies. It is naturally impossible to list all technologies here. Some classic and popular open source technologies are selected.
System monitoring is different from log monitoring. There are many open source softwares that complete the tasks of database collection, data storage, data display, and event alarms. Therefore, for the technology stack of system monitoring, these open source softwares will be excluded for the time being. explain. The main focus here is on how to build a unified system monitoring platform.
data collection:
System monitoring data collection is generally divided into two ways: active collection, client collection. Active collection is generally carried out through SNMP, SSH, Telnet, IPMI, JMX and other means for remote collection. Client collection requires the deployment of a client in each host to be monitored for data collection and sending to the remote server for reception.
Data buffer:
Like log monitoring, when facing massive monitoring, considering the pressure of the network and the bottleneck of data processing, you can go through a layer of data buffer before data storage, and place the collected data in the message queue first, and then from the distribution Data is read and stored in the queue. If the amount of data is not large, this layer can be ignored.
data storage:
For system monitoring data, time series database is usually used for storage. Time series database is called time series database. The time series database is mainly used to process data with time tags (changes in the order of time, that is, time serialization). The time tag data is also called time series data. Such as influxdb and opentsdb, are among the leaders.
OpenTSDB is a distributed and scalable time series database built by using hbase to store all time series (no sampling required). It can obtain corresponding data from large-scale clusters (including network devices, operating systems, and applications in the cluster). Metrics are stored, indexed, and served to make these data easier to understand, such as web-based, graphical, etc. Implementation in JAVA language is a boon for students of JAVA department, but its reliance on hbase may make some students discouraged. After all, it is necessary to maintain hbase first.
Influxdb is an emerging time series database, written in go language, without external dependencies, and developing rapidly, the latest version has reached 1.2. Provides SQL-like query syntax, easy to install, and can be used from a single point. Although it has the capability of clustering, this feature is not open source (but single point performance can basically meet the needs of enterprises). Provide Http API, easy to call and encapsulate. It is very friendly for students who want to perform data processing and display based on influxdb.
Data presentation:
When it comes to graphical presentation of time series data, Grafana is a weapon that has to be mentioned. Grafana is an open source time series data query and display software, providing flexible and rich graphical options; can mix multiple styles, with a full-featured measurement dashboard and graphics editor. Supports docking with many data stores such as Graphite, Elasticsearch, CloudWatch, Prometheus, InfluxdbDB, etc. for data query and graph display. Some open source monitoring software such as zabbix, Graphite, Prometheus also have their own data graphical display capabilities, but generally are also recommended to use
Grafana to replace their pages. One can imagine the excellence of Grafana.
Of course, the data source of Grafana comes from the time series database. In the actual scenario, part of the data of the report you want to view may also come from the business system. This is what Grafana or other monitoring software cannot do. Expansion is One way, the other way is to achieve the display of charts according to their own needs, through the calculation and analysis of time series data and the combination of business data, using open source chart front-end frameworks such as echarts to display. At this time, the advantages of Influxdb are manifested, and providing external http api is very suitable for independent packaging of graphical pages.
Click here to read the following content.