Dockone WeChat Share (107): SRE Engineering Practice--alarm based on time series stored data

Source: Internet
Author: User
Tags add time grafana influxdb cadvisor
This is a creation in Article, where the information may have evolved or changed.
"Editor's words" to build an intelligent operation and maintenance platform, operation monitoring and fault alarm is the two important part of the past. This share mainly introduces the alarm engineering practice based on time series data storage after introducing SRE concept.

SRE Alarm Introduction

Today I share the topic of SRE based on time series data alarm practice, since it is based on time series.

First, let me briefly introduce what is time series data.

Time series data is a series of ordered data. This is usually the sampled data at equal time intervals. The simplest definition of time-series storage is the data in the data format that contains the timestamp field. Time series data when querying, the time series will always take a time range to filter the data. The timestamp field is always included in the results of the query.

Monitoring data is presented in large numbers as time series data features, so in order to respond to complex monitoring data formats, add time fields to each piece of data. Different from the traditional relational database, time series data storage, query and presentation are optimized to obtain extremely high data compression capability and excellent query performance, especially in the case of IoT applications that need to deal with massive time series data.

After more than 10 years of development, Google's monitoring system has undergone a new model of monitoring and alarming based on the traditional probe model, graphical trend display model and now time series data information. This model collects time series information as the primary task of monitoring system, and develops a time series information manipulation language, which replaces the previous probe script by using the language to convert data into icons and alarms.

Monitoring and alarm is inseparable of the two parts, before our company's CTO Chaudesaigues once did on the time series data monitoring practice of sharing, in this sharing does not repeat the previous monitoring section, interested students can go to see the old Shaw article.

The OPS team uses a monitoring system to understand the runtime status of application services and to ensure service availability and stability. The monitoring system also usually provides the indicator data that the dashboard displays the service to run, although the various line chart looks very interesting, but the monitoring system most valuable embodiment, is when the service appears the anomaly or the indicator value exceeds the set threshold value, the Operation dimension team receives the alarm message, the timely intervention and restores the service to the normal condition.

SRE team that the monitoring system should not rely on people to analyze the alarm information, should be automatically analyzed by the system, the alarm to be operational, the goal is to solve some of the problems that have occurred, or to avoid the problem.

Monitoring and alerting

Monitoring and alerting allows the system to proactively notify us in the event of a failure or imminent failure. When the system is unable to automatically fix a problem, need a person to investigate the alarm, to determine whether there is a real fault, take a certain way to mitigate the failure, analyze the phenomenon, and finally find out the cause of the failure. The monitoring system should provide fault information from two aspects, that is, the phenomenon and the reason.

black box monitoring and white box monitoring

Black box monitoring: Monitor the system behavior by testing some external user visible. This is a phenomenon-oriented monitoring that provides ongoing issues and alerts employees to emergencies. Black-box monitoring is powerless for problems that have not yet occurred but are about to occur.

White box monitoring relies on some of the performance indicators exposed within the system for monitoring. This includes log analysis, the monitoring interface provided by the Java Virtual Machine, or an HTTP interface that lists the Intrastat data for monitoring. White box monitoring can detect impending problems by analyzing the indicator values of internal information in the system. White-box monitoring is sometimes a phenomenon-oriented, sometimes for reasons, depending on the information provided by white-box monitoring.

Google's SRE relies heavily on white-box monitoring.

Several principles for setting alarms

Normally, we should not issue an alert just because "something seems to be a problem."

The handling of the emergency alarm takes up the employee's valuable time, and if the employee is in the working time period, the handling of the alarm will interrupt his original work flow. If the employee is at home, the handling of the emergency alert will affect his personal life. Frequent alarms allow employees to enter the "wolf" effect, questioning the effectiveness of alarms and ignoring alarms, or even missing out on real failures.

Guidelines for setting alarm rules:
    • The alerts issued must be real, urgent, important, actionable.
    • Alarm rules to show problems that are occurring in your service or that are about to occur.
    • Clear problem classification, availability of basic functions, response time, correct data, etc.
    • Alarm the symptom, and provide the details and reasons as detailed as possible, do not alarm the cause directly.


Effective alarm based on time series data

Traditional monitoring, by running a script on the server, storing the return value for graphical display, and checking the return value to determine whether to alarm. Google internal use of Borgmon as a monitoring alarm platform.

Outside of Google, we can use Prometheus as a tool for monitoring alarms based on time series data, and then practice the white-box monitoring concept provided by SRE.

Monitoring Alarm Platform Architecture diagram:

Monitoring Alarm Components


    • CADVISRO provides users with tools to understand the resource usage and performance characteristics of the container runtime. Cadvisor is a program that runs in the background, collects, aggregates, processes, and exports information from the container runtime.
      Link:https://github.com/google/cadvisor

    • Prometheus is an open source system monitoring alarm toolset developed by SoundCloud. Prometheus collects the information of container runtime from Cadvisor HTTP interface, stores it in internal storage, and uses PROMQL to query and display the time series data and set alarms. The alarm information is pushed to the Alertmanager.
      link:https://prometheus.io/

    • Alertmanager handles the alarms sent by the Prometheus service for the removal of heavy, packet, routing, silent and noise reduction operations.
      link:https://prometheus.io/docs/alerting/alertmanager/

    • Alerta is a user-friendly alarm visualization display tool for displaying and managing alarm data pushed from Alertmanager.
      link:http://alerta.io/


Build a test environment

To facilitate testing, we run the above component on the test server with the container and test the server address 192.168.1.188.
    1. Launch two nginx containers and assign a different label to identify an application that belongs to the Dev group, an OPS group application.
    2. Start the Cadvisor container, port mapping 8080.
    3. Start the Alertmanager container, port mapping 9093, and specify the Alerta address in the configuration file as the Webhook notification address.
    4. Start the Prometheus container, port mapping 9090,cmd Specify the address of Alertmanager for the "-alertmanager.url" address.
    5. Start MongoDB as a alerta database
    6. Start Alerta, port mapping is 8181


Container run:

Apply Metrics Collection

Cadvisor Native provides HTTP interface to expose Prometheus need to collect metrics, we visit http://192.168.1.188:8080/metrics.

Configure the Cadvisor address as the target address in the Prometheus configuration file, and you can view the status of targets in the Prometheus Web page.

On the graph page of Prometheus, the collected data can be queried and graphically displayed.


Alarm rule Configuration

We configure alarm rules for the CPU usage of the container application, as follows:

In the diagram, the application container alarm rules are set for the dev group and the OPS group, and the format of the alarm rules is:
    • "Alert" is the name of the alarm rules, there can be no space between the names, the use of underlined links;
    • "IF" is the query expression of the data, the statement content of the query indicator "Container_cpu_usage_seconds_total", the label "Container_label_dataman_service" equals "web", The label "Container_label_dataman_group" equals "dev", using the function irate () to calculate the ratio of the indicator's difference in CPU usage time per second for the previous 5 minutes. The simple point is that the percentage of CPU time is calculated. The expressions in the two alarm rules here are somewhat different in order to differentiate between two groups of applications.
    • "For" is the alarm status for more than 1 minutes, the alarm will be changed from the status "PENDING" to "firing", the alarm will be given to alertmanager processing.
    • "LABELS" is the custom data, where we specify the level of the alarm and the value of the expression in the "IF" display.
    • "ANNOTATIONS" For custom data, we provide an introduction to the phenomena and causes of alarms here.


Trigger Alarm

We use stress to pressurize the CPU of two containers, which makes the CPU utilization of the container exceed the alarm threshold. On the Prometheus page we see the generated alarms.

On the Alertmanager page, see the alarm from Prometheus.

You can see that Alertmanager also pushed the alarm message to Alerta.

Alarm message Display

Alerta the received alarm to save and display.

Select an alarm information, you can enter the details, on the details of the alarm can be ACK, close and other operations.

After the alarm is over, you can view the history of the alarm in Alerta, which is the alarm in the off state.

Conclusion

Here we briefly introduce how to use Cadvisor,prometheus,alertmanager and alerta implementation of the Google SRE described in the data-based alarm practice, the alarm for performance indicators is the most basic alarm mode, We will also describe how to configure and capture the application's internal data metrics and monitor the alarm configuration. The monitoring of application system is a complicated process, which needs constant adjustment to deal with the health and service quality of service, and we need to absorb SRE's operation and maintenance concept constantly and fall into practice. SRE can be said to be devops in the operation of the specific implementation, it includes both the concept, culture, as well as the monitoring and alarm such as specific operations and engineering practices. More and more companies in the country are now beginning to focus on how SRE provides ongoing support for the project throughout its life cycle. But how can let SRE idea in the native land, how to find suitable for the company's own SRE Road, several people cloud is also constantly groping and continue to share the existing experience to everyone, hope that we can work together to learn the nutrition of SRE to continuously improve the level of enterprise operations and engineering practice. Thank you!

Q&a

Q: When the alarm message is received, does the system have the ability to automatically resolve the problem of the alarm report? Or do you need to solve the problem manually? Thanks

A: this to divide the situation, good mechanism is the alarm should be issued is a new problem, and then through the feedback mechanism, so that the same kind of problems no longer occur, or by the monitoring system itself to solve.
Q:INFLUXDB series of plans are considered, Grafana the latest version also has a good alarm mechanism, whether there is any attempt?

A: Influxdb's tick combination scheme has been considered and practiced, so it is very convenient to realize the complete process of data collection and storage processing presentation. By contrast, we found that Prometheus more in line with Google SRE for the idea of monitoring, their own community is also very active, and turned to Prometheus solution. Grafana realizes a powerful visual configuration alarm rules, for the original only as a demonstration of the tool, is a good enhancement, this inspiration to us is also very large, also in the study.
Q: What syntax is the alarm rule configuration, and can it be visualized?

A:prometheus is an alarm rule that is described in the configuration file. You can make your own visualizations.
Q: How to solve the large amount of data, such as the Million machine, 500 indicators of data, such as aminute ofthe amount of data 10000, how to save, how to quickly query data. What architecture and hardware are needed?

A: Simple answer, Prometheus can be grouped to support large-scale clusters, but to a certain size, it is necessary to practice the answer.
Q: Do you have any consideration or practice of intelligent early warning in monitoring alarm, such as based on historical monitoring data, through machine learning, to achieve early warning?

A: This is not the SRE recommended way, the alarm should be simple, the advanced function will blur the real intention.
Q: What is the size of the host and container deployed based on this scenario, and what frequency is the monitoring collection based on?

A: This sharing is a test environment, small scale. The Prometheus collects data from cadvisor at timed intervals and captures 5s of frequency.
Q:cadvisor the performance of data acquisition, the host of the resource is large?

A: Good performance, worry about resource consumption, you can start the container when the resource constraints.
Q:app own business logic needs to monitor the data, such as Counter,gauge, traditional Zabbix and so on can be data collection. I understand that cadvisor is collecting data for container. But is it possible to combine the monitoring of the app itself with the monitoring of the container?


A: Follow-up topics, we will practice the monitoring of the application of alarm. The logic of Prometheus is to fetch data from exporter, and then to query and analyze the stored time series data through PROMQL, so it can realize the combination of monitoring and container monitoring of the app itself.
The above content is organized according to the February 21, 2017 night group sharing content. Share people The sinus is strong, several people cloud research engineer. Multi-year operations development experience, familiar with the configuration management, continuous integration and other related technologies and practices, is currently responsible for several people cloud Platform monitoring alarm components research and development work. Dockone Weekly will organize the technology to share, welcome interested students add: Liyingjiesz, into group participation, you want to listen to the topic or want to share the topic can give us a message.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.