A complete monitoring ecosystem includes "supervision, analysis, control" three links, early warning platform as a partial "analysis" of a link, both the monitoring data to do the rule analysis, but also for the control system to generate early warning logs, plays a connecting role. The data collected by the monitoring platform is a typical time series data over time (hereinafter referred to as "timing data"), and how to design a flexible and controllable alert engine for timing data is the first task of the early warning platform. Based on the author's experience, this paper explores the architecture design of micro-kernel early warning engine based on time series data, hoping to bring some sympathy to the interested peers.
With the rise and development of mobile Internet, industrial Internet, IoT, Edge Computing, time series data has exploded in the last two years, according to the authoritative data published by Db-engines, it can be found that in the development trend of various types of databases, the development of timing database is extremely strong.
Trends in database development in the last two years
In the timing database Top 10 rankings, the semi-open source Influxdb as a new generation of timing database benchmark, comprehensive score ahead, so in the need to store time series data in the application scenario, Influxdb is undoubtedly the first choice.
Time series database comprehensive score ranking
Special call Cloud Platform monitoring system is also based on INFLUXDB storage monitoring data, although the Influxdb ecological kapacitor as an early warning system, but considering the flexible and controllable, functional scalability, and business flexibility and other requirements, we finally chose the self-designed micro-core of the early warning engine, Mainly consists of the following trilogy:
First, data capture
Any data processing system, which is sourced from the database, is the first to provide support for scalable data source management, which can be used to fetch data from time series databases, relational databases, no-sql databases, WEBAPI, etc. A data source entity can typically be described by a data center, data source type, data source connection address, database name, port, user name, password, and so on. When the alert engine starts, you need to dynamically load the configured data source.
Ii. Rules of Judgement
The alert engine is essentially a rule engine and requires a high degree of descriptive and abstract nature. Almost all of the early warning engine is an original set of expression specifications, in order to complete the description of the alert rule through the combination of various expressions. I think the expression is closed, there is a threshold, in particular, the need to use multiple expressions for early warning description, it is difficult to grasp and difficult to understand. What is the easiest thing for developers to master? The answer is SQL, SQL, SQL.
The author believes that a typical SQL-based rule engine has the following structure:
- SELECT DATA from TABLE
- WHERE FILTER
- Then ACTION
|
With the data source, the SELECT data from table is easy to implement, where the filter is the validation rule described in this sectionto, then the action is the latter sectionto.
Each time series data curve, is a description of a tag, so according to the number of tags and the data have no, you can officer rules for the following classification:
If the alert engine can provide the default implementation of the above validation rules, it can satisfy more than 90% scenarios, and then provide a scalable mechanism for personalized scenarios to support extended development, dynamic loading of different check rule plug-ins, basically can cover all business requirements.
Because of different SQL notation, the results of the returned data may be different, a unified memory model is needed to represent the result set, and the DataTable is undoubtedly an excellent choice, so much of the work of implementing the validation rules is doing data transformations, with a unified data model, It is easy to use some of the built-in methods of the DataTable to judge rules.
It is a difficult point to judge the condition of non-monitoring data, because the escalation of monitoring data is easy to be affected by the fluctuation of monitoring system, network fluctuation and other disturbances, if relying on a single check rule, it is easy to generate false positives, so for the situation of non-monitoring data, we should adopt multi-check rule to rely on judging mode, These dependent rules are still SQL, and as the parameters of the main check rules are injected into the rule judgment, which can solve the problem of miscalculation and improve the accuracy rate.
Third, do the action
After an alert is triggered, there are two main types of actions that need to be performed: notifications, commands.
Notification methods have e-mail, text messages, nail nails, and so on, with the growth of the system, when there is a failure, it is easy to generate early warning storm: A short period of time to receive a large number of early warning messages, so need to set an important level of alert rules, setting convergence interval, sending time period, combined with root cause analysis, Ensure that operations personnel receive a valid alert message.
Command refers to the control system issued operation and maintenance instructions, do some regular operations, such as restart process, recycle application pool, crawl dump, dump log, etc., timely stop loss to prevent further deterioration of the system.
Action This is a factor to be considered is the notification of the user, that is, the request for the corresponding recipient of the alert rule, you can set a static default receiver, can also be dynamically calculated, such as the actual occurrence of the warning on the machine, process and other information to calculate the specific recipient, so as to define an early warning rule, According to the actual situation flexibly distributed to different recipients of the effect.
Iv. Summary
The micro-kernel early warning engine based on time series data needs to provide extensible and dynamic loading function, based on SQL implementation rules parsing, driving monitoring data from grasping, judging to the smooth operation of motion.
Five, special call cloud computing and big Data public number
1. Public number name: Special call Cloud computing and Big data
2. Two-D code: