Unified alarm platform and alarm Platform

Last Update:2016-12-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Unified alarm platform and alarm Platform

I. Summary

Due to the rapid development of the Monitoring Service, there are many kinds of alarms and the alarm records cannot be queried, a platform is required to solve the issue of alarm display and query. The "unified alarm platform" came into being. AMC (Alert Messages Center ).

AMC provides interface call and foreground configuration. It supports four alarm channels: Rtx, SMS, and email. It supports functions such as automatic completion of module information, alarm convergence, and query/display of historical records, the alert recipient and alert machine can be associated with CMDB to obtain relevant information, and can also be configured independently. The temporary blocking function is also supported.

Ii. Access description

1. Register a project

If a project needs to send an alarm, you can consider connecting to AMC.

Developers of this project first register the project information on the AMC front-end page, and then call the alarm interface in their own project code to send an alarm.

The following information is required for the registered project:

Project name, required field: contains a) Globally Unique English field, B) Chinese name of laifeng
The project applicant only registers the project and does not use it for other purposes.
Select a third-level business information, required field
Alarm switch: Based on the alarm switch of this project, you can control whether the alarm messages of this project are sent out when a fault occurs.
Alert recipient, which supports sending business tree owner/machine owner/custom/interface designation
The alarm convergence period. The default value is 60 seconds. In a period, the first alarm is used for repeated alerts. Other alarms are filtered. You can set no convergence or filtering on the interface.
Alarm convergence rules
Alert method: rtx, sms, wechat, and email. select at least one alert method.

After registration, a globally unique project key and appKey are obtained as a required parameter for the alarm interface.

2. Alert type

A project usually has a variety of non-Alarm types (different exceptions). This module uses the "First Alarm registration" method, that is, you do not need to specifically register the exceptions.

During the development of a project, a unique string (within the project) is specified for this exception, which is called "alert type 」, therefore, there is a globally unique field: appKey + alarm type.

However, the logic of a project is very simple (such as a script), and there is no need to identify the exception type, or the project is just started, there is only one exception for the moment, you do not need to specify the "Alarm Type" when sending an alarm 」, use this moduleDefaultAs the default value.

The alarm type contains the project key by default, which is a globally unique alarm type 」.

For example, cpu and memory in basic monitoring are all Alarm types in this system.

3. Machine dimension

Each alarm message also has a machine information. Based on the machine information, we can:

Whether an alarm is triggered on the machine in CMDB. This is a general switch with the highest priority
Using the business information of this machine in CMDB, you can determine whether the machine is an online machine or a non-Online machine. Different machine roles have different convergence rules.

Iii. Alarm Interface Design

1. redis queue

There are two types of Queues:

Queue to be processed, tentatively set to 8 queues, in_list _ {01 }... in_list _ {08}, the "alarm interface" writes consistent hash data according to the "Alarm Type", and then processes it by the backend process.
Queue processed. A queue is tentatively set to his_msg_list. After the backend process is processed, it is written to this queue, and then retrieved by logstash every minute. It is written to elasticsearch.

2. Task data

The data written by the alarm interface is task data 」.

Task data must be serialized: Use the msgpack api to serialize task information into binary data, and then write the binary data into redis.

The sequence of serialized fields (php, pseudocode ):

Msgpack_pack (array ('appkey' => $ app_key, // string 'content' => $ content, // string 'armtype '=> $ alarm_type, // string 'isfadeout' => $ is_fadeout, // value 1, 0' timestamp' => $ timestamp, // timestamp 'alarmup' => $ uip, // unsigned integer ip address, alarm machine ip address 'remoteip' => $ remoteIp, // unsigned integer ip address, call interface peer ip address 'otheruser' => $ other_users, // user id list. Multiple IDs are separated by semicolons ));

3. Message data

The backend program processes a task and produces a "message data 」

The format is json. To allow logstash to directly write data to elasticsearch

A piece of data contains the following fields. For the meanings of each field, see the notes in the mysql create table statement below:

App_id, value, project idapp_key, String, project keyapp_name, String, Project ID (English name) alarm_id, value, Alarm Type idalarm_type, String, alarm type ip (point-to-point representation), string, ipcontent, String, alarm content occur_time, String, YYYY-MM-DD hh: mm: ss, fault timestamp result_code, value, processing result status code result, String, processing result description send_time, string, YYYY-MM-DD hh: mm: ss, message sending timestamp send_by, String, message sending channel send_to, String, message sent to WHO

Iv. Alarm background Design

1. Producer

According to the number of threads configured in the redis queue, each thread operates on a redis queue plus the consistent hash of the "Alarm Type" to ensure that the data below the second level in the memory does not need to be locked, greatly optimizing the program processing speed.

Obtain the latest ip address, Service Tree, data center, and other information from CMDB at regular intervals, and load configuration information from the AMC database at regular intervals to reduce database pressure and reduce blocking through queue transfer.

Determine whether an alarm is required based on memory data parsing, and push it to the consumer 」

Regularly restores the "abnormal/normal" Status of the alarm record

2. Consumers

Multi-threaded operations: Get alert messages from producers in real time, push them to users, and write them into databases and redis to provide logstash calls.

V. temporary shielding design

1. Add temporary blocking

The alert blocking function is provided for a certain period of time according to different requirements. The alert is disabled immediately after the blocking time starts. the alert is enabled when the blocking time ends. A shield record is generated for each shield record. Each shield record records the operator and the reason for blocking, facilitating audit.
NOTE: during the launch of the temporary blocking function, the alarm function will be permanently disabled. The maximum blocking time for a record is 10 days, because the amc alarm is generally rtx/mail for 7 days, if the rest lasts for one day, it is sufficient to block 10 days. Check whether the CMDB alarm switch is removed.
WARNING: The blocked start time must be greater than or equal to the next minute of the current time. The end time must be at least one minute after the start time and cannot be greater than the start time of 10 days.
IMPORTANT: Since AMC obtains the switch only once per minute, rather than real-time retrieval, the start/end time of blocking will be disabled/enabled one minute in advance, this ensures that the switch is close to the accurate value in AMC.

2. Shielding type

Amc has three-level indicators. Project-> type-> ip address provides
- Blocked by project
- Blocked by project type
- Blocked by ip address of the project type

If amc generates alarms based on machines

Blocked by the machine IP address, this ip address is not blocked, does not correspond to the ip1-ip5
Shield all machines under the service by business tree (line machines/Test machines can be distinguished, in use/out of use), including ip1 ~ Ip5
All machines in the IDC room are shielded by data center, including ip1 ~ Ip5

3. unblocking "recovery alert 」

Because the permanent switch is disabled, you need to provide the interface to unmask and re-enable the switch.

Provides an interface to cancel a blocking record.
Provides a more detailed interface for canceling cancellation.

Because the permanent switch is disabled and the ip address under the service tree/data center is updated in real time, the four blocking methods can only be canceled according to the original record mask id.
If the batch ip/appId/alarmId/statusId is canceled, you can modify the valid value in the mask table to remove the value to be canceled.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More