Alarm System Concept

Source: Internet
Author: User

* Significance of the alarm system
* Alarm Method
* Alarm system trigger conditions
* Access Method
* Application Scenario example

# Significance of the alarm system ##

> In combination with our existing system, we do not monitor the operation of the production environment. Every day, many user complaints are reported, in addition, developers do not have a full grasp of the running status of their own systems. In the SOA architecture, interfaces are everywhere and a single business has a problem. It is difficult to find out which module caused the problem, developers are very passive in handling problems, and faults allow users to immediately discover and complain, and the experience is poor.

> Imagine a scenario where a user places an order on the front-end page. After the payment is successful, the page still shows unpaid. When the user is anxious to wait for a complaint, our customer service phone or text message has notified the user, and told "the system has encountered an exception. We are handling the problem for you urgently. Please wait for the notification", and the user will feel full of security.

Conclusion: The system exception is detected before the user and handled in a timely manner.


# Alarm Method ##

> Notifications are sent by SMS or Email based on the exception level.
> Weekly alert summary is provided.

# Trigger conditions of the alarm system ##

1. All Fata-level errors.
2. An error is triggered Based on the specified errorcode.
3. the info type is triggered Based on the specified errorcode frequency.
4. triggered based on the response time threshold of the specified interface.
5. Custom based on business scenarios.

# Access method ##

1. Perform analysis based on the logs from the access date of all service providers to the system.
2. The platform determines the Contact Group based on appid.
3. The platform specifies the errorcode, which maps the Email recipients and text message recipients.

# Application scenarios ##

1. some time ago, the CPU load value of the production environment cluster remained high, and the PHP process continued to be high. According to the server troubleshooting, it was determined that the PHP program linked to external resources timed out, but the system complexity was caused, there is no way to determine what services (interfaces, data, and distributed files) are called by a project. After half a day of failure in troubleshooting, the service is automatically restored, the reason is that project a modified the linked server of memcache, and the slow response of memcache caused occasional timeout of interfaces provided by project.

2. The order payment process is abnormal and occasionally receives user complaints. There are two types of payment rules:
1. the payer's system is faulty and the callback is delayed.
2. We accept the payment callback result, but the payment script is abnormal.

To solve this problem, the payment script records the text and database log when processing the payment, but does not notify the developer actively, causing the developer to be passive. In addition, when the developer is troubleshooting the problem, the server logs and data records can be used, but error Scenarios cannot be restored.

3. Database Connection error, failed to write to queue, failed to write important log records (failed to write files or databases)

This article is from the "architecture technology Digest" blog, please be sure to keep this source http://wufaliang.blog.51cto.com/3160882/1543599

Alarm System Concept

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.