Practical experience of operation and maintenance fault handling

Source: Internet
Author: User

Operations engineers inevitably encounter a variety of failure situations, [controllable ] is the operation and maintenance team to pursue one of the ultimate goals

Includes the controllability of the fault, so the following sub-targets are derived:


1. Reduce the probability of failure

Hein rule: Behind every serious accident, there must be a minor accident and a warning, and a potential accident.

Use the data to speak, to count the causes of various anomalies distribution:

    • Networking and hardware
    • Outside Department personnel Cooperation
    • Program code Reason
    • Architectural design Flaws
    • Database
    • Deployment errors
    • Human error
    • Other

Accumulate data for a period of time, generate a distribution percentage graph, when a cause burst can be found in time

In general, code publishing and operational changes (such as machine Additions, data migrations,IP changes, etc.) are the two major failure fuses. So we should abstract operational objects, reduce human intervention, optimize operation process and reduce complexity. Each company's team has its own processes and steps, not generalize, need the entire company not only the operation and maintenance Department of cooperation.

2. Rapid detection of faults

Basic system monitoring

Basic business monitoring

Advanced business monitoring

Machine survival

Ports available

/td>

Live online people

Network connectivity

Process survival

Service timeout

CPU

Log monitoring

Data consistency

Memory

/td>

Curl available

Key components available

Disk

check_http

Capacity monitoring

The general OPS team is able to monitor the basic system and the basic business, but the advanced business monitoring is the measure of the OPS team.

For the alarm message to be layered, classified, and then filtered out redundant information, accurate to the respective application of the responsible person.

3. Fast handling of faults

Divide the fault handling into three sub-steps: Response, location, repair

The speed of response depends on operation and maintenance team Division of labor and Responsibility division, the theory of operation and maintenance team need to do the 7x24 response, to the real implementation of each of the operations of colleagues, need certain incentives and punitive measures, this does not say much.

Locating the fault needs the operation and maintenance team experience to inherit and share, need an operation and maintenance fault manual, which recorded a variety of typical faults and processing methods, but also need to have regular fault drills and various processing plans.

The speed of repair is largely determined by the availability of tools such as data repair, rollback, traffic switching, machine switching, etc.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Practical experience of operation and maintenance fault handling

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.