Operations engineers inevitably encounter a variety of failure situations, [controllable ] is the operation and maintenance team to pursue one of the ultimate goals
Includes the controllability of the fault, so the following sub-targets are derived:
1. Reduce the probability of failure
Hein rule: Behind every serious accident, there must be a minor accident and a warning, and a potential accident.
Use the data to speak, to count the causes of various anomalies distribution:
- Networking and hardware
- Outside Department personnel Cooperation
- Program code Reason
- Architectural design Flaws
- Database
- Deployment errors
- Human error
- Other
Accumulate data for a period of time, generate a distribution percentage graph, when a cause burst can be found in time
In general, code publishing and operational changes (such as machine Additions, data migrations,IP changes, etc.) are the two major failure fuses. So we should abstract operational objects, reduce human intervention, optimize operation process and reduce complexity. Each company's team has its own processes and steps, not generalize, need the entire company not only the operation and maintenance Department of cooperation.
2. Rapid detection of faults
| Basic system monitoring |
Basic business monitoring |
Advanced business monitoring |
| Machine survival |
Ports available /td> |
Live online people |
| Network connectivity |
Process survival |
Service timeout |
| CPU |
Log monitoring |
Data consistency |
| Memory /td> |
Curl available |
Key components available |
| Disk |
check_http |
Capacity monitoring |
The general OPS team is able to monitor the basic system and the basic business, but the advanced business monitoring is the measure of the OPS team.
For the alarm message to be layered, classified, and then filtered out redundant information, accurate to the respective application of the responsible person.
3. Fast handling of faults
Divide the fault handling into three sub-steps: Response, location, repair
The speed of response depends on operation and maintenance team Division of labor and Responsibility division, the theory of operation and maintenance team need to do the 7x24 response, to the real implementation of each of the operations of colleagues, need certain incentives and punitive measures, this does not say much.
Locating the fault needs the operation and maintenance team experience to inherit and share, need an operation and maintenance fault manual, which recorded a variety of typical faults and processing methods, but also need to have regular fault drills and various processing plans.
The speed of repair is largely determined by the availability of tools such as data repair, rollback, traffic switching, machine switching, etc.
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Practical experience of operation and maintenance fault handling