A few days ago, there are netizens to send Micro Bo said, take-out orders payment delay, some users after payment system still prompted not to pay; Group purchase page content is not displayed properly. A large-scale crash occurred in the US server.
Netizen Amusing:
1, this is a hot search, and let everyone eat lunch Bug.
2, the meal point time, encountered this kind of thing, also really dejected. Moreover, this situation is not a case, on the micro-blog on a search, reflecting the U.S. group outside the situation is really a lot of.
3, successive attempts to pay several times, but the money out, but neither into a single, there is no refund.
4, want to find the U.S. group outside the customer service consulting, but has been contacted, whether it is online customer service or telephone customer service.
US group:
1, the incident 12:16 on the day of the group Micro-blog reply: The order problem has been fixed, the order problem has been fixed, the order problem has been fixed.
2, 12:28 points the APP is still in an outage state. 12:43 minutes, the U.S. group on the Micro-blog response: After the Emergency repair, has been resumed, the repeated payment of the order will be returned, the system failure during the completion of the service orders, the user can not be responsible for canceling the refund.
Event Result:
1, some repeat the next single Netizen has received a refund and the American group apology Red envelopes;
2. The accident was an annual disaster for the engineers of the group, and it is likely to lead to the failure of the American team's program members to miss the lucrative year-end prize.
In fact, the United States of America is not the first time a similar problem, it is understood that the December 5, the United States to take out the sale of a server crash, noon when the user ordered the meal, want to view the progress of the order, the page will either display "System processing exception", or "order does not exist", so that users can not track their
users to the results of The Spit Groove:
1, the United States Group of programmers is not hungry to eat, ignoring the system of bugs, or be sacrificed to heaven. is not the winter vacation.
2, hungry is planted in the United States the programmer at the end of the team finally exerting force. After the storm video, shrimp music, another programmer to be sacrificed to heaven.
so we're fighting on the frontline of the handlers
from the " American group Takeout" event.
What lessons can be learned.
In this paper, we will analyze the construction process and related design principles of automatic operation and maintenance system, such as problem discovery, root cause analysis and problem solving in the course of operation and maintenance.
Problem finding: Complex Business Processes
Users in the use of the U.S. group outside the application process involved in the technology module, after the user, and system sent to the merchant--merchants ready to sell-and distribution, to the end of the user received goods such as hot bento, the whole process of time need to control in half an hour.
Figure 1: The user's point of view of the United States to sell the technology system
root cause Analysis: need to solve the problem
Figure 4: Developer's Daily pain point monitoring
We often encounter problems that plague developers in our daily business operations, as shown in Figure 4.
Four main pain points:
1, the company has more than one set of monitoring system, they have their own responsibility to locate, but not related to each other, so developers need to troubleshoot the problem with parameters in different systems to switch between, which reduces the efficiency of the location problem.
2, event notification, alarm events flooded with developers of IM, we need to spend a lot of energy to configure and optimize the alarm threshold, alarm level will not appear many false positives.
3, developers receive a variety of alarm, usually according to their own experience to troubleshoot problems, these troubleshooting experience can be standardized.
4, our code will have a large number of degraded current limit switch, in the service is abnormal when the corresponding protection operation. These switches are quickly iterative with the product, and we are not sure if they are still valid.
problem Solving one: with some automated measures to improve operational efficiency, triggering service protection has two paths.
Figure 5: Core construction objectives of the automated business operation and maintenance system
First, the user can also directly through the diagnostic alarm into the corresponding core link, to see the root cause of the resulting anomalies, to guide users to determine whether the need to trigger the corresponding service protection plan.
Second, when the user received our diagnostic alarm in the prior period, direct access to the alarm may affect the business market.
Problem Solving Two: the development of products, the construction of core products and the relationship between the various product modules.
1, the business Market Forecast alarm, the core link WD diagnostic alarm and has collected the various dimensions of the alarm events, if they can do further statistical analysis, can help developers from a more macroscopic point of view in advance of the service may be potential problems, equivalent to the service to do health checks in advance.
2, analyze the service status on the core link, help the developer to locate the final problem node, and suggest what service protection plan the developer needs to trigger.
3, we need to regularly through the full-link pressure measurement to continuously verify that the problem diagnosis and service protection is effective, in the pressure measurement can see the service health status under each scenario, to the service node to achieve effective capacity planning.
Figure 6: Business monitoring operation and maintenance architecture
Problem Solving three: business monitoring market and expanding ability
1, when the abnormal business indicators, according to the background of monitoring data analysis, can be manually or automatically event tagging, tell developers what caused the fluctuation of business indicators, to achieve the rapid synchronization of user information.
2, with time stamp and type to quickly guide developers to enter other monitoring systems to improve the efficiency of developers to troubleshoot problems.
Figure 7: Business monitoring market and expansion capability
Problem solving four: Core link Product construction path
1, we need to give the core Link Service Node health score, according to the scoring model to define the serious problem link.
2, here we will be based on the various indicators of service to depict a service problem portrait, the problem image of the indicators will also have weight division, such as: When the service has failed rate alarm, TP99 alarm, a large number of abnormal log will be high-weighted bonus points.
Figure 8: Core Link Product construction path
problem Solving five: core functions of the service Protection & Failure Walkthrough Module
1. Downgrade switch: Due to the rapid development of the business, there will be hundreds of downgrade switches in the code. Manual demotion is required when a business exception occurs.
2, current limit switch: Some of the specific business scenarios need to have a corresponding current-limit protection measures. For example: for the single-machine current limit is mainly the resource protection of their own servers, for the cluster current limit is mainly for the underlying DB or Cache storage resources such as resource protection, there are some other current limit requirements are expected to be in the system in the event of abnormal traffic can be effectively protected.
3, Hystrix automatic fuse: You can monitor the number of exceptions, the number of threads and other simple indicators to quickly protect our service health status will not deteriorate sharply.
Figure 9: Core functionality of the Service Protection & Failure Walkthrough Module
Problem Solving Six: improving the benefits of full-link pressure measurement
1, the regular organization of external selling full-link pressure measurement, each pressure measurement will involve a lot of people's cooperation, if it can be measured for a single pressure test scene will greatly shorten the cost of our organization pressure measurement.
2, in the full-link pressure measurement, for the pressure measurement of the flow of different scenes of the fault drill, in the manufacturing failure, while verifying that the service protection plan can be as expected to start the protection services.
Figure 10: Improved full-link pressure measurement gives us benefits
Automated Distances
In front of the main introduction of our business-based operation and maintenance system needs of the various core functions, the following highlights, we in the entire system construction, the construction of automation mainly focused on where.
1. Automatic detection of anomaly points
The abnormal points are detected by analyzing the historical data, and the alarm thresholds are automatically calculated and set.
Figure 11: Detection of anomaly points
2. Automatic triggering of service protection
According to a variety of monitoring indicators to accurately diagnose anomalies, and in advance to identify the abnormal scenario with our service protection plan to correlate, you can automate the service protection plan trigger.
Figure 12: Anomaly detection and service protection linkage
3. Automation of pressure Measurement Plan
We need to prepare for the following work:
Based on the transformation of real traffic, the basic data structure, data desensitization, data verification and so on as far as possible through the task ahead.
During the flow playback phase, we can trigger the failure plan for the typical fault scenario (e.g. Tair fault).
At the same time, we can combine the relationship data of the core link to accurately locate the problem node which is strongly related to the fault scene.
We automatically trigger the corresponding service protection plan based on the service protection relationship we established for the typical fault scenario.
Throughout the process, we need to finally confirm that the performance of each environment has reached our expectations, it is necessary to have a corresponding monitoring log output at each stage, the final automated output of the final pressure test report.
Figure 13: Automation of the pressure measurement program
Conclusion: The post-exerting point of automation construction
In the whole operation and maintenance system construction, only more accurate location problem root node, diagnose the root cause of the problem can be gradually automated to do some operations (such as: Trigger downgrade switch, expansion cluster, etc.). As shown in Figure 14, we will continue to invest in these aspects of the refinement of the construction, we hope to detect any dimension of the anomaly, up to determine which business indicators may affect what the user experience; down-to-all-link pressure measurement can be very accurate capacity planning, saving resources.
Figure 14: The post-force point of automation construction
Reply to "1" in the public number to bring you into your fan group.