Double 11 behind the scenes superhero: the value of the new generation of operation and maintenance

Source: Internet
Author: User

"Double Eleven" has just ended. In fact, the most nervous thing is not the shop tally, nor the netizens are keeping a close eye on the big goods to prepare for the spike, but the operation and maintenance personnel behind the net purchase. They are most worried about: what network interruption, application stuck, response Slow, server down...

Double Eleven is the top priority for the e-commerce IT department. Before the promotion, the operation and maintenance personnel need to prepare a number of preliminary plans early, and they are always nervous and experience hundreds of simulation exercises. It is not known how many sleepless nights they have at the back end. A few years ago, a spike to make the server down is the norm. Now, with hundreds of thousands of orders per second, the server is still strong. Undoubtedly, it is the sleepless nights of powerful technology and operation and maintenance.

The seemingly simple double XI involves the coordination and testing of the entire commercial infrastructure including payment, architecture, database, network, operation and maintenance, power, customer service, and logistics.

Those who have been promoted by the double eleven

The Tmall Double Eleven was first started in 2009. At that time, it was still Taobao Mall. The GMV of the day was only tens of millions, and there was no concept of zero-point madness. Before the big promotion, engineers will basically judge according to their own experience, such as the current load of the server, the current RT and QPS of the application, judging how much capacity each server can support, and then several people will decide on the decision board. How many servers do you need to add to the core applications? In fact, you have to add a lot of servers. In fact, everyone’s heart is not at the bottom. I really don't worry about applying for more expansion. In short, the business volume at this stage is also small, and it can cope with the past.

In the follwing years, with the promotion of the Tmall brand, the double eleven major explosion broke out year by year, and the original operation and maintenance method was no longer applicable. The business is developing rapidly, the number of applications on the back end is also greatly increased, and the calling links between application systems are complicated. How much resources should I prepare to expand before I am promoted? You can't make a head heat, because you can refuse to apply for too many resources, and you have to take more risks when you apply less. At this time, the online pressure measurement method can be used to solve the problem. For example, one server can be directly extracted in the production environment, and the pressure can be measured by analog playback or directly introducing multiple times of traffic, and the maximum load capacity of a single server can be calculated according to the pressure measurement result. Then use the numbers to speak and apply for expansion. There is also that even if capacity planning is done, it may still exceed expectations at the peak of zero, and the system will still explode. Therefore, the current limit and the downgrade are introduced. The current limit is to set a maximum threshold for each application. If the threshold is exceeded, the new request is rejected immediately. This has the advantage of protecting the application and avoiding avalanche. There is also a downgrade. Due to too many applications, during the promotion period, some non-core functions can be closed to ensure the maximum ability of the main transaction process. The pressure measurement at that stage is not completely accurate. The main problem is the limitation of pressure measurement. It is only a separate pressure measurement for an application, but there are dependencies between applications, especially some shared service centers. Basically, What should be done by all applications relying on calls? In the next few years, a new pressure measurement tool, full link pressure test, was developed. This is a brand-new idea for capacity planning. It directly generates a large amount of traffic through simulation replication in the production environment. Each link will be pressured and has a corresponding monitoring system to find out where the bottleneck is. And quickly optimize. And this process is done automatically.

It can be seen that automated operation and maintenance is the trend of the times.

The tactics behind zero snapping

The current e-commerce double eleven promotion activities still continue the zero-point snapping mode. For the application system security, whether it can successfully survive the first 15 minutes, or even the first few minutes, becomes the core security task. The operation and maintenance industry big coffee gave the following suggestions: Whether it can successfully pass the first 15 minutes, or even the first few minutes, becomes the core security task. The following suggestions are given in detail:

a. Capacity planning. As much as possible in the production environment to do the pressure test, only after experiencing the pressure test, my heart will have a bottom.

b. Critical applications need to support current limiting. The zero-point crazy traffic is likely to exceed expectations, and only the current limit can be set to protect its own application, otherwise an avalanche chain reaction will occur.

c. Degrade non-core features. Each time the double eleven will invest a lot of resources, the basic will be applied to the core trading class, then the degradation of non-core functions is acceptable to some extent.

d. Emergency plan. Prepare in advance for possible abnormal conditions.

Double eleven promotion is the most typical elastic scene

Elasticity is the biggest advantage of cloud computing, and big promotion is the most typical elastic scenario.

With the popularity of cloud computing, especially public clouds, current operations and maintenance personnel basically do not need to pay attention to the underlying facilities such as the computer room, network, and operating system. After continuous practice, today's e-commerce platform has adopted a flexible and scalable cloud computing platform, coupled with distributed data, efficient CDN distribution to achieve load balancing, to avoid collapse in the high eleven early morning. Operation and maintenance personnel will shift more energy to the fast-on, fast iteration to support business development.

The traffic of the big promotion activities is completely out of the order of the daily life, and the on-demand use of the cloud resources can be fully utilized to meet the demand for capacity expansion, and the cost is huge. In addition to expansion, it is of course necessary to prepare an emergency plan. Sort out the abnormal conditions that may occur on the day and preview in advance.

Last year, the Tmall double eleven opened only ten minutes, and the world payment record was refreshed again. According to data released by Alipay, at 0:39:39, Alipay's payment peak reached 120,000 pens/second, 1.4 times the previous year, setting a new record for last year. In the choice of payment methods, Huayan and Yu'ebao have become very popular payment methods for netizens, accounting for 29% and 18% respectively.

Can withstand huge transactions, can afford to kill the speed of light, the technical system can resist, the liquidity of the yield is stable... Only the ultimate test of the double eleven can be regarded as the real artifact!

Intelligent operation and maintenance can be realized by means of data and algorithms.

The development stage of operation and maintenance has undergone the intelligence from standardization, instrumentation, automation, and now to the beginning. The development of each stage represents a substantial increase in productivity and efficiency, and the whole trend is inevitable. The operation and maintenance of the intelligent era is not to make the operation and maintenance personnel unemployed, but to have great demands on the improvement of operation and maintenance efficiency, such as how to quickly locate problems, root causes, and even fault predictions in a complicated environment to avoid failures. Guarantee application stability.

Intelligent operation and maintenance can be realized by means of data (operation and maintenance data) and algorithms. First of all, the development of operation and maintenance capabilities does not jump directly to the stage of intelligent operation and maintenance. It must be standardized, tooled, and automated. Only highly sophisticated automation has the basic capabilities. The second is data accumulation, which requires a large amount of operation and maintenance data, which can be log data, network packet capture data, database data, and so on. There are also data for daily operation and maintenance to generate annotations. For example, after a failure, the operation and maintenance personnel will record the process, and this process will feed back to the system, which in turn will improve the operation and maintenance level. Finally, the algorithm, which type of algorithm model is used for continuous optimization.

In the operation and maintenance department, the Tianhong Fund hopes to monitor the usage of the basic resources of the application system through real-time monitoring of the server performance log. The client agent collects the CPU and memory usage of the server and cluster components to display the resource running status in a visual form.

It is reported that the Tianhong Fund cloud log platform project has started internal promotion and has been recognized by users during the official operation of the system. The specific value to users is reflected in the following aspects:

Operation and maintenance personnel: Data desensitization function helps operation and maintenance personnel to liberate manpower; collecting resource management and control functions can prevent Agent programs from affecting servers and applications, effectively avoiding catastrophic failures.

R & D personnel: log query function can quickly and easily query log files; call chain analysis helps developers quickly locate the cause and problem of the fault, and assist the R&D team to optimize system code and implement architecture management.

Service personnel: The monitoring alarm function can detect service faults in time, minimize the fault response time, and improve the user service experience.

Managers: Intelligent operation and maintenance can grasp the operation status of service resources in real time, and can predict the cluster water level and provide suggestions for resource expansion.

Written at the end

In addition to the above, all operation and maintenance teams also need to prepare a duty plan in advance, and carry out detailed planning on the various emergency situations that may occur on the day of the eleventh, and the key points that should be concerned in each time. In short, every year, the eleventh is a test, a test. The details determine success or failure. For all operators, all the details must be paid attention to. The drills and preparations are enough to cope with the double eleven promotion of each year.

As of 0:00 on November 12, the Tmall “Double Eleven” transaction in 2017 was rated at RMB 168.269 billion. What are the technical systems behind these incredible numbers, which are constantly innovating sales, peak trading, and peak payment? Intelligence is gradually entering all aspects of the IT industry and even social life. In the future, the use of big data correlation analysis and machine learning technology to give artificial intelligence to the operation and maintenance system, providing intelligent protection capabilities from fault prevention to fault location to fault closed loop. Perhaps by that time, the operation and maintenance engineers can also easily play the double eleven, properly buy and buy!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.