Analysis on the challenges faced by the flying open platform

Last Update:2014-08-12 Source: Internet

Author: User

Keywords Large-scale distributed system flying open platform

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. The challenge of flying test

Flying open platform is flying large-scale distributed computing system (short flying). Flying expects to assemble thousands of PCs into a "supercomputer" that provides a common compute divide,, and Task Scheduler multiple services for a wide variety of open services and cloud applications. It can be seen from the flying characteristics of the platform, versatility and large-scale, so flying faces a series of challenges.

Challenge one: The contradiction between the complexity of platform software and the rhythm of Internet release. There are many complex distributed modules in flying. The complexity of the module is multiplied by the protocol dependencies between the modules, and it typically takes one to two years to release a stable version with reliable quality, based on the traditional software development process. If so, the release rhythm is clearly far from the need for higher-level open services and rapid development of cloud applications.

Challenge two: The common platform supports many different applications to bring the explosion of the number of test cases. In terms of flying, different amounts of data, different scenarios, different machine sizes, different request pressures, it's possible that the path in the code is completely different, so the pressure point on the system is also pretty much the same. Whether it is trying to cover all the uses of flying in all applications, or from the design of the various combinations of module interfaces, the test case design is not convergent. So, when the test case goes to a dead end, can you find a new way to find another shortcut?

Challenge three: Problems with mass production clusters are exposed by small-scale test clusters. In the data center around Ali, flying production cluster is composed of thousands of physical machines. Considering the cost, the test cluster size usually does not exceed one-tenth of the production cluster. Statistical data show that 100 and 1000 of the distributed environment hardware and software failure rate, pressure bottlenecks, data levels, network performance will be very different. Conventional test methods make it difficult to find large scale problems in small clusters

Next, let's talk about how the flying test practice is currently coping with these three challenges.

2. Tiered testing and continuous integration

At present, flying the bottom of the module's release rhythm is half a year, the upper module of the release is more frequent, the shortest can reach three weeks release. This release rhythm relies mainly on layered testing and continuous integration mechanisms. According to the Test level, flying test can be divided into unit testing, functional testing, system testing, integration testing, E2E testing (end-to-end testing). In order to speed up the new version of the quality convergence, flying team almost every member will participate in the above test type, whether the development of students, or test students.

Generally speaking, the product will only perform functional testing and system testing on the external interface, but because the flying module itself is distributed, each module has the complexity of a traditional software product. Therefore, the module team is responsible for unit testing, but also for functional testing and system testing. Within the module team, development students are responsible for unit testing, and will assume functional testing and local characteristics of the system testing, testing students are often more focused on the test design and module level system testing.

Flying has a module-independent integrated testing team, integrated testing is mainly responsible for two pieces. On the one hand, through the continuous integration of regression test set to ensure that each module changes in the system can be integrated to work well, once found unable to repair the quality of short-term rollback, module changes will be immediately closed or rolled back. To ensure continuous integration, different levels of regression test set are as automated as possible and define the appropriate regression frequency, the module changes in the design will also consider easy to close or rollback. On the other hand, the integration test also carries on the platform level system test, extremely all can carry on the various torture to the flying, examines the underlying module the function, the performance and the system capacity, as well as in the extreme or the typical application scenario system stability and the service availability.

Before the launch of the new version, the Open service team will typically run E2E tests using the version of the integration test. The responsibility of the E2E test is responsible to the application to understand the specific requirements, this requirement is not only a requirement for interface functionality, but also includes data requirements, machine size requirements, throughput requirements, latency requirements, and business volume in a day or week curve, and so on. E2E testing typically constructs near-complete scenarios that simulate/reproduce real data conditions and stress characteristics as much as possible, and pass a long period of stability testing. Some upper-level applications will also have a standing test-run environment (Staging Environnement) to do e2e test validation at any time. In general, we only have to pass the final e2e test before we can get online to the application production cluster.

In order to ensure the quality of the test itself, each layer of test coverage has different measures. The unit test uses the coverage tool to examine the row cover and the branch coverage, the function test generally examines the function point coverage, these all are well-known. In addition, we designed a special overlay-Log Coverage for system testing, integration testing, and e2e testing. The Log Coverage tool can determine the adequacy of the test coverage by measuring how much log information is exported during the test run. By comparing the log in the production cluster with the log in the test, we will find where we haven't tested before. In addition, by checking the error log that was never printed in the code, we can also tell how many exception logic we have not tested.

3. Exploratory testing and grey box testing based on monitoring

As mentioned in Challenge two, the explosion of test cases has been a struggle for us. We found that we could not get through the black box test design ideas to exhaustive all the situation, even if we can design enough complete test cases, we do not have enough machines, staff and time to carry out these tests.

Fortunately, we have exploratory testing, and the monitoring system has become a guide for us to explore the direction. Flying has a detailed monitoring system can monitor the entire cluster parameters, these parameters are not only the OS level parameters, more is the flying module itself by invoking our monitoring system API to complete some of its own indicators of statistics. These statistics not only on the online system can play a role in monitoring the alarm, but also to provide a basis for exploratory testing. The person performing the test can perform exploratory testing by changing the parameters of the test, combining the changes in these metrics, and a stress test case can be executed in a variety of extreme scenarios that are close to the application. In general, testers are more likely to find hidden bugs by using metrics that are abnormal in some pressure changes or over time. In addition, in the platform level system test, through the module internal error log monitoring can also achieve good results.

Exploratory testing is important, but there is an exploration of the light, if the tester itself does not understand some of the system's internal logic, there will be two cases: the first is, only to verify the design of the scene, some other unusual circumstances, they can not explain, but itself is not a validation criteria, resulting in a lot of hidden problems in the end of the online outbreak; The second is, like a headless fly, aimless exploration, waste of time, but do not achieve good results. In the flying test, we ask the tester must be from the beginning of the design or even discuss the requirements of the development of the students together with the discussion, testing students need to develop more understanding of the system design principles.

Understanding how distributed systems work, the test alumni understand how to do a valid gray-box test. For example, at a key point, the request pressure will be more effective, at which time to do the test results of the assertion will be more convenient, complete or complete, and even know how to inject code for a module process, simulation of a small probability of recurrence of protocol communications packet loss problem.

There is a very typical example, early, one of the things that happened when we developed a table-structured storage engine in-house was that the test program always passed when the final validation was consistent, but when the business side was doing the e2e test with us, there was a very low probability that the data would be read incorrectly. The tester baffled, and finally found that this data, when written, will be modified in three places first, but due to some timing and lock problems, after the modification of two places after the return of success, the third place in memory, in a while will be repainted to the correct value. If the test students know the system inside these designs, then will design the writing process, the data consistency of real-time detection, will not be in the more expensive e2e test to find problems, the efficiency of the solution will be improved.

4. Long-time stability test with pressure and stochastic fault simulation

It is difficult to find the problem on large-scale production cluster in the paper. The reason for this is that we have found two main causes:

1. Large-scale cluster of the small probability of the single machine failure will increase with the number of machines, resulting in the whole cluster hardware failure rate linear upgrade, and even a variety of failures occur at the same time the probability is greatly increased.

2. The increase in the number of machines will lead to the flying module pressure point shift. In the case of Elastic Computation (ECS), in 300 sets of 600 units, it is found that the original worry of the file system master has not become a QPS bottleneck, responsible for locking the file coordination of the naming service is the first bottleneck.

In addition, the combination of various failures also makes the complete fault-tolerant test in the design and implementation of the cost becomes too large.

In order to solve the above difficulties, in the flying test practice, we gradually accumulated a set of long time stability test scheme with background pressure and stochastic fault simulation.

Background pressure is mainly for the various underlying modules read and write pressure, usually for a class of application scenarios to simulate. Based on the methods of divide-and-conquer and system emulation, we have implemented some lightweight pressure tools, so that the master machines and slave machines of each module receive similar access pressure and connection scale on large-scale production clusters respectively. In addition, some resources, such as cpu,memory,network, are added to simulate the busy production environment, resulting in a tense situation of machine resources. Fault simulation, on the one hand, as rich as possible software tools to simulate hardware failure, for example, disk errors (including bad disks, read-only, etc.), machine downtime, restart, break the network, switch restart, the main module process restart, suspended animation, etc. on the other hand, these fault simulation operations are randomly combined according to pre-set proportions. Under such background pressure and random failure operation, a long time (at least 7x24 hours) to continuously run the upper application simulation program/operation, and through the online monitoring system to check whether flying is normal.

In short, we use background pressure to solve the problem of pressure point, with long running to solve small cluster fault small probability problem, using random fault simulation and combination to solve the fault-tolerant test design and implementation costs.

Practice has proved that many important bugs flying are found through this test. Of course, this kind of test also has the shortcoming, is the problem investigation needs the long time, requests the tester to have the deep understanding and the diagnosis ability to the system.

5. Concluding remarks

Testing large scale distributed systems is a challenging task. Although we continue to implement all levels of testing, accumulate practical experience, innovate test methods, but due to the limitations of testing conditions, the complexity of the production environment, software problems can not be completely dependent on test elimination. This article only mentions the flying test in the lab in the direction of some thinking and practice, perfect flying quality assurance system needs test in lab and test in production two-pronged, this is the flying test is not bored with the direction of effort.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More