Approaching Fuxi, talking about the scheduling and performance optimization of 5000-node cluster

Source: Internet
Author: User
Keywords Aliyun Fuxi

5K Project is the milestone of the flying platform, the system in scale, performance and fault tolerance have been a leap-type development to reach the world's leading level. Fuxi as a flying platform distributed scheduling system, can support a single cluster 5000 nodes, running 10000 jobs, 30 minutes to complete the 100TB data Terasort, performance is at that time Yahoo! in the Sortbenchmark of the world record twice times.

About Fuxi

"Flying" is Alibaba's cloud computing platform, where the distributed scheduling system is named "Fuxi" (code name Fuxi), the name from ancient Chinese mythology figures. Fuxi is mainly responsible for the management of cluster machine resources and scheduling concurrent computing tasks, currently supports off-line data processing (DAG Job) and online Services (service), for the upper tier distributed applications such as odps/oss/ots to provide stable, efficient and secure resource management and task scheduling services, It provides a powerful computing engine for Alibaba Group to build the first platform for data sharing.

Fuxi system design using m/s architecture (as shown in Figure 1), the system has a known as "Fuxi Master" cluster control center, the rest of each machine will run a call "Fuxi agent" daemon, the daemon in addition to the management of the tasks running on the node, It is also responsible for collecting resource usage on the node and reporting it to the control center. The heartbeat mechanism is used between the control center and the Fuxi agent to monitor the node health status. When a user submits a task to Fuxi master, Fuxi Master dispatches a master process appmaster on which the active node initiates the task, and the master process then submits a resource request to Fuxi Master, after which the resources allocated by Fuxi Master are Appmaster notifies the corresponding node of the Fuxi agent to start running the task worker. Fuxi is a scheduling system that supports multitasking, and the control center, Fuxi Master, is responsible for arbitration among multiple tasks, supporting priority, resource quota quotas, and preemption.

  

With Fuxi, users can run common MapReduce tasks, and can host online services to meet the needs of different application scenarios. Multiple users can share a cluster, Fuxi supports configuring resource quotas for grouping, qualifying the computing resources that each user group can use. Urgent tasks such as important data reports can increase task priorities to prioritize the use of computing resources.

5K challenges

In the process of the 5K project, we see that the large-scale cloud computing platform from design to implementation of each step may have a performance "trap", there are three main reasons: scale amplification effect, when the system expanded to thousands of nodes, the original non-bottleneck and scale is directly proportional to the link, the impact will be magnified; cask effect, many times, System 99 % of the places have been optimized, the completion of the remaining 1% of the optimization seems to be just "icing on the cake", however, the 1% is likely to be a fatal bottleneck affecting system performance; Long path module dependencies, some request processing may need to span multiple modules (including external modules), The performance instability of the external module may ultimately affect the processing performance and stability of the request.

5K project is a full-scale campaign, to the Fuxi system to bring scale, performance, stability, operation and other aspects of technical challenges, such as the following performance "trap."

Communication message DDoS: In the 5000-scale cluster, the number of RPC requests between different processes will increase with the scale, the total number of requests in the network can reach 10000 QPS, it is very easy to cause the message congestion of the single point process in the system, which causes the request processing to timeout seriously. There is also a problem with the team head blocking (HoL) in message processing.

Key function Ops: Fuxi Master is the central node of resource scheduling, and the Ops of internal key scheduling functions must reach very high standards, otherwise it may affect the overall scheduling performance of the cluster because of the barrel effect.

Failover to external module dependencies: Fuxi Master has a user-transparent failover function (Failover), whose recovery process relies on the checkpoint written on Nuwa (note: Nuwa is the collaborative system of the flying platform, such as the name Service). Therefore, the overall recovery rate will be affected by the Nuwa access speed.

We do a lot of fuxi optimization work to circumvent the above performance "traps", involving architecture design, implementation details and module dependencies, through the phenomenon of looking at the essence, from the bottom of the performance analysis to start step by step to find bottlenecks. The following combined with specific examples of actual combat to share the optimization process.

Fuxi Optimization of actual combat

Communication performance optimization

At the initial stage of the 5K project, when we tested large-scale concurrent operations, it was found that the operation time became longer when the number of jobs exceeded 1000. Analysis of monitoring curves and logs, we found that appmaster to the Fuxi Master resource requests for a large number of messages timed out, appmaster delay to get resources, resource request processing delay is very high.

The total time that the message arrives from the Fuxi master process to the final processing is mainly the waiting time in the queue and the actual processing time, so the delay is no more than two reasons: the message processing itself OPS dropped; messages piled up in the pending queue without being processed in time. Along this line of thought, the key function of Fuxi master resource scheduling was found not to account for the majority of the whole message processing delay, the culprit left only the message backlog. After plotting the message stacking curve in the resource dispatch message queue in Fuxi master, it was found that when the number of jobs increased, the volume of requests piled up (as shown in Figure 2), and the processing time of each request was much higher.

  

Why is there so much information piling up in the Fuxi master queue? In the Fuxi system, the daemon agent and appmaster all need to query the resource state for Fuxi master, which is in charge of resource scheduling, and adopt a regular polling method on the communication strategy, the default is to query once per second. The use of polling communication method is mainly based on its simplicity, can be more effective in coping with network failures, message delivery process is more natural and regular. However, in the 5000-scale cluster, this strategy must be adjusted to optimize, otherwise it will cause Fuxi master to be a large number of requests "DDoS attack" and can not serve.

To locate the problem of message accumulation, we immediately flow control of the message communication strategy, the algorithm is simple and effective: send-side check if the last requested result of the request has been returned, indicating that the current Fuxi Master request processing more smoothly, the interval of a short time after the next query. Conversely, if the last-asked request times out, the Fuxi master is busier (for example, a task frees up large amounts of resources to be processed, etc.), and the sender waits a long time before sending the request. Through this adaptive flow control communication strategy adjustment, Fuxi Master message stacking problem has been effectively solved.

In addition, we have solved the problem of the team head blocking (HoL) of the Fuxi master message. Appmaster needs to obtain resource scheduling results with Fuxi Master communication, and also to communicate with Fuxi agent for worker's Kai-stop. Because the number of fuxi agents is much larger than that of Fuxi master, in extreme cases, if Appmaster uses the same thread pool to process these messages, then the Fuxi master message will be blocked by a large number of Fuxi agent messages in front. We profling the whole path of message processing, including from sending to processing and so on, which confirms the blocking phenomenon of team head. When a task of more than the worker, Appmaster need to communicate with the Fuxi agent will also increase, observed appmaster to get the resources of the time is significantly longer. For the team head blocking problem, we have added the independent thread function to the QoS effect in the communication component, and applied in the Appmaster processing Fuxi Master message communication. As shown in Figure 3, the message of Fuxi master uses a single thread pool, while the rest of the messages share another.

  

Through the above two performance optimization, the communication pressure within the FUXI system has been significantly reduced and the communication efficiency has been improved. The resource request communication between Appmaster and Fuxi master is improved, and the task can be quickly assigned to the resource to start running, which improves the speed of the task in the multi concurrent task scene. For example, through this optimization, the user can improve the speed of SQL query processing of mass data by ODPs client.

Key function optimization

In the 5K project we also focus on the performance of key functions in the system, there may also be hidden "traps." Fuxi Master a key action in scheduling a resource is to determine which task the resource is assigned to by comparing the idle resources of a node to all the resource requests queued on that node. The number of calls to this function is proportional to the size of the machine and the number of requests, so its speed has a decisive effect on the dispatch ops of Fuxi master.

Fuxi supports multiple dimensions when scheduling resources, such as memory, CPU, network, disk, and all resources and requests are represented by a multidimensional key-value pair, such as {mem:10,cpu:50,net:40,disk:60}. Therefore, the question of whether a free resource satisfies a resource request can be simply abstracted into a multidimensional vector's comparison problem, such as R:[R1,R2,R3,R4] > Q:[q1,q2,q3,q4], where 1, 2, 3, 4 and so on are all dimensions, R>q is judged only when all dimensions of R are greater than Q. The time complexity of this operation is determined by the number of comparisons. In the best case, the results can be obtained only 1 times, such as the decision [1,10,10,10] is greater than [2,1,1,1] failure, the worst need D (d for Dimension), such as [10,10,10,1] greater than [1,1,1,10] need to compare 4 times. When the high frequency of resource scheduling occurs, the comparisons here must be optimized.

Through the profiling analysis of the System Runtime resource idle and request, in the resource adequacy of the most common value of the most difficult to meet, so in the resource scheduling scenario we adopt a primary key based optimization algorithm: The maximum value of each resource request is defined as the dimension of the vector's primary key, First compare the primary key dimension to the request when there is an idle resource, if the other dimension is met on the primary key. In addition, a minimum value is sought for the primary key value queued on a node for all requests, and the idle resource does not need to compare other requests if it is less than the minimum value. Through the primary key algorithm, we greatly reduce the resource scheduling time vector comparison times, Fuxi master scheduling time optimization to a few milliseconds. Note that the resource request does not change, so the system overhead of calculating the primary key can be negligible.

The optimization of the key dispatching performance of Fuxi master enhances the scale expansion ability of the system, the user uses the flying platform to manage larger clusters, accommodates more computing tasks, and plays the cost advantage of the cloud computing platform.

Module Dependency Optimization

Fuxi Master supports failback and needs to read all task description files (Checkpoint) from Nuwa to continue running user tasks after a reboot. In view of the previous Nuwa service on the server side of the file content does not persist, Fuxi master after reading the checkpoint will write again Nuwa, this write-back operation performance depends on the Nuwa module. On the 5000-node cluster, the significant increase in the name resolution pressure resulted in Nuwa performance degradation in the server's writeback operation, which eventually passed through the module dependencies to Fuxi Master, which affected the performance of the recovery. After a test observation, a checkpoint write back to consume 70 seconds, which greatly reduced the availability of the Fuxi system.

We optimized the Fuxi Master fault recovery. First of all, from the perspective of Fuxi Master, the checkpoint content just read in the recovery is not changed on the Nuwa server, so it is not necessary to write back to the server side after reading checkpoint, just notify the local Nuwa agent to let its agent The agent is responsible for pushing the contents of locally cached files to the server when server downtime restarts. In collaboration with the Nuwa team, a new write-only local interface was added to the Nuwa API, so that Fuxi master circumvented the performance risk of writeback checkpoint in the case of a failure recovery. After optimization, in a 5000-node cluster and a 5000-task-tested scale, it takes only 18 seconds to process the checkpoint operation in a single recovery (primary time at one read). Visible in distributed systems, reliance on external modules can be a "performance trap" even if only one RPC request is designed and implemented as much as possible to avoid critical paths.

The fault recovery is the function that the distributed system guarantees the usability, after optimizing, the fast failure recovery of Fuxi master enhances the usability and stability of the flying computing platform, shields the hardware fault and makes the user's use process unaffected.

Engineering Experience

High-quality code there is no shortcut to go, and can not rely solely on the system flow, only serious two words: the author seriously, reviewer seriously, testing seriously.

Any item, whether resolving bugs or adding new feature, must be discussed clearly before the code is written, and review cannot replace the discussion. In the discussion, the author needs to answer two questions: Is this solution really feasible? What are the side effects? These discussions need to be tracked on tools such as wikis or Bugfree.

Small step run, early submission of code Review, many problems in this phase can be found, do not have to wait until the test found that the cost.

Code reviewer is responsible for half of the item, so review is not simply done literally. The checklist I used are: whether it accurately reflects the scenarios discussed before, whether there are deadlocks, performance traps, whether the modular encapsulation is adequate, whether the function name variable name is canonical, whether the log format is canonical, and whether the comment is sufficient. It is common for a piece of code to review iterations about 10 times.

Be sure to have targeted test validation.

The corresponding bug and review ID are associated with the code submission for subsequent traceability.

Summary

The above and you share some of the 5K project experience, the Fuxi system in the 5K project has also done a lot of meaningful system optimization and technology exploration, participate in the harvest quite abundant. Performance is part of the function, the system wards rather than the brocade on the flower. The 5K project is just the beginning of the development of Ali Cloud computing platform technology, which will be further developed in a larger scale and richer computing model, to build a reliable cloud computing engine for users, further reduce costs and excavate data value.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.