Unified scheduling system
Sigma, which was built in 2011, is a dispatching system that serves Alibaba's online business. There is a complete set of scheduling-centric cluster management system around Sigma.
Cluster management and scheduling system Sigma architecture diagram
Sigma has Alikenel, SigmaSlave, SigmaMaster three-layer brain collaboration, Alikenel is deployed on each physical machine, enhances the kernel, flexibly prioritizes and adjusts resources in resource allocation and time slice allocation, and delays the task. The preemption of the task time slice and the eviction of the unreasonable preemption can make their own decisions through the rules configuration of the upper layer. SigmaSlave can perform container CPU allocation, emergency scene processing, etc. on this machine. The local Slave quickly makes decisions and responses to the delay-sensitive task interference, avoiding the business loss caused by the long-term global decision processing. SigmaMaster is the strongest central brain that can take the global view and allocate resource allocation and algorithm optimization for container deployment of a large number of physical machines.
The whole architecture is a final-oriented design concept. After receiving the request, the data is stored in the persistent storage layer. The scheduler identifies the scheduling requirements to allocate resource locations, and the Slave identifies the state changes to promote local allocation deployment. The overall coordination and final consistency of the system is very good. We started the scheduling system in 2011, rewritten in Go in 2016, and compatible with the kubernetes API in 2017. We hope to combine the power of ecology to build and develop together.
Mixed architecture
Alibaba began to promote the mixed structure in 2014 and has now deployed on a large scale within Alibaba. Online services are tasks with long life cycle, high complexity of rule policies, and delay-sensitive tasks. The computing task has a short life cycle, large scheduling requirements, high throughput, different priorities, and is not sensitive to delay. Based on the difference between the two types of scheduling, we deal with the two schedulings in parallel on the hybrid deployment architecture. That is, one physical machine can have both Sigma scheduling and Fuxi scheduling to achieve unified infrastructure. Sigma scheduling is to launch the PouchContainer container via SigmaAgent. Fuxi also seized resources on this physical machine and started his own computing tasks. All online tasks are on the PouchContainer container. It is responsible for allocating server resources and running online tasks. Offline tasks are filled in their blank areas to ensure that the utilization of physical machine resources is saturated, thus completing the hybrid deployment of the two tasks.
Alibaba Hybrid Deployment Architecture
Key technology of the mixed department
Key technologies on kernel resource isolation
-
On the CPU HT resource isolation, the Noise Clean kernel feature was implemented to solve the problem of over/under-hyperthread resource contention.
-
On the CPU scheduling isolation, the Task Preempt feature is added to the CFS to improve the online task scheduling priority.
-
On- and off-level L3 cache (LLC) channel isolation (Broadwell and above) via CAT on CPU cache isolation.
-
Cgroup isolation/OOM priority on memory isolation; Bandwidth Control reduces offline quota for bandwidth isolation.
-
In memory resiliency, when the memory does not increase, the effect of the mixed part is improved, and the memcg limit is broken offline when the line is idle; when the memory is needed, the offline is released in time.
-
On the network QoS isolation, the control is marked as a gold medal, the online mark is a silver medal, and the offline mark is a bronze medal, and the bandwidth is guaranteed.
Key technologies in online cluster management
-
Image the application's memory, CPU, network, disk, and network I/O capacity, know its characteristics, resource specification requirements, real-time use of resources at different times, and then correlate the overall specifications and time for Overall scheduling optimization.
-
Affinity mutual exclusion and task priority allocation, which applications put together to make the overall computing power less, the throughput is relatively high, there is a certain affinity.
-
Different scenarios have different strategies. The double 11 singles day strategy is stable priority, and the stability priority represents a tile strategy that exhausts all resources and allows the resource layer to reach the lowest water level. The daily situation needs priority utilization. The “utilization priority” refers to the resource that has been used up to the highest water level, and the large amount of complete resources are vacated for large-scale calculation.
-
Applications are auto-shrink, vertically scaled, and time-multiplexed.
-
Rapid expansion and shrinkage of the entire site, flexible memory technology, etc.
Hybrid deployment - introducing computing tasks to improve daily resource efficiency
Hybrid deployment refers to the introduction of computing tasks into online service clusters to improve daily resource efficiency. After the introduction of offline tasks, the average CPU utilization increased from 10% to more than 40%, while the latency impact of delay-sensitive services was less than 5%, which is completely acceptable. At present, our entire mixed-use cluster has reached the scale of thousands of units, and has been verified by the transaction core link. This optimization can save more than 30% of the server daily. This year, we will expand the scale of deployment by 10 times and achieve scaled benefits.
Hybrid deployment - time-multiplexed to further improve resource efficiency
Through time-division multiplexing, the effect of further improving resource efficiency is achieved. The curve in the above image is the flow curve for one of our applications. It is very regular, with the left side representing the evening trough period and the right side representing the daytime peak period. The normal mixed part refers to the resource occupying the blue shaded part of the figure to increase the utilization rate to 40%. The elastic time-multiplexing technology refers to finding the application traffic trough period for the application image, shrinking the application, and releasing a large amount of memory and CPU. Schedule more computing tasks. With this technology, the average CPU utilization is increased to more than 60%.
PouchContainer container and containerization progress
Comprehensive containerization is the key technology to improve operation and maintenance capabilities and unified scheduling. First introduce the Alibaba internal container product PouchContainre. It has been built and launched since 2011. Based on LXC, it began to absorb Docker image functionality and compatible container standards in early 2015. Alibaba's container is very characteristic. It combines the Ali core and greatly improves the safety isolation. It is currently deployed in the Ali group within a million scale.
Let's take a look at PouchContainer's development path. Previously, the virtualization technology of virtual machines was used. The transition of virtualization technology to container technology faced many challenges in the operation and maintenance system. The migration of the operation and maintenance system has a large technical cost. We have achieved a seamless migration of Ali's internal operations and application perspectives, with independent IP, ssh logins, and independent file system and resource isolation usage visibility. After 2015, Alibaba introduced the container standard and formed a new set of containers PouchContainer and integrated into the entire operation and maintenance system.
PouchContainer positioning
PouchContainer's isolation is very good, it is a rich container, you can log in, see the amount of resources occupied by the process itself in the container, how many processes, the process hangs the container will not hang, you can run a lot of processes. The compatibility is very good, the old version of the kernel also supports, it is very helpful for the old. At the same time, after a large-scale verification of the deployment of millions of containers, we have developed a P2P image distribution mechanism to greatly improve distribution efficiency. At the same time, it is compatible with more standards in the industry, promotes the construction of standards, supports RunC, RunV, RunLXC, etc. It has been tested by millions of containers and is stable and efficient. It is the best choice for comprehensive containerization.
PouchContainer architecture diagram
The structure of PouchContainer is relatively clear, how Pouchd interacts with kubelet, swarm, and Sigma. The CSI standard was built with the industry on the storage side. Support distributed storage such as ceph, pangu. Use lxcfs on the network to enhance isolation and support multiple standards.
At present, PouchContainer covers most of Ali's BU. In 2017, it reached a million-level deployment. The online business reached 100% containerization, and the computing task began to be containerized. It flattened the operation and maintenance cost of heterogeneous platforms. Override run mode, multiple programming languages, DevOps system. PouchContainer covers almost all of Ali's business segments such as ants, transactions, middleware and more.
PouchContainer announced open source on October 10, 2017, officially opened source on November 19, and plans to release the first major version in March 2018. We hope that PouchContainer's open source will promote the development of the container field and the maturity of standards, and provide the industry with differentiated and competitive technology choices. Not only is it convenient for traditional IT companies to benefit from the old, but the old infrastructure can also enjoy the benefits and advantages brought by Yunyuansheng technology, and it is convenient for new IT enterprises to enjoy the advantages of scale stability and multi-standard compatibility.
Cloud architecture
Double 11 Singles Day cloud based architecture operation and maintenance system
Cloudized architecture operation and maintenance system
The cluster is divided into online task clusters, computing task clusters, and ECS clusters. Resource management, single-machine operation and maintenance, condition management, command channel, monitoring and alarming, and other basic operation and maintenance systems have been opened. In the double 11 singles dayscene, we will draw a separate area on the cloud to intercommunicate with other scenes. In the interworking area, Sigma scheduling can apply for resources in the computing cluster server, produce Pouch containers, or go to the cloud open API to apply for ECS to produce the container resources. In everyday scenarios, Fuxi can apply for resources in sigma and create the required containers.
In the double 11 singles day scenario, a large-scale online service is built on the container using a large-scale operation and maintenance system, including a mixed deployment of business layers. Each cluster has online service and stateful services and big data analysis. Alibaba Cloud's exclusive cluster also deployed online services and stateful data services, and achieved datacenter as a computer. Multiple data centers are managed like a single computer to achieve the development of scheduling services across multiple different platforms. H. Build a hybrid cloud to get the server at a very low cost and solve the problem.
The server scale is first used, and the resource utilization is greatly improved through time-sharing multiplexing and hybrid deployment. Really realize flexible and flexible deployment of flexible resource smoothing and multiplexing tasks, and complete business capacity targets with the minimum server time and optimal efficiency. Through this cloud based architecture, we achieved a 50% reduction in new IT costs and a 30% reduction in daily IT costs in the Double 11 Singles Day, bringing about a burst of technical value in the cluster management and scheduling areas, as well as container and scheduling techniques. The popularity is a necessity.
The Ali Dispatch System team is dedicated to building a globally efficient scheduling and cluster management system to build optimal cloud solutions through enterprise-class container and container platforms. Look forward to working with industry colleagues to reduce IT costs across the industry and accelerate innovation.