This is a creation in Article, where the information may have evolved or changed.
Summary
To meet the needs of compute-intensive services such as rendering and gene sequencing, Ucloud has launched a "compute factory" offering that allows users to quickly create a large number of computing resources (virtual machines). Behind the product is a set of computing resource management system based on Mesos. This paper mainly introduces the structure of the system, the use of Mesos in Ucloud, our solutions and the problems encountered.
Business requirements
Our needs are mainly two areas:
1. Supports both virtual machines and containers. Under the "containerized" wave, why do we still need to support virtual machines? First, some businesses have strict security isolation requirements, although the container is good, but also do not have the same level of isolation with the virtual machine. Second, some business programs can not run on Linux, more than the film, animation rendering software is mostly Windows programs.
2. Integration of multi-regional multi-data centers. Our resources come from a number of partners with idle resources scattered across multiple datacenters across multiple geographies. Our platform needs to be able to support global scheduling while minimizing the cost of operations, operation and maintenance.
Simply put, we need to have a platform that encapsulates the computing resources of multiple data centers and supports many forms of resource usage, such as virtual machines, containers, and so on.
Figure 1: Requirements for the compute resource management platform
When it comes to virtual machines, the first thing to think about is Ucloud's own uhost and open source OpenStack, however, both of these systems are for large public clouds, and are primarily for virtual machines. They feature a wide range of functions, modules and operations that require significant costs. However, our business does not require so much functionality.
In the end, we chose to implement this platform based on Mesos.
Why Choose Mesos
Mesos is an open source distributed resource management framework under Apache, which is the kernel of a distributed system.
With Mesos, a data center is no longer a single server, but a share of resources. Resources can be CPU cores, memory, storage, GPU, and so on. If you take a data center as an operating system, Mesos is the kernel of the operating system.
The reason we chose Mesos is that it is highly scalable and simple enough.
As a kernel, Mesos only provides the most basic functions: resource management, task management, scheduling, and so on. And each function is implemented in a modular way, making it easy to customize. Architecture, the Master and Agent two modules implement all the resources-related work, and users simply implement the Framework and Executor according to their business logic. This enables us to encapsulate computing resources into a variety of forms, such as virtual machines and containers.
The scheme of using Mesos to arrange container has been used by many vendors, and the relevant documents are also rich. However, with Mesos to manage virtual machines, there is no practice in the industry to apply to production environments. The remainder of this article, mainly to the reader to share the Ucloud with Mesos management virtual machine ideas and practical experience.
Mesos Introduction
The Mesos uses the master-agent architecture. Master is responsible for the overall resource scheduling and provides APIs externally. The Agent is deployed on all machines and is responsible for calling Executor to perform tasks, reporting status to Master, and so on.
The Mesos provides a two-tier scheduling model:
1. Master resource scheduling between the Framework;
2. Resource scheduling within each Framework for the respective business.
The overall architecture is as follows:
Figure 2:mesos Two-level scheduling structure diagram
Architecture Design
Overall architecture
On Mesos's two-tier scheduling model, the overall architecture of the platform is as follows:
Figure 3: The overall architecture diagram of the resource management platform based on Mesos
The structure is as follows:
1. One or more sets of Mesos clusters per IDC;
2. Each Mesos cluster is a Cluster Server, interacting with Mesos Master and the Framework, responsible for scheduling, state collection, and task distribution within the cluster;
3. A mesos cluster has multiple frameworks, one framework that is responsible for a business, such as VM Scheduler management virtual machines, Marathon Framework management docker tasks;
4. VM Framework implementation of the management of the Excutor is based on Libvirt, virtual machine to create, restart, delete and other operations;
5. All Cluster server Unified reporting to API server, escalation status, access to tasks;
6. API Server is responsible for the main business logic, as well as inter-cluster scheduling, resource and task management, and so on;
7. API Gateway provides APIs to the Ucloud console.
HTTP-based communication
All communication within the system is based on HTTP.
First, the Mesos internal communication based on the Libprocess components relies on the Libprocess library, which implements the Actor pattern in C + +. Each actor listens to an HTTP request, and the process of sending a message to the actor is to serialize the body of the message into an HTTP message and then request the actor.
Second, the various components of the business-related API server, Cluster server, and so on, are also serviced through Restful APIs.
The advantages of HTTP are simple and reliable, easy to develop debugging and extension.
VM Scheduler
For Docker containers, we use the Marathon Framework for management. For virtual machines, we use the VM Scheduler Framework we developed.
VM Scheduler gets a resource offer from Master. A resource offer contains the resources available on an Agent. When a virtual machine task needs to be performed, Cluster Server sends specific information about the task to the VM Scheduler.
Tasks fall into two categories:
1. Create/delete a virtual machine. You need to pass in configuration information for the virtual machine, including mirroring, networking, storage, and so on. Based on this information, VM Scheduler matches the Resource offer that meets the requirements and then generates a Task to be submitted to Mesos Master for execution.
2. Operate a virtual machine, such as switch machine, restart, image production, etc. At this point the VM Scheduler communicates with the VM Executor through the Framework Message, telling the latter to perform the specific operation.
VM Executor
A Task is the smallest unit of resource allocation in a Mesos. Master tells the Agent which task,agent to perform and reports the status of the Task to Master. Based on the information of the task, the Agent downloads and launches the required Executor, and then passes the specific task description to it.
VM Executor is the Executor that we develop to manage the life cycle of virtual machine, which realizes the function of virtual machine creation, deletion, switch machine, image making, etc.
After the VM Executor is started, dynamically generate the required configuration file for the virtual machine, based on the description of the Task, and then call Libvirt to create the virtual machine. When a Framework Message from VM Scheduler is received, the libvirt is called for operation such as a switch machine.
State management is a key part of implementing virtual machine management. By Mesos we can only get the status of the Task, runing indicates that the virtual machine was created successfully, FAILED indicates that the virtual machine failed, and finished that the virtual machine was successfully destroyed. However, in addition, a virtual machine also exists in the "Power on", "Shutdown", "Shutdown", "mirror production" and other states. We synchronize these states to the VM Scheduler by beating the heartbeat between VM Executor and VM Scheduler. The latter determines the status and, if the status changes, sends a state-updated message to the Cluster server, which is then forwarded to the API server, which is eventually updated to the database.
Scheduling of virtual machines
First look at how a Task is dispatched in the Mesos:
Figure 4:mesos Resource Scheduling process
In the example above:
1. The Agent reports its own resources to Master;
2. Master based on the dominant Resource fairness (DRF) scheduling algorithm, this resource as a Resource offer to the Framework 1;
3. The Framework 1 tells Master according to its own business logic that it intends to start two tasks with this resource;
4. Master notifies the Agent to start these two tasks.
In the case of a virtual machine, the schedule is divided into two parts:
1. Select the cluster. By default, API Server chooses a cluster of sufficient resources from the registered cluster, based on resource requirements, and assigns resource requirements to the cluster. In addition, it can be used for different companies, projects and other dimensions of the development of a cluster operation;
2. In-cluster scheduling. Cluster server obtains resource requirements from the API server, such as requiring 200 cores, thus creating a "resource plan" based on Mesos current resource usage, 200 cores allocated as 4 48-core virtual machines and one 8-core virtual machine. It then informs the Framework to create 5 tasks according to the plan.
Identification of the resource
There are other differences between servers, such as CPU, memory, hard disk, and so on. For example, sometimes the business requirements must use a certain model of the CPU, sometimes require must have SSD and so on. To support the scheduling of more dimensions, we used Mesos's Resource and Attribute to identify different resources.
Resource is a concept in Mesos that represents everything a user needs to use. The Agent automatically adds 5 resources such as CPUs, GPUs, Mem, ports, and disk by default. You can also specify additional resources through the parameters when the Agent starts.
Attribute a label in the form of Key-value that identifies an Agent-owned property, which can also be specified at startup by parameters.
Through the flexible use of Resource and Attribute, more resources can be identified to meet various resource scheduling needs. For example, through the Resource specify the SSD size, CPU model, through the Attribute to identify the rack bit, whether to have an external IP, whether to support hyper-threading and so on. After receiving a Resource offer, the Framework matches the needs of the task to be performed, determines whether the resource is sufficient by Resource, and then, through Attribute, decides whether to meet the requirements of other dimensions and ultimately decide whether to use this offer to create the task.
Mirroring, storage, and network management
The platform provides some base images, and users can create their own mirrors based on their own virtual machines. These image files are stored uniformly in a GlusterFS-based sub-deployment storage service, which is mounted on each physical machine.
Some business scenarios require that some VMS share the same storage, so we have developed a user storage service based on GlusterFS that can be automatically mounted when the virtual machine is created, depending on the user's configuration.
On the network side, each user can create multiple subnets, with network isolation between subnets. When creating a virtual machine, you need to specify which subnet to use.
Other questions
In the process of using Mesos, we also encountered some other problems.
Issue One: Marathon Select main exception
When the machine load is relatively high, especially when the IO is high, we find that the Marathon cluster has the probability of not choosing the master.
We suspect that the Marathon node and the ZK network are unstable, triggering a bug caused by Marathon or Mesos. So through iptables active shielding Leader ZK Port Way, successfully reproduce the problem.
By adding some final logs related to the Leader election in the Marathon code, the problem was successfully targeted because the stop () method of Mesos Driver did not successfully cause the start () method to exit from blocking.
Since all of our programs are started by the daemon, we have adopted the simplest solution: Modify the Marathon code and commit suicide when ZK happens to be abnormal. The daemon will start the process again after the suicide.
Question two: Go-marathon problem
Our services are developed using Golang and interact with marathon with the Go-marathon library. There are some problems with the library during use:
Multiple Marathon nodes are not supported. So we created a branch, using the node active detection method, to achieve multi-node support. (This feature is also supported after the original library v5.0 version)
Use HTTP with Timeout. When the Client initializes the Go-marathon, the subscription SSE generates a timeout problem. So we made a change, the normal HTTP API and SSE do not use the same HTTP. Client, operating the SSE http. Client does not set Timeout.
When a network exception occurs, the Go-marathon method call is stuck. So all of our calls to the Go-marathon method are added to the timeout control.
Conclusion
Mesos in Ucloud has a wide range of applications, external "computing Factory" and Udocker and other products, internally supporting the company intranet virtual machine management platform. With continuous practice, our understanding and mastery of Mesos is becoming more and more profound.