Mesos Persistent storage fault-tolerant resource allocation

Source: Internet
Author: User
Tags abstract new set reserved zookeeper apache mesos

Problems with persistent storage


As I've discussed in the previous article, the main benefit of using Mesos is that you can run multiple types of applications (scheduling and initialization of tasks through the framework) on the same set of compute nodes. These tasks use isolation modules (currently some types of container technology) to abstract from the actual nodes so that they can be moved and restarted on different nodes as needed.

So we're going to think about a problem, how does mesos deal with persistent storage? If I am running a database job, how can mesos ensure that when a task is scheduled, the assigned node has access to the data it needs? As shown in the Hindman example, the Hadoop file system is used ( HDFS) as a persistent layer of mesos, this is a common use of HDFS, and it is also the way that Mesos executor passes the configuration data assigned to the specified task to slave frequently. In fact, Mesos's persistent storage can use many types of file systems, HDFs is just one of them, but it is Mesos's most frequently used, which makes Mesos a kinship with high-performance computing. In fact, Mesos can have a variety of options to handle persistent storage problems:

Distributed File System. As mentioned above, Mesos can use DFS (such as HDFs or lustre) to ensure that data can be accessed by each node in the Mesos cluster. The disadvantage of this approach is that there will be network latency, and for some applications, such a network file system might not be appropriate.

A local file system that is replicated using data storage. An alternative approach is to use application-level replication to ensure that data can be accessed by multiple nodes. Applications that provide data storage replication can be NoSQL databases, such as Cassandra and MongoDB. The advantage of this approach is that you no longer need to consider network latency issues. The disadvantage is that you must configure Mesos so that a specific task runs only on the node that holds the replicated data, because you do not want all nodes in the data center to replicate the same data. To do this, you can use a framework to statically reserve specific nodes for the storage of replicated data.


Does not use a replicated local file system. You can also store the persisted data on the file system of the specified node and reserve the node to the specified application. As in the previous selection, nodes can be reserved statically for the specified application, but can only be reserved for a single node instead of a collection of nodes. The next two are obviously not the ideal choice, because essentially you need to create a static partition. However, we need this option in exceptional cases where no delay is allowed or the application cannot replicate its data storage.

The Mesos project is still in development, and it will regularly add new features. Now I've found two new features that can help solve persistent storage problems:

Dynamic reservation. The framework can use this feature frame to retain the specified resources, such as persistent storage, so that when another task needs to be started, the resource solicitation is sent only to that framework. This can be combined with a framework configuration in a single node and a collection of nodes to access the persistent data store. More information about this suggested feature can be obtained from here.

Persistent volume. This feature creates a volume that is started as part of a task on the slave node, even if its persistence persists after the task completes. Mesos provides access to the same data successor task, which is initialized with the same framework on the node collection that can access the persisted volume. More information about this suggested feature can be obtained from here.

Fault tolerant

Next, let's talk about how Mesos provides fault tolerance on its stack of protocols. With all due respect, one of the advantages of Mesos is that fault-tolerant design is in the architecture and implemented in an extensible distributed system.

Master. The fault handling mechanism and the specific architecture design achieve master's fault tolerance.


First, Mesos decided to use the hot backup (hot-standby) design to implement the master node collection. As Tomas Barton the above illustration, a master node runs in the same cluster as a plurality of standby (standby) nodes and is monitored by the Open-source software zookeeper. Zookeeper monitors all nodes in the master cluster and manages the new master election in the event of a master node failure. The total number of recommended nodes is 5, in fact, the production environment requires at least 3 master nodes. Mesos decided to design Master to hold the software state, which means that when the master node fails, its state can quickly be rebuilt on the newly elected master node. The state information of the Mesos actually resides in the framework scheduler and the Slave node collection. When a new master is elected, zookeeper notifies the framework and the post-election slave node collection so that it can be registered on the new master. At that time, the new master can reconstruct the internal state based on the information sent over the framework and the Slave node collection.

The Framework scheduler. The fault tolerance of the framework scheduler is achieved by registering 2 or more of the scheduler into master through the framework. When a scheduler fails, master notifies another dispatcher to take over. Note that the framework itself is responsible for implementing the mechanism of shared state between dispatchers.

Slave. Mesos implements the slave recovery feature, allowing the executor/task to continue running when a process on the slave node fails, and reconnecting the slave process to the executor/task running on that slave node. When the task executes, slave the monitoring point metadata for the task to the local disk. If the slave process fails, the task continues to run, and when master restarts the slave process because there are no messages to respond to at this point, the restarted slave process uses checkpoint data to recover the state and reconnect with the executor/task.

The following are very different, the slave on the compute node is running normally and the task execution fails. Here, Master is responsible for monitoring the status of all slave nodes.


When the compute node/slave node is unable to respond to multiple consecutive messages, master deletes the node from the list of available resources and attempts to close the node.


Master then reports the executor/task failure to the framework scheduler that assigned the task, and allows the scheduler to do task failure processing according to its configuration policy. Typically, the framework restarts the task to the new slave node, assuming it receives and accepts the appropriate resource solicitation from master.

Executor/task. Similar to the compute node/slave node failure, master reports the executor/task failure to the framework scheduler that assigned the task and allows the scheduler to handle the task if it fails according to its configuration policy. Typically, the framework restarts the task on the new slave node after it receives and accepts the appropriate resource solicitation from master.

Resource allocation for Mesos

An important function of the Apache Mesos to be the best Datacenter Resource Manager is that it has the ability to dredge like a traffic policeman in the face of various types of applications. In this paper, we will delve into the Mesos resource allocation, and discuss how Mesos can balance fair resource sharing according to customer application requirements. Before you begin, if the reader has not yet read the previous sequence of this series, it is recommended that you read them first. The first is an overview of Mesos, the second is a description of the level two architecture, and the third is data storage and fault tolerance.

We'll explore the Mesos resource allocation module to see how it determines what resource solicitation to send to specific framework, and how to recycle resources if necessary. Let's take a look at the Mesos task scheduling process first:


As we know from the description of the two-level architecture mentioned earlier, the Mesos Master Agent Task Scheduler first collects information about available resources from the slave node, and then provides these resources in the form of a resource solicitation to the framework on which it is registered.

The framework can choose to accept or reject a resource solicitation based on whether it meets the resource constraints of the task. Once the resource solicitation is accepted, the framework will collaborate with master to schedule the task and run the task on the corresponding slave node in the data center.

The decision on how to make a resource solicitation is implemented by the resource allocation module, which exists in master. The resource allocation module determines the order in which the framework accepts resource solicitation, while ensuring equitable sharing of resources among inherently greedy frameworks. In homogeneous environments, such as the Hadoop cluster, one of the most fair share allocation algorithms is the maximum minimum fairness algorithm (Max-min fairness). The maximal minimum Fairness algorithm maximizes the minimum resource allocation, and provide it to the user to ensure that each user has a fair share of the resources to meet their needs of the resources; A simple example can explain how it works, please refer to Example 1 of the maximum minimum fair share algorithm page. As mentioned earlier, in a homogeneous environment, this usually works well. There is little fluctuation in resource requirements in a homogeneous environment, and the types of resources involved include CPU, memory, network bandwidth, and I/O. However, resource allocation can be more difficult when scheduling resources across data centers and are heterogeneous resource requirements. For example, how do you provide a fair share allocation strategy when user A's each task requires 1 core CPUs, 4GB of RAM, and 3 core CPU and 1GB memory per task for User B? When user A's task is memory intensive, and User B's task is CPU-intensive, how to fairly allocate a package of resources to it ?

Because Mesos is dedicated to managing resources in heterogeneous environments, it implements a pluggable resource allocation module architecture that assigns the most appropriate allocation strategies and algorithms to the user for implementation. For example, a user can implement a weighted maximum minimum fairness algorithm, allowing the specified framework to obtain more resources than the other framework. By default, Mesos includes a strict priority resource allocation module and an improved fair Share resource allocation module. The algorithm implemented by the strict priority module gives priority to the framework so that it always receives and accepts resource invitations sufficient to meet its task requirements. This ensures that key applications limit the cost of dynamic resource shares in Mesos, but potentially other framework starvation scenarios.

For these reasons, most users use the DRF (dominant resource fairness algorithm dominant Resource fairness) by default, which is the improved fair share algorithm for the Mesos which is more suitable for heterogeneous environments.

DRF, like Mesos, comes from the Berkeley Amplab team and is encoded as the Mesos default resource-allocation policy.

Readers can read DRF's original paper from here and here. In this article, I'll summarize the main points and provide some examples that will give you a clearer understanding of DRF. Let's start the secret journey.

The goal of DRF is to ensure that each user, the framework within the Mesos, receives a fair share of their most needed resources in a heterogeneous environment. In order to master DRF, we need to understand the concept of the dominant resource (dominant resource) and the dominant share (dominant share). The dominant resource of the framework is its most needed resource type (CPU, memory, etc.), which is presented as a percentage of available resources in the Resource solicitation. For example, for compute-intensive tasks, the dominant resource for its framework is the CPU, which relies on tasks that are computed in memory, and the dominant resource for its framework is memory. Because the resource is assigned to the framework, DRF tracks the percentage of the resource type owned by each framework; The highest percentage of the total resource type share of the framework is the dominant share of the framework. The DRF algorithm uses all registered frameworks to compute the dominant share to ensure that each framework receives a fair share of its dominant resources.

Is the concept too abstract? Let's use an example to illustrate. Let's say we have a resource solicitation that contains 9 cores and 18GB of RAM. Framework 1 Run task required (1 core CPU, 4GB memory), framework 2 run task required (3 cores CPU, 1GB memory)

Each task of the framework 1 consumes 1/9 of the total number of CPUs and 2/9 of the total memory, so the dominant resource for the framework 1 is memory. Similarly, each task of the framework 2 will have 1/3 of the total CPU and 1/18 of the total memory, so the dominant resource for the framework 2 is the CPU. DRF will try to provide an equal amount of dominant resources for each framework as their dominant share. In this example, DRF the collaboration framework as follows: Framework 1 has three tasks, total allocation (3 core CPU, 12GB memory), framework 2 has two tasks, total allocation (6 core CPU, 2GB memory).

At this point, the dominant resource for each framework (the framework 1 memory and the Framework 2 CPU) eventually gets the same dominant share (2/3 or 67%), so that after the two framework is available, there will not be enough resources available to run other tasks. It should be noted that if only two of the tasks in the framework 1 need to be run, then all remaining resources will be received by the framework 2 and other registered frameworks.


So, how does DRF calculate and produce these results? As mentioned earlier, the DRF allocation module tracks the resources assigned to each framework and the dominant share of each framework. Each time, DRF is sent to the framework as a resource solicitation, with the lowest dominant share of tasks running in all frameworks. If sufficient resources are available to run its tasks, the framework will accept the offer. Through the examples in the DRF paper quoted earlier, we go through each step of the DRF algorithm. For simplicity, the example will not consider the factor that the resource is freed back into the resource pool after the short task completes, and we assume that each framework will have an unlimited number of tasks to run and that each resource solicitation will be accepted.

Review the example above, assuming that a resource solicitation includes 9 cores and 18GB of memory. The framework 1 runs tasks required (1 cores, 4GB of memory), and the framework 2 runs tasks required (3 cores CPU, 2GB memory). The task of the Framework 1 consumes 1/9 of the total number of CPUs, and the dominant resource for the total memory 2/9,framework 1 is memory. Similarly, each task of the Framework 2 will have a total CPU of 1/3, the total memory of 1/18,framework 2 of the dominant resource is the CPU.


Each row in the table above provides the following information:

The framework chosen--receives the latest Resource solicitation framework.

Resource shares--The total number of resources accepted by the framework within a given time, including CPU and memory, as a proportion of the total amount of resources.

Dominant Share (dominant share)----------------the proportion of the framework-dominant resources in a given time as a proportion of total resources.

Dominant Share% (dominant share percentage)-the percentage of the framework-dominant resources in a given time as a percentage of the total amount of resources.

The total CPU resources of all the frameworks that are accepted in the allocation--for a given time.

The total memory resources of all the frameworks that are accepted by RAM allocation--at a given time.

Note that the lowest dominant share in each row is shown in bold text to find.

Initially, the two framework's dominant share was 0%, and we assumed that DRF first chose Framework 2, and of course we could assume framework 1, but the end result was the same.

The Framework 2 receives the share and runs the task so that its dominant resource becomes CPU and the dominant share increases to 33%.

Because the framework 1 has a dominant share of 0%, it receives shares and runs tasks, leading to a dominant share of the resource (memory) to 22%.

Because the framework 1 still has a lower dominant share, it receives the next share and runs the task, increasing its dominant share to 44%.

DRF then sends the resource solicitation to the framework 2 because it now has a lower lead share.

The process continues until a new task cannot be run because of a lack of available resources. In this case, the CPU resources are saturated.

The process will then be repeated using a new set of resource invitations.

Note that you can create a resource allocation module that uses weighted DRF to make it biased toward a framework or a set of frameworks. As mentioned earlier, you can also create custom modules to provide an organization-specific allocation policy.

In general, most tasks are now short-lived, and Mesos can wait for tasks to complete and reallocate resources. However, it is also possible to run long-running tasks on the cluster, which are used to handle the framework for suspending jobs or behaving improperly.

It is noteworthy that the resource allocation module has the ability to undo the task when the resource is released fast enough. Mesos tries to undo the task: Send a request to the executor to end the specified task and give a grace period for the executor to clean up the task. If the executor does not respond to the request, the allocation module ends the executor and all the tasks on it.

An allocation policy can be implemented to block the revocation of a specified task by providing a framework-related assurance configuration. If the framework is below the guaranteed configuration, Mesos will not be able to end the task of the framework.

We also need to know more about the Mesos resource allocation, but I'm going to halt. Next, I'm going to say something different about the Mesos community. I believe this is an important topic to consider, since open source includes not only technology, but communities as well.

After the community, I will write a step-by-step tutorial on the installation of Mesos and the creation and use of the framework. After a practical teaching article, I will come back to do some more in-depth topics, such as how the framework interacts with master, and how Mesos works across multiple data centers.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.