Mesos: A cluster scheduling and management system to disrupt big data analytics

Source: Internet
Author: User

As described in the "Mesos: Motivation" section earlier, the main goal of Mesos is to help manage cluster resources between different frameworks (or application stacks). For example, there is a business that needs to run hadoop,storm and spark simultaneously on the same physical cluster. In this case, the existing scheduler is unable to complete such fine-grained resource sharing across frames. The Yarn Scheduler for Hadoop is a central scheduler that allows multiple frameworks to run in a single cluster. However, it becomes difficult to use framework-specific algorithms or scheduling strategies, because there is only one scheduling algorithm between multiple frames. For example, MPI uses a group scheduling algorithm, while Spark uses deferred scheduling. The two of them running simultaneously on a cluster can lead to conflicts between supply and demand. Another option is to physically split the cluster into smaller clusters, and then run the different frameworks independently on those small clusters. One more way is to assign a set of virtual machines to each frame. As Regola and ducom say, virtualization is considered a performance bottleneck, especially in high-performance computing (HPC) systems. This is the scenario where Mesos is suitable-it allows users to manage cluster resources across frameworks.

Mesos is a double-decker scheduler. In the first tier, Mesos provides a certain amount of resources (in the form of containers) to the corresponding framework. Once the framework receives the resources on the second tier, it runs its own scheduling algorithm to assign the tasks to the resources provided by Mesos. Compared to this central scheduler for Hadoop yarn, it may not be as efficient as the use of cluster resources. But it provides flexibility-for example, multiple framework instances can run in a cluster. This is not possible with any of these existing schedulers. Even Hadoop yarn is just trying to support a third-party framework like MPI on the same cluster. More importantly, with the advent of the new framework, such as Samza, which has recently been open source for LinkedIn-with Mesos, these new frameworks can be deployed to existing clusters in a pilot manner, peacefully coexisting with other frameworks.

Mesos components

The key component of the Mesos is its master-slave daemon, as shown in 2.5, which runs on the master and slave nodes of the Mesos, respectively. A frame or frame part is hosted on a slave node, and the framework part consists of two processes, the execution process, and the scheduling process. The slave node publishes a list of available resources to the master node. This is published in the form of <2 CPU,8GB memory > list. The master node evokes the allocation module, which allocates resources to the framework based on the configuration policy. The master node then assigns the resource to the framework scheduler. Once the framework scheduler receives this request (or rejects it if it does not meet the requirements), it sends back the list of tasks that need to be run and the resources they need. The master node sends the task and resource requirements to the slave node, which sends the information to the framework scheduler, which is responsible for initiating these tasks. The remaining resources in the cluster can be freely allocated to other frameworks. Next, as long as the existing tasks are complete and the resources in the cluster become available again, the process of allocating resources repeats over time. It is important to note that the framework does not describe how many resources it needs, and if it cannot satisfy the resources it requests, it can reject those requests. To improve the efficiency of this process, Mesos allows the framework to set its own filters, and the primary node will always check this condition first before allocating resources. In practice, the framework can use deferred scheduling, waiting for a period of time before it is taken to the node that holds the data that they need to calculate.

Figure 2.5 Architecture of Mesos

Once the resources are allocated, Mesos is immediately available to the framework. It may take some time for the framework to respond to this request. This ensures that the resources are locked and the resources are immediately available once the framework has accepted this allocation. If the frame is not responding for a long time, the resource Manager (RM) has the right to revoke this assignment.

Resource allocation

The resource allocation module is pluggable. At present, there are two kinds of implementations--one is the dominant resource fairness (dominant Resource fairness, DRF) strategy proposed by GHODSI and others (2011). A fair scheduler in Hadoop (https://issues. apache.org/jira/browse/hadoop-3746) allocates resources according to the granularity of the node's fixed-size partitions (also known as slots). This can be inefficient, especially in heterogeneous computing environments with modern multicore processors. DRF is a generalization of the minimum-maximum fairness algorithm under heterogeneous resources. It is important to note that the maximum minimum algorithm is a common algorithm that has many variants such as loops and weighted fair queueing, but it is usually used for homogeneous resources. The DRF algorithm ensures that the maximum minimum policy is used in the user's dominant resource. (the dominant resource for CPU-intensive jobs is the CPU, while the dominant resource for IO-intensive jobs is bandwidth). Some interesting features of the DRF algorithm are listed below:

    • It is fair, and one of the things that attracts users is that it guarantees that if all resources are statically distributed evenly, they will not be biased towards any user.

    • There is no benefit for users to misrepresent resource requirements.

    • It has Pareto efficiency, in a sense, the system resource utilization is maximized and subject to allocation constraints.

The framework can obtain the size of the resources that are guaranteed to be allocated to them through API calls. This feature is useful when Mesos has to kill some user tasks. If the resource allocated by the framework is within the bounds of assurance, its process will not be killed by Mesos, and if the threshold is exceeded, Mesos will kill its process.

Isolation

MESOS provides isolation capabilities using Linux or Solaris containers. Traditional hypervisor-based virtualization technologies, such as kernel-based virtual machines (KVM), Xen (Barham, et 2003), or VMware, are made up of virtual machine monitors based on the host operating system, which provides all the hardware emulation of a virtual machine. In this way, each virtual machine has its own operating system, which is completely isolated from other virtual machines. The Linux container is a technique called OS-level virtualization. Operating system-level virtualization uses the concept of isolating user-space instances to create a partition of a physical machine resource. In essence, this approach eliminates the need for customer operating systems based on hypervisor virtualization technology. That is, hypervisor works at the hardware abstraction layer, while operating system-level virtualization works on the system call layer. However, the abstraction provided to the user is that each user space entity will run its own eight dedicated independent operating system. The different implementations of OS-level virtualization are slightly different, linux-vserver work on chroot, and OpenVZ is working on the kernel namespace. Mesos uses LXC, which manages resource management through Cgroups (Process Control group) and uses the kernel namespace for isolation. Xavier and Others (2013) A detailed performance evaluation report was made, with the following results: [1]

    • Judging from the Linpack benchmark test of CPU performance (Dongarra 1987), LXC is better than Xen.

    • In the stream benchmark, Xen's memory overhead is significantly greater than LXC (close to 30%), which provides near-native performance.

    • When performing iozone benchmarks, LXC reads, repeats read, writes, and overrides the performance of the native performance, while Xen generates significant overhead.

    • The network bandwidth performance of the Netpipe benchmark with LXC is close to native performance, while the cost of Xen has increased by almost 40%.

    • Due to the use of the guest operating system, LXC is less isolated than Xen in isolation Benchmark Suite (IBS) tests. A special test called a fork bomb (which repeatedly creates sub-processes) indicates that LXC cannot limit the number of child processes currently being created.

Fault tolerance

Mesos provides fault tolerance for the primary node by running multiple master nodes with a hot standby configuration of zookeeper (Hunt, etc. 2010) and electing a new master node once the primary node crashes. The state of the master node consists of three parts--they are, the active from the node, the active frame, and the running task list. The new master node can reconstruct its state from the information of the node and the framework scheduler. Mesos also reports framework actuators and tasks to the corresponding framework, which can handle failures independently of its own policies. Mesos also allows the framework to register multiple schedulers once the main scheduler fails and can go to connect from the scheduler. However, the framework ensures that the state of the different schedulers is synchronized.

[1] Readers familiar with the UNIX operating system will recall that Chroot is a command that changes the root directory of the current worker process, creating an environment called "Chroot Prison" to provide file-level isolation.

original articles, reproduced please specify: reproduced from the Concurrent programming network –ifeve.com


Mesos: A cluster scheduling and management system to disrupt big data analytics

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.