Summary of research on cluster scheduling technology

Source: Internet
Author: User

1 Introduction

What is scheduling? The individual understands that the initial dispatch is time-related. Time as the only irreversible resource, is generally divided into multiple time slices to use (as shown in). As far as the computer is concerned, because the CPU is much faster, there is a schedule for CPU time slices, so that multiple tasks run on the same CPU. This is an illusion, at some point the CPU is a single-task run.


Later, in order to do more tasks at the same time, many things need to be done at the same time. If more than one person or multiple processors work together to complete a task goal, a facilitator is needed. This is a distributed system, in the case of a single data center or a small area, is a cluster. If a distributed system runs multiple tasks, multiple tasks are bound to compete for resources in the distributed system, and scheduling problems evolve to resource scheduling (sharing and allocation).

At the same time, multicore computers have developed. Produces the same problems as distributed systems.

This sharing attempt summarizes the different scheduling mechanisms appearing in single-machine operating system, C/S,B/S,P2P, distributed system, and network layer, focusing on the three stages of the development of cluster scheduling: Macro Scheduling, two-tier scheduling and state sharing scheduling, comparing the advantages and disadvantages of the three.

2 Cluster scheduling

2.1 Macro Dispatch monolithic schedulers MapReduce

Macro scheduling: Implement scheduling policy in the same code module, single instance, no parallel. Common in the HPC (high-performancecomputing) world.


Figure 1:hadoop1, MapReduce's macro scheduling architecture

A master process called Jobtracker is a central scheduler for all mapreduce tasks (similar to Hpcscheduler sge_qmaster in the Grid Engine or Pbs_server in Torque). Each node runs a tasktracker process to manage the task in each node (similar to HPC clusters Sge_execd in Grid Engine or Pbs_momin Torque). Tasktracker to communicate with master on Tasktracker and accept Tasktracker control. Like most resource managers, Jobtracker supports two scheduling strategies: Capacityand Fair.

Taking Jobtracker as an example, the scheduling of resources and the management of jobs are all done in one process. The disadvantage of this design approach is poor scalability: first of all, the cluster size is limited, second, the new scheduling strategy is difficult to integrate into existing code, such as the previous only to support mapreduce jobs, now to support streaming jobs, and the flow of job scheduling policy embedded in the central scheduler is a difficult task.

2.2 Static Partitioning statically partitionedschedulers

Also become the cloud computing in the scheduling, resource collection of comprehensive control. Deployed on a subset of specialized, statically partitioned clusters.

Or divide the cluster into different parts, and support different behaviors respectively.

2.3 Two-level dispatch two-level scheduling

To deal with the problem of static partitioning, a straightforward solution is two-tier scheduling. Dynamically adjusts the resources allocated to each scheduler (framework Scheduler) by introducing a central coordination component that determines the number of resources that each subset group needs to allocate.

Each framework scheduler does not know the entire cluster resource usage, but passively receives resources. Master pushes only the available resources to the frameworks, and the framework chooses to use or deny those resources. Once a framework (such as Jobtracker) receives new resources, it further allocates resources to its internal applications (each MapReduce job), thereby enabling a two-tier dispatch.

The two-tier scheduler has two drawbacks, one of which is that the framework is unable to know the real-time resource usage of the whole cluster, and the other is the pessimistic lock and the small granularity.

2.3.1 YARN

Yarn is called Apachehadoop Next Generation Compute Platform, which is the biggest difference between HADOOP1 and HADOOP2:


The basic idea of HADOOP2 (MRV2) is to divide jobtracker functions into two separate processes: global resource management ResourceManager and monitoring and scheduling of each process applicationmaster. This process can be a task in map-reduce or a task in a DAG.

The API for HADOOP2 (MRV2) is back-compatible, and the task of supporting Map-reduce can be run on HADOOP2 (MRV2) only if it is recompiled.

Figure 2:yarn "Two-tier" scheduling architecture

In yarn design, there can be multiple applicationmasters in a cluster, and each applicationmasters can have multiple containers (for example, there are two applicationmasters, red, and blue in the figure.) Red has three containers, blue one container). The key point is that applicationmasters is not a ResourceManager part, which reduces the pressure on the central scheduler, and each applicationmasters can dynamically adjust the container of its own control.

The ResourceManager is a purely scheduler (which does not monitor and track the execution state of the process, nor is it responsible for restarting the failed process), and its sole purpose is to manage the available resources (in containers granularity) between multiple applications. ResourceManager is the ultimate authority for resource allocation. If ResourceManager is Master,nodemanager, it is slave. ResourceManager and supports the plug-in of scheduling policies: Capacityscheduler and Fairscheduler are plugins like this.

Applicationmaster is responsible for the submission of the Mission, through consultations and negotiations from the ResourceManager where the resources are obtained in containers form (responsible for negotiating the containers appropriate to their application needs). Then the running state of the track process. Applicationmasters is specific to the application, can be based on different applications to write different applicationmasters. , such as yarnincludes a distributed shell framework that runs a shell script on Multiplenodes on the cluster. In addition, Applicationmaster provides automatic restart services. Applicationmaster can be understood as an interface library that an application can implement itself.

Applicationmasters Request and manage containers. containers specifies how many resources (including memory, CPU, and so on) an app can use on a single host, similar to a resource pool in an HPC schedule. Applicationmaster once a resource is obtained from ResourceManager, it will contact NodeManager to initiate a particular task. For example, if you use the MapReduce framework, these tasks may be mapper and reducer processes. Different frameworks will have different processes.

NodeManager is a framework agent on every machine, responsible for the containers on the machine, and monitors available resources (CPU, memory, disk, network). And the resource status is reported to ResourceManager.

On the face of it, yarn is also a two-layer dispatch. In yarn, resource requests are emitted from application masters to a central global scheduler, where the central scheduler allocates resources on multiple nodes in the cluster, depending on the needs of the application. But application masters in yarn provides just a task management service, not a real two-tier scheduler. So essentially yarn is still a macro scheduling architecture. Until now, yarn only supports scheduling of resource types (memory).


Figure 3:yarn Architecture

2.3.2 Mesos

Mesos,which is a open source platform for fine-grained resource sharing between multiple diverse cluste R computing frameworks.

The emergence of a large number of distributed computing frameworks: Hadoop, Giraph, MPI, etc. Each computing framework manages its own compute clusters. These application frameworks often divide tasks into small tasks, which can increase the utilization of the cluster and keep the computation close to the data. But these frameworks are developed independently, and it is not possible to share resources between application frameworks. The image is expressed as follows:


We want to be able to run multiple application frameworks on the same cluster. Mesos is provided by providing a common resource sharing layer, and several different application frameworks can be run on top of this resource sharing layer. The image is expressed as follows:

But we do not want to use a simple static partitioning method, as shown in:


The biggest benefit of mesos is to increase the utilization of the cluster. The product environment and experimental environment can be well isolated, and multiple frameworks can be run concurrently. Second, data can be shared across multiple clusters. Thirdly, the cost of maintenance can be reduced. The biggest challenge for Mesos is how to support a large number of application frameworks. Because each framework has different scheduling requirements: Programming models, communication paradigms, task dependencies, and data placement. In addition, the Mesos scheduling system needs to be able to scale to thousands of nodes and run millions of tasks. Because all tasks in a cluster depend on Mesos, the dispatch system must be fault-tolerant and highly available.

Mesos Design decision-making (Design philosophy): No centralized, comprehensive design (application requirements, available resources, organizational strategy), suitable for all tasks of the global scheduling strategy. The delegated scheduling task is used to give the application framework (the function of dispatching and executing is given to the application framework). Mesos claims: Such a design strategy may not achieve the global optimal scheduling, but in the actual operation surprisingly good, can make the application framework of the near perfect to achieve the goal. It also claims two advantages: the evolutionary independence of the application framework and the simplicity of keeping the mesos.

The main components of the mesos include the master Daemon,slave Daemons and Mesos applications (also known as frameworks) running on top of the slaves. Master determines how many resources are allocated to each application based on the corresponding strategy (fair scheduling, priority scheduling, etc.). The modular architecture supports a variety of strategies. Resource offer is an abstract representation of a resource, which allows the application framework to instantiate and run tasks on a node in the cluster. Each resource offer is a list of free resources that are distributed across multiple node (s). Mesos determines how many resources can be allocated to the application framework based on a certain algorithm strategy (such as fair scheduling), and the application framework determines which resources are used (accepted) and which tasks are run. The application framework running on Mesos consists of two parts: the application Scheduler and the slave running the agent. The application scheduler registers with the Mesos. Master determines how much resources are provided to the registered framework, and the application scheduler determines which of the resources that are allocated by master are used. After the dispatch is complete, the application scheduler sends the accepted resources to Mesos, which determines which slave to use. Then the execution of the tasks in the application framework can be run on slave. Mesos works well when tasks are small and short-term (each task is frequently assigned to the resources it holds).


In Mesos, a central resource allocator dynamically divides clusters, allocating resources to different scheduling frameworks (scheduler frameworks). Resources can be arbitrarily assigned in the form of "offers" between different scheduling frameworks, and offers represents the resources currently available. Resource allocator in order to avoid conflicting requests for the same resource by different scheduling frameworks, only one scheduling framework can be allocated at a time. In the process of scheduling decision, the resource allocator essentially plays the role of lock. Therefore, concurrency scheduling in Mesos is a pessimistic strategy.

Master uses the resource offer mechanism to fine-grained shared resources across multiple frameworks. Each resource offer is a list of free resources, distributed across multiple slave. Master decides how much resources to provide for each application framework, based on a fair approach or a priority approach. Third, you can customize the strategy in a pluggable module.

Reject mechanism: rejected the resource scheme provided by Mesos. To maintain the simplicity of the interface, Mesos does not allow the application framework to specify the restriction information for resource requirements, but rather allows the application framework to reject the resource scheme provided by the Mesos. The application framework rejects the wait if it encounters a resource-provisioning scenario that does not meet its needs. Mesos claims that the deny mechanism can support arbitrarily complex resource constraints while maintaining extensibility and simplicity.

One problem with the reject mechanism is that it may take a long time before the application framework receives a scenario that meets its needs. Because you do not know the requirements of the application framework, Mesos may issue the same resource scenario to multiple application frameworks. Therefore, the introduction of the Fliter mechanism: A scheduling framework in Mesos uses the filter to describe the type of resource it expects to be serviced (allowing the application framework to set a filter to indicate that the application framework will always reject a class of resources). Therefore, it does not need to access the entire cluster, it only needs to access the node it is an offer. The disadvantage of this strategy is that it cannot support preemption and policy based on the entire cluster state: A scheduling framework does not know the resources assigned to other scheduling frameworks. Mesos provides a resource-storage strategy to support gang scheduling. For example, the application framework can specify a list of the node whitelists that it can run. Is this not a dynamic cluster partition? Mesos further explains the filter mechanism: filter is just a performance optimization scheme for a resource allocation model, and what tasks the application framework has to run on which node is the final decision.


Mesos Task Scheduling process:

1.Slave 1 reports to master that it has 4 CPUs and 4 GB of free memory, and master's allocation module notifies the framework 1 that all available resources can be used according to the appropriate assignment policy.

2. Master sends the available resources on slave 1 to the framework 1 (in the form of resource offer).

3.framework Scheduler responds to Master Scheduler: Prepare to run two tasks on slave, using resources such as: First task <2 CPUs, 1 GB ram>, second task <1 CPUs, 2 GB ram>.

4. Finally, Master sends the task to slave and assigns the appropriate resources to the framework's actuators. The executor then initiates two tasks. Since there are 1 CPUs on the SLAVE1 and 1 GB of memory not allocated, the allocation module can allocate resources to the framework 2.

In addition Mesos Master's allocation module is pluggable. Use zookeeper to implement the failover of Mesos master.

Mesos's API:


2.4 State sharing scheduling shared-state scheduling

Each scheduler has access to the entire cluster state. Optimistic concurrency control is used when multiple schedulers simultaneously update the cluster state. Shared-state scheduling can solve two problems: the parallel limitation of pessimistic concurrency control and the visibility of the scheduling framework to the whole cluster resource.

The problem with optimistic concurrency control is that when optimistic assumptions are not established, they need to be re-dispatched.

To overcome the two drawbacks of the two-tier scheduler (Omega paper focuses on this issue), Google has developed a next-generation resource management system Omega,omega is a shared-state-based scheduler, The scheduler simplifies the centralized resource scheduling module in a two-tier scheduler into persistent shared data (state) and validation code for the data, where "shared data" is actually the real-time resource usage information for the entire cluster. Once the shared data is introduced, the concurrent access of the shared data becomes the core of the system design, and Omega uses the multi-version concurrent Access control method (also called "optimistic lock", MVCC, Multi-version Concurrencycontrol) in the traditional database. , which greatly increases the concurrency of Omega. There is no central resource allocator in Omega, and the scheduler makes its own resource allocation decisions.

2.4.1 Omega

The disadvantage of macro scheduling is that it is difficult to increase scheduling policies and specialized implementations, and cannot scale with the expansion of the cluster. Two-tier scheduling can indeed provide flexibility and parallelism, but in practice their resource visibility is conservative. Lock algorithm. Difficult to adapt to some critical tasks and some tasks that require access to the entire cluster resources. Omega's solution is to propose a new parallel scheduling framework: shared-State, lock-free, optimistic concurrency control for scalability and performance scalability.

There is no central resource allocator in Omega, and all resource allocation decisions are done by the application scheduler itself. Omega maintains a primary copy of the resource allocation state information that becomes a cell. Each application scheduler maintains a locally-owned, frequently updated cell state copy, which is used to make scheduling decisions. The scheduler can see all the resources in the global, and according to the permissions and priorities to self-righteous requirements required resources. When the scheduler determines the resource scheme, the shared cell state is updated atomically: Most of the time this commit will succeed (this is the optimistic approach). When a conflict occurs, the scheduling decision will fail in a transactional manner. The scheduler synchronizes the local cell state and the shared cell state again, whether the dispatch succeeds or fails. Then, if necessary, restart the scheduling process.

Omega's scheduler is completely parallel and does not need to wait for other schedulers. To avoid conflict-caused starvation, the Omega Scheduler uses an incremental dispatch--accept all but the conflictthings to avoid resource hoarding. If you use the all or nothing policy, you can use gang scheduling (either all tasks of a job is scheduled together, or Noneare, and the scheduler must try To schedule the entire job again.). Gang scheduling waits for all resources to be ready before the entire task commits, resulting in a resource backlog.

Each application scheduler can implement its own scheduling strategy. However, they must agree on the allocation of resources and the prioritization of tasks. This can be easily achieved by the Central Explorer in two-tier scheduling. Fairness is not a critical requirement, it's just a matter of meeting your business needs. Therefore, limit each application scheduler resource Cap and task submission limit. We need to discuss this here.


2.5 Comparative analysis

The main goal of cluster scheduling is to improve the utilization and efficiency of cluster.


Monolithicschedulers uses a central scheduling algorithm for all tasks. The disadvantage is that it is not easy to add new scheduling strategy, nor can it be extended with the expansion of the cluster.

Two-levelschedulers uses a dynamic resource manager to provide compute resources or storage resources to multiple parallel scheduling frameworks. Each scheduling framework has a subset of the entire resource. Why is it a dynamic explorer? is relative to the static cluster partition. We can divide the cluster into several zones statically, and serve the different applications separately. The above dynamic resource Manager completes the work by dynamically moving the static partitioning work. Because the two-tier scheduling cannot handle difficult-to-dispatch critical tasks and cannot make decisions based on the status of the entire cluster, Google introduces the following scheduling architecture.

Sharedstate Schedulers uses a non-optimistic concurrency control algorithm. In contrast, Two-level schedulers is essentially a pessimistic scheduling algorithm. This architecture is used in Omega,google's next generation dispatch system.

In Omega's view, Mesos's offer mechanism is essentially a dynamic filtering mechanism, so that Mesos master provides only a subset of the resource pool to the application framework. Of course, this subset can be expanded into a complete set, also share state, but its interface is still pessimistic strategy. This can be discussed.

In Omega's view, applicationmasters in yarn provides only a task management service, not a real two-tier scheduler. Second, yarn only supports one resource type so far. In addition, although application Masters in yarn can request resources for a particular node, its specific strategy is unclear.

3 Next work

resource-aware scheduling.

Reference documents

1, omega:flexible, scalable schedulers for large compute clusters.

2, Mesos A Platform for fine-grained Resource sharing in the Data Center.

3, Apache Hadoop yarn:yet anotherresource negotiator.

4, multi-agent Cluster scheduling for Scalability and flexibility.

5, http://mesos.apache.org/


Summary of research on cluster scheduling technology

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.