Task Scheduler for Spark

Source: Internet
Author: User
Tags random shuffle spark rdd

This paper attempts to comb the practice of spark in task scheduling and resource allocation from the source level.




Start with executor and schedulerbackend. Executor is a truly task-based process that itself has a number of CPUs and memory that can perform computational tasks in terms of threads, the smallest unit that a resource management system can give. Schedulerbackend is a spark-supplied interface that defines many of the processes associated with the executor event, including: The new executor registers the executor information when it comes in, increases the global resource volume (number of cores), and makes a makeoffer Executor update the status, if the task is completed, recycle core, do a makeoffer, other stop executor, remove executor and other events. The following is unfolded by Makeoffer.
The purpose of the Makeoffer is to haveResource Updates, by invoking Scheduler's Resourceoffers method to trigger it to assign an existing task once, and eventually launch the new tasks. Here the global Scheduler is TaskScheduler, implementation is Taskschedulerimpl, it can dock various schedulerbackend implementations, including standalone, yarn, Mesos. Schedulerbackend in doing Makeoffer, will be the existing executor resources to Workerofffer list of the way to scheduler, that is, in the worker unit, Give the worker information and the resources within it to scheduler. Scheduler get the resources for these clusters, go through the tasks that have been submitted and decide how to launch tasks based on locality.
TaskScheduler, the Resourceoffers method will take the tasks that have been submitted oncePriority Sorting, this sorting algorithm is currently two types: FIFO or fair. After getting this copy of the tasks to be run, the next step is to properly assign the worker resource information that schedulerbackend handed over to these tasks. Before allocation, in order to avoid each time is the first several workers are assigned to tasks, so first a random shuffle to the Workeroffer list. The next step is to traverse tasks and see workers resources."Not Enough","character does not match"The Task,ok task is formally launch up. Note that the resources here "enough" is very good to judge, in the TaskScheduler set the number of CPU per task to start, the default is 1, so only need to do the size of the audit and minus 1 operations can traverse the allocation down. And the "character does not fit" the matter, depending on the tasks of eachLocality Settings。
There are five types of task locality, ranked by priority: Process_local,node_local,no_pref,rack_local,any. It is best to be in the same process, the second good is the same node (that is, the machine), again the same rack, or any line. Task has its own locality, if this resource does not have the desired locality resources, how to do? Spark has a spark.locality.wait parameter, which is 3000ms by default. For Process,node,rack, this time is used by default as the wait time for the locality resource. So once the task needs to be locality, it can triggerDelay Scheduling。
Here, there is a general understanding of the allocation of tasks and the use of resources. In fact, TaskScheduler's resourceoffer also triggered the Tasksetmanager Resourceoffer method, The Resourceoffer of Tasksetmanager is to examine the locality of the task and eventually call Dagscheduler to launch the task. The names of these classes and their invocation relationships seem to be quite chaotic. I'll simply comb it.
This is going to start with a DAG cut from spark. The Spark Rdd is strung together to form a DAG through its transaction and action actions. The invocation of action triggers the submission of the DAG and the execution of the entire job. After triggering, the DAG is sliced by dagscheduler, the globally unique stage-oriented Dag scheduler, and is cut into multiple small dags, the stage, based on whether shuffle. All the rdd between is narrow dependence, all belong to a stage, each operation in this corresponds to Maptask, the degree of parallelism is the partition number of their respective rdd. Generally encountered wide-dependency operation, then the operation is cut into a stage, the operation corresponds to Resulttask, the result is the partition number of RDD is the degree of parallelism. Maptask and Resulttask can be easily understood as traditional Mr Maps and reduce, and their basis is essentially shuffle. So before shuffle, a lot of maps can be manipulated with partition. Each stage corresponds to multiple maptask or multiple resulttask, a task set in the stage that synthesizes a Taskset class, managed by Tasksetmanager to manage the running state of those tasks, locality processing ( such as the need for delay scheduling). This tasksetmanager is on the spark level, how to manage your tasks, that is, the task thread, this layer and the bottom resource management isStrippingOf The Tasksetmanager Resourceoffer method we mentioned above is the interaction between the task and the underlying resource, and the coordinator of this resource interaction is taskscheduler and global. TaskScheduler docking is the implementation of different schedulerbackend (such as Mesos,yarn,standalone), so to interface with different resource management systems. At the same time, for a resource management system, they are responsible for the process, which is the number of processes on the worker, how many resources each process allocates. So the two layers are clear,Spark itself calculates a thread-level task within the framework, each stage has a taskset, itself is a small dag, can be thrown into the global pool of available resources to run; Spark's lower body.The two-tier resource Management Section controls the process-level executor, do not care about how the task is placed, also do not care about the task running state, this is tasksetmanager management of things, the two coordinators is TaskScheduler and schedulerbackend implementation of the inside.
The implementation of Schedulerbackend, except the local mode, is divided into fine-grained and coarse-grained two kinds. The fine granularity only Mesos (Mesos has the thickness two kinds of granularity use way) realizes, the coarse granularity realization has the yarn,mesos,standalone. Take standalone mode coarse granularity, each physical machine is a worker,worker total can use how much CPU and memory, start time can specify each worker up several executor, namely process, each executor CPU and memory is how much. In my opinion, the main difference between coarse-grained and fine-grained is that coarse granularity is process long-running, and compute threads can be transferred to executor, but executor's CPU and memory are much easier to waste. Fine-grained, can be reused, can achieve preemption, and so more harsh but to promote resource utilization of things. These two concepts were first proposed in the Amplab paper and implemented in the Mesos. Amplab has many papers in this field, including Mesos DRF algorithm, Sparrow Scheduler, and so on. So in standalone mode, depending on the number of partition of the RDD, and the number of CPUs required per task, it is easy to calculate the load per physical machine, the consumption of resources, and even know that taskset must be divided into several batches to run through a stage.

Complete the full text:)

Task Scheduler for Spark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.