Spark Core Runtime Analysis: Dagscheduler, TaskScheduler, Schedulerbackend

Source: Internet
Author: User

The main analytic hierarchy of the Spark runtime, combing the runtime components and the execution process,

Dagscheduler

job= multiple stage,stage= multiple same task, task divided into Shufflemaptask and resulttask,dependency divided into shuffledependency and narrowdependency

Segmentation for stage, segmentation based on wide dependency

Maintain waiting jobs and active jobs, maintain waiting stages, active stages and failed stages, and map relationships with jobs

Main functions

  1. Receives the primary entry for the submission job, submitJob(rdd, ...) or runJob(rdd, ...) . SparkContextThese two methods are called in.
    • Generate a stage and commit, and then determine if the stage has a parent stage that is not completed, if so, commits and waits for the parent stage, and so on. The result: Dagscheduler added some waiting stage and a running stage.
    • After the running stage commits, the type of task in the stage is parsed, and a task description, Taskset, is generated.
    • Call the TaskScheduler.submitTask(taskSet, ...) method and submit the task description to TaskScheduler. TaskScheduler allocates resources for this taskset and triggers execution, depending on the amount of resources and trigger allocation conditions.
    • DAGSchedulerAfter the job is submitted, the object is returned asynchronously JobWaiter , able to return to the job run state, be able to cancel the job, process and return the result after successful execution
  2. ProcessingTaskCompletionEvent
    • If the task executes successfully, subtract the task from the corresponding stage and do some counting work:
      • If the task is resulttask, the counter Accumulator adds one, and in the job the task True,job the total number of finish plus one. If the number of finishes is equal to the number of partition, the stage is completed, the mark stage is completed, the stage is subtracted from the running stages, and some cleanup work is removed from the stage.
      • If the task is shufflemaptask, the counter Accumulator adds one, and an output location is added to the stage, which is a MapStatus class. MapStatusis the return of the ShuffleMapTask execution completion, containing the location information and block size (optionally compressed or uncompressed). Also check the stage completion, to MapOutputTracker register the Shuffleid and location information in the stage. Then check if there is a null in the output location of the stage, if there is a null, some task fails, the entire stage is resubmitted; otherwise, the next stage needs to be submitted from waiting stages.
    • If the task is a resubmit, add this task to the corresponding stage
    • If the task is a fetch failure, immediately mark the corresponding stage to complete and subtract from running stages. If retry,abort the entire stage is not allowed, then resubmit the entire stage. In addition, this fetch-related location and map task information, removed from the stage, from the MapOutputTracker logout. Finally, if the Blockmanagerid object of this fetch is not empty, do it once ExecutorLost , and the next shuffle will be executed on another executor.
    • Other task states are TaskScheduler handled by processing, such as exception, Taskresultlost, Commitdenied, and so on.
  3. Other job-related operations include: Cancel Job, cancel stage, resubmit failed stage, and more

Other Functions
1. Cachelocations and Preferlocation

privatevalnew HashMap[Int, Array[Seq[TaskLocation]]]
TaskScheduler

Maintain task and executor correspondence, executor and physical resource correspondence, in queued task and running task.

Maintains a task queue internally, scheduling tasks according to FIFO or fair policies.

TaskScheduleritself is an interface, and spark only implements one TaskSchedulerImpl , in theory task scheduling can be customized. Here are TaskScheduler the main interfaces:

def start(): Unitdef postStartHook() { }def stop(): Unitdef submitTasks(taskSet: TaskSet): Unitdef cancelTasks(stageId: Int, interruptThread: Boolean)def setDAGScheduler(dagScheduler: DAGScheduler): Unitdef executorHeartbeatReceived(execId: String, taskMetrics: Array[(Long, TaskMetrics)],    blockManagerId: BlockManagerId): Boolean

Main functions

  1. submitTasks(taskSet)To receive DAGScheduler the tasks that are submitted
    • Create one for tasks TaskSetManager and add it to the task queue. TaskSetManagertracks the execution status of each task, maintaining a lot of specific information about the task.
    • Triggers a resource request.
      • First, TaskScheduler executor allocations are made against the available resources and task queues at hand (priority, localization, etc.), and eligible executor are assigned TaskSetManager .
      • Then, the resulting task description is given SchedulerBackend , called launchTask(tasks) , to trigger the execution of the task on executor. The task description is serialized and sent to Executor,executor to extract the task information, calling the task's run() method to perform the calculation.
  2. cancelTasks(stageId), cancel the tasks of a stage
    • SchedulerBackendthe method that is called killTask(taskId, executorId, ...) . TaskID and Executorid TaskScheduler have been maintained in the house.
  3. Resourceoffer (offers:seq[workers]) , which is a very important method, the caller is schedulerbacnend , and uses the underlying resource Schedulerbackend assigns the spare workers resource to TaskScheduler to allocate a reasonable amount of CPU and memory resources to the queued task according to the scheduling policy. The task description list is then passed back to schedulerbackend
    • from the worker offers, collecting the correspondence between executor and host, active executors, Rack information, and so on
    • worker offers the list of resources randomly, the task list in the task queue is sorted according to the scheduling policy
    • iterates through each taskset, by process localization, worker localization, machine localization, The priority order of the rack localization, providing the number of CPU cores available for each taskset, to see if the
      • default one task requires a CPU, and the setting parameter is "Spark.task.cpus=1"
      • allocates resources for Taskset, verifies whether the logic is satisfied, and ends with tasksetmanager resourceoffer (execid, Host, maxlocality) If the
      • in the method is satisfied, the final task description is generated, and the taskstarted (task, Info) method of Dagscheduler is called, notifying Dagscheduler , which triggers dagscheduler to do submitmissingstage every time. If the tasks on the stage are assigned to resources, they are immediately committed to execute
  4. statusUpdate(taskId, taskState, data), another very important method, the caller is SchedulerBacnend that the purpose is to SchedulerBacnend report the status of the task execution to TaskScheduler make some decisions
    • If TaskLost , find the corresponding executor of the task, remove it from the active executor, and avoid this executor being assigned to another task to continue to fail.
    • Task finish consists of four states: finished, killed, failed, lost. Only finished is successful execution is complete. The other three kinds are failures.
    • The task successfully finishes, calls TaskResultGetter.enqueueSuccessfulTask(taskSet, tid, data) , or otherwise calls TaskResultGetter.enqueueFailedTask(taskSet, tid, state, data) . TaskResultGetterA thread pool is maintained internally, responsible for executing the results of the asynchronous fetch task and deserializing it. The default is to open four threads to do this, with parameters "spark.resultGetter.threads"=4 .

Taskresultgetter the logic of the task result

    • For success task, if the data in the Taskresult is a direct result data, it is directly deserialized to get the result, and if not, it is called blockManager.getRemoteBytes(blockId) from remote Fetch. If the data retrieved remotely is empty, it is called TaskScheduler.handleFailedTask to tell it that the task is complete but the data is lost. Otherwise, when the data is taken, BlockManagerMaster it is notified that the block message is removed, called TaskScheduler.handleSuccessfulTask , and told that the task is successful and that the result data is passed back.
    • For failed task, parse the reason for fail from data, call TaskScheduler.handleFailedTask it, tell it that the task failed, and what the reason is.
Schedulerbackend

On the TaskScheduler lower level, used to dock different resource management systems, SchedulerBackend is an interface, the main methods that need to be implemented are as follows:

def start(): Unitdef stop(): Unitdef// 重要方法:SchedulerBackend把自己手头上的可用资源交给TaskScheduler,TaskScheduler根据调度策略分配给排队的任务吗,返回一批可执行的任务描述,SchedulerBackend负责launchTask,即最终把task塞到了executor模型上,executor里的线程池会执行task的run()def killTask(taskId: Long, executorId: String, interruptThread: Boolean): Unit =    thrownew UnsupportedOperationException

Coarse granularity: The process resident pattern, typically represented by the standalone mode, Mesos coarse-grained mode, yarn

Fine granularity: Mesos Fine-grained mode

The coarse-grained pattern is discussed here for better understanding: CoarseGrainedSchedulerBackend .

Maintain executor related information (including executor address, communication port, host, total number of cores, number of remaining cores), how many of the executor on hand are registered, how many remain, and how many cores are empty, and so on.

Main functions

  1. The driver side primarily listens and handles these events through the actor:
    • RegisterExecutor(executorId, hostPort, cores, logUrls)。 This is the source that executor adds, and usually the worker pulls up and restarts to trigger executor registration. CoarseGrainedSchedulerBackendmaintain these executor and update the internal resource information, such as the increase in the total number of cores. The last call makeOffer() , which is to drop the resources at hand to TaskScheduler allocate once, return to the task description back, the task launch up. This makeOffer() call will appear in any event related to resource changes , as shown below.
    • StatusUpdate(executorId, taskId, state, data)。 The state callback for the task. First, the call is TaskScheduler.statusUpdate escalated. Then, determine if the task is finished, and then put the Freecore back on the executor and call it once makeOffer() .
    • ReviveOffers。 This event is called directly by others directly to the SchedulerBackend requesting resource makeOffer() .
    • KillTask(taskId, executorId, interruptThread)。 This killtask event will be sent to Executor's actor,executor to handle the KillTask event.
    • StopExecutors。 Notifies each executor to handle the StopExecutor event.
    • RemoveExecutor(executorId, reason)。 From the maintenance information, that this heap of executor involved in the number of resources lost, and then call the TaskScheduler.executorLost() method, notify the upper level I have a group of resources can not be used, you deal with it. TaskSchedulerwill continue to executorLost escalate the incident to the DAGScheduler cause of DAGScheduler concern about the output location of the shuffle task. DAGSchedulerwill tell BlockManager the executor not to be used, remove it, and then take all the shuffleoutput information of the stage to traverse it, remove the executor, and register the updated shuffleoutput information MapOutputTracker , Finally, clean down the local CachedLocations Map.
  2. reviveOffers()The implementation of the method. Call the makeOffers() method directly, get a batch of executable task description, call launchTasks .
  3. launchTasks(tasks: Seq[Seq[TaskDescription]])Method.
    • Iterates through each task description, serializes it into binary, and sends it to each corresponding executor for this task message
      • If this binary information is too large, more than 9.2M (default akkaframesize 10M minus the default is Akka left blank 200K), error, abort the entire taskset, and print a reminder to increase the Akka frame size
      • If the binary data size is acceptable, the actor sent to executor handles the LaunchTask(serializedTask) event.
Executor

Executor is a process model in spark that can be applied to different resource management systems and used in conjunction with each other SchedulerBackend .

There is a thread pool inside, there is a running tasks map, there is an actor, receive the above mentioned by SchedulerBackend the sent event.

Event handling

    1. launchTask。 Based on the task description, generate a TaskRunner thread, throw it out of running tasks map, execute this with the thread poolTaskRunner
    2. killTask。 Take the thread object from running tasks map and tune its Kill method.

Complete the full text:)

Spark Core Runtime Analysis: Dagscheduler, TaskScheduler, Schedulerbackend

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.