"Spark Core" TaskScheduler source code and task submission principle Analysis 2

Source: Internet
Author: User

Introduction

The previous section, "TaskScheduler Source and task Submission Principle Analysis 1" introduces the creation of TaskScheduler process, in this section, I will undertake "stage generation and stage source analysis" The Submitmissingtasks function in the task continues to describe the creation and distribution of tasks.

The Submitmissingtasks function in Dagscheduler

If all the parent stage of a stage has been computed or exists in the cache, then he invokes submitmissingtasks to submit the tasks contained in the stage.
Submitmissingtasks is responsible for creating a new task.
Spark divides the tasks executed by executor into the Shufflemaptask and resulttask two types.
Each stage generates a task based on the isshufflemap tag in the stage to determine whether it is shufflemapstage, and if marked as true, the result of this stage output passes through the shuffle stage as input to the next stage. Create Shufflemaptask, otherwise resultstage, so the result of creating resulttask,stage is output to spark space, and finally, the task is submitted through taskscheduler.submittasks.

Calculation process

The Submitmissingtasks calculation process is as follows:

  1. First get the partition that need to be calculated in the RDD, for the stage of the shuffle type, you need to determine whether the result is cached in the stage, and for the final stage of the result type, determine whether the partition in the calculation job is completed.
  2. Serializes the binary of a task. Executor can get it by broadcasting variables. Each task is deserialized first when it is run. Tasks that run on different executor are isolated and do not affect each other.
  3. Generates a task for each partition that needs to be computed: Generates a task of type Shufflemaptask for the stage on which the shuffle type is dependent, and for the stage of the result type, a resulttask type of task is generated.
  4. Ensure that the task can be serialized. Because different cluster have different taskscheduler, judging here can simplify the logic, and ensure that the Taskset task is serializable.
  5. Submit Taskset through TaskScheduler.
Part of the Code

The following is a partial code that submitmissingtasks determines whether it is shufflemapstage, some of which are described in comments:

    ValTasks:seq[task[_]] =if(stage.isshufflemap) {partitionstocompute.map {id = =Vallocs = Getpreferredlocs (Stage.rdd, id)ValPart = stage.rdd.partitions (ID)serial number of the//stage.id:stage        //taskbinary: This is described in detail below        //part:rdd corresponding to the partition        //locs: The most suitable execution location        NewShufflemaptask (Stage.id, Taskbinary, part, locs)}}Else{ValJob = Stage.resultOfJob.get Partitionstocompute.map {id = =ValP:int = job.partitions (ID)Valpart = Stage.rdd.partitions (p)Vallocs = Getpreferredlocs (Stage.rdd, p)//p:partition index that indicates from which partition read the data        //id: The partition index of the output, which represents Reduceid        NewResulttask (Stage.id, Taskbinary, part, locs, id)}}

About the Taskbinary parameter: This is the broadcast variable for RDD and shuffledependency (Broadcase version) as a result of serialization.
The RDD is serialized here and its dependencies are deserialized before executor runs the task. This approach provides better isolation between different tasks.

Here is some of the code submitmissingtasks does for the task submission:

    if(Tasks.size >0) {Loginfo ("Submitting"+ Tasks.size +"Missing tasks from"+ stage +" ("+ Stage.rdd +")") Stage.pendingtasks ++= Tasks Logdebug ("New Pending tasks:"+ stage.pendingtasks) Taskscheduler.submittasks (NewTaskSet (Tasks.toarray, Stage.id, Stage.newattemptid (), Stage.jobid, properties)) Stage.latestInfo.submissionTime = So Me (Clock.gettimemillis ())}Else{//Because We posted sparklistenerstagesubmitted earlier, we should Mark      //The stage as completed here with case there is no tasks to runMarkstageasfinished (stage, None) Logdebug ("Stage"+ stage +"is actually done; %b%d%d ". Format (stage.isavailable, stage.numavailableoutputs, stage.numpartitions))}
The Submittasks in Taskschedulerimpl

The submittasks process is as follows:

  1. Tasks are packaged as Tasksetmanager (because Tasksetmanager is not thread-safe, so the source needs to be synchronized)
  2. Tasksetmanager instances are queued for dispatch in the scheduling pool through Schedulablebuilder (divided into fifoschedulablebuilder and fairschedulablebuilder two kinds)
  3. Task commit simultaneously initiates the timer, and if the task is not executed, the timer will continue to warn until the task is executed.
  4. Call Backend's reviveoffers function, send reviveoffers message to driveractor instance of backend, Driveeractor call reviveoffers after receiving makeoffers message
  Override defSubmittasks (Taskset:taskset) {Valtasks = Taskset.tasks Loginfo ("Adding Task Set"+ Taskset.id +"with"+ Tasks.length +"Tasks") This. synchronized {ValManager = Createtasksetmanager (TaskSet, Maxtaskfailures) activetasksets (taskset.id) = Manager Schedulablebuilder . Addtasksetmanager (Manager, Manager.taskSet.properties)if(!islocal &&!hasreceivedtask) {Starvationtimer.scheduleatfixedrate (NewTimerTask () {Override defRun () {if(!haslaunchedtask) {logwarning ("Initial job has no accepted any resources;"+"Check your cluster UI to ensure that workers is registered"+"and has sufficient resources")            }Else{ This. Cancel ()}}, Starvation_timeout, starvation_timeout)} Hasreceivedtask =true} backend.reviveoffers ()}
Tasksetmanager Scheduling

Once each stage is confirmed, a corresponding Taskset is generated (that is, a set of tasks), which corresponds to a tasksetmanager that is submitted to the schedule pool by the stage back to the missing stage at the most source, in the dispatch pool, These tasksetmananger will be sorted according to the job ID, first commit the job Tasksetmanager priority scheduling, and then a job within the Tasksetmanager ID small first scheduled, And if there is tasksetmanager of the parent stage that is not executed, it is not committed to the schedule pool.

Reviveoffers function code

The following is the Reviveoffers function of Coarsegrainedschedulerbackend:

  overridedef reviveOffers() {    driverActor ! ReviveOffers  }

Driveeractor Reviveoffers message is received, the Makeoffers handler function is called.

The Makeoffers function of Driveractor

The processing logic for the Makeoffers function is:

  1. Find free executor, distribute the strategy randomly distributed, that is, as far as possible to spread the task to each executor
  2. If there is an idle executor, some tasks in the task list are sent to the specified executor using Launchtasks

Schedulerbackend (here is actually coarsegrainedschedulerbackend) is responsible for distributing the newly created task to executor, as can be seen from the Launchtasks code. You need to serialize the taskdescription before sending the lauchtasks instruction.

    // Make fake resource offers on all executors    def makeOffers() {      case (id, executorData) =>        new WorkerOffer(id, executorData.executorHost, executorData.freeCores)      }.toSeq))    }
The Resourceoffers function in Taskschedulerimpl

Tasks are distributed randomly to individual executor, and the work of resource allocation is handled by the Resourceoffers function.
As mentioned in the Submittasks function above, in Taskschedulerimpl, this set of tasks is entrusted to a new Tasksetmanager instance for management, All Tasksetmanager are sorted by Schedulablebuilder according to a specific scheduling strategy, in Taskschedulerimpl resourceOffers函数 , The Resourceoffer function of the currently selected Tasksetmanager is called and returns the Taskdescription that contains the serialized task data. Finally, these taskdescription were distributed by Schedulerbackend to Executorbackend to execute .

Resourceoffers has done 3 things mainly:

  1. Randomly pull out some of the workers to perform the task.
  2. The task is identified with the worker by Tasksetmanager, and the final compilation is packaged into a taskdescription return.
  3. Returns the mapping relationship for Worker–>array[taskdescription].
  /** * Called by Cluster Manager to offer resources on slaves. We respond by asking We have active task * Sets for tasks in order of priority.   We fill each of the round-robin manner so *, tasks are balanced across the cluster. */  defResourceoffers (Offers:seq[workeroffer]): seq[seq[taskdescription]] = synchronized {//Mark each slave as alive and remember its hostname    //Also track if new executor is added    varNewexecavail =false    //traverse the resources provided by the worker, update the executor related mappings     for(o <-offers) {Executoridtohost (O.executorid) = o.host Activeexecutorids + = O.executoridif(!executorsbyhost.contains (O.host)) {Executorsbyhost (O.host) =NewHashset[string] () executoradded (O.executorid, o.host) Newexecavail =true} for(Rack <-Getrackforhost (o.host)) {Hostsbyrack.getorelseupdate (Rack,NewHashset[string] ()) + = O.host}}//Select some randomly from the worker to prevent the task from piling up on a machine    //Randomly shuffle offers to avoid all placing tasks on the same set of workers.    ValShuffledoffers = Random.shuffle (Offers)//Build A list of tasks to assign to each worker.    //Worker's Task List    Valtasks = Shuffledoffers.map (o =NewArraybuffer[taskdescription] (o.cores))ValAvailablecpus = Shuffledoffers.map (o = o.cores). ToArray//Getsortedtask function to sort taskset    ValSortedtasksets = Rootpool.getsortedtasksetqueue for(TaskSet <-sortedtasksets) {Logdebug ("ParentName:%s, Name:%s, Runningtasks:%s". Format (TaskSet.parent.name, Taskset.name, Taskset.runningtasks))if(Newexecavail) {taskset.executoradded ()}}//Take each TaskSet into our scheduling order, and then offer it each node in increasing order    //Of locality levels so, it gets a chance to launch the local tasks on all of them.    //Note:the preferredlocality order:process_local, node_local, No_pref, rack_local, any    //random traversal of the pumped worker, tasksetmanager the highest local task to the worker through the resourceoffer of the    //locality is the level of task locality determined based on the current wait time.     //Its locality is mainly comprised of four classes: Process_local, node_local, rack_local, any.     //1. First loop through Sortedtasksets, and for each Taskset, traverse tasklocality    //2. The more local the higher the priority, cannot find (Launchedtask is false) will go to the next locality level    //3. (encapsulated in the Resourceoffersingletaskset function) is traversing the offer list multiple times,    //Because one taskset.resourceoffer only takes up one core,    //instead of running out of all cores at once, this helps a taskset task to be distributed evenly across the workers    //4. Only in the Taskset, the locality, when no suitable task is found for all the worker's offer ,    //Before jumping to the next locality level    varLaunchedtask =false     for(TaskSet <-sortedtasksets; maxlocality <-taskset.mylocalitylevels) {do {launchedtask = Resourceoffersingletaskset (TaskSet, maxlocality, Shuffledoffers, AVAILABLECP us, Tasks)} while(Launchedtask)}if(Tasks.size >0) {Haslaunchedtask =true}returnTasks}

Taskdescription Code:

privateclass TaskDescription(    val taskId: Long,    val attemptNumber: Int,    val executorId: String,    val name: String,    val index: Int,    // Index within this task‘s TaskSet    _serializedTask: ByteBuffer)  extends Serializable {  // Because ByteBuffers are not serializable, wrap the task in a SerializableBuffer  private val buffer = new SerializableBuffer(_serializedTask)  def serializedTask: ByteBuffer = buffer.value  override def toString: String = "TaskDescription(TID=%d, index=%d)".format(taskId, index)}
The Launchtasks function of Driveractor

Launchtasks function Flow:

  1. The Launchtasks function serializes the taskdescription information returned by the Resourceoffers function
  2. Sends a LAUNCHTASK message that encapsulates Serializedtask to Executoractor

Due to the size limit of the Akka Frame size, if the sending data is too large, it will be truncated.

    //Launch tasks returned by a set of resource offers    defLaunchtasks (Tasks:seq[seq[taskdescription]) { for(Task <-Tasks.flatten) {ValSer = SparkEnv.get.closureSerializer.newInstance ()ValSerializedtask = ser.serialize (Task)if(Serializedtask.limit >= akkaframesize-akkautils.reservedsizebytes) {ValTasksetid = Scheduler.taskidtotasksetid (task.taskid) scheduler.activeTaskSets.get (Tasksetid). foreach {TaskSet =& GtTry{varmsg ="Serialized task%s:%d was%d bytes, which exceeds max allowed:"+"Spark.akka.frameSize (%d bytes)-Reserved (%d bytes). Consider increasing "+"Spark.akka.frameSize or using broadcast variables for large values."msg = Msg.format (Task.taskid, Task.index, Serializedtask.limit, Akkaframesize, Akkautils.reservedsizebytes) Taskset.abort (MSG)}Catch{ CaseE:exception = LogError ("Exception in error callback", E)}}}Else{ValExecutordata = Executordatamap (task.executorid) executordata.freecores-= scheduler. Cpus_per_task Executordata.executoractor! Launchtask (NewSerializablebuffer (Serializedtask)}}}
Resources

Spark Big Data processing, Gao Yanjie, mechanical industry Press
Spark Technology Insider: Source parsing of task submissions to executor
Spark Source Series (iii) operation process

reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Google search jasonding1354 go to my blog homepage

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

"Spark Core" TaskScheduler source code and task submission principle Analysis 2

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.