36th Spark TaskScheduler Spark Shell Case Run log detailed, TaskScheduler and Schedulerbackend, FIFO and fair, Task runtime local algorithm details

Last Update:2016-05-15 Source: Internet

Author: User

Tags serialization shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

</pre>
& When a task executes a commit failure, it retries, and the default retry count for the task is 4 times.
& def this (sc:sparkcontext) = This (SC, sc.conf.getInt ("Spark.task.maxFailures", 4)) (Taskschedulerimpl)
(2) Add Tasksetmanager
Schedulerbuilder (depending on the Schedulermode, FIFO is different from fair implementation) #addTaskSetManger方法会确定TaskSetManager的调度顺序, Then follow Tasksetmanager's locality aware to determine that each task runs specifically in that executorbackend.
& The default scheduling order for Fifo;spark applications currently supports two scheduling modes FIFO and fair can be specified by Spark.Scheduler.mode in spark-env.sh.
   Default Scheduler is FIFO
    Private valschedulingmodeconf = Conf.get ("Spark.scheduler.mode", "FIFO")  (Taskschedulerimpl)
Call Addtasksetmanagerdef Addtasksetmanager at 1 (manager:schedulable, properties:properties)

FIFO mode, Addtasksetmanageroverride def addtasksetmanager (manager:schedulable, properties:properties) {    Rootpool.addschedulable (manager)  }

The Tasksetmanager is added directly to the schedulerqueue end of the dispatch object. Override Def addschedulable (schedulable:schedulable) {    require (schedulable! = null)    Schedulablequeue.add ( schedulable)    schedulablenametoschedulable.put (Schedulable.name, schedulable)    schedulable.parent = This  }

Fair mode, Addtasksetmanageroverride def addtasksetmanager (manager:schedulable, properties:properties) {var poolName = default_pool_name//A reference to the default schedule pool for the root node var parentpool = Rootpool.getschedulablebyname (poolname) if (Properties! = NULL      {//Gets a reference to the parent-dispatched object pool based on the priority. poolname = Properties.getproperty (fair_scheduler_properties, default_pool_name) Parentpool = rootPool.getSchedulable  ByName (poolname) if (Parentpool = = null) {//If the parent scheduler object does not exist, it is created according to the application configuration information Parentpool = new Pool (poolname, Default_scheduling_mode, Default_minimum_share, default_weight)//Child of root node default pool join the default pool root          Pool.addschedulable (Parentpool) loginfo ("Created pool%s, Schedulingmode:%s, Minshare:%d, Weight:%d". Format ( Poolname, Default_scheduling_mode, Default_minimum_share, Default_weight)}}//similar to FIFO, in each parent pool in the form of a queue, the TAS    Ksetmanager joined the team tail. Parentpool.addschedulable (manager) Loginfo ("Added task Set" + Manager.name + "tasks to Pool "+ poolname)} 


(1) Coarsegrainedschedulerbackend Allocation of resources
Coarsegrainedschedulerbackend#reviveoffers method, send reviveoffers message to Driverendpoint; Reviveoffers itself is an empty case object, It just triggers the underlying dispatch, and when there is a task commit or resource change, the reviveoffers message is sent. Every time a stage is submitted, a resource is requested and a reviveoffers message is sent.
2 called methods override Def reviveoffers () {    driverendpoint.send (reviveoffers)  }

& Reviveoffers is equivalent to a trigger that fires when a resource changes.
& TaskScheduler is responsible for assigning a compute resource to a task (the cluster resource that is allocated to master at the time the program starts), determining which executorbackend the task is to run based on the calculated local principle.

(4) Receiving reviveoffers messages and allocating resources
Receive reviveoffers messages in Driverendpoint and route to Makeoffers method In the Makeoffers method, you first prepare all the workoffers that are available for the calculation (representing all the core information available in the executor that the application obtains from master).
Coarsegrainedschedulerbackend.driverendpoint#receiveoverride def receive:partialfunction[any, Unit] = {//Omit part of the code Case      reviveoffers =        makeoffers ()}

   Logically, let all executor be the provider of compute Resources    private def makeoffers () {      //filter out the Hung executor      val activeexecutors = Executordatamap.filterkeys (executorisalive)      //generates a sequence of executor meta information with all aliver val workoffers = Activeexecutors.map {case (id, executordata) = =        new Workeroffer (ID, Executordata.executorhost, Executordata.freecores)      }.toseq//generates a two-bit array of resource allocations, based on which the tasks are loaded, executed      launchtask (Scheduler.resourceoffers ( workoffers))      //3  4    }

(a)Resourceoffers method
Call the Taskschedulerimpl#resourceoffers method to allocate a compute resource for each task, whose input is the cores available on the Executorbackend machine, and the output is a taskdescription two-dimensional array. where each task is specifically run in which executorbackend is defined.
3 Call Resourceoffers, the method entered as a list of executor, output as a//taskdescription two-bit array def resourceoffers (Offers:seq[workeroffer])    : seq[seq[taskdescription]] = synchronized {//each slave node is alive and records its hostname//if a new slave node is added, it is tracked. var newexecavail = False for (o <-offers) {executoridtohost (O.executorid) = O.host Executoridtotaskcount . Getorelseupdate (O.executorid, 0) if (!executorsbyhost.contains (o.host)) {//If there is a new executors, the new slave node joins Executo Rsbyhost (o.host) = new Hashset[string] ()//notification Dagscheduler add executors executoradded (O.executorid, o.host)//mark with new Execu Tor can be newexecavail = True}//Update available node information for (Rack <-getrackforhost (o.host)) {Hostsbyrack.getor Elseupdate (rack, new hashset[string] ()) + = O.host}}//Use random scrambling offers (round-robin manner) to allocate compute resources executor, avoiding TA    sk//are centrally allocated to certain machines.  Val shuffledoffers = Random.shuffle (offers)//Create a list of tasks assignments for each worker, see val tasks = Shuffledoffers.map (o = + new Arraybuffer[taskdescriptIon] (O.cores)) Val Availablecpus = shuffledoffers.map (o = o.cores). toarray//gets the Tasksetmanager Val sorte ordered by the scheduling policy Dtasksets = Rootpool.getsortedtasksetqueue for (taskSet <-sortedtasksets) {logdebug ("ParentName:%s, Name:%s , Runningtasks:%s ". Format (TaskSet.parent.name, Taskset.name, Taskset.runningtasks)) if (Newexecavail) {//if there is The new slave executor is available and needs to recalculate the Tasksetmanager's nearest Principle taskset.executoradded ()}}//For Rootpool obtained from Tasksetmanager List to allocate resources. The principle of assignment is the nearest principle, with priority assigned to Process_local, Node_local, No_pref, rack_local, any var launchedtask = False for (TaskSet <- Sortedtasksets;            Maxlocality <-taskset.mylocalitylevels) 〖{〗^⑦do {launchedtask = Resourceoffersingletaskset (      TaskSet, Maxlocality, Shuffledoffers, Availablecpus, Tasks)} while (Launchedtask)} if (Tasks.size > 0) { Haslaunchedtask = true} return tasks}


 Figure 36-1 Relationship between worker and tasks and Availablecpus
& Taskdescription to determine the good task to run on that executorbackend. The algorithm that determines that the task runs specifically on that Executorbackend is determined by the Tasksetmanager Resourceoffers method.

Allocate resources for each Tasksetmanager private def resourceoffersingletaskset (Taskset:tasksetmanager, Maxlocality:tasklocali Ty, Shuffledoffers:seq[workeroffer], Availablecpus:array[int], tasks:seq[arraybuffer[taskdescription]) : Boolean = {var launchedtask = false//Sequential traversal of the currently existing executor for (I <-0 until shuffledoffers.size) {//Get executor ID and hostname val execid = shuffledoffers (i). Executorid val host = Shuffledoffers (i). host//The executor can be assigned a task core implementation by invoking the Tasksetmanager to assign the executor task if (Availablecpus (i) >= cpus_per_task) {//To ensure that the number of cores available is not less than the minimum cores required for the task to run, Cpus_ Per_task default//is 1 try {//Gets the highest level of locality, and records executor's run correspondence with task for (Task <-Taskset.resourceoffer (Execid, hos T, maxlocality) ^?)            {Tasks (i) + = task val tid = task.taskid Taskidtotasksetmanager (tid) = TaskSet        Taskidtoexecutorid (tid) = Execid Executoridtotaskcount (execid) + = 1 Executorsbyhost (host) + = Execid     Availablecpus (i)-= Cpus_per_task assert (Availablecpus (i) >= 0) Launchedtask = True }} catch {//task not serialized case e:tasknotserializableexception = LogError (S "resour CE offer failed, task set ${taskset.name} is not serializable ") return Launchedtask}}}//return information    : Whether the task has been allocated a good resource. Return Launchedtask}

Resourceoffers algorithm Idea
Resourceoffers determines which executorbackend the task is running on. The implementation of the algorithm is as follows:
A) through the random#shuffle, the computational resources will be re-shuffle to seek the calculation of the Olympic Games load balance.
b) A arraybuffer array of type taskdescription is declared based on the number of cores per Executorbackend.
c) If a new executorbackend is assigned to our job, the executoradded will be called to get the full available compute resources.
& Here's a high-to-low priority order of data locality (Locality level): Process_local, Node_local, No_pref, rack_local, any, where no_ Pref refers to the local nature of the machine. Rack_local is a rack-local nature.
d) Follow the code below to track the highest level of local availability. (see 7)
For (TaskSet <-sortedTaskSets; maxlocality <-taskset.mylocalitylevels) {
Each task is computed by default with a thread.
Executing a task By default requires a cores, which is a thread. Val cpus_per_task = Conf.getint ("Spark.task.cpus", 1)

E) The final determination of which Executorbackend and the specific locality level each task runs on by invoking Tasksetmanager#resourceoffer.
p121 3 calls Tasksetmanager#resourceofferdef Resourceoffer (execid:string, host:string, MAXLOCALITY:TASKL Ocality. tasklocality): option[taskdescription] = {//not zombie taskset, i.e. can also submit task Taskset if (!iszombie) {val curtime = Clo Ck.gettimemillis ()//Get current maximum Local level var allowedlocality = maxlocality//If maximum local level is not machine local if (maxlocality!).        = Tasklocality.no_pref) {//recalculate the highest local level of the current time node, due to delayed scheduling, we need to obtain the current locality based on the wait//time delay scheduling algorithm. allowedlocality = Getallowedlocalitylevel (curtime) if (allowedlocality > Maxlocality) {//If the resulting load task is local is less than the original maxlocality, the local of the task load is set to//maxlocality allowedlocality = maxlocality}}//based on the local level of the different tasks.      Processing. Dequeuetask (Execid, host, allowedlocality) match {//index indicates the task's subscript in Taskset, tasklocality: Local, Speculative: Indicates whether//        is speculative, because other tasks have been scheduled to determine the task. Case Some (index, tasklocality, speculative) = =//Find a executor for the Task (can also be considered as executor for the current taskset to find a//task), Return to TaskInformation for some registration processing//Find this task in Taskset val task = Tasks (index)//create a task id val taskId = sched.newtaskid () Do various bookkeeping (??? ) copiesrunning (index) + = 1//Set the number of attempts to commit Val Attemptnum = taskattempts (index). size//instantiating a task's meta-Information VA L info = new TaskInfo (taskId, Index, Attemptnum, Curtime, Execid, host, tasklocality, speculative) Tas Kinfos (taskId) = Info taskattempts (index) = info:: taskattempts (Index)//update Local level for deferred dispatch policy//NO _pref does not affect delay scheduling related variables if (maxlocality! = tasklocality.no_pref) {Currentlocalityindex = Getlocalityindex ( tasklocality) Lastlaunchtime = curtime}//Serialize and return task Val startTime = Clock.gettimem Illis () Val serializedtask:bytebuffer = try {task.serializewithdependencies (Task, Sched.sc.addedFile           S, Sched.sc.addedJars, Ser)} catch {//task serialization failed, discard entire taskset case nonfatal (e) =   Val msg = S "Failed to serialize task $taskId, not attempting to retry it." LogError (MSG, E) abort (S "$msg Exception during serialization: $e") throw new tasknotserializable Exception (e)}//task the size limit of the serialization at broadcast time (why do you want to serialize it later?) if (Serializedtask.limit > tasksetmanager.task_size_to_warn_kb * 1024x768 &&!emittedtasksi zewarning) {emittedtasksizewarning = True logwarning (S "Stage ${task.stageid} contains a task of ver Y Large Size "+ S" (${serializedtask.limit/1024} KB). The maximum recommended task size is "+ S" ${tasksetmanager.task_size_to_warn_kb} KB. ")}          The task is joined to the running task Queue Addrunningtask (TASKID)//serializes some log processing. Val TaskName = S "Task ${info.id} in stage ${taskset.id}" Loginfo (S "Starting $taskName (TID $taskId, $host, Partit Ion ${task.partitionid}, "+ S" $taskLocality, ${serializedtask.limit} bytes) "//to the high-level scheduler DagschedulerThe task starts executing sched.dagScheduler.taskStarted (task, info)//Returns the Some class return Some that encapsulates taskdescription (new Taskdes        Cription (taskId = taskId, Attemptnumber = Attemptnum, Execid, TaskName, Index, serializedtask))} Case _ = =}} None}
& Dagscheduler is considered preferedlocation from the data (storage) level, while TaskScheduler is considering the local nature of the calculation from the point of view of the specific task being computed.
f) Send the task to Executorbackend to execute it through the Lanch task. (See (4) The Code Note 4?)
 Launch tasks returned by a set of resource offers Private def launchtasks (Tasks:seq[seq[taskdescription]]) {        For (Task <-Tasks.flatten) {//serialization of all tasks in task Val Serializedtask = ser.serialize (Task)//Limit of serialized task if (serializedtask.limit >= akkaframesize-akkautils.reservedsizebytes) {SCHEDULER.TASKIDTOTASKSETMANAGER.G ET (task.taskid). foreach {tasksetmgr = ' try {var msg = ' Serialized task%s:%d was%d bytes, wh Ich exceeds max allowed: "+" spark.akka.frameSize (%d bytes)-Reserved (%d bytes).              Consider increasing "+" spark.akka.frameSize or using broadcast variables for large values. " msg = Msg.format (Task.taskid, Task.index, Serializedtask.limit, Akkaframesize, akkautils.reservedsizebytes )//task size exceeds limit, discards Taskset tasksetmgr.abort (msg)} catch {case e:exception = Loger          Ror ("Exception in error callback", E)  }}}} else {///Otherwise, the task size meets the requirements//update executor information val executordata = Executordatamap (task.execut Orid) Executordata.freecores-= scheduler.          cpus_per_task//sends the task serialized task to executor. ExecutorData.executorEndpoint.send (Launchtask (New Serializablebuffer (Serializedtask))}}}

& for Task size settings: When a task is broadcast, the akkaframesize size is 128mb,akka reserved byte size is 200k, if the task is greater than or equal to 128mb-200k, then the task will be discarded directly If less than 128mb-200k will go through the coarsegrainedschedulerbackend launchtask to the specific executorbackend.
At this point, the processing of the driver end is complete, and the next section will explain the processing of the Executorbackend end receiving the task. (can refer to the 35th lesson map)


Description
This article is a note from the 36th course of the IFM course at DT Big Data DreamWorks

36th Spark TaskScheduler Spark Shell Case Run log detailed, TaskScheduler and Schedulerbackend, FIFO and fair, Task runtime local algorithm details

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More