Spark2.0 Source Learning-job submission and task splitting

Source: Internet
Author: User
Tags message queue
"Spark2.0 Source Learning" -9.job submission and task splittingIn the previous section of the client load, Spark's Driverrunner has started to execute the user task class (for example: Org.apache.spark.examples.SparkPi), which we begin to analyze for the user task Class (or task code) first, the overall preview           Expands on the previous diagram to increase the related interaction of task execution
Code: Refers to the user-written codes RDD: Elastic distributed datasets, user-encoded according to Sparkcontext and Rdd API can be very good to convert code into RDD data structure (the conversion details below will be described) Dagscheduler: A forward loop graph scheduler, Encapsulate the RDD as a jobsubmitted object into the EventLoop (Implementation class Dagschedulereventprocessloop) queue EventLoop: Timed scan unhandled jobsubmitted object, Submit the Jobsubmitted object to Dagscheduler dagscheduler: processing for jobsubmitted, eventually converting the rdd to execution Taskset, and committing Taskset to TaskScheduler TaskScheduler: Creates Tasksetmanager object from Taskset into the Schedulablebuilder data pool (pool) and invokes the Driverendpoint invoke consumption (reviveoffers) operation Driverendpoint: After accepting the reviveoffers instruction, the tasks in Taskset are distributed evenly to Executor Executor according to the relevant rules: Start a taskrunner execute a task second, code conversion to the initial Rdds     Our user code creates the spark's context (Sparkcontext) by invoking the Spark's API (for example: SparkSession.builder.appName ("Spark Pi"). Getorcreate ()). When we call the Transform class method (such as: Parallelize (), map ()) creates (or decorates) the Spark data structure (RDD), if it is an action class (such as: Reduce ()),           Then the last encapsulated Rdd is submitted as a job, and in the queue to be dispatched (Dagschedulereventprocessloop) for subsequent asynchronous processing.      If you invoke the action class operation multiple times, the encapsulated multiple RDD is submitted as multiple jobs. The process is as follows: executeenv (execution Environment) here can be MainClass submitted via Spark-submit, or Spark-shell script MainClass: The code will definitely create or get a Sparkcontext spa Rk-shell: The default is to create a sparkcontext RDD (elastic distributed DataSet) Create: can be created directly (e.g. sc.parallelize (1 until n, slices)), It can also be read elsewhere (e.g. Sc.textfile ("readme.md"), etc. Transformation:rdd provides a set of APIs that can be repeatedly encapsulated into an existing RDD to become the new Rdd, where the decorator design pattern is used, The following is a partial adorner class diagram
Action: When invoking the action method of the RDD (collect, reduce, lookup, save), this triggers the Dagscheduler job submission Dagscheduler: Creating a message named Jobsubmitted to dagschedulereventprocessloop blocking Message Queuing (Linkedblockingdeque) Dagschedulereventprocessloop: Start a thread named "Dag-scheduler-event-loop" Live consumption message queue "Dag-scheduler-event-loop" Callback Jobwaiter dagscheduler: print job execution result jobsubmitted: The relevant code is as follows (where Jobid is the Dagscheduler global increment ID):
Eventprocessloop.post (jobsubmitted (
  jobId, Rdd, Func2, Partitions.toarray, CallSite, waiter
  , Serializationutils.clone (properties)))
Final example:

The final conversion of the RDD is divided into four layers, each layer is dependent on the upper Rdd, the Shfflerdd encapsulated as a job to deposit dagschedulereventprocessloop to be processed, if we have a few paragraphs in the code above the sample code, Then you will create a few shfflerdd for the response to be deposited dagschedulereventprocessloopiii. RDD decomposition to the task set to be executed (TaskSet)After the job is submitted, Dagscheduler resolves to the corresponding stages based on the RDD hierarchy, while maintaining the relationship between the job and the stage. The topmost stage is decomposed into multiple tasks based on the concurrency relationship (findmissingpartitions), and the multiple tasks are encapsulated as Taskset and submitted to TaskScheduler. The non-topmost stage's list of deposits processed (waitingstages + = stage) flows as follows:
Dagschedulereventprocessloop, Thread "Dag-scheduler-event-loop" is processed to jobsubmitted call Dagscheduler for handlejobsubmitted First, the stage family is created according to the RDD dependency relationship, and stage is divided into shufflemapstage,resultstage two classes.

Update Jobid and Stageid relationship map Create Activejob, call Livelistenerbug, send Sparklistenerjobstart command to find the top stage to commit, The lower stage is deposited in waitingstage for subsequent processing call outputcommitcoordinator for Stagestart () processing call livelistenerbug, sending The sparklistenerstagesubmitted directive calls Sparkcontext's broadcast method to get the broadcast object to create the corresponding multiple tasks based on the stage type. A stage is divided into several corresponding task,task according to Findmissingpartitions Shufflemaptask,resulttask

Encapsulating the task as TaskSet, calling Taskscheduler.submittasks (taskSet) for task scheduling, the key code is as follows:

Taskscheduler.submittasks (New TaskSet (
  Tasks.toarray, Stage.id, Stage.latestInfo.attemptId, JobId, properties))
  Iv. Taskset Package is Tasksetmanager and submitted to driver      The TaskScheduler encapsulates the TaskSet as Tasksetmanager (new Tasksetmanager (this, TaskSet, Maxtaskfailures, blacklisttrackeropt)), Send Driverendpoint Invoke consumption (reviveoffers) instruction into the in-process task pool       Dagsheduler submits the Taskset to the TaskScheduler implementation class, where Taskchedulerimpl Taskschedulerimpl creates a Tasksetmanager management Taskset, The key code is as follows:
     New Tasksetmanager (This, TaskSet, Maxtaskfailures, blacklisttrackeropt)
Reviveoffers the implementation class of Schedulerbackend in the Tasksetmanager Add Schedduablebuilder Task Pool poll. Here is the implementation class for Standlone mode standaloneschedulerbackend schedulerbackend send reviveoffers command to Driverendpoint v. Driver the Tasksetmanager into Taskdescriptions and publishes the task to the executorDriver accepts the invoke consumption instruction, matches all pending tasksetmanager with the executor resource registered in driver, and finally a tasksetmanager gets multiple Taskdescription objects, Follow taskdescription to send launchtask instructions to the corresponding executor
When driver obtains the Reviveoffers (request consumption) instruction, it first obtains the executor resource information (workeroffer) according to executordatamap cache information, the key code is as follows
Val activeexecutors = Executordatamap.filterkeys (executorisalive)
val workoffers = activeexecutors.map {case (ID, Executordata) =
  new Workeroffer (ID, executordata.executorhost, executordata.freecores)
}.toindexedseq
Then call TaskScheduler for resource matching, as defined by the following method:
     def resourceoffers (Offers:indexedseq[workeroffer]): seq[seq[taskdescription]] = synchronized {..}
The Workeroffer resource is scrambled (val shuffledoffers = Random.shuffle (offers)) to remove poo pending Tasksetmanager (val sortedtasksets = Rootpool.getsortedtasksetqueue), and loops through the sortedtasksets and matches the shuffledoffers loop if Shuffledoffers (i) has sufficient CPU resources (if ( Availablecpus (i) >= cpus_per_task)), call Tasksetmanager to create the Taskdescription object (Taskset.resourceoffer (ExecId, host, maxlocality)), and eventually create multiple taskdescription,taskdescription definitions as follows:
New Taskdescription (
  taskId,
  attemptnum,
  execid,
  taskname,
  index,
  sched.sc.addedFiles,
  Sched.sc.addedJars,
  task.localproperties,
  serializedtask)
If Taskdescriptions is not empty, loop taskdescriptions, Serialize Taskdescription object, and send launchtask instruction to Executorendpoint, the key code is as follows:
for (Task <-taskdescriptions.flatten) {val Serializedtask = Taskdescription.encode (Task) Val executordata = Executordatamap (task.executorid) executordata.freecores-= scheduler. Cpus_per_task ExecutorData.executorEndpoint.send (Launchtask (New Serializablebuffer (Serializedtask))} 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.