Spark Source--stage

Last Update:2018-07-23 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Stage is the physical unit that spark schedules execution Spark1.6 version stage Source:

Package Org.apache.spark.scheduler Import scala.collection.mutable.HashSet import org.apache.spark._ Import Org.apache.spark.rdd.RDD Import Org.apache.spark.util.CallSite/** * A stage is a set of parallel tasks all computing th E same function, need to run as part * of a Spark job, where all the tasks has the same shuffle dependencies. Each DAG's tasks run * by the scheduler are split up to stages at the boundaries where shuffle occurs, and then the *
 Dagscheduler runs these stages in topological order. * * Stage consists of a set of identical tasks, dividing the stage boundary is shuffle * * Each stage can either is a shuffle map stage, in the which case its tasks ' re  Sults is input for * Other Stage (s), or a result stage, in which case its tasks directly compute a Spark action * (e.g. Count (), save (), etc) by running a function in an RDD.

 For shuffle map stages, we also * track the nodes and each output partition are on. * Stage is divided into shufflemapstage and resultstage * * Each stage also have a firstjobid, identifyingThe job that first submitted the stage. When FIFO * Scheduling are used, this allows Stages from earlier jobs to being computed first or recovered * faster on Failu
 Re. * * The stage can be retried * Finally, a single stage is re-executed in multiple attempts due to fault recovery.
 In this * case, the Stage object would track multiple Stageinfo objects to pass to listeners or the Web UI.
 * The latest one would be accessible through latestinfo.  * * @param ID Unique Stage ID * @param RDD rdd that this stage runs on:for a shuffle map stage, it's the RDD we run map Tasks * on, when for a result stage, it's the target RDD, the We ran an action in * @param numtasks total number of tasks in stage;
 Result stages in particular is not need to * compute all partitions, e.g. for first (), lookup (), and take ().
 * @param parents List of stages, this stage depends on (through shuffle dependencies).
 * @param firstjobid ID of the first job this stage is part of, a for FIFO scheduling. * @parAm CallSite location in the user program associated with this stage:either where the target * RDD is created, for a s
 Huffle map stage, or where the action for a result stage is called. */Private[scheduler] abstract class Stage (Val id:int, Val Rdd:rdd[_], Val numtasks:int, Val parents:

  List[stage],//parents is the Stage List, the connection relationship of the role Dag Val Firstjobid:int, Val callsite:callsite) extends Logging { Val numpartitions = rdd.partitions.length/** Set of Jobs The This stage belongs to. */val Jobids = new Hashset[int] val pendingpartitions = new Hashset[int]/** the ID to use for the next new Attem PT for the This stage. 

  */private var nextattemptid:int = 0 Val name:string = callsite.shortform val details:string = Callsite.longform Private var _internalaccumulators:seq[accumulator[long]] = seq.empty/** Internal accumulators gkfx across all t asks in this stage. */def Internalaccumulators:seq[accumulator[long]] = _INTERNALACCUMulators/** * Re-initialize the internal accumulators associated with this stage. * * This is called every time the stage was submitted, *except* when a subset of tasks * belonging to this stage have Already finished.
   Otherwise, reinitializing the internal * accumulators here again would override the partial values from the finished tasks. 
  */def resetinternalaccumulators (): Unit = {_internalaccumulators = internalaccumulator.create (Rdd.sparkcontext) }/** * Pointer to the [Stageinfo] object for the most recent attempt. 
   This needs to is initialized * here, before any attempts has actually been created, because the Dagscheduler uses this 
   * Stageinfo to the sparklisteners when a job starts (which happens before any stage attempts * has been created). */private var _latestinfo:stageinfo = Stageinfo.fromstage (this, nextattemptid)/** * Set of stage attempt ID s that has failed with a fetchfailure. We keep track of these * failures In order to avoid endless retries if a stage keeps failing with a fetchfailure. * We keep track of each attempt ID that have failed to avoid recording duplicate failures if * Multiple tasks from the S
   AME stage attempt fail (SPARK-5945). */private Val fetchfailedattemptids = new Hashset[int] Private[scheduler] def clearfailures (): Unit = {FETCHFA Iledattemptids.clear ()}/** * Check Whether we should abort the Failedstage due to multiple consecutive fetch FAI
   Lures. * * This method updates the running set of failed stage attempts and returns * True if the number of failures exceed
   s the allowable number of failures.
    * * Check if you need to discard the current failed stage */Private[scheduler] def failedonfetchandshouldabort (stageattemptid:int): Boolean = {

  Fetchfailedattemptids.add (Stageattemptid) fetchfailedattemptids.size >= Stage.max_consecutive_fetch_failures} 
 /** creates a new attempt for this stage is creating a new stageinfo with a new attempt ID.  * * Retry Stage */Def makenewstageattempt (Numpartitionstocompute:int, Tasklocalitypreferences:seq[se Q[tasklocation]] = seq.empty): Unit = {_latestinfo = Stageinfo.fromstage (this, Nextattemptid, Some (numpartitio  Nstocompute), tasklocalitypreferences) Nextattemptid + = 1}/** Returns the Stageinfo for the most recent attempt For this stage. */def Latestinfo:stageinfo = _latestinfo override final Def hashcode (): Int = ID override final Def equals (other 
  : any): Boolean = Other match {case stage:stage = stage! = NULL && stage.id = = id Case _ = False 
  }/** Returns the sequence of partition IDs that is missing (i.e. needs to be computed). * Get unfinished Parition */def findmissingpartitions (): Seq[int]} Private[scheduler] Object Stage {//the number of cons
 Ecutive failures allowed before a stage is aborted//retry up to 4 times val max_consecutive_fetch_failures = 4}

Shufflemapstage and Resultstage

As shown in the figure:

Shufflemapstage is the stage of the job intermediate execution process, Resultstage is the last stage of the job

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More