Stage partitioning algorithm

Source: Internet
Author: User

Summary of Stage partitioning algorithm

    1. The last Rdd creates a finalstage

    2. Finalstage Reverse Push

    3. New stage division with wide dependency

    4. Using recursion, submit the stage sequentially, starting at the parent stage

SOURCE Org.apache.spark.scheduler under the package

The stage partitioning algorithm consists of submitstage and Getmissingparentstages methods

Step one: Use the last rdd that triggers the job, create Finalstage, and pass in to the Newstage method

var finalstage:stage = null

Create a Stage object and add the stage to the Dagscheduler

Finalstage = Newstage (Finalrdd, Partitions.size, None, JobId, CallSite)

Step two: Create a job with finalstage, that is, the last stage of the job, of course, is finalstage

Val job = new Activejob (JobId, Finalstage, func, partitions, CallSite, listener, properties)

Step three: Add the job to the in-memory cache

Jobidtoactivejob (jobId) = Job

Activejobs + = Job

Finalstage.resultofjob = Some (Job)

Val stageids = Jobidtostageids (jobId). ToArray

Val Stageinfos = stageids.flatmap (id = = Stageidtostage.get (ID). Map (_.latestinfo))

Listenerbus.post (

Sparklistenerjobstart (Job.jobid, Jobsubmissiontime, Stageinfos, properties))

Fourth step: Submit Finalstage using the Submitstage method (try)

Submitstage (Finalstage)

Call the Getmissingparentstages method to get the parent stage of the current stage

Val missing = getmissingparentstages (stage). SortBy (_.id)

First, push the last rdd into the stack.

Waitingforvisit.push (Stage.rdd)

Then make a while loop and call your own internally defined visit () method

while (!waitingforvisit.isempty) {

Visit (Waitingforvisit.pop ())

}

Within the visit () method, traverse the dependency of the RDD

For (DEP <-rdd.dependencies)

If it is a narrow dependency, then put the dependent rdd into the stack

Case Narrowdep:narrowdependency[_] =

Waitingforvisit.push (Narrowdep.rdd)

If it is wide-dependent, then a new stage is created with the dependent Rdd, and Isshufflemap is set to True

(The default last stage is not shufflemap stage)

Except Finalstage, it's shufflemap stage.

Case Shufdep:shuffledependency[_, _, _] = =

Val mapstage = Getshufflemapstage (SHUFDEP, Stage.jobid)


if (missing = = Nil) {

If no parent stage is executed

Loginfo ("submitting" + Stage + "(" + Stage.rdd + "), which has no missing parents")

Submitmissingtasks (stage, Jobid.get)

} else {

Recursively call the Submit method to commit to the parent stage

For (parent <-missing) {

Submitstage (parent)

}

And put the current stage in the stage queue awaiting execution

Waitingstages + = Stage

}


/*

* Methods for submitting the stage

*/

Private def submitstage (Stage:stage) {

Val jobId = activejobforstage (stage)

if (jobid.isdefined) {

Logdebug ("Submitstage (" + stage + ")")

if (stage) &&!runningstages (stage) &&!failedstages (stage)) {!waitingstages

Call the Getmissingparentstages method to get the parent stage of the current stage

Val missing = getmissingparentstages (stage). SortBy (_.id)

Logdebug ("Missing:" + missing)

if (missing = = Nil) {

Loginfo ("submitting" + Stage + "(" + Stage.rdd + "), which has no missing parents")

Submitmissingtasks (stage, Jobid.get)

} else {

For (parent <-missing) {

Submitstage (parent)

}

Waitingstages + = Stage

}

}

} else {

Abortstage (stage, "No active job for stage" + stage.id)

}

}



/*

* Get the parent stage method for a stage

*/

Private def getmissingparentstages (stage:stage): list[stage] = {

Val missing = new Hashset[stage]

Val visited = new Hashset[rdd[_]]

We is manually maintaining a stack here to prevent stackoverflowerror

caused by recursively visiting

Val waitingforvisit = new Stack[rdd[_]]

def visit (Rdd:rdd[_]) {

if (!visited (RDD)) {

Visited + + Rdd

if (Getcachelocs (RDD). Contains (Nil)) {

For (DEP <-rdd.dependencies) {

Traverse the parent dependency of the RDD

DEP Match {

Case Shufdep:shuffledependency[_, _, _] = =

Val mapstage = Getshufflemapstage (SHUFDEP, Stage.jobid)

if (!mapstage.isavailable) {

Missing + = Mapstage

}

Case Narrowdep:narrowdependency[_] =

Waitingforvisit.push (Narrowdep.rdd)

}

}

}

}

}

First, push the last rdd into the stack.

Waitingforvisit.push (Stage.rdd)

Then loop, calling your own internally defined visit () method

while (!waitingforvisit.isempty) {

Visit (Waitingforvisit.pop ())

}

Missing.tolist

}


Stage partitioning algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.