Stage partitioning algorithm

Last Update:2017-05-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Summary of Stage partitioning algorithm

The last Rdd creates a finalstage
Finalstage Reverse Push
New stage division with wide dependency
Using recursion, submit the stage sequentially, starting at the parent stage

SOURCE Org.apache.spark.scheduler under the package

The stage partitioning algorithm consists of submitstage and Getmissingparentstages methods

Step one: Use the last rdd that triggers the job, create Finalstage, and pass in to the Newstage method

var finalstage:stage = null

Create a Stage object and add the stage to the Dagscheduler

Finalstage = Newstage (Finalrdd, Partitions.size, None, JobId, CallSite)

Step two: Create a job with finalstage, that is, the last stage of the job, of course, is finalstage

Val job = new Activejob (JobId, Finalstage, func, partitions, CallSite, listener, properties)

Step three: Add the job to the in-memory cache

Jobidtoactivejob (jobId) = Job

Activejobs + = Job

Finalstage.resultofjob = Some (Job)

Val stageids = Jobidtostageids (jobId). ToArray

Val Stageinfos = stageids.flatmap (id = = Stageidtostage.get (ID). Map (_.latestinfo))

Listenerbus.post (

Sparklistenerjobstart (Job.jobid, Jobsubmissiontime, Stageinfos, properties))

Fourth step: Submit Finalstage using the Submitstage method (try)

Submitstage (Finalstage)

Call the Getmissingparentstages method to get the parent stage of the current stage

Val missing = getmissingparentstages (stage). SortBy (_.id)

First, push the last rdd into the stack.

Waitingforvisit.push (Stage.rdd)

Then make a while loop and call your own internally defined visit () method

while (!waitingforvisit.isempty) {

Visit (Waitingforvisit.pop ())

}

Within the visit () method, traverse the dependency of the RDD

For (DEP <-rdd.dependencies)

If it is a narrow dependency, then put the dependent rdd into the stack

Case Narrowdep:narrowdependency[_] =

Waitingforvisit.push (Narrowdep.rdd)

If it is wide-dependent, then a new stage is created with the dependent Rdd, and Isshufflemap is set to True

(The default last stage is not shufflemap stage)

Except Finalstage, it's shufflemap stage.

Case Shufdep:shuffledependency[_, _, _] = =

Val mapstage = Getshufflemapstage (SHUFDEP, Stage.jobid)

if (missing = = Nil) {

If no parent stage is executed

Loginfo ("submitting" + Stage + "(" + Stage.rdd + "), which has no missing parents")

Submitmissingtasks (stage, Jobid.get)

} else {

Recursively call the Submit method to commit to the parent stage

For (parent <-missing) {

Submitstage (parent)

}

And put the current stage in the stage queue awaiting execution

Waitingstages + = Stage

}

* Methods for submitting the stage

Private def submitstage (Stage:stage) {

Val jobId = activejobforstage (stage)

if (jobid.isdefined) {

Logdebug ("Submitstage (" + stage + ")")

if (stage) &&!runningstages (stage) &&!failedstages (stage)) {!waitingstages

Call the Getmissingparentstages method to get the parent stage of the current stage

Val missing = getmissingparentstages (stage). SortBy (_.id)

Logdebug ("Missing:" + missing)

if (missing = = Nil) {

Loginfo ("submitting" + Stage + "(" + Stage.rdd + "), which has no missing parents")

Submitmissingtasks (stage, Jobid.get)

} else {

For (parent <-missing) {

Submitstage (parent)

}

Waitingstages + = Stage

}

} else {

Abortstage (stage, "No active job for stage" + stage.id)

}

* Get the parent stage method for a stage

Private def getmissingparentstages (stage:stage): list[stage] = {

Val missing = new Hashset[stage]

Val visited = new Hashset[rdd[_]]

We is manually maintaining a stack here to prevent stackoverflowerror

caused by recursively visiting

Val waitingforvisit = new Stack[rdd[_]]

def visit (Rdd:rdd[_]) {

if (!visited (RDD)) {

Visited + + Rdd

if (Getcachelocs (RDD). Contains (Nil)) {

For (DEP <-rdd.dependencies) {

Traverse the parent dependency of the RDD

DEP Match {

Case Shufdep:shuffledependency[_, _, _] = =

Val mapstage = Getshufflemapstage (SHUFDEP, Stage.jobid)

if (!mapstage.isavailable) {

Missing + = Mapstage

}

Case Narrowdep:narrowdependency[_] =

Waitingforvisit.push (Narrowdep.rdd)

}

First, push the last rdd into the stack.

Waitingforvisit.push (Stage.rdd)

Then loop, calling your own internally defined visit () method

while (!waitingforvisit.isempty) {

Visit (Waitingforvisit.pop ())

}

Missing.tolist

}

Stage partitioning algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Stage partitioning algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Stage partitioning algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support