Summary of Stage partitioning algorithm
The last Rdd creates a finalstage
Finalstage Reverse Push
New stage division with wide dependency
Using recursion, submit the stage sequentially, starting at the parent stage
SOURCE Org.apache.spark.scheduler under the package
The stage partitioning algorithm consists of submitstage and Getmissingparentstages methods
Step one: Use the last rdd that triggers the job, create Finalstage, and pass in to the Newstage method
var finalstage:stage = null
Create a Stage object and add the stage to the Dagscheduler
Finalstage = Newstage (Finalrdd, Partitions.size, None, JobId, CallSite)
Step two: Create a job with finalstage, that is, the last stage of the job, of course, is finalstage
Val job = new Activejob (JobId, Finalstage, func, partitions, CallSite, listener, properties)
Step three: Add the job to the in-memory cache
Jobidtoactivejob (jobId) = Job
Activejobs + = Job
Finalstage.resultofjob = Some (Job)
Val stageids = Jobidtostageids (jobId). ToArray
Val Stageinfos = stageids.flatmap (id = = Stageidtostage.get (ID). Map (_.latestinfo))
Listenerbus.post (
Sparklistenerjobstart (Job.jobid, Jobsubmissiontime, Stageinfos, properties))
Fourth step: Submit Finalstage using the Submitstage method (try)
Submitstage (Finalstage)
Call the Getmissingparentstages method to get the parent stage of the current stage
Val missing = getmissingparentstages (stage). SortBy (_.id)
First, push the last rdd into the stack.
Waitingforvisit.push (Stage.rdd)
Then make a while loop and call your own internally defined visit () method
while (!waitingforvisit.isempty) {
Visit (Waitingforvisit.pop ())
}
Within the visit () method, traverse the dependency of the RDD
For (DEP <-rdd.dependencies)
If it is a narrow dependency, then put the dependent rdd into the stack
Case Narrowdep:narrowdependency[_] =
Waitingforvisit.push (Narrowdep.rdd)
If it is wide-dependent, then a new stage is created with the dependent Rdd, and Isshufflemap is set to True
(The default last stage is not shufflemap stage)
Except Finalstage, it's shufflemap stage.
Case Shufdep:shuffledependency[_, _, _] = =
Val mapstage = Getshufflemapstage (SHUFDEP, Stage.jobid)
if (missing = = Nil) {
If no parent stage is executed
Loginfo ("submitting" + Stage + "(" + Stage.rdd + "), which has no missing parents")
Submitmissingtasks (stage, Jobid.get)
} else {
Recursively call the Submit method to commit to the parent stage
For (parent <-missing) {
Submitstage (parent)
}
And put the current stage in the stage queue awaiting execution
Waitingstages + = Stage
}
/*
* Methods for submitting the stage
*/
Private def submitstage (Stage:stage) {
Val jobId = activejobforstage (stage)
if (jobid.isdefined) {
Logdebug ("Submitstage (" + stage + ")")
if (stage) &&!runningstages (stage) &&!failedstages (stage)) {!waitingstages
Call the Getmissingparentstages method to get the parent stage of the current stage
Val missing = getmissingparentstages (stage). SortBy (_.id)
Logdebug ("Missing:" + missing)
if (missing = = Nil) {
Loginfo ("submitting" + Stage + "(" + Stage.rdd + "), which has no missing parents")
Submitmissingtasks (stage, Jobid.get)
} else {
For (parent <-missing) {
Submitstage (parent)
}
Waitingstages + = Stage
}
}
} else {
Abortstage (stage, "No active job for stage" + stage.id)
}
}
/*
* Get the parent stage method for a stage
*/
Private def getmissingparentstages (stage:stage): list[stage] = {
Val missing = new Hashset[stage]
Val visited = new Hashset[rdd[_]]
We is manually maintaining a stack here to prevent stackoverflowerror
caused by recursively visiting
Val waitingforvisit = new Stack[rdd[_]]
def visit (Rdd:rdd[_]) {
if (!visited (RDD)) {
Visited + + Rdd
if (Getcachelocs (RDD). Contains (Nil)) {
For (DEP <-rdd.dependencies) {
Traverse the parent dependency of the RDD
DEP Match {
Case Shufdep:shuffledependency[_, _, _] = =
Val mapstage = Getshufflemapstage (SHUFDEP, Stage.jobid)
if (!mapstage.isavailable) {
Missing + = Mapstage
}
Case Narrowdep:narrowdependency[_] =
Waitingforvisit.push (Narrowdep.rdd)
}
}
}
}
}
First, push the last rdd into the stack.
Waitingforvisit.push (Stage.rdd)
Then loop, calling your own internally defined visit () method
while (!waitingforvisit.isempty) {
Visit (Waitingforvisit.pop ())
}
Missing.tolist
}
Stage partitioning algorithm