The main analytic hierarchy of the Spark runtime, combing the runtime components and the execution process,
Dagscheduler
job= multiple stage,stage= multiple same task, task divided into Shufflemaptask and resulttask,dependency divided into shuffledependency and narrowdependency
Segmentation for stage, segmentation based on wide dependency
Maintain waiting jobs and active jobs, maintain waiting stages, active stages and failed stages, and map relationships with jobs
Main functions
- Receives the primary entry for the submission job,
submitJob(rdd, ...)
or runJob(rdd, ...)
. SparkContext
These two methods are called in.
- Generate a stage and commit, and then determine if the stage has a parent stage that is not completed, if so, commits and waits for the parent stage, and so on. The result: Dagscheduler added some waiting stage and a running stage.
- After the running stage commits, the type of task in the stage is parsed, and a task description, Taskset, is generated.
- Call the
TaskScheduler.submitTask(taskSet, ...)
method and submit the task description to TaskScheduler. TaskScheduler allocates resources for this taskset and triggers execution, depending on the amount of resources and trigger allocation conditions.
DAGScheduler
After the job is submitted, the object is returned asynchronously JobWaiter
, able to return to the job run state, be able to cancel the job, process and return the result after successful execution
- Processing
TaskCompletionEvent
- If the task executes successfully, subtract the task from the corresponding stage and do some counting work:
- If the task is resulttask, the counter
Accumulator
adds one, and in the job the task True,job the total number of finish plus one. If the number of finishes is equal to the number of partition, the stage is completed, the mark stage is completed, the stage is subtracted from the running stages, and some cleanup work is removed from the stage.
- If the task is shufflemaptask, the counter
Accumulator
adds one, and an output location is added to the stage, which is a MapStatus
class. MapStatus
is the return of the ShuffleMapTask
execution completion, containing the location information and block size (optionally compressed or uncompressed). Also check the stage completion, to MapOutputTracker
register the Shuffleid and location information in the stage. Then check if there is a null in the output location of the stage, if there is a null, some task fails, the entire stage is resubmitted; otherwise, the next stage needs to be submitted from waiting stages.
- If the task is a resubmit, add this task to the corresponding stage
- If the task is a fetch failure, immediately mark the corresponding stage to complete and subtract from running stages. If retry,abort the entire stage is not allowed, then resubmit the entire stage. In addition, this fetch-related location and map task information, removed from the stage, from the
MapOutputTracker
logout. Finally, if the Blockmanagerid object of this fetch is not empty, do it once ExecutorLost
, and the next shuffle will be executed on another executor.
- Other task states are
TaskScheduler
handled by processing, such as exception, Taskresultlost, Commitdenied, and so on.
- Other job-related operations include: Cancel Job, cancel stage, resubmit failed stage, and more
Other Functions
1. Cachelocations and Preferlocation
privatevalnew HashMap[Int, Array[Seq[TaskLocation]]]
TaskScheduler
Maintain task and executor correspondence, executor and physical resource correspondence, in queued task and running task.
Maintains a task queue internally, scheduling tasks according to FIFO or fair policies.
TaskScheduler
itself is an interface, and spark only implements one TaskSchedulerImpl
, in theory task scheduling can be customized. Here are TaskScheduler
the main interfaces:
def start(): Unitdef postStartHook() { }def stop(): Unitdef submitTasks(taskSet: TaskSet): Unitdef cancelTasks(stageId: Int, interruptThread: Boolean)def setDAGScheduler(dagScheduler: DAGScheduler): Unitdef executorHeartbeatReceived(execId: String, taskMetrics: Array[(Long, TaskMetrics)], blockManagerId: BlockManagerId): Boolean
Main functions
submitTasks(taskSet)
To receive DAGScheduler
the tasks that are submitted
- Create one for tasks
TaskSetManager
and add it to the task queue. TaskSetManager
tracks the execution status of each task, maintaining a lot of specific information about the task.
- Triggers a resource request.
- First,
TaskScheduler
executor allocations are made against the available resources and task queues at hand (priority, localization, etc.), and eligible executor are assigned TaskSetManager
.
- Then, the resulting task description is given
SchedulerBackend
, called launchTask(tasks)
, to trigger the execution of the task on executor. The task description is serialized and sent to Executor,executor to extract the task information, calling the task's run()
method to perform the calculation.
cancelTasks(stageId)
, cancel the tasks of a stage
SchedulerBackend
the method that is called killTask(taskId, executorId, ...)
. TaskID and Executorid TaskScheduler
have been maintained in the house.
-
Resourceoffer (offers:seq[workers])
, which is a very important method, the caller is schedulerbacnend
, and uses the underlying resource Schedulerbackend
assigns the spare workers resource to TaskScheduler
to allocate a reasonable amount of CPU and memory resources to the queued task according to the scheduling policy. The task description list is then passed back to schedulerbackend
- from the worker offers, collecting the correspondence between executor and host, active executors, Rack information, and so on
- worker offers the list of resources randomly, the task list in the task queue is sorted according to the scheduling policy
- iterates through each taskset, by process localization, worker localization, machine localization, The priority order of the rack localization, providing the number of CPU cores available for each taskset, to see if the
- default one task requires a CPU, and the setting parameter is
"Spark.task.cpus=1"
- allocates resources for Taskset, verifies whether the logic is satisfied, and ends with
tasksetmanager
resourceoffer (execid, Host, maxlocality)
If the
- in the method is satisfied, the final task description is generated, and the
taskstarted (task, Info)
method of Dagscheduler
is called, notifying Dagscheduler
, which triggers dagscheduler
to do submitmissingstage
every time. If the tasks on the stage are assigned to resources, they are immediately committed to execute
statusUpdate(taskId, taskState, data)
, another very important method, the caller is SchedulerBacnend
that the purpose is to SchedulerBacnend
report the status of the task execution to TaskScheduler
make some decisions
- If
TaskLost
, find the corresponding executor of the task, remove it from the active executor, and avoid this executor being assigned to another task to continue to fail.
- Task finish consists of four states: finished, killed, failed, lost. Only finished is successful execution is complete. The other three kinds are failures.
- The task successfully finishes, calls
TaskResultGetter.enqueueSuccessfulTask(taskSet, tid, data)
, or otherwise calls TaskResultGetter.enqueueFailedTask(taskSet, tid, state, data)
. TaskResultGetter
A thread pool is maintained internally, responsible for executing the results of the asynchronous fetch task and deserializing it. The default is to open four threads to do this, with parameters "spark.resultGetter.threads"=4
.
Taskresultgetter the logic of the task result
- For success task, if the data in the Taskresult is a direct result data, it is directly deserialized to get the result, and if not, it is called
blockManager.getRemoteBytes(blockId)
from remote Fetch. If the data retrieved remotely is empty, it is called TaskScheduler.handleFailedTask
to tell it that the task is complete but the data is lost. Otherwise, when the data is taken, BlockManagerMaster
it is notified that the block message is removed, called TaskScheduler.handleSuccessfulTask
, and told that the task is successful and that the result data is passed back.
- For failed task, parse the reason for fail from data, call
TaskScheduler.handleFailedTask
it, tell it that the task failed, and what the reason is.
Schedulerbackend
On the TaskScheduler
lower level, used to dock different resource management systems, SchedulerBackend
is an interface, the main methods that need to be implemented are as follows:
def start(): Unitdef stop(): Unitdef// 重要方法:SchedulerBackend把自己手头上的可用资源交给TaskScheduler,TaskScheduler根据调度策略分配给排队的任务吗,返回一批可执行的任务描述,SchedulerBackend负责launchTask,即最终把task塞到了executor模型上,executor里的线程池会执行task的run()def killTask(taskId: Long, executorId: String, interruptThread: Boolean): Unit = thrownew UnsupportedOperationException
Coarse granularity: The process resident pattern, typically represented by the standalone mode, Mesos coarse-grained mode, yarn
Fine granularity: Mesos Fine-grained mode
The coarse-grained pattern is discussed here for better understanding: CoarseGrainedSchedulerBackend
.
Maintain executor related information (including executor address, communication port, host, total number of cores, number of remaining cores), how many of the executor on hand are registered, how many remain, and how many cores are empty, and so on.
Main functions
- The driver side primarily listens and handles these events through the actor:
RegisterExecutor(executorId, hostPort, cores, logUrls)
。 This is the source that executor adds, and usually the worker pulls up and restarts to trigger executor registration. CoarseGrainedSchedulerBackend
maintain these executor and update the internal resource information, such as the increase in the total number of cores. The last call makeOffer()
, which is to drop the resources at hand to TaskScheduler
allocate once, return to the task description back, the task launch up. This makeOffer()
call will appear in any event related to resource changes , as shown below.
StatusUpdate(executorId, taskId, state, data)
。 The state callback for the task. First, the call is TaskScheduler.statusUpdate
escalated. Then, determine if the task is finished, and then put the Freecore back on the executor and call it once makeOffer()
.
ReviveOffers
。 This event is called directly by others directly to the SchedulerBackend
requesting resource makeOffer()
.
KillTask(taskId, executorId, interruptThread)
。 This killtask event will be sent to Executor's actor,executor to handle the KillTask
event.
StopExecutors
。 Notifies each executor to handle the StopExecutor
event.
RemoveExecutor(executorId, reason)
。 From the maintenance information, that this heap of executor involved in the number of resources lost, and then call the TaskScheduler.executorLost()
method, notify the upper level I have a group of resources can not be used, you deal with it. TaskScheduler
will continue to executorLost
escalate the incident to the DAGScheduler
cause of DAGScheduler
concern about the output location of the shuffle task. DAGScheduler
will tell BlockManager
the executor not to be used, remove it, and then take all the shuffleoutput information of the stage to traverse it, remove the executor, and register the updated shuffleoutput information MapOutputTracker
, Finally, clean down the local CachedLocations
Map.
reviveOffers()
The implementation of the method. Call the makeOffers()
method directly, get a batch of executable task description, call launchTasks
.
launchTasks(tasks: Seq[Seq[TaskDescription]])
Method.
- Iterates through each task description, serializes it into binary, and sends it to each corresponding executor for this task message
- If this binary information is too large, more than 9.2M (default akkaframesize 10M minus the default is Akka left blank 200K), error, abort the entire taskset, and print a reminder to increase the Akka frame size
- If the binary data size is acceptable, the actor sent to executor handles the
LaunchTask(serializedTask)
event.
Executor
Executor is a process model in spark that can be applied to different resource management systems and used in conjunction with each other SchedulerBackend
.
There is a thread pool inside, there is a running tasks map, there is an actor, receive the above mentioned by SchedulerBackend
the sent event.
Event handling
launchTask
。 Based on the task description, generate a TaskRunner
thread, throw it out of running tasks map, execute this with the thread poolTaskRunner
killTask
。 Take the thread object from running tasks map and tune its Kill method.
Complete the full text:)
Spark Core Runtime Analysis: Dagscheduler, TaskScheduler, Schedulerbackend