Several basic concepts:
(1) Job: A parallel computation consisting of multiple tasks, often spawned by action.
(2) The dispatching unit of Stage:job.
(3) Task: A unit of work that is sent to a executor.
(4) TaskSet: A set of tasks that are associated with each other and do not have a shuffle dependency relationship.
An application is comprised of a driver program and multiple jobs. A job is made up of multiple stage. A stage consists of several tasks that do not have a shuffle relationship.
Running schemas for Spark applications:
(1) Simply say:
Request resources from driver to the cluster, cluster allocate resources, start ex ecutor. Driver sends the code and files of the spark application to executor. Run a task on executor and return the result to driver or write to the outside world after running.
(2) complex point says:
Submit application, build Sparkcontext, build Dag diagram, submit to scheduler for parsing, parse into stage, submit to cluster, dispatch by cluster Task Manager, cluster start spark executor. Driver the code and files to executor. Executor perform various operations to complete task tasks. The Block tracker on the driver records executor data blocks that are generated on each node. When the task finishes running, the data is written to HDFs or to other types of databases.
(3) Comprehensive point says:
The spark application makes various transformation calculations and finally triggers the job through action. After submission, the DAG graph is constructed by sparkcontext based on the Rdd dependency relationship, and the DAG diagram is submitted to Dagscheduler for parsing, parsing is shuffle as the boundary, reverse parsing, and building stage,stage also has a dependency relationship. The process is to parse the Dag diagram and divide the stage, and calculate the dependencies between the stages. Then submit a taskset to the underlying scheduler, in Spark is submitted to TaskScheduler processing, generate Taskset Manager, finally submitted to executor for calculation, executor multithreading calculation, After calculating the feedback to Tasksetmanager, then feedback to TaskScheduler, and then back to Dagscheduler. Write the data after you have finished running all.
(4) More in-depth understanding:
After the application commits, it triggers the action, builds the Sparkcontext, builds the DAG diagram, submits it to Dagscheduler, builds the stage, submits Stageset to TaskScheduler, builds the Taskset Manager, The task is then submitted to executor to run. After executor runs the task, it submits the completion information to Schedulerbackend, which submits the task completion information to TaskScheduler. TaskScheduler feedback to Tasksetmanager, delete the task and perform the next task. At the same time TaskScheduler inserts the completed results into the success queue, and then returns the information to join successfully. TaskScheduler to Taskset Manager about the success of the task processing. After all tasks are completed, Taskset Manager feeds the results back to Dagscheduler. If it belongs to Resulttask, give it to Joblistener. If it does not belong to Resulttask, save the result.
Operating framework for Spark applications