"Spark" spark application execution mechanism

Last Update:2015-07-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark Application Concepts

The Spark app (application) is a user-submitted application. Execution mode is also local, Standalone, YARN, Mesos. Depending on whether the Spark application driver program is running in a cluster, the spark application can be run in cluster mode and client mode.
Here are some of the basic concepts involved in the Spark application:

Application:spark application, after the user submits, Spark assigns resources to the app, transforms and executes the program, where application contains a driver programs and several executor

Sparkcontext:spark Application portal, responsible for scheduling the various computing resources, coordinate the various Worker
The Executor on Node

Driver Program: Runs the main () function of application and creates the Sparkcontext

The RDD Graph:rdd is the core structure of spark and can be manipulated by a series of operators (mainly transformation and action). When the RDD encounters an action operator, all previous operators are formed into a directed acyclic graph (DAG). Then into the job in spark, commit to the cluster execution. Multiple jobs can be included in an app

Executor: is a process that runs on worker node for application, which is responsible for running the task and is responsible for the memory or disk of the data. Each application will apply for its own executor to handle the mission.

Worker node: Any node in the cluster that can run application code, running one or more executor processes

The following describes the concepts of the various components in the Spark application run process:

Job: An RDD graph-triggered operation, often triggered by the spark action operator, submits the job to spark through the Runjob method in Sparkcontext

Stage: Each job is sliced into many stages based on the wide dependency of the RDD, and each stage contains a set of identical tasks, a set of tasks called Taskset

Task: A partition corresponds to an operator that is contained in the corresponding stage in the Task,task execution Rdd. The task is encapsulated and then placed in the executor thread pool to execute

Dagscheduler: Build a stage-based DAG based on the job and submit the stage to TaskScheduler

TaskScheduler: Submit Taskset to worker node cluster to run and return results

Spark execution mechanism overview Spark application transformation

The RDD action operator triggers the job submission, and the job that is submitted to spark generates an RDD dag, which is converted from Dagscheduler to the stage DAG, and each stage produces the corresponding task set. TaskScheduler distributes the task to executor execution. Each task corresponds to a corresponding block of data, which is processed using a user-defined function.

The underlying implementation of spark execution

In the bottom-up implementation of spark, the data is managed through the RDD, and there is a set of data blocks distributed across different nodes in the RDD, and when the application of Spark operates on the RDD, the scheduler distributes the tasks that contain the operations to the specified machine and executes the tasks in a multithreaded manner on the compute nodes. Once an operation is completed, the RDD is converted to another rdd, so that the user's operation executes sequentially. In order for the system to not run out of memory quickly, Spark uses deferred execution, that is, only the operation accumulates to the action, the operator triggers the execution of the entire sequence of operations, and the intermediate result does not reallocate memory separately, but rather pipelining on the same block of data.

Spark implements distributed computing and task processing, and achieves tasks such as distributing, tracking, executing, and finally aggregating results to complete the calculation of spark applications.
The block management of the RDD is done through Blockmanger, Blockmanager data is abstracted into chunks, stored in memory or on disk, and if the data is not present on this node, it can also be computed from the remote node to this machine.
The thread pool is created in the executor executor of the compute node, which executes the tasks that need to be performed concurrently through the thread pool.

How the app is submitted and executed

The application submission consists of the following two ways:
* The driver process runs on the client side, and the application is managed and monitored
* Master node specifies a worker node to start driver, responsible for monitoring the entire application

Driver process is the application of the master process, responsible for the application of parsing, slicing the stage and scheduling task to executor execution, including Dagscheduler and other important objects.
The following is a detailed description:

The driver process runs on the client

This approach applies the execution process:

The user starts the client, and then the client runs the user program, starting the driver process. Start or instantiate components such as Dagscheduler in driver. The driver of the client is registered with master.

The worker registers with master, and the Master command worker starts Exeuctor. The worker initiates the executorbackend process within the executorrunner thread by creating a Executorrunner thread.

After the Executorbackend is started, the client driver the Schedulerbackend within the process to register so that the driver process can find the compute resources. The driver Dagscheduler parses the Rdd dag in the application and generates the corresponding stage, each stage containing the taskset assigned to executor by TaskScheduler. Initiates a thread pool parallelism execution task within the executor.

The driver process runs on the worker node

This approach applies the execution process:

The user initiates the client and the client submits the application to master.

The Master Scheduler app, which is distributed to a specified worker to start driver, which is scheduler-backend, for each app. After the worker receives the Master command, the Driverrunner thread is created and the Schedulerbackend process is created range the Driverrunner line. Driver acts as the master process for the entire job. MASTER Specifies other worker startup Exeuctor, which is the executorbackend process, which provides compute resources. The process is similar to the above, the worker creates the Executorrunner thread, and Executorrunner initiates the executorbackend process.

After the executorbackend is started, it registers with the schedulerbackend of driver so that the driver gets the compute resources to dispatch and distribute the task to the compute nodes for execution. The schedulerbackend process contains Dagscheduler, which generates Taskset and dispatches and distributes task to executor based on the Dag Shard stage of the RDD. For each stage of the taskset, it will be stored in the TaskScheduler. TaskScheduler distributes tasks to executor, performing multi-threaded parallel tasks.

reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Google search jasonding1354 go to my blog homepage

"Spark" spark application execution mechanism

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Spark" spark application execution mechanism

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Spark" spark application execution mechanism

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support