"Spark" spark application execution mechanism

Source: Internet
Author: User

Spark Application Concepts

The Spark app (application) is a user-submitted application. Execution mode is also local, Standalone, YARN, Mesos. Depending on whether the Spark application driver program is running in a cluster, the spark application can be run in cluster mode and client mode.
Here are some of the basic concepts involved in the Spark application:

  • Application:spark application, after the user submits, Spark assigns resources to the app, transforms and executes the program, where application contains a driver programs and several executor
  • Sparkcontext:spark Application portal, responsible for scheduling the various computing resources, coordinate the various Worker
    The Executor on Node
  • Driver Program: Runs the main () function of application and creates the Sparkcontext
  • The RDD Graph:rdd is the core structure of spark and can be manipulated by a series of operators (mainly transformation and action). When the RDD encounters an action operator, all previous operators are formed into a directed acyclic graph (DAG). Then into the job in spark, commit to the cluster execution. Multiple jobs can be included in an app
  • Executor: is a process that runs on worker node for application, which is responsible for running the task and is responsible for the memory or disk of the data. Each application will apply for its own executor to handle the mission.
  • Worker node: Any node in the cluster that can run application code, running one or more executor processes

The following describes the concepts of the various components in the Spark application run process:

  • Job: An RDD graph-triggered operation, often triggered by the spark action operator, submits the job to spark through the Runjob method in Sparkcontext
  • Stage: Each job is sliced into many stages based on the wide dependency of the RDD, and each stage contains a set of identical tasks, a set of tasks called Taskset
  • Task: A partition corresponds to an operator that is contained in the corresponding stage in the Task,task execution Rdd. The task is encapsulated and then placed in the executor thread pool to execute
  • Dagscheduler: Build a stage-based DAG based on the job and submit the stage to TaskScheduler
  • TaskScheduler: Submit Taskset to worker node cluster to run and return results

Spark execution mechanism overview Spark application transformation

The RDD action operator triggers the job submission, and the job that is submitted to spark generates an RDD dag, which is converted from Dagscheduler to the stage DAG, and each stage produces the corresponding task set. TaskScheduler distributes the task to executor execution. Each task corresponds to a corresponding block of data, which is processed using a user-defined function.

The underlying implementation of spark execution

In the bottom-up implementation of spark, the data is managed through the RDD, and there is a set of data blocks distributed across different nodes in the RDD, and when the application of Spark operates on the RDD, the scheduler distributes the tasks that contain the operations to the specified machine and executes the tasks in a multithreaded manner on the compute nodes. Once an operation is completed, the RDD is converted to another rdd, so that the user's operation executes sequentially. In order for the system to not run out of memory quickly, Spark uses deferred execution, that is, only the operation accumulates to the action, the operator triggers the execution of the entire sequence of operations, and the intermediate result does not reallocate memory separately, but rather pipelining on the same block of data.

Spark implements distributed computing and task processing, and achieves tasks such as distributing, tracking, executing, and finally aggregating results to complete the calculation of spark applications.
The block management of the RDD is done through Blockmanger, Blockmanager data is abstracted into chunks, stored in memory or on disk, and if the data is not present on this node, it can also be computed from the remote node to this machine.
The thread pool is created in the executor executor of the compute node, which executes the tasks that need to be performed concurrently through the thread pool.

How the app is submitted and executed

The application submission consists of the following two ways:
* The driver process runs on the client side, and the application is managed and monitored
* Master node specifies a worker node to start driver, responsible for monitoring the entire application

Driver process is the application of the master process, responsible for the application of parsing, slicing the stage and scheduling task to executor execution, including Dagscheduler and other important objects.
The following is a detailed description:

The driver process runs on the client


This approach applies the execution process:

  1. The user starts the client, and then the client runs the user program, starting the driver process. Start or instantiate components such as Dagscheduler in driver. The driver of the client is registered with master.
  2. The worker registers with master, and the Master command worker starts Exeuctor. The worker initiates the executorbackend process within the executorrunner thread by creating a Executorrunner thread.
  3. After the Executorbackend is started, the client driver the Schedulerbackend within the process to register so that the driver process can find the compute resources. The driver Dagscheduler parses the Rdd dag in the application and generates the corresponding stage, each stage containing the taskset assigned to executor by TaskScheduler. Initiates a thread pool parallelism execution task within the executor.
The driver process runs on the worker node


This approach applies the execution process:

  1. The user initiates the client and the client submits the application to master.
  2. The Master Scheduler app, which is distributed to a specified worker to start driver, which is scheduler-backend, for each app. After the worker receives the Master command, the Driverrunner thread is created and the Schedulerbackend process is created range the Driverrunner line. Driver acts as the master process for the entire job. MASTER Specifies other worker startup Exeuctor, which is the executorbackend process, which provides compute resources. The process is similar to the above, the worker creates the Executorrunner thread, and Executorrunner initiates the executorbackend process.
  3. After the executorbackend is started, it registers with the schedulerbackend of driver so that the driver gets the compute resources to dispatch and distribute the task to the compute nodes for execution. The schedulerbackend process contains Dagscheduler, which generates Taskset and dispatches and distributes task to executor based on the Dag Shard stage of the RDD. For each stage of the taskset, it will be stored in the TaskScheduler. TaskScheduler distributes tasks to executor, performing multi-threaded parallel tasks.

reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Google search jasonding1354 go to my blog homepage

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

"Spark" spark application execution mechanism

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.