Operating framework for Spark applications

Last Update:2015-10-03 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Several basic concepts:

(1) Job: A parallel computation consisting of multiple tasks, often spawned by action.

(2) The dispatching unit of Stage:job.

(3) Task: A unit of work that is sent to a executor.

(4) TaskSet: A set of tasks that are associated with each other and do not have a shuffle dependency relationship.

An application is comprised of a driver program and multiple jobs. A job is made up of multiple stage. A stage consists of several tasks that do not have a shuffle relationship.

Running schemas for Spark applications:

(1) Simply say:

Request resources from driver to the cluster, cluster allocate resources, start ex ecutor. Driver sends the code and files of the spark application to executor. Run a task on executor and return the result to driver or write to the outside world after running.

(2) complex point says:

Submit application, build Sparkcontext, build Dag diagram, submit to scheduler for parsing, parse into stage, submit to cluster, dispatch by cluster Task Manager, cluster start spark executor. Driver the code and files to executor. Executor perform various operations to complete task tasks. The Block tracker on the driver records executor data blocks that are generated on each node. When the task finishes running, the data is written to HDFs or to other types of databases.

(3) Comprehensive point says:

The spark application makes various transformation calculations and finally triggers the job through action. After submission, the DAG graph is constructed by sparkcontext based on the Rdd dependency relationship, and the DAG diagram is submitted to Dagscheduler for parsing, parsing is shuffle as the boundary, reverse parsing, and building stage,stage also has a dependency relationship. The process is to parse the Dag diagram and divide the stage, and calculate the dependencies between the stages. Then submit a taskset to the underlying scheduler, in Spark is submitted to TaskScheduler processing, generate Taskset Manager, finally submitted to executor for calculation, executor multithreading calculation, After calculating the feedback to Tasksetmanager, then feedback to TaskScheduler, and then back to Dagscheduler. Write the data after you have finished running all.

(4) More in-depth understanding:

After the application commits, it triggers the action, builds the Sparkcontext, builds the DAG diagram, submits it to Dagscheduler, builds the stage, submits Stageset to TaskScheduler, builds the Taskset Manager, The task is then submitted to executor to run. After executor runs the task, it submits the completion information to Schedulerbackend, which submits the task completion information to TaskScheduler. TaskScheduler feedback to Tasksetmanager, delete the task and perform the next task. At the same time TaskScheduler inserts the completed results into the success queue, and then returns the information to join successfully. TaskScheduler to Taskset Manager about the success of the task processing. After all tasks are completed, Taskset Manager feeds the results back to Dagscheduler. If it belongs to Resulttask, give it to Joblistener. If it does not belong to Resulttask, save the result.

Operating framework for Spark applications

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More