Spark Learning Notes 1:application,driver,job,task,stage Understanding

Source: Internet
Author: User

I read Spark's original paper and related materials and learned about some of the frequently used terminology in spark and recorded it.

1,application

Application (application) is actually a program submitted to spark with Spark-submit. Let's say the sparkpi of the calculated pi in spark examples. A application usually consists of three parts: taking data from a data source (such as HDFs) to form an RDD, calculated through the transformation and action of the RDD, Output the results to the console or external storage (let's say collect collects output to console).

2,driver

The driver in spark actually feels similar to the functionality of application master in yarn. The main task is to complete the scheduling and coordination with executor and cluster manager. There are client and cluster modes. The client mode driver runs on the machine on which the task is submitted, while the cluster mode randomly selects a machine in the machine to start the driver. A map from the Spark website provides an overview of driver's functionality.

3,job

Job in Spark is not the same as the job in Mr. The job in Mr is primarily a map or reduce job. And Spark's job is really very different, an action operator even if a job, say Count,first, etc.

4, Task

The task is the latest execution unit in spark. The RDD is typically with partitions, and each partition execution on a executor can be a task.

5, Stage

The stage concept is unique in spark. Typically, a job will switch to a certain number of stages. Each stage is executed in order. As for how the stage is segmented, it is preferred to know the concept of narrow dependency (narrow dependency) and wide dependency (wide dependency) mentioned in the spark paper. In fact, it is very good to distinguish, look at the data in the parent Rdd to enter a different sub-rdd, if only into a sub-rdd is narrow dependence, otherwise is wide dependence. The boundary of wide dependence and narrow dependence is the dividing point of the stage. Two diagrams from Spark's paper can clearly understand the narrow dependence and the division of the stage.

As to why this is divided, it is mainly due to the difference in fault-tolerant recovery and processing performance (wide dependence requires shuffer).

There's so much to know about Spark's terminology for the time being that it may not be in place, but that's all.

Spark Learning Notes 1:application,driver,job,task,stage Understanding

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.