I read Spark's original paper and related materials and learned about some of the frequently used terminology in spark and recorded it.
1,application
Application (application) is actually a program submitted to spark with Spark-submit. Let's say the sparkpi of the calculated pi in spark examples. A application usually consists of three parts: taking data from a data source (such as HDFs) to form an RDD, calculated through the transformation and action of the RDD, Output the results to the console or external storage (let's say collect collects output to console).
2,driver
The driver in spark actually feels similar to the functionality of application master in yarn. The main task is to complete the scheduling and coordination with executor and cluster manager. There are client and cluster modes. The client mode driver runs on the machine on which the task is submitted, while the cluster mode randomly selects a machine in the machine to start the driver. A map from the Spark website provides an overview of driver's functionality.
3,job
Job in Spark is not the same as the job in Mr. The job in Mr is primarily a map or reduce job. And Spark's job is really very different, an action operator even if a job, say Count,first, etc.
4, Task
The task is the latest execution unit in spark. The RDD is typically with partitions, and each partition execution on a executor can be a task.
5, Stage
The stage concept is unique in spark. Typically, a job will switch to a certain number of stages. Each stage is executed in order. As for how the stage is segmented, it is preferred to know the concept of narrow dependency (narrow dependency) and wide dependency (wide dependency) mentioned in the spark paper. In fact, it is very good to distinguish, look at the data in the parent Rdd to enter a different sub-rdd, if only into a sub-rdd is narrow dependence, otherwise is wide dependence. The boundary of wide dependence and narrow dependence is the dividing point of the stage. Two diagrams from Spark's paper can clearly understand the narrow dependence and the division of the stage.
As to why this is divided, it is mainly due to the difference in fault-tolerant recovery and processing performance (wide dependence requires shuffer).
There's so much to know about Spark's terminology for the time being that it may not be in place, but that's all.
Spark Learning Notes 1:application,driver,job,task,stage Understanding