Explanation of Spark Operating Architecture

Last Update:2020-06-04 Source: Internet

Author: User

Keywords spark spark architecture spark operating architecture

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly understands how the following five components are connected to each other, and cooperate to realize the principle of the operation architecture of our Spark application.

1. Driver

2. Master

3. Worker

4. Executor

5. Task

Driver is a process, and the Spark program we submit is written on the Driver, which is executed by the Driver process. The driver may be a node of the Spark cluster, or it may be the machine where you submit the Spark program. Spark's running mode is different. In other words, on which machine we submit the running program, Spark will use it as a Driver role.

In some companies, the official cluster is a machine that has a dedicated submission program to run, and some directly use the master node of the cluster as the driver to run the program, and some nodes in the cluster are elected to serve as the driver node.

Master, it is also a process. We can use the simplest command jps to check which processes we have run in the established cluster. Master is mainly responsible for resource scheduling and allocation, as well as cluster monitoring and other responsibilities.

Worker is also a process, mainly responsible for, one is to use its own memory to store some partition data of RDD; the other is to start other processes and threads to process and calculate the partition on RDD

Executor is also a process, and there will be multiple task threads in an Executor process. The Executor and task here are mainly responsible for the parallel calculation of RDD partitions, that is, the implementation of our team's RDD operators, such as map, flatMap, reduceByKey, etc.

Task is a thread, mainly responsible for the actual execution of operator tasks.

After a brief introduction of each component, how do these components call and cooperate with each other to complete the running of a Spark application, combined with the following figure for intuitive explanation.

Here we analyze the principle of Spark operation architecture by building a minimum test cluster. Here we have 3 nodes, 1 master, and 2 worker nodes. The election of the master and slave nodes of the machine can specify whether the machine is a Master or a Worter node in the configuration file when setting up the environment.

A, What do you do to initialize the operation every time the Driver process is started? First it will send a request to the Master to register the Spark application, that is, we want to let the Master know that there is now a new Spark application to run.

B, after receiving the application for registration of the Spark application, the Master will send it to the Worker for scheduling and allocation of resources. This also shows that resource allocation is allocated and managed by the executor.

C. After receiving the Master's request, Worter will start the Executor for the Spark application to allocate resources.

D. After the Executor starts allocating resources, it will deregister with the Driver, and the Driver will know which Executors serve him.

E. After the driver gets registered with Executor, we can start the official execution of our spark application. The first step is to create an initial RDD, read the data source, and then execute a series of subsequent operators. HDFS file content is read to multiple worker nodes to form a distributed data set in memory, which is the initial RDD.

F, at this time, the Driver will form a corresponding task according to the operator in the Job task, and finally submit it to the Executor to assign the task to calculate the thread.

G, at this time, the task will call the data corresponding to its task (that is, the first step to initialize the RDD partition) to calculate, and the task will perform the specified operator operation on the called RDD partition data to form a new RDD partition, then a big loop is over.

The partition data of the new RDD will form a new batch of tasks through the Driver and be submitted to the Executor for execution. This operation is repeated until the execution of all operators ends.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More