This article mainly understands how the following five components are connected to each other, and cooperate to realize the principle of the operation architecture of our
Spark application.
1. Driver
2. Master
3. Worker
4. Executor
5. Task
Driver is a process, and the
Spark program we submit is written on the Driver, which is executed by the Driver process. The driver may be a node of the Spark cluster, or it may be the machine where you submit the Spark program. Spark's running mode is different. In other words, on which machine we submit the running program, Spark will use it as a Driver role.
In some companies, the official cluster is a machine that has a dedicated submission program to run, and some directly use the master node of the cluster as the driver to run the program, and some nodes in the cluster are elected to serve as the driver node.
Master, it is also a process. We can use the simplest command jps to check which processes we have run in the established cluster. Master is mainly responsible for resource scheduling and allocation, as well as cluster monitoring and other responsibilities.
Worker is also a process, mainly responsible for, one is to use its own memory to store some partition data of RDD; the other is to start other processes and threads to process and calculate the partition on RDD
Executor is also a process, and there will be multiple task threads in an Executor process. The Executor and task here are mainly responsible for the parallel calculation of RDD partitions, that is, the implementation of our team's RDD operators, such as map, flatMap, reduceByKey, etc.
Task is a thread, mainly responsible for the actual execution of operator tasks.
After a brief introduction of each component, how do these components call and cooperate with each other to complete the running of a Spark application, combined with the following figure for intuitive explanation.
Here we analyze the principle of Spark operation architecture by building a minimum test cluster. Here we have 3 nodes, 1 master, and 2 worker nodes. The election of the master and slave nodes of the machine can specify whether the machine is a Master or a Worter node in the configuration file when setting up the environment.
A, What do you do to initialize the operation every time the Driver process is started? First it will send a request to the Master to register the Spark application, that is, we want to let the Master know that there is now a new Spark application to run.
B, after receiving the application for registration of the Spark application, the Master will send it to the Worker for scheduling and allocation of resources. This also shows that resource allocation is allocated and managed by the executor.
C. After receiving the Master's request, Worter will start the Executor for the Spark application to allocate resources.
D. After the Executor starts allocating resources, it will deregister with the Driver, and the Driver will know which Executors serve him.
E. After the driver gets registered with Executor, we can start the official execution of our spark application. The first step is to create an initial RDD, read the data source, and then execute a series of subsequent operators. HDFS file content is read to multiple worker nodes to form a distributed data set in memory, which is the initial RDD.
F, at this time, the Driver will form a corresponding task according to the operator in the Job task, and finally submit it to the Executor to assign the task to calculate the thread.
G, at this time, the task will call the data corresponding to its task (that is, the first step to initialize the RDD partition) to calculate, and the task will perform the specified operator operation on the called RDD partition data to form a new RDD partition, then a big loop is over.
The partition data of the new RDD will form a new batch of tasks through the Driver and be submitted to the Executor for execution. This operation is repeated until the execution of all operators ends.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.