Spark on YARN Notes

Last Update:2018-07-26 Source: Internet

Author: User

Tags hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

has been to hadoop this set of limitations on the use of the good, not a systematic understanding of the Hadoop ecosystem, but also lead to the use of problems difficult to find the key reasons, all have to find relevant information Google. So now I think it's going to take some time, at least to understand the principles and concepts of the relevant parts used in the usual.

As long as the components of the Hadoop ecosystem are used, many will use yarn to manage resources and task assignments. and the rational allocation of resources directly related to the efficiency of the implementation of the task, or even determine success or failure. Spark is now the mainstream Big Data computing framework, so get a clear understanding of yarn architecture and the combination of spark and yarn as a starting point for learning. the architecture of yarn

Yarn is an abbreviation (yet another Resource negotiator, another resource coordinator). Primarily used to manage the resources of a distributed cluster, understanding yarn is primarily about understanding 4 abstract components. ResourceManager, Applicationmaster, NodeManager and container.

ResourceManager: Can be understood as the head of the entire distributed cluster, responsible for managing the resource allocation of the entire cluster.

applicationmaster: There can be multiple applications running in a cluster, which can be map reduce, spark applications and so on, as long as every application that applies yarn to resource allocation will have a applicationmaster, It is responsible for requesting resources from ResourceManager and for coordinating the implementation of specific tasks with NodeManager.

NodeManager: As a Slave (worker) for ResourceManager, start container, manage resources, and report resource usage to ResourceManager.

Container: A container that contains memory, CPU, network, disk, and possibly GPU resources later. Each container identifies a subset of the resources on the physical machine and assigns them to the processes running in the container, each of which can run multiple processes, which are configurable. So you can also think of a container as a virtual machine that actually performs computational tasks.

The following diagram clearly reflects the structure of yarn and how the components work together.

There are two clients in this picture that have submitted their respective applications, the applicationmaster of the purple application runs on the first node, and the Applicationmaster of the brown application is assigned to the second node. They all apply resources to ResourceManager, and ResourceManager assigns a purple application to a container on the second node, assigning a total of 3 container on the first and third nodes to the brown application. Container and corresponding Applicationmaster Communication Report task status, Applicationmaster and ResourceManager communication. NodeManager manages container startup and recovery, while reporting status to ResourceManager. Spark Architecture

The architecture of the spark cluster, as shown in the figure above, is the program that runs the user's logic code, maintains the context of the Spark runtime, and handles communication with the cluster manager, requesting resources to run the spark application driver.

Cluster Manager, as the name implies, is the Cluster Administrator that manages resource allocations across the cluster.

Worker node is the slave of the cluster, which is the physical machine that actually runs executor. A worker can start multiple executor, and each executor has a fixed amount of configurable resources, which is actually a JVM.

Spark generates a DAG that performs several action and transformation operations based on the logic code in the driver program, then decomposes the DAG into several stages and then into several tasks. Driver requests resources from a specific task to Cluster Manager, cluster Manager and worker node communicate to determine the available idle resources, and then tells driver the information about the worker node. Driver then serializes the action that needs to be performed and the corresponding closure (the specific method code) to worker node, then the worker node chooses the appropriate executor to execute the task. Spark on YARN

Once you understand yarn's architecture and the architecture of the spark cluster, you can see how spark works on yarn. Client Mode

There are two different forms according to deployment mode, one is the client one is cluster, the first is the diagram, and the client mode architecture is shown in the figure below.

Here the role of spark driver is consistent with the role of driver program in the spark architecture, and spark driver runs on the client, and its resource usage is not related to yarn. Yarn Application master and yarn ResourceManager from spark play the role of cluster Manager in the spark architecture. AM to RM request Resources, RM will idle node feedback to am,am and NodeManager communication, let NodeManager to start executor. Spark's worker node in yarn corresponds to container, if yarn manages a cluster with 3 servers, but each server can start many container, So the concept that corresponds to spark is that the spark cluster has a lot of worker node and not just 3. Cluster mode

The architecture of the cluster mode is shown in the following figure

In fact, regardless of the client mode or cluster mode, the actual job is performed on the cluster. The only difference is whether the driver program is on the cluster or on the client. As can be seen from the architecture diagram of cluster mode, Spark driver and Applicationmaster are both run in the Yarn allocation container, which is good for the machine that the client is not a cluster, but a remote machine, which can reduce the time for many data transmissions. And the client after the completion of the task, stdout information is running, the disconnection will not affect the execution of the task.

Client mode or cluster mode is not absolutely good or bad, and depends on the actual situation. For the client in the remote, it is recommended to choose Cluster mode. For the client is a cluster of a machine, you can use the client mode, because this can be seen on the clients to see the operation of the task, such as the use of airflow to do scheduling system, using the client Mode can see the overall log directly on the Airflow Web page without having to sign in to the Applicationmaster Web UI.

Mode can be selected by the parameter--deploy-mode when the task is submitted.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More