Spark on YARN

Last Update:2018-10-10 Source: Internet

Author: User

Tags hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark on YARN

Yarn Overview

YARN is what

Apache Hadoop YARN (yet another Resource negotiator, another resource coordinator) is a new Hadoop resource Manager, a common resource management system that provides unified resource management and scheduling for upper-level applications. The introduction of the cluster brings great benefits to the utilization, unified management of resources and data sharing.

YARN location in the Hadoop ecosystem

YARN the resulting background

With the rapid development of the Internet, this disk-based off-line computing framework for MapReduce has been unable to meet the requirements of the application, and a number of new computing frameworks have emerged to deal with various scenarios, including the memory computing framework, the streaming computing framework and the iterative computing framework, while the MRV1 Cannot support multiple computational frameworks coexisting.

YARN Basic Architecture

ResourceManager (RM)

ResourceManager responsible for the unified management and scheduling of cluster resources, assume the role of Jobtracker, the whole cluster only "one", in general, RM has the following functions:

1. Handling client Requests
2. Start or monitor Applicationmaster
3. Monitoring NodeManager
4. Allocation and scheduling of resources

NodeManager (NM)

NodeManager manages each node in the yarn cluster. NodeManager provides services for each node in the cluster, from overseeing lifetime management of a container to monitoring resources and tracking node health. MRV1 manages the execution of Map and Reduce tasks through a slot, while NodeManager manages abstract containers that represent resources for each node that can be used by a particular application. NM has the following effects

1. Managing resources on a single node
2. Handling commands from ResourceManager
3. Handling commands from Applicationmaster

Applicationmaster (AM)

Each application has one that is responsible for the management of the application. Applicationmaster is responsible for coordinating resources from ResourceManager and monitoring the execution of containers and resource usage (CPU, memory, etc.) through NodeManager. Note that although the current resource is more traditional (CPU core, memory), it will support the new resource type (the specific processing unit or the dedicated processing device) in the future. AM has the following effects:

1. Responsible for the segmentation of data
2. Request resources for the application and assign to internal tasks
3. Task monitoring and fault tolerance

Container

Container is a resource abstraction in YARN that encapsulates a multidimensional resource on a node, such as memory, CPU, disk, network, and so on, when AM is requesting resources from RM, the resource that RM returns for AM is represented by Container. Yarn assigns a container to each task, and the task can only use the resources described in the container. Container has the following effects:

Abstract the task run environment, encapsulate the CPU, memory and other multi-dimensional resources and environment variables, start the command and other tasks run related information

Spark on Yarn Run schema parsing

Review the basic Spark workflow

With Sparkcontext as the total entrance to the program, Spark creates Dagscheduler job scheduling and TaskScheduler Task Scheduler Two-level scheduling module during Sparkcontext initialization. The job scheduling module is a high-level scheduling module based on the task stage, which calculates the multiple scheduling phases (usually based on shuffle) for each spark job, and then builds a specific set of tasks for each phase (usually taking into account the local nature of the data). It is then submitted to the Task Scheduler module in the form of Tasksets (Task group) for specific execution. The Task scheduling module is responsible for the specific start-up tasks, monitoring and reporting tasks running situation.

YARN Standalone/yarn Cluster

Yarn Standalone is the 0.9 and previous version of the name, 1.0 began to change to yarn cluster
Yarn-cluster (Yarnclusterscheduler)
Driver and am are running, the client alone
./bin/spark-submit--class path.to.your.Class--master yarn--deploy-mode cluster [options] [app options]

YARN Standalone/yarn Cluster

Spark driver preferred to start as a applicationmaster in the yarn cluster, Each job that the client submits to ResourceManager assigns a unique applicationmaster on the worker node of the cluster, which is used by the Applicationmaster to manage the entire life cycle of the application. Because the driver program runs in yarn, there is no need to start spark master/client in advance, and the results of the application can no longer be displayed by the client (viewable in the history server)

YARN Standalone/yarn Cluster

YARN Client

Yarn-client (Yarnclientclusterscheduler)
Client and Driver run together (running locally), AM is used only to manage resources
./bin/spark-submit--class path.to.your.Class--master yarn--deploy-mode Client [options] [app options]

YARN Client

In yarn-client mode, driver runs on the client and obtains resources through Applicationmaster to RM. The local driver is responsible for interacting with all executor container and summarizing the final results. Ending the terminal is equivalent to killing the spark application. In general, you need to configure this if the result of the run is only returned to terminal.

How to choose

If you need to return data to the client, use yarn client mode
Recommended yarn cluster mode for data storage to HDFs

Additional Configuration and considerations

How to change the default configuration

Spark_home/conf/spark-defaults.conf, each app will use his or her configuration when it commits.
--conf prop=value, specifying personalization parameters for individual apps

Environment variables

Spark_home/conf/spark-defaults.conf, each app will use his or her configuration when it commits.
Spark.yarn.appMasterEnv. [Environmentvariablename]

Related configuration

Special attention

Under cluster mode, Yarn.nodemanager.local-dirs right? Spark executors and spark driver are all working, Spark.local.dir will be ignored
In client mode, spark executors uses yarn.nodemanager.local-dirs, and spark driver uses Spark.local.dir
--files and–archives support for # mapping to HDFs
--jars

The above is the main content of this section Bo master for everyone, this is the master of his own learning process, hope to give you a certain guidance role, useful also hope that we point a support, if you do not use also hope to forgive, there are mistakes please point out. If there is hope to pay attention to bloggers to get updates the first time Oh, thank you! Also welcome reprint, but must be in the post obvious location marked the original address, the right to interpret the owner of all!

Spark on YARN

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More