Spark on YARN

Source: Internet
Author: User
Tags hadoop ecosystem

Spark on YARN

Yarn Overview

YARN is what

Apache Hadoop YARN (yet another Resource negotiator, another resource coordinator) is a new Hadoop resource Manager, a common resource management system that provides unified resource management and scheduling for upper-level applications. The introduction of the cluster brings great benefits to the utilization, unified management of resources and data sharing.

YARN location in the Hadoop ecosystem

YARN the resulting background

With the rapid development of the Internet, this disk-based off-line computing framework for MapReduce has been unable to meet the requirements of the application, and a number of new computing frameworks have emerged to deal with various scenarios, including the memory computing framework, the streaming computing framework and the iterative computing framework, while the MRV1 Cannot support multiple computational frameworks coexisting.

YARN Basic Architecture

ResourceManager (RM)

ResourceManager responsible for the unified management and scheduling of cluster resources, assume the role of Jobtracker, the whole cluster only "one", in general, RM has the following functions:

    • 1. Handling client Requests
    • 2. Start or monitor Applicationmaster
    • 3. Monitoring NodeManager
    • 4. Allocation and scheduling of resources

NodeManager (NM)

NodeManager manages each node in the yarn cluster. NodeManager provides services for each node in the cluster, from overseeing lifetime management of a container to monitoring resources and tracking node health. MRV1 manages the execution of Map and Reduce tasks through a slot, while NodeManager manages abstract containers that represent resources for each node that can be used by a particular application. NM has the following effects

    • 1. Managing resources on a single node
    • 2. Handling commands from ResourceManager
    • 3. Handling commands from Applicationmaster

Applicationmaster (AM)

Each application has one that is responsible for the management of the application. Applicationmaster is responsible for coordinating resources from ResourceManager and monitoring the execution of containers and resource usage (CPU, memory, etc.) through NodeManager. Note that although the current resource is more traditional (CPU core, memory), it will support the new resource type (the specific processing unit or the dedicated processing device) in the future. AM has the following effects:

    • 1. Responsible for the segmentation of data
    • 2. Request resources for the application and assign to internal tasks
    • 3. Task monitoring and fault tolerance

Container

Container is a resource abstraction in YARN that encapsulates a multidimensional resource on a node, such as memory, CPU, disk, network, and so on, when AM is requesting resources from RM, the resource that RM returns for AM is represented by Container. Yarn assigns a container to each task, and the task can only use the resources described in the container. Container has the following effects:

    • Abstract the task run environment, encapsulate the CPU, memory and other multi-dimensional resources and environment variables, start the command and other tasks run related information

Spark on Yarn Run schema parsing

Review the basic Spark workflow

With Sparkcontext as the total entrance to the program, Spark creates Dagscheduler job scheduling and TaskScheduler Task Scheduler Two-level scheduling module during Sparkcontext initialization. The job scheduling module is a high-level scheduling module based on the task stage, which calculates the multiple scheduling phases (usually based on shuffle) for each spark job, and then builds a specific set of tasks for each phase (usually taking into account the local nature of the data). It is then submitted to the Task Scheduler module in the form of Tasksets (Task group) for specific execution. The Task scheduling module is responsible for the specific start-up tasks, monitoring and reporting tasks running situation.

YARN Standalone/yarn Cluster

    • Yarn Standalone is the 0.9 and previous version of the name, 1.0 began to change to yarn cluster
    • Yarn-cluster (Yarnclusterscheduler)
    • Driver and am are running, the client alone
    • ./bin/spark-submit--class path.to.your.Class--master yarn--deploy-mode cluster [options] [app options]

YARN Standalone/yarn Cluster

Spark driver preferred to start as a applicationmaster in the yarn cluster, Each job that the client submits to ResourceManager assigns a unique applicationmaster on the worker node of the cluster, which is used by the Applicationmaster to manage the entire life cycle of the application. Because the driver program runs in yarn, there is no need to start spark master/client in advance, and the results of the application can no longer be displayed by the client (viewable in the history server)

YARN Standalone/yarn Cluster

YARN Client

    • Yarn-client (Yarnclientclusterscheduler)
    • Client and Driver run together (running locally), AM is used only to manage resources
    • ./bin/spark-submit--class path.to.your.Class--master yarn--deploy-mode Client [options] [app options]

YARN Client

In yarn-client mode, driver runs on the client and obtains resources through Applicationmaster to RM. The local driver is responsible for interacting with all executor container and summarizing the final results. Ending the terminal is equivalent to killing the spark application. In general, you need to configure this if the result of the run is only returned to terminal.

How to choose

    • If you need to return data to the client, use yarn client mode
    • Recommended yarn cluster mode for data storage to HDFs

Additional Configuration and considerations

How to change the default configuration

    • Spark_home/conf/spark-defaults.conf, each app will use his or her configuration when it commits.
    • --conf prop=value, specifying personalization parameters for individual apps

Environment variables

    • Spark_home/conf/spark-defaults.conf, each app will use his or her configuration when it commits.
    • Spark.yarn.appMasterEnv. [Environmentvariablename]

Related configuration

Special attention

    • Under cluster mode, Yarn.nodemanager.local-dirs right? Spark executors and spark driver are all working, Spark.local.dir will be ignored
    • In client mode, spark executors uses yarn.nodemanager.local-dirs, and spark driver uses Spark.local.dir
    • --files and–archives support for # mapping to HDFs
    • --jars

The above is the main content of this section Bo master for everyone, this is the master of his own learning process, hope to give you a certain guidance role, useful also hope that we point a support, if you do not use also hope to forgive, there are mistakes please point out. If there is hope to pay attention to bloggers to get updates the first time Oh, thank you! Also welcome reprint, but must be in the post obvious location marked the original address, the right to interpret the owner of all!

Spark on YARN

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.