Spark on YARN
Yarn Overview
YARN is what
Apache Hadoop YARN (yet another Resource negotiator, another resource coordinator) is a new Hadoop resource Manager, a common resource management system that provides unified resource management and scheduling for upper-level applications. The introduction of the cluster brings great benefits to the utilization, unified management of resources and data sharing.
YARN location in the Hadoop ecosystem
YARN the resulting background
With the rapid development of the Internet, this disk-based off-line computing framework for MapReduce has been unable to meet the requirements of the application, and a number of new computing frameworks have emerged to deal with various scenarios, including the memory computing framework, the streaming computing framework and the iterative computing framework, while the MRV1 Cannot support multiple computational frameworks coexisting.
YARN Basic Architecture
ResourceManager (RM)
ResourceManager responsible for the unified management and scheduling of cluster resources, assume the role of Jobtracker, the whole cluster only "one", in general, RM has the following functions:
- 1. Handling client Requests
- 2. Start or monitor Applicationmaster
- 3. Monitoring NodeManager
- 4. Allocation and scheduling of resources
NodeManager (NM)
NodeManager manages each node in the yarn cluster. NodeManager provides services for each node in the cluster, from overseeing lifetime management of a container to monitoring resources and tracking node health. MRV1 manages the execution of Map and Reduce tasks through a slot, while NodeManager manages abstract containers that represent resources for each node that can be used by a particular application. NM has the following effects
- 1. Managing resources on a single node
- 2. Handling commands from ResourceManager
- 3. Handling commands from Applicationmaster
Applicationmaster (AM)
Each application has one that is responsible for the management of the application. Applicationmaster is responsible for coordinating resources from ResourceManager and monitoring the execution of containers and resource usage (CPU, memory, etc.) through NodeManager. Note that although the current resource is more traditional (CPU core, memory), it will support the new resource type (the specific processing unit or the dedicated processing device) in the future. AM has the following effects:
- 1. Responsible for the segmentation of data
- 2. Request resources for the application and assign to internal tasks
- 3. Task monitoring and fault tolerance
Container
Container is a resource abstraction in YARN that encapsulates a multidimensional resource on a node, such as memory, CPU, disk, network, and so on, when AM is requesting resources from RM, the resource that RM returns for AM is represented by Container. Yarn assigns a container to each task, and the task can only use the resources described in the container. Container has the following effects:
- Abstract the task run environment, encapsulate the CPU, memory and other multi-dimensional resources and environment variables, start the command and other tasks run related information
Spark on Yarn Run schema parsing
Review the basic Spark workflow
With Sparkcontext as the total entrance to the program, Spark creates Dagscheduler job scheduling and TaskScheduler Task Scheduler Two-level scheduling module during Sparkcontext initialization. The job scheduling module is a high-level scheduling module based on the task stage, which calculates the multiple scheduling phases (usually based on shuffle) for each spark job, and then builds a specific set of tasks for each phase (usually taking into account the local nature of the data). It is then submitted to the Task Scheduler module in the form of Tasksets (Task group) for specific execution. The Task scheduling module is responsible for the specific start-up tasks, monitoring and reporting tasks running situation.
YARN Standalone/yarn Cluster
- Yarn Standalone is the 0.9 and previous version of the name, 1.0 began to change to yarn cluster
- Yarn-cluster (Yarnclusterscheduler)
- Driver and am are running, the client alone
- ./bin/spark-submit--class path.to.your.Class--master yarn--deploy-mode cluster [options] [app options]
YARN Standalone/yarn Cluster
Spark driver preferred to start as a applicationmaster in the yarn cluster, Each job that the client submits to ResourceManager assigns a unique applicationmaster on the worker node of the cluster, which is used by the Applicationmaster to manage the entire life cycle of the application. Because the driver program runs in yarn, there is no need to start spark master/client in advance, and the results of the application can no longer be displayed by the client (viewable in the history server)
YARN Standalone/yarn Cluster
YARN Client
- Yarn-client (Yarnclientclusterscheduler)
- Client and Driver run together (running locally), AM is used only to manage resources
- ./bin/spark-submit--class path.to.your.Class--master yarn--deploy-mode Client [options] [app options]
YARN Client
In yarn-client mode, driver runs on the client and obtains resources through Applicationmaster to RM. The local driver is responsible for interacting with all executor container and summarizing the final results. Ending the terminal is equivalent to killing the spark application. In general, you need to configure this if the result of the run is only returned to terminal.
How to choose
- If you need to return data to the client, use yarn client mode
- Recommended yarn cluster mode for data storage to HDFs
Additional Configuration and considerations
How to change the default configuration
- Spark_home/conf/spark-defaults.conf, each app will use his or her configuration when it commits.
- --conf prop=value, specifying personalization parameters for individual apps
Environment variables
- Spark_home/conf/spark-defaults.conf, each app will use his or her configuration when it commits.
- Spark.yarn.appMasterEnv. [Environmentvariablename]
Related configuration
Special attention
- Under cluster mode, Yarn.nodemanager.local-dirs right? Spark executors and spark driver are all working, Spark.local.dir will be ignored
- In client mode, spark executors uses yarn.nodemanager.local-dirs, and spark driver uses Spark.local.dir
- --files and–archives support for # mapping to HDFs
- --jars
The above is the main content of this section Bo master for everyone, this is the master of his own learning process, hope to give you a certain guidance role, useful also hope that we point a support, if you do not use also hope to forgive, there are mistakes please point out. If there is hope to pay attention to bloggers to get updates the first time Oh, thank you! Also welcome reprint, but must be in the post obvious location marked the original address, the right to interpret the owner of all!
Spark on YARN