Introduction to the principle of Yarn

Source: Internet
Author: User

Introduction to Yarn Principles Outline: Hadoop Architecture Introduction to yarn-generated background yarn infrastructure and principles Introduction to 1.X architecture of Hadoop

In the 1.x namenodes can only have one, although the Secondarynamenode and Namenode may be synchronized with the data backup, but there will always be a certain delay, if the namenode hangs, but if some of the data has not been synchronized to the Secondarynamenode , there may be a problem with data loss.

Consists of two layers:

Namespace

L contains information on directories, files, and blocks

L Support Operations on namespace related file systems such as additions, deletions, modifications, and display of files and directories

Block Storage Service contains two parts

L Block Management (implemented in Namenode)

Provides enlistment of data node cluster members and periodically checks through the heartbeat.

provides block reporting and maintenance of block storage locations

Provides operations on blocks, such as adding and deleting blocks, and obtaining the storage address of blocks

Replication of copies of blocks and management of storage locations

L Storage-provides datanode for local storage of data and provides read and write operations

Disadvantages:

1. Poor extensibility

2. Poor reliability

3. Low resource utilization

4. Unable to support multiple computing frameworks

Introduction to the 2.X architecture of Hadoop in 2.X, the change in HDFs, mainly reflected in the enhancement of namenode level of expansion and availability, you can deploy multiple namenode at the same time, these namenodes are independent of each other, that is, they do not need to coordinate with each other, datanode at all Nam Enodes registers as a common storage node for them, and sends a report to all of these namenodes for the use of the heartbeat block, and handles all namenodes instructions sent to it.

Storage block pools (blocks pool)

A storage block pool consists of a set of storage blocks that belong to a single namespace (Namenode), and the storage blocks of all storage block pools in the cluster are stored in datanodes. Each storage block pool and its storage block pool are managed independently, so it does not need to collaborate with the storage block pools in other namespace (Namenode) when generating block IDs for new blocks, even if a namespace (Namenode) hangs up, It does not make the block in the datanodes inaccessible because the storage block pool in the other namespace (Namenode) also holds information about all the storage blocks in the datanodes.

A namespace (Namespace) and its block pool together are called namespace vectors. It is a self-contained snap-in. When a namenode/namespace is deleted, the corresponding storage block pool stored in the datanodes is also deleted, and each namespace vector is upgraded as a whole during the update of the cluster.

Cluster ID (clusterid)

The addition of the cluster ID is used to confirm all nodes in the cluster, or you can specify the cluster ID when formatting other namenodes and add it to a cluster.

Yarn's basic architecture

YARN is a kind of Hadoop resource manager, which is a general resource management system, which provides a unified resource and dispatch for the upper application, which brings great benefits in utilization, unified management of resources and data sharing in several laps.

Application Scenarios

Universal and unified resource Management system:

1. Long Application

2. Short Application

The advantages of yarn greatly reduce the resource consumption of Jobtracker (now the ResourceManager), and it makes it more secure and graceful to have a program that monitors the state of each job subtask (tasks). In the new Yarn, Applicationmaster is a changeable part that allows users to write their own appmst on different programming models, allowing more types of programming models to run in a Hadoop cluster, as referenced in the Hadoop Yarn official configuration template Mapred-site.xml configuration. The representation of the resource in memory (in the current version of Yarn does not take into account the CPU footprint) is more reasonable than the number of slots remaining. The old frame, jobtracker a big burden is to monitor the job under the tasks of the health, now, this part is thrown to Applicationmaster do, and ResourceManager has a module called Applicationsmasters (note not applicationmaster), which monitors the health of Applicationmaster and restarts it on other machines if a problem occurs. Container is a framework proposed by Yarn for future resource isolation. This should draw on the work of Mesos, is currently a framework, only to provide the isolation of Java Virtual machine memory, Hadoop team design ideas should be able to support more resource scheduling and control, since the resources expressed as the amount of memory, then there is no previous map slot/reduce slot The embarrassing situation that separates the cluster resources from idle. Yarn's core thinking separates Jobtracker from Tasktacker, which consists of the following major components: A. A global resource Manager resourcemanagerb.resourcemanager each node agent Nodemanagerc. Represents the Applicationmasterd for each app. Each applicationmaster has multiple container running on NodeManager ResourceManager (RM) RM is a global resource manager that is responsible for resource management and allocation throughout the system. It consists primarily of two components: the Scheduler (Scheduler) and the Application Manager (Applications manager,asm). The Scheduler scheduler allocates resources based on capacity, queues, and other constraints such asA certain number of jobs, etc.), assigning resources in the system to each running application. It is important to note that the scheduler is a "pure scheduler" that no longer engages in any specific application-related work, such as not being responsible for monitoring or tracking the execution status of an application, or restarting failures due to application execution failures or hardware failures. This is done by the application-related applicationmaster. The scheduler only allocates resources based on the resource requirements of individual applications, and the resource allocation units are represented by an abstract concept "Resource container" (Resource Container, abbreviated as Container), which is a dynamic resource allocation unit that will memory, CPU, disk , network, and other resources are packaged together to limit the amount of resources used by each task. In addition, the scheduler is a pluggable component, users can design a new scheduler according to their own needs, yarn provides a variety of directly available scheduler, such as fair scheduler and Capacity scheduler. The Application Manager Application Manager is responsible for managing all applications throughout the system, including application submissions, negotiating resources with the scheduler to start Applicationmaster, monitoring applicationmaster run state, and restarting it on failure. Applicationmaster (AM) each application submitted by a user contains an AM, which includes negotiating with the RM Scheduler for resources (denoted by container), assigning the resulting tasks to internal tasks (two allocations of resources), and communicating with NM to start /stop tasks, monitor all task run states, and request resources for tasks to restart tasks when the task fails to run. The current yarn comes with two AM implementations, one for demonstrating the AM authoring method Instance program Distributedshell, which can request a certain number of container to run a shell command or shell script in parallel The other is the am-mrappmaster that runs the MapReduce application. Note: RM is only responsible for monitoring am, starting it when am fails, RM is not responsible for fault tolerance of AM internal task, this is done by AM. NodeManager (NM) NM is the resource and Task Manager on each node, on the one hand, it periodically reports to RM on resource usage and the operational status of each container on the node, on the other hand, it receives and processes the container boot from AM/ Stop waiting for various requests. Containercontainer is a resource abstraction in yarn that encapsulates multi-dimensional resources on a node, such as memory, CPU, disk, network, etc. when AM is requesting resources from RM, the resource that RM returns for AM is expressed in container. Yarn assigns a container to each task, and the task can only use the resources described in the container. Note: 1. Container differs from the slot in MRv1, which is a dynamic resource partitioning unit that is dynamically generated based on the needs of the application. 2. Yarn only supports both CPU and memory resources, and uses a lightweight resource isolation mechanism to cgroups resource isolation. Yarn's resource management and execution framework is the--slave---node Manager (NM) that runs on the master/slave paradigm, monitors each node, and reports the availability status of the resource to the master---resource Manager (RM) of the cluster, and the resource manager eventually assigns resources to all applications in the system. The execution of a particular application is controlled by Applicationmaster, Applicationmaster is responsible for splitting an application into multiple tasks and coordinating with the resource manager the resources required to execute the resource, once the resources have been allocated, Applicationmaster will work with the Node Manager to schedule, execute, and monitor separate application tasks. It is necessary to note that yarn different service components communicate in an event-driven asynchronous concurrency mechanism, which simplifies the design of the system.  

Source: Http://baike.baidu.com/link?url=89uk7XaDK00jje0PXqDI_ Gwumyzeavnqfcalsc7b01phnloqhcrzg-w27ugs5ynrbb1mtf2jka97pz5imhx-q_

Introduction to the principle of Yarn

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.