Apache Hadoop YARN: Background and overview

Last Update:2014-11-05 Source: Internet

Author: User

Tags hortonworks hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Apache Hadoop yarn (yarn = yet another Resource negotiator) has been a sub-project of Apache Hadoop since August 2012. Since this Apache Hadoop consists of the following four sub-projects:

Hadoop Comon: Core Library, service for other parts
Hadoop HDFS: Distributed Storage System
Open source implementation of Hadoop Mapreduce:mapreduce model
Hadoop YARN: A new generation of Hadoop data processing framework

In summary, the purpose of Hadoop yarn is to make Hadoop data processing capabilities go beyond MapReduce. As we all know, Hadoop HDFs is the data storage layer of Hadoop, and Hadoop mapreduce is the processing layer. However, MapReduce has not been able to meet today's extensive data processing needs, such as real-time/quasi-real-time calculations, graph calculations, etc. Hadoop Yarn provides a more general framework for resource management and distributed applications. In this framework, users can implement customized data processing applications according to their own needs. And Hadoop MapReduce is an application on yarn. We will see that MPI, graph processing, online services, and so on (such as Spark,storm,hbase) will be used as yarn applications like Hadoop mapreduce. The following describes the traditional Hadoop MapReduce and the next generation of Hadoop yarn architectures.

The traditional Apache Hadoop mapreduce architecture

The traditional Apache Hadoop MapReduce system consists of Jobtracker and Tasktracker. Where Jobtracker is master, only one; Tasktracker is slaves, and each node deploys one.

Figure 1 Apache Hadoop mapreduce system Architecture

Jobtracker is responsible for resource management (through the management of Tasktracker nodes), tracking resource consumption/release, and job lifecycle management (dispatching each task of the job, tracking task progress, providing fault tolerance for tasks, etc.). Tasktracker's responsibilities are simple, starting and stopping tasks assigned by Jobtracker, and periodically reporting task progress and status information to Jobtracker.

Apache Hadoop Yarn Architecture

Yarn's most basic idea is to jobtracker two main responsibilities: resource management and job scheduling management to two roles respectively. One is the global ResourceManager, one is the applicationmaster of each application. The ResourceManager and the nodemanager of each node constitute a new universal system for managing applications in a distributed manner.

Figure 2 Apache Hadoop yarn Architecture

ResourceManager is the highest authority for allocating resources between arbitration applications in the system. The applicationmaster of each application is responsible for negotiating resources with the ResourceManager and working with NodeManager to execute and manage tasks. The ResourceManager has an pluggable scheduler that allocates resources to individual applications to meet limits such as capacity, groups, and so on. This scheduler is a purely scheduler, meaning it is not responsible for managing or tracking the status of the application, nor is it responsible for task failure restart work due to hardware errors or application problems. The scheduler only executes scheduling based on the application's resource requirements, and the dispatch content is an abstract concept resource Container, which contains resource elements such as memory, CPU, network, disk, etc.

NodeManager is the slave of each node, which is responsible for starting the container of the application, managing their resource usage (memory, CPU, network, disk), and reporting the overall resource usage to ResourceManager.

The applicationmaster of each application is responsible for negotiating reasonable resource container and tracking their status and managing progress to the ResourceManager scheduler. From a system point of view, Applicationmaster itself is executed in the form of a common container.

Summarize

Because of the limitations of MapReduce in computational models, Hadoop implements a more general yarn for resource management systems and uses MapReduce as an application. The application of various computational models can be implemented on yarn to meet business needs. In addition, yarn system will jobtracker the main work of the segmentation, so that the master of the pressure is greatly reduced (ResourceManager bear the workload is much smaller than jobtracker), so yarn system can support a larger cluster size.

Reprint address:http://blog.csdn.net/liangliyin/article/details/20729281

Resources:

"1" http://hortonworks.com/blog/introducing-apache-hadoop-yarn/

"2" http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/

"3" http://hadoop.apache.org/

Apache Hadoop YARN: Background and overview

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More