Introduction to basic terms in YARN/MRV2

Last Update:2015-03-20 Source: Internet

Author: User

Tags cassandra hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

yarn/ MRv2 is the next generation MapReduce framework (see HADOOP-0.23.0), which is completely different from the current MapReduce framework, which is better in terms of extensibility, fault tolerance, and versatility, and, according to statistics, yarn has more than 150000 lines of code and is completely rewritten. This article introduces the meaning of the basic terms in YARN/MRV2 and helps interested program members to have a preliminary understanding of yarn.

(1) YARN

The name of the next generation MapReduce framework, for easy memory, is commonly referred to as MRV2 (MapReduce version 2). The framework is no longer a traditional mapreduce framework, and even unrelated to MapReduce, she is a generic runtime framework that allows users to write their own computational framework that runs in that environment. The framework used for its own writing is a Lib for the client, which is packaged when it is applied to submit a job. The framework provides the following components:

<1> Resource management: including application management and machine resource management

Two-tier scheduling for <2> resources

<3> fault tolerance: Fault tolerance is considered for each component

<4> Extensibility: Scalable to tens of thousands of nodes

The currently more well-known computational frameworks are:

MapReduce: Google's computing framework, which is widely used in large-scale Internet-based data processing, has drawbacks such as not supporting DAG jobs, iterative operations, and so forth.

Apache giraph: Graph algorithm processing framework, using the BSP model (bulk-synchronous parallel models), can be used to calculate pagerank,shared connections, Personalization-based popularity and other iterative class algorithms.

Apache HAMA: Based on the BSP model of distributed computing framework, can be used for large-scale scientific computing, such as matrix, graph algorithm, network algorithm, inspired by Google's Pregel, but different, HAMA is a more general framework, not only support graph algorithm.

Open MPI: This is a library of high-performance computing functions, typically used in HPC (High performance Computing), with more performance and user controllability than MapReduce, but with complex programming and poor fault tolerance, it can be said that In practical applications, MPI or MapReduce is used for different applications.

HBase : Hadoop Database, a high-reliability, high-performance, column-oriented, scalable distributed storage system, modeled after Google BigTable, has gradually become popular in recent years, slowly replacing Cassandra (in Hadoop In China2011, Facebook engineers said they had already given up Cassandra and instead switched to HBase.

These frameworks have their strengths and are used in some Internet companies, which can be deployed uniformly in the yarn environment if the respective deployment of these computing frameworks is cumbersome and yarn is available. At present, only MapReduce can be used, and several others have been developed in succession, the specific reference is:

Apache Hadoop MapReduce, of course! –https://issues.apache.org/jira/browse/mapreduce-279

spark-https://github.com/mesos/spark-yarn/

Apache hama–https://issues.apache.org/jira/browse/hama-431

Apache giraph–https://issues.apache.org/jira/browse/giraph-13

Open mpi–https://issues.apache.org/jira/browse/mapreduce-2911

Generic co-processors for Apache hbase–https://issues.apache.org/jira/browse/hbase-4047

Apache HBase Deployment using yarn–https://issues.apache.org/jira/browse/hbase-4329

(2) ResourceManager

Abbreviation "RM".

MRv2 's most basic design idea is to divide Jobtracker's two main functions, namely, resource management and job scheduling/monitoring into two separate processes. There are two components in the solution: the Global ResourceManager (RM) and the Applicationmaster (AM) associated with each application. The "app" here refers to a separate MapReduce job or dag job. RM and together with NodeManager (NM, one per node) Form the entire data computing framework. RM is the ultimate decision maker for allocating resources to individual applications in the system. Am is actually a concrete framework library whose task is to "negotiate with RM for the resources needed for the application" and "work with NM to complete the task of executing and monitoring tasks".

RM is comprised of two components:

Scheduler (Scheduler)

Application Manager (APPLICATIONSMANAGER,ASM)

The scheduler allocates resources from the system to each running application based on constraints such as capacity, queues, such as allocating a certain amount of resources per queue, performing a certain number of jobs, and so on. The scheduler here is a "pure scheduler" because it is no longer responsible for monitoring or tracking the execution status of the application, and he is not responsible for restarting failures due to application execution failures or hardware failures. The scheduler is only scheduled based on the resource requirements of each application, which is done through the abstract concept "Resource container", which encapsulates resources such as memory, CPU, disk, network, and so on, to limit the amount of resources used by each task. Resource Container.

The scheduler is embedded with a policy pluggable plug-in that is responsible for allocating resources from the cluster to multiple queues and applications. The current MapReduce scheduler, such as Capacity Scheduler and fair Scheduler, can be used as the plug-in.

(3) NodeManager

Referred to as "NM".

NM is the framework agent on each node, primarily responsible for launching the containers required by the application, monitoring the use of resources (memory, CPU, disk, network, etc.) and reporting them to the scheduler.

Bottom line: "NM is primarily used to manage tasks and resources on a node."

(4) Applicationsmanager

Abbreviation "ASM".

ASM is primarily responsible for receiving jobs, negotiating to get the first container for executing am and providing services to restart failed AM container.

Bottom line: "ASM is primarily used to manage AM".

(5) Applicationmaster

short for "AM".

AM is primarily responsible for negotiating with the scheduler to obtain the appropriate containers and to track the status of these containers and monitor their progress.

Bottom line: "AM is primarily used to manage its corresponding applications, such as MapReduce jobs, dag jobs, and so on."

(6) Container

Containers encapsulate machine Resources such as memory, CPU, disk, network, and so on, each task is assigned a container that can only be executed in that container and use the resources encapsulated by that container.

How do I deploy a computing framework (mapreduce,hama,giraph) to yarn?

A: You need to write a applicaionmaster.

Resources

(1) Yahoo claims to be a huge contributor to Apache Hadoop: http://oss.org.cn/?action-viewnews-itemid-62734

(2) The Next Generation of Apache Hadoop mapreduce:http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/

(3) Next Generation of Apache Hadoop mapreduce–the scheduler:http://developer.yahoo.com/blogs/hadoop/posts/2011/03/ mapreduce-nextgen-scheduler/

(4) Apache Hadoop NextGen MapReduce (YARN): http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/ Hadoop-yarn-site/yarn.html

Reprinted from Dong's Blog

Introduction to basic terms in YARN/MRV2

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More