Introduction to basic terms in YARN/MRV2

Source: Internet
Author: User
Tags cassandra hadoop mapreduce

yarn/ MRv2 is the next generation MapReduce framework (see HADOOP-0.23.0), which is completely different from the current MapReduce framework, which is better in terms of extensibility, fault tolerance, and versatility, and, according to statistics, yarn has more than 150000 lines of code and is completely rewritten. This article introduces the meaning of the basic terms in YARN/MRV2 and helps interested program members to have a preliminary understanding of yarn.

(1) YARN

The name of the next generation MapReduce framework, for easy memory, is commonly referred to as MRV2 (MapReduce version 2). The framework is no longer a traditional mapreduce framework, and even unrelated to MapReduce, she is a generic runtime framework that allows users to write their own computational framework that runs in that environment. The framework used for its own writing is a Lib for the client, which is packaged when it is applied to submit a job. The framework provides the following components:

<1> Resource management: including application management and machine resource management

Two-tier scheduling for <2> resources

<3> fault tolerance: Fault tolerance is considered for each component

<4> Extensibility: Scalable to tens of thousands of nodes

The currently more well-known computational frameworks are:

MapReduce: Google's computing framework, which is widely used in large-scale Internet-based data processing, has drawbacks such as not supporting DAG jobs, iterative operations, and so forth.

Apache giraph: Graph algorithm processing framework, using the BSP model (bulk-synchronous parallel models), can be used to calculate pagerank,shared connections, Personalization-based popularity and other iterative class algorithms.

Apache HAMA: Based on the BSP model of distributed computing framework, can be used for large-scale scientific computing, such as matrix, graph algorithm, network algorithm, inspired by Google's Pregel, but different, HAMA is a more general framework, not only support graph algorithm.

Open MPI: This is a library of high-performance computing functions, typically used in HPC (High performance Computing), with more performance and user controllability than MapReduce, but with complex programming and poor fault tolerance, it can be said that In practical applications, MPI or MapReduce is used for different applications.

HBase : Hadoop Database, a high-reliability, high-performance, column-oriented, scalable distributed storage system, modeled after Google BigTable, has gradually become popular in recent years, slowly replacing Cassandra (in Hadoop In China2011, Facebook engineers said they had already given up Cassandra and instead switched to HBase.

These frameworks have their strengths and are used in some Internet companies, which can be deployed uniformly in the yarn environment if the respective deployment of these computing frameworks is cumbersome and yarn is available. At present, only MapReduce can be used, and several others have been developed in succession, the specific reference is:

  • Apache Hadoop MapReduce, of course! –https://issues.apache.org/jira/browse/mapreduce-279
  • spark-https://github.com/mesos/spark-yarn/
  • Apache hama–https://issues.apache.org/jira/browse/hama-431
  • Apache giraph–https://issues.apache.org/jira/browse/giraph-13
  • Open mpi–https://issues.apache.org/jira/browse/mapreduce-2911
  • Generic co-processors for Apache hbase–https://issues.apache.org/jira/browse/hbase-4047
  • Apache HBase Deployment using yarn–https://issues.apache.org/jira/browse/hbase-4329

(2) ResourceManager

Abbreviation "RM".

MRv2 's most basic design idea is to divide Jobtracker's two main functions, namely, resource management and job scheduling/monitoring into two separate processes. There are two components in the solution: the Global ResourceManager (RM) and the Applicationmaster (AM) associated with each application. The "app" here refers to a separate MapReduce job or dag job. RM and together with NodeManager (NM, one per node) Form the entire data computing framework. RM is the ultimate decision maker for allocating resources to individual applications in the system. Am is actually a concrete framework library whose task is to "negotiate with RM for the resources needed for the application" and "work with NM to complete the task of executing and monitoring tasks".

RM is comprised of two components:

Scheduler (Scheduler)

Application Manager (APPLICATIONSMANAGER,ASM)

The scheduler allocates resources from the system to each running application based on constraints such as capacity, queues, such as allocating a certain amount of resources per queue, performing a certain number of jobs, and so on. The scheduler here is a "pure scheduler" because it is no longer responsible for monitoring or tracking the execution status of the application, and he is not responsible for restarting failures due to application execution failures or hardware failures. The scheduler is only scheduled based on the resource requirements of each application, which is done through the abstract concept "Resource container", which encapsulates resources such as memory, CPU, disk, network, and so on, to limit the amount of resources used by each task. Resource Container.

The scheduler is embedded with a policy pluggable plug-in that is responsible for allocating resources from the cluster to multiple queues and applications. The current MapReduce scheduler, such as Capacity Scheduler and fair Scheduler, can be used as the plug-in.

(3) NodeManager

Referred to as "NM".

NM is the framework agent on each node, primarily responsible for launching the containers required by the application, monitoring the use of resources (memory, CPU, disk, network, etc.) and reporting them to the scheduler.

Bottom line: "NM is primarily used to manage tasks and resources on a node."

(4) Applicationsmanager

Abbreviation "ASM".

ASM is primarily responsible for receiving jobs, negotiating to get the first container for executing am and providing services to restart failed AM container.

Bottom line: "ASM is primarily used to manage AM".

(5) Applicationmaster

short for "AM".

AM is primarily responsible for negotiating with the scheduler to obtain the appropriate containers and to track the status of these containers and monitor their progress.

Bottom line: "AM is primarily used to manage its corresponding applications, such as MapReduce jobs, dag jobs, and so on."

(6) Container

Containers encapsulate machine Resources such as memory, CPU, disk, network, and so on, each task is assigned a container that can only be executed in that container and use the resources encapsulated by that container.

How do I deploy a computing framework (mapreduce,hama,giraph) to yarn?

A: You need to write a applicaionmaster.

Resources

(1) Yahoo claims to be a huge contributor to Apache Hadoop: http://oss.org.cn/?action-viewnews-itemid-62734

(2) The Next Generation of Apache Hadoop mapreduce:http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/

(3) Next Generation of Apache Hadoop mapreduce–the scheduler:http://developer.yahoo.com/blogs/hadoop/posts/2011/03/ mapreduce-nextgen-scheduler/

(4) Apache Hadoop NextGen MapReduce (YARN): http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/ Hadoop-yarn-site/yarn.html

Reprinted from Dong's Blog

Introduction to basic terms in YARN/MRV2

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.