The next generation of MapReduce for YARN Apache Hadoop

Source: Internet
Author: User

The Hadoop project that I did before was based on the 0.20.2 version, looked up the data and learned that it was the original Map/reduce model.

Official Note:
1.1.x-current stable version, 1.1 release
1.2.x-current beta version, 1.2 release
2.x.x-current Alpha version
0.23.x-simmilar to 2.x.x but missing NN HA.
0.22.x-does not include security
0.20.203.x-old Legacy Stable Version
0.20.x-old Legacy Version
Description
0.20/0.22/1.1/CDH3 Series, original Map/reduce model, stable version
0.23/2.X/CDH4 series, yarn model, new

Once again open the Hadoop website, ready to try to translate the content of this chapter to introduce yarn, learn knowledge and improve foreign language ability.


the next generation of MapReduce for YARN Apache Hadoop

In the hadoop-0.23 version, MapReduce has made a comprehensive modification, which is what we now call the MapReduce 2.0 (MRV2) or YARN.

MRv2 's basic idea is to jobtracker two main functions, one is resource management, one is job scheduling and monitoring, divided into their own independent background process. The idea is to have a global resource manager (ResourceManager (RM)) and an application master (Applicationmaster) that is owned by each application. An application can be a single job in a traditional map-reduce job collection, or it can also be a set of jobs with a direction-free graph.

Resource Manager (ResourceManager), each node's subordinate module (Per-node slave) and Node Manager (NodeManager (NM)) Form this data calculation framework. The resource manager has the highest level of control over resources in all applications throughout the system.

The Application Master (applicationmaster) for each application is actually a framework-specific class library that handles resources from the resource manager and works with the Node Manager (NodeManager) to execute and monitor the job.




The resource manager has two main components: Scheduler(Scheduler), Apply Collection Manager(Applicationsmanager)

The scheduler is responsible for allocating resources to a variety of running applications that are subject to similar constraints such as capacity, queues, and so on. The state of the application is not monitored and tracked by the scheduler during execution, and from this point of view, it is simply a scheduler. At the same time, it does not guarantee that it will restart failed tasks due to application errors or hardware failures. The scheduler performs its scheduling function based on the requirements of the application's resources. It is based on an abstract concept called the Resource container (Container) , which contains elements such as memory,CPU, disk, network, and so on. In the first version, only memory is supported.

The scheduler has a plug-in program that supports pluggable policies, which is responsible for splitting cluster resources that exist in all sorts of queues, applications, and so on. Current map-reduce  Scheduler such as Capacity Scheduler ( capacityscheduler " and the Fair Scheduler (

The capacity Scheduler supports hierarchical queues ( Hierarchical Queues ) in order to allow more predictable sharing of cluster resources.

The Application Collection Manager is responsible for receiving job submissions, negotiating the first container to execute the application-specific master in the application, and servicing the restart of the application master container that failed for some reason.

The Node Manager is the framework agent for each computer, which is responsible for various containers, monitoring their resource usage (CPU, memory, disk, network, etc.), and passing the same information to the Resource Manager or scheduler.

The application master of each application is responsible for negotiating the appropriate resource containers from the scheduler and tracking and monitoring their status for the process.

The next generation of MapReduce MRV2 and the previous stable version hadoop-1.x maintain good API compatibility. This means that all map-reduce jobs do not need to be changed and can be run flawlessly on MRv2 as long as they are recompiled.


The original website, if there is an inappropriate translation, please advise, thank you!

Apache Hadoop NextGen MapReduce (YARN)

MapReduce has undergone a complete overhaul in hadoop-0.23 and we are now having, what we call, MapReduce 2.0 (MRV2) or YARN.

The fundamental idea of MRV2 was to split up the major functionalities of the Jobtracker, resource management and job S Cheduling/monitoring, into separate daemons. The idea was to have a global ResourceManager (RM) and Per-application applicationmaster (AM). An application are either a single job with the classical sense of map-reduce jobs or a DAG of jobs.

The ResourceManager and Per-node Slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority this arbitrates resources among all the applications in the system.

The Per-application applicationmaster are, in effect, a framework specific library and are tasked with negotiating resources From the ResourceManager and working with the NodeManager (s) to execute and monitor the tasks.

The ResourceManager has both main Components:scheduler and Applicationsmanager.

the Scheduler is responsible for Allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure Scheduler in the sense that it performs no monitoring or tracking of the status for the application. Also, it offers no guarantees about restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based the resource requirements of the applications; It does so based on the abstract notion of a resource  Container  which incorporates elements such as Mem Ory, CPU, disk, network etc. In the first version, only  memory  is supported.

The Scheduler has a pluggable policy plug-in, which are responsible for partitioning the cluster resources among the Variou s queues, applications etc. The current map-reduce schedulers such as the Capacityscheduler and the Fairscheduler would is some examples of the plug-i N.

The Capacityscheduler supports hierarchical queues to allow for more predictable sharing of cluster resources

The Applicationsmanager is responsible for accepting job-submissions, negotiating the first container for executing the AP Plication specific Applicationmaster and provides the service for restarting the Applicationmaster container on failure.

The NodeManager is the Per-machine framework agent who is responsible for containers, monitoring their resource usage (CPU , memory, disk, network) and reporting the same to the Resourcemanager/scheduler.

The Per-application Applicationmaster have the responsibility of negotiating appropriate resource containers from the Sched Uler, tracking their status and monitoring for progress.

MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This means, map-reduce jobs should still run unchanged on top of the MRv2 with just a recompile.



The next generation of MapReduce for YARN Apache Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.