New MapReduce Framework for Hadoop yarn detailed

Last Update:2017-02-27 Source: Internet

Author: User

Tags resource hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction to the Hadoop MapReduceV2 (Yarn) framework

Problems with the original Hadoop MapReduce framework

For the industry's large data storage and distributed processing systems, Hadoop is a familiar and open source Distributed file storage and processing framework, the Hadoop framework for the introduction of this no longer tired, readers can refer to the official Hadoop profile. Colleagues who have used and studied the old Hadoop framework (0.20.0 and previous versions) should be familiar with the original MapReduce frame chart as follows:

Figure 1.Hadoop Original MapReduce architecture

From the above figure can clearly see the original MapReduce program flow and design ideas:

First, the user program (Jobclient) submits a job,job message to the job Tracker, the job Tracker is the center of the map-reduce framework, he needs to communicate with the machines in the cluster (heartbeat), and which process needs to be managed The order should run on which machines, you need to manage all job failures, restart, and so on.

Tasktracker is a part of every machine in the Map-reduce cluster, and the main thing he does is to monitor the resources of his machine.

Tasktracker also monitors the current machine's tasks health. Tasktracker needs to send this information through heartbeat to Jobtracker,jobtracker to gather the information to run on which machines the newly submitted job assignment is running. The dotted arrow above is the process that represents the sending-receiving of messages.

We can see that the original Map-reduce architecture is simple and straightforward, in the first few years, also received a number of successful cases, access to the industry wide support and affirmation, but as the size of the distributed system cluster and its workload growth, the original framework of the problem gradually surfaced, the main issues focused on the following:

Jobtracker is a centralized processing point of map-reduce, which has a single point of failure.

Jobtracker completed too many tasks, resulting in too much resource consumption, when Map-reduce job is very much, will cause a lot of memory overhead, potentially, also increased the risk of jobtracker fail, this is the industry generally summed up the old Hadoop map -reduce can only support the upper limit of 4000-node hosts.

At the Tasktracker end, the number of map/reduce tasks as a representation of the resource is too simple to take into account the cpu/memory footprint, and if two large memory-consuming tasks are dispatched to a piece, it is easy to appear OOM.

At the Tasktracker end, the resource is forced to be divided into map task slot and reduce task slot, if only the map task or the reduce task in the system will cause a waste of resources, which is the previously mentioned cluster resource utilization Problem.

Source code level analysis, you will find the code is very difficult to read, often because a class did too many things, code up to more than 3,000 lines, resulting in class task is not clear, increase bug repair and version maintenance difficulty.

From an operational standpoint, the current Hadoop MapReduce framework enforces system-level upgrade updates when there are any important or unimportant changes, such as bug fixes, performance upgrades, and peculiarities. Worse still, it enforces that every client of a distributed cluster system is updated at the same time, regardless of the user's preferences. These updates will allow users to waste a lot of time trying to verify that their previous applications are applying the new version of Hadoop.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More