New MapReduce Framework for Hadoop yarn detailed

Source: Internet
Author: User
Tags resource hadoop mapreduce

Introduction to the Hadoop MapReduceV2 (Yarn) framework

Problems with the original Hadoop MapReduce framework

For the industry's large data storage and distributed processing systems, Hadoop is a familiar and open source Distributed file storage and processing framework, the Hadoop framework for the introduction of this no longer tired, readers can refer to the official Hadoop profile. Colleagues who have used and studied the old Hadoop framework (0.20.0 and previous versions) should be familiar with the original MapReduce frame chart as follows:

Figure 1.Hadoop Original MapReduce architecture

From the above figure can clearly see the original MapReduce program flow and design ideas:

First, the user program (Jobclient) submits a job,job message to the job Tracker, the job Tracker is the center of the map-reduce framework, he needs to communicate with the machines in the cluster (heartbeat), and which process needs to be managed The order should run on which machines, you need to manage all job failures, restart, and so on.

Tasktracker is a part of every machine in the Map-reduce cluster, and the main thing he does is to monitor the resources of his machine.

Tasktracker also monitors the current machine's tasks health. Tasktracker needs to send this information through heartbeat to Jobtracker,jobtracker to gather the information to run on which machines the newly submitted job assignment is running. The dotted arrow above is the process that represents the sending-receiving of messages.

We can see that the original Map-reduce architecture is simple and straightforward, in the first few years, also received a number of successful cases, access to the industry wide support and affirmation, but as the size of the distributed system cluster and its workload growth, the original framework of the problem gradually surfaced, the main issues focused on the following:

Jobtracker is a centralized processing point of map-reduce, which has a single point of failure.

Jobtracker completed too many tasks, resulting in too much resource consumption, when Map-reduce job is very much, will cause a lot of memory overhead, potentially, also increased the risk of jobtracker fail, this is the industry generally summed up the old Hadoop map -reduce can only support the upper limit of 4000-node hosts.

At the Tasktracker end, the number of map/reduce tasks as a representation of the resource is too simple to take into account the cpu/memory footprint, and if two large memory-consuming tasks are dispatched to a piece, it is easy to appear OOM.

At the Tasktracker end, the resource is forced to be divided into map task slot and reduce task slot, if only the map task or the reduce task in the system will cause a waste of resources, which is the previously mentioned cluster resource utilization Problem.

Source code level analysis, you will find the code is very difficult to read, often because a class did too many things, code up to more than 3,000 lines, resulting in class task is not clear, increase bug repair and version maintenance difficulty.

From an operational standpoint, the current Hadoop MapReduce framework enforces system-level upgrade updates when there are any important or unimportant changes, such as bug fixes, performance upgrades, and peculiarities. Worse still, it enforces that every client of a distributed cluster system is updated at the same time, regardless of the user's preferences. These updates will allow users to waste a lot of time trying to verify that their previous applications are applying the new version of Hadoop.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.