Hadoop Tutorial (v) 1.x MapReduce process diagram

Last Update:2017-02-27 Source: Internet

Author: User

Tags resource hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Official Shuffle Architecture chart

This paper explains the trend and principle of the data from the global macro level.

Refine the schema diagram

Explained the details of Map/reduce from Jobtracker and Tasker.

From the above figure can clearly see the original MapReduce program flow and design ideas:

1 First the user program (Jobclient) submits a job,job message to the job Tracker, the job Tracker is the center of the map-reduce framework, he needs to communicate with the machines in the cluster (heartbeat), and what needs to be managed Which machines the program should run on, and how to manage all job failures, restarts, and so on.

2 Tasktracker is a part of every machine in the Map-reduce cluster, and the main thing he does is to monitor the resources of his machine.

3 Tasktracker at the same time monitor the current machine's tasks running condition. Tasktracker needs to send this information through heartbeat to Jobtracker,jobtracker to gather the information to run on which machines the newly submitted job assignment is running. The dotted arrow above is the process that represents the sending-receiving of messages.

We can see that the original Map-reduce architecture is simple and straightforward, in the first few years, also received a number of successful cases, access to the industry wide support and affirmation, but as the size of the distributed system cluster and its workload growth, the original framework of the problem gradually surfaced, the main issues focused on the following:

1 Jobtracker is a map-reduce centralized processing point, there is a single point of failure.

See more highlights of this column: http://www.bianceng.cnhttp://www.bianceng.cn/webkf/tools/

2 Jobtracker completed too many tasks, resulting in too much resource consumption, when Map-reduce job is very much, will cause a lot of memory overhead, potentially, also increased the risk of jobtracker fail, which is the industry generally summed up the old Hadoop Ma P-reduce can only support the upper limit of 4000-node hosts.

3 at the Tasktracker end, the number of map/reduce tasks as a representation of the resource is too simple to take into account the cpu/memory footprint, if two large memory consumption task is dispatched to a piece, it is easy to appear OOM.

4 at the Tasktracker end, the resource is forced to be divided into map task slot and reduce task slot, which can be a waste of resources when only a map task or a reduce task is available, which is the previously mentioned cluster resource benefit Use of the problem.

5 Source code Level analysis, you will find the code is very difficult to read, often because one class did too many things, the code amounted to more than 3,000 lines, resulting in class task is not clear, increase bug repair and version maintenance difficulty.

6 from an operational standpoint, the current Hadoop MapReduce framework enforces system-level upgrade updates when there are any important or unimportant changes (such as bug fixes, performance upgrades, and peculiarities). Worse still, it enforces that every client of a distributed cluster system is updated at the same time, regardless of the user's preferences. These updates will allow users to waste a lot of time trying to verify that their previous applications are applying the new version of Hadoop.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More