Hadoop note-Why map-Reduce V2 (yarn)

Source: Internet
Author: User
Preface:

I haven't written a blog for a while (I found this is the most common start of my blog, but this interval is really long). Some time ago there were many things, so there was a lot of delay.

Now I plan to write a new topic called hadoop note, which containsArticleThe article is not organized in the order of entry-intermediate-advanced. If you want to read the book from entry to depth, the definitive guide of hadoop is recommended.

Today I want to write about the difference between map-Reduce V2 (Map-Reduce next generation, or yarn) and map-reduce. Recently, when I was studying yarn, I also referred to the blogs of many people. The introductions in them all have their own strengths. However, a very important question is why we need a yarn. After reading it, we don't think that yarn is special. We just need to split the previous design into several pieces. Think carefully, only then can you understand the secrets.

Understanding this article requires a certain understanding of the old map-Reduce framework. If you have any questions about hadoop and Big Data Processing architectures and applications, you can contact me by following the contact information section.

Copyright:This article is published by leftnoteasy in http://leftnoteasy.cnblogs.com, this article can be partially or all cited, but please indicate the source, you can contact wheeleast (AT) gmail.com, you can also add my weibo: http://weibo.com/leftnoteasy

 

Why yarn: Map-Reduce Old man, can I still eat?

The first time I saw the yarn problem, I had to ask why I had to redesign such a mature architecture.

"The Apache hadoop map-Reduce framework is showing it's age, clearly ",CommunityThe yarn design document "MapReduce_NextGen-Architecture.

Currently, the map-Reduce framework has encountered many problems, such as excessive memory consumption, unreasonable thread model design, and insufficient scalability, stability, and performance when the cluster size increases. This is the current hadoop cluster that has been stuck on the order of magnitude of the 3000 machines announced by Yahoo.

To overcome the problems mentioned above, we need to rethink the current architecture. Then we will compare the old and new map-Reduce architectures.

 

Yarn Design Requirements

Design requirements are not put here to gather words. Recently, it is increasingly found that design requirements are the soul of software development. Although requirements often change, they can wake up plant people, however, a good design requirement can make the subsequent steps clearer. Refer to the yarn design document:

Top requirements:

· Reliability

· Availability

· Scalability-clusters of 10000 nodes and 200,000 Cores

· Backward compatibility-ensure MERs 'map-Reduce applications can run unchanged in the nextversion of the framework. also implies forward compatibility. (Forward compatibility)

· Evolution-ability for MERs to control upgrades to the grid software stack.

· Predictable latency-a major customer concern.

· Cluster Utilization

The second tier of requirements is(Requirements with a lower priority ):

· Support for alternate programming paradigms to map-Reduce

· Support for limited, short-lived services

The requirement is explained here. First, it can support a very large cluster scale. The number of 200 K cores is indeed amazing.

And then forward compatibility. Before that, everyone wrote a lotProgramIf these old programs are not supported, old users are reluctant to switch to the new version.

In addition, the cluster resources are used to maximize.

It is worth mentioning that the support for adding computing models other than map-Reduce demonstrates hadoop's determination to become a leader in the field, previously, hadoop basically had an equal sign with map-Reduce. What map-Reduce couldn't do was hadoop basically couldn't do. Map-reduce can do a limited number of things (refer to one of my previous articles in the "disadvantages of hadoop" section). Simply put, there is a lot of data, however, hadoop is powerful in simple computing. If the computing logic is complex, you need to perform iteration to convergence.AlgorithmAnd so on. If other computing models can be added to hadoop, the scenario where the data volume is not that large but the computing volume is large will prevent some picky customers from jyy on this issue.

The support for limited and short-lived services has not been described in more detail.

 

Old map-reduce design

The following is a design drawing,

A Brief Introduction

1. first, the user program (Client Program) submits a job, and the job information will be sent to the job tracker. The job tracker is the center of the map-Reduce framework, he needs to regularly communicate with the machines in the cluster (Heartbeat), to manage which programs should run on which machines, and to manage all operations such as job failure and restart.

2. tasktracker is a part of each machine in the map-Reduce cluster, the main task is to monitor the resources of the machine where the machine is located (the resource indicates "How many map-tasks and how many reduce-tasks can be started on the local machine ", the upper limit of MAP/reduce tasks on each machine is configured when a cluster is created). In addition, tasktracker monitors the running status of the task on the current machine.

Tasktracker needs to send this information to jobtracker through heartbeat. jobtracker will collect this information to allocate the newly submitted job to which machines it runs. The arrow similar to the paper clip indicates the process of sending and receiving messages.

Is it simple enough? The current architecture can be summarized in these two sentences, but when you look at itCodeThe code is very difficult to read, because a class often has more than 3000 lines, because a class has done too many things, this will cause the class task to be unclear. In addition, in my understanding, the above design has at least the following problems,

1. jobtracker is a single point of Map-Reduce. It looks more or less unpleasant.

2. jobtracker has completed too many tasks, resulting in excessive resource consumption. When map-reduce jobs are too many, it will cause a large memory overhead. Potentially, it also increases the risk of jobtracker fail.

3. on the tasktracker side, it is too simple to express the number of MAP/reduce tasks as resources without considering the CPU/memory usage. If two tasks with large memory consumption are scheduled to one, OOM is easy to appear.

4. on the tasktracker side, resources are forcibly divided into map task slot and reduce task slot. If only map tasks or reduce tasks exist in the system, resources are wasted, that is, the cluster resource utilization problem mentioned earlier

The four problems mentioned above, except for the first one in yarn, all the other problems have been solved.

 

 

Let's take a look at the design of Map-Reduce V2:

First, the user, jobtracker, and tasktracker are lost. Instead, ResourceManager, application master, and Node Manager are replaced. Let me explain in detail.

Resource Manager is a central service. It schedules, starts the applicationmaster to which each job belongs, and monitors the existence of applicationmaster.

Careful friends have found something missing, right! The monitoring and restart of the task in the job are gone. This is why applicationmaster exists.

The MPI mast (ER) and Mr master in are the applicationmaster of a mpi job or Mr job. Remember that applicationmaster is a part of every job (not every job, applicationmaster can run on machines other than ResourceManager. In the old framework, jobtracker monitors the running status of tasks in the job. Now, this part is thrown to applicationmaster, and a module in ResourceManager is called applicationsmaster, it monitors the running status of the applicationmaster. If a problem occurs, it will be restarted on another machine.

Design advantage 1: This design greatly reduces the resource consumption of jobtracker (now ResourceManager) and distributes programs that monitor the status of each job subtask (tasks, safer and more elegant

In addition, in the new version, applicationmaster is a changeable part. Users can write their own applicationmaster for different programming models so that more types of programming models can run in hadoop clusters.

Design advantage 2: Support for different programming models

Design advantage 3: memory-based resource representation (in the current yarn version, CPU usage is not taken into account), which is more reasonable than the number of remaining Slots

Design advantage 4: Since the resources are expressed as the amount of memory, there will be no embarrassing situation of idle cluster resources due to the separate map slot/reduce slot.

In fact, there are still a lot of elegant designs in yarn, and I will give them a little bit later.

 

Summary:

In hadoop 0.23, yarn has been added. This change has changed hadoop from a single "Black Tiger" to a out-of-the-box School in Shaolin, learning and understanding yarn will also be a qualified basic quality of hadooper.

 

References:

Is the official design documentation: https://issues.apache.org/jira/secure/attachment/12486023/MapReduce_NextGen_Architecture.pdf

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.