Yarn (mapreduce V2)

Last Update:2014-07-09 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Here we will talk about the limitations of mapreduce V1:

Jobtracker spof bottleneck. Jobtracker in mapreduce is responsible for job distribution, management, and scheduling. It must also maintain heartbeat communication with all nodes in the cluster to understand the running status and Resource Status of the machine. Obviously, the unique jobtracker in mapreduce is responsible for too many tasks. If the number of clusters and the number of submitted jobs keep increasing, the number of jobtracker tasks will also increase rapidly, this results in rapid consumption of jobtracker memory and network bandwidth. The final result is that jobtracker becomes the bottleneck of the cluster and becomes the center of cluster jobs and the core of risks.
Tasktracker end, because the job allocation information is too simple, it is possible to allocate multiple tasks that consume many resources or run for a long time to the same node, this will cause a single point of failure or long wait time for the job.
The job delay is too high. Before running a job in mapreduce, tasktracker needs to report the resource status and running status of the job. jobtracker allocates the job based on the obtained information and starts running after tasktracker obtains the task. The result is that the startup time of the job is too long due to the communication delay. The most significant impact is that small jobs cannot be completed in a timely manner.
The programming framework is not flexible enough. Although the current mapreduce framework allows you to define the processing functions and objects for each stage, the mapreduce framework still limits the programming mode and resource allocation.

To address the above problems, the mapreduce designer proposed the next-generation hadoop mapreduce framework (officially called mrv2/yarn ).

Yarn (mapreduce V2) design requirements

Reliability ).
Availability ).
Scalability ). The cluster can be expanded to 10000 nodes and 200000 cores.
Backward compatibility ). Ensure that the programs written based on mapreduce V1 can run on mapreduce V2.
Evolution. Allows you to control software upgrades in the cluster.
Predictable latency ). Improve the response and processing speed of small jobs.
Cluster utilization. For example, map tasks and reduce tasks share resources.
Supports other frameworks except the mapreduce programming framework. In this way, the applicable audience of mapreduce V2 can be expanded.
Support Limited and short-term services

Main Ideas and architecture of yarn (mapreduce V2)

Considering the design requirements of mapreduce V2 and the problems highlighted in mapreduce V1, especially the single point bottleneck of jobtracker (this problem affects the reliability, availability, and scalability of hadoop clusters ), the main design idea of mapreduce V2 is to separate the two major tasks undertaken by jobtracker-cluster resource management and Job Management (the isolated cluster resource management is managed by the Global Resource Manager (ResourceManager) management: The separated job management is managed by the application topic (applicationmaster) of each job, and then the taskertracker evolves into the Node Manager (nodemanger ). In this way, the global resource manager and the local Node Manager form a data computing framework, in which the resource manager becomes the final resource distributor in the entire cluster. The application subject for a job becomes a specific framework library, which is responsible for two tasks: communicating with the resource manager to obtain resources and working with the Node Manager to complete node task tasks.

Figure 1 Structure of mapreduce v2

(1) Resource Manager

The resource manager is divided into two components based on different functions: scheduler and applicationmanager ). The scheduler allocates resources to running applications based on the capacity, queue, and resource limits in the cluster. Although called a scheduler, it is only responsible for resource allocation, rather than monitoring the running status of each application and restart tasks when the task fails, the application fails, or the hardware fails. The scheduler encapsulates its memory, CPU, disk, and other resources together according to the resource requirements of each application and the resource containers of each node in the cluster) for scheduling. The Application Manager is responsible for receiving jobs, negotiating to obtain the first resource container for executing the application task subject, and allocating containers for the failed restart application subject.

(2) Node Manager

The Node Manager is the framework proxy for each node. It is responsible for starting the application container, monitoring the resource usage of the container (including CPU, memory, hard disk and network bandwidth), and reporting the information to the scheduler. The application subject obtains resource containers from the scheduler through negotiation, and tracks the Status and application execution of these containers.

Each node in the cluster has a Node Manager, which is mainly responsible:

A. Enable the container allocated to the application by the Scheduler for the application.

B. Ensure that started containers do not exceed the allocated resources.

C. Build a container environment for tasks, including binary executable files and jar files;

D. Provide a simple service for all nodes to manage local storage resources.

The application can continue to use local storage resources even if it has not been requested from the resource manager. For example, mapreduce can use this service to store the intermediate output results of map tasks and shuffle them to reduce tasks.

(3) Application subject

Application subjects correspond to applications one by one. It has the following responsibilities:

A. Negotiate resources with the Scheduler

B. Work with the Node Manager to run the corresponding component tasks in the appropriate container and monitor the execution of these tasks;

C. If the iner fails, the application subject will repeatedly apply for other resources from the scheduler;

D. calculate the amount of resources required by the application and convert it into a protocol information package that can be recognized by the scheduler;

E. When the application subject fails, the application manager restarts it, but the application subject restores the application from the execution status of the previously saved application.

Figure 2 Application subject component event stream

1) the event scheduling component is the manager of each component in the application body and is responsible for generating events for other components.

2) The container allocation component is responsible for translating task resource requests into resource requests sent to the application subject of the scheduler, and obtaining resources through negotiation with the resource manager.

3) The user service component reports the job status, counters, execution progress, and other information to the hadoop mapreduce user.

4) The task listening component is responsible for receiving heartbeat information sent by map or reduce tasks.

5) The Task component is responsible for receiving heartbeat information and status update information formed by map or reduce tasks.

6) The container startup component is responsible for starting the container by running the Node Manager.

7) The task history event processing component writes historical tasks of a job to HDFS.

8) Job components to maintain the status of jobs and components.

(4) Resource container

In mapreduce V2, system resources are organized to divide available resources on nodes. Each part is encapsulated into a resource unit of the system, that is, container (such as fixed-size memory fragments, number of CPU cores, network bandwidth, and hard disk space blocks. In mapreduce V2, the so-called resource refers to memory resources. Each node is composed of multiple memory containers of MB or 1 GB ). Instead of organizing resources into map pools and reduce pools as in mapreduce V1. The application subject can apply for any number of containers with an integer multiple of memory size. Since the memory resources of each node are divided into containers with fixed sizes and the same status, these memory containers can be exchanged during task execution to submit utilization, this avoids the bottleneck of mapreduce V1 jobs in the reduce pool and the lack of resource swaps. The main responsibility of a resource container is to run, save, or transmit jobs submitted by an application subject or data that needs to be stored and transmitted.

Mapreduce V2 design details

This section describes the main design ideas and architecture of mapreduce V2 and the main responsibilities of each part. The following describes the detailed design details.

1. Resource Negotiation

The application owner applies for a resource container based on the appropriate resource requirement description, and can apply for some specified Machine nodes. The application subject can also request multiple resource containers on the same machine. All resource requests are limited by application capacity and queue capacity. Therefore, in order to intelligently allocate cluster resource containers, the application subject needs to calculate Application resource requirements and encapsulate these requirements into the protocol information packages that the scheduler can recognize, such as <priority, (host, Rack, *), memory, # containers>. Taking mapreduce as an example, the application subject analyzes input-splits and converts it into a transpose table with the host as the key and sends it to the resource manager, the sent information also includes the changes required by the application to the resource container as the execution progresses during its execution. After the scheduler parses the request information of the application body, it tries its best to allocate the requested resources to the application body. If resources on the specified machine are unavailable, you can allocate resources on the same machine or different machines to the application subject. In some cases, because the entire cluster is very busy, the resources obtained by the application subject may not be the most suitable. In this case, the application can reject these resources and request a reallocation.

From the resource negotiation process described above, we can see that the resources in mapreduce v2 are not from the map pool and reduce pool, but from the unified resource container, in this way, the application subject can apply for the required amount of resources, instead of being suspended because the resources are not of the required type. It should be noted that the scheduler does not allow the application subject to apply for resources without restrictions. It controls the resource size applied by the application subject based on application restrictions, User restrictions, queue restrictions, and resource restrictions, this ensures that cluster resources are not wasted.

2. Scheduling

The scheduler collects resource requests from all running applications and creates a global resource allocation plan. The scheduler allocates resources based on application-related constraints (such as suitable machines) and global constraints (such as total queue resources, queue restrictions, User restrictions, and so on. The scheduler uses a concept similar to container scheduling, and uses capacity guarantee as the basic policy to allocate resources among multiple competing applications. The scheduling steps of the scheduler are as follows:

1) Select the "minimum service" queue in the system. This queue can be the queue with the longest waiting time, or the queue with the largest ratio of waiting time to allocated resources.

2) select jobs with the highest priority from the queue.

3) meet the resource requests of the selected job.

In mapreduce V2, there is only one interface used by the application body to request resources from the scheduler. The interface is as follows:

Response allocate (list <resourcerequest> ask, list <container> release)

The application body uses the resourcerequest list in this interface to request specific resources, and uses the container list in the interface to notify the scheduler of the resource containers released by itself.

After receiving the request from the application body, the scheduler returns a response to the request according to its global plan and various restrictions. The reply mainly includes three types of information: the list of newly allocated resource containers, the status of the specified resource container for the application that completed the task after the last interaction between the application subject and the resource manager, and the number of available resources for the application in the current cluster. The application body can collect container information and respond to failed tasks. The amount of available resources can be used as a reference for the next voluntary application by the application subject. For example, the application subject can use this information to reasonably allocate the resources requested by map and reduce, this prevents deadlocks (the most obvious case is that reduce requests occupy all available resources ).

3. Resource Monitoring

The scheduler periodically collects information about the use of allocated resources from the Node Manager. In addition, the scheduler sets the status of the completed task container to available for application.

　　4. Application Submission

Follow these steps to submit an application.

1) The user submits the job to the Application Manager. The specific step is to assign a new application ID to the user after the user submits the job, and package the application definition to the application cache directory of the user on HDFS. Finally, submit the application to the Application Manager.

2) The application manager accepts the application submission.

3) The Application Manager negotiates with the scheduler to obtain the first resource container required to run the application subject and execute the application subject.

4) The application manager returns the details of the started application to the user so that the user can monitor the application progress.

　　5. Application Manager Components

The Application Manager is responsible for starting all application subjects in the system and managing their lifecycles. After the application subject is started, the Application Manager monitors the application subject through the "Heartbeat" regularly sent by the application subject to ensure its availability. If the subject fails, it needs to be restarted.

To complete the preceding tasks, the Application Manager includes the following components:

1) The scheduling and negotiation component is responsible for negotiating with the scheduler the resource containers required by the application subject.

2) The application subject container management component is responsible for starting or stopping the application subject container by communicating with the Node Manager.

3) The application subject monitoring component is responsible for monitoring the status of the application subject, keeping it available, and restarting the application subject if necessary.

6. mapreduce V2 job execution process

The job execution process in mapreduce V2. The execution flowchart of the job is as follows (only the main process is described, and some feedback processes and heartbeat communication are not marked ).

Figure 3 mapreduce V2 job execution process

Step 1: The mapreduce framework receives jobs submitted by users, assigns them a new application ID, and assigns it a new application ID, then, package the application definition and upload it to the application cache directory of the user on HDFS. Then, submit the application to the Application Manager.

Step 2: The Application Manager negotiates with the scheduler to obtain the first resource container required to run the application subject.

Step 3: The Application Manager executes the application subject on the obtained resource container.

Step 4: The application subject calculates the resources required by the application and sends resource requests to the scheduler.

Step 5: The scheduler allocates appropriate resource containers to the application subject based on the available resource status statistics and resource requests of the application subject.

Step 6: The application subject communicates with the Node Manager of the assigned container, submits the job status and resource usage instructions.

Step 7: Enable the container in the Node Manager and run the task.

Step 8: The application body monitors the execution of tasks on the container.

Step 9: The application subject reports the execution status and completion status of the job.

　7. mapreduce V2 system availability guarantee

System availability mainly refers to the availability of each component in mapreduce V2, that is, to ensure that it can quickly restore and provide services after failure, such as ensuring the availability of the resource manager and application body. First, we will introduce how mapreduce V2 ensures the availability of mapreduce applications and application subjects. As mentioned earlier, the application manager in Resource Manager monitors the execution of mapreduce application subjects. After the application subject fails, the application manager only restarts the application subject and restores a specific mapreduce job. There are three options for an application subject to resume a mapreduce job: Completely restart the mapreduce job; restart the unfinished map and reduce tasks; indicate the map and reduce tasks that are running when the application body fails, and resume the job execution. The first method has a high cost and will repeat the work. The second method has a good effect, but it is still possible to repeat part of the reduce task. The third method is the most ideal, starting from the failure point directly, there is no repeated work, but this method is too demanding on the system. Select the second Recovery Method in mapreduce V2. The specific implementation method is: The Application Manager records logs while supervising the execution of mapreduce tasks, indicating the completed map and reduce tasks; when you restore a job, analyze the log and restart the unfinished task.

Next we will introduce how mapreduce ensures the availability of the resource manager. When running the service, Resource Manager uses zookeeper to save the status of resource management, including the process, queue definition, resource allocation, and Node Manager of the Application Manager. After the resource manager fails, the resource manager restores itself based on its own status.

8. Advantages of mapreduce v2

1) scattered jobtracker tasks. The resource manager is responsible for resource management tasks, and the application subjects distributed on cluster nodes are responsible for starting, running, and checking jobs. This greatly reduces the single-point bottleneck and single-point risks of jobtracker in mapreduce V1, and greatly improves the scalability and availability of clusters.

2) In mapreduce V2, the application subject (applicationmaster) is a user-customizable part. Therefore, you can write your own application subject Program for the programming model. This greatly expands the applicability of mapreduce V2.

3) apply zookeeper to the resource manager for failover. When the resource manager fails, the standby resource manager will start quickly based on the cluster status saved in zookeeper. Mapreduce V2 supports specifying checkpoints for applications. This ensures that the application subject can be restarted quickly based on the stored status on HDFS after failure. These two measures greatly improve the availability of mapreduce V2.

4) cluster resources are organized into resource containers in a unified manner, unlike the map and reduce pools in mapreduce V1. In this way, as long as there is a task request resource, the scheduler will allocate the available resources in the cluster to the request task, regardless of the resource type. This greatly improves the utilization of cluster resources.

References: hadoop version 2nd

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More