Learn the difference between mapreduceV1 (previous mapreduce) and mapreduceV2 (YARN) We need to understand MapreduceV1 's working mechanism and design ideas first.
First, take a look at the operation diagram of the MapReduce V1
The components and functions of the MapReduce V1 are:
Client: Clients, responsible for writing MapReduce code and configuring and submitting jobs.
Jobtracker: Is the core of the entire MapReduce framework, similar to the Dispatcherservlet in SPRINGMVC is responsible for initializing the job, assigning the job and heartbeat communication with Tasktracker, coordinating the task nodes
Tasktracker: A mapreduce can have many tasktracker that handle the data assigned to the Jobtracker (map and reduce two-phase tasks) and maintain heartbeat communication with jobtracker.
The architecture diagram in the diagram is cumbersome and can be considered as the following design:
1. First the client writes the MapReduce business code and submits a job to Jobtracker,jobtracker to manage multiple tasktracker nodes and distribute tasks to them.
2. Tasktracker is owned by each machine in MapReduce and is responsible for monitoring native resource status.
3. Tasktracker is responsible for jobtracker heartbeat communication at any time to report the local resource status, and after the completion of the MapReduce task to put the result output into HDFs
In short, the entire operation system of MapReduce can be understood as the client writes the code well, submits the job to the Jobtracker Central manager, and the Jobtracker is responsible for distributing and managing the task to each task node. The local tasktracker is responsible for monitoring and reporting the resource and security status to Jobtracker, and each task node outputs the result after the assigned task has been executed.
As a result, the performance problems of MapReduce V1 are also obvious.
1. First, since MapReduce V1 is managed by a master (Jobtracker) to manage multiple distributed task nodes, and all client jobs are sent to the task node through Jobtracker, and communication with each node is maintained, As the number of jobs submitted by the client continues to increase, the memory and bandwidth consumption of the jobtracker is increasing, so performance can drop sharply. Jobtracker will become a performance bottleneck for the entire distributed cluster.
2. Tasktracker end Because job assignment information is too simple, it is likely that a task that consumes more than one resource or has a long run time is assigned to the same node, which results in a single point of failure or long wait times for the job.
3. The job delay is too high, before the jobtracker distribution of the job needs to Tasktracker report the local resources and operation, Jobtracker based on this information reasonable distribution. The delay in communication over this period of time can greatly affect the efficiency of the entire mapreduce operation.
Yarn's design idea
Having talked about the flaws of the MapReduce V1, let's take a look at where MapReduce V2 has improved.
Yarn is designed to separate the two major tasks of Jobtracker (cluster resource management and job management) against the problem that Jobtrack may have a single point bottleneck as the client submits a large job volume. The detached cluster Explorer is managed by the global resource Manager, and the detached job manager is managed by the application principal for each job. Tasktracker as the node manager.
In this way, the data calculation Framework = Global Resource Manager + local Node manager. The application body for the job has become a concrete framework library, which is responsible for two tasks: 1) Communicating with the resource manager to get the Resource 2) task tasks to complete the node with the node server
So, let's take a look at yarn's specific execution flow
1. The MapReduce framework receives a user-submitted job and assigns it a new user ID, packages the app's definition to the user's app cache directory on HDFs, and submits the app to the app manager.
2. Application Manager negotiates with the scheduler to get the first resource container required to run the application principal.
3. Application Manager executes the application principal on the retrieved resource container.
4. The application principal calculates the required resources for the application and sends a resource request to the scheduler.
5. The scheduler allocates the appropriate resource containers to the application subject according to the available resource state of its own statistics and the resource request of the application principal.
6. The application principal communicates with the node manager of the assigned container, submits the job status and the resource usage instructions.
7. Node Manager enables the container and runs the task.
8. The application subject monitors the execution of tasks on the container.
9. Apply the principal feedback job's execution status information and completion status.
The advantages of yarn compared to MapReduce V1 are:
1. Jobtracker is segmented, the resource management task is assigned to the manager, and the job initiation, run, and inspection tasks are the responsibility of the application body distributed on each task node. As a result, the single point bottleneck problem is solved, and the scalability of the cluster is improved.
2. The application body in yarn is a user-customizable part, so users can write their own application body program for becoming a model, making yarn no longer as rigid as MapReduce V1.
The work flow of MapReduce and the next generation of Mapreduce--yarn