Hadoop: The Definitive Guid summarizes The working principles of Chapter 6 MapReduce

Source: Internet
Author: User
Tags hadoop mapreduce

 

1. Analyze the MapReduce job running mechanism

1). Typical MapReduce -- MapReduce1.0

There are four independent entities throughout the process

  • Client: Submit MapReduce
  • JobTracker: Coordinates job running
  • TaskTracker: The task after the job is divided.
  • HDFS: used to share job files between other entities

The overall running figure is as follows:

A. Submit A job

The runJob of JobClient is a convenient method for creating a JobClient instance and calling its submitJob () method. After submitting a Job, runJob () polls the Job progress every second to monitor the Job running status at any time.

The job submission process implemented by the submitJob () method of JobClient:

  • Request a new job ID from JobTracker
  • Check job output description
  • Input parts of a computing job
  • Copy the resources (Jar files, configuration files, and computing input parts) required to run the job to the JobTracker File System in the directory named by the job ID.

B. Job Initialization

After receiving the submitted job, JobTracker puts the call into a queue and submits it to the Job scheduler for scheduling and initialization. Initialization includes creating an object that indicates a running job-encapsulate the task and record information to track the status and process of the task.

C. assignment of tasks

TaskTracker runs a simple loop to send heartbeat to JobTracker and tell whether it is alive and interacts with each other. For map and reduce tasks, TaskTracker allocates a fixed number of task slots, the ideal status generally follows data localization and rack Localization

D. Execution of the task

Step 1: TaskTracker copies the JAR file to the local device. Step 2: TaskTracker creates a local directory and pressurizes the JAR file to the local device. Step 3: TaskTracker creates a TaskRunner instance to run the task.

Streaming and Pipes can run special Map and Reduce tasks. Streaming supports multi-language writing. Pipes can also communicate with C ++ processes, for example:

E. Process and status updates

Checks the Job based on the Job Status attribute, such as the cloud habit Status of the Job, the progress of map and reduce running, the value of the Job Counter, and the description of the Status message, especially the Counter) attribute check. The transfer process of status update in the MapReduce system is as follows:

F. job completion

When JobTracker receives the message that the last Task of the Job is completed, it sets the Job status to "complete". After JobClient knows it, it returns the result from the runJob () method.

 

2). Yarn (MapReduce 2.0)

Yarn is available in Hadoop 0.23 and 2.0, which improves the performance of MapReduce.

Compared with MapReduce1.0, JobTracker is split into two main functions in MRv2: Resource Management and task scheduling and monitoring. This idea creates a global resource management (global ResourceManager (RM) Create an Application Management (ApplicationMaster (AM)). An application can enable a classic MR jobs scenario or a series of continuous jobs.

ResourceManager and NodeManager of each slave node (NM) Constitute a resource estimation framework. ResourceManager has the final highest level arbitration right for resource allocation of all applications in the system. ResourceManager has two main components: schedager and ApplicationsManager)

In fact, each application's ApplicationMaster (AMIs the lib package used by the resource estimation framework. It is used to negotiate resources with ResourceManager and run and monitor tasks for NodeManager.

Overall:

 

In summary, there are five independent entities in Hadoop Yarn.

  • Client: Used to submit a MapReduce Job
  • Yarn ResourcesManager: used to manage and allocate resources in the cluster.
  • Yarn NodeManager: Used to start and monitor the utilization of the local computer resource unit Container.
  • MapReduce Application Master: used to coordinate the running of tasks under MapReduce jobs. Both the Container and MapReduce Task run in the Container, which is scheduled by RM (ResourcesManager) and managed by Node Manager (NM ).
  • HDFS: used to share job files between other entities

Overall:

A. Submit A job

Job submission is similar to MapReduce1.0. when mapreduce is set in the configuration file. framework. when the name is yarn, the ClientProtocol mode of the inherited interface MapReduce2.0 is activated. RM generates a new Job ID (from the Yarn perspective, Application ID --- step 2 ), then the Job client calculates the input parts, copies the resources (including the Job JAR file, configuration file, and partition information) to HDFS (Step 3), and finally submits the Job to RM using the submitApplication function (step 4)

 

B. Job Initialization

RM receives the call submitted by a above and sends the request to the Scheduler for processing. The scheduler allocates the iner, at the same time, RM starts the Application Master process (Steps 5a and 5b) at NM. The AM Main Function MRAppMatser initializes a certain number of record objects (bookkeeping) to track the Job running progress, and collect the progress and completion of the task (Step 6), and then the MRAppMaster collects the input parts after calculation.

Later, it is different from MapReduce1.0. At this time, the Application Master will decide how to organize and run the MapReduce Job. If the Job is very small and can run on the same JVM and Node, run in uber mode (see source code)

 

C. assignment of tasks

If it is not running in uber mode, the Application Master requests Container to RM for all map and CER tasks. All requests are passed through heartbeat, and other information is also transmitted during heartbeat, for example, the information about the localization of map data, the host and rack address of the shard, which helps the master scheduler to make scheduling decisions, the scheduler tries its best to follow the principles of data localization or rack localization to allocate Container

In Yarn, unlike MapReduce1.0, which limits the number of map or reduce slots, the resource utilization is limited, and non-configuration resources in Yarn are more flexible, you can set the maximum allocation and minimum allocation Resources in the configuration file. For example, you can use yarn. scheduler. capacity. minimum-allocation-mb sets the minimum requested resource 1 GB, with yarn. scheduler. capacity. maximum-allocation-mb sets a maximum of 10 Gb resources for a Task ~ Within 10 GB

 

D. Execution of the task

After the Container is assigned to the Task, the Application Master on the NM contacts the stariner of the NM startup (starts). The Task is finally executed by a main class called YarnChild, however, resource files have been copied from the distributed cache before they can start to run map tasks or reduce tasks. PS: YarnChild is a (dedicated) JVM

Streaming and Pipes run in the same way as MapReduce1.0.

 

E. Process and status updates

When Yarn runs simultaneously, the Task and Container report its progress and status to the Application Master. The client polls and checks the Application Master every second, so that the update information is received at any time, this information can also be viewed through the Web UI.

F. job completion

The client polls every five seconds to check whether the Job is complete. During this period, the waitForCompletion () method in the function Job class needs to be called. This method is returned after the Job ends. You can set the polling interval using the configuration file attributes mapreduce. client. completion. pollinterval.

 

2. Failure

1) Classic MapReduce --- MapReduce1.0

A. TasK failed.

First case: User code in a map or reduce task throws a running exception. At this time, the JVM process of the sub-process wants to send an error report to TaskTracker before exiting. The error report is recorded in the error log, taskTracker marks the Task Attempt being run as failed and releases a Task slot to run another Task Attempt.

Case 2: the sub-process JVM suddenly exits from the Task Tracker and notices that the JVM exits and marks the Task Attempt as failed.

After JobTracker detects a Task Attempt failure through heartbeat, it restarts the scheduling of the Task. By default, if the Task fails for four times, it will not retry (through mapred. map. max. attempts can change the number of times), the entire Job will also be marked as failed

 

B. TaskTracker failure

If TaskTracker fails due to a crash or slow running, stop sending heartbeat to JobTracker. JobTracker notices this and removes the TaskTracker from the TaskTracker pool of the waiting task scheduling task.

Even if the TaskTracker does not fail, it is possible that the number of failed tasks is much higher than the average number of failed tasks in the cluster. In this case, the TaskTracker is blacklisted and removed from the blacklist after being restarted.

 

C. JobTracker failure

JobTracker failure is the most serious problem. At this time, you have to submit the task again.

 

2). Yarn failed

A. Task failure

The situation is similar to MapReduce1.0. When the Task Attempt fails, the message notifies the Application Master, which is marked as failed by the Application Master. When the Task failure rate is greater than that of mapreduce. map. failures. maxpercent (map) or mapreduce. reduce. failures. maxpercent (batch ECE), the Job fails.

 

B. Application Master failure

Similar to the previous one, when the Application Master fails, it will be marked as a failure. This is the failure that RM will immediately explore AM (Application Master, and instantiate a new AM and use the NM to build a new corresponding iner. app. mapreduce. am. job. recovery. if enable is true, the failed AM can be restored, and no result is returned after the restoration. By default, failure AM causes all tasks to return

If the status of AM failure remains unchanged after client polling, re-request RM to apply for the corresponding resource

 

C. Node Manager failure

When the NM fails, it will stop sending heartbeat to RM. Then RM will remove this NM from the available NM pool, and the mentality interval can be set by yarn. resourcemanager. nm. liveness-monitor.expiry-interval-ms settings, 10 minutes by default

If the number of Application failures on the NM is too high, the NM will be blacklisted, and AM will run tasks on different nodes.

 

D. Resources Manager failure

RM failure is the most serious, and the entire system will be paralyzed after it leaves RM, so after it fails, it will use the checkpoint mechanism to re-build a RM. For detailed configuration, see the authoritative guide and Hadoop Official Website

 

 

3. Job Scheduling

1). FIFO Scheduler

This scheduling is Hadoop's default d. The FIFO scheduling algorithm is used to run jobs and the priority can be set. However, this FIFO scheduling does not support preemption, therefore, later jobs with higher priority will still be blocked by first-come jobs with lower priority.

2) Fair Scheduler

The goal of Fair Scheduler is to make every user fairly enjoy the resources in the cluster. After multiple jobs are submitted, idle task slot resources can be allocated with the principle of "making every user share the cluster fairly". A user's short Job will be completed within a reasonable time, even if another user has a long Job running

Generally, jobs are placed in the Job pool. By default, each user has its own Job pool. When the number of jobs submitted by one user exceeds that of another, more cluster resources are not obtained.

In addition, Fair Scheduler supports preemption. If the resources in a pool do not obtain the cluster resources fairly within a period of time, Fair Scheduler will terminate the tasks that obtain the redundant cluster resources and distribute them to the former.

3). Capacity schedity

In Capacity schedity, there are many queues for cluster resources, and each queue has a certain allocation capability. In each queue, cluster resources are allocated according to the FIFO Scheduler.

 

4. shuffle and sorting

When a Hadoop Job is running, MapReduce will ensure that the inputs of each reducer are sorted by keys and execute this sorting process. The input of map output is called a shuffle. In many aspects, shuffle is the heart of map.

The following cover up some details, and the new version of Hadoop has been modified in this section.

1). map end

MapTask has a circular memory buffer. When the buffer space reaches a certain proportion (such as 80%), the spill thread is enabled to write data in the buffer to the disk. before writing data to the disk, the spill thread divides the data into corresponding partition based on the reduce to be delivered. In each partition, the thread sorts the data according to the key (Haoop2.0 shows fast sorting ), after the spill thread executes the processing of the output structure of the last batch of maptasks, it enables merger to merge the spill files. If the Combiner is set, execute the Combine function to merge files with the same local keys.

 

2). reduce end

Run ReduceTask. The Fetch thread obtains the corresponding file partition from the Map end in HTTP mode. After copying the map output, reducer starts sorting and finally runs merger to store the copied files on the local disk. (PS: for data transmission between Map and Reduce in Yarn, Netty and Java NIO are used. For details, see the source code)

Note that the merge target for each trip is to merge the smallest number of files to meet the merging coefficient of the last trip. For example, if there are 40 files, we won't go into four more steps, merge 10 files each time and then get 4 files. On the contrary, only 4 files are merged in the first lecture, and 10 files are merged in the last three sessions, in the last trip, four merged files and the remaining six files (not merged) were merged into 10 files (see). In fact, the number of merged files has not changed, it is only an optimization measure to minimize the amount of data written to the disk, because the last one is always merged into reduce (? In this place, the data source is merged from memory and disk, which reduces the number of files from memory, so the data size written to the disk at the last time is reduced)

 

The overall data transmission process from Map to Cer CER is as follows:

 

3) Configuration Optimization

The general principle of optimization provides as much memory space as possible for the shuffle process. On the map side, you can achieve optimal performance by avoiding multiple overflow writes to the disk (related configuration io. sort. *, io. sort. mb), In the reduce end, when all the intermediate data stays in the memory, the best performance can be achieved, but by default, this is impossible, generally, all memory is reserved for reduce functions. (If you need to modify the memory, you need to configure mapred. inmem. merge. threshold, mapred. job. reduce. input. buffer. percent)

 

 

 

 

5. Execution of tasks

1). task execution environment

Hadoop provides information about the runtime environment for MapTask and ReduceTask. For example, MapTask can find the name of the file it processes and implement it by providing a configure () method for mapper and reducer, table to obtain the Job configuration information.

Hadoop sets Job configuration parameters as the environment variables of the Streaming program.

 

2) Speculative Execution)

Speculative Execution mechanism in order to solve the problem of slow running of some tasks in Hadoop, Speculative Execution will start Speculative tasks for those tasks that are slower than the average progress, at this time, if the original Task is completed before the Speculative Task, the Speculative Task will be terminated. Similarly, if the Speculative Task is completed before the original Task, the original Task will be terminated.

By default, Speculative Execution is enabled. the following attributes can be used to determine whether to enable this function:

3). Output Committers

Hadoop MapReduce uses the commit protocol to ensure that the insert operation is appropriate during Job or Task running. Whether they are successfully completed or failed, OutputCommitter is determined by the interface OutputFormat in the new API,

The following is the OutputCommitter API:

public abstract class OutputCommitter {    public abstract void setupJob(JobContext jobContext) throws IOException;    public void commitJob(JobContext jobContext) throws IOException {    }    public void abortJob(JobContext jobContext, JobStatus.State state)            throws IOException {    }    public abstract void setupTask(TaskAttemptContext taskContext)            throws IOException;    public abstract boolean needsTaskCommit(TaskAttemptContext taskContext)            throws IOException;    public abstract void commitTask(TaskAttemptContext taskContext)            throws IOException;    public abstract void abortTask(TaskAttemptContext taskContext)            throws IOException;}

The setupJob () method is run when the Job is to be run. When the Job runs successfully, the commitJob () method is called and the output directory suffix _ SUCCESS is returned. When the Job fails to run successfully, abortJob () is called, and the corresponding output file is deleted. Similar to the case where a Task is successfully executed and the commitTask () method is called, the Task fails to call the abortTask () method, and the generated files are deleted.

 

 

4). Task JVM Reuse

After JVM reuse is started, different tasks can reuse one JVM, saving the time required to re-create and destroy the JVM. This function is not provided by Hadoop2.0 by default.

 

5). Skip bad records

There are often some corrupt records in big data. When a Job starts to process this corrupt record, it will cause the Job to fail. If these records do not significantly affect the running result, we can skip the damage record to make the Job run successfully.

Generally, skipping mode is enabled when the Task fails twice. for tasks that have been failing a record, NM or TaskTracker will run TaskAttempt.

A. Task failed

B. Task failure

C. Enable skipping mode. Task failed. The failure record is saved by TaskTracker or NM.

D. The skipping mode is still enabled, and the task continues to run, but the bad record of the previous attempt failure is skipped.

By default, skipping mode is disabled. You can use the SkipBadRecords class to enable this function.

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.