Mapreduce Working Mechanism

Source: Internet
Author: User
Mapreduce task execution process

5 is the detailed execution flowchart of mapreduce jobs.

Figure 5 mapreduce job execution Flowchart

1. Write mapreduce code on the client, configure the job, and start the job.

Note that after a mapreduce job is submitted to hadoop, it enters the fully automated execution process. In this process, in addition to the execution status and force termination of the Monitoring Program, the user cannot intervene in the execution process of the job. Therefore, before submitting a job, you need to configure all the parameters that should be configured as needed.

2. Request a job ID from jobtracker

3. Copy the resource file of the job.

Jobtracker copies the resources required for running a job, including the JAR file, configuration file, and input division of computing, to the HDFS corresponding to the job. These files are stored in the folder created specifically for this job by jobtracker. The folder name is the job ID of the job. By default, there are 10 copies (mapred. submit. replication Property Control). Therefore, many copies of the cluster are available for tasktracker access when running the job. The input partition information tells jobtracker how many map tasks should be started for the job.

4. Submit a job

Call the submitjob () method of the jobtracker object to submit the job and tell the jobtracker job to be executed.

5. Initialization

After receiving a call to its submitjob () method, jobtracker puts the call into an internal queue and submits it to Job Scheduling for scheduling and initialization.

6. Get input Division

To create a task running list, the Job scheduler first obtains the input part information calculated by jobclient from the shared file system, and then creates a map task for each part. The number of reduce tasks to be created is determined by the mapred. Reduce. Task attribute of jobconf, and then the scheduler creates the corresponding number of reduce tasks to be run.

7. jobtracker assigns a task.

Communication between tasktracker and jobtracker and task allocation are completed through the heartbeat mechanism.

Tasktracker runs a simple loop to periodically send "Heartbeat" to jobtracker, it is used to tell jobtracker whether it is still alive and whether it is ready to run new tasks (whether the number of map tasks and reduce tasks is smaller than the upper limit ). If yes, jobtracker will assign a task to him and encapsulate the allocation information in the return value of Heartbeat communication and return it to tasktracker. When tasktracker obtains the new task information from the heartbeat information returned by jobtracker, it adds the map task or reduce task to the corresponding task slot.

Note that when jobtracker assigns a map task to tasktracker, the map task data will be localized to reduce network bandwidth. Based on the Network Location of the tasktracker, it selects an input file closest to the tasktracker map task and assigns it to the tasktracker. The best case is that the file is located in tasktracker.

8. tasktracker obtains job resources.

Tasktracker copies the data, configuration information, and program code required for running the task from hdfs to the tasktracker local disk.

9. Publish a task

10. Run

When tasktracker obtains the job resource, tasktracker creates a local working directory for the job and decompress the content in the jar file to this folder, tasktracker creates a new taskrunner instance to run the task. taskrunner starts a new JVM to run each task.

11. output the result to HDFS.

The above is the overall process of mapreduce to complete a task. However, it does not reflect how the client knows the running progress and status of the job. The following is a brief description.

Each task divided by mapreduce jobs has a set of counters, which count the progress events in the task execution process. If the task needs to report progress, it sets a flag to indicate that the status change will be sent to tasktracker. When this flag is detected by another listening thread, the current task status of tasktracker is notified. Meanwhile, tasktracker encapsulates the task status in the heartbeat sent to jobtracker every five seconds to inform him of the task execution status. Through this heartbeat communication mechanism, the statistics of all tasktrackers are summarized at jobtracker. Jobtracker combines these statistics to generate a global job progress statistics, which is used to indicate all running jobs and their statuses. Finally, jobclient displays the latest job progress status by checking jobtracker.

When jobtracker receives a notification that the last task of the job has been completed, it sets the job status to "successful ". Then, when jobclient queries the status, it knows that the task has been completed. Therefore, jobclient prints a message to notify the user. Finally, jobtracker clears the job status and instructs tasktracker to clear the job status (eg: delete intermediate output ).

Note that the map task writes the result to the local hard disk instead of HDFS. Because the result of a map task is an intermediate result, it is necessary to re-process the reduce task. After processing, the result of the map task has no value and is usually deleted. The same copy of data on HDFS is usually backed up. Therefore, if it is stored in HDFS, there will be some issues.

Error Handling Mechanism

Faults that occur during mapreduce task execution can be divided into two categories: hardware faults and faults caused by task execution failures.

Hardware faults

In hadoop cluster, there is only one jobtracker. Therefore, jobtracker itself has a single point of failure. How can we solve the Single Point Problem of jobtracker? We can use the master-slave deployment mode to start one or more slave nodes of jobtracker while starting the master node of jobtracker. When a problem occurs on the jobtracker master node, a new master node is selected from the standby jobtracker node using an election algorithm.

In addition to the jobtracker error, a machine failure is a tasktracker error. Tasktracker faults are relatively common. mapreduce generally solves these faults by re-executing tasks.

In a hadoop cluster, under normal circumstances, tasktracker constantly communicates with jobtracker through the heartbeat mechanism. If a tasktracker fails or runs slowly, it stops or rarely sends heartbeat to jobtracker. If a tasktracker does not communicate with jobtracker within a certain period of time (1 minute by default), jobtracker removes the tasktracker from the tasktracker set waiting for task scheduling. Meanwhile, jobtracker requires that the task on this tasktracker be returned immediately. If the tasktracker task is still in the mapping stage, jobtracker requires other tasktracker to re-execute all the map tasks originally executed by the faulty tasktracker. If the task is a reduce task in the reduce stage, jobtracker requires other tasktrackers to re-execute the incomplete reduce task of the fault tasktracker. For example, a tasktracker has completed two of the three reduce tasks allocated, because once the reduce task is completed, data is written to HDFS, so only the third incomplete reduce task needs to be re-executed. However, for a map task, even if tasktracker completes some map tasks, reduce may still be unable to obtain all the map outputs on this node. Therefore, no matter whether the map task is completed or not, the map task on the tasktracker must be re-executed.

Task failed

In actual tasks, mapreduce jobs also encounter user code defects or task failures caused by process crashes. A user code defect will cause it to throw an exception during execution. At this time, the task JVM process automatically exits and sends an error message to the tasktracker parent process. At the same time, the error message is written to the log file. Finally, tasktracker marks the failed task attempt. If a task fails due to a process crash, the tasktracker listener will find that the process exits. At this time, tasktracker will mark the task attempt as failed. For an endless loop program or a program that has been executed for too long, because tasktracker does not receive progress updates, it will mark the task attempt as a failure and kill the corresponding process of the program.

In the above cases, after tasktracker marks a failed task attempt, it will subtract 1 from the task counter of tasktracker so that jobtracker can apply for a new task. Tasktracker also tells jobtracker that a local task attempt fails through the heartbeat mechanism. After receiving a notification of task failure, jobtracker resets the task status, add it to the scheduling queue to re-allocate the task for execution (jobtracker tries to avoid re-assigning the failed task to the failed tasktracker ). If the task has been tried four times (you can set the number of times) but is still not completed, the task will not be retried, and the entire job will fail.

Job Scheduling Mechanism

Before Version 0.19.0, user jobs on the hadoop cluster adopt the first-in-first-out (FIFO) scheduling algorithm, that is, run according to the order of Job submission. At the same time, each job uses the entire cluster during running, therefore, you can enjoy the services of the entire cluster only when it is your turn to run it. Although the FIFO scheduler supports priority setting, priority preemption is not supported, therefore, this single-user scheduling algorithm is still not in line with the purpose of using parallel computing in cloud computing to provide services. Starting from version 0.19.0, hadoop provides a scheduler that supports multi-user colleagues' services and fair share of cluster resources, namely, the fair scheduler guide and the Capacity scheduler guide ).

 

Fair Scheduler

Fair Scheduling is a way to allocate resources to jobs. Its purpose is to allow submitted jobs to obtain the same amount of cluster shared resources over time, so that users can share clusters fairly. The specific method is: when only one job in the cluster is running, it will use the entire cluster; when other jobs are submitted, the system will allocate the idle time slice of the tasktracker node to these new jobs and ensure that each job gets an approximate equal amount of CPU time.

Fair Scheduling organizes jobs by job pool. Resources are evenly allocated to the job pool based on the number of users who submit jobs. By default, each user has an independent job pool so that each user can obtain an equivalent cluster resource without worrying about how many jobs they submit. In each resource pool, a fair share method is used to share the capacity between running jobs. In addition to the sharing method, the fair scheduler also allows you to set the minimum shared resources for the job pool, to ensure that sufficient resources are always available to specific users, groups, or production applications. For jobs with the minimum shared resource set, if a job is included, it can obtain at least the minimum shared resource. However, if the minimum shared resource exceeds the resource required by the job, additional resources are split between other job pools.

In normal operations, when a new job is submitted, the fair scheduler waits for the tasks in the running job to finish, releasing the time slice for the new job. Fair scheduler also supports job preemption. If the new job does not obtain fair resource configuration within a certain period of time (that is, the timeout time can be configured), the fair scheduler will allow the job to seize the tasks in the running job, to obtain the resources required for running. In addition, if the resource obtained by a job within the time-out period is less than half of the fair resource, the job can also be preemptible. During the selection, the fair scheduler selects the most recently run task among all running tasks, which wastes a relatively small amount of computing. Because a hadoop job can tolerate loss of tasks, preemption does not lead to failure of the preemptible job, but only takes a longer time to run the preemptible job.

Finally, the fair scheduler can limit the number of concurrent jobs run by each user and each job pool. This restriction can be used when you submit hundreds of jobs at a time or when a large number of jobs are executed concurrently to ensure that the intermediate data is not fully occupied by the disk space on the cluster. Jobs that have exceeded the limit will be included in the scheduler's queue for waiting until the early job has finished running. Based on the job priority and submission time, the fair scheduler schedules the jobs to be run in the waiting job.

 

Capacity Scheduler

The capacity scheduler divides resources by queue. Each queue has a lower limit and upper limit for resource usage. You can set the resource usage limit for each user. The remaining resources of one queue can be shared to another queue. Other queues can be returned after use. Administrators can restrict the resource usage of a single queue, user, or job. Supports resource-intensive jobs and can allocate multiple slots to some jobs (this is a special point ). Job priority is supported, but resource preemption is not supported.

The relationship between users, queues, and jobs is clarified here. Hadoop manages resources in queues. Each queue is allocated with certain resources. You can only submit jobs to one or more queues. Queue Management is embodied in two aspects: 1. user permission management: The hadoop user management module is based on the ing between operating system users and user groups, allowing an operating system user or user group to correspond to one or more queues. You can also configure the administrator user for each queue. Queue information is configured in the mapred-site.xml file, including the name of the queue, whether to enable permission management and other information, and does not support dynamic loading. Queue permission options are configured in the mapred-queue-acls.xml file to configure some permissions of a user or user group in a queue. Permissions include job submission and Job Management permissions. 2. system resource management: Administrators can configure available resource resources for each queue and each user to provide scheduling basis for the scheduler. This information is configured in the scheduler's own configuration file (such as capacity-schedity. XML.

The configuration of the above two schedulers is not detailed here. There are some configuration methods in the reference links at the end of this Article. If you are interested, refer to them.

Shuffle and sorting

In the mapreduce Process, in order for reduce to process map results in parallel, the map output must be sorted and separated, and then delivered to the corresponding reduce, the process of further sorting the map output and handing it over to reduce becomes shuffle, that is, the section in the red box of the following mapreduce flowchart. From the shuffle process, it is the core of mapreduce, and the performance of the shuffle process is directly related to the performance of the entire mapreduce.

In general, the shuffle process is included at both the map and reduce ends. The shuffle process on the map end is to partition, sort, and spill the map results, and then merge the outputs belonging to the same partition (merge) it is written on the disk, and the results are sent to the corresponding reduce according to different divisions. The reduce end merges the output of the same partition sent by each map, sorts the results of merge, and finally submits the results to reduce.

The following link explains the shuffle process in detail through an instance. Please read it and I will not explain the shuffle process too much here.

Http://www.slideshare.net/snakebbf/hadoop-mapreduce-12716482

ShuffleProcess optimization

Here we will briefly introduce how to optimize the shuffle process by triggering hadoop parameter configuration. In a task, I/O operations are generally used to complete the unit task for the most time. On the map end, the write operation is performed when the buffer content in the shuffle stage exceeds the threshold. You can reasonably set the IP. Sort. * attribute to reduce the number of writes in this case. Specifically, you can increase the IO. Sort. MB value. At the reduce end, when copying map output, directly storing the copied results in the memory can also improve performance, in this way, I/O operations can be performed less than twice for some data (the premise is that the memory left is sufficient for reduce task execution ). Therefore, when the memory requirement of the reduce function is very small. inmen. merge. set threshold to 0 and set mapred. job. reduce. input. buffer. setting percent (0 by default) to 1.0 (or a lower value) can reduce I/O operations and improve shuffle performance.

Policy for task execution

Here we will introduce some of the strategies used by hadoop in task execution, so that you can learn more about the execution details of mapreduce tasks.

Speculative execution

Speculative execution means that when all the tasks of a job start to run, jobtracker calculates the average progress of all tasks, if the tasktraker node where a task is located is slower than the average speed of the overall task because of its low configuration or high CPU load, in this case, jobtracker starts a new backup task. After the original task and the new task are executed, the other task is killed. This means that the task is successfully executed on the jobtracker page, but there are always some reasons why the task is killed,

Mapreduce splits the tasks to be executed into some small tasks, and then processes these tasks in parallel to improve the job running efficiency, so that the overall job execution time is less than the sequential execution time. However, it is obvious that slow tasks will become the bottleneck of mapreduce. As long as there is a slow task, the completion time of the entire job will be greatly extended. In this case, we need to use speculation to avoid this situation.

Speculative execution is started by default. This execution method has a very obvious defect: the Backup Task enabled by this method does not solve the problem if the task execution speed is too slow due to code defects. In addition, because speculative execution starts new tasks, this execution method will inevitably increase the burden on the cluster. The values of the lower cution attribute are enabled or disabled for map and reduce tasks ).

Task JVMReuse

Both map tasks and reduce tasks are run in the Java Virtual Machine (JVM) on the tasktracker node. When tasktracker is assigned a task, a new JVM is started locally to run the task. For map tasks with a large number of input files, it is obvious that there is still much room for improvement to start a JVM for each map task. If the JVM is reused for subsequent tasks after a very short task is completed, it can save the time for new tasks to start a new JVM. This is called task JVM reuse. It should be noted that, although multiple tasks may run at the same time on a tasktracker, the tasks being executed are all on the independent JVM.

The property controlling JVM reuse is mapred. Job. Reuse. JVM. Num. Tasks. This attribute defines the maximum number of tasks running on a single JVM. By default, the default value is 1, which means that each JVM runs a task. You can set this attribute to a value greater than 1 to enable JVM reuse, or set this attribute to-1, indicating that the number of tasks sharing this JVM is unlimited.

Skipping bad records

Mapreduce jobs process a large dataset. When writing a processing program based on mapreduce, users may not consider every data format and field in the dataset (especially some bad records ). Therefore, user code may crash when processing a specific record in a dataset. In this case, even if mapreduce has an error handling mechanism. However, due to this Code defect, this task will still fail even if it is re-executed four times (the maximum number of re-executions by default), and the entire job will eventually fail. Therefore, it is useless to re-run the task because of the exception thrown by the task caused by bad data. However, if you want to find this bad record in a large data set, and then add the corresponding processing code in the program or directly remove this bad record, it is obviously very difficult, moreover, there is no other bad records. Therefore, the best way is to skip a bad record during the execution of the task corresponding to the current Code (because of the huge dataset, it is acceptable to ignore such a small number of bad records ), run the command again. This is the ignore mode (skipping mode) in hadoop ).

When the ignore mode is started, if the task fails twice in a row, it will tell tasktracker about the records it is processing, tasktracker then re-runs the task and skips the record that runs to the previous Task report. In the ignore mode, only one error record can be detected and ignored. Therefore, this mechanism is only applicable to detecting individual error records. If you increase the maximum number of task attempts (mapred. Map. Max. attemps, mapred. Reduce. Max. attemps), you can increase the number of error records that can be detected and ignored in ignore mode. By default, the ignore mode is disabled. You can use the skipbadrecord class to start the map and reduce tasks separately.

Task execution environment

How do map and reduce tasks know the execution environment and required parameters during execution? In fact, after jobtracker assigns a task to tasktracker, tasktracker will copy the required resources to the local device through HDFS, tasktracker executes mapper/reducer tasks (tasks) in the form of sub-processes in a separate JVM. Therefore, when a map or reduce task is started, the execution environment of the task is inherited directly from the parent tasktracker. To list the local parameters used when each task is executed.

Figure 7 local reference table of a task

When a job is started, tasktracker creates a job and a local cache based on the configuration file. The local directory of tasktracker is $ {mapred. Local. dir}/tasktracker /. There are two sub-directories in this directory: one is the distributed cache directory of the job, the path is to add archive/behind the local directory, and the other is the local job directory, the path is followed by jobcache/$ jobid in the local directory. The shared directory for job execution is saved in this directory (each task can use this space as a temporary space for file sharing between tasks, this directory uses job. local. the dir parameter is exposed to the user), the directory where the jar package is stored (the JAR file for saving the job and the expanded JAR file), and an XML file (this XML file is a common local job configuration file) and the task directory allocated according to the task id (each task has a directory that contains the localized task job configuration file, stores the output file directory of the intermediate result, the current job directory of the task, and the temporary job directory ).

Note that multiple instances of the same task do not attempt to write to the same file. This may cause two problems. First, if the task fails and is retried, the old file of the first task will be deleted first. Second, in speculative execution, the two instances of the same task write to the same file. Hadoop solves the above problem by writing the output to the temporary folder of the task. The temporary directory is {mapred. Out. Put. dir}/Temporary/$ {mapred. task. ID }. If the task is successfully executed, the contents of the directory (Task output) will be copied to the output directory ({mapred. Out. Put. dir}) of the job }). Therefore, if a task fails and is retried, some output of the first task will be deleted. At the same time, the backup task and the original task during speculative execution are located in different working directories, and their temporary output folders are different, only the completed task will upload the output content in its working directory to the output directory, and the working directory of another task will be discarded.

References: hadoop version 2nd

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.