The working process of the MapReduce program

Source: Internet
Author: User
Tags shuffle

Transferred from:http://www.aboutyun.com/thread-15494-1-2.html

Questions Guide
1. What is the structure of the HDFS framework?
2. What is the reading and writing process for HDFs files?
3. What is the structure of the MapReduce framework?
4. What is the working principle of mapreduce?
5. What is the shuffle stage and the sort stage?

Remember that 2.5 years ago, we set up the Hadoop pseudo-distributed cluster, installed eclipse after running successfully wordcount.java, and then learn the pace of Hadoop becomes very slow, I believe there are many small partners and I like. My own work on the Mr Program (specifically the Hadoop 1.x version) has not been very clear, and now focus on summarizing the basics for Mr Programming. Since MapReduce is an HDFS-based operation, in order to understand mapreduce in depth (solving the problem of distributed computing), the first step is to understand HDFs (which solves the problem of distributed storage).

I. HDFS Framework Composition
<ignore_js_op>
HDFs uses Master/slaver's master-slave architecture, and an HDFS cluster consists of a Namenode node (master node) and multiple Datanode nodes (slave nodes), and provides the application's access interface. The Namenode,datanode and client explanations are as follows:

    • Namenode is responsible for the management and maintenance of the file System namespace, and is responsible for the control of client file operations (such as Open, close, renaming files or directories, etc.) and the management and allocation of specific storage tasks (such as determining the mapping of data blocks to specific datanode nodes);
    • Datanode is responsible for processing the file system client's read and write requests, providing the real file data storage services;
    • The client, which generally refers to an application that accesses the HDFs interface, or a Web service in HDFs (which allows the user to view the health of HDFs through a browser), etc.


1. Read the file
The read flow of the HDFs, NameNode, datanode files that the client interacts with, as follows:

    • The client initiates an RPC request to the remote Namenode; (1)
    • Namenode returns some or all of the block lists for the file, and returns the Datanode address of the block copy for each block,namenode; (2)
    • The client chooses the nearest Datanode to read the block, and if the client itself is Datanode, the data will be read directly from the local; (3)
    • After reading the current block, close the current Datanode connection and find the nearest Datanode for reading the next block; (4)
    • After reading the block list, and the file read is not finished, the client will continue to fetch the next block list to Namenode; (5)
    • A block is cheeksum verified, and if an error occurs while reading the Datanode, the client notifies Namenode and continues to read the data from another nearest neighbor Datanode of that block. After the client reads the data, close the data stream. (6)


2. Writing the file
The write flow of HDFs, NameNode, datanode files that the client interacts with, as follows:
<ignore_js_op>

    • The client initiates an RPC request to the remote Namenode; (1)
    • Namenode will check whether the file to be created already exists, whether the creator has permission to operate, etc., if the relevant conditions are met, the file will be created, otherwise the client will throw an exception; (2)
    • When the client begins writing to the file, the development library (i.e., dfsoutputstream) splits the file into packets, writes to the "Data queue", and then applies a new block to the Namenode, The appropriate list of Datanode is used to store the replica (default 3), and the size of each list depends on the setting of the replication in Namenode; (3)
    • First, a packet is written to the first datanode in a stream, followed by the next Datanode in the pipeline, and then to the last Datanode, the way the data is written in pipelined form; (assuming the replica is 3, Then the pipeline is composed of 3 datanode nodes, i.e. pipeline of Datanodes) (4)
    • When the last Datanode is completed, a confirmation packet is returned, which is passed to the client in the pipeline, and the development library (that is, Dfsoutputstream) maintains a "confirmation queue", which is sent from the confirmation queue after the successful receipt of the Datanode confirmation packet. Delete the corresponding package (5)
    • If a datanode fails, the Datanode will be removed from the current pipeline, and the remaining blocks will continue to be spread as pipelines in the remaining Datanode, and Namenode will be assigned a new datanode. To maintain the number of replication set. After the client writes the data, close the data stream. (6)


Description: The default block size of HDFs is 64M, providing sequencefile and mapfile two types of files.

Two. MapReduce Framework composition
The main components of the MapReduce framework and their relationship to each other are as follows:
<ignore_js_op>
The above procedure contains 4 entities, the functions of each entity, as follows:

    • Client: A MapReduce job submitted, such as a written Mr Program, or a command executed by the CLI;
    • Jobtracker: The operation of coordination operation, the essence is a manager;
    • Tasktracker: The task after running the job partition is essentially a performer;
    • HDFS: An abstract file system used to share storage between clusters.


Intuitively, Namenode is a metadata warehouse, just like the registry in Windows. Secondarynamenode can be seen as a namenode backup. Datanode can be seen as a task for storing job partitioning. In the usual 3 Hadoop distributed clusters, Master is Namenode,secondarynamenode,jobtracker, and the other 2 slaver are Tasktracker,datanode, and Tasktracker need to run on the datanode of HDFs.
The classes used above, or the functionality of the process, are as follows:

  • Mapper and Reducer
    The most current components of the Hadoop-based mapreduce application include a Mapper abstract class, an reducer abstract class, and an execution program that creates jobconf.
  • Jobtracker
    Jobtracker belongs to master, the general situation should be deployed on a separate machine, its function is to receive the job, the task is responsible for each subtask tasks run on the Tasktracker, and monitor them, if you find a failed task to restart it.
  • Tasktracker
    Tasktracker is a slaver service that runs on multiple nodes, and its function is to actively communicate with Jobtracker through heartbeat and receive jobs, and is responsible for performing each task.
  • Jobclient
    Jobclient's function is to upload some files to HDFs after the client submits the job, such as the jar package for the job (including the application and configuration parameters), and submit the path to the Jobtracker, Each task (i.e., maptask and Reducetask) is then created by Jobtracker and sent to each tasktracker for execution.
  • Jobinprogress
    Jobclient after the job is submitted, Jobtracker creates a jobinprogress to track and dispatch the job and add it to the job queue. Jobinprogress creates a corresponding batch of TASKINPROGRESS1 for monitoring and scheduling tasks based on the input datasets (broken down into filesplit) defined in the submitted job jar.
  • TaksInProgress2
    Jobtracker runs a task through each TASKINPROGRESS1, the Task object (that is, Maptask and Reducetask) is serialized to the appropriate Tasktracker, Tasktracker will create a corresponding TaskInProgress2 for monitoring and dispatching the Maptask and Reducetask.
  • Maptask and Reducetask
    Mapper based on the input data defined in the job Jar <key1, value1> read in, generate a temporary <key2, Value2>, if combiner is defined, Maptask will call the combiner when the mapper is complete to combine the values of the same key to reduce the output. Maptask all finished to Reducetask process call reducer processing, generate the final result <key3, value3>. The specific process can be found in [4].


Three. How MapReduce works
Work works for the entire mapreduce operation, as follows:
<ignore_js_op>
1. Submission of assignments
The Jobclient Submitjob () method implements the job submission process as follows:

    • Request a new job ID via Jobtracker's Getnewjobid () (2)
    • Check the output description of the job (for example, if the output directory is not specified or the output directory already exists, throw an exception);
    • Compute the input shard of the job (throws an exception if the Shard cannot be calculated, such as the input path does not exist, etc.);
    • Copy the resources required to run the job (such as the job Jar file, configuration file, computed input shard, etc.) to a directory named after the job ID. (Multiple replicas in the cluster are available for Tasktracker access) (3)
    • Tells the job to prepare for execution by calling the Submitjob () method of Jobtracker. (4)


2. Initialization of the job

    • Jobtracker receives the call to its submitjob () method, it puts the call into an internal queue, which is dispatched by the job scheduler (such as the FIFO scheduler, the Capacity scheduler, the fair dispatcher, etc.); (5)
    • Initialization is primarily about creating an object that represents a running job-encapsulating tasks and logging information in order to track the status and process of a task; (5)
    • In order to create a task run list, the job scheduler first obtains jobclient computed input shard information (6) from HDFs. Then create a maptask for each shard, and create a reducetask. (The task is assigned an ID at this time, please distinguish between the ID of the job and the ID of the task).


3. Assignment of tasks

    • Tasktracker regularly communicate with Jobtracker through "Heartbeat", mainly to tell Jobtracker if he is still alive, and whether it is ready to run new tasks, etc. (7)
    • Jobtracker before selecting a task for Tasktracker, you must first select the job where the task is located through the job scheduler;
    • There are a fixed number of task slots for Maptask and Reducetask,tasktracker (the exact number is determined by the number of tasktracker cores and the size of the memory). Jobtracker will fill the Tasktracker maptask First, then assign Reducetask to Tasktracker;
    • For Maptrask,jobtracker, a tasktracker that is closest to its input shard file is selected by a distance. For Reducetask, because data localization is not considered, there is no standard to choose which tasktracker.


4. Implementation of the mandate

    • After the Tasktracker is assigned to a task, it copies the job's jar file from HDFs to the file system where the Tasktracker resides (jar localization is used to start the JVM). At the same time Tasktracker copy all the files required by the application from the distributed cache to the local disk; (8)
    • Tasktracker creates a new local working directory for the task and extracts the contents of the jar file into this folder;
    • Tasktracker starts a new JVM (9) to run each task (including Maptask and reducetask) so that the client's mapreduce does not affect the Tasktracker daemon (for example, causing crashes or hangs, etc.);
    • The child process communicates with the parent process through the umbilical interface, and the child process of the task informs the parent process of its progress every few seconds until the task completes.


5. Updates to processes and statuses
A job and each of its tasks have a status message, including the run status of the job or task, progress of map and reduce, counter values, status messages, or descriptions (which can be set by user code). How are these state information changing during the job and how do they communicate with the client?
<ignore_js_op>

    • While the task is running, it keeps track of its progress (that is, the percentage of task completion). For Maptask, the task progress is the proportion of processed input. For Reducetask, the situation is slightly more complex, but the system still estimates the scale of the processed reduce input;
    • These messages are aggregated by the child jvm->tasktracker->jobtracker at a certain time interval. Jobtracker will produce a global view that indicates all running jobs and their task status. Can be viewed through the Web UI. At the same time, Jobclient is updated by querying Jobtracker per second and outputting to the console.


6. Completion of the job
When Jobtracker receives a notification that the job's last task has been completed, the status of the job is set to success. Then, when jobclient queries the state, it knows that the job has completed successfully, so jobclient prints a message to inform the user and finally returns from the Runjob () method.

Four. Shuffle stage and sort stage
The shuffle phase is the process of starting from the output of the map, including system execution sequencing and transferring the map output to reduce as input. The sort stage refers to the process of sorting the key output from the map side. Different maps may output the same key, and the same key must be sent to the same reduce-side processing. The shuffle stage can be divided into the map end of the shuffle and reduce end of the shuffle. The shuffle stage and the sort stage work as follows:
<ignore_js_op>
If this is how the MapReduce works from a physical entity's point of view, the above is how the MapReduce works from a logical entity's perspective, as shown here:
1. Map-Side Shuffle

    • When the map function starts producing output, it is not simply writing the data to disk, because frequent disk operations can result in a severe degradation of performance. Its processing process is more complex, the data is first written to a buffer in memory, and do some pre-sequencing to improve efficiency;
    • Each maptask has a circular memory buffer for writing output data (the default size is 100MB), and when the amount of data in the buffer reaches a certain threshold (by default, 80%) the system initiates a background thread that writes the contents of the buffer to disk (that is, the spill phase). During the write disk process, the map output continues to be written to the buffer, but if the buffer is filled during this time, the map blocks until the write disk process is complete;
    • Before writing the disk, the thread first divides the data into corresponding partitions (partition) based on the reducer the data is ultimately passed to. In each partition, the background thread is sorted by key (fast sort), and if a combiner (that is, mini Reducer) is run on the sorted output;
    • Once the memory buffer reaches the overflow write threshold, an overflow write file is created, so after Maptask finishes its last output record, it will have multiple overflow write files. The overflow write file is merged into an index file and a data file (the sort stage) before the maptask is completed;
    • After the overflow write file is merged, map will delete all temporary overflow write files and inform the Tasktracker that the task has been completed, as long as one of the maptask finishes, Reducetask will start copying its output (copy phase);
    • The output file of map is placed on the local disk of the Tasktracker running Maptask, which is the input data required to run the Reducetask Tasktracker, but the reduce output is not, it is generally written to hdfs (reduce phase).


2. Shuffle of the reduce end

    • Copy phase: The reduce process initiates some data copy threads, requesting the tasktracker of the Maptask in HTTP to obtain the output file.
    • Merge stage: The data copied from the map side is first placed in the memory buffer, the merge has 3 forms, namely memory to memory, memory to disk, disk to disk. The first form is not enabled by default, the second merge mode is running (spill phase) until the end, and then the third disk-to-disk merge method is enabled to generate the final file.
    • Reduce phase: The final file may exist on disk or in memory, but it is on disk by default. When the input file for reduce is set, the entire shuffle is finished, and then reduce executes, putting the results in HDFs.


Five. Other
HDFs and MapReduce are the infrastructure of Hadoop. In addition to the above explanations, there are mapreduce fault-tolerant mechanisms, task JVM reuse, Job scheduler, etc. have not been summarized. After a thorough understanding of the work of MapReduce can be a lot of mapreduce programming, plan to bring the instance of Hadoop after reading, then read "Mahout combat", synchronous Learning "Hadoop Technology Insider: In-depth analysis yarn architecture design and implementation principles", Formally entering the portal of the Hadoop 2.x version.

Reference documents:
[1] The Authority of Hadoop Guide (second edition)
[2] "Hadoop Application development Technology detailed"
[3] Hadoop 0.18 Documentation: http://hadoop.apache.org/docs/r1.0.4/cn/hdfs_design.html
[4] WordCount source analysis: http://blog.csdn.net/recommender_system/article/details/42029311
[5] The multi-way merge sort of external sorting technique: http://blog.chinaunix.net/uid-25324849-id-2182916.html

The working process of the MapReduce program

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.