My opinion on the execution process of MapReduce

Source: Internet
Author: User

we all know that Hadoop is mainly used for off-line computing, which consists of two parts: HDFs and MapReduce, where HDFs is responsible for the storage of the files, and MapReduce is responsible for the calculation of the data in the execution of the MapReduce program. You need to make the input file URI and the output file URI. In general, these two addresses are stored on HDFS. The MapReduce calculation process is divided into two stages: map phase and reduce phase, in which the map phase is responsible for dividing the input files, the result of which is a key:value pair, and the reduce phase will summarize the results of multiple maps. So let's start with the entire calculation, and look at what the entire computational process of MapReduce looks like, and here's an example of a simple wordcount, which is just a general process, with specific details and reliability assurances not yet known. In the framework of MapReduce, there are two kinds of nodes, Jobtracker and Tasktracker, respectively, control node and working node, the former is the core of the system, control the system's resource allocation and load balance by single point way, In MapReduce, the job submission is submitted to Jobtracker, which is responsible for the resource request and task scheduling for the entire job, which determines the number of tasktracker in which each job map and reduce is executed. While Tasktracker is a working node, you can perform map and reduce processes, need to report your status in real time, and run the task to the Jobtracker server.
I think the MapReduce programming model is the interface where multiple callbacks are set up, and then the entire framework completes the process according to its model, and you can set different stages to invoke different callback functions by reflection.

Before all the steps, let's be clear about what the MapReduce input is to complete the entire calculation process, and what is the output? The content it enters is one or more files, stored in HDFs, and the output is a set of Key-value pairs, written to HDFs.
1. Partitioning the inputThe execution map and reduce are called task processes, which are initiated by Tasktracker, because the input data needs to be given to multiple tasktracker for calculation, so the first thing to do is to partition the input data, This allows the tasktracker process scattered across multiple nodes to perform map work in parallel, which is called split, which is the partitioning of one or more files into multiple partitions, which the Jottracker control will hand over to the map program for processing. Split this step is done by the InputFormat abstract class, we can inherit this class to implement its own partitioning method, this class defines two functions: public abstract list<inputsplit> Getsplits ( Jobcontext context) throws IOException, Interruptedexception;public abstract recordreader<k,v> Createrecordreader (Inputsplit split, Taskattemptcontext context) throws IOException, interruptedexception; Where the Getsplits function is the MapReduce program input file (file information stored in the Jobcontext, in the submission of the job when set) divided into a logical partition, each partition of the information including its location of the file path, length, offset and other information, This also makes a logical partition cannot span multiple physical files; The Createrecordreader function creates a Recordreader object on a partition that reads the contents of each partition and gives it to map processing. Several InputFormat implementations are available in the system, including the Fileinputformat to split the file, the Dbinputformat of the database input, and so on. We generally use Fileinputformat to partition, you can call the Addinputpath function input file before submitting the job, and set the path of the input file through the Setoutputpath function. Let's say we got 3 partitions here.
2. Divide the contents of each partition before map After partitioning, Jobtracker will start mappertask (referred to as mapper) to handle these partitions, each of which is handled by a mapper, and if the number of partitions is greater than the number of Tasktracker, Then the task needs to be assigned according to the specific operating conditions, After getting these partitions you also need to use Recordreader (you can set your own recordreader via the InputFormat Createrecordreader interface) to read data from these partitions to get a key: Value is the input to the map function. In the simple WordCount program, we did not customize the InputFormat and Recordreader, where Fileinputformat and Linerecordreader were used to get the map input, The key of the map is the offset of the row, and value is the contents of the row. In WordCount, the map function is as follows:
public void Map (Object key, Text value, context context                    ) throws IOException, interruptedexception {      Stringtokeni Zer ITR = new StringTokenizer (value.tostring ());      while (Itr.hasmoretokens ()) {        Word.set (Itr.nexttoken ());        Context.write (Word, one);      }    }

It's just that each row that appears is divided into multiple words as key, and the value of each word equals 1, and the map here is just a dividing function.
3. Spill in the process of mapper When the map function executes, it gets a lot of key:value pairs, which are stored in the memory buffer in order (in the order processed by the map function), but the memory buffers are size limited, 100M by default, and if there is too much content in the buffer, you need to swap its contents to disk. This step is called spill, in order to not affect the execution of the map function and the exchange of buffers, the buffer will not be fully written to disk after the full, Instead, the Is.sort.spill.percent configuration item controls the time to swap to disk, for example, this configuration item is 0.8 by default, which means that when the memory buffer reaches 80M, it is spill to disk, and the remaining 20M can continue to be written by the map program. Since map writes the disk based on the size of the memory buffer, it is possible to generate multiple files for the map results, and if the map result set is small all can be put into memory, then only the last write disk.
4. Partitioning and merging in the spill process We said earlier that the result of Mapper is a key:value pair, for example wordcount in the "Hello": 1, "World": 1 in this form, it is likely to write a memory buffer to disk at a time there are multiple "hello" : 1 This result (because there are many times of hello words in this partition), in order to save disk space (and also save bandwidth), need to merge these need to merge, but the MapReduce framework does not know how to merge, here can set the callback, This process is called combiner; in addition, the result of map needs to be given to reduce, but how do we know which key to give to which reduce processing? This is where you need to partition the key again, called Partitioner. So every time you write the contents of a memory buffer to disk, you need to perform sort, merge, and partition operations. Partitioning is generally used in the way of hashing (the number of reducetask here is set in the configuration?). Or is it a mapreduce setup? ):
public int getpartition (K key, V value,                        int numreducetasks) {    return (Key.hashcode () & integer.max_value)% Nu Mreducetasks;}

So after the map execution, we can get multiple disk files, which are sorted, merged, and partitioned, assuming we have 4 reduce, each of which may contain 4 partitions, each containing a key that needs to be handed over to a specific reduce processing after the merge: The collection of value pairs.
5. Merging between disk files After getting these disk files, we also need to merge these files so that we can reduce the number of reduce reads, where the merge is done on a per-file basis, for example, the first reduction of the partition is merged into a total partition, the merge process will perform sorting, Combiner operation, the step of merging multiple files into a single result file is called a group merge.
Once this is done, all of the map's work is done, and you can see that the split operation needs to be performed before the map executes, and then use Recordreader to remove the key and value required for the map function in each partition. The results of each map function are written to the memory buffer first, and then to the disk when the buffer reaches the threshold, sorting, combiner, and Partitioner are performed when the disk is written. After all the inputs have been executed, the multiple output disk files are sorted and combiner by partition, resulting in a file containing multiple partitions, which is the input of reduce, where each reduce input is included in a partition. It is also important to note that the disk operations here are all on the local disk of the Tasktracker node where the map process resides.
The next step is to start the reducetask execution of the task, and the input to reduce is the result of a multiple-order merge after the map, and each reduce is processed as part of each mapper result set, And which part of each mapper is given to which reduce processing is partitioned when Mapper writes the disk, and the result of mapper is stored on the local disk instead of HDFs. I think this reason should be because it is some intermediate result, there is no need to save, in addition, different from Mapper, when allocating mapper, consider the strategy of "mobile computing rather than moving data" to localize the data as much as possible, That is, in the first step split (most of the time for a large file is a block divided into a split), as far as possible to save the block copy of the Datanode (and it is also a tasktracker) Start mapper, In this way, the entire computational process of this mapper does not need to get input data from the network, which greatly saves the bandwidth. But it is not the same for reduce, because it needs to read data from multiple nodes (one at a time on each mapper), and even if these intermediate results are stored on HDFS, the input of reduce is not on the same datanode. The main reason, of course, should be the 1th. Also, because the reliability of these intermediate data is not guaranteed, if one of the mapper disks is lost after the mapper is finished, then the data will need to be jobtracker again for scheduling calculations, or the entire task fails (personal conjecture, no specific strategy is confirmed).
6. Reduce Read input The first step of the reduce, of course, is to read the input data, which needs to be read from multiple nodes, read a portion of each node, and this step is similar to reading split on map, except that split is likely to be local and reduce needs to be read from other nodes. At this time, the data transmission is through the HTTP protocol, which is stored in the Tasktracker of Mapper, which is managed by the latter.
7. Merging of the reduce inputs But think about it, every time the Tasktracker executes mapper, not all mapper operations end at the same time due to input size, node configuration, and reduce needs to read the results of all mapper operations, Then it needs to wait until all the mapper operations are complete, which means that the mapper operation and the reduce operation cannot be done in parallel, but without the reduce operation, you can read the output data from the finished mapper into it, So at the end of the map operation on each tasktracker, all of the reduce will read the part of the input data that they need to process on that node. This time the read data is kept in memory because reduce has not yet started, so the maximum amount of memory here is based on the JVM's heap configuration. In this way, each end of a mapper will read the data once, but the data read on different mapper may have duplicate key, this time also need to merge, when the input data reached the threshold value, you also need to write this data to the local disk, The process is to continuously read the input data from the mapper into memory, and wait until the memory reaches the set threshold after the data in memory is sorted, merged and written to the disk file, which is similar to the process of writing the data in the memory buffers to disk in the map operation, except that there is no need for partitioning. If the input of reduce is larger, it is possible to have more than one input file on disk, and after all the inputs are read, the disk files need to be sorted and merged again, and a large input file is synthesized and saved on the local disk.
8. Implement the reduce callback With the previous step, after all the map operations have been completed and the results of the map are copied, the reduce process is started, and the key:value pair in the input file is continuously read during the reduce operation, and the reduce function then writes the output results to HDFs. The input that is read by the reduce process may be saved to disk or stored in memory (depending on the size of the data after all the input files have been merged).
It can be found through the above process that the execution of mapreduce involves multiple sorting, merging, and partitioning operations. The first is the partition of the input data (split), then the result of the map will need to sort, merge, and partition when writing from the memory buffer to the disk, and then after the completion of a map execution, all output files will be sorted and merged. When reduce reads input data, if there is not enough memory space to write to disk, then the sorting and merging operations are performed, and the multiple input files need to be sorted and merged again. No wonder people say that the core of MapReduce is sequencing. Furthermore, throughout the process, the process from map results to the input of reduce is also called shuffle, which is the merge, partition (when writing to disk) and the merge operation that is performed in reduce again during the map process.
The above is my knowledge of the MapReduce model in the process of learning Hadoop, if there is anything wrong, please correct me ~ ~

My opinion on the execution process of MapReduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.