Original article: http://wiki.apache.org/lucene-hadoop/HadoopMapReduce
Keyword:
Filesplit: a subset of a file-the object segmentation body.
Introduction:
This document describes how map and reduce operations are completed in hadoop. If you are not familiar with Google's mapreduce patterns, see mapreduce-http://labs.google.com/papers/mapreduce.html first
Map
Since map operates the input file set in parallel, its first step (filesplit) is to divide the file set into several subsets. if a single file is large enough to affect the search efficiency, it will be divided into small split bodies. It should be noted that the split step does not know the internal logic structure of the input file. For example, text files separated by the behavior logic will be separated by any byte boundary, therefore, you must specify the specific split by yourself. You can also use the simple split defined by hadoop. Each file split body has a new map task.
When a single map task starts, it starts a new output writer for each configured reduce task ). then it (writer) reads its file split body using the recordreader obtained from the specified inputformat. Inputformat analyzes the input file and generates key-value pairs. At the same time, inputformat must process records at the boundary when splitting files. For example, textinputformat will read the last row of the object split boundary. If the read object is not the first, textinputformat will ignore the content of the first row.
The inputformat class does not need to generate meaningful key-value pairs. For example, the default output of the textinputformat class uses the row content of the input text as the value and the row offset as the key. Most applications only use the offset but seldom use the offset.
The key-value pairs passed to the er configured by the user are read from the recordreader. The Mapper class provided by the user can perform any operation on the key-value pairs and then call outputcollector. collect method to re-collect key-value pairs defined by yourself. The generated output must use a key class and a value class, because the map output result will be written to the disk in the form of sequencefile, this form includes the type information of each file and all records are of the same type (if you want to output different data structures, You can inherit from sub-classes ). The input and output key-value pairs of map do not need to be related to the type.
When mapper outputs are collected, they are output to the output file in a specified way by the partitioner class. By default, the hashcode generated by the hash function of the key class is distinguished by hashpartitioner class (therefore, a good hash function is required to balance the load of each reduce task ). For details, you can view the maptask class. N inputs can generate M map tasks to run. Each map task generates several output files for the configured reduce task. Each output file targets a specific reduce task and all key-value pairs generated from the map task are sent to reduce. Therefore, all key-value pairs of a given key are processed in a specific reduce task.
Combine
When the map operation outputs its key-value pairs, they already exist in the memory. For the sake of performance and efficiency, it is advantageous to provide a synthesizer with the reduce function. If there is a synthesizer, the map key-value pairs will not be immediately written into the output, they will be collected in the list, a key value and a list, when a certain number of key-value pairs are written, the buffer is sent to the synthesizer, all values of each key are sent to the reduce method of the synthesizer, just like the key-value pairs output by the original map.
For example, for the word count program in the hadoop case, its map operation outputs a (word, 1) Key-value pair, and the word count in the input can be accelerated by a synthesizer. A merging operation collects and processes lists in the memory, and a word is a list. When a certain number of key-value pairs are output to the memory, the reduce method of the merging operation is called. Each time a unique word is used as the key, values is the list iterator. Then the synthesizer outputs (word, Count-in-this-part-of-the-input) Key-value pairs. From the reduce operation point of view, the synthesizer also has the same information in the map output, but this will greatly reduce the hard disk read/write.
Reduce
When a reduce task starts, its input is scattered in the map output file on each node. In distributed mode, they need to first copy to the local file system in the copy step. For details, see the cetcetaskrunner class.
Once all the data is valid locally, it is added to a file in the Add step. Then, the file will be merged into categories so that key-value pairs of the same key can be arranged together (classification steps ). In this way, the real reduce operation can be simplified. This file will be read in sequence, and the value (values) will be passed to the reduce method through an iterator in the input file until the next key. For more information, see the cetcetask class.
Finally, the output consists of the output files of each reduce task. Their format can be specified by the jobconf. setoutputformat class. If the jobconf. setoutputformat class is used, the output key class and value class must be both specified.