1, for the input of the map, the input data is first cut into equal shards, for each shard to create a map worker, where the size of the tile is not arbitrarily set, is generally the same size as the HDFS block, the default is 64MB, The maximum size of an input data slice that is stored on a node is the block size of HDFs, and when the size of the tile is larger than the HDFS block, it causes the transfer between nodes, which takes up bandwidth.
2. Map worker calls the user-written map function to process each shard and outputs the processing results to local storage (non-HDFS)
3, the output of the map combiner operation, here the combiner is mainly to reduce the data transmission between the map and reduce, not the necessary steps, can be cited as a "hadoop:the definitive guide" The maximum temperature is calculated in the example.
Fir Map Output:
(1950, 0)
(1950, 20)
(1950, 10)
Sec Map Output:
(1950, 25)
(1950, 15)
In the case where combiner is not called, the output data of the map will be transferred to reduce, and when reduce is processed, the following data will be used as input:
(1950, [0, 20, 10, 25, 15])
In the case of calling Combiner, the output data is now processed locally on each map (the maximum temperature of the current map is calculated) and then lost to reduce, as follows:
Fir Map Combined:
(1950, 20)
Sec Map Combined:
(1950, 25)
At this point, reduce will use the following data as input, thereby reducing the amount of data transferred between map and reduce:
(1950, [20, 25])
4, the combiner processing data or map output data shuffle processing, so-called shuffle processing is the record in the data through the partition operation map to reduce, so that each reduce processing is the same key record. Note that the partition function can be customized, or the default partition function can be used, and the default partition is to map records of the same key to the same reduce using a hash map operation.
5. Reduce calls the user-defined reduce function to process the data, and the output is stored in HDFs.
[Hadoop]mapreduce principle Brief