MapReduce (12): Processing of map and reduce phase data merging __mapreduce

Source: Internet
Author: User

When processing data in the map phase, because of the memory limit, the data will be written to the file, the end will be based on the number of data generated multiple files, each file will be based on the number of reduce partition, each partition of data in accordance with the key value sequence emissions, map after the completion of a number of files merged into the same file, Merging data from the same partitions of multiple files is merged and the data reordering of multiple partitions is discharged in key order. In the reduce phase, the partition data belonging to the reduce is obtained from multiple maps, and then the data is written into the file and in memory based on the number of data, each map is a file or a section of memory, and finally the data of the memory and file are combined to compute the final result, The merge method is consistent with the map merge method. Therefore, in both the map and reduce phases There is a file that combines data from multiple files or multiple segments of memory to output one.



As shown in the map phase above, which requires merging data from multiple files, reduce task 1 obtains partition 1 data from each map, and reduce task 2 obtains data for partition 2 from each map, and reduce task 3 gets partition 3 from each map. For simplicity, Reduce TASK3 the point of fetching data to the map without identification. When it is finished, it is placed in memory or in a file based on the size of the data, then the partitioned data for multiple maps is merged, and then the final result is output after reduce calculation.

Before merging data, the data of the memory, or the data of the file is segment encapsulated to read, it provides two constructors, constructs the segment instance to the file and the memory reading way respectively, constructs the reading to the file:


Constructs the read of the memory of the pair:


One instance of reader is IFile.InMermoryReader.java, which constructs an instance in a byte array:


Then read the data from the array:

After all the segment have been constructed, the segment is put into the Mergequeue, Mergequeue inherits the Priorityqueue abstract class, and when the segment is placed in the Mergequeue, The segment is sorted according to the size of the first key:


Mergequeue also integrates the Rawkeyvalueinterator interface, is responsible for the segment read Key,value data, after each read a key, immediately to the owned segment again according to the size of the current first key, sorted, Therefore, the Key,value read from Mergequeue is always read from multiple segment in the Order of key:

In order to merge the number of files is not too much, so in mergequeue merging, will determine whether the number of files exceeded a threshold, if more than a number of files to merge into a file, so that the total number of files under this threshold value:

If you want the number of merged files to be less than a threshold, return mergequeue itself, Mergequeue provides access to all segment in key order:


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.