Shuffle process finishing in MapReduce

Source: Internet
Author: User
Tags shuffle

the shuffle process in MapReduce is divided into two processes, map and reduce.

Map End:

1. (hash partitioner) after executing the map function, hash according to key, and the result of reduce the number of modulus (the key value pair will be processed by a reduce side) to get a partition number.

2. (Sort combiner) writes the byte after the key-value pair and the partition number to the memory buffer (size 100M, loading factor 0.8), when the memory buffer size exceeds 100*0.8 = 80M, will be spill (overflow) , the byte after which the key-value pair and the partition number are serialized in the memory buffer before the overflow, and merges the same key-value pair as the key in the buffer.

3. (merge) in case the map result has multiple spill files, the files are merged again, merging the same key-value pair in each spill file, then forming some new files and deleting the Spiil file ( Note: The output file on the map side is stored on the local disk, not on HDFs , and the shuffle of the map end is directly terminated for the case of only one spill file.


Reduce side:

1. (copy) Use HTTP copy required files from each map side

2. (merge) for the files obtained from each map end in memory (this memory is not only 100M, but the size of the heap in the JVM, because at this time the reduce task is not performed, the JVM's memory can all be used by the merge) merge, Merges the same key-value pairs of keys in each file, storing the results in memory or HDFS as input to the reduce function.

Shuffle process finishing in MapReduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.