the shuffle process in MapReduce is divided into two processes, map and reduce.
Map End:
1. (hash partitioner) after executing the map function, hash according to key, and the result of reduce the number of modulus (the key value pair will be processed by a reduce side) to get a partition number.
2. (Sort combiner) writes the byte after the key-value pair and the partition number to the memory buffer (size 100M, loading factor 0.8), when the memory buffer size exceeds 100*0.8 = 80M, will be spill (overflow) , the byte after which the key-value pair and the partition number are serialized in the memory buffer before the overflow, and merges the same key-value pair as the key in the buffer.
3. (merge) in case the map result has multiple spill files, the files are merged again, merging the same key-value pair in each spill file, then forming some new files and deleting the Spiil file ( Note: The output file on the map side is stored on the local disk, not on HDFs , and the shuffle of the map end is directly terminated for the case of only one spill file.
Reduce side:
1. (copy) Use HTTP copy required files from each map side
2. (merge) for the files obtained from each map end in memory (this memory is not only 100M, but the size of the heap in the JVM, because at this time the reduce task is not performed, the JVM's memory can all be used by the merge) merge, Merges the same key-value pairs of keys in each file, storing the results in memory or HDFS as input to the reduce function.
Shuffle process finishing in MapReduce