The mapreduce process, spark, and Hadoop shuffle-centric comparative analysis
The map-shuffle-reduce process of mapreduce and spark
MapReduce Process Parsing (MapReduce uses sort-based shuffle)
The obtained data shard partition is parsed, the k/v pair is obtained, and then the map () is processed.
After the map function is processed, it enters the collect stage, collects the processed k/v pairs, and stores them in the ring buffer of the memory.
when the data in the ring buffer reaches the threshold (or it may not have reached the threshold, the same is to write the data in memory to disk), the data in the memory buffer is transferred to disk through the Spillthread thread. It is important to note that before transferring, the record data is sorted first using the fast row (the principle is to sort by the partition number, then the key, and note that the sort is before writing to the disk). Then follow the partition number, Get the above sorted data and write it to the Spill.out file (a spill.out file may have more than one partition of the data-because a map operation will have multiple spill process), it should be noted that if the combiner is set, before writing to the file, the number of each partition The aggregation operation. The file also corresponds to the Spillrecord structure (spill.out file index).
The final phase of map is merge: The process merges each spill.out file into a large file (which also has a corresponding index file), and the merging process is simple, merging data from multiple spill.out files in the same partition. (First time aggregation)
Shuffle stage. The first thing to note is that the shuffle phase has two threshold settings. First, when you get the result data from the map, Depending on the size of the data (the size of the file.out) is naturally divided into memory or disk (this threshold setting is completely different from the map phase); Second, the memory and disk can save the number of files have thresholds, exceeding the threshold, the file will be merged operations, that is, small files merge into large files. Shuffle process:
1) Get the completed Map task list.
2) The remote copy of the data (HTTP GET method), according to the size of the data file is naturally divided into memory or disk.
3) When the memory or disk files are large, file merging. (Secondary aggregation)
The sort operation is required before reduce, but the two phases are parallelized, sort creates a small top heap in memory or disk, and holds an iterator to the root node of the small top heap, and the reduce task passes the same key data to reduce () in the form of an iterator function for processing.
Spark Shuffle process parsing (with hash-based shuffle)
The RDD is the most obvious difference between Spark and Hadoop (data structure), and the RDD in Spark has many features that are not mentioned here.
The shuffle process between spark and Hadoop is similar, and there is an aggregation operation before and after the shuffle of Spark. But there are also obvious differences: The shuffle process of Hadoop is a distinct phase: Map (), Spill,merge,shuffle,sort,reduce (), and so on, are executed according to the process, and belong to the push type; however, unlike Spark, Because the shuffle process of spark is operator driven and has lazy execution, it belongs to the pull type.
The second obvious difference between Spark and Hadoop's shuffle is that Spark's shuffle is hash-based type, and Hadoop's shuffle is sort-based type. Here's a brief introduction to Spark's shuffle:
1. Because it is operator driven, the Shuffle of Spark is mainly two stages: Shuffle write and Shuffle Read.
The entire implementation of 2.ShuffleMapTask is the shuffle write phase
The beginning of the 3.SPRK shuffle process is to send the data record in the map's result file to the corresponding bucket (buffer), which bucket is determined by key (the process is a hash process, Each bucket corresponds to the final reducer, that is, in hash-based, the data is automatically divided into the bucket corresponding to reducer. After that, the data in each bucket is constantly written to the local disk, forming a shuffleblockfile, or simply filesegment. This is the whole shufflemaptask process. After that, reducer will fetch his own filesegment and enter the shuffle read phase.
4. It is important to note that the fetch operation of the reducer data is not started until all the shufflemaptask have been executed, because all shufflemaptask may not be in the same stage, The execution of the stage executes after the parent stage commits the submission, so the fetch operation is not filesegment generated.
5. It should be noted that the freshly-fetch filesegment is stored in the Softbuffer buffer, SPARK specifies that the buffer limit cannot exceed spark.reducer.maxMbInFlight, which is represented by Softbuffer, which is the default size of 48MB.
6. The data after reduce is placed on memory + disk (spill with the relevant policy).
7.fetch once started, it will fetch edge processing (reduce). The MapReduce shuffle phase is the side fetch side that uses combine () for processing, but combine () handles part of the data. MapReduce cannot do edge-fetch-edge reduce processing, because mapreduce must wait until all the data is shuffle-sort before starting reduce () in order for the records to enter reduce () to be ordered. However, spark does not require shuffle data to be globally ordered, so it is not necessary to wait until all the data shuffle is complete before processing. In order to achieve edge shuffle edge processing, and the inflow of records is unordered can be used aggregate data structure, such as HashMap.
Comparison of Hash-based and sort-based
hash-based so the name of the meaning of shuffle in the process of writing data do not do a sort operation, but the data according to the result of the hash, each reduce partition data written to their respective disk files. The problem with this is that if the Reduce partition is larger, a large number of disk files (Map*reduce) will be generated. If the number of files is particularly large, the performance of file read and write will have a large impact, in addition, because the number of file handles open at the same time, serialization, and compression and other operations need to allocate temporary memory space can also be rapidly inflated to the point of unacceptable, the use of memory and GC bring great pressure, This is especially true when the executor memory is relatively small, such as spark on yarn mode. But there are ways to improve it:
A shufflemaptasks that executes continuously on a core can share an output file shufflefile. The first execution of the Shufflemaptask form Shuffleblock I, after the execution of the shufflemaptask can be directly appended to the output data shuffleblock I, the formation of Shuffleblock I ', each shuffleblock is called filesegment. The next stage of the reducer only needs to fetch the entire shufflefile on the line. In this case, the total number of shuffle files becomes c*r.
Sort-based is a Spark1.1 version of the implementation of a pilot (that is, some features and interfaces are still in the development of the evolution) of the Shufflemanager, when writing to the partition data, the first time according to the actual situation of the data in different ways to sort operations, the bottom line is at least according to the reduce partition PA Rtition is sorted so that all the data from the same map task shuffle to the different reduce partitions can be written to the same external disk file, with a simple offset flag that offsets the data in this file for different reduce partitions. Such a map task only needs to generate a shuffle file, which avoids the large number of files that Hashshufflemanager may encounter. This process is similar to the mapreduce process.
The performance comparison of the two depends on the combination of memory, sequencing, file manipulation and other factors.
For shuffle operations that do not need to be sorted, such as repartition, if the number of files is not particularly large, the Hashshufflemanager faces a small memory problem, and Sortshufflemanager needs to be sorted by partition, obviously Hashshufflemanager is more efficient.
For shuffle operations that would otherwise need to be sorted on the map side, such as Reducebykey, the use of Hashshufflemanager is not sorted while writing the data, but it still needs to be sorted in other steps. Instead, Sortshufflemanager can combine write data and sort jobs together, so sortshufflemanager can still be faster even without hashshufflemanager memory usage issues.
The difference between shuffle in Hadoop and shuffle in spark