Spark Tech Insider: Sort Based Shuffle Implementation resolution

Last Update:2015-01-05 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In Spark 1.2.0, an important upgrade of Spark core is to replace the default hash Based Shuffle with the sort Based Shuffle, where Spark.shuffle.manager is changed from hash to sort, The corresponding implementation classes are Org.apache.spark.shuffle.hash.HashShuffleManager and Org.apache.spark.shuffle.sort.SortShuffleManager, respectively.

The choice of this method is done in ORG.APACHE.SPARK.SPARKENV:

    Let the user specify short names forshuffle managers    val shortshufflemgrnames = Map (      "hash"--"org.apache.sp Ark.shuffle.hash.HashShuffleManager ",      " sort "," Org.apache.spark.shuffle.sort.SortShuffleManager ")    Val shufflemgrname =conf.get ("Spark.shuffle.manager", "sort")//Get Shuffle manager Type,sort as default    val Shufflemgrclass =shortshufflemgrnames.getorelse (shufflemgrname.tolowercase, Shufflemgrname)    Val Shufflemanager =instantiateclass[shufflemanager] (shufflemgrclass)

So what is the reason for the sort basedshuffle "replace" Hash basedshuffle as the default option?

As mentioned earlier, each mapper of hashbased shuffle needs to write a file for each reducer for reducer to read, that is, the need to produce m*r number of files, if the number of mapper and reducer is larger, The number of files produced will be very numerous. One of the goals of the Hash based shuffle design is to avoid unnecessary sorting (where Hadoop Map reduce is criticized, and a lot of the sort that does not need a sort result in unnecessary overhead). But it has a lot of diskio and memory consumption when it comes to dealing with super-large datasets, which undoubtedly affects performance. Hash based shuffle is also in constant optimization, as mentioned in the Spark 0.8.1 introduced by the file consolidation to some extent solves this problem. To better solve this problem, Spark 1.1 introduced the sort based shuffle. First, each shuffle Map task does not generate a separate file for each reducer; instead, it writes all the results to a file and generates an index file that reducer can use to get the data it needs to process. The direct benefit of avoiding large amounts of files is to save memory usage and the low latency of sequential disk IO. Saving memory usage can reduce the risk and frequency of GC. Reducing the number of files can avoid the pressure on the system to write multiple files at the same time.

And judging from almost all of the author's reynoldxin tests, sortbased shuffle is better at speed and memory usage than hashbased Shuffle: "Sort-basedshuffle has lower memory usage and seems to outperformhash-based in almost allof our testing. "

Performance data: from:https://issues.apache.org/jira/browse/spark-3280

The Shuffle Map task will sort according to the partition ID corresponding to the key, where the key belonging to the same partition does not sort. Because this sort is a negative benefit for operations that do not require a sort, you know that before spark started using hash based shuffle instead of sort based to avoid Hadoop MAP Reduce the performance loss of sort for all calculations. For those operations that require sort, such as sortbykey, this sort is still done by reducer in Spark 1.2.0.

If this process does not have enough memory, then these already sort contents will be spill to the external storage. Then at the end of the time, these different files are merge sort.

To facilitate downstream taskfetch to their desired partition, an index file is generated to record the location information for different partition. Of course, Org.apache.spark.storage.BlockManager needs to have a responsive implementation to implement this new way of addressing.

The logic of the core implementation is org.apache.spark.shuffle.sort.SortShuffleWriter in the class. The following is a brief analysis of its implementation:

1) For each partition, create a Scala. The array stores the Key,value pairs it contains. Each pending key,value pair will be inserted into the appropriate Scala. Array.

2) if Scala. The size of the array exceeds the threshold value, then the in memory data needs to be spill to the external storage. At the beginning of this file, the ID of this partition is recorded, and the number of pair information is saved by this file.

3) Finally, all spill to the externally stored files need to be mergesort. Not too many files open at the same time, too many words will consume a lot of memory, increase the risk of oom or GC, nor too little, too little will affect performance, increase the delay of the calculation. In general, it is recommended to open 10–100 files at the same time.

4) When you generate the final data file, you need to generate the index index file at the same time. As mentioned earlier, this index file will record a range of different partition.

Of course, you may also have a question, that is, hash Based shuffle is written according to the key needs to write Org.apache.spark.HashPartitioner, for each reducer write a separate partition. Except for the shuffle map task that starts with the same core, if you choose Spark.shuffle.consolidateFiles, the second shuffle map task append the result into the previous file. So the logic of sort is completely integrated into the hash Based shuffle, why do we have to re-implement a shuffle writer? I think there are the following points:

Shuffle mechanism is one of the core mechanisms of all similar computational modules, the risk of large optimization is very high, such as a seemingly simple consolidation mechanism, introduced in 0.8.1, but to 1.2.0 still not as the default option.
Hash Based Shuffle If you modify the logic for sort, the so-called improvements may affect the original spark application that is already stable. For example, an application that uses hash Based shuffle performance is exactly what is expected, so after migrating to spark 1.2.0, you only need to modify the configuration file below to complete this seamless migration.
As a common computing platform, your test case will never cover all the scenes. Then, leave it to the user to choose.
The sort mechanism also deals with the continuous improvement phase. For example, a lot of optimization or functional improvements will continue to improve. Therefore, expect the sort to be perfected in later versions.

If you like this article, then please move your finger support the following blog star rating Bar. Thank you very much for your vote. You can have a ticket every day.
dot I vote

Spark Tech Insider: Sort Based Shuffle Implementation resolution

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More