Spark Performance Optimization | Shuffle Tuning

Source: Internet
Author: User
Keywords spark optimization spark shuffle tuning spark performance tuning
This blog post brings you Shuffle tuning.
1. Adjust the buffer size on the map side
During the running of the Spark task, if the amount of data processed by the map side of shuffle is relatively large, but the size of the buffer on the map side is fixed, it may happen that the buffered data on the map side is frequently spilled and written to the disk file, making the performance very Low, by adjusting the size of the buffer on the map side, frequent disk IO operations can be avoided, thereby improving the overall performance of Spark tasks.
Simple Application Server
USD1.00 New User Coupon
* Only 3,000 coupons available.
* Each new user can only get one coupon(except users from distributors).
* The coupon is valid for 30 days from the date of receipt.

The default configuration of the map buffer is 32KB. If each task processes 640KB of data, 640/32 = 20 overflow writes will occur. If each task processes 64000KB of data, 64000/32=2000 overflow writes will occur. This has a serious impact on performance.
val conf = new SparkConf()
  .set("spark.shuffle.file.buffer", "64")

 2. Adjust the size of the data buffer on the reduce side
In the process of Spark Shuffle, the buffer size of the shuffle reduce task determines the amount of data that the reduce task can buffer each time, that is, the amount of data that can be pulled each time. If the memory resources are sufficient, increase the data buffer size The size can reduce the number of times to pull data, which can also reduce the number of network transmissions, thereby improving performance.

The size of the data pull buffer on the reduce side can be set by the spark.reducer.maxSizeInFlight parameter, and the default is 48MB.
val conf = new SparkConf()
  .set("spark.reducer.maxSizeInFlight", "96")

 3. Adjust the number of retries to pull data on the reduce side
During the Spark Shuffle process, when the reduce task pulls its own data, if it fails due to network abnormalities and other reasons, it will automatically retry. For those jobs that include particularly time-consuming shuffle operations, it is recommended to increase the maximum number of retries (for example, 60 times) to avoid data pull failures caused by factors such as JVM full gc or network instability. In practice, it has been found that for the shuffle process for extremely large data volumes (billions to tens of billions), adjusting this parameter can greatly improve the stability.

The number of retries to pull data on the reduce side can be set through the spark.shuffle.io.maxRetries parameter, which represents the maximum number of retries that can be retried. If the pull is still unsuccessful within the specified number of times, it may cause the job execution to fail. The default is 3.
val conf = new SparkConf()
  .set("spark.shuffle.io.maxRetries", "6")

 4. Adjust the wait interval for the reduce side to pull data
In the process of Spark Shuffle, when the reduce task pulls its own data, if it fails due to network abnormalities and other reasons, it will automatically retry. After a failure, it will wait a certain time interval before retrying. You can increase the interval Duration (such as 60s) to increase the stability of shuffle operation.

The wait interval for the reduce side to pull data can be set by the spark.shuffle.io.retryWait parameter, and the default value is 5s.
val conf = new SparkConf()
  .set("spark.shuffle.io.retryWait", "60s")

 5. Adjust the SortShuffle sort operation threshold
For SortShuffleManager, if the number of shuffle reduce tasks is less than a certain threshold, no sorting operation will be performed during the shuffle write process, but the data will be written directly according to the unoptimized HashShuffleManager method, but in the end all temporary generated by each task Disk files are merged into one file, and a separate index file is created.

When you use SortShuffleManager, if you really don’t need a sorting operation, then it is recommended to increase this parameter, larger than the number of shuffle read tasks, then map-side will not be sorted at this time, reducing the performance overhead of sorting, but In this way, a large number of disk files are still generated, so shuffle write performance needs to be improved. The SortShuffleManager sort operation threshold can be set through the parameter spark.shuffle.sort. bypassMergeThreshold, the default value is 200,
val conf = new SparkConf()
  .set("spark.shuffle.sort.bypassMergeThreshold", "400")

  This sharing ends here,
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.