Shuffle Tuning Parameters
New Sparkconf (). Set ("Spark.shuffle.consolidateFiles", "true")
Spark.shuffle.consolidateFiles: Whether to turn on merging shuffle block file, default to false// Set from Mapartitionrdd above to the next stage of the resulttask when the data transfer fast can be aggregated (the specific principle can be seen under the principle of shuffle and not set the difference)
Spark.reducer.maxSizeInFlight:reduce task pull cache, default 48m//settings pull cache large, can pull more data at once to reduce the number of pull data
Spark.shuffle.file.buffer:map task's write disk cache, default 32k//sets the disk buff cache, reducing the number of overflows to disk, thus heightening the performance of the buff
Spark.shuffle.io.maxRetries: The maximum number of retries to pull failed, default 3 times//Here can prevent mirror GC or full GC when the task thread is occupied by Java garbage collection thread, resulting in pull failure, If the number of retries is 3 each time 5 seconds, a total of 15 seconds, FULLGC is 1 minutes words. Time is far from enough. Spark will think that the pull failed error even caused the application to crash.
Spark.shuffle.io.retryWait: Pull failed retry interval, default 5s//ibid.
Spark.shuffle.memoryFraction: The ratio of memory used for the reduce-side aggregation, by default 0.2, overruns to disk//settings or increased reduce memory to reduce the number of memory overflow disk
Spark new optimization shuffle new energy tuning