Spark Performance optimization: Shuffle tuning

Source: Internet
Author: User
Tags map data structure shuffle

Tuning Overview

Most spark job performance is mainly consumed in the shuffle link, because this link contains a lot of disk IO, serialization, network data transmission and other operations. Therefore, if you want to make the performance of the job to a higher level, it is necessary to tune the shuffle process. But it's also important to remind you that the factors that affect the performance of a spark job are primarily code development, resource parameters, and data skew, and shuffle tuning can only take a fraction of the performance tuning of the entire spark. Therefore, we must grasp the basic principles of tuning, and never trifles. Below we will give you a detailed explanation of the shuffle principle, as well as the relevant parameters of the description, at the same time give the various parameters of the tuning recommendations.

Shufflemanager Development Overview In Spark's source code, the components responsible for the execution, computation, and processing of the shuffle process are primarily shufflemanager, or shuffle managers. With the development of the spark version, Shufflemanager is constantly iterating and becoming more advanced.

Prior to spark 1.2, the default shuffle compute engine was hashshufflemanager. The Shufflemanager and Hashshufflemanager have a very serious disadvantage, that is, will produce a large number of intermediate disk files, and thus by a large number of disk IO operations affect performance.

So in the release of Spark 1.2, the default shufflemanager is changed to Sortshufflemanager. Sortshufflemanager compared to Hashshufflemanager, there are some improvements. The main reason is that each task, while doing shuffle operations, produces more temporary disk files, but eventually merges all the temporary files (merge) into a single disk file, so each task has only one disk file. The shuffle read task in the next stage pulls its own data, as long as it reads some of the data from each disk file according to the index.

Below we analyze the principle of hashshufflemanager and sortshufflemanager in detail.
The principle of Hashshufflemanager without optimization is illustrated by the Hashshufflemanager operating principle of hashshufflemanager. Here we first define a hypothetical premise: each executor has only 1 CPU cores, that is to say, no matter how many task threads are allocated on the executor, only one task thread can be executed at the same time.

Let's start with shuffle write. Shuffle write phase, mainly after the end of a stage calculation, for the next stage can execute shuffle class operators (such as Reducebykey), and each task processing data by key "classification." The so-called "classification" is the execution of the hash algorithm on the same key, which writes the same key to the same disk file, and each disk file belongs to only one task on the downstream stage. Data is written to the memory buffer before it is written to disk, and when the memory buffer fills up, it is not spilled into the disk file.

So how many disk files do you want to create for the next stage for each task that executes shuffle write? It's simple how many tasks are on the next stage and how many disk files each task on the current stage will create. For example, the next stage has a total of 100 tasks, and each task on the current stage will create 100 disk files. If the current stage has 50 tasks, a total of 10 executor, and each executor executes 5 tasks, then a total of 500 disk files are created on each executor, and 5,000 disk files are created on all executor. Thus, the number of disk files produced by the shuffle write operation is extremely alarming.

Then let's say shuffle read. Shuffle read is usually the first thing to do at the beginning of a stage. At this point, each task of the stage needs to be all the same key in the calculation results from the previous stage, pull it over the network from each node to its own node, and then do the aggregation or connection of the key. Because shuffle write, the task creates a disk file for each task on the downstream stage, so shuffle read, each task is provided from all the task nodes on the upstream stage. Pull the disk file that belongs to you.

Shuffle Read's pull process is aggregated while pulling aside. Each shuffle read task has its own buffer buffer, and each time it pulls only the same size as buffer buffers, and then aggregates the data through a map in memory. After aggregating a batch of data, pull down a batch of data and put it into buffer buffer for aggregation operations. And so on, until finally all the data is pulled out and the final result is obtained.



The optimized Hashshufflemanager illustrates the principle of the optimized hashshufflemanager. The optimization mentioned here means that we can set a parameter, Spark.shuffle.consolidateFiles. The default value of this parameter is False, which is set to true to enable the optimization mechanism. Generally speaking, if we use Hashshufflemanager, it is recommended to turn this option on.

When the consolidate mechanism is turned on, the task does not create a disk file for each task in the downstream stage during the shuffle write process. The concept of Shufflefilegroup appears, and each shufflefilegroup corresponds to a batch of disk files, and the number of disk files is the same as the number of tasks in the downstream stage. How many tasks can be executed in parallel with the number of CPU cores on a executor. Each task that is executed in parallel in the first batch creates a shufflefilegroup and writes the data to the corresponding disk file.

When the executor CPU core executes a batch of tasks and then executes the next batch of tasks, the next batch of tasks will reuse the previously existing shufflefilegroup, including the disk files. That is, the task now writes data to an existing disk file without writing to the new disk file. Therefore, the consolidate mechanism allows different tasks to reuse the same batch of disk files, which effectively merges the disk files of multiple tasks to some extent, drastically reducing the number of disk files, thereby improving the performance of shuffle write.

Assuming that the second stage has 100 tasks, the first stage has 50 tasks, a total of 10 executor, and each executor executes 5 tasks. So when the original use of the Hashshufflemanager is not optimized, each executor will produce 500 disk files, all executor will produce 5,000 disk files. However, after optimization, the number of disk files created per executor is calculated as: number of CPU cores * Number of tasks for the next stage. That is, each executor will only create 100 disk files at this time, and all executor will only create 1000 disk files.

The operating mechanism of Sortshufflemanager operating principle Sortshufflemanager is divided into two kinds, one is normal operation mechanism and the other is bypass operation mechanism. The bypass mechanism is enabled when the number of shuffle read tasks is less than or equal to the value of the Spark.shuffle.sort.bypassMergeThreshold parameter (default is 200).
The general operating mechanism illustrates the principle of common sortshufflemanager. In this mode, the data is written into a memory data structure, and depending on the shuffle operator, different data structures may be chosen. If it is reducebykey this kind of aggregation class shuffle operator, then will choose the map data structure, one side through the map aggregation, while writing memory, if it is a common shuffle operator join, then the array data structure will be used to directly write memory. Then, after each piece of data is entered into the memory data structure, it is determined if a critical threshold is reached. If the critical threshold is reached, an attempt is to overflow the data in the memory data structure to disk and then empty the memory data structure.

The data that is already in the memory data structure is sorted according to key before the overflow to the disk file. After sorting, data is written to the disk file in batches. The default batch count is 10,000, which means that sorted data is written to disk files in batches of 10,000 data per batch. The write to disk file is implemented through Java Bufferedoutputstream. Bufferedoutputstream is the buffered output stream of Java, which first buffers the data in memory, and then writes it back to the disk file once the memory buffer overflows, which reduces disk IO times and improves performance.

When a task writes all data to a memory data structure, multiple disk overflow operations occur, and several temporary files are generated. Finally, all of the previous temporary disk files are merged, which is the merge process, which reads the data from all the previous temporary disk files and then writes to the final disk file in turn. In addition, because a task only corresponds to a single disk file, it means that the task for the downstream stage of the task to prepare the data in this file, and therefore also a separate index file, which identifies the downstream task of the data in the file start offset and end Offset

Sortshufflemanager because of the process of a disk file merge, it greatly reduces the number of files. For example, the first stage has 50 tasks, a total of 10 executor, each executor executes 5 tasks, and the second stage has 100 tasks. Since each task eventually has only one disk file, there are only 5 disk files on each executor, and all executor have only 50 disk files.

Bypass operating mechanism illustrates the principle of bypass Sortshufflemanager. The triggering conditions for the bypass operating mechanism are as follows:

1. The Shuffle map task number is less than the value of the Spark.shuffle.sort.bypassMergeThreshold parameter.
2, not the shuffle operator of the aggregation class (such as Reducebykey).

At this point, the task creates a temporary disk file for each downstream task, hashes the data by key and writes the key to the corresponding disk file based on the hash value of the key. Of course, writing to a disk file is also written to the memory buffer, and the buffer is filled before it overflows to the disk file. Finally, all the temporary disk files are merged into one disk file and a separate index file is created.

The disk write mechanism of this process is in fact the same as the Hashshufflemanager, because it is to create a staggering number of disk files, but in the end will be a combination of disk files. As a result, a small amount of the final disk file also makes the mechanism relatively non-optimized hashshufflemanager, shuffle read performance will be better.

The mechanism differs from the normal sortshufflemanager operating mechanism: first, the disk write mechanism is different; second, it is not sorted. That is, the biggest benefit of enabling this mechanism is that, in the shuffle write process, there is no need to sort the data, saving the performance overhead of this part.

Shuffle related parameter tuning the following are some of the main parameters in the Shffule process, where the function, default values, and tuning recommendations based on practical experience are explained in detail.

Spark.shuffle.file.buffer 1, default value: 32k
Parameter description: This parameter sets the buffer buffer size of the bufferedoutputstream of the shuffle write task. Before the data is written to the disk file, it is written to the buffer buffer, and the buffer is filled before it overflows to disk.
Tuning Recommendations: If the available memory resources of the job are sufficient, the size of this parameter (such as 64k) can be appropriately increased, thus reducing the number of times the disk files are shuffle in the write process, which can also improve performance by reducing the number of disk IO times. In practice, it is found that the performance of the 1%~5% can be improved by adjusting the parameters rationally.

Spark.reducer.maxSizeInFlight Default value: 48m
Parameter description: This parameter is used to set the buffer buffer size of the shuffle read task, which determines how much data can be pulled each time.
Tuning Recommendations: If the available memory resources of the job are sufficient, the size of this parameter (such as 96m) can be increased appropriately, thus reducing the number of times the data is pulled, which can reduce the number of times the network is transmitted and thus improve performance. In practice, it is found that the performance of the 1%~5% can be improved by adjusting the parameters rationally.

Spark.shuffle.io.maxRetries Default value: 3
Parameter description: Shuffle when the read task pulls its own data from the node where the shuffle write task is located, it will be retried automatically if the pull fails because of a network exception. This parameter represents the maximum number of times that can be retried. If the pull is not successful within a specified number of times, the job execution may fail.
Tuning recommendations: For jobs that contain a particularly time-consuming shuffle operation, it is recommended to increase the maximum number of retries (for example, 60) to avoid data pull failures due to the JVM's full GC or network instability. In practice, it is found that adjusting this parameter can greatly improve the stability for the shuffle process of large data volume (billions of to tens of billions).

Spark.shuffle.io.retryWait Default value: 5s
Parameter description: Explained above, this parameter represents the wait interval of each retry pull data, the default is 5s.
Tuning Recommendations: It is recommended to increase the interval length (such as 60s) to enhance the stability of the shuffle operation.

Spark.shuffle.memoryFraction Default value: 0.2
Parameter Description: This parameter represents the proportion of memory in executor memory that is allocated to the shuffle read task for aggregation operations, which defaults to 20%.
Tuning Recommendations: This parameter is explained in the tuning of resource parameters. If you have sufficient memory and rarely use persistent operations, it is recommended that you increase this scale to give shuffle read more memory for aggregation operations to avoid frequent disk reads and writes to the aggregation process due to insufficient memory. In practice, it is found that reasonable adjustment of this parameter can improve performance by about 10%.

Spark.shuffle.manager Default value: Sort
Parameter description: This parameter is used to set the type of the Shufflemanager. After Spark 1.5, there are three options available: hash, sort, and tungsten-sort. Hashshufflemanager is the default option before Spark 1.2, but the spark 1.2 and later versions are Sortshufflemanager by default. Tungsten-sort is similar to sort, but uses an out-of-heap memory management mechanism in the Tungsten plan to make memory usage more efficient.
Tuning recommendations: Because Sortshufflemanager sorts the data by default, you can use the default Sortshufflemanager if you need the sorting mechanism in your business logic, and if your business logic doesn't need to sort the data, It is recommended that the following parameters be tuned to avoid sorting operations through bypass mechanisms or optimized hashshufflemanager, while providing better disk read and write performance. It is important to note that Tungsten-sort should be used with caution, as some of the corresponding bugs have been discovered before.

Spark.shuffle.sort.bypassMergeThreshold Default value: 200
Parameter description: When Shufflemanager is Sortshufflemanager, if the number of shuffle read tasks is less than this threshold (default is 200), the shuffle write procedure does not perform the sort operation. Instead, the data is written directly in an hashshufflemanager way, but eventually all the temporary disk files generated by each task are merged into one file and a separate index file is created.
Tuning Recommendations: When you use Sortshufflemanager, if you do not need a sort operation, it is recommended to turn this parameter up a bit larger than the number of shuffle read tasks. The bypass mechanism is automatically enabled at this point, and map-side will not be sorted, reducing the performance overhead of sorting. However, in this way, a large number of disk files are still generated, so shuffle write performance needs to be improved.

Spark.shuffle.consolidateFiles Default value: False
Parameter description: This parameter is valid if Hashshufflemanager is used. If set to true, the consolidate mechanism is turned on, and the output file of the shuffle write is greatly merged, which can greatly reduce disk IO overhead and improve performance in cases where the number of shuffle read tasks is particularly large.
Tuning Recommendations: If you do not need a sortshufflemanager sorting mechanism, then in addition to using the bypass mechanism, You can also try to manually specify the Spark.shffle.manager parameter as a hash, use Hashshufflemanager, and turn on the consolidate mechanism. In practice, it is found that the performance of the bypass mechanism is higher than the sortshufflemanager of 10%~30%.

Spark Performance optimization: Shuffle tuning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.