Spark Resource parameter tuning

Source: Internet
Author: User
Tags shuffle

Resource parameter tuning

Once you understand the fundamentals of the spark job run, the parameters related to the resource are easy to understand. The so-called Spark resource parameter tuning, in fact, is the spark in the process of running the various resources used in the place, by adjusting various parameters to optimize the efficiency of resource use, thereby improving the performance of spark job. The following parameters are the main resource parameters in Spark, each of which corresponds to a part of the operating principle of the job, and we also give a tuning reference value.

Num-executors
    • Parameter description: This parameter is used to set the total number of executor processes to be executed by the spark job. Driver when you request a resource from the Yarn Cluster Manager, yarn Cluster Manager starts the appropriate number of executor processes on each work node of the cluster as you set it. This parameter is very important, if not set, the default will only give you to start a small number of executor process, at this time your spark job is running very slow.
    • Parameter Tuning Recommendations: Run general settings for each spark job 50~100 The executor process is appropriate, setting too little or too many executor processes is not good. Too few settings to make full use of cluster resources; too many of the queues may not be able to provide sufficient resources.
Executor-memory
    • Parameter description: This parameter is used to set the memory for each executor process. Executor the size of the memory, many times directly determines the performance of the spark job, and with the common JVM Oom exception, there is also a direct association.
    • Parameter Tuning Recommendations: The memory settings 4g~8g for each executor process are more appropriate. But this is only a reference value, the specific settings will have to be based on the resource queue of different departments. You can see what the maximum memory limit for your team's resource queue is, num-executors times executor-memory, and cannot exceed the maximum amount of storage in the queue. In addition, if you are sharing this resource queue with other people on your team, the amount of memory requested is best not to exceed the 1/3~1/2 of the maximum total memory in the resource queue, preventing your own spark job from taking up all of the queue's resources and causing the other students ' jobs to fail to run.
Executor-cores
    • Parameter description: This parameter is used to set the number of CPU cores per executor process. This parameter determines the ability of each executor process to execute the task thread in parallel. Because each CPU core can execute only one task thread at a time, the higher the number of CPU cores per executor process, the faster it can execute all the task threads assigned to it.
    • Parameter Tuning Recommendations: The Executor CPU core number is set to one to four. Also depending on the resource queue of the different departments, you can look at the maximum CPU core limit of your resource queue, and then depending on the number of executor set, each executor process can be assigned to several CPU cores. It is also recommended that if you share this queue with others, then num-executors * executor-cores do not exceed the queue total CPU core 1/3~1/2 around the appropriate, but also to avoid affecting other students of the job run.
Driver-memory
    • Parameter description: This parameter is used to set the memory of the driver process.
    • Parameter Tuning Recommendations: Driver memory is usually not set, or set to around 1G should be enough. The only thing to note is that if you need to use the Collect operator to pull all the data from the RDD to driver for processing, you must make sure that the driver memory is large enough to cause an oom memory overflow problem.
Spark.default.parallelism
    • Parameter description: This parameter sets the default task number for each stage. This parameter is extremely important if not set and may directly affect your spark job performance.
    • Parameter Tuning Recommendations: The default task number for spark jobs is 500~1000. A lot of students often make a mistake is not to set this parameter, then it will lead to spark itself according to the number of blocks in the underlying HDFS to set the number of tasks, by default, an HDFS block corresponding to a task. In general, the number of default settings for Spark is small (for example, dozens of tasks), and if the number of tasks is small, the parameters of the executor that you set earlier will be wasted. Imagine, no matter how many of your executor processes, memory and CPU, but the task is only 1 or 10, then 90% of the executor process may not have task execution at all, that is wasted resources! So the Spark website recommends setting the principle that setting this parameter to Num-executors * Executor-cores is more appropriate, such as executor total CPU core number is 300, then set 1000 task is possible, The resources of the spark cluster can be fully exploited at this time.
Spark.storage.memoryFraction
    • Parameter description: This parameter is used to set the ratio of the RDD persisted data to executor memory, which is 0.6 by default. That is, the default executor 60% of memory, can be used to save persisted RDD data. Depending on the persistence policy you choose, the data may not persist if there is not enough memory, or the data will be written to disk.
    • Parameter Tuning Recommendations: If there are more RDD persistence operations in the spark job, the value of the parameter can be increased appropriately to ensure that persisted data can be accommodated in memory. Avoid insufficient memory to cache all data, resulting in data being written to disk only, reducing performance. However, if the shuffle class operation in the spark job is more, and the persistence operation is relatively small, the value of this parameter is appropriately reduced. In addition, if the discovery job is slow to run due to frequent GC (the GC of the job can be observed through the Spark Web UI), which means that the task does not have enough memory to execute user code, it is also recommended to lower the value of this parameter.
Spark.shuffle.memoryFraction
    • Parameter description: This parameter is used to set the ratio of executor memory that can be used when a task is pulled to the output of a task in the previous stage during the shuffle process, and the default is 0.2. That is, executor defaults to only 20% of the memory used for this operation. When the shuffle operation is aggregated, if it finds that the memory used exceeds the 20% limit, the excess data is spilled into the disk file, which can greatly degrade performance.
    • Parameter Tuning Recommendations: If there are fewer rdd persistence operations in spark jobs and shuffle operations, it is recommended to reduce the memory footprint of the persisted operation, increase the ratio of memory to the shuffle operation, and avoid insufficient memory when the data is too high in the shuffle process, and must be spilled to disk. Reduced performance. In addition, if the discovery job is running slowly due to frequent GC, which means that the task does not have enough memory to execute user code, it is also recommended to lower the value of this parameter.

The tuning of the resource parameters, without a fixed value, requires the students to follow their own realities (including the number of shuffle operations in the spark job, the number of RDD persistence operations, and the job GC shown in the Spark Web UI), as well as referring to the principles and tuning recommendations presented in this article. Set the above parameters reasonably.

Resource parameter Reference example

Here is an example of the Spark-submit command, you can refer to, and according to their actual situation to adjust:

./bin/spark-submit   --master yarn-cluster   --num-executors 100   --executor-memory 6G   --executor-cores 4   --driver-memory 1G   --conf spark.default.parallelism=1000   --conf spark.storage.memoryFraction=0.5   --conf spark.shuffle.memoryFraction=0.3 \

参考: http://tech.meituan.com/spark-tuning-basic.html

Spark Resource parameter tuning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.