Spark Core Secret -14-spark 10 major problems in performance optimization and their solutions

Source: Internet
Author: User

Problem 1:reduce task number not appropriate

Solution:

Need to adjust the default configuration according to the actual situation, the adjustment method is to modify the parameter spark.default.parallelism. Typically, the reduce number is set to 2-3 times the number of cores. The number is too large, causing a lot of small tasks, increasing the overhead of starting tasks, the number is too small, the task runs slowly. Therefore, the number of tasks to reasonably modify reduce is spark.default.parallelism

Issue 2:shuffle long disk IO time

Solution:

Set Spark.local.dir to multiple disks and set the disk's IO-speed disk to optimize shuffle performance by adding IO;

Large number of problem 3:map|reduce, resulting in more shuffle small files

Solution:

Merge shuffle intermediate files by setting Spark.shuffle.consolidateFiles to true, at which time the number of files is the number of reduce tasks;

Issue 4: Long serialization time and large results

Solution:

spark defaults to using the ObjectOutputStream that comes with the JDK, which results in a large result, long CPU processing time, You can set Spark.serializer to Org.apache.spark.serializer.KeyoSerializer.

In addition, if the result is already very large, it is best to use the broadcast variable way, the result you understand.

Issue 5: A single record consumes large

Solution:

Replacing map,mappartition with mappartition is calculated for each partition, and map is calculated for each record in the partition;

Problem 6:collect Slow when outputting a large number of results

Solution:

Collect source is to put all the results in an array of the way in memory, you can directly output to the distributed file system, and then view the contents of the file system;

Issue 7: Task Execution Speed tilt

solution:

If the data is skewed, generally partition key is not good, you can consider other parallel processing methods, and in the middle plus aggregation operation; If the worker tilts, For example, on some workers, executor execution is slow, you can set Spark.speculation=true to remove those nodes that continue to slow;


Issue 8: There are many empty tasks or small tasks generated after a multi-step RDD operation

Solution:

Use coalesce or repartition to reduce the number of partition in the RDD;

Problem 9:spark streaming throughput is low

You can set Spark.streaming.concurrentJobs

Problem 10:spark streaming the speed of the operation has dropped suddenly, there will often be task delay and blocking

Solution:

This is because we set the job startup interval time interval to be too short, causing each job to fail to finish at a specified time, in other words, the Windows window is created too densely spaced;

Spark Core Secret -14-spark 10 major problems in performance optimization and their solutions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.