International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Spark Core Secret -14-spark 10 major problems in performance optimization and their solutions

Last Update:2015-01-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Problem 1:reduce task number not appropriate

Solution:

Need to adjust the default configuration according to the actual situation, the adjustment method is to modify the parameter spark.default.parallelism. Typically, the reduce number is set to 2-3 times the number of cores. The number is too large, causing a lot of small tasks, increasing the overhead of starting tasks, the number is too small, the task runs slowly. Therefore, the number of tasks to reasonably modify reduce is spark.default.parallelism

Issue 2:shuffle long disk IO time

Solution:

Set Spark.local.dir to multiple disks and set the disk's IO-speed disk to optimize shuffle performance by adding IO;

Large number of problem 3:map|reduce, resulting in more shuffle small files

Solution:

Merge shuffle intermediate files by setting Spark.shuffle.consolidateFiles to true, at which time the number of files is the number of reduce tasks;

Issue 4: Long serialization time and large results

Solution:

spark defaults to using the ObjectOutputStream that comes with the JDK, which results in a large result, long CPU processing time, You can set Spark.serializer to Org.apache.spark.serializer.KeyoSerializer.

In addition, if the result is already very large, it is best to use the broadcast variable way, the result you understand.

Issue 5: A single record consumes large

Solution:

Replacing map,mappartition with mappartition is calculated for each partition, and map is calculated for each record in the partition;

Problem 6:collect Slow when outputting a large number of results

Solution:

Collect source is to put all the results in an array of the way in memory, you can directly output to the distributed file system, and then view the contents of the file system;

Issue 7: Task Execution Speed tilt

solution:

If the data is skewed, generally partition key is not good, you can consider other parallel processing methods, and in the middle plus aggregation operation; If the worker tilts, For example, on some workers, executor execution is slow, you can set Spark.speculation=true to remove those nodes that continue to slow;

Issue 8: There are many empty tasks or small tasks generated after a multi-step RDD operation

Solution:

Use coalesce or repartition to reduce the number of partition in the RDD;

Problem 9:spark streaming throughput is low

You can set Spark.streaming.concurrentJobs

Problem 10:spark streaming the speed of the operation has dropped suddenly, there will often be task delay and blocking

Solution:

This is because we set the job startup interval time interval to be too short, causing each job to fail to finish at a specified time, in other words, the Windows window is created too densely spaced;

Spark Core Secret -14-spark 10 major problems in performance optimization and their solutions

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark Core Secret -14-spark 10 major problems in performance optimization and their solutions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support