Spark JVM Tuning Executor memory and connection wait long _spark performance optimization

Source: Internet
Author: User
Tags shuffle
Executor out of heap memory

Sometimes, if your spark job deals with a particularly large amount of data, hundreds of millions of of the data, and then spark the job one time, and occasionally the error, shuffle file cannot find,executor, task Lost,out of memory (memory overflow) ;

It may be that the executor of the heap is not sufficient, causing the executor to overflow in the process of running, and then may cause subsequent stage tasks to be pulled from some executor to fetch the shuffle map output file , but the executor may have been hung up, and the associated block manager is missing, so it is possible to report shuffle output file not found;resubmitting Task;executor lost Spark the job completely collapsed.

In this case, you can consider adjusting the executor of the heap memory. It may be possible to avoid the error; In addition, sometimes, the external memory regulation of the larger time, for performance, will also bring a certain degree of improvement.

The executor ran, suddenly ran out of memory, the heap of memory is not enough, may be oom, hang off. Block manager is gone and the data is lost.

If at this point, Stage0 's executor is dead, block manager is gone; at this point, Stage1 's executor task, though driver, gets the address of its own data. , but actually to find the other side of the block manager to get the data, is not get the time, will run the job (jar) in Spark-submit, client (standalone client, yarn client), Log will be printed on this machine
Shuffle output file not found ...
Dagscheduler,resubmitting task, will hang up all the time. Repeatedly hung out several times, repeated the error several times


By default, the outer memory limit of this heap is about 300 m; later we usually in the project, the real processing of large data, there will be problems, causing the spark job repeatedly crashes, unable to run, this time to adjust this parameter, to at least 1G (1024M), or even 2G, 4G. Usually this parameter is adjusted up later, will avoid some JVM oom abnormal problem, at the same time, will let the whole spark the performance of the job, get a bigger promotion.



At this point, there will be no response, unable to establish a network connection; Spark the default network connection is long, 60s, and if you can't connect to 60s, you will fail.

A situation, occasionally so-and-so file. A string of file IDs. UUID (DSFSFD-2342VS--SDF--SDFSD). Not found. File lost.

In this case, it is most likely that there is a executor of that data in the JVM GC. So when you pull the data, you can't build the connection. Then after the default 60s, the direct declaration fails.
The number of errors, several times pulled out of the data, may lead to the collapse of spark operations. may also lead to Dagscheduler, repeated submissions several times stage. TaskScheduler, submit several tasks over and over again. Greatly prolong the running time of our spark job.


You can consider adjusting the timeout length of the connection.
/usr/local/spark/bin/spark-submit \--class com.ibeifeng.sparkstudy.WordCount \--num-executors--driver-memory 6g \--executor-memory 6g \--executor-cores 3 \--master yarn-cluster \--queue root.default \--conf spark.yarn.executor.me moryoverhead=2048 \ "This is yarn, not with yarn is standalone"--conf spark.core.connection.ack.wait.timeout=300 \/usr/local/ Spark/spark.jar \ ${1}
Spark-submit script inside, to use the--conf way, to add the configuration, must pay attention to ... Remember, not in your spark job code, with the new sparkconf (). Set () This way, don't set it, it's no use. Be sure to set it in the Spark-submit script.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.