Spark JVM Tuning Executor memory and connection wait long

Spark JVM Tuning Executor memory and connection wait long _spark performance optimization

Last Update:2018-08-23 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Executor out of heap memory

Sometimes, if your spark job deals with a particularly large amount of data, hundreds of millions of of the data, and then spark the job one time, and occasionally the error, shuffle file cannot find,executor, task Lost,out of memory (memory overflow) ；

It may be that the executor of the heap is not sufficient, causing the executor to overflow in the process of running, and then may cause subsequent stage tasks to be pulled from some executor to fetch the shuffle map output file , but the executor may have been hung up, and the associated block manager is missing, so it is possible to report shuffle output file not found;resubmitting Task;executor lost Spark the job completely collapsed.

In this case, you can consider adjusting the executor of the heap memory. It may be possible to avoid the error; In addition, sometimes, the external memory regulation of the larger time, for performance, will also bring a certain degree of improvement.

The executor ran, suddenly ran out of memory, the heap of memory is not enough, may be oom, hang off. Block manager is gone and the data is lost.

If at this point, Stage0 's executor is dead, block manager is gone; at this point, Stage1 's executor task, though driver, gets the address of its own data. , but actually to find the other side of the block manager to get the data, is not get the time, will run the job (jar) in Spark-submit, client (standalone client, yarn client), Log will be printed on this machine
Shuffle output file not found ...
Dagscheduler,resubmitting task, will hang up all the time. Repeatedly hung out several times, repeated the error several times

By default, the outer memory limit of this heap is about 300 m; later we usually in the project, the real processing of large data, there will be problems, causing the spark job repeatedly crashes, unable to run, this time to adjust this parameter, to at least 1G (1024M), or even 2G, 4G. Usually this parameter is adjusted up later, will avoid some JVM oom abnormal problem, at the same time, will let the whole spark the performance of the job, get a bigger promotion.

At this point, there will be no response, unable to establish a network connection; Spark the default network connection is long, 60s, and if you can't connect to 60s, you will fail.

A situation, occasionally so-and-so file. A string of file IDs. UUID (DSFSFD-2342VS--SDF--SDFSD). Not found. File lost.

In this case, it is most likely that there is a executor of that data in the JVM GC. So when you pull the data, you can't build the connection. Then after the default 60s, the direct declaration fails.
The number of errors, several times pulled out of the data, may lead to the collapse of spark operations. may also lead to Dagscheduler, repeated submissions several times stage. TaskScheduler, submit several tasks over and over again. Greatly prolong the running time of our spark job.

You can consider adjusting the timeout length of the connection.
/usr/local/spark/bin/spark-submit \--class com.ibeifeng.sparkstudy.WordCount \--num-executors--driver-memory 6g \--executor-memory 6g \--executor-cores 3 \--master yarn-cluster \--queue root.default \--conf spark.yarn.executor.me moryoverhead=2048 \ "This is yarn, not with yarn is standalone"--conf spark.core.connection.ack.wait.timeout=300 \/usr/local/ Spark/spark.jar \ ${1}
Spark-submit script inside, to use the--conf way, to add the configuration, must pay attention to ... Remember, not in your spark job code, with the new sparkconf (). Set () This way, don't set it, it's no use. Be sure to set it in the Spark-submit script.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More