Spark Performance Tuning-Adjust Executor-heap external memory

Spark Performance Tuning-Adjust Executor-heap external memory _spark

Last Update:2018-08-23 Source: Internet

Author: User

Tags ack shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Adjust Executor heap Memory

Spark the underlying shuffle transmission mode is the use of Netty transmission, Netty in the process of network transmission will request the heap of memory, so the use of the heap of external memory.

When you need to adjust the executor memory size of the heap.
When an exception occurs:
Shuffle file cannot find,executor lost, task Lost,out of memory

There are generally two situations in which this problem occurs:
1, executor hung up, the corresponding executor above block manager also hung up, can not find the corresponding shuffle map output file, reducer end can not pull data
2, executor did not hang off, but in the process of pulling data has arisen a problem.

In this case, you can consider adjusting the executor of the heap memory. It may be possible to avoid an error; In addition, sometimes,
When the memory of the heap is relatively large, it will also bring some improvement to the performance.

The executor ran, suddenly ran out of memory, the heap of memory is not enough, may be oom, hang off. Block manager is gone, too.
The data is lost.

If at this point, Stage0 's executor is dead and block manager is gone; at this point, Stage1 's executor task, though
Driver's mapoutputtrakcer gets the address of its own data, but it actually goes to the other's block manager to get the data.
, you can't get it.

At this point, the job (JAR) is run in Spark-submit, client (standalone client, yarn client),
Log will be printed on this machine

Shuffle output file not found ...
Dagscheduler,resubmitting task, will hang up all the time. Repeatedly hung out several times, repeated the error several times

The whole spark operation collapsed.

--conf spark.yarn.executor.memoryoverhead=2048

Spark-submit script inside, to use the--conf way, to add the configuration, must pay attention to ... Remember
Not in your spark job code, with the new sparkconf (). Set () This way, don't set it, it's no use.
Be sure to set it in the Spark-submit script.

Spark.yarn.executor.memoryOverhead (see name, as its name suggests, is based on the yarn submission mode)

By default, this outer-heap memory limit defaults to 10% of the memory size of each executor; and then we usually project, when we actually handle big data,
There will be problems here, causing the spark job to crash repeatedly and not run, and then adjust this parameter to at least 1G (1024M),
Even say 2G, 4G

Usually this parameter is adjusted up, will avoid some JVM oom abnormal problem, at the same time, will let the whole spark job performance,
Get a bigger boost.

Adjust wait time long ...

Executor, take priority to get a piece of data from a locally associated Blockmanager

If the local block manager does not, it will pass transferservice to connect to the other nodes remotely executor
The block manager to get

Attempt to establish a remote network connection and to pull data

Task creates objects that are very large, especially many

Frequently overflow the JVM heap memory for garbage collection.
Just met the Exeuctor JVM in the garbage collection

JVM Tuning: Garbage collection

All worker threads stop when garbage collection is in progress;
Spark/executor stopped working, unable to provide response

At this point, there will be no response, unable to establish a network connection; Ok,spark The default network connection is long, 60s;
If the 60s cannot be established, the connection will fail.

Encounter a situation, occasionally, occasionally, occasionally ... There is no rule ... So-and-so file. A string of file IDs.
UUID (DSFSFD-2342VS--SDF--SDFSD). Not found. File lost.

In this case, it is most likely that there is a executor of that data in the JVM GC. So when you pull the data, you can't build the connection.
Then after the default 60s, the direct declaration fails.

The number of errors, several times pulled out of the data, may lead to the collapse of spark operations. may also lead to Dagscheduler,
Repeatedly submitted several stage. TaskScheduler, submit several tasks over and over again. Greatly prolong the running time of our spark job.

You can consider adjusting the timeout length of the connection.

--conf spark.core.connection.ack.wait.timeout=300

The Spark-submit script, remember, is not set in the new Sparkconf (). Set ().

Spark.core.connection.ack.wait.timeout (Spark core,connection, connection, ack,wait timeout,
Timeout waits long when no connection is established

Adjust this value is relatively large, usually, can avoid some of the occasional file pull failed, So-and-so file lost lost ...

Why are we talking about these two parameters here?

Because of the practical, in the real processing of large data (not tens of millions of data volume, millions of data volume), hundreds of millions of, billions of, tens of billions of time.
It is easy to run into executor heap memory and the problem of connection timeout caused by GC.
File not Found,executor Lost,task lost.

It is also helpful to adjust the above two parameters.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More