Spark Performance Tuning-Adjust Executor-heap external memory _spark

Source: Internet
Author: User
Tags ack shuffle
Adjust Executor heap Memory


Spark the underlying shuffle transmission mode is the use of Netty transmission, Netty in the process of network transmission will request the heap of memory, so the use of the heap of external memory.


When you need to adjust the executor memory size of the heap.
When an exception occurs:
Shuffle file cannot find,executor lost, task Lost,out of memory


There are generally two situations in which this problem occurs:
1, executor hung up, the corresponding executor above block manager also hung up, can not find the corresponding shuffle map output file, reducer end can not pull data
2, executor did not hang off, but in the process of pulling data has arisen a problem.




In this case, you can consider adjusting the executor of the heap memory. It may be possible to avoid an error; In addition, sometimes,
When the memory of the heap is relatively large, it will also bring some improvement to the performance.


The executor ran, suddenly ran out of memory, the heap of memory is not enough, may be oom, hang off. Block manager is gone, too.
The data is lost.


If at this point, Stage0 's executor is dead and block manager is gone; at this point, Stage1 's executor task, though
Driver's mapoutputtrakcer gets the address of its own data, but it actually goes to the other's block manager to get the data.
, you can't get it.


At this point, the job (JAR) is run in Spark-submit, client (standalone client, yarn client),
Log will be printed on this machine


Shuffle output file not found ...
Dagscheduler,resubmitting task, will hang up all the time. Repeatedly hung out several times, repeated the error several times


The whole spark operation collapsed.


--conf spark.yarn.executor.memoryoverhead=2048


Spark-submit script inside, to use the--conf way, to add the configuration, must pay attention to ... Remember
Not in your spark job code, with the new sparkconf (). Set () This way, don't set it, it's no use.
Be sure to set it in the Spark-submit script.


Spark.yarn.executor.memoryOverhead (see name, as its name suggests, is based on the yarn submission mode)


By default, this outer-heap memory limit defaults to 10% of the memory size of each executor; and then we usually project, when we actually handle big data,
There will be problems here, causing the spark job to crash repeatedly and not run, and then adjust this parameter to at least 1G (1024M),
Even say 2G, 4G


Usually this parameter is adjusted up, will avoid some JVM oom abnormal problem, at the same time, will let the whole spark job performance,
Get a bigger boost.


Adjust wait time long ...


Executor, take priority to get a piece of data from a locally associated Blockmanager


If the local block manager does not, it will pass transferservice to connect to the other nodes remotely executor
The block manager to get


Attempt to establish a remote network connection and to pull data






Task creates objects that are very large, especially many


Frequently overflow the JVM heap memory for garbage collection.
Just met the Exeuctor JVM in the garbage collection


JVM Tuning: Garbage collection


All worker threads stop when garbage collection is in progress;
Spark/executor stopped working, unable to provide response


At this point, there will be no response, unable to establish a network connection; Ok,spark The default network connection is long, 60s;
If the 60s cannot be established, the connection will fail.


Encounter a situation, occasionally, occasionally, occasionally ... There is no rule ... So-and-so file. A string of file IDs.
UUID (DSFSFD-2342VS--SDF--SDFSD). Not found. File lost.


In this case, it is most likely that there is a executor of that data in the JVM GC. So when you pull the data, you can't build the connection.
Then after the default 60s, the direct declaration fails.


The number of errors, several times pulled out of the data, may lead to the collapse of spark operations. may also lead to Dagscheduler,
Repeatedly submitted several stage. TaskScheduler, submit several tasks over and over again. Greatly prolong the running time of our spark job.


You can consider adjusting the timeout length of the connection.


--conf spark.core.connection.ack.wait.timeout=300


The Spark-submit script, remember, is not set in the new Sparkconf (). Set ().


Spark.core.connection.ack.wait.timeout (Spark core,connection, connection, ack,wait timeout,
Timeout waits long when no connection is established


Adjust this value is relatively large, usually, can avoid some of the occasional file pull failed, So-and-so file lost lost ...


Why are we talking about these two parameters here?


Because of the practical, in the real processing of large data (not tens of millions of data volume, millions of data volume), hundreds of millions of, billions of, tens of billions of time.
It is easy to run into executor heap memory and the problem of connection timeout caused by GC.
File not Found,executor Lost,task lost.


It is also helpful to adjust the above two parameters.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.