Spark the underlying shuffle transmission mode is the use of Netty transmission, Netty in the process of network transmission will request the heap of memory, so the use of the heap of external memory.
When you need to adjust the executor memory size of the heap.
When an exception occurs:
Shuffle file cannot find,executor lost, task Lost,out of memory
There are generally two situations in which this problem occurs:
1, executor hung up, the corresponding executor above block manager also hung up, can not find the corresponding shuffle map output file, reducer end can not pull data
2, executor did not hang off, but in the process of pulling data has arisen a problem.
In this case, you can consider adjusting the executor of the heap memory. It may be possible to avoid an error; In addition, sometimes,
When the memory of the heap is relatively large, it will also bring some improvement to the performance.
The executor ran, suddenly ran out of memory, the heap of memory is not enough, may be oom, hang off. Block manager is gone, too.
The data is lost.
If at this point, Stage0 's executor is dead and block manager is gone; at this point, Stage1 's executor task, though
Driver's mapoutputtrakcer gets the address of its own data, but it actually goes to the other's block manager to get the data.
, you can't get it.
At this point, the job (JAR) is run in Spark-submit, client (standalone client, yarn client),
Log will be printed on this machine
Shuffle output file not found ...
Dagscheduler,resubmitting task, will hang up all the time. Repeatedly hung out several times, repeated the error several times
The whole spark operation collapsed.
--conf spark.yarn.executor.memoryoverhead=2048
Spark-submit script inside, to use the--conf way, to add the configuration, must pay attention to ... Remember
Not in your spark job code, with the new sparkconf (). Set () This way, don't set it, it's no use.
Be sure to set it in the Spark-submit script.
Spark.yarn.executor.memoryOverhead (see name, as its name suggests, is based on the yarn submission mode)
By default, this outer-heap memory limit defaults to 10% of the memory size of each executor; and then we usually project, when we actually handle big data,
There will be problems here, causing the spark job to crash repeatedly and not run, and then adjust this parameter to at least 1G (1024M),
Even say 2G, 4G
Usually this parameter is adjusted up, will avoid some JVM oom abnormal problem, at the same time, will let the whole spark job performance,
Get a bigger boost.
Adjust wait time long ...
Executor, take priority to get a piece of data from a locally associated Blockmanager
If the local block manager does not, it will pass transferservice to connect to the other nodes remotely executor
The block manager to get
Attempt to establish a remote network connection and to pull data
Task creates objects that are very large, especially many
Frequently overflow the JVM heap memory for garbage collection.
Just met the Exeuctor JVM in the garbage collection
JVM Tuning: Garbage collection
All worker threads stop when garbage collection is in progress;
Spark/executor stopped working, unable to provide response
At this point, there will be no response, unable to establish a network connection; Ok,spark The default network connection is long, 60s;
If the 60s cannot be established, the connection will fail.
Encounter a situation, occasionally, occasionally, occasionally ... There is no rule ... So-and-so file. A string of file IDs.
UUID (DSFSFD-2342VS--SDF--SDFSD). Not found. File lost.
In this case, it is most likely that there is a executor of that data in the JVM GC. So when you pull the data, you can't build the connection.
Then after the default 60s, the direct declaration fails.
The number of errors, several times pulled out of the data, may lead to the collapse of spark operations. may also lead to Dagscheduler,
Repeatedly submitted several stage. TaskScheduler, submit several tasks over and over again. Greatly prolong the running time of our spark job.
You can consider adjusting the timeout length of the connection.
--conf spark.core.connection.ack.wait.timeout=300
The Spark-submit script, remember, is not set in the new Sparkconf (). Set ().
Spark.core.connection.ack.wait.timeout (Spark core,connection, connection, ack,wait timeout,
Timeout waits long when no connection is established
Adjust this value is relatively large, usually, can avoid some of the occasional file pull failed, So-and-so file lost lost ...
Why are we talking about these two parameters here?
Because of the practical, in the real processing of large data (not tens of millions of data volume, millions of data volume), hundreds of millions of, billions of, tens of billions of time.
It is easy to run into executor heap memory and the problem of connection timeout caused by GC.
File not Found,executor Lost,task lost.
It is also helpful to adjust the above two parameters.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.