Executor out of heap memory
Sometimes, if your spark job deals with a particularly large amount of data, hundreds of millions of of the data, and then spark the job one time, and occasionally the error, shuffle file cannot find,executor, task Lost,out of memory (memory overflow) ;
It may be that the executor of the heap is not sufficient, causing the executor to overflow in the process of running, and then may cause subsequent stage tasks to be pulled from some executor to fetch the shuffle map output file , but the executor may have been hung up, and the associated block manager is missing, so it is possible to report shuffle output file not found;resubmitting Task;executor lost Spark the job completely collapsed.
In this case, you can consider adjusting the executor of the heap memory. It may be possible to avoid the error; In addition, sometimes, the external memory regulation of the larger time, for performance, will also bring a certain degree of improvement.
The executor ran, suddenly ran out of memory, the heap of memory is not enough, may be oom, hang off. Block manager is gone and the data is lost.
If at this point, Stage0 's executor is dead, block manager is gone; at this point, Stage1 's executor task, though driver, gets the address of its own data. , but actually to find the other side of the block manager to get the data, is not get the time, will run the job (jar) in Spark-submit, client (standalone client, yarn client), Log will be printed on this machine
Shuffle output file not found ...
Dagscheduler,resubmitting task, will hang up all the time. Repeatedly hung out several times, repeated the error several times
By default, the outer memory limit of this heap is about 300 m; later we usually in the project, the real processing of large data, there will be problems, causing the spark job repeatedly crashes, unable to run, this time to adjust this parameter, to at least 1G (1024M), or even 2G, 4G. Usually this parameter is adjusted up later, will avoid some JVM oom abnormal problem, at the same time, will let the whole spark the performance of the job, get a bigger promotion.
At this point, there will be no response, unable to establish a network connection; Spark the default network connection is long, 60s, and if you can't connect to 60s, you will fail.
A situation, occasionally so-and-so file. A string of file IDs. UUID (DSFSFD-2342VS--SDF--SDFSD). Not found. File lost.
In this case, it is most likely that there is a executor of that data in the JVM GC. So when you pull the data, you can't build the connection. Then after the default 60s, the direct declaration fails.
The number of errors, several times pulled out of the data, may lead to the collapse of spark operations. may also lead to Dagscheduler, repeated submissions several times stage. TaskScheduler, submit several tasks over and over again. Greatly prolong the running time of our spark job.
You can consider adjusting the timeout length of the connection.
/usr/local/spark/bin/spark-submit \--class com.ibeifeng.sparkstudy.WordCount \--num-executors--driver-memory 6g \--executor-memory 6g \--executor-cores 3 \--master yarn-cluster \--queue root.default \--conf spark.yarn.executor.me moryoverhead=2048 \ "This is yarn, not with yarn is standalone"--conf spark.core.connection.ack.wait.timeout=300 \/usr/local/ Spark/spark.jar \ ${1}
Spark-submit script inside, to use the--conf way, to add the configuration, must pay attention to ... Remember, not in your spark job code, with the new sparkconf (). Set () This way, don't set it, it's no use. Be sure to set it in the Spark-submit script.