Error message:
WARN tasksetmanager:lost Task 132.0 in Stage 2.0 (TID 5951, spark047207): java.io.FileNotFoundException:/data1/spark/tmp /blockmgr-5363024d-29a4-4f6f-bf87-127b95669c7c/1c/temp_shuffle_7dad1a33-286f-47d2-8506-da0a02e22c10
In spark1.6, Mesos coarse-grained mode is used to activate the Mesosexternalshuffleservice in order to use dynamic executor mechanism. It turns out that the shuffle file error (Java.io.FileNotFoundException) could not be found as long as the last 2 minutes.
Spark driver connects to Mesosexternalshuffleservice via Mesosexternalshuffleclient, and when it disconnects, it clears all data that is related to that driver.
In this version, regardless of whether driver survives, mesosexternalshuffleclient will be in spark.shuffle.io.connectionTimeout (or spark.network.timeout) Disconnect after a set time (because it is idle). The shuffle file is also deleted.
Mesosexternalshuffleservice is not available in the current version, so it is not possible to use executor dynamically released features.
This issue will be fixed in spark2.0
https://issues.apache.org/jira/browse/SPARK-12583
spark1.6 also has a bug, too much memory usage will kill executor directly, so it will directly report lost executor instead of the error indicating oom.