Spark master and spark worker hang up application recovery issues
First of all, the situation in 5:
1,spark Master process hangs up
2,spark master hangs out in execution.
3,spark worker was all hung up before the task was submitted.
4,spark worker hangs up during the execution of application
5,spark worker was all hung up during the execution of application.
1,spark Master process hangs up
Cannot submit application, so do not consider application recovery problem
2,spark master hangs out in execution.
does not affect application normal execution, as the execution process is done in the worker and the results are returned directly by the worker.
3,spark worker was all hung up before the task was submitted.
The error message is as follows, after starting Woker, application return to normal.
17/01/04 19:31:13 WARN taskschedulerimpl:initial Job has no accepted any resources; Check your cluster UI to ensure that workers is registered and has sufficient resources
4,spark worker hangs up during the execution of application
The error message is as follows:
17/01/04 19:41:50 ERROR taskschedulerimpl:lost executor 0 on 192.168.91.128:remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Remove the RPC client, check if the DAG loses the critical path, if it loses recalculation, if lost:0, remove the failed executor from Blockmanagermaster, redistribute the failed executor to the other worker.
5,spark worker was all hung up during the execution of application.
The error message is as follows,
17/01/04 19:34:16 ERROR taskschedulerimpl:lost executor 1 on 192.168.91.128:remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Application into a stagnant state, waiting for the worker to register.
After starting worker: Executor re-registration
Remove the unavailable executor and re-enroll.
coarsegrainedschedulerbackend$driverendpoint:asked to remove non-existent executor 0
coarsegrainedschedulerbackend$driverendpoint:registered executor nettyrpcendpointref (null) (192.168.91.128:55126) With ID 1
After the worker starts, the application returns to normal, but still the error is as follows. (This bug is fixed later in the 2.1.0 version).
Org.apache.spark.SparkException:Could not find Coarsegrainedscheduler
1, WARN taskschedulerimpl:initial job has no accepted any resources; Check your cluster Uito ensure that workers is registered and has sufficient memory available resources for the current cluster cannot meet the resources requested by the application. Resources are divided into 2 classes: Cores and RAM core represent the free memory that is required on each worker to run your application on behalf of the executor slots RAM that is available for execution. Workaround: Apply do not request extra free available resources to close off already executed application 2, application isn ' t using all of the cores:how to set the cores Used by a Spark app sets the core solution available to each app: spark-env.sh spark.deploy.defaultCores or Spark.cores.max 3, SPA RK Executor oom:how to set memory Parameters on Spark OOM is too many things in memory heap 1, increase job parallelism, that is, increase the number of job partition, the big data set into smaller data, can reduce the time The amount of data in the memory of the sex load. Inputfomart, Getsplit to determine. 2, spark.storage.memoryFraction management executor in the RDD and run the task of memory ratio, if the shuffle is small, only need a little shuffle memory, then adjust the proportion. The default is 0.6. It can't be bigger than the old age. Big is a waste. 3, Spark.executor.memory if still do not, then will add executor memory, after changing executor memory, this need to restart. 4, Shark server/long Running application Metadata Cleanup The metadata of the spark program is stored indefinitely in memory. Spark.cleaner.ttl to prevent oom, mainly appearingIn spark steaming and shark server. Export spark_java_opts + = "-dspark.kryoserializer.buffer.mb=10-dspark.cleaner.ttl=43200" 5, Class not Found: Classpath issues problem 1, missing jar, not in Classpath. Issue 2, jar package conflicts, different versions of the same jar. 1: All the dependent jars are penetrated into a fatjar package, and then manually set the Dir that is dependent on each machine specified. Val conf = new sparkconf (). Setappname (AppName). Setjars (Seq (System.getproperty ("User.dir") + "/target/scala-2.10/ Sparktest.jar ") Solution 2: Put the required jar packages into the default classpath and distribute them to each worker node. About performance optimization: The first one is sort-based shuffle. This feature greatly reduces the memory footprint of shuffle for ultra-large jobs, allowing us to sort more memory. The second is a new Netty-based network module that replaces the original NIO network module. This new module improves the performance of network transmissions, and it manages the memory from the JVM's GC itself, reducing the GC frequency. The third one is a external shuffle service that is independent of spark executor. In this case, the executor in GC can also be used by other nodes to fetch shuffle data, so the network transmission itself is not affected by GC. In the past some of the competition system software processing is not able to reach the hardware bottleneck, even the utilization of hardware less than 10%. And this time our entry system was full of 3gb/s hard disk bandwidth during map, reaching the bottleneck of eight SSD on these VMS, the network utilization to 1.1GB/S, close to the physical limit during reduce.