Spark FAQ Rollup

Last Update:2018-07-25 Source: Internet

Author: User

Tags shuffle sort

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark master and spark worker hang up application recovery issues

First of all, the situation in 5:

1,spark Master process hangs up

2,spark master hangs out in execution.

3,spark worker was all hung up before the task was submitted.

4,spark worker hangs up during the execution of application

5,spark worker was all hung up during the execution of application.

1,spark Master process hangs up

Cannot submit application, so do not consider application recovery problem

2,spark master hangs out in execution.

does not affect application normal execution, as the execution process is done in the worker and the results are returned directly by the worker.

3,spark worker was all hung up before the task was submitted.

The error message is as follows, after starting Woker, application return to normal.

17/01/04 19:31:13 WARN taskschedulerimpl:initial Job has no accepted any resources; Check your cluster UI to ensure that workers is registered and has sufficient resources

4,spark worker hangs up during the execution of application

The error message is as follows:

17/01/04 19:41:50 ERROR taskschedulerimpl:lost executor 0 on 192.168.91.128:remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

Remove the RPC client, check if the DAG loses the critical path, if it loses recalculation, if lost:0, remove the failed executor from Blockmanagermaster, redistribute the failed executor to the other worker.

5,spark worker was all hung up during the execution of application.

The error message is as follows,

17/01/04 19:34:16 ERROR taskschedulerimpl:lost executor 1 on 192.168.91.128:remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

Application into a stagnant state, waiting for the worker to register.

After starting worker: Executor re-registration

Remove the unavailable executor and re-enroll.

coarsegrainedschedulerbackend$driverendpoint:asked to remove non-existent executor 0

coarsegrainedschedulerbackend$driverendpoint:registered executor nettyrpcendpointref (null) (192.168.91.128:55126) With ID 1

After the worker starts, the application returns to normal, but still the error is as follows. (This bug is fixed later in the 2.1.0 version).

Org.apache.spark.SparkException:Could not find Coarsegrainedscheduler

1, WARN taskschedulerimpl:initial job has no accepted any resources; Check your cluster Uito ensure that workers is registered and has sufficient memory available resources for the current cluster cannot meet the resources requested by the application. Resources are divided into 2 classes: Cores and RAM core represent the free memory that is required on each worker to run your application on behalf of the executor slots RAM that is available for execution. Workaround: Apply do not request extra free available resources to close off already executed application 2, application isn ' t using all of the cores:how to set the cores Used by a Spark app sets the core solution available to each app: spark-env.sh spark.deploy.defaultCores or Spark.cores.max 3, SPA RK Executor oom:how to set memory Parameters on Spark OOM is too many things in memory heap 1, increase job parallelism, that is, increase the number of job partition, the big data set into smaller data, can reduce the time The amount of data in the memory of the sex load. Inputfomart, Getsplit to determine. 2, spark.storage.memoryFraction management executor in the RDD and run the task of memory ratio, if the shuffle is small, only need a little shuffle memory, then adjust the proportion. The default is 0.6. It can't be bigger than the old age. Big is a waste. 3, Spark.executor.memory if still do not, then will add executor memory, after changing executor memory, this need to restart. 4, Shark server/long Running application Metadata Cleanup The metadata of the spark program is stored indefinitely in memory. Spark.cleaner.ttl to prevent oom, mainly appearingIn spark steaming and shark server. Export spark_java_opts + = "-dspark.kryoserializer.buffer.mb=10-dspark.cleaner.ttl=43200" 5, Class not Found: Classpath issues problem 1, missing jar, not in Classpath. Issue 2, jar package conflicts, different versions of the same jar. 1: All the dependent jars are penetrated into a fatjar package, and then manually set the Dir that is dependent on each machine specified. Val conf = new sparkconf (). Setappname (AppName). Setjars (Seq (System.getproperty ("User.dir") + "/target/scala-2.10/ Sparktest.jar ") Solution 2: Put the required jar packages into the default classpath and distribute them to each worker node. About performance optimization: The first one is sort-based shuffle. This feature greatly reduces the memory footprint of shuffle for ultra-large jobs, allowing us to sort more memory. The second is a new Netty-based network module that replaces the original NIO network module. This new module improves the performance of network transmissions, and it manages the memory from the JVM's GC itself, reducing the GC frequency. The third one is a external shuffle service that is independent of spark executor. In this case, the executor in GC can also be used by other nodes to fetch shuffle data, so the network transmission itself is not affected by GC. In the past some of the competition system software processing is not able to reach the hardware bottleneck, even the utilization of hardware less than 10%. And this time our entry system was full of 3gb/s hard disk bandwidth during map, reaching the bottleneck of eight SSD on these VMS, the network utilization to 1.1GB/S, close to the physical limit during reduce.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More