This article mainly describes some of the operations of Spark standalone mode for job migration to spark on yarn. 1, Code RECOMPILE
Because the previous Spark standalone project used the version of Spark 1.5.2, and now spark on yarn is using Spark 2.0.1, it is necessary to recompile the original code and recommend using MAVEN to build the project. Use the Pom.xml file provided below to automatically download the jar packages that are required for the currently deployed version.
<dependencies> <dependency> <groupId>org.scala-lang</groupId> ;artifactid>scala-library</artifactid> <version>2.11.8</version> </dependency&
Gt <dependency> <groupId>org.apache.spark</groupId> <artifactid>spark-core_2. 11</artifactid> <version>2.0.1</version> </dependency> <dependency > <groupId>org.scala-lang</groupId> <artifactid>scala-reflect</artifactid&
Gt <version>2.11.8</version> </dependency> <dependency> <groupid>o Rg.apache.spark</groupid> <artifactId>spark-streaming-flume_2.11</artifactId> &L t;version>2.0.1</version> </dependency> <dependency> <groupid>org.apache.spark</groupid> <artifactId>spark-streaming_2.11</artifactId> <ve rsion>2.0.1</version> </dependency> <dependency> <groupid>org.apac He.spark</groupid> <artifactId>spark-sql_2.11</artifactId> <version>2.0.1& lt;/version> </dependency> <dependency> <groupid>org.apache.spark</gr Oupid> <artifactId>spark-mllib_2.11</artifactId> <VERSION>2.0.1</VERSION&G
T </dependency> <dependency> <groupId>org.apache.spark</groupId> &L
T;artifactid>spark-streaming-kafka-0-8_2.11</artifactid> <version>2.0.1</version> </dependency> <dependency> <groupId>org.scalikejdbc</groupId> &L t;artifactid≫scalikejdbc_2.11</artifactid> <version>2.2.1</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactid>mysql-connector-java& lt;/artifactid> <version>5.1.35</version> </dependency> <DEPENDENCY&G
T
<groupId>org.apache.spark</groupId> <artifactId>spark-mllib_2.11</artifactId> <version>2.0.1</version> </dependency> </dependencies>
2. Parameter change
Because the spark standalone is different from the spark on yarn resource management mechanism, there are some parameters that need to be adjusted to run in spark on yarn, where only the parts that need to be modified are described, and the parameters that are not involved need not be changed.
-Master
Modified here to yarn, in addition –deploy-mode parameters can be client and cluster, where the difference is the client model of the driver on the submit task of the machine, the job run log directly from the current client printing, easy job debugging , and the cluster mode is to assign the driver randomly to a machine, which is suitable for the stable operation after the Operation debugging.
-Executor
Discard –total-executor-cores in standalone mode instead of –num-executors 3, task execution directory
spark2.0.1 client in hadoop001/opt/spark2, in order to smooth migration, the original spark1.5.2 of the client path is unchanged, we suggest that the new requirements on the yarn above.
4, the Job History record inquiry
The spark job historyserver is deployed in hadoop003, so the history view address is hadoop003.dx.momo.com:18080
Here is a demonstration of the same task in standalone and spark on yarn's execution script as a distinction:
-Standalone mode
Bin/spark-submit–class Spark.mllib.als-master Spark://spark002.dx.momo.com:6066–deploy-mode cluster– Executor-memory 4g–total-executor-cores 200–queue Data
-Spark on Yarn mode
Bin/spark-submit–class spark.mllib.als-master yarn–deploy-mode cluster–executor-memory 4g–num-executors 200–queue da Ta 5, below provides a few simple spark on yarn use interface, easy to learn and use the Spark:spark-shell job debugging the most commonly used methods, convenient code debugging and tuning
Bin/spark-shell–master yarn–deploy-mode client–executor-memory 2g–num-executors 20–executor-cores 1–queue data Spark -SQL debugging SQL statements commonly used
Bin/spark-sql–master yarn–deploy-mode client–executor-memory 2g–num-executors 20–executor-cores 1–queue data 6, Jobs migrated Possible error Container killed by YARN for exceeding memory limits. 3.0 GB of 3 GB physical memory used. Consider boosting Spark.yarn.executor.memoryOverhead
Analysis: Because spark on yarn will have yarn to detect resource usage for Spark container (standalone not), it is likely that the error occurs when the job migrates. The solution is to directly increase the Spark.yarn.executor.memoryOverhead or executor memory (spark.yarn.executor.memoryOverhead default is executor Memory 10%), you can use –conf spark.yarn.executor.memoryoverhead=* (such as 1024) or increase –executor-memory in the job submission script to ensure that the job runs in a new environment. The above is a guide to migration, for everyone to smooth migration operations, follow-up will continue to collate the work of debugging and tuning aspects of sharing.