You are welcome to reprint it. Please indicate the source, huichiro.
Summary
Yarn in hadoop2 is a management platform for distributed computing resources. Due to its excellent model abstraction, it is very likely to become a de facto standard for distributed computing resource management. Its main responsibility is to manage distributed computing clusters and manage and allocate computing resources in clusters.
Yarn provides good implementation standards for application development. Spark supports yarn deployment. This article will analyze how spark can be deployed on the yarn platform in detail.
Review of spark standalone deployment mode
It is a brief example of the computing module in Spark standalone cluster. It can be seen that the entire cluster is mainly composed of four different JVMs.
- The master is responsible for managing the entire cluster. Both the driver application and worker must be registered with the master.
- Worker is responsible for managing computing resources on a node, such as starting the corresponding executor
- The specific execution of each stage in executor RDD is completed on executor.
- Schedulerbackend in driver application driver may vary depending on the deployment mode.
In other words,The master manages resources at the process level, while schedulerbackend is at the Thread level.
Startup sequence diagram
Basic Architecture and workflow of Yarn
Shows the basic architecture of yarn, which consists of three functional modules: 1) Rm (ResourceManager) 2) nm (Node Manager) 3) Am (Application master)
Job submission
- The user submits an application to ResourceManager through the client. ResourceManager allocates an appropriate iner based on user requests, and then runs the container on the specified nodemanager to start applicationmaster.
- Register yourself with ResourceManager after the applicationmaster is started.
- For a user's task, applicationmaster needs to negotiate with ResourceManager to obtain the iner required to run the user task. After the collection is successful, applicationmaster sends the task to the specified nodemanager.
- Nodemanager starts the corresponding iner and runs the user task
Instance
I have mentioned a lot above. To put it bluntly, when writing yarn application, it mainly implementsClientAndApplicatonmaster. For more information, seeSimple-yarn-app.
Spark on Yarn
Combined with the deployment mode of spark standalone and the requirements of the yarn programming model, a table is provided to show the comparison between spark standalone and spark on yarn.
Standalone |
Yarn |
Notes |
Client |
Client |
For standalone, see the spark. Deploy directory. |
Master |
Applicationmaster |
|
Worker |
Executorrunnable |
|
Scheduler |
Yarnclusterscheduler |
|
Schedulerbackend |
Yarnclusterschedulerbackend |
|
The purpose of the above table is to figure out why these changes need to be made and what is the ing between these changes and the standalone mode. During code reading, the analysis focuses onApplicationmaster, yarnclusterschedulerbackend and yarnclusterschedend
In general, the class name specified to start applicationmaster is displayed in the client, as shown in the following code:
ContainerLaunchContext amContainer = Records.newRecord(ContainerLaunchContext.class); amContainer.setCommands( Collections.singletonList( "$JAVA_HOME/bin/java" + " -Xmx256M" + " com.hortonworks.simpleyarnapp.ApplicationMaster" + " " + command + " " + String.valueOf(n) + " 1>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout" + " 2>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr" ) );
However, the Class Name of applicationmaster is not directly specified in yarn. Client. It is encapsulated through clientarguments, and the name of the startup class is actually specified in clientarguments. The default value of amclass specified in the constructor isOrg. Apache. Spark. Deploy. yarn. applicationmaster
Instance description
Deploy sparkpi on yarn. The following are specific instructions.
$ SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.0.5-alpha.jar ./bin/spark-class org.apache.spark.deploy.yarn.Client --jar examples/target/scala-2.10/spark-examples-assembly-0.9.1.jar --class org.apache.spark.examples.SparkPi --args yarn-standalone --num-workers 3 --master-memory 4g --worker-memory 2g --worker-cores 1
The output log shows that when the client submits the request, am specifiesOrg. Apache. Spark. Deploy. yarn. applicationmaster
13/12/29 23:33:25 INFO Client: Command for starting the Spark ApplicationMaster: $JAVA_HOME/bin/java -server -Xmx4096m -Djava.io.tmpdir=$PWD/tmp org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.examples.SparkPi --jar examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar --args ‘yarn-standalone‘ --worker-memory 2048 --worker-cores 1 --num-workers 3 1> /stdout 2> /stderr
Summary
When spark is submitted, the resource application is completed at one time. That is to say, the number of executors required for a specific application is calculated at the beginning, if the entire cluster can meet the requirements at this time, it will be submitted; otherwise, it will wait. If a new node is added to the entire cluster, the running program cannot use these new resources. There is a lack of rebalance mechanisms, but storm does.
References
- Launch spark on Yarn http://spark.apache.org/docs/0.9.1/running-on-yarn.html
- Getting started writing yarn application http://hortonworks.com/blog/getting-started-writing-yarn-applications/
- Dong Xicheng, author of hadoop technology insider in-depth analysis of yarn Architecture Design and Implementation Principles
- Application Development Process http://my.oschina.net/u/1434348/blog/193374It is strongly recommended !!!