Apache Spark Source code reading-spark on Yarn

Last Update:2014-07-07 Source: Internet

Author: User

Tags hortonworks

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

You are welcome to reprint it. Please indicate the source, huichiro.

Summary

Yarn in hadoop2 is a management platform for distributed computing resources. Due to its excellent model abstraction, it is very likely to become a de facto standard for distributed computing resource management. Its main responsibility is to manage distributed computing clusters and manage and allocate computing resources in clusters.

Yarn provides good implementation standards for application development. Spark supports yarn deployment. This article will analyze how spark can be deployed on the yarn platform in detail.

Review of spark standalone deployment mode

It is a brief example of the computing module in Spark standalone cluster. It can be seen that the entire cluster is mainly composed of four different JVMs.

The master is responsible for managing the entire cluster. Both the driver application and worker must be registered with the master.
Worker is responsible for managing computing resources on a node, such as starting the corresponding executor
The specific execution of each stage in executor RDD is completed on executor.
Schedulerbackend in driver application driver may vary depending on the deployment mode.

In other words,The master manages resources at the process level, while schedulerbackend is at the Thread level.

Startup sequence diagram

Basic Architecture and workflow of Yarn

Shows the basic architecture of yarn, which consists of three functional modules: 1) Rm (ResourceManager) 2) nm (Node Manager) 3) Am (Application master)

Job submission

The user submits an application to ResourceManager through the client. ResourceManager allocates an appropriate iner based on user requests, and then runs the container on the specified nodemanager to start applicationmaster.
Register yourself with ResourceManager after the applicationmaster is started.
For a user's task, applicationmaster needs to negotiate with ResourceManager to obtain the iner required to run the user task. After the collection is successful, applicationmaster sends the task to the specified nodemanager.
Nodemanager starts the corresponding iner and runs the user task

Instance

I have mentioned a lot above. To put it bluntly, when writing yarn application, it mainly implementsClientAndApplicatonmaster. For more information, seeSimple-yarn-app.

Spark on Yarn

Combined with the deployment mode of spark standalone and the requirements of the yarn programming model, a table is provided to show the comparison between spark standalone and spark on yarn.

Standalone	Yarn	Notes
Client	Client	For standalone, see the spark. Deploy directory.
Master	Applicationmaster
Worker	Executorrunnable
Scheduler	Yarnclusterscheduler
Schedulerbackend	Yarnclusterschedulerbackend

The purpose of the above table is to figure out why these changes need to be made and what is the ing between these changes and the standalone mode. During code reading, the analysis focuses onApplicationmaster, yarnclusterschedulerbackend and yarnclusterschedend

In general, the class name specified to start applicationmaster is displayed in the client, as shown in the following code:

    ContainerLaunchContext amContainer =        Records.newRecord(ContainerLaunchContext.class);    amContainer.setCommands(        Collections.singletonList(            "$JAVA_HOME/bin/java" +            " -Xmx256M" +            " com.hortonworks.simpleyarnapp.ApplicationMaster" +            " " + command +            " " + String.valueOf(n) +            " 1>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout" +            " 2>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr"            )        );

However, the Class Name of applicationmaster is not directly specified in yarn. Client. It is encapsulated through clientarguments, and the name of the startup class is actually specified in clientarguments. The default value of amclass specified in the constructor isOrg. Apache. Spark. Deploy. yarn. applicationmaster

Instance description

Deploy sparkpi on yarn. The following are specific instructions.

$ SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.0.5-alpha.jar     ./bin/spark-class org.apache.spark.deploy.yarn.Client       --jar examples/target/scala-2.10/spark-examples-assembly-0.9.1.jar       --class org.apache.spark.examples.SparkPi       --args yarn-standalone       --num-workers 3       --master-memory 4g       --worker-memory 2g       --worker-cores 1

The output log shows that when the client submits the request, am specifiesOrg. Apache. Spark. Deploy. yarn. applicationmaster

13/12/29 23:33:25 INFO Client: Command for starting the Spark ApplicationMaster: $JAVA_HOME/bin/java -server -Xmx4096m -Djava.io.tmpdir=$PWD/tmp org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.examples.SparkPi --jar examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar --args  ‘yarn-standalone‘  --worker-memory 2048 --worker-cores 1 --num-workers 3 1> /stdout 2> /stderr

Summary

When spark is submitted, the resource application is completed at one time. That is to say, the number of executors required for a specific application is calculated at the beginning, if the entire cluster can meet the requirements at this time, it will be submitted; otherwise, it will wait. If a new node is added to the entire cluster, the running program cannot use these new resources. There is a lack of rebalance mechanisms, but storm does.

References

Launch spark on Yarn http://spark.apache.org/docs/0.9.1/running-on-yarn.html
Getting started writing yarn application http://hortonworks.com/blog/getting-started-writing-yarn-applications/
Dong Xicheng, author of hadoop technology insider in-depth analysis of yarn Architecture Design and Implementation Principles
Application Development Process http://my.oschina.net/u/1434348/blog/193374It is strongly recommended !!!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More