Apache Spark Source code reading-spark on Yarn

Source: Internet
Author: User
Tags hortonworks

You are welcome to reprint it. Please indicate the source, huichiro.

Summary

Yarn in hadoop2 is a management platform for distributed computing resources. Due to its excellent model abstraction, it is very likely to become a de facto standard for distributed computing resource management. Its main responsibility is to manage distributed computing clusters and manage and allocate computing resources in clusters.

Yarn provides good implementation standards for application development. Spark supports yarn deployment. This article will analyze how spark can be deployed on the yarn platform in detail.

Review of spark standalone deployment mode

It is a brief example of the computing module in Spark standalone cluster. It can be seen that the entire cluster is mainly composed of four different JVMs.

  1. The master is responsible for managing the entire cluster. Both the driver application and worker must be registered with the master.
  2. Worker is responsible for managing computing resources on a node, such as starting the corresponding executor
  3. The specific execution of each stage in executor RDD is completed on executor.
  4. Schedulerbackend in driver application driver may vary depending on the deployment mode.

In other words,The master manages resources at the process level, while schedulerbackend is at the Thread level.

Startup sequence diagram

Basic Architecture and workflow of Yarn

Shows the basic architecture of yarn, which consists of three functional modules: 1) Rm (ResourceManager) 2) nm (Node Manager) 3) Am (Application master)

Job submission
  1. The user submits an application to ResourceManager through the client. ResourceManager allocates an appropriate iner based on user requests, and then runs the container on the specified nodemanager to start applicationmaster.
  2. Register yourself with ResourceManager after the applicationmaster is started.
  3. For a user's task, applicationmaster needs to negotiate with ResourceManager to obtain the iner required to run the user task. After the collection is successful, applicationmaster sends the task to the specified nodemanager.
  4. Nodemanager starts the corresponding iner and runs the user task
Instance

I have mentioned a lot above. To put it bluntly, when writing yarn application, it mainly implementsClientAndApplicatonmaster. For more information, seeSimple-yarn-app.

Spark on Yarn

Combined with the deployment mode of spark standalone and the requirements of the yarn programming model, a table is provided to show the comparison between spark standalone and spark on yarn.

Standalone Yarn Notes
Client Client For standalone, see the spark. Deploy directory.
Master Applicationmaster  
Worker Executorrunnable  
Scheduler Yarnclusterscheduler  
Schedulerbackend Yarnclusterschedulerbackend  

The purpose of the above table is to figure out why these changes need to be made and what is the ing between these changes and the standalone mode. During code reading, the analysis focuses onApplicationmaster, yarnclusterschedulerbackend and yarnclusterschedend

In general, the class name specified to start applicationmaster is displayed in the client, as shown in the following code:

    ContainerLaunchContext amContainer =        Records.newRecord(ContainerLaunchContext.class);    amContainer.setCommands(        Collections.singletonList(            "$JAVA_HOME/bin/java" +            " -Xmx256M" +            " com.hortonworks.simpleyarnapp.ApplicationMaster" +            " " + command +            " " + String.valueOf(n) +            " 1>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout" +            " 2>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr"            )        );

However, the Class Name of applicationmaster is not directly specified in yarn. Client. It is encapsulated through clientarguments, and the name of the startup class is actually specified in clientarguments. The default value of amclass specified in the constructor isOrg. Apache. Spark. Deploy. yarn. applicationmaster

Instance description

Deploy sparkpi on yarn. The following are specific instructions.

$ SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.0.5-alpha.jar     ./bin/spark-class org.apache.spark.deploy.yarn.Client       --jar examples/target/scala-2.10/spark-examples-assembly-0.9.1.jar       --class org.apache.spark.examples.SparkPi       --args yarn-standalone       --num-workers 3       --master-memory 4g       --worker-memory 2g       --worker-cores 1

The output log shows that when the client submits the request, am specifiesOrg. Apache. Spark. Deploy. yarn. applicationmaster

13/12/29 23:33:25 INFO Client: Command for starting the Spark ApplicationMaster: $JAVA_HOME/bin/java -server -Xmx4096m -Djava.io.tmpdir=$PWD/tmp org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.examples.SparkPi --jar examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar --args  ‘yarn-standalone‘  --worker-memory 2048 --worker-cores 1 --num-workers 3 1> /stdout 2> /stderr
Summary

When spark is submitted, the resource application is completed at one time. That is to say, the number of executors required for a specific application is calculated at the beginning, if the entire cluster can meet the requirements at this time, it will be submitted; otherwise, it will wait. If a new node is added to the entire cluster, the running program cannot use these new resources. There is a lack of rebalance mechanisms, but storm does.

References

  1. Launch spark on Yarn http://spark.apache.org/docs/0.9.1/running-on-yarn.html
  2. Getting started writing yarn application http://hortonworks.com/blog/getting-started-writing-yarn-applications/
  3. Dong Xicheng, author of hadoop technology insider in-depth analysis of yarn Architecture Design and Implementation Principles
  4. Application Development Process http://my.oschina.net/u/1434348/blog/193374It is strongly recommended !!!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.