Spark on Yarn complete decryption (dt Big Data Dream Factory)

Source: Internet
Author: User

Content:

1. Hadoop Yarn's workflow decryption;

2, Spark on yarn two operation mode combat;

3, Spark on yarn work flow decryption;

4, Spark on yarn work inside decryption;

5, Spark on yarn best practices;

Resource Management Framework Yarn

Mesos is a resource management framework for distributed clusters, and big data does not matter, but can manage the resources of big data

==========hadoop Yarn Parsing ============

1, yarn is the introduction of Hadoop resource Manager, is responsible for distributed (big Data) cluster Computing resource management, responsible for the management and allocation of resources, based on yarn we can run multiple computing frameworks on the same big data cluster, such as Spark, MapReduce, storm, etc. ;

2, yarn basic work flow as shown:

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

Resource Manager manages the resources of the entire cluster (primary memory, CPU, possibly including disk IO and network IO);

Each machine Node Manager manages each resource and reports the current machine's resources to resource manager;

Resource Manager knows how to manage it because it reports each one.

When the client submits the program to Resource Manager, Resource Manager launches a process on the specific machine to run the client-submitted program based on its own knowledge of the cluster's resource status, Resource The manager is the master of the entire cluster.

Resource Manager does not start a process on a specific machine, it only commands node Manager to start the process.

App Mstr is the driver of the current submission, and when it starts, it will request resources with Resource Manager, and Resource Manager will return the resource to the app based on the resource situation.

Container is a container for encapsulating resources. Report the resource information to NodeManager, and report the calculation information to the app Mstr.

=====> Summary:

3, the client submits the program to Resource Manager, Resource Manager accepts the application, and depending on the status of the cluster resource, the Task Scheduler driver (App MSTR) to launch the currently submitted application on a specific node. The decision will command a specific node manager to start a new JVM process, run the Drvier part of the program, when the app Mstr starts (will first resource Manager Registration instructions are responsible for running the current program, download the current app-related jar and other resources and based on this decision specifically to Resource Manager to request the specific content of the resources, Resource Manager received the app The MASTR resource allocation request is then maximized to meet the resource allocation request, and the metadata information of the resource is sent to the APP Mastr,app MASTR receives the metadata information of the resource and sends instructions to the node Manager on the specific machine based on the metadata information, allowing node Manager to launch the specific Container,container must be registered with the app Mastr after launch, and when the app Mstr gets the containers for the calculation, the task is scheduled and calculated until the job execution is complete. It should be added that if the resource manager is not able to fully complete the requested resource allocation for the first time, the subsequent resource manager discovers that there are new available in the cluster and will proactively MASTR sends the metadata information of the new available resources to provide more resources for the current program's operation;

Additional notes:

1) Container If it is the calculation of the mapreduce of Hadoop, container can not be reused, if spark on Yarn,container can be reused;

2) container specific destruction, is determined by application master;

==========spark on yarn Two modes of operation ============

Start Yarn,start-yarn

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

master:8088 Watch Yarn's console

Start Hadoop's historical information

mr-jobhistory-daemon.sh start Historyserver--config $HADOOP _conf_dir

master:19888 Watch the console

You do not need to start the spark cluster at this time, just start yarn, yarn's resource Manager, which is equivalent to spark standalone mode master!

1, Spark on yarn Two operating modes, the only determining factor is that the current Applicatio Task Scheduler Drvier run where:

1) Cluster:driver running in the JVM process on a machine under the yarn cluster;

2) Client:driver run on the client machine of the current submission program;

It is important to note that no matter what the mode, as long as the current machine runs the spark code, it must be installed spark!

2. Running PI

1) Client mode: It is convenient to see the process information of running directly at the command terminal, especially convenient for testing;

./spark-submit --class org.apache.spark.examples.SparkPi --master Yarn--deploy-mode client. /lib/spark-Examples-1.6.0-hadoop2.6.0.jar50000


If yarn is not activated

Will be reported to try to connect yarn, that does not start the spark cluster, the spark cluster itself is started by the master and worker, in fact, it is a resource management framework, and yarn is to replace its role.

16/02/13 21:04:07 INFO IPC. Client:retrying Connect to server:master/192.168.145.131:8032. Already tried 0 time (s); Retry policy is Retryuptomaximumcountwithfixedsleep (maxretries=10, sleeptime=1000 MILLISECONDS)

16/02/13 21:04:08 INFO IPC. Client:retrying Connect to server:master/192.168.145.131:8032. Already tried 1 time (s); Retry policy is Retryuptomaximumcountwithfixedsleep (maxretries=10, sleeptime=1000 MILLISECONDS)

16/02/13 21:04:09 INFO IPC. Client:retrying Connect to server:master/192.168.145.131:8032. Already tried 2 time (s); Retry policy is Retryuptomaximumcountwithfixedsleep (maxretries=10, sleeptime=1000 MILLISECONDS)

16/02/13 21:04:10 INFO IPC. Client:retrying Connect to server:master/192.168.145.131:8032. Already tried 3 time (s); Retry policy is Retryuptomaximumcountwithfixedsleep (maxretries=10, sleeptime=1000 MILLISECONDS)

16/02/13 21:04:11 INFO IPC. Client:retrying Connect to server:master/192.168.145.131:8032. Already tried 4 time (s); Retry policy is Retryuptomaximumcountwithfixedsleep (maxretries=10, sleeptime=1000 MILLISECONDS)

16/02/13 21:04:12 INFO IPC. Client:retrying Connect to server:master/192.168.145.131:8032. Already tried 5 time (s); Retry policy is Retryuptomaximumcountwithfixedsleep (maxretries=10, sleeptime=1000 MILLISECONDS)

16/02/13 21:04:13 INFO IPC. Client:retrying Connect to server:master/192.168.145.131:8032. Already tried 6 time (s); Retry policy is Retryuptomaximumcountwithfixedsleep (maxretries=10, sleeptime=1000 MILLISECONDS)

Secret decryption: In standalone mode, starting the Spark cluster is the start of master and worker, actually started the resource manager. There is no relationship between the real job calculation and the cluster resource manager. So the job of spark is not actually running in the spark cluster you started, but running in a JVM, as long as you install the spark on the machine where the JVM resides!!!

2) Cluster mode: It is convenient to see the process information of running directly at the command terminal, especially convenient for testing;

./spark-submit --class org.apache.spark.examples.SparkPi --master yarn--deploy-mode cluster. /lib/spark-Examples-1.6.0-hadoop2.6.0.jar50000

3, Spark on Yarn mode driver and application Master relationship:

1) Cluster:drvier is located in the Appliation master process, we need to view the current spark program running information through the Web console via the 8088 port specified by Hadoop.

2) Client:drvier is located on the machine where the code is submitted, application Master is still in the cluster and is only responsible for resource requests, and launch executor, at which time executor is not application Master registration, just register to Drvier!!!

==========spark on yarn Best Practices ============

1. in Spark on yarn mode, Hadoop yarn configuration yarn.nodemanager.local-dir will overwrite Spark's spark.local.dir!!!

2, in the actual production environment, is generally used cluster mode, we will be through the Historyserver to obtain the final all the operational information;

3. If you want to see the log information of the run directly, you can use the following command:

Yarn Logs-applicationid <app id>

Liaoliang Teacher's card:

China Spark first person

Sina Weibo: Http://weibo.com/ilovepains

Public Number: Dt_spark

Blog: http://blog.sina.com.cn/ilovepains

Mobile: 18610086859

qq:1740415547

Email: [Email protected]


This article from "a Flower proud Cold" blog, declined reprint!

Spark on Yarn complete decryption (dt Big Data Dream Factory)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.