The working process of Spark program based on yarn

Source: Internet
Author: User
Tags shuffle

I. Understanding of yarn

Yarn is the product of the Hadoop 2.x version, and its most basic design idea is to decompose the two main functions of jobtracker, namely, resource management, job scheduling and monitoring, into two separate processes. In detail before the Spark program work process, the first simple introduction of yarn, that is, Hadoop operating system, not only support the MapReduce computing framework, but also support flow computing framework, iterative computing framework, MPI Parallel Computing framework, implementation of the use of event-based driving mechanism.

The architecture diagram for yarn is as follows:

1. ResourceManager

ResourceManager is similar to Jobtracker, including two main components: the Scheduler (Scheduler) and the Application Manager (Applicationmanager). Described separately, as follows:

    • The primary function of scheduler is to allocate resources to each running application, which is a resource-based request to perform scheduling functions. Scheduler is an abstract concept based on container, including memory, CPU, disk and network, etc.
    • The main function of Applicationmanager is to take care of the job that was submitted by the transfer, negotiate the first container to perform the task, and provide a restart of the failed job. Each application's Applicationmaster is responsible for the amount of container used to negotiate resources with scheduler, tracking status and monitoring processes.

2. NodeManager

NodeManager similar Tasktracker, its main function is to initiate container, monitor container resources (CPU, memory, disk and network, etc.) and escalate the information to ResourceManager.

Two. Spark Basic framework

A spark application consists of a driver program and multiple jobs. A job is made up of multiple stage. A stage consists of several tasks that do not have a shuffle relationship. The Spark basic framework is as follows:

Spark applications are inseparable from the Sparkcontext and executor two parts, executor is responsible for performing tasks, the machine running executor called the worker node, sparkcontext by the user program, Communication via resource scheduling module and executor.
In detail, with Sparkcontext as the total entry for the application to run, Spark creates Dagscheduler job scheduling and TaskScheduler Task Scheduler Two-level scheduling module during Sparkcontext initialization.
The job scheduling module is a high-level scheduling module based on the task stage, which calculates the multiple scheduling phases (usually based on shuffle) for each spark job, and then builds a specific set of tasks for each phase (usually taking into account the local nature of the data). It is then submitted to the Task Scheduler module in the form of Tasksets (Task group) for specific execution. The Task scheduling module is responsible for the specific start-up tasks, monitoring and reporting tasks running situation.

Three. Rdd and how it is calculated (transformation and action)

Four. Dagscheduler job scheduling and TaskScheduler task scheduling

Five. Spark working process

Reference documents:

[1] Yarn Introduction:http://www.ibm.com/developerworks/cn/data/library/bd-yarn-intro/index.html

[2] Hadoop new MapReduce Framework yarn Detailed: http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop-yarn/

[3] The operating architecture of the spark application: http://blog.csdn.net/hwssg/article/details/38537523

[4] RDD: Memory-based cluster compute fault tolerance abstraction: http://shiyanjun.cn/archives/744.html

The working process of Spark program based on yarn

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.