Spark Kernel Architecture

Source: Internet
Author: User

Sparkcontext creation: High level dagscheduler, bottom TaskScheduler, schedulerbackend

Application=driver+executor
Spark's program is divided into two parts : Driver and Executor
Driver Drive Executor
driver Part of the source code: Sparkconf+sparkcontext
Executor specific implementation
Executor part of the specific source code: Textfile flatMap Map, etc...

Cluster Manager
A service that obtains external resources in a cluster, a resource allocator

The Spark application operation does not depend on Cluster Manager, if the registration is successful, the resource has been allocated through Clustermanager, and the operation is not required Cluster Manager (pluggable) participation , is a coarse-grained resource allocation method
A application can have multiple jobs inside.

The Worker(node) does not run code, manages the current node CPU usage, receives the Master allocation resource (filter instruction), and executorrunner the specific run process,
The worker itself is a process, the worker is not running the code of the program, the worker is to manage the current node's memory, CPU and other resource usage, and accept Mater specific instructions to allocate specific computing resources executor (in the new process allocation), Parallel execution of threads in executor
The worker manages the current node's resources and accepts Master's instructions to allocate the specific compute resource executor (allocated in the new process)

Executorrunner
On the worker, is a proxy that creates the thread remotely
The worker itself does not report to master the current node's memory and Cpu,worker and Master's heartbeat is only worker_id, there is no resource information inside.
Master allocates the resources of the worker, and then dynamically adjusts the resources.

Executor The task reads data from memory or disk
Executor is an object in a process running on a worker, executing for the current application, executor running a task through a thread pool, implementing thread pool concurrency and thread reuse
A worker implicitly believes that a current program executes a executor.
Without setting the core is all exclusive, as long as a job execution does not complete, the next task has no resources
Note: (1) worker is foreman, Cluster Manager is Project manager
(2) The worker does not report resources to master, only if the resource fails in the event of a failure.
job triggered by action
Job->dag->stage->task
Inside the stage: The calculation logic is exactly the same, but the calculated data is different.
A job is a parallel computation that contains a series of tasks, typically triggered by an action. A series of RDD operations are executed sequentially by the action trigger job job.
A application can have more than one job, because there can be different actions, usually one action for a job (because checkpoint also generates jobs). Runjob produces dags, one Dag contains multiple stages, one stage contains multiple tasks, and the stage is divided by shuffle
A job default has a executor on each node
Spark is not based on memory, but because of its scheduling, fault tolerance and other characteristics

Typically an action is a job

Two tasks, two executor

There are two modes of operation of the SPARK program:client and cluster
The default is the client mode, you can see the log information, generally dedicated to find a node to submit, must and cluster in the same network environment, and the configuration and the worker is consistent.
In production: Because driver has frequent network interactions and consumes memory and CPU resources, it is generally not recommended to perform driver on master (Spark cluster environment does not commit on idea), that is, the machine submitting sparkjob is not committed on master

submission of the Spark program :
Machines dedicated to submitting spark programs: This machine is generally bound to spark cluster in the same network environment (driver frequent and executors communication), and its configuration is consistent with different workers.
Application (various dependent external resources, such as the *.so File jar), use Spark-submit to run the program (you can configure various parameters at runtime, such as memory cores ... ), the actual production environment to write shell script Automation configuration and submission program, of course, the current machine must be installed spark, but it is installed here Spark does not belong to the cluster!!

submission of Spark task :
Driver (Core is Sparkcontext), create sparkconf First, create sparkcontext on this basis
Akka and Netty Implementing RPC

Sparkcontext: Create Dagscheduler, TaskScheduler, Schedulerbackend, in the process of instantiation register the current program to Master,master accept registration, if there is no problem, Master assigns AppID to the current program and allocates compute resources.

In general, when a job is triggered through an action, spark context uses Dagscheduler to divide the Dag in the job into different stages, each within a series of tasks that are identical in business logic but that handle data differently , constituting the taskset.

TaskScheduler and Schedulerbackend are responsible for specific task runs (data locality compliance)

    1. Spark Cluster

Master: Accept user-submitted programs and send instructions to the worker to assign compute resources to the current program, where each worker node implicitly considers the current program to be assigned a executor and executes concurrently through the thread pool in executor

The amount of memory and CPU resources that spark runs on a node depends on:
1,spark-env.sh and spark-defaults.sh
Parameters provided by 2,spark-submit
3, parameters of sparkconf configuration in program
Conflicts, Precedence: 3>2>1

Worker Node
Worker process, through a proxy for Executorrunner object instance to remotely start Executorbackend for
Executorbackend process, inside there is executor, thread pool ThreadPool
The task is actually encapsulated by Taskrunner at work, and then a thread executes a task from ThreadPool, and the thread is recycled after the execution is done.

The task in the last stage, called Resulttask, produces the result of the job, and the other tasks in the previous stage are Shufflemaptask, preparing the data for the stage of the next phase, equivalent to mapper in MapReduce.

The entire spark program is run, that is, dagscheduler the job into different stages, submit taskset to TaskScheduler, and then submit to executor execution (data local), Each task computes a partition in the RDD, based on the partition to give us a definition of a series of functions within the same stage, and so on ... Until the entire program is finished running.

Summary :
Run node, Spark-submit->driver, Sparkcontext->dagscheduler&taskscheduler&schedulerbackend- Dagscheduler The job is divided into stage-and-stage internal task components Taskset->tasksheduler and Schedulerbackend are responsible for performing taskset, register After job Tomaster, Master is accepted, assign AppID and compute resources->master send a user-submitted program to the worker assignment compute resource, worker defaults to start a executor to a program The worker process starts remotely with a proxy for a Executorrunner object instance executorbackend-> executorbackend Executor Executor fetching a thread from the threadpool thread pool by Taskrunner Package task--executor task-> one partition in each task calculation Rdd, thread recycling after execution completes Next task, loop until the task in the last stage, complete with the entire program run, is called Resulttask (the task in the previous stage is Shufflemaptask, preparing the data for the next stage), Results of the build job

Spark Kernel Architecture

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.