Spark develops the-spark kernel to elaborate

Source: Internet
Author: User

Core
1. Introducing the core of Spark

cluster mode is standalone.
Driver: That's the one machine we used to submit the Spark program we wrote, the most important thing in Driver-Creating a Sparkcontext
Application: That's the program we wrote, the class created the Sparkcontext program.
Spark-submit: is used to submit application to the Spark cluster program, SPARK-SUBMIT, in fact, is a Akka actor inherits the Actor model, if not inherit, We cannot communicate with master, we cannot register our application with Master
Sparkcontext: In the process of creating sparkcontext, one of the most important 3 things, one is to create a dagsechedule (with a non-circular graph dispatcher), and the other to create TaskScheduler (Task Scheduler), The third is to create the taskschedulerbackend according to Tasksecheduler (Task Dispatch backend)
Dagscheduler:dag: A Direction-free graph (Directed acyclic graph) After the creation of a good program, will be a variety of operators to dagscheduler for the whole of a dispatch, we each application at the time of operation, will be divided into several stages by Dagscheduler, which is done by the relevant partitioning algorithm.
When Dagscheduler receives the task information, it assigns the relevant taskscheduler to make a specific dispatch of the task, allowing us to taskset a batch of tasks to perform a specific task.
Taskscheduler:taskscheduler, it organizes, dispatches task execution
When the executor in the worker is started, it is actively reverse-registering to driver, and when driver receives all executor (a set of executor) reverse registration information, it begins to load the data to create the RDD. The various operators to Dagscheduler management "so question, driver is how to know that it received a group of all the executor it, we also remember that Master received the driver registration request, the task assignment, Notify each worker to receive the task, and the worker will respond to the master task, and Master will tell Driver,worker that it has received the task, and master will assign the task assignment plan to driver at this moment, Driver according to this distribution plan, you can know whether a group of executor has all arrived "
Master:master is mainly used for cluster monitoring, the allocation of running resources, Master in the allocation of resources, there are two ways of distribution, a spreadapps, a non-spreadapps,master is actually a Akka Actor's actor model received driver sent over the registration notice, and then measure the task, how the resources need to hand over to the worker to work, in fact, let the worker to start the executor process
Taskrunner: When our task is assigned, executor extracts the corresponding task from the thread pool, encapsulates it into Taskrunner, executes the specific flatmap, map, Reducebykey, and so on.
In fact, there are two types of task tasks, Shufflemaptask,resulttask,resulttask is the task that executes the action, and the rest are Shufflemaptask

Spark Kernel:
Application=driver+executor
Driver's Code =sparkconf+sparkcontext
Executor the object within a process inside a work. Concurrent execution through the thread pool, threads re-executing task
Application run process does not rely on cluster Manager
Work is the management node, the code that does not run the program, is the resource that manages the current node, and accepts master's instruction to allocate
Specific compute resource Executor (allocated in new process)
Work does not send the resource information when the heartbeat is sent, and master assigns it to know the resource situation.
The Job contains a series of task parallel computations generally triggered by the action action does not produce an RDD
Inside the stage: calculating logic just like the data is different.

Why can't I use the idea integration development environment to release spark programs into the spark cluster
1, memory and core limitations, by default the SPARK program's driver will be submitted to the Spark program on the machine, so if the IDE to submit the program, then the IDE machine must be very powerful
2, driver to direct the operation of workers and frequent communication, if the concurrency environment IDE and spark cluster is not the same network, there will be loss of tasks, slow operation and many other unnecessary problems
3. It's not safe.

Flowchart:

Sparkcontext connect to master, register with master and request resources (CPU Core and memory)
Master decides on which worker to allocate resources based on Sparkcontext's resource request and the information reported in the worker's heartbeat cycle, then gets the resource on that worker and then starts standaloneexecutorbackend;
Standaloneexecutorbackend registers with Sparkcontext;
Sparkcontext sends Applicaiton code to Standaloneexecutorbackend And Sparkcontext parses the Applicaiton code, constructs a DAG diagram, and submits the DAG Scheduler to the stage (the job is spawned when an action action is encountered; Each job contains 1 or more stage The stage is typically generated prior to acquiring external data and shuffle, and then submitted to the task by the stage (or Taskset) Scheduler,task Scheduler is responsible for assigning the task to the appropriate worker. Finally submitted to standaloneexecutorbackend execution;
Standaloneexecutorbackend creates a executor thread pool, starts executing the task, and reports to Sparkcontext, Until the task finishes
All tasks are completed, Sparkcontext logs off to master, freeing the resources

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.