Get through the Spark system Operation Insider mechanism cycle process (dt Big Data DreamWorks)

Source: Internet
Author: User

Content:

1, taskscheduler working principle;

2, TaskScheduler source decryption;

There are a series of tasks in the stage, the tasks are parallel computing, the logic is exactly the same, but the processing of data is different.

The Dagscheduler is submitted to TaskScheduler (Task Scheduler) as a task.

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

==========taskscheduler Working principle Decryption ============

1, Dagscheduler in the submission of taskset to the underlying scheduler is interface-oriented TaskScheduler, which conforms to the object-oriented dependency abstraction, and does not rely on the specific principles, resulting in the underlying resource scheduler pluggable, Causes spark to run in a number of resource scheduler modes, such as standalone, YARN, Mesos, Local, EC2, and other custom resource schedulers, in standalone mode, we focus on Taskschedulerimpl;

2, when Sparkcontext instantiation, through the Createtaskscheduler to create Taskschedulerimpl and Sparkdeployschedulerbackend

CASE  Spark_regex (Sparkurl)  =>
  Span style= "color: #cc7832; font-weight:bold;" >val  scheduler = NEW  Taskschedulerimpl (SC)
  VAL  masterurls =  sparkurl.split (). Map ( "spark://"   + _)
  VAL  backend =  NEW  sparkdeployschedulerbackend (Scheduler,  sc,  masterurls)
  Scheduler.initialize (backend)
  (Backend,  scheduler)

In the Initialize method of Taskschedulerimpl, the sparkdeployschedulerbackend is transmitted in order to be Taskschedulerimpl, in Taskschedulerimpl The Backend.start method is called when the Start method is called, and the application is eventually registered in the Start method;

3, TaskScheduler's core task is to submit taskset to the cluster operation and report the results:

1) Create and maintain a Tasksetmanager for Taskset and track the local and error messages of the task;

2) When encountering the Straggle task, it will be put to other nodes for retry;

3) TaskScheduler must report the execution to Dagscheduler, including information such as fetch fail when shuffle output lost;

4, TaskScheduler internal will hold schedulerbackend, from standalone mode, concrete realization is sparkdeployschedulerbackend;

5, Sparkdeployschedulerbackend at the start of the construction of Appclient instance, and in the instance start when the Clientendpoint this message loop body, Clientendpoint The current program is registered with master at boot time, and Sparkdeployschedulerbackend The parent class Coarsegrainedschedulerbackend will instantiate the message loop body of type Driverendpoint (which is the driver of the classic object when we run the program) at start. Sparkdeployschedulerbackend is specifically responsible for collecting resource information on workers, and when Executorbackend starts, it sends Registerexecutor information to Driverendpoint for registration. At this time Sparkdeployschedulerbackend mastered the current application has the computing resources, TaskScheduler is through the Sparkdeployschedulerbackend owned computing resources to run the task specifically;

6, Sparkcontext, Dagscheduler, Taskschedulerimpl, sparkdeployschedulerbackend when the application starts, only instantiate once, the application exists during, always exist these objects;

Big Summary: Call Createtaskscheduler to create Taskschedulerimpl and Sparkdeployshedulerbackend when Sparkcontext is instantiated, while in spark instantiation, Sparkdeployshedulerbackend's start is called in the Start,start method that calls Taskschedulerimpl, and the Appclient object is created in the method. and calls the Start method of the Appclient object, in which the clientendpoint is created, When you create a clientendpoint, the command is passed to specify the name of the Ingress class for the executor process that is specifically launched for the current application coarsegrainedexecutorbackend, Then Clientendpoint starts and registers the current application with Tryregistermaster to master, and master receives the registration information, and if it can run the program generates JOBID for the program and passes schedule () To allocate computing resources, the allocation of specific computing resources is determined by the configuration information of the application running, memory, cores, etc., and finally master sends instructions to the worker, When a worker allocates compute resources for the current application, it allocates the Executorrunner,executorrunner internally to build Processbuiler in a thread way to start another JVM process, The name of the class in which the main method is loaded when the JVM process starts is to specify a class with a specific name of Coarsegrainedexecutorbackend when the command to create the Clientendpoint is passed in. The JVM sees the coarsegrainedexecutorbackend loading and invoking the main method in the Processbuilder boot, In the main method, the Coarsegrainedexecutorbackend itself is instantiated as the message loop body, Coarsegrainedexecutorbackend, when instantiated, sends Registerexecutor to Driverendpoint via callback OnStart to register the current Coarsegrainedexecutorbackend , Driverendpoint received the registration information and saved it in the Sparkdeployschedulerbackend instance in the memory data structure so that the driver gets the compute resources.

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

Homework:

Draw a big summary of the flowchart.

Liaoliang Teacher's card:

China Spark first person

Sina Weibo: Http://weibo.com/ilovepains

Public Number: Dt_spark

Blog: http://blog.sina.com.cn/ilovepains

Mobile: 18610086859

qq:1740415547

Email: [Email protected]


This article from "a Flower proud Cold" blog, declined reprint!

Get through the Spark system Operation Insider mechanism cycle process (dt Big Data DreamWorks)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.