Spark Set-PLATE: 007~spark Streaming source code interpretation of Jobscheduler Insider realization and deep thinking

Source: Internet
Author: User

The content of this lecture:

A. Jobscheduler Insider implementation
B. Jobscheduler Deep Thinking

Note: This lecture is based on the spark 1.6.1 version (the latest version of Spark in May 2016).

Previous section Review

Last lesson, we take the Jobgenerator class as the center of gravity, for everyone left and right extension, decryption job dynamic generation, and summed up the job dynamic generation of the three core:

A. Jobgenerator: responsible for job generation

B. Jobsheduler: Responsible for job scheduling

C. receivertracker: Get Meta data

such as job dynamic generation diagram:

Lecturing

From the last lesson we know:

Jobscheduler is the center of all job scheduling in sparkstreaming, with two important members inside: Jobgenerator is responsible for job generation, Receivertracker is responsible for recording input data source information.

The start of the Jobscheduler will cause receivertracker and jobgenerator to start. The start of the Receivertracker causes receiver running on the executor side to start and receive data, and Receivertracker records the data meta information received by receiver.

Jobgenerator's startup results in every batchduration, calling Dstreamgraph to generate the Rdd Graph and generate the job.

The line pool in Jobscheduler commits the encapsulated Jobset object (time value, Job, meta of the data source). The business logic is encapsulated in the job, causing the action of the last Rdd to be triggered, and the job is actually dispatched on the spark cluster by Dagscheduler.

So it can be said that Jobscheduler is the core of the entire Spark streming dispatch, and its position is equivalent to Dagscheduler in Spark core. Therefore, we must thoroughly master the Jobscheduler.

First, let's take a look at the source tracking step diagram of Jobscheduler important method:

Let's take a step back and decrypt Jobscheduler's insider.

In the last few lessons, when we were developing spark streaming, we performed various transform and action-level operations on Dstream, which constituted Dstream graph, the dependency between Dstream, Over time, DStream graph generates an RDD dag based on the Batchintaval interval, and then executes the job. Dstream Dstream graph is a logical level, and the Rdd dag is a physical execution level. Dstream is the dimension of spatial dimensions, and the spatial dimension plus time constitutes the temporal dimension.

At this point, Jobscheduler is physically running the job at the logical level on the spark core. Jobgenerator is a job that generates a logical level and uses Jobscheduler to run the job in a thread pool. Jobscheduler is instantiated in the StreamingContext and opens a new thread in the StreamingContext start method.

The code in the curly braces executes as an anonymous function in the new thread. The sparkstreaming runtime requires at least two threads, one of which is used to iterate over data, and it doesn't matter if at least two threads are running Scheduler.start () on top of a new thread. The sparkstreaming runtime requires at least two threads to be used for job processing, and the upper code opens up a new thread at the scheduling level, no matter how many threads are specified at the time the Sparkstreaming program runs, and there is no connection between the new thread.

Each thread has its own private property, where private properties are set for new threads, which do not affect the main thread.

Source code in the writing mode is very worthwhile to learn, in the future when looking at the source as a common application, from the JVM point of view, Spark is a distributed application. In future projects to separate the scheduling and job processing threads, it is also easy to maintain and optimize.

Jobscheduler instantiates the jobgenerator and thread pools as they are instantiated.

There is a thread in the thread pool by default, and of course it can be configured in the Spark configuration file or using code to modify the default number of threads in sparkconf, increasing the number of default threads to a certain extent to increase the efficiency of the job execution. This is also a performance tuning approach (especially when there are multiple jobs in a program).

Receivertracker and Jobgenerator are instantiated when the Jobscheduler is instantiated.

Take a look at the Jobscheduler.start code:

EventLoop is instantiated when the start method of the jobgenerator is called.

The OnStart method is recalled in the Start method of the EventLoop, and in general the OnStart method executes some prepared code, although there is no replication OnStart method in JobSchedule, but spark The streaming framework is clearly here for the extensibility of the code, which is what you need to learn when developing a project.

Dstream action-level operations turn around or call Foreachrdd this method, vivid description of the Dstream operation is actually the operation of the RDD.

The Foreachfunc method in the upper code is a further encapsulation of dstream action-level methods; Foreachrdd method, turn around new Foreachdstream

Note: This function is applied to each rdd in this dstream, which is an output operation, so this dstream will be registered as OutputStream and materialized.

One of the most important functions in Foreachdstream is generatejob. Considering the time dimension and the action level, each duration is based on generatejob to generate the job. Foreachfunc (Rdd, Time) This method is the last operation of Dstream, the new Job (timing, Jobfunc) is only on the basis of the RDD, plus the encapsulation of the temporal dimension. The job here is just an ordinary object that represents a spark's calculation, and when the job's Run method is called, the real job is triggered. The Rdd in Foreachfunc (Rdd, time) is actually determined by the last dstream in the dstreamgraph.

Take a look at the Foreachdstream.generatejob code:

The job is generated through the Foreachdstream generatejob, and it is noteworthy that only Foreachdstream overrides the Generatejob method in the Dstream subclass.

Now think about who called the Generatejob method of Foreachdstream? Of course it's jobgenerator. The Generatejob method of Foreachdstream is the static logic level, and he needs to jobgenerator if he wants to really run and become a physical level.

Now take a look at the Jobgenerator code, Jobgenerator has a timer and message loop body Eventloop,timer will be based on Batchinterval, The generatejobs message is sent to EventLoop, leading to the execution of the Processevent method->generatejobs method.

Code for the Generatejobs method:

Graph.generatejobs (time) code for this method:

The OutputStream in Outputstream.generatejob (time) is the one that says Foreachdstream,generatejob (time) The method is the Generatejob (time) method in Foreachdstream.

This is the thing that calls the spatial dimension from the time dimension, so the time-space combination turns into physical execution.

Jobgenerator code for the Generatejobs method:

After the job is generated based on graph.generatejobs, it is encapsulated into Jobset and submitted to Jobscheduler,jobset (time, Jobs, Streamidtoinputinfos), Where Streamidtoinputinfos is the metadata of the received data.

Jobset represents a batch of jobs in a batch duration. is an ordinary object that contains information such as uncommitted jobs, time of submission, execution start and end times, and so on.

After the Jobset is submitted to Jobscheduler, it is placed into the Jobsets data structure, Jobsets.put (Jobset.time, Jobset), So Jobscheduler has the Jobset in each batch. Executes in a thread pool.

When the job is put into the thread pool, it is encapsulated with Jobhandler. Jobhandler is an instance of an Runable interface.

The main code is Job.run (), previously said Job.run () is the Dstream action level method.

Jobstarted and jobcompleted messages are sent before and after Job.run (), Jobscheduler receive these two messages just to record the time, to notify the job to start or finish, and there is not too much to do.

Note:
1. DT Big Data Dream Factory public number Dt_spark
2, Spark God-level experts: Liaoliang
3, Sina Weibo: Http://www.weibo.com/ilovepains

Spark Set-PLATE: 007~spark Streaming source code interpretation of Jobscheduler Insider realization and deep thinking

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.