Spark version customization Seven: Spark streaming source Interpretation Jobscheduler insider realization and deep thinking

Source: Internet
Author: User

Contents of this issue:

1,jobscheduler Insider Realization

2,jobscheduler Deep Thinking

Abstract: Jobscheduler is the core of the entire dispatch of the spark streaming, which is equivalent to the dagscheduler! in the dispatch center on the spark core.

First,Jobscheduler Insider Realization

Q: Where did theJobscheduler spawn?

A: Jobscheduler is generated when the StreamingContext instantiation, from the StreamingContext source code on the 183th line can be seen:

private[streaming] Val Scheduler = new Jobscheduler (this)

Q: Why does Spark streaming set two threads?
A: The two threads specified by Setmaster refer to at least two threads when the program is running. A thread is used to receive data, which requires constant looping. The other is the processing thread, which is the number of threads we specify for job processing. As shown in the Start () method of StreamingContext:

def start (): Unit = synchronized {State match {case INITIALIZED = Startsite.set (dstream.getcreationsi          Te ()) StreamingContext.ACTIVATION_LOCK.synchronized {streamingcontext.assertnoothercontextisactive () try {Validate ()//Start the streaming scheduler in a new thread, so that thread local Propert IES//Like call sites and job groups can is reset without affecting those of the/current thread.
//spark streaming internally initiated thread for scheduling of the entire job Threadutils.runinnewthread (" Streaming-start ") {Sparkcontext.setcallsi Te (startsite.get) Sparkcontext.clearjobgroup () Sparkcontext.setlocalproperty (Sparkcontext.spark_ Job_interrupt_on_cancel, "false") Scheduler.start ()} state = streamingcontextstate.active} catch {case nonfatal (e) = LogError ("Error STA Rting the context, marking it as stopped ", E) Scheduler.stop (false) state = Streamingcontextstat e.stopped throw E} streamingcontext.setactivecontext (this)} Shutdownhookref = Shutdownhookmanager.addshutdownhook (streamingcontext.shutdown_hook_priority) (Stoponshutdown)//Regist ering streaming Metrics at the start of the StreamingContext assert (Env.metricssystem! = null) Env.metricssy Stem.registersource (Streamingsource) Uitab.foreach (_.attach ()) Loginfo ("StreamingContext started") case ACTIVE = logwarning ("StreamingContext have already been started") Case STOPPED = throw new Ille Galstateexception ("StreamingContext has already been stopped")}}

Enter Jobscheduler source code:

/**
Jobscheduler is responsible for the job at the logical level and runs its physical level on top of spark* This class schedules the jobs to being run on Spark. It uses the Jobgenerator to generate * The jobs and runs them using a thread pool.
*/private[streaming]class Jobscheduler (Val ssc:streamingcontext) extends Logging {//Through the Jobset collection, constantly store the received jobPrivate Val jobsets:java.util.map[time, Jobset] = new Concurrenthashmap[time, Jobset]
//Set the degree of parallelism, default is 1, you want to modify the degree of parallelism of the job run in spark-conf or in the application to modify this value
Why would you want to modify the concurrency level?
A: Sometimes there are multiple outputs in the application, which can result in multiple job executions, all within a batchdurations, where the job executes without waiting for each other, so you can set this value to execute concurrently!
Different batch, the thread pool has a lot of threads, also can run concurrently! Private Val numconcurrentjobs = Ssc.conf.getInt ("Spark.streaming.concurrentJobs", 1)
//The job that translates the logical-level job to a physical level is implemented through the Newdaemonfixedthreadpool threadPrivate Val jobexecutor = Threadutils.newdaemonfixedthreadpool (numconcurrentjobs, "Streaming-job-executor")
//instantiation of JobgeneratorPrivate Val jobgenerator = new Jobgenerator (this) val clock = jobgenerator.clock val listenerbus = new Streaminglistener Bus ()
///The following three is said to be instantiated at Jobscheduler startup These-created only when scheduler starts. EventLoop not being null means the scheduler have been started and not stopped var Receivertracker:receivertracker = n ull//A tracker to track all the input stream information as well as processed record number Var Inputinfotracker:inpu Tinfotracker = null private var eventloop:eventloop[jobschedulerevent] = null def start (): Unit = synchronized {if (EventLoop! = NULL) return//scheduler has already been started Logdebug ("Starting Jobscheduler") EventLoop = new E Ventloop[jobschedulerevent] ("Jobscheduler") {Override protected Def onreceive (event:jobschedulerevent): Unit = proc Essevent (Event) override protected Def onError (e:throwable): Unit = ReportError ("Error in Job Scheduler", E)} Eventloop.start ()//Attach rate controllers of input streams to receive batch completion updates for {INPUTD Stream <-ssc.graph.getInputStreams Ratecontroller <-inputdstream.ratecontrollER} ssc.addstreaminglistener (ratecontroller) Listenerbus.start (ssc.sparkcontext) Receivertracker = new ReceiverT Racker (SSC) Inputinfotracker = new Inputinfotracker (SSC) Receivertracker.start () Jobgenerator.start () LogInfo ( "Started Jobscheduler")}
second,Jobscheduler deep thinking

Here's how to reverse the job's build process starting with the output method print () of the application:

1. After clicking the print () method in the application, jump into the print () Dstream:

print (): Unit = ssc.withscope {  print (10)}

2. Click the print () method above the red line mark again:

/** * Print the first num elements of each RDD generated in this DStream. This was an output * operator, so this DStream would be registered as an output stream and there materialized. */def print (num:int): Unit = ssc.withscope {def foreachfunc: (  Rdd[t]  span>, time) = Unit = {(rdd:rdd[t], Time:time) + {val firstnum =  rdd  . Take (num + 1)//Scalastyle:off println println ("------------------------------------------ -") println (" Time: "+ time) println ("-------------------------------------------") firstnum.take (num). Fore Ach (println) if (firstnum.length > num) println ("...") println ()//Scalastyle:on println}} 
    
     foreachrdd   (Context.sparkContext.clean (Foreachfunc), Displayinnerrddops = False)} 
    

The code of the red flag can be drawn: The sparkstreaming is ultimately executed at various logical levels of operation on the RDD!

3. Click Foreachrdd on the diagram again to enter the Foreachrdd method:

/** * Apply a function to each RDD in this DStream. This was an output of operator, so * ' This ' DStream would be registered as an output stream and therefore materialized.  * @param foreachfunc foreachrdd function * @param displayinnerrddops Whether The detailed callsites and scopes of the RDDs Generated * in the                           ' Foreachfunc ' to being displayed in the UI.  If ' false ', then * only the                           scopes and callsites of ' Foreachrdd ' would override those * of the RDDs on the                           display. */private def foreachrdd (    foreachfunc: (Rdd[t], time) = unit,    displayinnerrddops:boolean): unit = {  
   
    foreachdstream(This,    Context.sparkContext.clean (Foreachfunc, False), displayinnerrddops). Register ()}
   

4. Click Foreachdstream to enter the Foreachdstream class and find the Generatejob method:

/** * An internal DStream used to represent output operations like Dstream.foreachrdd.  * @param Parent Parent DStream * @param foreachfunc Function to apply in each RDD generated by the parent DStream * @param displayinnerrddops Whether The detailed callsites and scopes of the RDDs generated * b Y ' Foreachfunc ' 'll be displayed in the UI; Only the scope and * callsite of ' Dstream.foreachrdd ' would be displayed. */private[streaming]class Foreachdstream[t:classtag] (parent:dstream[t],Foreachfunc: (Rdd[t], time) = Unit, Displayinnerrddops:boolean) extends Dstream[unit] (PARENT.SSC) {override Def Dependenc Ies:list[dstream[_]] = List (parent) override Def slideduration:duration = Parent.slideduration override def COMPUTE (VA Lidtime:time): option[rdd[unit]] = None//The job is constantly generated based on interval override DefGeneratejob(time:time): option[job] = {Parent.getorcompute (time) match {case Some (RDD) = val Jobfunc = () = Createrddwithlocalproperties (time, Displayinnerrddops) {
Time-based RDD, because it is the output, so it is the last rdd, then we just need to find out where to call Foreachdstream's Generatejob method, we can know the final generation of the jobForeachfunc(Rdd, Time)} Some (time, jobfunc) case None = = None}}}

5. In the previous lecture we came to the following process:

Streamingcontext.start-->jobscheduler.start-->receivertracker.start ()-->jobgenterator.start ()-- Eventloop-->processevent ()-->generatejobs ()-->jobscheduler.receivertracker.allocateblockstobatch (time )-graph.generatejobs (time)  

One of the last Graph.generatejobs is the Dstreamgraph method, which enters:

def generatejobs (time:time): seq[job] = {  Logdebug ("Generating jobs for Time" + time)  val jobs = This.synchroniz ed {    //At this time the OutputStream is Foreachdstream    outputstreams. flatMap {outputstream =      val Joboption = Outputstream.generatejob (time)      Joboption.foreach (_.setcallsite (outputstream.creationsite))      Joboption    }  }  logdebug ("Generated" + jobs.length + "Jobs for Time" + time)  jobs}

outputstreams= new arraybuffer[DStream[_]] ()

By looking at the subclass inheritance structure of the Dstream and the Generatejob method of the above Foreachdstream, it is concluded that only foreachdstream in the Dstream subclasses Override the Dstream generatejob! .
Finally concluded that:
真正Job的生成是通过ForeachDStream的generateJob来生成的,此时的job是逻辑级别的,真正被物理级别的调用是在JobGenerator中In the Generatejob method:
/** Generate jobs and perform checkpoint for the given ' time '. */Private def generatejobs (time:time) {//Set the sparkenv in this thread, so that job generation code can access T  He environment//EXAMPLE:BLOCKRDDS is created in this thread, and it needs to access Blockmanager//Update:this    is probably redundant after threadlocal stuff in sparkenv have been removed. Sparkenv.set (ssc.env) Try {jobScheduler.receiverTracker.allocateBlocksToBatch (time)//allocate received blocks T        o Batch graph.generatejobs (time)//generate jobs using allocated block} match {case Success (jobs) = Val Streamidtoinputinfos = JobScheduler.inputInfoTracker.getInfo (Time)Jobscheduler.submitjobset (Jobset (Time, Jobs, Streamidtoinputinfos))Case Failure (e) = Jobscheduler.reporterror ("Error Generating jobs for Time" + Time, E)} eventloop.post ( Docheckpoint (time, Clearcheckpointdatalater = False)}

To enter the Jobscheduler.submitjobset method:

//The job that translates the logical-level job to a physical level is implemented through the  newdaemonfixedthreadpool thread Private Val jobexecutor =    Threadutils.newdaemonfixedthreadpool (numconcurrentjobs, "Streaming-job-executor")
def submitjobset (jobset:jobset) {    if (jobSet.jobs.isEmpty) {      loginfo ("No jobs added for time" + jobset.time) 
   
    jobexecutor.execute (New Jobhandler (Job))      loginfo ("Added Jobs for Time" + jobset.time)    }  }
   

At this point, the entire job generation, execution is very clear, and finally summarized as follows:

From the previous lecture, we learned that Jobscheduler contains two core components Jobgenerator and Receivertracker, which are responsible for job generation and receiving of source data, respectively .

Receivertracker boot will cause receiver running on the executor end to start and receive data, Receivertracker will record the data received by the receiver meta-information,

Jobgenerator the boot causes every batchduration, call Dstreamgraph to generate the Rdd Graph and generate the job,

The line pool in Jobscheduler commits the encapsulated Jobset object (time value, Job, meta of the data source). The business logic is encapsulated in the job, causing the action of the last Rdd to be triggered,

The job is actually dispatched on the spark cluster by the Dagscheduler.

Special thanks to Liaoliang Teacher's unique explanation:

Liaoliang Teacher's card:

China Spark first person

Sina Weibo: Http://weibo.com/ilovepains

Public Number: Dt_spark

Blog: http://blog.sina.com.cn/ilovepains

qq:1740415547

YY classroom: Daily 20:00 live teaching channel 68917580

Spark version customization Seven: Spark streaming source Interpretation Jobscheduler insider realization and deep thinking

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.