Contents of this issue:
1, Jobscheduler Insider realization
2, Jobscheduler deep thinking
Jobscheduler is the dispatch core of spark streaming, and it is important to be the Dag Scheduler of the dispatch center on Spark Core!
Jobgenerator Every batch duration time will be dynamically generated Jobset submitted to Jobscheduler,jobscheduler received Jobset, how to deal with it?
Create Job
/** Generate jobs and perform checkpoint for the given` Time`. */
Private DefGeneratejobs(Time:time) {
//Set The sparkenv in this thread, so the job generation code can access the environment
//Example:blockrdds is created in this thread, and it needs to access Blockmanager
//Update:this is probably redundant after threadlocal stuff in sparkenv have been removed.
Sparkenv.Set(SSC.Env)
Try{
Jobscheduler.Receivertracker. Allocateblockstobatch (Time)//Allocate received blocks to batch
Graph. Generatejobs (Time)Generate jobs using allocated block
}Match{
CaseSuccess(Jobs) =
ValStreamidtoinputinfos = Jobscheduler.Inputinfotracker. GetInfo (Time)
Jobscheduler.Submitjobset(Jobset( time, Jobs, Streamidtoinputinfos))
CaseFailure(e) = =
Jobscheduler.reporterror ("Error Generating jobs for Time"+ Time, e)
}
EventLoop. Post (Docheckpoint( time, Clearcheckpointdatalater =false))
}
Processing the resulting jobset
defSubmitjobset(Jobset:jobset) {
if(JobSet.jobs.isEmpty) {
Loginfo ("No jobs added for time"+ jobset.time)
}Else{
Listenerbus. Post (streaminglistenerbatchsubmitted(Jobset.tobatchinfo))
jobsets. put (Jobset.time, Jobset)
JobSet.jobs.foreach (Job = jobexecutor. Execute (new Jobhandler (Job)))
Loginfo ("Added Jobs for Time"+ jobset.time)
}
}
This will generate a new jobhandler for each job and give it to Jobexecutor to run.
The most important processing logic here is Job = Jobexecutor.execute (new Jobhandler), which is to handle each job in the Jobexecutor thread pool, with the new Jobhandler.
Let's take a look at Jobhandler's main processing logic for the job:
var_eventloop =EventLoop
off(_eventloop! =NULL) {
_eventloop.post (jobstarted(Job, Clock. Gettimemillis ()))
Disable checks for existing output directories in jobs launched by the streaming
//Scheduler, since we may need to write output to an existing directory during checkpoint
//recovery; see SPARK-4835 for more details.
Pairrddfunctions.disableoutputspecvalidation. Withvalue (true) {
Job.run ()
}
_eventloop =EventLoop
if(_eventloop! =NULL) {
_eventloop.post (jobcompleted(Job, Clock. Gettimemillis ()))
}
In other words, jobhandler in addition to doing some state records, the most important thing is to call Job.run ()! This corresponds to our analysis in the DStream generated RDD instance, where Foreachdstream.generatejob (time) defines the operating logic of the job, which defines the job.func. And here in Jobhandler, is really called the Job.run (), will trigger the real execution of the Job.func!
def run () {_result = Try (func ())}
650) this.width=650; "title=" 1.png "alt=" Wkiom1c-pzxgmmyaaadvu7cxuo4555.png "src=" http://s4.51cto.com/wyfs02/M01/ 80/58/wkiom1c-pzxgmmyaaadvu7cxuo4555.png "/>
Reference Blog: http://lqding.blog.51cto.com/9123978/1773391
Note:
information from: Dt_ Big Data Dream Factory ( Spark release version customization)
for more private content, please follow the public number: Dt_spark
If you have a big dataSparkinterested to be free to listen to by Liaoliang teacher every night -:xxopened bySparkpermanent free public class, addressYYRoom Number:68917580
This article is from "Dt_spark Big Data DreamWorks" blog, please make sure to keep this source http://18610086859.blog.51cto.com/11484530/1775258
(Version Customization) lesson 7th: Spark Streaming Source Interpretation Jobscheduler insider realization and deep thinking