Contents of this issue:
Understanding the entire architecture and operating mechanism of Sparkstreaming's job is critical to mastering sparkstreaming. We know that for a typical spark application, the RDD action operation triggers the job. So how does the job work for sparkstreaming? When we wrote the Sparkstreaming program, we set the batchduration,job to automatically trigger every batchduration time, which is certainly the sparkstreaming framework provides a timer, The program is submitted to spark as soon as the time is up and runs as a spark job.
This involves the concept of two jobs:
Each batchinterval will produce a specific job, in fact, the job here is not the job referred to in Spark Core, it is only the dstreamgraph based on the RDD generated by the DAG, from the Java perspective, equivalent to the Runnable interface instance, Why use a thread pool when you want to run a job that needs to be submitted to Jobscheduler, and find a separate thread in Jobscheduler that submits the job to the cluster (in fact, the RDD-based action in the thread triggers a real job)?
A), the job is constantly generated, so in order to improve efficiency, we need a thread pool, which is similar to executing a task in executor through a thread pool;
b), it is possible to set the job of the Fair Fair scheduling method, this time also requires multi-threading support;
The spark job submitted by the above job itself. From this point of view, there is no difference between this job and the job in Spark core.
Let's look at the job run process:
1. First instantiate the sparkconf and set the run-time parameters.
Val conf = new sparkconf (). Setappname ("Updatestatebykeydemo")
2. Instantiate StreamingContext, set the batchduration time interval to control the frequency of job generation and create a portal for spark streaming execution.
Val SSC = new StreamingContext (Conf,seconds (20))
3. In the process of instantiating StreamingContext, instantiate Jobscheduler and Jobgenerator.
Line 183th of Streamingcontext.scala
Private[streaming] Val Scheduler = new Jobscheduler (this)
Line 50th of Jobscheduler.scala
Private Val jobgenerator = new Jobgenerator (this)
4.StreamingContext calls the Start method.
Def start (): unit = synchronized { state match { case initialized => startsite.set ( Dstream.getcreationsite ()) streamingcontext.activation_lock.synchronized { streamingcontext.assertnoothercontextisactive () try { validate () // Start the streaming scheduler in a new thread, so that thread local properties // like call sites And job groups can be reset without affecting those of the // current thread. Threadutils.runinnewthread ("Streaming-start") { sparkcontext.setcallsite (Startsite.get) sparkcontext.clearjobgroup () sparkcontext.setlocalproperty (sparkcontext.spark_job_interrupt_on_cancel, "false") scheduler.start () } state = streamingcontextstate.active } catch { case nonfatal (e) => logerror ("error starting the context, marking it as stopped", e ) scheduler.stop (false) state = streamingcontextstate.stopped throw e } streamingcontext.setactivecontext ( This) } shutdownHookRef = Shutdownhookmanager.addshutdownhook ( streamingcontext.shutdown_ hook_priority) (Stoponshutdown) // Registering Streaming metrics at the start of the streamingcontext ASSERT (env.metricssystem != null) env.metricssystem.registersource (StreamingSource) uitab.foreach (_.attach ()) loginfo ("StreamingContext started ") case ACTIVE => Logwarning ("streamingcontext has already been started") case Stopped => throw new illegalstateexception (" Streamingcontext has already been stopped ") }}
5. Start the Jobscheduler method within Streamingcontext.start ().
Scheduler.start ()
Instantiates the EventLoop inside Jobscheduler.start () and executes the Eventloop.start () for the message loop.
Constructs the Receivertacker inside Jobscheduler.start () and calls the Start method of Jobgenerator and Receivertacker:
Def start (): unit = synchronized { if (eventLoop != null) return // scheduler has already been started logdebug ("Starting jobscheduler ") eventloop = new eventloop[jobschedulerevent] (" JobScheduler ") { override protected def onreceive (event: JobSchedulerEvent): unit = processevent (Event) override protected def OnError (e: throwable): unit = reporterror ("Error in job scheduler", e) } eventloop.start () // attach rate controllers of input streams to receive batch completion updates for { inputdstream <- ssc.graph.getinputstreams ratecontroller <- inputdstReam.ratecontroller } ssc.addstreaminglistener (Ratecontroller) listenerBus.start ( Ssc.sparkcontext) receivertracker = new receivertracker (SSC) Inputinfotracker = new inputinfotracker (SSC) receivertracker.start () Jobgenerator.start () loginfo ("Started jobscheduler")}
6.JobGenerator will continuously generate a job based on batchduration after booting
/** generate jobs and perform checkpoint for the given ' time '. */private def generatejobs (time: time) { // Set the sparkenv in this thread, so that job generation code can Access the environment // example: blockrdds are created in this thread, and it needs to access BlockManager // update: this is probably redundant after threadlocal stuff in Sparkenv has been removed. sparkenv.set (ssc.env) Try { jobscheduler.receivertracker.allocateblockstobatch (Time) // allocate received Blocks to batch graph.generatejobs (Time) // generate jobs using allocated Block } match { case success (Jobs) => val streamidtoinputinfos = jobscheduler.inputinfotracker.getinfo (Time) jobscheduler.submitjobset (Jobset (time, jobs, Streamidtoinputinfos)) case failure (e) => jobscheduler.reporterror ("error generating jobs for time " + time, e) } eventloop.post (Docheckpoint (time, clearcheckpointdatalater = False)}
7.ReceiverTracker start receiver first in spark cluster (actually start receiversupervisor in executor), After receiver receives the data, it is stored to executor via Receiversupervisor and sends the metadata information of the data to Receivertracker in driver. The received metadata information is managed internally through the Receivertracker Receivedblocktracker.
/** Start The endpoint and receiver execution thread. */def start (): Unit = synchronized {if (istrackerstarted) {throw new Sparkexception ("Receivertracker already started ")} if (!receiverinputstreams.isempty) {endpoint = Ssc.env.rpcEnv.setupEndpoint (" Receivertracker ", New Receiv Ertrackerendpoint (SSC.ENV.RPCENV)) if (!skipreceiverlaunch) launchreceivers () Loginfo ("Receivertracker started") Trackerstate = Started}}
Two. Spark streaming fault tolerant mechanism:
We know that the relationship between Dstream and Rdd is a constant creation of the RDD over time, and the dstream operation is to operate the RDD at a fixed time. So, in a sense, spark streaming's fault-tolerant mechanism based on dstream is actually a fault-tolerant mechanism for each of the RDD, which is the genius of spark streaming.
The fault tolerance of Spark streaming should be considered in two ways:
Driver recovery on run-time failure
Use checkpoint to record the state of the driver runtime, and to read the checkpoint and restore the driver state after a failure.
Specific recovery for each job run failure
Consider the failure recovery of receiver, as well as the recovery of the RDD calculation failure. Receiver can use the method of writing the Wal log. The fault tolerance of the RDD is provided by Spark Core, which is based on the Rdd feature, and its fault-tolerant mechanism is mainly two kinds:
01. Based on Checkpoint;
Between stage, is wide dependence, produced shuffle operation, lineage chain is too complex and lengthy, this time need to do checkpoint.
02. Fault tolerance based on lineage (descent):
In general, Spark chooses pedigree fault tolerance because it is expensive to make checkpoints for large datasets. Considering the Rdd dependency, each stage is internally narrow-dependent, which is generally based on lineage fault-tolerant, convenient and efficient.
Summary: The stage inside do lineage,stage between do checkpoint. Span style= "line-height:0px;" >
Note:
1. DT Big Data Dream Factory public number Dt_spark
2, the IMF 8 o'clock in the evening big data real combat YY Live channel number: 68917580
3, Sina Weibo: Http://www.weibo.com/ilovepains
3rd Lesson: Sparkstreaming thorough understanding of kick: decryption sparkstreaming operation mechanism and architecture advanced job and fault tolerance