Key points in this issue:
1. Explore spark streaming job architecture and operating mechanism
2. Probe into spark streaming fault tolerance mechanism
About sparkstreaming In fact, we have discussed in the previous blog, Sparkstreaming is a sub-framework running before Sparkcode, below we have a simple example to explore sparkstreaming operation mechanism and architecture
- sparkstreaming operating mechanism and architecture
//Sina Weibo: http://weibo.com/ilovepains/Sparkconf conf= NewSparkconf ().Setmaster ("spark://master:7077").Setappname ("Wordcountonline"); Javastreamingcontext JSC= NewJavastreamingcontext (conf, durations.Seconds5)); Javareceiverinputdstream lines=Jsc.Sockettextstream ("Master",9999); Javadstream<String>Words=Lines.FlatMap (NewFlatmapfunction<String,String>() {@Override PublicIterable<String>CallStringLine) throws Exception {returnArrays.Aslist (line.Split" ")); } }); Javapairdstream<String,Integer>Pairs=Words.Maptopair (NewPairfunction<String,String,Integer>() {@Override PublicTuple2<String,Integer>CallStringWord) throws Exception {return NewTuple2<String,Integer>(Word,1); } }); Javapairdstream<String,Integer>Wordscount=Pairs.Reducebykey (NewFunction2<Integer,Integer,Integer>() {@Override Public IntegerCallIntegerV1,IntegerV2) throws Exception {returnV1+v2; } }); Wordscount.Print (); Jsc.Start (); Jsc.Awaittermination (); Jsc.Close ();
This is an example of a sparkstreaming word count
In the Sparkstreaming program is StreamingContext is the sparkstreaming application all the functions of the starting point and program scheduling core, we look at the StreamingContext initialization part of the source code:
//StreamingContext.scala 183行privatenew JobScheduler(this)
We can see that when the StreamingContext was built, StreamingContext initialized the Jobscheduler, and Jobgenerator was initialized in Jobscheduler, The Receivertracker variable is also defined, as follows
//JobScheduler.scala 50行privatevalnew JobGenerator(this) val clock = jobGenerator.clock valnew StreamingListenerBus() // These two are created only when scheduler starts. // eventLoop not being null means the scheduler has been started and not stopped varnull
Let's look at Jsc.sockettextstream ("Master", 9999) to create some of the source code behind the Dstream:
Streamingcontext.scala327Yes def sockettextstream(hostname:string, port:int, storagelevel:storagelevel = S toragelevel.memory_and_disk_ser_2):Receiverinputdstream[string] = Withnamedscope ("Socket text stream") {socketstream[string] (hostname, port, Socketreceiver.bytestolines, Storagelevel)}//Streamingcontext.scala345Yes def socketstream[T:Classtag] (hostname:string, Port:int, Converter: (InputStream) = Iterator[t], storagelevel:stor Agelevel): receiverinputdstream[t] = {new Socketinputdstream[t] (this, hostname, port, Converter, Storagelevel)}
From the above we can see that StreamingContext the method of Socketstream method overload, the final call is Socketinputdstream, then we look at Socketinputdstream
private[streaming]class SocketInputDStream[T: ClassTag]( ssc_ : StreamingContext, host: String, port: Int, bytesToObjects: InputStream => Iterator[T], storageLevel: StorageLevel ) extends ReceiverInputDStream[T](ssc_) { def getReceiver(): Receiver[T] = { new SocketReceiver(host, port, bytesToObjects, storageLevel) }}
The Getreceiver method of accepting data is determined in Socketinputdstream, of course, the things we see are in the phase of method definition or object initialization, and haven't really started executing
Now let's see Jsc.start () Start the program execution method
def start (): Unit = synchronized {State match {case INITIALIZED = = Startsite. Set(DStream. Getcreationsite()) StreamingContext. ACTIVATION_lock. Synchronized{StreamingContext. Assertnoothercontextisactive() try {Validate ()//Start the streaming schedulerinchA new thread, so the thread local properties//LikePagerSites andJob groups can be reset without affecting those of the/current thread. Threadutils. Runinnewthread("Streaming-start") {Sparkcontext. Setcallsite(Startsite. Get) Sparkcontext. Clearjobgroup() Sparkcontext. Setlocalproperty(Sparkcontext. SPARK_job_interrupt_on_cancel,"false") Scheduler. Start()} state = Streamingcontextstate. ACTIVE} catch {case nonfatal (e) = LogError ("Error Starting the context, marking it as stopped", e) Scheduler. Stop(false) state = Streamingcontextstate. STOPPEDThrow e} StreamingContext. Setactivecontext(This)} Shutdownhookref = Shutdownhookmanager. Addshutdownhook(StreamingContext. SHUTDOWN_hook_priority) (Stoponshutdown)//registering streaming Metrics at the start of the StreamingContext assert (env. Metricssystem! = NULL) env. Metricssystem. Registersource(Streamingsource) UITab. foreach(_. Attach()) Loginfo ("StreamingContext started") Case ACTIVE = logwarning ("StreamingContext has already been started") Case STOPPED = throw new IllegalStateException ("StreamingContext has already been stopped") } }
We can drive to Jsc.start (), actually do a lot of work, but we focus on: Scheduler.start ()
def start():Unit = synchronized {if(EventLoop! = null)returnScheduler has already been started Logdebug ("Starting Jobscheduler") EventLoop = new Eventloop[jobschedulerevent] ("Jobscheduler") {Override protected def onreceive(event:jobschedulerevent):Unit = processevent (event) override protected def onError(e:throwable):Unit = ReportError ("Error in Job Scheduler", e)} eventloop.start ()//Attach rate controllers of input streams to receive batch completion updates for{inputdstream <-ssc.graph.getInputStreams ratecontroller <-Inputdstream.ratecontroller} ssc.addstre Aminglistener (Ratecontroller) Listenerbus.start (ssc.sparkcontext)//jobscheduler.scala theLine Receivertracker = new Receivertracker (SSC) Inputinfotracker = new Inputinfotracker (SSC) Receivertracker.start ( )//jobscheduler.scala theLine Jobgenerator.start () Loginfo ("Started Jobscheduler") }
I can now see that Receivertracker has been initialized in the start method of Jobscheduler and called its Start method
Receivertracker.scala149Yes def start():Unit = synchronized {if(istrackerstarted) {throw new Sparkexception ("Receivertracker already started") }if(!receiverinputstreams.isempty) {endpoint = Ssc.env.rpcEnv.setupEndpoint ("Receivertracker", New Receivertrackerendpoint (SSC.ENV.RPCENV))if(!skipreceiverlaunch) Launchreceivers () Loginfo ("Receivertracker started") trackerstate = Started}}//receivertracker.scala413Line Private def launchreceivers():Unit = {val receivers = Receiverinputstreams.map (NIS = {val rcvr = Nis.getreceiver () rcvr.setreceiverid (nis.id) RCVR}) Rundummysparkjob () Loginfo ("Starting"+ Receivers.length +"Receivers") Endpoint.send (Startallreceivers (receivers))}
Now we can see that the start method of Jobscheduler is called when StreamingContext executes the Start method, In the Jobscheduler start method, the Receivertracker is initialized and its Start method is executed, Receivertracker when the Start method is executed, the excutor process in the worker is eventually notified by RPC communication that the data is being accepted and the metadata information is reported to the driver
Next we go back to Jobscheduler.scala 83 line, see the Jobgenerator.start () method:
Jobgenerator.scala -Yes def start():Unit = synchronized {if(EventLoop! = null)returnGenerator have already been started//call Checkpointwriter here to initialize it before EventLoop uses it to avoid A deadlock. See spark-10125Checkpointwriter EventLoop = new Eventloop[jobgeneratorevent] ("Jobgenerator") {Override protected def onreceive(event:jobgeneratorevent):Unit = processevent (event) override protected def onError(e:throwable):Unit = {Jobscheduler.reporterror ("Error in Job Generator", e)}} eventloop.start ()if(ssc.ischeckpointpresent) {restart ()}Else{Startfirsttime ()}}
To this piece has finished sparkstreaming start Receivertracker accept data and generate job through Jobgenerator job generator, run on cluster
Of course, we can see in the program in the source code in fact there are a lot of thread pool usage, the author believes that the biggest advantage is that it can reduce the time spent creating new threads and can achieve a high degree of thread reuse (similar to the database connection pool is a reason)
- Spark Streaming fault tolerant mechanism:
Spark streaming Bottom is actually a collection of RDD, based on this feature, its fault-tolerant mechanism is mainly two: one is checkpoint, and the other is based on lineage (descent) fault tolerance. Of course, if the lineage chain is too complex and lengthy, then you need to do checkpoint
Due to the dependency of the RDD, if the stage is narrow-dependent, this is generally based on lineage fault tolerance, convenient and efficient. If there is a wide dependency between the stages, and a wide dependency generally produces shuffle operations, then we need to consider checkpoint.
Decryption sparkstreaming operation mechanism and schema advanced job and Fault tolerance (third article)