11th Lesson: Spark Streaming the Receivertracker architecture design and concrete implementation of driver in source code interpretation

Source: Internet
Author: User

The last lesson will be to how receiver is constantly receiving data, and the data received by the metadata will be reported to Receivertracker, below we look at the Receivertracker specific functions and implementation.

First, the main functions of Receivertracker:

    1. Start receivers on the executor.

    2. Stop receivers.

    3. Update receiver's rate of receiving data (i.e., current limit)

    4. Constantly waiting for the receivers to run, restart receiver as long as the receivers stops running. This is the fault-tolerant function of receiver.

    5. Accept the registration of receiver.

    6. Use Receivedblocktracker to manage the metadata of receiver receiving data.

    7. Report the error message sent by receiver.


Receivertracker manages a message communication receivertrackerendpoint that communicates with receiver or receivertracker.

In the Receivertracker start method, the Receivertrackerendpoint is instantiated and receivers is started on executor:

/** Start The endpoint and receiver execution thread. */def start (): Unit = synchronized {if (istrackerstarted) {throw new Sparkexception ("Receivertracker already started ")} if (!receiverinputstreams.isempty) {endpoint = Ssc.env.rpcEnv.setupEndpoint (" Receivertracker ", New Receiv    Ertrackerendpoint (SSC.ENV.RPCENV)) if (!skipreceiverlaunch) launchreceivers () Loginfo ("Receivertracker started") Trackerstate = Started}}

Start RECEIVR, is actually receivertracker to Receivertrackerendpoint sent a local message, Receivertrackerendpoint the receiver package into an RDD to be submitted to the cluster for job execution.

Endpoint.send (Startallreceivers (receivers))

The endpoint here is Receivertrackerendpoint's reference.


Receiver after the start, will be registered to Receivertracker, registration success is officially started.

Override protected Def onreceiverstart (): Boolean = {val msg = Registerreceiver (Streamid, Receiver.getClass.getSimpl ename, host, Executorid, endpoint) Trackerendpoint.askwithretry[boolean] (msg)}

When the receiver side receives the data, it is necessary to write the data to the Blockmanager and report the data to Receivertracker:

/** store block and report it to driver */def pushandreportblock (     receivedblock: receivedblock,    metadataoption: option[ any],    blockidoption: option[streamblockid]  )  {  val  blockid = blockidoption.getorelse (Nextblockid)   val time =  System.currenttimemillis  val blockstoreresult = receivedblockhandler.storeblock ( Blockid, receivedblock)   logdebug (S "pushed block  $blockId  in ${( System.currenttimemillis - time)} ms ")   val numRecords =  Blockstoreresult.numrecords  val blockinfo = receivedblockinfo (streamId,  Numrecords, metadataoption, blockstoreresult)   trackerendpoint.askwithretry[boolean] ( Addblock (Blockinfo))   logdebug (S "reported block&nbSP; $blockId ")} 


When Receivertracker receives the metadata, a thread is started in the thread pool to write the data:

Case addblock (Receivedblockinfo)  =>  if  (writeaheadlogutils.isbatchingenabled ( ssc.conf, isdriver = true))  {    walbatchingthreadpool.execute (new  runnable {      override def run (): Unit =  utils.trylognonfatalerror {        if  (Active)  {           context.reply (Addblock (receivedBlockInfo))          } else {           throw new illegalstateexception ("Receivertracker rpcendpoint shut down.")         }      }     })   } else {    context.reply (Addblock (receivedBlockInfo))    }

The metadata for the data is managed by Receivedblocktracker.

The data is eventually written to the Streamidtounallocatedblockqueues: a queue that corresponds to a block of data for a stream.

Private type Receivedblockqueue = mutable. Queue[receivedblockinfo]private val streamidtounallocatedblockqueues = new mutable. Hashmap[int, Receivedblockqueue]


Whenever streaming triggers a job, the data in the queue is assigned to a batch and the data is written to the TIMETOALLOCATEDBLOCKS data structure.

Private val timetoallocatedblocks = new mutable. Hashmap[time, allocatedblocks]....def allocateblockstobatch (batchtime: time):  Unit =  synchronized {  if  (lastallocatedbatchtime == null | |  batchtime > lastallocatedbatchtime)  {    val streamidtoblocks  = streamIds.map { streamId =>         ( Streamid, getreceivedblockqueue (Streamid). Dequeueall (x => true))     }. Tomap    val allocatedblocks = allocatedblocks (streamIdToBlocks)      if  (WriteToLog (Batchallocationevent (batchtime, allocatedblocks))  {       timetoallocatedblocks.put (batchtime, allocatedblocks)        lastallocatedbatchtime = batchtime   &nBSP;}  else {      loginfo (S "possibly processed batch $ Batchtime need to be processed again in wal recovery ")      }  } else {    // this situation occurs  when:    // 1. wal is ended with batchallocationevent,  but without BatchCleanupEvent,    // possibly processed  batch job or half-processed batch job need to be processed  again,    // so the batchtime will be equal to  lastallocatedbatchtime.    // 2. slow checkpointing makes  recovered batch time older than wal recovered    //  lastallocatedbatchtime.    // this situation will only occurs in recovery time.     loginfo (S "possibly processed batch  $batchTime  need to be  Processed again in wal recovery ")   }}

A batch can be seen to contain data from multiple streams.


Every time a job for streaming is finished running:

Private def handlejobcompletion (Job: job, completedtime: long)  {  val  jobset = jobsets.get (job.time)   jobset.handlejobcompletion (Job)    Job.setendtime (completedtime)   listenerbus.post (streaminglisteneroutputoperationcompleted ( Job.tooutputoperationinfo))   loginfo ("finished job "  + job.id +  "  from job set of time  " + jobset.time"   if  ( jobset.hascompleted)  {    jobsets.remove (jobset.time)      Jobgenerator.onbatchcompletion (jobset.time)     loginfo ("Total delay: %.3f s  for time %s  (Execution: %.3f s) ". Format (       jobset.totaldelay / 1000.0, jobset.time.tostring,       jobset.processingdelay / 1000.0    ))    listenerbus.post (streaminglistenerbatchcompleted (jobset.tobatchinfo))   }  &NBSP, .....

Jobscheduler will invoke the Handlejobcompletion method, which will eventually trigger

JobScheduler.receiverTracker.cleanupOldBlocksAndBatches (time-maxrememberduration)


The maxrememberduration here is the maximum amount of time that the RDD generated in Dstream is retained for each moment.

def cleanupoldbatches (Cleanupthreshtime:time, Waitforcompletion:boolean): Unit = synchronized {require ( Cleanupthreshtime.milliseconds < Clock.gettimemillis ()) Val Timestocleanup = timeToAllocatedBlocks.keys.filter {_ & Lt Cleanupthreshtime}.toseq loginfo ("Deleting batches" + timestocleanup) if (WriteToLog (Batchcleanupevent ( Timestocleanup)) {timetoallocatedblocks--= timestocleanup Writeaheadlogoption.foreach (_.clean.   milliseconds, waitforcompletion)} else {logwarning ("Failed to acknowledge batch clean up in the Write Ahead Log.") }}

and finally

Listenerbus.post (streaminglistenerbatchcompleted (Jobset.tobatchinfo))

This code will call the

Case batchcompleted:streaminglistenerbatchcompleted = listener.onbatchcompleted (batchcompleted) ... All the way down .../** * A Ratecontroller that sends the new rate to receivers via the receiver tracker. */private[streaming] class Receiverratecontroller (Id:int, Estimator:rateestimator) extends Ratecontroller (ID, Estima Tor) {Override def publish (Rate:long): Unit = ssc.scheduler.receiverTracker.sendRateUpdate (ID, rate)}
/** Update A receiver ' s maximum ingestion rate */def sendrateupdate (Streamuid:int, newrate:long): Unit = synchronized { if (istrackerstarted) {endpoint.send (Updatereceiverratelimit (Streamuid, Newrate))}}
Case Updatereceiverratelimit (Streamuid, newrate) + = (Info <-receivertrackinginfos.get (streamuid); EP <-in Fo.endpoint) {ep.send (Updateratelimit (newrate))}

The rate at which the data flow is controlled is finally adjusted by Blockgenerator to adjust the rate at which the message is sent to Receiver,receiver.

Case Updateratelimit (EPS) = Loginfo (S "Received a new rate limit: $eps.") Registeredblockgenerators.foreach {bg = bg.updaterate (EPS)}


Note:

1. DT Big Data Dream Factory public number Dt_spark
2, the IMF 8 o'clock in the evening big data real combat YY Live channel number: 68917580
3, Sina Weibo: Http://www.weibo.com/ilovepains


This article is from the "Ding Dong" blog, please be sure to keep this source http://lqding.blog.51cto.com/9123978/1774994

11th Lesson: Spark Streaming the Receivertracker architecture design and concrete implementation of driver in source code interpretation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.