9th lesson: Spark Streaming Source interpretation receiver in driver's subtle realization full life cycle thorough research and ponder

Source: Internet
Author: User

In spark streaming, for Receiverinputdstream, it's a real receiver, used to receive data. Receiver can have many and run on a different worker node. These receiver are managed by Receivertracker.

In the Start method of Receivertracker, a message communication body Receivertrackerendpoint is created:

/** Start The endpoint and receiver execution thread. */def start (): Unit = synchronized {if (istrackerstarted) {throw new Sparkexception ("Receivertracker already started ")} if (!receiverinputstreams.isempty) {endpoint = Ssc.env.rpcEnv.setupEndpoint (" Receivertracker ", New Receiv    Ertrackerendpoint (SSC.ENV.RPCENV)) if (!skipreceiverlaunch) launchreceivers () Loginfo ("Receivertracker started") Trackerstate = Started}}


Then call the Launchreceivers () method

/** * Get The receivers from the Receiverinputdstreams, distributes them to the * worker nodes as a parallel collection, a nd runs them. */private def launchreceivers (): Unit = {val receivers = Receiverinputstreams.map (NIS = {val RCVR = nis.getrecei Ver () Rcvr.setreceiverid (nis.id) RCVR}) Rundummysparkjob () Loginfo ("starting" + Receivers.length + "receivers" ) Endpoint.send (Startallreceivers (receivers))}

In the above code, receiver is first obtained from the Inputdstream. Each inputdstream for a receiver. For the spark streaming program, there can be multiple inputdstream.

What needs to be explained here is that it will run a method Rundummysparkjob (), as can be seen from the name, which is a virtual job. The job's primary role is to allow the receivers to be dispersed to different workers as far as possible.

Do you think Master doesn't know what workers are in the system? Is it OK to use the executor on these workers directly? There's a problem here, maybe the executor on the worker goes down, but Master doesn't know. This will result in receiver being assigned to a executor that cannot be executed. After using the Rundummysparkjob () method, the executor obtained through Blockmanager must be "alive" at the moment.

How did that come about?

Private Def rundummysparkjob (): Unit = {if (!ssc.sparkcontext.islocal) {Ssc.sparkContext.makeRDD (1 to.). Map (x = = (x, 1)). Reducebykey (_ + _, +). Collect ()} assert (Getexecutors.nonempty)}
Private def getexecutors:seq[executorcachetasklocation] = {if (ssc.sc.isLocal) {val Blockmanagerid = Ssc.sparkconte Xt.env.blockManager.blockManagerId Seq (Executorcachetasklocation (Blockmanagerid.host, Blockmanagerid.executorid) )} else {ssc.sparkContext.env.blockManager.master.getMemoryStatus.filter {case (Blockmanagerid, _) = Block Managerid.executorid! = sparkcontext.driver_identifier//Ignore the DRIVER location}.map {case (Blockmanagerid, _) = > executorcachetasklocation (blockmanagerid.host, Blockmanagerid.executorid)}.toseq}}


The startallreceivers message is then sent to Receivertrackerendpoint. After receiving the message, do the following:

Case Startallreceivers (Receivers) = Val Scheduledlocations = schedulingpolicy.schedulereceivers (Receivers, getexecutors) for (receiver <-receivers) {val executors = scheduledlocations (receiver.streamid) updatereceiver Scheduledexecutors (Receiver.streamid, executors) receiverpreferredlocations (receiver.streamid) = Receiver.preferredlocation Startreceiver (receiver, executors)}

Assign the corresponding executor for each receiver in the for loop. and call the Startreceiver method:

Receiver is started in job mode!!! Here you may be wondering, without the rdd and the job? First, in the Startreceiver method, receiver is encapsulated into an RDD

Val Receiverrdd:rdd[receiver[_]] = if (scheduledlocations.isempty) {Ssc.sc.makeRDD (Seq (Receiver), 1)} else {VA L preferredlocations = Scheduledlocations.map (_.tostring). Distinct Ssc.sc.makeRDD (SEQ (Receiver- preferredlocations))}

There is only one "data" in this rdd, and the data itself is a receiver object. The receiver object will be sent to the remote executor through the job, so it must be serializable.

Abstract class Receiver[t] (Val storagelevel:storagelevel) extends Serializable

After encapsulation into an RDD, submit the RDD to the cluster to run

Val future = ssc.sparkcontext.submitjob[receiver[_], unit, unit] (Receiverrdd, Startreceiverfunc, Seq (0), (_, _) = = Un it, ())

The task is sent to executor, the "Receiver" is removed from the RDD and then executed Startreceiverfunc:

val startreceiverfunc: iterator[receiver[_]] => unit =   (iterator:  Iterator[receiver[_])  => {    if  (!iterator.hasnext)  {       throw new sparkexception (         " Could not start receiver as object not found. ")     }    if  (Taskcontext.get () Attemptnumber ()  == 0)  {      val receiver = iterator.next ()        assert (Iterator.hasnext == false)       val  Supervisor = new receiversupervisorimpl (         Receiver, sparkenv.get, serializablehadoopconf.value, checkpointdiroption)        supervisor.start ()     &Nbsp; supervisor.awaittermination ()     } else {       // it ' s restarted by taskscheduler, but we want to  reschedule it again. so exit it.    }  }


A Receiversupervisorimpl object was created in the function. It is used to manage specific receiver.

First it will register receiver in Receivertracker

Override protected Def onreceiverstart (): Boolean = {val msg = Registerreceiver (Streamid, Receiver.getClass.getSimpl ename, host, Executorid, endpoint) Trackerendpoint.askwithretry[boolean] (msg)}

If the registration is successful, the receiver is started

Def startreceiver (): Unit = synchronized {try {if (Onreceiverstart ()) {Loginfo ("starting receiver") recei Verstate = Started Receiver.onstart () loginfo ("called Receiver OnStart")} else {//the driver refused US Stop ("registered unsuccessfully because Driver refused to start receiver" + Streamid, None)}} catch {CA Se nonfatal (t) + = Stop ("Error starting receiver" + Streamid, Some (t))}}


Back to Receivertracker's Startreceiver method, if receiver fails to start, it will send a Restartreceiver method to Receivertrackerendpoint.

Future.oncomplete {  case success (_)  =>    if  (! Shouldstartreceiver)  {      onreceiverjobfinish (receiverId)      } else {      loginfo (S "restarting receiver $ Receiverid ")       self.send (Restartreceiver (receiver))     }   case failure (e)  =>    if  (!shouldstartreceiver)  {       onreceiverjobfinish (Receiverid)     } else {       logerror ("receiver has been stopped. try to  Restart it. ",  e)       loginfo (S" restarting receiver $ Receiverid ")       self.send (Restartreceiver (receiver))     }} (Submitjobthreadpool)


Re-select a executor for receiver and run receiver again. Until receiver starts.


Note:

1. DT Big Data Dream Factory public number Dt_spark
2, the IMF 8 o'clock in the evening big data real combat YY Live channel number: 68917580
3, Sina Weibo: Http://www.weibo.com/ilovepains


This article is from the "Ding Dong" blog, please be sure to keep this source http://lqding.blog.51cto.com/9123978/1773912

9th lesson: Spark Streaming Source interpretation receiver in driver's subtle realization full life cycle thorough research and thinking

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.