Decryption sparkstreaming operation mechanism and schema advanced job and Fault tolerance (third article)

Last Update:2016-05-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Key points in this issue:
1. Explore spark streaming job architecture and operating mechanism
2. Probe into spark streaming fault tolerance mechanism

About sparkstreaming In fact, we have discussed in the previous blog, Sparkstreaming is a sub-framework running before Sparkcode, below we have a simple example to explore sparkstreaming operation mechanism and architecture

sparkstreaming operating mechanism and architecture

//Sina Weibo: http://weibo.com/ilovepains/Sparkconf conf= NewSparkconf ().Setmaster ("spark://master:7077").Setappname ("Wordcountonline"); Javastreamingcontext JSC= NewJavastreamingcontext (conf, durations.Seconds5)); Javareceiverinputdstream lines=Jsc.Sockettextstream ("Master",9999); Javadstream<String>Words=Lines.FlatMap (NewFlatmapfunction<String,String>() {@Override PublicIterable<String>CallStringLine) throws Exception {returnArrays.Aslist (line.Split" "));        }        }); Javapairdstream<String,Integer>Pairs=Words.Maptopair (NewPairfunction<String,String,Integer>() {@Override PublicTuple2<String,Integer>CallStringWord) throws Exception {return NewTuple2<String,Integer>(Word,1);        }        }); Javapairdstream<String,Integer>Wordscount=Pairs.Reducebykey (NewFunction2<Integer,Integer,Integer>() {@Override Public IntegerCallIntegerV1,IntegerV2) throws Exception {returnV1+v2;        }        }); Wordscount.Print (); Jsc.Start (); Jsc.Awaittermination (); Jsc.Close ();

This is an example of a sparkstreaming word count

In the Sparkstreaming program is StreamingContext is the sparkstreaming application all the functions of the starting point and program scheduling core, we look at the StreamingContext initialization part of the source code:

//StreamingContext.scala 183行privatenew JobScheduler(this)

We can see that when the StreamingContext was built, StreamingContext initialized the Jobscheduler, and Jobgenerator was initialized in Jobscheduler, The Receivertracker variable is also defined, as follows

//JobScheduler.scala 50行privatevalnew JobGenerator(this)  val clock = jobGenerator.clock  valnew StreamingListenerBus()  // These two are created only when scheduler starts.  // eventLoop not being null means the scheduler has been started and not stopped  varnull

Let's look at Jsc.sockettextstream ("Master", 9999) to create some of the source code behind the Dstream:

Streamingcontext.scala327Yes def sockettextstream(hostname:string, port:int, storagelevel:storagelevel = S toragelevel.memory_and_disk_ser_2):Receiverinputdstream[string] = Withnamedscope ("Socket text stream") {socketstream[string] (hostname, port, Socketreceiver.bytestolines, Storagelevel)}//Streamingcontext.scala345Yes def socketstream[T:Classtag] (hostname:string, Port:int, Converter: (InputStream) = Iterator[t], storagelevel:stor Agelevel): receiverinputdstream[t] = {new Socketinputdstream[t] (this, hostname, port, Converter, Storagelevel)}

From the above we can see that StreamingContext the method of Socketstream method overload, the final call is Socketinputdstream, then we look at Socketinputdstream

private[streaming]class SocketInputDStream[T: ClassTag](    ssc_ : StreamingContext,    host: String,    port: Int,    bytesToObjects: InputStream => Iterator[T],    storageLevel: StorageLevel  ) extends ReceiverInputDStream[T](ssc_) {  def getReceiver(): Receiver[T] = {    new SocketReceiver(host, port, bytesToObjects, storageLevel)  }}

The Getreceiver method of accepting data is determined in Socketinputdstream, of course, the things we see are in the phase of method definition or object initialization, and haven't really started executing

Now let's see Jsc.start () Start the program execution method

def start (): Unit = synchronized {State match {case INITIALIZED = = Startsite. Set(DStream. Getcreationsite()) StreamingContext. ACTIVATION_lock. Synchronized{StreamingContext. Assertnoothercontextisactive() try {Validate ()//Start the streaming schedulerinchA new thread, so the thread local properties//LikePagerSites andJob groups can be reset without affecting those of the/current thread. Threadutils. Runinnewthread("Streaming-start") {Sparkcontext. Setcallsite(Startsite. Get) Sparkcontext. Clearjobgroup() Sparkcontext. Setlocalproperty(Sparkcontext. SPARK_job_interrupt_on_cancel,"false") Scheduler. Start()} state = Streamingcontextstate. ACTIVE} catch {case nonfatal (e) = LogError ("Error Starting the context, marking it as stopped", e) Scheduler. Stop(false) state = Streamingcontextstate. STOPPEDThrow e} StreamingContext. Setactivecontext(This)} Shutdownhookref = Shutdownhookmanager. Addshutdownhook(StreamingContext. SHUTDOWN_hook_priority) (Stoponshutdown)//registering streaming Metrics at the start of the StreamingContext assert (env. Metricssystem! = NULL) env. Metricssystem. Registersource(Streamingsource) UITab. foreach(_. Attach()) Loginfo ("StreamingContext started") Case ACTIVE = logwarning ("StreamingContext has already been started") Case STOPPED = throw new IllegalStateException ("StreamingContext has already been stopped")    }  }

We can drive to Jsc.start (), actually do a lot of work, but we focus on: Scheduler.start ()

   def start():Unit = synchronized {if(EventLoop! = null)returnScheduler has already been started Logdebug ("Starting Jobscheduler") EventLoop = new Eventloop[jobschedulerevent] ("Jobscheduler") {Override protected def onreceive(event:jobschedulerevent):Unit = processevent (event) override protected def onError(e:throwable):Unit = ReportError ("Error in Job Scheduler", e)} eventloop.start ()//Attach rate controllers of input streams to receive batch completion updates for{inputdstream <-ssc.graph.getInputStreams ratecontroller <-Inputdstream.ratecontroller} ssc.addstre Aminglistener (Ratecontroller) Listenerbus.start (ssc.sparkcontext)//jobscheduler.scala theLine Receivertracker = new Receivertracker (SSC) Inputinfotracker = new Inputinfotracker (SSC) Receivertracker.start ( )//jobscheduler.scala theLine Jobgenerator.start () Loginfo ("Started Jobscheduler")  }

I can now see that Receivertracker has been initialized in the start method of Jobscheduler and called its Start method

Receivertracker.scala149Yes def start():Unit = synchronized {if(istrackerstarted) {throw new Sparkexception ("Receivertracker already started")    }if(!receiverinputstreams.isempty) {endpoint = Ssc.env.rpcEnv.setupEndpoint ("Receivertracker", New Receivertrackerendpoint (SSC.ENV.RPCENV))if(!skipreceiverlaunch) Launchreceivers () Loginfo ("Receivertracker started") trackerstate = Started}}//receivertracker.scala413Line Private def launchreceivers():Unit = {val receivers = Receiverinputstreams.map (NIS = {val rcvr = Nis.getreceiver () rcvr.setreceiverid (nis.id) RCVR}) Rundummysparkjob () Loginfo ("Starting"+ Receivers.length +"Receivers") Endpoint.send (Startallreceivers (receivers))}

Now we can see that the start method of Jobscheduler is called when StreamingContext executes the Start method, In the Jobscheduler start method, the Receivertracker is initialized and its Start method is executed, Receivertracker when the Start method is executed, the excutor process in the worker is eventually notified by RPC communication that the data is being accepted and the metadata information is reported to the driver

Next we go back to Jobscheduler.scala 83 line, see the Jobgenerator.start () method:

Jobgenerator.scala -Yes def start():Unit = synchronized {if(EventLoop! = null)returnGenerator have already been started//call Checkpointwriter here to initialize it before EventLoop uses it to avoid    A deadlock. See spark-10125Checkpointwriter EventLoop = new Eventloop[jobgeneratorevent] ("Jobgenerator") {Override protected def onreceive(event:jobgeneratorevent):Unit = processevent (event) override protected def onError(e:throwable):Unit = {Jobscheduler.reporterror ("Error in Job Generator", e)}} eventloop.start ()if(ssc.ischeckpointpresent) {restart ()}Else{Startfirsttime ()}}

To this piece has finished sparkstreaming start Receivertracker accept data and generate job through Jobgenerator job generator, run on cluster

Of course, we can see in the program in the source code in fact there are a lot of thread pool usage, the author believes that the biggest advantage is that it can reduce the time spent creating new threads and can achieve a high degree of thread reuse (similar to the database connection pool is a reason)

Spark Streaming fault tolerant mechanism:

Spark streaming Bottom is actually a collection of RDD, based on this feature, its fault-tolerant mechanism is mainly two: one is checkpoint, and the other is based on lineage (descent) fault tolerance. Of course, if the lineage chain is too complex and lengthy, then you need to do checkpoint
Due to the dependency of the RDD, if the stage is narrow-dependent, this is generally based on lineage fault tolerance, convenient and efficient. If there is a wide dependency between the stages, and a wide dependency generally produces shuffle operations, then we need to consider checkpoint.

Decryption sparkstreaming operation mechanism and schema advanced job and Fault tolerance (third article)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Decryption sparkstreaming operation mechanism and schema advanced job and Fault tolerance (third article)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Decryption sparkstreaming operation mechanism and schema advanced job and Fault tolerance (third article)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support