3rd Lesson: Sparkstreaming thorough understanding of kick: decryption sparkstreaming operation mechanism and architecture advanced job and fault tolerance

Last Update:2016-05-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Contents of this issue:

Decrypting the spark streaming job architecture and operating mechanism
Decrypt spark streaming fault-tolerant architecture and operating mechanism

Understanding the entire architecture and operating mechanism of Sparkstreaming's job is critical to mastering sparkstreaming. We know that for a typical spark application, the RDD action operation triggers the job. So how does the job work for sparkstreaming? When we wrote the Sparkstreaming program, we set the batchduration,job to automatically trigger every batchduration time, which is certainly the sparkstreaming framework provides a timer, The program is submitted to spark as soon as the time is up and runs as a spark job.

This involves the concept of two jobs:

Each batchinterval will produce a specific job, in fact, the job here is not the job referred to in Spark Core, it is only the dstreamgraph based on the RDD generated by the DAG, from the Java perspective, equivalent to the Runnable interface instance, Why use a thread pool when you want to run a job that needs to be submitted to Jobscheduler, and find a separate thread in Jobscheduler that submits the job to the cluster (in fact, the RDD-based action in the thread triggers a real job)?
A), the job is constantly generated, so in order to improve efficiency, we need a thread pool, which is similar to executing a task in executor through a thread pool;
b), it is possible to set the job of the Fair Fair scheduling method, this time also requires multi-threading support;
The spark job submitted by the above job itself. From this point of view, there is no difference between this job and the job in Spark core.

Let's look at the job run process:

1. First instantiate the sparkconf and set the run-time parameters.

Val conf = new sparkconf (). Setappname ("Updatestatebykeydemo")

2. Instantiate StreamingContext, set the batchduration time interval to control the frequency of job generation and create a portal for spark streaming execution.

Val SSC = new StreamingContext (Conf,seconds (20))

3. In the process of instantiating StreamingContext, instantiate Jobscheduler and Jobgenerator.

Line 183th of Streamingcontext.scala

Private[streaming] Val Scheduler = new Jobscheduler (this)

Line 50th of Jobscheduler.scala

Private Val jobgenerator = new Jobgenerator (this)

4.StreamingContext calls the Start method.

Def start (): unit = synchronized {  state match {     case initialized =>      startsite.set ( Dstream.getcreationsite ())       streamingcontext.activation_lock.synchronized  {        streamingcontext.assertnoothercontextisactive ()          try {           validate ()           // Start the  streaming scheduler in a new thread, so that thread local  properties          // like call sites  And job groups can be reset without affecting those of the           // current thread.           Threadutils.runinnewthread ("Streaming-start")  {             sparkcontext.setcallsite (Startsite.get)              sparkcontext.clearjobgroup ()              sparkcontext.setlocalproperty (sparkcontext.spark_job_interrupt_on_cancel,  "false")             scheduler.start ()            }          state  = streamingcontextstate.active        } catch {           case nonfatal (e)  =>              logerror ("error starting the context, marking it as stopped",  e )             scheduler.stop (false)              state = streamingcontextstate.stopped             throw e         }        streamingcontext.setactivecontext ( This)       }      shutdownHookRef =  Shutdownhookmanager.addshutdownhook (        streamingcontext.shutdown_ hook_priority) (Stoponshutdown)       // Registering Streaming  metrics at the start of the streamingcontext       ASSERT (env.metricssystem != null)       env.metricssystem.registersource (StreamingSource)        uitab.foreach (_.attach ())       loginfo ("StreamingContext  started ")     case ACTIVE =>       Logwarning ("streamingcontext has already been started")     case  Stopped =>      throw new illegalstateexception (" Streamingcontext has already been stopped ")   }}

5. Start the Jobscheduler method within Streamingcontext.start ().

Scheduler.start ()

Instantiates the EventLoop inside Jobscheduler.start () and executes the Eventloop.start () for the message loop.

Constructs the Receivertacker inside Jobscheduler.start () and calls the Start method of Jobgenerator and Receivertacker:

Def start (): unit = synchronized {  if  (eventLoop != null)  return // scheduler has already been started  logdebug ("Starting  jobscheduler ")   eventloop = new eventloop[jobschedulerevent] (" JobScheduler ")  {    override protected def onreceive (event: JobSchedulerEvent):  unit = processevent (Event)     override protected def  OnError (e: throwable):  unit = reporterror ("Error in job scheduler",  e)   }  eventloop.start ()   // attach rate controllers of  input streams to receive batch completion updates  for {     inputdstream <- ssc.graph.getinputstreams    ratecontroller  <- inputdstReam.ratecontroller  } ssc.addstreaminglistener (Ratecontroller)   listenerBus.start ( Ssc.sparkcontext)   receivertracker = new receivertracker (SSC)    Inputinfotracker = new inputinfotracker (SSC)   receivertracker.start ()    Jobgenerator.start ()   loginfo ("Started jobscheduler")}

6.JobGenerator will continuously generate a job based on batchduration after booting

/** generate jobs and perform checkpoint for the given  ' time '.   */private def generatejobs (time: time)  {  // Set the  sparkenv in this thread, so that job generation code can  Access the environment  // example: blockrdds are created in  this thread, and it needs to access BlockManager  //  update: this is probably redundant after threadlocal stuff in  Sparkenv has been removed.  sparkenv.set (ssc.env)   Try {     jobscheduler.receivertracker.allocateblockstobatch (Time)  // allocate received  Blocks to batch    graph.generatejobs (Time)  // generate jobs  using allocated Block  } match {    case success (Jobs)  =>       val streamidtoinputinfos = jobscheduler.inputinfotracker.getinfo (Time)       jobscheduler.submitjobset (Jobset (time, jobs,  Streamidtoinputinfos))     case failure (e)  =>       jobscheduler.reporterror ("error generating jobs for time "  + time,  e)   }  eventloop.post (Docheckpoint (time, clearcheckpointdatalater =  False)}

7.ReceiverTracker start receiver first in spark cluster (actually start receiversupervisor in executor), After receiver receives the data, it is stored to executor via Receiversupervisor and sends the metadata information of the data to Receivertracker in driver. The received metadata information is managed internally through the Receivertracker Receivedblocktracker.

/** Start The endpoint and receiver execution thread. */def start (): Unit = synchronized {if (istrackerstarted) {throw new Sparkexception ("Receivertracker already started ")} if (!receiverinputstreams.isempty) {endpoint = Ssc.env.rpcEnv.setupEndpoint (" Receivertracker ", New Receiv    Ertrackerendpoint (SSC.ENV.RPCENV)) if (!skipreceiverlaunch) launchreceivers () Loginfo ("Receivertracker started") Trackerstate = Started}}

Two. Spark streaming fault tolerant mechanism:

We know that the relationship between Dstream and Rdd is a constant creation of the RDD over time, and the dstream operation is to operate the RDD at a fixed time. So, in a sense, spark streaming's fault-tolerant mechanism based on dstream is actually a fault-tolerant mechanism for each of the RDD, which is the genius of spark streaming.

The fault tolerance of Spark streaming should be considered in two ways:

Driver recovery on run-time failure
Use checkpoint to record the state of the driver runtime, and to read the checkpoint and restore the driver state after a failure.
Specific recovery for each job run failure
Consider the failure recovery of receiver, as well as the recovery of the RDD calculation failure. Receiver can use the method of writing the Wal log. The fault tolerance of the RDD is provided by Spark Core, which is based on the Rdd feature, and its fault-tolerant mechanism is mainly two kinds:

　　01. Based on Checkpoint;

Between stage, is wide dependence, produced shuffle operation, lineage chain is too complex and lengthy, this time need to do checkpoint.

　　02. Fault tolerance based on lineage (descent):

　　In general, Spark chooses pedigree fault tolerance because it is expensive to make checkpoints for large datasets. Considering the Rdd dependency, each stage is internally narrow-dependent, which is generally based on lineage fault-tolerant, convenient and efficient.

Summary: The stage inside do lineage,stage between do checkpoint. Span style= "line-height:0px;" >

Note:

1. DT Big Data Dream Factory public number Dt_spark
2, the IMF 8 o'clock in the evening big data real combat YY Live channel number: 68917580
3, Sina Weibo: Http://www.weibo.com/ilovepains

3rd Lesson: Sparkstreaming thorough understanding of kick: decryption sparkstreaming operation mechanism and architecture advanced job and fault tolerance

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

3rd Lesson: Sparkstreaming thorough understanding of kick: decryption sparkstreaming operation mechanism and architecture advanced job and fault tolerance

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

3rd Lesson: Sparkstreaming thorough understanding of kick: decryption sparkstreaming operation mechanism and architecture advanced job and fault tolerance

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support