3rd Lesson: Sparkstreaming thorough understanding of kick: decryption sparkstreaming operation mechanism and architecture advanced job and fault tolerance

Source: Internet
Author: User

Contents of this issue:

    • Decrypting the spark streaming job architecture and operating mechanism

    • Decrypt spark streaming fault-tolerant architecture and operating mechanism


Understanding the entire architecture and operating mechanism of Sparkstreaming's job is critical to mastering sparkstreaming. We know that for a typical spark application, the RDD action operation triggers the job. So how does the job work for sparkstreaming? When we wrote the Sparkstreaming program, we set the batchduration,job to automatically trigger every batchduration time, which is certainly the sparkstreaming framework provides a timer, The program is submitted to spark as soon as the time is up and runs as a spark job.


This involves the concept of two jobs:

    1. Each batchinterval will produce a specific job, in fact, the job here is not the job referred to in Spark Core, it is only the dstreamgraph based on the RDD generated by the DAG, from the Java perspective, equivalent to the Runnable interface instance, Why use a thread pool when you want to run a job that needs to be submitted to Jobscheduler, and find a separate thread in Jobscheduler that submits the job to the cluster (in fact, the RDD-based action in the thread triggers a real job)?

      A), the job is constantly generated, so in order to improve efficiency, we need a thread pool, which is similar to executing a task in executor through a thread pool;

      b), it is possible to set the job of the Fair Fair scheduling method, this time also requires multi-threading support;

    2. The spark job submitted by the above job itself. From this point of view, there is no difference between this job and the job in Spark core.


Let's look at the job run process:

1. First instantiate the sparkconf and set the run-time parameters.

Val conf = new sparkconf (). Setappname ("Updatestatebykeydemo")

2. Instantiate StreamingContext, set the batchduration time interval to control the frequency of job generation and create a portal for spark streaming execution.

Val SSC = new StreamingContext (Conf,seconds (20))

3. In the process of instantiating StreamingContext, instantiate Jobscheduler and Jobgenerator.

Line 183th of Streamingcontext.scala

Private[streaming] Val Scheduler = new Jobscheduler (this)

Line 50th of Jobscheduler.scala

Private Val jobgenerator = new Jobgenerator (this)

4.StreamingContext calls the Start method.

Def start (): unit = synchronized {  state match {     case initialized =>      startsite.set ( Dstream.getcreationsite ())       streamingcontext.activation_lock.synchronized  {        streamingcontext.assertnoothercontextisactive ()          try {           validate ()           // Start the  streaming scheduler in a new thread, so that thread local  properties          // like call sites  And job groups can be reset without affecting those of the           // current thread.           Threadutils.runinnewthread ("Streaming-start")  {             sparkcontext.setcallsite (Startsite.get)              sparkcontext.clearjobgroup ()              sparkcontext.setlocalproperty (sparkcontext.spark_job_interrupt_on_cancel,  "false")             scheduler.start ()            }          state  = streamingcontextstate.active        } catch {           case nonfatal (e)  =>              logerror ("error starting the context, marking it as stopped",  e )             scheduler.stop (false)              state = streamingcontextstate.stopped             throw e         }        streamingcontext.setactivecontext ( This)       }      shutdownHookRef =  Shutdownhookmanager.addshutdownhook (        streamingcontext.shutdown_ hook_priority) (Stoponshutdown)       // Registering Streaming  metrics at the start of the streamingcontext       ASSERT (env.metricssystem != null)       env.metricssystem.registersource (StreamingSource)        uitab.foreach (_.attach ())       loginfo ("StreamingContext  started ")     case ACTIVE =>       Logwarning ("streamingcontext has already been started")     case  Stopped =>      throw new illegalstateexception (" Streamingcontext has already been stopped ")   }}

5. Start the Jobscheduler method within Streamingcontext.start ().

Scheduler.start ()

Instantiates the EventLoop inside Jobscheduler.start () and executes the Eventloop.start () for the message loop.

Constructs the Receivertacker inside Jobscheduler.start () and calls the Start method of Jobgenerator and Receivertacker:

Def start (): unit = synchronized {  if  (eventLoop != null)  return // scheduler has already been started  logdebug ("Starting  jobscheduler ")   eventloop = new eventloop[jobschedulerevent] (" JobScheduler ")  {    override protected def onreceive (event: JobSchedulerEvent):  unit = processevent (Event)     override protected def  OnError (e: throwable):  unit = reporterror ("Error in job scheduler",  e)   }  eventloop.start ()   // attach rate controllers of  input streams to receive batch completion updates  for {     inputdstream <- ssc.graph.getinputstreams    ratecontroller  <- inputdstReam.ratecontroller  } ssc.addstreaminglistener (Ratecontroller)   listenerBus.start ( Ssc.sparkcontext)   receivertracker = new receivertracker (SSC)    Inputinfotracker = new inputinfotracker (SSC)   receivertracker.start ()    Jobgenerator.start ()   loginfo ("Started jobscheduler")}

6.JobGenerator will continuously generate a job based on batchduration after booting

/** generate jobs and perform checkpoint for the given  ' time '.   */private def generatejobs (time: time)  {  // Set the  sparkenv in this thread, so that job generation code can  Access the environment  // example: blockrdds are created in  this thread, and it needs to access BlockManager  //  update: this is probably redundant after threadlocal stuff in  Sparkenv has been removed.  sparkenv.set (ssc.env)   Try {     jobscheduler.receivertracker.allocateblockstobatch (Time)  // allocate received  Blocks to batch    graph.generatejobs (Time)  // generate jobs  using allocated Block  } match {    case success (Jobs)  =>       val streamidtoinputinfos = jobscheduler.inputinfotracker.getinfo (Time)       jobscheduler.submitjobset (Jobset (time, jobs,  Streamidtoinputinfos))     case failure (e)  =>       jobscheduler.reporterror ("error generating jobs for time "  + time,  e)   }  eventloop.post (Docheckpoint (time, clearcheckpointdatalater =  False)}


7.ReceiverTracker start receiver first in spark cluster (actually start receiversupervisor in executor), After receiver receives the data, it is stored to executor via Receiversupervisor and sends the metadata information of the data to Receivertracker in driver. The received metadata information is managed internally through the Receivertracker Receivedblocktracker.

/** Start The endpoint and receiver execution thread. */def start (): Unit = synchronized {if (istrackerstarted) {throw new Sparkexception ("Receivertracker already started ")} if (!receiverinputstreams.isempty) {endpoint = Ssc.env.rpcEnv.setupEndpoint (" Receivertracker ", New Receiv    Ertrackerendpoint (SSC.ENV.RPCENV)) if (!skipreceiverlaunch) launchreceivers () Loginfo ("Receivertracker started") Trackerstate = Started}}


Two. Spark streaming fault tolerant mechanism:

We know that the relationship between Dstream and Rdd is a constant creation of the RDD over time, and the dstream operation is to operate the RDD at a fixed time. So, in a sense, spark streaming's fault-tolerant mechanism based on dstream is actually a fault-tolerant mechanism for each of the RDD, which is the genius of spark streaming.

The fault tolerance of Spark streaming should be considered in two ways:

    1. Driver recovery on run-time failure

      Use checkpoint to record the state of the driver runtime, and to read the checkpoint and restore the driver state after a failure.

    2. Specific recovery for each job run failure

      Consider the failure recovery of receiver, as well as the recovery of the RDD calculation failure. Receiver can use the method of writing the Wal log. The fault tolerance of the RDD is provided by Spark Core, which is based on the Rdd feature, and its fault-tolerant mechanism is mainly two kinds:

  01. Based on Checkpoint;

Between stage, is wide dependence, produced shuffle operation, lineage chain is too complex and lengthy, this time need to do checkpoint.

  02. Fault tolerance based on lineage (descent):

  In general, Spark chooses pedigree fault tolerance because it is expensive to make checkpoints for large datasets. Considering the Rdd dependency, each stage is internally narrow-dependent, which is generally based on lineage fault-tolerant, convenient and efficient.

Summary: The stage inside do lineage,stage between do checkpoint. Span style= "line-height:0px;" >




Note:

1. DT Big Data Dream Factory public number Dt_spark
2, the IMF 8 o'clock in the evening big data real combat YY Live channel number: 68917580
3, Sina Weibo: Http://www.weibo.com/ilovepains


3rd Lesson: Sparkstreaming thorough understanding of kick: decryption sparkstreaming operation mechanism and architecture advanced job and fault tolerance

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.