Apache Spark Source code reading 4-dstream real-time stream Data Processing

Source: Internet
Author: User

You are welcome to reprint it. Please indicate the source, huichiro.

Spark streaming can process streaming data at almost real-time speeds. Different from the general stream data processing model, this model enables spark streaming to have a very high processing speed and higher swallowing capability than storm.

This article briefly analyzes the spark streaming processing model, the initialization process of the spark streaming system, and the subsequent processing steps when external data is received.

Overview of streaming data features

Compared with common file (I .e., fixed content) data sources, the so-called stream data has the following features:

  1. Data is constantly changing
  2. Data cannot be rolled back
  3. Data continues to flow
Dstream

If you want to summarize the processing logic of spark streaming in one sentence, it is"Persists, discretization, and batch processing of continuous data ".

Let's take a closer look at the reasons for this.

  • Data PersistenceTemporarily store the data received from the network to provide the possibility of re-occurrence of the event when an error occurs,
  • DiscretizationThe endless flow of data has never come to an end, just as Zhou xingchi's comedy said, "The worship is like the water of the Yellow River, and it is out of control ". Since it cannot be exhausted, it will be split by time. For example, if one minute is used as the interval, the data collected within one minute is stored together in a centralized manner.
  • Batch ProcessingProcess the persistent data in batches, and apply the processing mechanism to the previous RDD mode.

Dstream is another layer of encapsulation of RDD. If dstream. Scala and RDD. Scala are enabled, it can be found that almost all operations on RDD have corresponding definitions in dstream.

Operation acting on dstream is divided into two types:

  1. Transformation
  2. Output indicates that the output results will be returned. currently supported include print, saveasobjectfiles, saveastextfiles, and saveashadoopfiles.
Dstreamgraph

If there is an input, there must be an output. If there is no output, all the previous actions are meaningless. How can we bind these inputs to the output? The solution to this problem depends on dstreamgraph, which records the input stream and output stream.

  private val inputStreams = new ArrayBuffer[InputDStream[_]]()  private val outputStreams = new ArrayBuffer[DStream[_]]()  var rememberDuration: Duration = null  var checkpointInProgress = false

The elements in outputstreams are automatically added to dstreamgraph when an output-type operation acts on dstream.

An important difference between outputstream and inputstream is that it will overloadGeneratejob.

Initialization Process

Streamingcontext

Streamingcontext is the entry point for spark streaming initialization. The main function is to generate jobscheduler based on input parameters.

Set inputstream

If the stream data source is from socket, useSocketstream. If the data source is from a constantly changing file, you can useFilestream

Submit for running

Streamingcontext. Start ()

 

Data Processing

Taking socketstream as an example, the data comes from socket.

Socketinputdstream starts a thread that uses the receive function to receive data.

 def receive() {                                                                                                              var socket: Socket = null                                                                                                  try {                                                                                                                        logInfo("Connecting to " + host + ":" + port)                                                                              socket = new Socket(host, port)                                                                                            logInfo("Connected to " + host + ":" + port)                                                                               val iterator = bytesToObjects(socket.getInputStream())                                                                     while(!isStopped && iterator.hasNext) {                                                                                      store(iterator.next)                                                                                                     }                                                                                                                          logInfo("Stopped receiving")                                                                                               restart("Retrying connecting to " + host + ":" + port)                                                                   } catch {                                                                                                                    case e: java.net.ConnectException =>                                                                                         restart("Error connecting to " + host + ":" + port, e)                                                                   case t: Throwable =>                                                                                                         restart("Error receiving data", t)                                                                                     } finally {          if (socket != null) {                                                                                                        socket.close()                                                                                                             logInfo("Closed socket to " + host + ":" + port)                                                                         }                                                                                                                        }                                                                                                                        }                                                                                                                        }        

The received data will be stored first, and the storage will eventually call functions in blockmanager. scala. How does blockmanager be transferred to streamingcontext? When passed in using sparkenv, pay attention to the input parameters of the streamingcontext constructor.

Processing Timer

Data storage is triggered by socket. So what triggers the real processing of stored data?

Remember that when we specify a time parameter during streamingcontext initialization, we will use this parameter to construct the correspondingRepeat TimerCall the generatejobs function once the timer times out.

private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds, longTime => eventActor ! GenerateJobs(new Time(longTime)), "JobGenerator")

Event processing functions

 /** Processes all events */                                                                                                private def processEvent(event: JobGeneratorEvent) {                                                                         logDebug("Got event " + event)                                                                                             event match {                                                                                                                case GenerateJobs(time) => generateJobs(time)                                                                              case ClearMetadata(time) => clearMetadata(time)                                                                            case DoCheckpoint(time) => doCheckpoint(time)                                                                              case ClearCheckpointData(time) => clearCheckpointData(time)                                                              }                                                                                                                        }     

Genertejobs

 private def generateJobs(time: Time) {                                                                                       SparkEnv.set(ssc.env)                                                                                                      Try(graph.generateJobs(time)) match {                                                                                        case Success(jobs) =>                                                                                                        val receivedBlockInfo = graph.getReceiverInputStreams.map { stream =>                                                        val streamId = stream.id                                                                                                   val receivedBlockInfo = stream.getReceivedBlockInfo(time)                                                                  (streamId, receivedBlockInfo)                                                                                            }.toMap                                                                                                                    jobScheduler.submitJobSet(JobSet(time, jobs, receivedBlockInfo))                                                         case Failure(e) =>                                                                                                           jobScheduler.reportError("Error generating jobs for time " + time, e)                                                  }                                                                                                                          eventActor ! DoCheckpoint(time)                                                                                          }          

Generatejobs-> generatejob is called to job. Run along the way, and SC. runjob is called in job. Run, which is not listed in the specific call path.

 private class JobHandler(job: Job) extends Runnable {    def run() {      eventActor ! JobStarted(job)      job.run()      eventActor ! JobCompleted(job)    }  }

DstreamThe. generatejob function defines jobfunc, which is the jobfunc used in job. Run ().

  private[streaming] def generateJob(time: Time): Option[Job] = {    getOrCompute(time) match {      case Some(rdd) => {        val jobFunc = () => {          val emptyFunc = { (iterator: Iterator[T]) => {} }          context.sparkContext.runJob(rdd, emptyFunc)        }        Some(new Job(time, jobFunc))      }      case None => None    }  }

In this process,DstreamgraphIt plays a key role, very similar to the graph in tridentstorm.

During the generatejob process, dstream generates the corresponding RDD by calling the compute function, and sparkcontext converts the RDD-based abstraction into multiple stages for execution.

An important conversion in streamingcontext is the conversion from dstream to RDD, while an important conversion in sparkcontext is the conversion from RDD to stage and task. Note the implementation of the getorcompute and compute functions in these two different abstract classes.

Summary

The content of this article is a little hasty, and the content is not rich enough to be informative. We will try to enrich the specific call path when we are free.

This article does not cover the fault tolerance Processing Mechanism. It will be explained in another article.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.