You are welcome to reprint it. Please indicate the source, huichiro Hui. Thank you.
To ensure the reliability of the processing results (You cannot calculate multiple or miss the calculation.To process all input data only once. In the processing mechanism of spark streaming, it is easier to understand because it does not involve multiple computations. So how does it achieve that even if the data processing node is restarted, the data will be processed again after the restart?
Environment Construction
To have a perceptual knowledge, run a simple spark streaming example first. First, check that OpenBSD-Netcat has been installed.
Run Netcat
nc -lk 9999
Run spark-shell
SPARK_JAVA_OPTS=-Dspark.cleaner.ttl=10000 MASTER=local-cluster[2,2,1024] bin/spark-shell
Enter the following content in spark-shell:
import org.apache.spark.streaming._import org.apache.spark.streaming.StreamingContext._val ssc = new StreamingContext(sc, Seconds(3))val lines = ssc.socketTextStream("localhost", 9999)val words = lines.flatMap( _.split(" "))val pairs = words.map(word => (word,1))val wordCount = pairs.reduceByKey(_ + _)wordCount.print()ssc.start()ssc.awaitTermination()
After SSC. Start () is executed, enter some content on the NC Side and press Enter. the spark-shell displays the statistical results.
Data receiving process
Let's take a look at the Code Implementation Layer. from two perspectives, one is the control layer (Control Panel) and the other is the data layer (data panel ).
The control layer of spark streaming's data receiving process is roughly shown in.
Briefly explain the meaning,
- Data is actually received inSocketreceiver. ReceiveFunction, put the received dataBlockgenerator. currentbuffer
- There is a duplicate timer in blockgenerator, and the processing function isUpdatecurrentbuffer, Updatecurrentbuffer encapsulates the data in the current buffer into a newBlock,ToBlocksforpushQueue in progress
- There is also a blockpushingthread in blockgenerator. Its responsibility is to continuously pass the members in the blocksforpush queue throughPusharraybufferThe function is passed to blockmanager to store data in memorystore.
- Pusharraybuffer will also pass the ID number of the block that has been stored by blockmanager to receivertracker, and receivertracker will put the stored blockid in the queue of the corresponding streamid.
Socket. Receive-> receiver er. Store-> pushsingle-> blockgenerator. updatecurrentbuffer-> blockgenerator. keeppushblocks-> pusharraybufer
-> Receivertracker. addblocks
The pusharraybuffer function is defined as follows:
def pushArrayBuffer( arrayBuffer: ArrayBuffer[_], optionalMetadata: Option[Any], optionalBlockId: Option[StreamBlockId] ) { val blockId = optionalBlockId.getOrElse(nextBlockId) val time = System.currentTimeMillis blockManager.put(blockId, arrayBuffer.asInstanceOf[ArrayBuffer[Any]], storageLevel, tellMaster = true) logDebug("Pushed block " + blockId + " in " + (System.currentTimeMillis - time) + " ms") reportPushedBlock(blockId, arrayBuffer.size, optionalMetadata) }
Data Structure Change Process
One of the reasons why spark streaming data processing is efficient is that data analysis is performed in batches. How does the batch data aggregate? In another way, we can express this problem. At a certain time point, the received data is single, that is, we can only make up to <t, data> data tuples, in runjob, data is extracted and analyzed in batches. When is the composition of the batch data completed?
It outlines how a new message is put into blockmanager through a series of processing after it is received by socketreceiver, And the receivertracker records the corresponding metadata at the same time.
- First, the new message is put into blockmanager. currentbuffer.
- During timer timeout processing, data in the entire currentbuffer is packaged into a block and placed into arrayblockingqueue. The data structure supports FIFO.
- Keeppushingblocks stores each block (the block contains the timestamp and the received raw data) for blockmanager to save. It also notifies receivertracker which blocks have been stored in blockmanager.
- Receivertracker puts the blocks received by each stream but not processed into receiverblockinfo, which is a hashmap. In the subsequent generatejobs, data is extracted from receiverblockinfo to generate the corresponding RDD.
Data Processing Process
The most important function in data processing is generatejobs. generatejobs will trigger the following function call process, and the specific code will not be listed one by one.
- Jobgenerator. generatejobs-> dstreamgraph. generatejobs-> dstream. generatejob-> getorcompute-> compute generate RDD
- Job calls job. func
The jobgenerator. generatejobs function is defined as follows:
private def generateJobs(time: Time) { SparkEnv.set(ssc.env) Try(graph.generateJobs(time)) match { case Success(jobs) => val receivedBlockInfo = graph.getReceiverInputStreams.map { stream => val streamId = stream.id val receivedBlockInfo = stream.getReceivedBlockInfo(time) (streamId, receivedBlockInfo) }.toMap jobScheduler.submitJobSet(JobSet(time, jobs, receivedBlockInfo)) case Failure(e) => jobScheduler.reportError("Error generating jobs for time " + time, e) } eventActor ! DoCheckpoint(time) }
Let's talk about how the data processing phase is hooked up with the data stored in the above receiving phase.
Assuming that the last RDD processing occurred at T1 and now it is T2. what blocks are not processed between <t2, T1>?
You already know the answer, and all the unprocessed blocks are saved in the receivertrackerReceiverblockinfoMedium
In generatejob, every dstream calls getreceivedblockinfo. You said it was not connected to receivedblockinfo in receivertracker. Don't worry! View the Data Input SourceReceiverinputdstreamHow is getreceivedblockinfo defined in. The Code is as follows.
private[streaming] def getReceivedBlockInfo(time: Time) = { receivedBlockInfo(time) }
So where does the receivedblockinfo (time) come from? It depends on the compute function implementation in receivedinputdstream.
override def compute(validTime: Time): Option[RDD[T]] = { // If this is called for any time before the start time of the context, // then this returns an empty RDD. This may happen when recovering from a // master failure if (validTime >= graph.startTime) { val blockInfo = ssc.scheduler.receiverTracker.getReceivedBlockInfo(id) receivedBlockInfo(validTime) = blockInfo val blockIds = blockInfo.map(_.blockId.asInstanceOf[BlockId]) Some(new BlockRDD[T](ssc.sc, blockIds)) } else { Some(new BlockRDD[T](ssc.sc, Array[BlockId]())) } }
So far, we finally see that the getreceivedblockinfo in receivertracker is called, that is, the data in the receiving phase is connected with the input channel in the current processing phase.
Function call path, from generatejobs to sparkcontext. submitjobs. At this time, note that the Operation registered as outputstream in dstreamgraph will cause sparkcontext. runjobs to be called. Let's take the print function as an example to look at the call process.
def print() { def foreachFunc = (rdd: RDD[T], time: Time) => { val first11 = rdd.take(11) println ("-------------------------------------------") println ("Time: " + time) println ("-------------------------------------------") first11.take(10).foreach(println) if (first11.size > 10) println("...") println() } new ForEachDStream(this, context.sparkContext.clean(foreachFunc)).register() }
Note:RDD. TakeThis will trigger the runjob call. If you do not believe it, let's take a look at the part that calls runjob in its definition.
val left = num - buf.size val p = partsScanned until math.min(partsScanned + numPartsToTry, totalParts) val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p, allowLocal = true) res.foreach(buf ++= _.take(num - buf.size)) partsScanned += numPartsToTry }
Summary of the data processing process
- Use time as the keyword to retrieve all blockids added before this time
- When a real request is submitted and run, blockfetcher in RDD uses blockid as the keyword to obtain the real data from blockmanagermaster, that is, the raw data received from the socket.
Fault tolerance Processing
Jobgenerator. at the end of the generatejobs function, a docheckpoint notification will be sent, which will allow the corresponding actor to write dstreamcheckpointdata to the hdfs file. Let's take a look at why checkpointdata needs to be written and what is included in checkpointdata.
In the Data Processing Section, we have analyzed that multiple jobs will be generated during generatejobs, and they will be sent to the cluster through the sparkcontext. runjob interface for actual execution.
For example, if the worker crashes at T2, the worker fails until T3 is completely restored. Due to the failure, the job generated by the last generatejobs may not be completely processed (some have already been processed, and some have not yet been processed), so you need to submit it again. There is a problem that may result in repeated processing of the same batch of data, thus the semantic Effect of exactly-once cannot be achieved.
Question 2:No new data is received within the time range of <T2, T3>.,Therefore, spark streaming's socketreceiver is suitable for serving as the client side rather than the server side. The data read by socketreceiver should exist in a memory database or cache queue with redundant backup mechanism, such as kafaka. For problem 2, spark streaming itself cannot solve it. Of course, further research will lead to the problem of Server Load balancer.
Checkpointdata
What are the member variables of checkpoint? Let's take a look at its structure definition.
val master = ssc.sc.master val framework = ssc.sc.appName val sparkHome = ssc.sc.getSparkHome.getOrElse(null) val jars = ssc.sc.jars val graph = ssc.graph val checkpointDir = ssc.checkpointDir val checkpointDuration = ssc.checkpointDuration val pendingTimes = ssc.scheduler.getPendingTimes().toArray val delaySeconds = MetadataCleaner.getDelaySeconds(ssc.conf) val sparkConfPairs = ssc.conf.getAll
Generatedrdds is included in graph. Therefore, do not suddenly panic and find that generatedrdds is not saved.
The checkpoint data isCheckpointwritehandlerWrite Data to HDFSCheckpiontreaderRead. CheckpointreadeIt will be used during the restart to determine whether it is the first clean start or the restart is caused by an error. The judgment is based on the CP variable.
To automatically check and load the corresponding checkpoint data after the restart, you cannot simply call New streamingcontext when creating streamingcontext, but use the getorcreate function, the sample code is as follows.
// Function to create and setup a new StreamingContextdef functionToCreateContext(): StreamingContext = { val ssc = new StreamingContext(...) // new context val lines = ssc.socketTextStream(...) // create DStreams ... ssc.checkpoint(checkpointDirectory) // set checkpoint directory ssc}// Get StreaminContext from checkpoint data or create a new oneval context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)// Do additional setup on context that needs to be done,// irrespective of whether it is being started or restartedcontext. ...// Start the contextcontext.start()context.awaitTermination()
Summary
This article describes the two images used in the data receiving process using tikz, which contains a wealth of information and is intended to understand the internal processing mechanism of spark streaming, you may wish to read the code in detail for reference.
If you have any mistakes or errors, please criticize and correct them.
References
- Spark streaming source code analysis checkpointhttp: // www.cnblogs.com/fxjwind/p/3596451.html
- Spark streaming introductionhttp: // jerryshao. Me/architecture/2013/04/02/spark-streaming-introduction/
- Deep dive with spark streaming http://www.meetup.com/spark-users/events/122694912/