Contents of this issue
1. Receivedblocktracker Fault-tolerant security
2, Dstreamgraph and Jobgenerator fault-tolerant security
All data that cannot be streamed in real time is invalid data. In the stream processing era, Sparkstreaming has a strong appeal, and development prospects, coupled with Spark's ecosystem, streaming can easily call other powerful frameworks such as Sql,mllib, it will eminence.
The spark streaming runtime is not so much a streaming framework on spark core as one of the most complex applications on spark core. If you can master the complex application of spark streaming, then other complex applications are a cinch. It is also a general trend to choose spark streaming as a starting point for custom versions.
From the data plane, Receivedblocktracker records metadata information for the entire spark streaming application.
From the scheduling level, dstreamgraph and Jobgenerator are at the heart of the spark streaming dispatch, recording the current schedule to which progress, and the business.
Start with the receivertracker angle.
If you turn on Wal, write the metadata to Wal and join Receivedblockqueue
def addblock (receivedblockinfo:receivedblockinfo): Boolean = {try { val writeresult = WriteToLog ( Blockadditionevent (receivedblockinfo)) if (writeresult) { synchronized { Getreceivedblockqueue ( Receivedblockinfo.streamid) + = Receivedblockinfo } logdebug (S "Stream ${receivedblockinfo.streamid} Received "+ S" Block ${receivedblockinfo.blockstoreresult.blockid} ") } else { logdebug (S" Failed to Acknowledge Stream ${receivedblockinfo.streamid} receiving "+ S" Block ${ ReceivedBlockInfo.blockStoreResult.blockId} in the Write Ahead Log. ") } Writeresult} catch {case nonfatal (e) = LogError (S "Error adding block $receivedBlockInfo", E) false }}
Here's a look at the WriteToLog method
Private def WriteToLog (record:receivedblocktrackerlogevent): Boolean = { if (iswriteaheadlogenabled) { Logtrace (S "Writing record: $record") try { writeAheadLogOption.get.write (Bytebuffer.wrap (Utils.serialize ( Record)), Clock.gettimemillis ()) true } catch {case nonfatal (e) = logwarning (S "Exception Thrown while writing record: $record to the Writeaheadlog. ", E) false } } else { true } }
Take a look at Jobscheduler. To allocate blocks for allocating batch
def allocateblockstobatch (batchtime:time): Unit = synchronized { if (Lastallocatedbatchtime = = NULL | | batchtime > Lastallocatedbatchtime) { val streamidtoblocks = streamids.map {streamid = (Streamid, Getreceivedblockqueue (Streamid). Dequeueall (x = True)) }.tomap val allocatedblocks = Allocatedblocks ( Streamidtoblocks) if (WriteToLog (Batchallocationevent (Batchtime, allocatedblocks))) { Timetoallocatedblocks.put (Batchtime, allocatedblocks) lastallocatedbatchtime = Batchtime } else { Loginfo (S "Possibly processed batch $batchTime need to being processed again in WAL recovery") } } else { logIn Fo (S "Possibly processed batch $batchTime need to being processed again in WAL recovery") } }
Although this writetolog is called every time, the inside of the method will determine if the Wal is open.
And then into the writetolog interior.
Private def WriteToLog (record:receivedblocktrackerlogevent): Boolean = { if (iswriteaheadlogenabled) { Logtrace (S "Writing record: $record") try { writeAheadLogOption.get.write (Bytebuffer.wrap (Utils.serialize ( Record)), Clock.gettimemillis ()) true } catch {case nonfatal (e) = logwarning (S "Exception Thrown while writing record: $record to the Writeaheadlog. ", E) false } } else { true } }
After the write is complete, receivertracker partial fault tolerance is the data partial completion
Next look at the job generation perspective, that is, the scheduling level to see fault tolerance
Call Docheckpoint according to Dstreamgraph each time a job is generated based on a fixed batchinterval.
Private Def docheckpoint (Time:time, Clearcheckpointdatalater:boolean) { if (shouldcheckpoint && (Time-gra ph.zerotime). ismultipleof (ssc.checkpointduration)) { Loginfo ("Checkpointing Graph for Time" + time) Ssc.graph.updateCheckpointData (Time) Checkpointwriter.write (New Checkpoint (SSC, Time), Clearcheckpointdatalater) } }
In the modification method, the Updatecheckpointdata method is called according to the checkpoint of each update time.
def updatecheckpointdata (time:time) { loginfo ("Updating Checkpoint data for time" + time) this.synchronized {
outputstreams.foreach (_.updatecheckpointdata (Time)) } loginfo ("Updated Checkpoint data for time" + time) } Private[streaming] def updatecheckpointdata (currenttime:time) { logdebug ("Updating Checkpoint data for time" + Curr Enttime) checkpointdata.update (currenttime) Dependencies.foreach (_.updatecheckpointdata (currentTime)) logdebug ("Updated Checkpoint data for time" + CurrentTime + ":" + checkpointdata) }
In summary, the Receivedblocktracker handles the data plane, which, by way of Wal,
and Dstreamgraph and Jobgenerator are from the dispatch level, through the checkpoint way
Note:
Data from: Dt_ Big Data Dream Factory (spark release version customization)
For more private content, please follow the public number: Dt_spark
If you are interested in big data spark, you can listen to it free of charge by Liaoliang teacher every night at 20:00 Spark Permanent free public class, address yy room Number: 68917580
Spark Version Custom 13th day: Driver tolerance