Analysis of Spark Streaming principles
Receive Execution Process Data
StreamingContext
During instantiation, You need to inputSparkContext
And then specifyspark matser url
To connectspark engine
To obtain executor.
After instantiation, you must first specify a method for receiving data, as shown in figure
val lines = ssc.socketTextStream(localhost, 9999)
In this way, text data is received from the socket. In this step,ReceiverInputDStream
Implementation, includingReceiver
To receive data and convert it to RDD in the memory.
ReceiverInputDStream
There is a method that requires subclass implementation
def getReceiver(): Receiver[T]
The worker node can obtainReceiver
To distribute the data received to the worker.
For local runningReceiver
The received data is stored locally. Therefore, when starting the streaming application, note that the number of allocated cores must be greaterReceiver
Number to free up the cpu for computing task scheduling.
Receiver
Subclass implementation required
def onStart()def onStop()
Defines the initialization of a data receiver, how to store the received data, and how to release resources at the end.
Receiver
Provides a seriesstore()
Interface, suchstore(ByteBuffer)
,store(Iterator)
And so on. These store interfaces are implemented and will be initialized on the worker node.ReceiverSupervisor
To complete these storage functions.ReceiverSupervisor
Will alsoReceiver
Perform monitoring, such as monitoring whether started, stopped, restarted, and reported error.
ReceiverSupervisor
With the helpBlockManager
, Data will be stored in the form of RDD, accordingStorageLevel
Select different storage policies. Memory is stored after serialization by default. If the memory cannot be stored, it is written to the disk (executor ). The intermediate result of the calculated RDD. The default storage policy is to save memory only after serialization.
ReceiverSupervisor
When performing the putBlock operationBlockManager
Store the data, and thenReceiverTracker
Send oneAddBlock
.ReceiverTracker
InternalReceivedBlockTracker
Maintain all the block information received by a worker, that isBlockInfo
, SoAddBlock
The information will be stored inReceivedBlockTracker
. When computing is required in the future,ReceiverTracker
According to the streamIdReceivedBlockTracker
Obtain the corresponding block list.
RateLimiter
Help ControlReceiver
Speed,spark.streaming.receiver.maxRate
Parameters.
For data sources, common data sources include file, socket, akka, and RDDs. Advanced data sources include Twitter, Kafka, and Flume. Developers can also customize their own data sources.
Task Scheduling
JobScheduler
Initialize in context. When context start is enabled, the start of scheduler is triggered.
Scheduler's start triggersReceiverTracker
AndJobGenerator
. These two classes are the focus of task scheduling. The former is started on worker.Receiver
Data is received, and the exposure interface can obtain a batch of Block addresses based on the streamId. The latter generates Task descriptions based on data and time.
JobScheduler
Contains a thread pool for scheduling task execution.spark.streaming.concurrentJobs
You can control the job concurrency. The default value is 1, that is, it can only extract jobs one by one.
Job fromJobGenerator
GeneratedJobSet
.JobGenerator
Generate a job and execute cp according to the time.
JobGenerator
Job Generation logic:
-CallReceiverTracker
OfallocateBlocksToBatch
Method: allocate blocks for this batch of data, that is, prepare the data.
-Indirect callDStream
OfgenerateJob(time)
Method, manufacturing executable RDD
DStream
Split RDD and generate executable RDD, that isgetOrCompute(time)
:
-If the RDD at this time point has been generated, it will be taken out from the memory hashmap; otherwise, the next step will be taken.
-If the time is an integer multiple of the batch interval, the next step is required. Otherwise, this time point is not valid.
-The subclass that calls DStreamcompute
Method To obtain RDD. It may be an RDD or an RDD list.
-Call the persist method for each RDD to create a default storage policy. Callcheckpoint
Methods To develop cp policies
-CallSparkContext.runJob(rdd, emptyFunction)
. Turn this entire function into a function to generateJob
Class. It will be triggered on executor in the futurerunJob
JobGenerator
After a job is successfully generated, callJobScheduler.submitJobSet(JobSet)
,JobScheduler
All jobs in JobSet are submitted using the thread pool. After this method is called,JobGenerator
Send oneDoCheckpoint
Message,Note that the cp here is the metadata cp of the driver, rather than the cp of the RDD itself.. If time is appropriate, the cp operation is triggered.CheckpointWriter
Class will completewrite(streamingContext, time)
.
JobScheduler
The run () method of the job is triggered when the job is submitted,JobScheduler
ProcessingJobCompleted(job)
. If the job runs successfully, callJobSet
OfhandleJobCompletion(Job)
To do some time and number work. If the entire JobSet is complete, callJobGenerator
OfonBatchCompletion(time)
Method,JobGenerator
Next we will doclearMetadata
And thenJobScheduler
Print the output. If the job fails to run,JobScheduler
Report error, and finally throw an exception in context.
More instructions
Transform: it can interact with external RDD, such as joining a dimension table.
UpdateStateByKey: generatedStateDStream
For example, incremental computing
Each batch needs to perform cogroup with the incremental RDD and then execute the update function. The cogroup process of two RDDs has some overhead: RDD [K, V] And RDD [K, U] synthesis of rdd [K, List [V], List [U], list [U] generally has a size of 1 and is interpreted as oldvalue, that is, RDD [K, batchValueList, Option [oldValue]. After the update function is processed, it becomes RDD [K, newValue]. The batch and batch operations are strictly ordered, that is, incremental merge operations. The number of partitions without concurrent incremental RDD between batches can be increased, that is, the calculation of each incremental operation can be concurrent.
Window: A sliding window operation composed of three parameters: batch size, window length, and sliding interval. This operation combines multiple batches of RDD into one calculation.
ForeachRDD: this operation is an output operation, which is special.
/** * Apply a function to each RDD in this DStream. This is an output operator, so * 'this' DStream will be registered as an output stream and therefore materialized. */ def foreachRDD(foreachFunc: (RDD[T], Time) => Unit) { new ForEachDStream(this, context.sparkContext.clean(foreachFunc, false)).register() }
Integration with Spark SQL and DF
Example
This is similar to the control logic.
Cache
For window operations, the data received by default is persist in the memory.
For flume and kafka source, replicate the data received by default is saved in two copies.
Checkpoint
The result RDD of state-related streamcompute will be directed to HDFS by cp. The original Article is as follows:
Data checkpointing-Saving of the generated RDDs to reliable storage. this is necessary in some stateful transformations that combine data processing SS multiple batches. in such transformations, the generated RDDs depends on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. to avoid such unbounded increase in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.
The cp interval can also be set. You can perform one cp operation in multiple batches.
Cp operations are synchronized.
A simple flow task without state operations can be disabled.
The driver metadata also has a cp policy. When driver cp is usedStreamingContext
Objects are written to reliable storage.