Analysis of Spark Streaming principles

Source: Internet
Author: User

Analysis of Spark Streaming principles
Receive Execution Process Data

StreamingContextDuring instantiation, You need to inputSparkContextAnd then specifyspark matser urlTo connectspark engineTo obtain executor.

After instantiation, you must first specify a method for receiving data, as shown in figure

val lines = ssc.socketTextStream(localhost, 9999)

In this way, text data is received from the socket. In this step,ReceiverInputDStreamImplementation, includingReceiverTo receive data and convert it to RDD in the memory.

ReceiverInputDStreamThere is a method that requires subclass implementation

def getReceiver(): Receiver[T]

The worker node can obtainReceiverTo distribute the data received to the worker.

For local runningReceiverThe received data is stored locally. Therefore, when starting the streaming application, note that the number of allocated cores must be greaterReceiverNumber to free up the cpu for computing task scheduling.

ReceiverSubclass implementation required

def onStart()def onStop()

Defines the initialization of a data receiver, how to store the received data, and how to release resources at the end.

ReceiverProvides a seriesstore()Interface, suchstore(ByteBuffer),store(Iterator)And so on. These store interfaces are implemented and will be initialized on the worker node.ReceiverSupervisorTo complete these storage functions.ReceiverSupervisorWill alsoReceiverPerform monitoring, such as monitoring whether started, stopped, restarted, and reported error.

ReceiverSupervisorWith the helpBlockManager, Data will be stored in the form of RDD, accordingStorageLevelSelect different storage policies. Memory is stored after serialization by default. If the memory cannot be stored, it is written to the disk (executor ). The intermediate result of the calculated RDD. The default storage policy is to save memory only after serialization.

ReceiverSupervisorWhen performing the putBlock operationBlockManagerStore the data, and thenReceiverTrackerSend oneAddBlock.ReceiverTrackerInternalReceivedBlockTrackerMaintain all the block information received by a worker, that isBlockInfo, SoAddBlockThe information will be stored inReceivedBlockTracker. When computing is required in the future,ReceiverTrackerAccording to the streamIdReceivedBlockTrackerObtain the corresponding block list.

RateLimiterHelp ControlReceiverSpeed,spark.streaming.receiver.maxRateParameters.

For data sources, common data sources include file, socket, akka, and RDDs. Advanced data sources include Twitter, Kafka, and Flume. Developers can also customize their own data sources.

Task Scheduling

JobSchedulerInitialize in context. When context start is enabled, the start of scheduler is triggered.

Scheduler's start triggersReceiverTrackerAndJobGenerator. These two classes are the focus of task scheduling. The former is started on worker.ReceiverData is received, and the exposure interface can obtain a batch of Block addresses based on the streamId. The latter generates Task descriptions based on data and time.

JobSchedulerContains a thread pool for scheduling task execution.spark.streaming.concurrentJobsYou can control the job concurrency. The default value is 1, that is, it can only extract jobs one by one.

Job fromJobGeneratorGeneratedJobSet.JobGeneratorGenerate a job and execute cp according to the time.

JobGeneratorJob Generation logic:
-CallReceiverTrackerOfallocateBlocksToBatchMethod: allocate blocks for this batch of data, that is, prepare the data.
-Indirect callDStreamOfgenerateJob(time)Method, manufacturing executable RDD

DStreamSplit RDD and generate executable RDD, that isgetOrCompute(time):
-If the RDD at this time point has been generated, it will be taken out from the memory hashmap; otherwise, the next step will be taken.
-If the time is an integer multiple of the batch interval, the next step is required. Otherwise, this time point is not valid.
-The subclass that calls DStreamcomputeMethod To obtain RDD. It may be an RDD or an RDD list.
-Call the persist method for each RDD to create a default storage policy. CallcheckpointMethods To develop cp policies
-CallSparkContext.runJob(rdd, emptyFunction). Turn this entire function into a function to generateJobClass. It will be triggered on executor in the futurerunJob

JobGeneratorAfter a job is successfully generated, callJobScheduler.submitJobSet(JobSet),JobSchedulerAll jobs in JobSet are submitted using the thread pool. After this method is called,JobGeneratorSend oneDoCheckpointMessage,Note that the cp here is the metadata cp of the driver, rather than the cp of the RDD itself.. If time is appropriate, the cp operation is triggered.CheckpointWriterClass will completewrite(streamingContext, time).

JobSchedulerThe run () method of the job is triggered when the job is submitted,JobSchedulerProcessingJobCompleted(job). If the job runs successfully, callJobSetOfhandleJobCompletion(Job)To do some time and number work. If the entire JobSet is complete, callJobGeneratorOfonBatchCompletion(time)Method,JobGeneratorNext we will doclearMetadataAnd thenJobSchedulerPrint the output. If the job fails to run,JobSchedulerReport error, and finally throw an exception in context.

More instructions

Transform: it can interact with external RDD, such as joining a dimension table.

UpdateStateByKey: generatedStateDStreamFor example, incremental computing

Each batch needs to perform cogroup with the incremental RDD and then execute the update function. The cogroup process of two RDDs has some overhead: RDD [K, V] And RDD [K, U] synthesis of rdd [K, List [V], List [U], list [U] generally has a size of 1 and is interpreted as oldvalue, that is, RDD [K, batchValueList, Option [oldValue]. After the update function is processed, it becomes RDD [K, newValue]. The batch and batch operations are strictly ordered, that is, incremental merge operations. The number of partitions without concurrent incremental RDD between batches can be increased, that is, the calculation of each incremental operation can be concurrent.

Window: A sliding window operation composed of three parameters: batch size, window length, and sliding interval. This operation combines multiple batches of RDD into one calculation.

ForeachRDD: this operation is an output operation, which is special.

  /**   * Apply a function to each RDD in this DStream. This is an output operator, so   * 'this' DStream will be registered as an output stream and therefore materialized.   */  def foreachRDD(foreachFunc: (RDD[T], Time) => Unit) {    new ForEachDStream(this, context.sparkContext.clean(foreachFunc, false)).register()  }
Integration with Spark SQL and DF

Example

This is similar to the control logic.

Cache

For window operations, the data received by default is persist in the memory.

For flume and kafka source, replicate the data received by default is saved in two copies.

Checkpoint

The result RDD of state-related streamcompute will be directed to HDFS by cp. The original Article is as follows:

Data checkpointing-Saving of the generated RDDs to reliable storage. this is necessary in some stateful transformations that combine data processing SS multiple batches. in such transformations, the generated RDDs depends on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. to avoid such unbounded increase in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.

The cp interval can also be set. You can perform one cp operation in multiple batches.

Cp operations are synchronized.

A simple flow task without state operations can be disabled.

The driver metadata also has a cp policy. When driver cp is usedStreamingContextObjects are written to reliable storage.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.