---restore content starts---
1, why introduce backpressure
by default,Spark StreamingthroughReceiverdata is received at the rate of producer production data, which occurs during the calculationBatch processing time > Batch intervalthe situation in whichBatch processing timeto actually calculate a lot of time spent,Batch Intervalto beStreamingthe batch interval at which the settings are applied. This means thatSpark StreamingThe data receive rate is higher thanSparkThe rate at which data is removed from the queue, that is, the data processing capability is low and the current receive rate is not fully processed during the set interval. If this condition persists for too long, it can cause data to accumulate in memory, causingReceiverwhereExecutormemory Overflow and other issues (if setStoragelevelincludedisk,data that is not stored in memory isdisk,increased latency). Spark 1.5Previous versions, if users want to restrictReceiverthe data receive rate can be set by setting the static configuration parameters"spark.streaming.receiver.maxRate
"value, although this can be done by limiting the rate of reception to match the current processing power to prevent memory overflow, but also introduce other issues. For example:producerdata production is higher thanmaxrate, the current cluster processing power is also higher thanmaxrate, which will lead to a decline in resource utilization and so on. In order to better coordinate the data receiving rate and the resource processing ability,Spark Streamingfromv1.5The introduction of the reverse pressure mechanism is initiated (back-pressure),The data receiving rate can be dynamically controlled to match the capability of cluster processing.
2 , backpressure
Spark streaming backpressure: according toJobschedulerFeedback job execution information to dynamically adjustReceiverdata reception rate. Through the properties "spark.streaming.backpressure.enabled
to control whether the Enablebackpressuremechanism, default valuefalse, which is not enabled.
2.1 Streaming as shown in the architecture (see Streaming data Reception process documentation and Streaming Source parsing)
2.2 backpressure the execution process is as follows:
On the basis of the original architecture, add a new component Ratecontroller, which is responsible for monitoring the "onbatchcompleted" event and extracting Processingdelay and schedulingdelay information from it. The estimator estimates the maximum processing speed (rate) based on this information, and finally the receiver-based input Stream transfers rate through Receivertracker and Receiversupervisorimpl to Blockgenerator (inherited from Ratelimiter).
3 , backpressure Source Parsing
3.1 Ratecontroller class System
Ratencontroller inherits from Streaminglistener. used to process batchcompleted event. The core code is:
* * A Streaminglistener that receives batch completion updates, and maintains * an estimate of the speed at which this St Ream should ingest messages, * Given an estimate computation from a ' rateestimator ' */private[streaming] abstract class Ra Tecontroller (Val streamuid:int, rateestimator:rateestimator) extends Streaminglistener with Serializable {...../** * C Ompute the new rate limit and publish it asynchronously. */Private Def computeandpublish (Time:long, Elems:long, Workdelay:long, waitdelay:long): Unit = Future[unit] { Val newrate = Rateestimator.compute (Time, Elems, Workdelay, Waitdelay) Newrate.foreach {s = Ratelimit.s ET (S.tolong) Publish (Getlatestrate ())}} def getlatestrate (): Long = Ratelimit.get () override Def ONBATC Hcompleted (batchcompleted:streaminglistenerbatchcompleted) {val elements = BatchCompleted.batchInfo.streamIdToInputInfo for {processingend <-batchCompleted.batchInfo.processingEndTime WoRkdelay <-batchCompleted.batchInfo.processingDelay waitdelay <-batchCompleted.batchInfo.schedulingDelay Elems <-Elements.get (streamuid). Map (_.numrecords)} computeandpublish (Processingend, Elems, Workdelay, WaitDelay) }}
---restore content starts---
1, why introduce backpressure
by default,Spark StreamingthroughReceiverdata is received at the rate of producer production data, which occurs during the calculationBatch processing time > Batch intervalthe situation in whichBatch processing timeto actually calculate a lot of time spent,Batch Intervalto beStreamingthe batch interval at which the settings are applied. This means thatSpark StreamingThe data receive rate is higher thanSparkThe rate at which data is removed from the queue, that is, the data processing capability is low and the current receive rate is not fully processed during the set interval. If this condition persists for too long, it can cause data to accumulate in memory, causingReceiverwhereExecutormemory Overflow and other issues (if setStoragelevelincludedisk,data that is not stored in memory isdisk,increased latency). Spark 1.5Previous versions, if users want to restrictReceiverthe data receive rate can be set by setting the static configuration parameters"spark.streaming.receiver.maxRate
"value, although this can be done by limiting the rate of reception to match the current processing power to prevent memory overflow, but also introduce other issues. For example:producerdata production is higher thanmaxrate, the current cluster processing power is also higher thanmaxrate, which will lead to a decline in resource utilization and so on. In order to better coordinate the data receiving rate and the resource processing ability,Spark Streamingfromv1.5The introduction of the reverse pressure mechanism is initiated (back-pressure),The data receiving rate can be dynamically controlled to match the capability of cluster processing.
2 , backpressure
Spark streaming backpressure: according toJobschedulerFeedback job execution information to dynamically adjustReceiverdata reception rate. Through the properties "spark.streaming.backpressure.enabled
to control whether the Enablebackpressuremechanism, default valuefalse, which is not enabled.
2.1 Streaming as shown in the architecture (see Streaming data Reception process documentation and Streaming Source parsing)
2.2 backpressure the execution process is as follows:
On the basis of the original architecture, add a new component Ratecontroller, which is responsible for monitoring the "onbatchcompleted" event and extracting Processingdelay and schedulingdelay information from it. The estimator estimates the maximum processing speed (rate) based on this information, and finally the receiver-based input Stream transfers rate through Receivertracker and Receiversupervisorimpl to Blockgenerator (inherited from Ratelimiter).
3 , backpressure Source Parsing
3.1 Ratecontroller class System
Ratencontroller inherits from Streaminglistener. used to process batchcompleted event. The core code is:
* * A Streaminglistener that receives batch completion updates, and maintains * an estimate of the speed at which this St Ream should ingest messages, * Given an estimate computation from a ' rateestimator ' */private[streaming] abstract class Ra Tecontroller (Val streamuid:int, rateestimator:rateestimator) extends Streaminglistener with Serializable {...../** * C Ompute the new rate limit and publish it asynchronously. */Private Def computeandpublish (Time:long, Elems:long, Workdelay:long, waitdelay:long): Unit = Future[unit] { Val newrate = Rateestimator.compute (Time, Elems, Workdelay, Waitdelay) Newrate.foreach {s = Ratelimit.s ET (S.tolong) Publish (Getlatestrate ())}} def getlatestrate (): Long = Ratelimit.get () override Def ONBATC Hcompleted (batchcompleted:streaminglistenerbatchcompleted) {val elements = BatchCompleted.batchInfo.streamIdToInputInfo for {processingend <-batchCompleted.batchInfo.processingEndTime WoRkdelay <-batchCompleted.batchInfo.processingDelay waitdelay <-batchCompleted.batchInfo.schedulingDelay Elems <-Elements.get (streamuid). Map (_.numrecords)} computeandpublish (Processingend, Elems, Workdelay, WaitDelay) }}
3.2 Ratecontroller the registration
all inputdstreamregistered in dstreamgraph are extracted when the Jobscheduler is started Ratecontroller , and Register for monitoring with Listenerbus . This section of code is as follows:
Spark Streaming Backpressure Analysis