Spark Streaming Backpressure Analysis

Source: Internet
Author: User

---restore content starts---

1, why introduce backpressure

                by default,Spark StreamingthroughReceiverdata is received at the rate of producer production data, which occurs during the calculationBatch processing time > Batch intervalthe situation in whichBatch processing timeto actually calculate a lot of time spent,Batch Intervalto beStreamingthe batch interval at which the settings are applied. This means thatSpark StreamingThe data receive rate is higher thanSparkThe rate at which data is removed from the queue, that is, the data processing capability is low and the current receive rate is not fully processed during the set interval. If this condition persists for too long, it can cause data to accumulate in memory, causingReceiverwhereExecutormemory Overflow and other issues (if setStoragelevelincludedisk,data that is not stored in memory isdisk,increased latency). Spark 1.5Previous versions, if users want to restrictReceiverthe data receive rate can be set by setting the static configuration parameters"spark.streaming.receiver.maxRate"value, although this can be done by limiting the rate of reception to match the current processing power to prevent memory overflow, but also introduce other issues. For example:producerdata production is higher thanmaxrate, the current cluster processing power is also higher thanmaxrate, which will lead to a decline in resource utilization and so on. In order to better coordinate the data receiving rate and the resource processing ability,Spark Streamingfromv1.5The introduction of the reverse pressure mechanism is initiated (back-pressure),The data receiving rate can be dynamically controlled to match the capability of cluster processing.

2 , backpressure

                Spark streaming backpressure: according toJobschedulerFeedback job execution information to dynamically adjustReceiverdata reception rate. Through the properties "spark.streaming.backpressure.enabledto control whether the Enablebackpressuremechanism, default valuefalse, which is not enabled.

2.1 Streaming as shown in the architecture (see Streaming data Reception process documentation and Streaming Source parsing)

2.2 backpressure the execution process is as follows:

On the basis of the original architecture, add a new component Ratecontroller, which is responsible for monitoring the "onbatchcompleted" event and extracting Processingdelay and schedulingdelay information from it. The estimator estimates the maximum processing speed (rate) based on this information, and finally the receiver-based input Stream transfers rate through Receivertracker and Receiversupervisorimpl to Blockgenerator (inherited from Ratelimiter).

3 , backpressure Source Parsing

3.1 Ratecontroller class System

Ratencontroller inherits from Streaminglistener. used to process batchcompleted event. The core code is:

* * A Streaminglistener that receives batch completion updates, and maintains * an estimate of the speed at which this St Ream should ingest messages, * Given an estimate computation from a ' rateestimator ' */private[streaming] abstract class Ra Tecontroller (Val streamuid:int, rateestimator:rateestimator) extends Streaminglistener with Serializable {...../** * C   Ompute the new rate limit and publish it asynchronously.      */Private Def computeandpublish (Time:long, Elems:long, Workdelay:long, waitdelay:long): Unit = Future[unit] { Val newrate = Rateestimator.compute (Time, Elems, Workdelay, Waitdelay) Newrate.foreach {s = Ratelimit.s ET (S.tolong) Publish (Getlatestrate ())}} def getlatestrate (): Long = Ratelimit.get () override Def ONBATC Hcompleted (batchcompleted:streaminglistenerbatchcompleted) {val elements =      BatchCompleted.batchInfo.streamIdToInputInfo for {processingend <-batchCompleted.batchInfo.processingEndTime WoRkdelay <-batchCompleted.batchInfo.processingDelay waitdelay <-batchCompleted.batchInfo.schedulingDelay  Elems <-Elements.get (streamuid). Map (_.numrecords)} computeandpublish (Processingend, Elems, Workdelay, WaitDelay) }}

  

---restore content starts---

1, why introduce backpressure

                by default,Spark StreamingthroughReceiverdata is received at the rate of producer production data, which occurs during the calculationBatch processing time > Batch intervalthe situation in whichBatch processing timeto actually calculate a lot of time spent,Batch Intervalto beStreamingthe batch interval at which the settings are applied. This means thatSpark StreamingThe data receive rate is higher thanSparkThe rate at which data is removed from the queue, that is, the data processing capability is low and the current receive rate is not fully processed during the set interval. If this condition persists for too long, it can cause data to accumulate in memory, causingReceiverwhereExecutormemory Overflow and other issues (if setStoragelevelincludedisk,data that is not stored in memory isdisk,increased latency). Spark 1.5Previous versions, if users want to restrictReceiverthe data receive rate can be set by setting the static configuration parameters"spark.streaming.receiver.maxRate"value, although this can be done by limiting the rate of reception to match the current processing power to prevent memory overflow, but also introduce other issues. For example:producerdata production is higher thanmaxrate, the current cluster processing power is also higher thanmaxrate, which will lead to a decline in resource utilization and so on. In order to better coordinate the data receiving rate and the resource processing ability,Spark Streamingfromv1.5The introduction of the reverse pressure mechanism is initiated (back-pressure),The data receiving rate can be dynamically controlled to match the capability of cluster processing.

2 , backpressure

                Spark streaming backpressure: according toJobschedulerFeedback job execution information to dynamically adjustReceiverdata reception rate. Through the properties "spark.streaming.backpressure.enabledto control whether the Enablebackpressuremechanism, default valuefalse, which is not enabled.

2.1 Streaming as shown in the architecture (see Streaming data Reception process documentation and Streaming Source parsing)

2.2 backpressure the execution process is as follows:

On the basis of the original architecture, add a new component Ratecontroller, which is responsible for monitoring the "onbatchcompleted" event and extracting Processingdelay and schedulingdelay information from it. The estimator estimates the maximum processing speed (rate) based on this information, and finally the receiver-based input Stream transfers rate through Receivertracker and Receiversupervisorimpl to Blockgenerator (inherited from Ratelimiter).

3 , backpressure Source Parsing

3.1 Ratecontroller class System

Ratencontroller inherits from Streaminglistener. used to process batchcompleted event. The core code is:

* * A Streaminglistener that receives batch completion updates, and maintains * an estimate of the speed at which this St Ream should ingest messages, * Given an estimate computation from a ' rateestimator ' */private[streaming] abstract class Ra Tecontroller (Val streamuid:int, rateestimator:rateestimator) extends Streaminglistener with Serializable {...../** * C   Ompute the new rate limit and publish it asynchronously.      */Private Def computeandpublish (Time:long, Elems:long, Workdelay:long, waitdelay:long): Unit = Future[unit] { Val newrate = Rateestimator.compute (Time, Elems, Workdelay, Waitdelay) Newrate.foreach {s = Ratelimit.s ET (S.tolong) Publish (Getlatestrate ())}} def getlatestrate (): Long = Ratelimit.get () override Def ONBATC Hcompleted (batchcompleted:streaminglistenerbatchcompleted) {val elements =      BatchCompleted.batchInfo.streamIdToInputInfo for {processingend <-batchCompleted.batchInfo.processingEndTime WoRkdelay <-batchCompleted.batchInfo.processingDelay waitdelay <-batchCompleted.batchInfo.schedulingDelay  Elems <-Elements.get (streamuid). Map (_.numrecords)} computeandpublish (Processingend, Elems, Workdelay, WaitDelay) }}

3.2 Ratecontroller the registration

all inputdstreamregistered in dstreamgraph are extracted when the Jobscheduler is started Ratecontroller , and Register for monitoring with Listenerbus . This section of code is as follows:

Spark Streaming Backpressure Analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.