1, why introduce backpressure
By default, Spark streaming receives data through receiver at the rate of producer production data, and the batch processing time > Batch interval occurs during the calculation, where batch Processing time is spent on the actual calculation of a batch interval the batch interval that is set for the streaming application. This means that the data reception rate of spark streaming is higher than the rate at which spark removes data from the queue, that is, the data processing capability is low and the current receive rate is not fully processed during the set interval. If this persists for a long time, it can cause data to accumulate in memory, causing receiver to executor memory overflow and other problems (if the settings storagelevel contains disk, the memory will not be stored in the data can be written to disk, increase latency). Prior to Spark 1.5, if the user wants to limit receiver's data reception rate, it can be implemented by setting the value of the static configuration parameter "Spark.streaming.receiver.maxRate", although this can be achieved by limiting the rate of reception, To match the current processing power to prevent memory overflow, but it also introduces other issues. For example: Producer data production is higher than maxrate, the current cluster processing capacity is higher than maxrate, which will lead to the reduction of resource utilization and so on. In order to better coordinate the data receiving rate and the resource processing ability, Spark streaming introduced the back pressure mechanism (back-pressure) from v1.5, and adapted the cluster data processing capability by dynamically controlling the receiving rate.
2, Backpressure
Spark Streaming backpressure: Dynamically adjusts receiver data reception rates based on the execution information of the Jobscheduler feedback job. By using the property "spark.streaming.backpressure.enabled" to control whether the backpressure mechanism is enabled, the default value is False, which is not enabled.
The 2.1 streaming architecture is shown in the following figure (see Streaming data reception process documentation and streaming source parsing)
The 2.2 backpressure execution process is shown in the following figure:
On the basis of the original architecture, add a new component Ratecontroller, which is responsible for monitoring the "onbatchcompleted" event and extracting Processingdelay and schedulingdelay information from it. The estimator estimates the maximum processing speed (rate) based on this information, and finally the receiver-based input Stream transfers rate through Receivertracker and Receiversupervisorimpl to Blockgenerator (inherited from Ratelimiter).
3, backpressure source code Analysis
3.1 Ratecontroller class System
Ratencontroller inherits from Streaminglistener. Used to handle batchcompleted events. The core code is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |