Spark Streaming Back Pressure Mechanism

Last Update:2020-06-12 Source: Internet

Author: User

Keywords spark spark streaming spark streaming mechanism

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

By default, Spark Streaming uses receivers (or Direct) to receive data at the rate that the producer produces it. When batch processing time> batch interval, that is, the processing time of each batch of data is longer than the Spark Streaming batch processing interval; more and more data is received, but the data processing speed does not keep up, resulting in the system Data accumulation begins to occur, which may further cause OOM problems on the Executor side and fail.

Before the Spark 1.5 version, in order to solve this problem, for the Receiver-based data receiver, we can configure the spark. streaming.receiver.maxRate parameter to limit the maximum recorded data per receiver per second; for Direct Approach For data reception, we can configure the spark.streaming.kafka.maxRatePerPartition parameter to limit the maximum number of records read by each Kafka partition in each job. Although this method can adapt to the current processing capacity by limiting the receiving rate, this method has the following problems:

We need to estimate the processing speed of the cluster and the speed of message data generation in advance;

These two methods require manual participation. After modifying the relevant parameters, we need to manually restart the Spark Streaming application;

If the processing capacity of the current cluster is higher than the maxRate we configured, and the data generated by the producer is higher than maxRate, this will result in low utilization of cluster resources, and it will also cause the data to not be processed in time.

Back pressure mechanism
So is it possible that the Spark Streaming system automatically handles these problems without human intervention? Of course! Spark 1.5 introduces a Back Pressure mechanism, which automatically adapts cluster data processing capabilities by dynamically collecting some data from the system. For detailed records, please refer to the instructions in SPARK-7398.

Spark Streaming pre-1.5 architecture
The data is continuously received through the receiver. When the data is received, it stores the data in the Block Manager; in order not to lose the data, it also backs up the data to other Block Managers;

Receiver Tracker receives the stored Block IDs, and then maintains the relationship between these block IDs within a time;

The Job Generator will receive an event every batchInterval and it will generate a JobSet;

Job Scheduler runs the JobSet generated above.

In order to automatically adjust the data transmission rate, a new component named RateController is added to the original architecture. This component inherits from StreamingListener, which listens to the onBatchCompleted event of all jobs, and is based on processingDelay, schedulingDelay, and the current batch processing record Number and processing completion events to estimate a rate; this rate is mainly used to update the maximum number of records that the stream can process per second. Rate Estimator can be implemented in many ways, but the current Spark 2.2 only implements PID-based rate estimator.

The calculated maximum rate will be stored in the RateController inside InputDStreams. This rate will push the calculated rate to ReceiverSupervisorImpl after the onBatchCompleted event is processed, so that the receiver knows how much data should be received next.

If the user configures spark.streaming.receiver.maxRate or spark.streaming.kafka.maxRatePerPartition, how much data is finally received depends on the minimum of the three. In other words, each receiver or each Kafka partition will process data per second that does not exceed the value of spark.streaming.receiver.maxRate or spark.streaming.kafka.maxRatePerPartition.

Use of Spark Streaming backpressure mechanism
Enabling the backpressure mechanism in Spark is simple. Just set spark.streaming.backpressure.enabled to true. The default value of this parameter is false. The backpressure mechanism also involves the following parameters, including those not listed in the documentation:

spark.streaming.backpressure.initialRate: The initial maximum rate at which each receiver receives the first batch of data when the backpressure mechanism is enabled. The default value is not set.

spark.streaming.backpressure.rateEstimator: Rate estimator class, the default value is pid, Spark currently only supports this, you can implement it according to your needs.

spark.streaming.backpressure.pid.proportional: The weight used to respond to errors (changes between the last batch and the current batch). The default value is 1, which can only be set to a non-negative value. weight for response to "error" (change between last batch and this batch)

spark.streaming.backpressure.pid.integral: Error weighted response weight, with suppression (effective damping). The default value is 0.2 and can only be set to a non-negative value. weight for the response to the accumulation of error. This has a dampening effect.

spark.streaming.backpressure.pid.derived: The weight of response to error trends. This may cause fluctuations in batch size, which can help increase/decrease capacity quickly. The default value is 0, which can only be set to a non-negative value. weight for the response to the trend in error. This can cause arbitrary/noise-induced fluctuations in batch size, but can also help react quickly to increased/reduced capacity.

spark.streaming.backpressure.pid.minRate: What is the lowest rate that can be estimated. The default value is 100, which can only be set to a non-negative value.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More