Exactly-once fault-tolerant ha mechanism of Spark streaming

Source: Internet
Author: User

Spark Streaming 1.2 provides a Wal based fault-tolerant mechanism (refer to the previous blog post http://blog.csdn.net/yangbutao/article/details/44975627), You can guarantee that the calculation of the data is executed at least once,

However, it is not guaranteed to perform only once, for example, after Kafka receiver write data to Wal, to zookeeper write offset failed, then after the driver failure recovery, due to offset or previously written offset position, The data will be pulled from the Kafka once, and executed once, and for some scenarios the consistency of the sexual requirements is more stringent, and 1.2 ha mechanism is more complex, and the impact on performance is relatively large.

A simpler way to support the need for spark streaming exactly-one, which is the direct API, has been provided since 1.3.

Refer to the following figure:

The difference 1.2 ha relies on the Wal and receiver,1.3 versions of the direct API approach to implement the exactly-once.

Driver when generating RDD tasks, the division of each Batch is based on the offset range of the Kafka consumption; when each job is executed, the data is retrieved from the Kafka based on the divided offset range ; current offset can be reliably stored in driver through the checkpoint mechanism for reliable recovery at expiration. As a result of removing the receiver, for parallel operations do not need to configure how many thread consumption Kafka partition, in the direct API implementation, each RDD partition corresponding to the Kafka partition, greatly simplifies the parallel programming model, do automatic parallel reading.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.