Spark Streaming 1.2 provides a Wal based fault-tolerant mechanism (refer to the previous blog post http://blog.csdn.net/yangbutao/article/details/44975627), You can guarantee that the calculation of the data is executed at least once,
However, it is not guaranteed to perform only once, for example, after Kafka receiver write data to Wal, to zookeeper write offset failed, then after the driver failure recovery, due to offset or previously written offset position, The data will be pulled from the Kafka once, and executed once, and for some scenarios the consistency of the sexual requirements is more stringent, and 1.2 ha mechanism is more complex, and the impact on performance is relatively large.
A simpler way to support the need for spark streaming exactly-one, which is the direct API, has been provided since 1.3.
Refer to the following figure:
The difference 1.2 ha relies on the Wal and receiver,1.3 versions of the direct API approach to implement the exactly-once.
Driver when generating RDD tasks, the division of each Batch is based on the offset range of the Kafka consumption; when each job is executed, the data is retrieved from the Kafka based on the divided offset range ; current offset can be reliably stored in driver through the checkpoint mechanism for reliable recovery at expiration. As a result of removing the receiver, for parallel operations do not need to configure how many thread consumption Kafka partition, in the direct API implementation, each RDD partition corresponding to the Kafka partition, greatly simplifies the parallel programming model, do automatic parallel reading.