160728. Spark streaming Kafka Several ways to achieve data 0 loss

Source: Internet
Author: User

definition

Before the problem begins, explain some of the concepts in dirty handling:

    • At most once-Each piece of data is processed at most once (0 or 1 times)

    • At least once-Each piece of data is processed at least once (1 or more times)

    • Exactly once-Each piece of data is processed only once (no data is lost and no data is processed multiple times)

High Level API

If you do not do fault tolerance, it will result in data loss
Because receiver has been receiving data, when it has not been processed (informed ZK data received), Executor suddenly hang up (or driver hang off the notification executor off), the data cached in it will be lost.


Because of this problem, Spark1.2 started to join the WAL (Write ahead log)
Open WAL to change the storage level of receiver acquisition data toStorageLevel.MEMORY_AND_DISK_SER

val conf = new SparkConf()
conf.set("spark.streaming.receiver.writeAheadLog.enable","true")
val sc= new SparkContext(conf)
val ssc = new StreamingContext(sc,Seconds(5))ssc.checkpoint("walDir")
val lines = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicMap, StorageLevel.MEMORY_AND_DISK_SER).map(_._2)

There are still data loss issues after opening Wal
Even if the Wal is officially set, there will still be data loss, why? Because the task is receiver also forced to terminate when interrupted, will cause data loss, prompted as follows:

0: Stopped by driverWARN BlockGenerator: Cannot stop BlockGenerator as its not in the Active state [state = StoppedAll]WARN BatchedWriteAheadLog: BatchedWriteAheadLog Writer queue interrupted.

At the end of the streaming program, the program is terminated only if all receiver is confirmed to be closed.

sys.addShutdownHook({  ssc.stop(true,true)})

The method to invoke is:

def stop(stopSparkContext: Boolean, stopGracefully: Boolean): Unit
The problem with Wal

The Wal implements At-least-once semantics.
If data written to external storage has not been updated to zookeeper, the data will be consumed over and over again. At the same time, the throughput of the program is reduced.

Kafka Direct API

The Kafka direct API runs in such a way that receiver is no longer used to read data or to use the Wal mechanism.


It also guarantees exactly-once semantics and does not consume duplicate data in Wal. However, it is necessary to complete the process of writing offset into ZK, which is described in the official documentation.
For example, the following method is called:

val message = rdd.map(_._2)  //对数据进行一些操作   message.map(method)//更新zk上的offset (自己实现)   updateZKOffsets(rdd)})

160728. Spark streaming Kafka Several ways to achieve data 0 loss

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.