Spark streaming docking Kafka record

Last Update:2016-10-27 Source: Internet

Author: User

Tags zookeeper

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

There are two ways spark streaming butt Kafka:

Reference: http://group.jobbole.com/15559/

http://blog.csdn.net/kwu_ganymede/article/details/50314901

Approach 1:receiver-based approach Receiver-based solution:

This approach uses receiver to get the data. Receiver is implemented using the high-level consumer API of Kafka. The data that receiver obtains from Kafka is stored in the spark executor's memory, and then the job that spark streaming initiates will process that data.

However, in the default configuration, this approach may result in loss of data due to underlying failure. If you want to enable a high-reliability mechanism so that data 0 is lost, you must enable the pre-write log mechanism for spark streaming (write Ahead Log,wal). The mechanism will step together to write the received Kafka data to the pre-write log on the Distributed file system (such as HDFs). Therefore, even if the underlying node fails, you can use the data in the pre-write log to recover.

Connection code:

= Kafkautils.createstream (StreamingContext,     [ZK quorum][  Consumer group ID][per-topic Number of Kafka partitions to consume])

Attention:

1. The partition of topic in Kafka is not related to the partition of the RDD in Spark. So, in Kafkautils.createstream (), increasing the number of partition will only increase the number of threads that read partition in one receiver. Does not increase the degree of parallelism of spark processing data.
2. You can create multiple Kafka input dstream, using different consumer group and topic to receive data in parallel with multiple receivers.
3, if the fault-tolerant file system, such as HDFS, enabled the pre-write log mechanism, the received data will be copied to the pre-write log. Therefore, in Kafkautils.createstream (), the persistence level set is Storagelevel.memory_and_disk_ser.

Approach 2:direct approach (No receivers) Direct Read method:

This new, non-receiver-based direct approach is introduced in Spark 1.3 to ensure a more robust mechanism. Instead of using receiver to receive data, this method periodically queries the Kafka to obtain the latest offset for each topic+partition, thus defining the range of offset for each batch. When the job that handles the data starts, it uses Kafka's simple consumer API to get the data Kafka the specified offset range.

Worry point: (relative to Method 1)

1. Simplify parallel reads: If you are reading multiple partition, you do not need to create multiple input dstream and then union them. Spark creates as many RDD partition as Kafka partition, and reads data from Kafka in parallel. So between the Kafka partition and the RDD partition, there is a one-on mapping relationship.

2, High performance: If you want to ensure 0 data loss, in the receiver-based approach, you need to open the Wal mechanism. This approach is inefficient because the data is actually duplicated in two copies, and Kafka itself has a highly reliable mechanism that copies the data and copies one into the Wal. The direct-based approach, which does not rely on receiver, does not need to open the Wal mechanism, so long as the data in the Kafka is replicated, it can be restored through a copy of the Kafka.

3, once and only once the transaction mechanism:
The receiver-based approach is to use Kafka's high-order API to store the consumed offset in the zookeeper. This is the traditional way of consuming Kafka data. This approach, in conjunction with the WAL mechanism, guarantees the high reliability of data 0 loss, but does not guarantee that the data will be processed once and only once, and may be processed two times. Because spark and zookeeper may be out of sync.
Based on the direct approach, using Kafka's simple Api,spark streaming is responsible for tracking the offset of consumption and saving it in checkpoint. Spark itself must be synchronous, so it can guarantee that the data is consumed once and consumed only once.

Disadvantages:

This method does not update the offset inside the zookeeper. Therefore, the zookeeper-based Kafka monitoring tool cannot get processing progress, but can write offset to ZK while processing itself.

Connection:

Import org.apache.spark.streaming.kafka._ val directkafkastream = kafkautils.createdirectstream[     [key class], [ Value class], [Key decoder class], [Value Decoder class]] (     StreamingContext, [map of Kafka parameters], [Set of topic s to consume])

Custom Offset Reference:

Http://www.voidcn.com/blog/bdchome/article/p-6188635.html

https://www.iteblog.com/archives/1381

http://ju.outofmemory.cn/entry/270603

Spark streaming docking Kafka record

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More