91st: Sparkstreaming based on Kafka's direct explanation

Last Update:2016-05-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1:direct Mode Features:

1) The direct approach is to directly manipulate the Kafka underlying metadata information so that if the calculation fails, you can reread the data and re-process it. That data is bound to be processed. Pull data, which is the RDD to pull data directly when executing.

2) as the direct operation of the Kafka,kafka is the equivalent of your underlying file system. This time guarantees strict transactional consistency, which is bound to be handled and will only be processed once. And receiver's way is not guaranteed, because receiver and ZK data may be out of sync, spark streaming may repeat consumption data, this tuning can be resolved, but obviously no direct convenience. While the direct API is directly operational Kafka, spark streaming itself is responsible for tracking the consumption of this data offset or offset, and save itself to checkpoint, so its data must be synchronized, will not be duplicated. Even if the restart will not be repeated, because checkpoint, but the program upgrade, can not read the original checkpoint, in the face of the upgrade checkpoint invalid problem, how to solve it? When upgrading, read the backup I specified. That is, the manual designation of checkpoint is also possible, which once again perfectly ensures transactional, there is only one transaction mechanism. So how about manual checkpoint? When building sparkstreaming, there is getorcreate this API, it will get checkpoint content, specifically specify the next checkpoint where the good. Or, for example:

And if after recovering from checkpoint, if the data accumulates too much to handle, what should I do? 1) speed limit 2) enhance the machine's processing capacity 3) put it in the data buffer pool.

3) because the underlying is directly read data, there is no so-called receiver, directly periodic (batch Intervel) query Kafka, processing data, we will use the Kafka native based consumer API to obtain a specific range of Kafka ( Offset range) data. At this point, one of the obvious performance benefits of Direct API access to Kafka is that if you want to read more than one Partition,spark will also create an RDD partition, This time the partition of the RDD and the Kafka partition are consistent. And receiver's way, these 2 partition is not any relationship. This advantage is your rdd, in fact, at the bottom of the reading Kafka, Kafka partition is equivalent to a block on the original HDFs. This is in line with data locality. Both the RDD and Kafka data are on this side. So, where the data is read, where the data is processed and the program that drives it is on the same machine, which can greatly improve performance. The disadvantage is that because the RDD and Kafka Patition are one-to-one, it can be cumbersome to improve parallelism. Increasing the degree of parallelism or repartition, or repartitioning, is time-consuming because of the shuffle generated. This problem, later perhaps the new version can be free to configure the scale, not one-to-one. It makes sense to improve the degree of parallelism to better utilize the computing resources of the cluster.

4) Do not need to open the Wal mechanism, from the point of view of the loss of data 0, greatly improve the efficiency, but also save at least one times the disk space. Getting data from Kafka is faster than getting data from HDFs because zero copy is the way it is.

2: The actual combat section

Kafka + Spark streaming cluster

Premise:

Spark Installation Successful,spark 1.6.0

Zookeeper Installation Success

Kafka Installation Success

Steps:

1: First start the ZK on three machines, then three machines also start Kafka,

2: Create topic test on Kafka

3: Start the Kafka producer in worker1:

[Email protected]:/usr/local/kafka_2.10-0.9.0.1# bin/kafka-console-producer.sh--broker-list localhost:9092 --topic Test

Start the consumer in worker2 :

[Email protected]:/usr/local/kafka_2.10-0.9.0.1# bin/kafka-console-consumer.sh--zookeeper master:2181 -- Topic test

Producers produce the news that consumers can consume. Description Kafka cluster no problem. Go to the next step.

start spark-shell in master

./spark-shell--master local[2]--packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0, org.apache.kafka:kafka_2.10:0.8.2.1

The author uses spark is 1.6.0 , the reader adjusts according to own version.

logic code in the shell (wordcount):

Import org.apache.spark.SparkConf

Import Kafka.serializer.StringDecoder

Import Org.apache.spark.streaming.kafka.KafkaUtils

Import org.apache.spark.streaming. {durations, StreamingContext}

ValSSC =NewStreamingContext (SC, durations.seconds(5))
kafkautils.Createdirectstream[String, String, Stringdecoder, Stringdecoder] (SSC
, Map("Bootstrap.servers" -"master:2181,worker1:2181,worker2:2181", "Metadata.broker.list" -"master:9092,worker1:9092,worker2:9092", "Group.id" -"Streamingwordcountselfkafkadirectstreamscala")
, Set("Test") . Map (t = t._2). FlatMap (_.tostring.split (" "). Map ((_, 1) . Reducebykey (_ + _). Print ()
Ssc.start ()

Producer Reproduction Message:

Spark streaming 's reaction:

Back to worker2 View Consumer

It can be seen that groupId is not the same, there is no mutual exclusion.

The above is the use of createdirectstream Way Link Kafka, in practice, in fact, and receiver in the API and API parameters are different, the other basic same

Reference:

Http://spark.apache.org/docs/latest/streaming-kafka-integration.html

91st: Sparkstreaming based on Kafka's direct explanation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

91st: Sparkstreaming based on Kafka's direct explanation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

91st: Sparkstreaming based on Kafka's direct explanation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support