91st: Sparkstreaming based on Kafka's direct explanation

Source: Internet
Author: User

1:direct Mode Features:

1) The direct approach is to directly manipulate the Kafka underlying metadata information so that if the calculation fails, you can reread the data and re-process it. That data is bound to be processed. Pull data, which is the RDD to pull data directly when executing.

2) as the direct operation of the Kafka,kafka is the equivalent of your underlying file system. This time guarantees strict transactional consistency, which is bound to be handled and will only be processed once. And receiver's way is not guaranteed, because receiver and ZK data may be out of sync, spark streaming may repeat consumption data, this tuning can be resolved, but obviously no direct convenience. While the direct API is directly operational Kafka, spark streaming itself is responsible for tracking the consumption of this data offset or offset, and save itself to checkpoint, so its data must be synchronized, will not be duplicated. Even if the restart will not be repeated, because checkpoint, but the program upgrade, can not read the original checkpoint, in the face of the upgrade checkpoint invalid problem, how to solve it? When upgrading, read the backup I specified. That is, the manual designation of checkpoint is also possible, which once again perfectly ensures transactional, there is only one transaction mechanism. So how about manual checkpoint? When building sparkstreaming, there is getorcreate this API, it will get checkpoint content, specifically specify the next checkpoint where the good. Or, for example:


And if after recovering from checkpoint, if the data accumulates too much to handle, what should I do? 1) speed limit 2) enhance the machine's processing capacity 3) put it in the data buffer pool.

3) because the underlying is directly read data, there is no so-called receiver, directly periodic (batch Intervel) query Kafka, processing data, we will use the Kafka native based consumer API to obtain a specific range of Kafka ( Offset range) data. At this point, one of the obvious performance benefits of Direct API access to Kafka is that if you want to read more than one Partition,spark will also create an RDD partition, This time the partition of the RDD and the Kafka partition are consistent. And receiver's way, these 2 partition is not any relationship. This advantage is your rdd, in fact, at the bottom of the reading Kafka, Kafka partition is equivalent to a block on the original HDFs. This is in line with data locality. Both the RDD and Kafka data are on this side. So, where the data is read, where the data is processed and the program that drives it is on the same machine, which can greatly improve performance. The disadvantage is that because the RDD and Kafka Patition are one-to-one, it can be cumbersome to improve parallelism. Increasing the degree of parallelism or repartition, or repartitioning, is time-consuming because of the shuffle generated. This problem, later perhaps the new version can be free to configure the scale, not one-to-one. It makes sense to improve the degree of parallelism to better utilize the computing resources of the cluster.

4) Do not need to open the Wal mechanism, from the point of view of the loss of data 0, greatly improve the efficiency, but also save at least one times the disk space. Getting data from Kafka is faster than getting data from HDFs because zero copy is the way it is.

2: The actual combat section


Kafka + Spark streaming cluster

Premise:

Spark Installation Successful,spark 1.6.0

Zookeeper Installation Success

Kafka Installation Success

Steps:

1: First start the ZK on three machines, then three machines also start Kafka,

2: Create topic test on Kafka

3: Start the Kafka producer in worker1:

[Email protected]:/usr/local/kafka_2.10-0.9.0.1# bin/kafka-console-producer.sh--broker-list localhost:9092 --topic Test

Start the consumer in worker2 :

[Email protected]:/usr/local/kafka_2.10-0.9.0.1# bin/kafka-console-consumer.sh--zookeeper master:2181 -- Topic test

Producers produce the news that consumers can consume. Description Kafka cluster no problem. Go to the next step.

start spark-shell in master

./spark-shell--master local[2]--packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0, org.apache.kafka:kafka_2.10:0.8.2.1

The author uses spark is 1.6.0 , the reader adjusts according to own version.

logic code in the shell (wordcount):

Import org.apache.spark.SparkConf

Import Kafka.serializer.StringDecoder

Import Org.apache.spark.streaming.kafka.KafkaUtils

Import org.apache.spark.streaming. {durations, StreamingContext}

ValSSC =NewStreamingContext (SC, durations.seconds(5))
kafkautils.Createdirectstream[String, String, Stringdecoder, Stringdecoder] (SSC
  , Map("Bootstrap.servers" -"master:2181,worker1:2181,worker2:2181", "Metadata.broker.list" -"master:9092,worker1:9092,worker2:9092", "Group.id" -"Streamingwordcountselfkafkadirectstreamscala")
  , Set("Test") . Map (t = t._2). FlatMap (_.tostring.split (" "). Map ((_, 1) . Reducebykey (_ + _). Print ()
Ssc.start ()

Producer Reproduction Message:

Spark streaming 's reaction:

Back to worker2 View Consumer

It can be seen that groupId is not the same, there is no mutual exclusion.

The above is the use of createdirectstream Way Link Kafka, in practice, in fact, and receiver in the API and API parameters are different, the other basic same

Reference:

Http://spark.apache.org/docs/latest/streaming-kafka-integration.html


91st: Sparkstreaming based on Kafka's direct explanation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.