Spark Streaming and Kafka integrated Development Guide (i)

Source: Internet
Author: User

Apache Kafka is a distributed message publishing-subscription system. It can be said that any real-time big data processing tools lack of integration with Kafka is incomplete. This article will show you how to use Spark streaming to receive data from Kafka, here are two approaches: (1), using receivers and Kafka high-level APIs, (2), using the direct API, This is used in low-level KAFKAAPI and is not used to receivers, which is introduced in Spark 1.3.0. These two methods have different programming models, performance characteristics and semantic guarantees. This will be described below.

A method based on receivers

  

This method uses the receivers to receive data. The implementation of receivers uses a high-level consumer API to Kafka. For all receivers, the data received will be saved in spark executors and then processed by a job initiated by spark streaming.

However, in the default configuration, this method loses data in case of failure, in order to guarantee 0 data loss, you can use the Wal log in spark streaming, which is a feature introduced in Spark 1.2.0. This allows us to save the received data to the Wal (the Wal log can be stored on HDFS), so we can recover from the Wal when it fails, without losing the data.

Below, I'll show you how to use this method to receive data.

  1, the introduction of dependency.

For Scala and Java projects, you can introduce the following dependencies in your Pom.xml file:

<dependency>  <groupId>org.apache.spark</groupId>  <artifactId> Spark-streaming-kafka_2.10</artifactid>  <version>1.3.0</version></dependency>

If you are using SBT, you can introduce:

Librarydependencies + = "Org.apache.spark"% "spark-streaming-kafka_2.10"% "1.3.0"

  2. Programming

In the streaming program, introduce Kafkautils and create an input dstream:

Import= kafkautils.createstream (StreamingContext,     [ZK Quorum], [consumer group ID], [per- Topic number of KAFKA partitions to consume])

When creating a dstream, you can also specify the key and value type of the data, and specify the corresponding decoding class.

   It is important to note that:
1. Partitioning in topic in Kafka and spark streaming generated by the RDD is not a concept. So, in KafkaUtils.createStream()Increasing the number of specific topic partitions only increases the number of threads consumed topic in one receiver. Does not increase the number of concurrent data processing by spark;

2, for different group and tpoic we can use multiple receivers to create different dstreams to receive data in parallel;

3, if you enable the Wal, the received data will be persisted to the log, so we need to set the storage level to StorageLevel.MEMORY_AND_DISK_SER  , that is:

Kafkautils.createstream (..., storagelevel.memory_and_disk_ser)

  3. Deployment

For any spark application, we use it spark-submit to launch your application, and for Scala and Java users, if you're using SBT or maven, you can Spark-streaming-kafka_ 2.10 and its dependencies are packaged into the application's jar file, and ensure that spark-core_2.10 and spark-streaming_2.10 are marked as provided because they already exist in the spark installation package:

<dependency>          <groupId>org.apache.spark</groupId>          <artifactid>spark-streaming_ 2.10</artifactid>          <version>1.3.0</version>          <scope>provided</scope></ dependency><dependency>          <groupId>org.apache.spark</groupId>          <artifactId> spark-core_2.10</artifactid>          <version>1.3.0</version>          <scope>provided</ Scope></dependency>

Then use Spark-submit to start your application.

Of course, you can also not package spark-streaming-kafka_2.10 and its dependencies in the application jar file, and we can add the--jars parameter after spark-submit to run your program:

[Email protected] spark]$ spark-1.3.0-bin-2.6.0/bin/spark-submit  --master yarn-cluster --class  Iteblog. Kafkatest      --jars lib/spark-streaming-kafka_2.10-1.3.0. Jar, Lib/spark-streaming_2.10-1.3.0. Jar, Lib/kafka_2.10-0.8.1.1.jar,lib/zkclient-0.3. Jar, Lib/metrics-core-2.2.0.jar./ Iteblog-1.0-snapshot.jar

The following is a complete example:

Object Kafkawordcount {def main (args:array[string]) {if(Args.length < 4) {System.err.println ("Usage:kafkawordcount <zkQuorum> <group> <topics> <numThreads>") System.exit (1)} streamingexamples.setstreamingloglevels () Val Array (Zkquorum, group, topics, numthreads)=args Val sparkconf=NewSparkconf (). Setappname ("Kafkawordcount") Val SSC=NewStreamingContext (sparkconf, Seconds (2)) Ssc.checkpoint ("Checkpoint") Val Topicmap= Topics.split (","). Map ((_,numthreads.toint)). Tomap Val Lines=Kafkautils.createstream (SSC, Zkquorum, Group, TOPICMAP). Map (_._2) val words= Lines.flatmap (_.split ("")) Val wordcounts= Words.map (x = x, 1L). Reducebykeyandwindow (_+ _, _-_, Minutes (Ten), Seconds (2), 2) Wordcounts.print () Ssc.start () Ssc.awaittermination ()}}

Spark Streaming and Kafka integrated Development Guide (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.