2016 Big data spark "mushroom cloud" action spark streaming consumption flume acquisition of Kafka data DIRECTF mode

Source: Internet
Author: User
Tags zookeeper

Liaoliang Teacher's course: The 2016 big Data spark "mushroom cloud" action spark streaming consumption flume collected Kafka data DIRECTF way job.



First, the basic background

Spark-streaming get Kafka data in two ways receiver and direct way, this article describes the way of direct. The specific process is this:

1, direct mode is directly connected to the Kafka node to obtain data.

2. Direct-based approach: Periodically query Kafka to obtain the latest offset for each topic+partition, thus defining the range of offset for each batch.

3. When the job processing data starts, it uses Kafka's simple consumer API to get the data Kafka the specified offset range.

This approach has the following advantages:

1. Simplify parallel reads: If you are reading multiple partition, you do not need to create multiple input dstream and then union them. Spark creates as many RDD partition as Kafka partition, and reads data from Kafka in parallel. So between the Kafka partition and the RDD partition, there is a one-on mapping relationship. ;

2, high performance: Do not need to open the Wal mechanism, as long as the Kafka in the data replication, then can be restored through the Kafka copy;

3, once and only once the transaction mechanism: Spark streaming itself is responsible for tracking the consumption of offset, and saved in the checkpoint.

Spark itself must be synchronous, so it can guarantee that the data is consumed once and consumed only once.

ii. configuration files and codes

Flume version: 1.6.0, this version is directly supported to Kafka without installing the plugin separately.

Kafka version 2.10-0.8.2.1, must be 0.8.2.1, at first I used 0.10, the result appeared under

Iv. the 2nd error of all kinds of errors.

Spark version: 1.6.1.


Kafka with files: producer.properties, red text for special attention to the configuration of the pit, hehe

#agentsection

producer.sources= s

producer.channels= C

producer.sinks= R


#sourcesection

producer.sources.s.type= exec

producer.sources.s.command= tail-f-n+1/opt/test/test.log

producer.sources.s.channels= C


# Eachsink ' s type must be defined

Producer.sinks.r.type= Org.apache.flume.plugins.KafkaSink

producer.sinks.r.metadata.broker.list=192.168.0.10:9092

Producer.sinks.r.partition.key=0

producer.sinks.r.partitioner.class=org.apache.flume.plugins.SinglePartition

producer.sinks.r.serializer.class=Kafka.serializer.StringEncoder

Producer.sinks.r.request.required.acks=0

producer.sinks.r.max.message.size=1000000

Producer.sinks.r.producer.type=sync

Producer.sinks.r.custom.encoding=utf-8

Producer.sinks.r.custom.topic.name=flume2kafka2streaming930

#Specifythe Channel the sink should use

Producer.sinks.r.channel= C


# Eachchannel ' s type is defined.

Producer.channels.c.type= Memory

Producer.channels.c.capacity= 1000

Producer.channels.c.transactioncapacity= 100


The core code is as follows:

 sparkconf conf = sparkconf (). Setmaster () .               setappname ()                .setjars (string[] {                       }) map<stringstring>  Kafkaparameters = hashmap<stringstring> () kafkaparameters.put () set<string> topics  =  HashSet<String> () Topics.add () javapairinputdstream<stringstring> lines  = kafkautils. (JscString.String.StringDecoder.StringDecoder.kafkaParameterstopics) Javadstream<string> words = lines.flatmap (flatmapfunction<tuple2<stringstring> String> ()  { Iterable<String>  (tuple2<stringstring> tuple)  exception  {       &nbSp;      arrays. (Tuple: Split ())}      }) javapairdstream<stringinteger> pairs =  Words.maptopair (pairfunction<stringstringinteger> ()  {           Tuple2<StringInteger>  (String word)  Exception {               Tuple2<StringInteger> (word)}       }) javapairdstream<stringinteger> wordscount =  Pairs.reducebykey (Function2<integerintegerinteger> ()  { Integer  (Integer v1integer &NBSP;V2)  Exception {               v1 + v2}      }) Wordscount.print () Jsc.start () Jsc.awaittermination () Jsc.close ()



Third, start the script

Start Zookeeper

bin/zookeeper-server-start.sh Config/zookeeper.properties &

Start Kafka Broker

bin/kafka-server-start.sh Config/server.properties &


Create topic

bin/kafka-topics.sh--create--zookeeper 192.168.0.10:2181--replication-factor 1--partitions 1--topic flume2kafka2streaming930


Start flume

Bin/flume-ng Agent--conf conf/-F conf/producer.properties-n producer-dflume.root.logger=info,console


Bin/spark-submit--class com.dt.spark.sparkstreaming.SparkStreamingOnKafkaDirected--JARS/LIB/KAFKA_2.10-0.8.2.1/ kafka-clients-0.8.2.1.jar,/lib/kafka_2.10-


0.8.2.1/kafka_2.10-0.8.2.1.jar,/lib/kafka_2.10-0.8.2.1/metrics-core-2.2.0.jar,/lib/spark-1.6.1/ Spark-streaming-kafka_2.10-1.6.1.jar--master Local[5] Sparkapps.jar


echo "Hadoop spark Hive Storm spark Hadoop HDFs" >>/opt/test/test.log

echo "Hive Storm" >>/opt/test/test.log

echo "HDFs" >>/opt/test/test.log

echo "Hadoop spark Hive Storm spark Hadoop HDFs" >>/opt/test/test.log


The output results are as follows:

* The results are as follows:
* -------------------------------------------
* time:1475282360000 MS
* -------------------------------------------
* (spark,8)
* (storm,4)
* (hdfs,4)
* (hive,4)
* (hadoop,8)



Four, all kinds of error Daquan

1,Exception in thread "main" Java.lang.noclassdeffounderror:org/apache/spark/streaming/kafka/kafkautils
At Com.dt.spark.SparkApps.SparkStreaming.SparkStreamingOnKafkaDirected.main

All are not submitted jar package, will be error, can not be executed, all in the Submit script added:

Bin/spark-submit--class com.dt.spark.sparkstreaming.SparkStreamingOnKafkaDirected--JARS/LIB/KAFKA_2.10-0.8.2.1/ kafka-clients-0.8.2.1.jar,/lib/kafka_2.10-


0.8.2.1/kafka_2.10-0.8.2.1.jar,/lib/kafka_2.10-0.8.2.1/metrics-core-2.2.0.jar,/lib/spark-1.6.1/ Spark-streaming-kafka_2.10-1.6.1.jar--master Local[5] Sparkapps.jar

2,Exception in thread "main" Java.lang.ClassCastException:kafka.cluster.BrokerEndPoint cannot is cast to Kafka.cluster.Broker.

On stackoverflow.com and Spark website, this is due to version incompatibility. Version available on the website: Spark streaming 1.6.1 is compatible with Kafka 0.8.2.1


Liaoliang _DT Big Data Dream Factory

Introduction: Liaoliang: DT Big Data DreamWorks founder and chief expert. Public number Dt_spark.

contact email [email protected]

Tel: 18610086859

No.: 18610086859

Weibo: Http://weibo.com/ilovepains

2016 Big data spark "mushroom cloud" action spark streaming consumption flume acquisition of Kafka data DIRECTF mode

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.