Liaoliang Teacher's course: The 2016 big Data spark "mushroom cloud" action spark streaming consumption flume collected Kafka data DIRECTF way job.
First, the basic background
Spark-streaming get Kafka data in two ways receiver and direct way, this article describes the way of direct. The specific process is this:
1, direct mode is directly connected to the Kafka node to obtain data.
2. Direct-based approach: Periodically query Kafka to obtain the latest offset for each topic+partition, thus defining the range of offset for each batch.
3. When the job processing data starts, it uses Kafka's simple consumer API to get the data Kafka the specified offset range.
This approach has the following advantages:
1. Simplify parallel reads: If you are reading multiple partition, you do not need to create multiple input dstream and then union them. Spark creates as many RDD partition as Kafka partition, and reads data from Kafka in parallel. So between the Kafka partition and the RDD partition, there is a one-on mapping relationship. ;
2, high performance: Do not need to open the Wal mechanism, as long as the Kafka in the data replication, then can be restored through the Kafka copy;
3, once and only once the transaction mechanism: Spark streaming itself is responsible for tracking the consumption of offset, and saved in the checkpoint.
Spark itself must be synchronous, so it can guarantee that the data is consumed once and consumed only once.
ii. configuration files and codes
Flume version: 1.6.0, this version is directly supported to Kafka without installing the plugin separately.
Kafka version 2.10-0.8.2.1, must be 0.8.2.1, at first I used 0.10, the result appeared under
Iv. the 2nd error of all kinds of errors.
Spark version: 1.6.1.
Kafka with files: producer.properties, red text for special attention to the configuration of the pit, hehe
#agentsection
producer.sources= s
producer.channels= C
producer.sinks= R
#sourcesection
producer.sources.s.type= exec
producer.sources.s.command= tail-f-n+1/opt/test/test.log
producer.sources.s.channels= C
# Eachsink ' s type must be defined
Producer.sinks.r.type= Org.apache.flume.plugins.KafkaSink
producer.sinks.r.metadata.broker.list=192.168.0.10:9092
Producer.sinks.r.partition.key=0
producer.sinks.r.partitioner.class=org.apache.flume.plugins.SinglePartition
producer.sinks.r.serializer.class=Kafka.serializer.StringEncoder
Producer.sinks.r.request.required.acks=0
producer.sinks.r.max.message.size=1000000
Producer.sinks.r.producer.type=sync
Producer.sinks.r.custom.encoding=utf-8
Producer.sinks.r.custom.topic.name=flume2kafka2streaming930
#Specifythe Channel the sink should use
Producer.sinks.r.channel= C
# Eachchannel ' s type is defined.
Producer.channels.c.type= Memory
Producer.channels.c.capacity= 1000
Producer.channels.c.transactioncapacity= 100
The core code is as follows:
sparkconf conf = sparkconf (). Setmaster () . setappname () .setjars (string[] { }) map<stringstring> Kafkaparameters = hashmap<stringstring> () kafkaparameters.put () set<string> topics = HashSet<String> () Topics.add () javapairinputdstream<stringstring> lines = kafkautils. (JscString.String.StringDecoder.StringDecoder.kafkaParameterstopics) Javadstream<string> words = lines.flatmap (flatmapfunction<tuple2<stringstring> String> () { Iterable<String> (tuple2<stringstring> tuple) exception { &nbSp; arrays. (Tuple: Split ())} }) javapairdstream<stringinteger> pairs = Words.maptopair (pairfunction<stringstringinteger> () { Tuple2<StringInteger> (String word) Exception { Tuple2<StringInteger> (word)} }) javapairdstream<stringinteger> wordscount = Pairs.reducebykey (Function2<integerintegerinteger> () { Integer (Integer v1integer &NBSP;V2) Exception { v1 + v2} }) Wordscount.print () Jsc.start () Jsc.awaittermination () Jsc.close ()
Third, start the script
Start Zookeeper
bin/zookeeper-server-start.sh Config/zookeeper.properties &
Start Kafka Broker
bin/kafka-server-start.sh Config/server.properties &
Create topic
bin/kafka-topics.sh--create--zookeeper 192.168.0.10:2181--replication-factor 1--partitions 1--topic flume2kafka2streaming930
Start flume
Bin/flume-ng Agent--conf conf/-F conf/producer.properties-n producer-dflume.root.logger=info,console
Bin/spark-submit--class com.dt.spark.sparkstreaming.SparkStreamingOnKafkaDirected--JARS/LIB/KAFKA_2.10-0.8.2.1/ kafka-clients-0.8.2.1.jar,/lib/kafka_2.10-
0.8.2.1/kafka_2.10-0.8.2.1.jar,/lib/kafka_2.10-0.8.2.1/metrics-core-2.2.0.jar,/lib/spark-1.6.1/ Spark-streaming-kafka_2.10-1.6.1.jar--master Local[5] Sparkapps.jar
echo "Hadoop spark Hive Storm spark Hadoop HDFs" >>/opt/test/test.log
echo "Hive Storm" >>/opt/test/test.log
echo "HDFs" >>/opt/test/test.log
echo "Hadoop spark Hive Storm spark Hadoop HDFs" >>/opt/test/test.log
The output results are as follows:
* The results are as follows:
* -------------------------------------------
* time:1475282360000 MS
* -------------------------------------------
* (spark,8)
* (storm,4)
* (hdfs,4)
* (hive,4)
* (hadoop,8)
Four, all kinds of error Daquan
1,Exception in thread "main" Java.lang.noclassdeffounderror:org/apache/spark/streaming/kafka/kafkautils
At Com.dt.spark.SparkApps.SparkStreaming.SparkStreamingOnKafkaDirected.main
All are not submitted jar package, will be error, can not be executed, all in the Submit script added:
Bin/spark-submit--class com.dt.spark.sparkstreaming.SparkStreamingOnKafkaDirected--JARS/LIB/KAFKA_2.10-0.8.2.1/ kafka-clients-0.8.2.1.jar,/lib/kafka_2.10-
0.8.2.1/kafka_2.10-0.8.2.1.jar,/lib/kafka_2.10-0.8.2.1/metrics-core-2.2.0.jar,/lib/spark-1.6.1/ Spark-streaming-kafka_2.10-1.6.1.jar--master Local[5] Sparkapps.jar
2,Exception in thread "main" Java.lang.ClassCastException:kafka.cluster.BrokerEndPoint cannot is cast to Kafka.cluster.Broker.
On stackoverflow.com and Spark website, this is due to version incompatibility. Version available on the website: Spark streaming 1.6.1 is compatible with Kafka 0.8.2.1
Liaoliang _DT Big Data Dream Factory
Introduction: Liaoliang: DT Big Data DreamWorks founder and chief expert. Public number Dt_spark.
contact email [email protected]
Tel: 18610086859
No.: 18610086859
Weibo: Http://weibo.com/ilovepains
2016 Big data spark "mushroom cloud" action spark streaming consumption flume acquisition of Kafka data DIRECTF mode