Real-time streaming processing complete flow based on flume+kafka+spark-streaming
1, environment preparation, four test server
Spark Cluster Three, SPARK1,SPARK2,SPARK3
Kafka cluster Three, SPARK1,SPARK2,SPARK3
Zookeeper cluster three, SPARK1,SPARK2,SPARK3
Log Receive server, SPARK1
Log collection server, Redis (this machine is used to do redis development, now used to do log collection test, the hostname does not change)
Log collection process:
Log collection Server-> log receiving server->kafka cluster->spark cluster processing
Description: Log collection server, in the actual production is likely to be the application system server, log receiving server for a large data server, log through the network to the log receiving server, and then into the cluster processing.
Because, in the production environment, the network is often only one-way open to a certain server port access.
Flume version: apache-flume-1.5.0-cdh5.4.9, this version has been better integrated with Kafka support
2. Log collection server (summary end)
Configure flume to collect specific logs dynamically, collect.conf configured as follows:
# Name The components on this agent
a1.sources = tailsource-1
a1.sinks = remotesink a1.channels
= Memorychnane L-1
# describe/configure the source
a1.sources.tailsource-1.type = exec
A1.sources.tailsource-1.command = Tail-f/opt/modules/tmpdata/logs/1.log
a1.sources.tailsource-1.channels = memoryChnanel-1
# Describe the Sink
A1.sinks.k1.type = Logger
# Use a channel which buffers events in memory
A1.channels.memorychnanel-1.type = Memory
a1.channels.memorychnanel-1.keep-alive = Ten
a1.channels.memorychnanel-1.capacity = 100000
a1.channels.memorychnanel-1.transactioncapacity = 100000
# Bind the source and sink to the channel
A1.sinks.remotesink.type = Avro
a1.sinks.remotesink.hostname = spark1
a1.sinks.remotesink.port = 666
A1.sinks.remotesink.channel = memoryChnanel-1
Log real-time monitoring log, through the network Avro type, transferred to the SPARK1 server on the 666 port
To start the Log collection-side script:
Bin/flume-ng agent--conf conf--conf-file conf/collect.conf--name A1-dflume.root.logger=info,console
3. Log receiving Server
Configure Flume to receive logs in real time, collect.conf configured as follows:
#agent Section producer.sources = s producer.channels = c Producer.sinks = r #source Section producer.so Urces.s.type = Avro Producer.sources.s.bind = Spark1 Producer.sources.s.port = 666 Producer.sources.s.channels = C # Each sink ' s type must is defined Producer.sinks.r.type = Org.apache.flume.sink.kafka.KafkaSink Producer.sinks.r.topic =
Mytopic producer.sinks.r.brokerlist = spark1:9092,spark2:9092,spark3:9092 producer.sinks.r.requiredacks = 1 Producer.sinks.r.batchsize = Producer.sinks.r.channel = C1 #Specify The channel sink should use Producer.sinks.
R.channel = C # Each channel ' s type is defined.
Producer.channels.c.type = Org.apache.flume.channel.kafka.KafkaChannel Producer.channels.c.capacity = 10000
producer.channels.c.transactioncapacity = 1000 producer.channels.c.brokerlist=spark1:9092,spark2:9092,spark3:9092 Producer.channels.c.topic=channel1 producer.channels.c.zookeeperconnect=spark1:2181,spark2:2181,spark3:2181
The key is to specify the 666来 data such as the source to receive the network port, and enter the Kafka cluster, the topic and ZK addresses should be configured.
To start the receive-side script:
Bin/flume-ng agent--conf conf--conf-file conf/receive.conf--name Producer-dflume.root.logger=info,console
4, Spark cluster processing receive data
Import org.apache.spark.SparkConf Import Org.apache.spark.SparkContext Import Org.apache.spark.streaming.kafka.KafkaUtils Import org.apache.spark.streaming.Seconds Import Org.apache.spark.streaming.StreamingContext Import Kafka.serializer.StringDecoder Import Scala.collection.immutable.HashMap Import org.apache.log4j.Level Import org.apache.log4j.Logger/** * @author Administrator */Object Kafkadatatest {def main (args:array[string]): unit = {Logger.getlogger ("Org.apache.spark
"). Setlevel (Level.warn);
Logger.getlogger ("Org.eclipse.jetty.server"). Setlevel (Level.error); Val conf = new sparkconf (). Setappname ("Stocker"). Setmaster ("local[2]") val sc = new Sparkcontext (conf) Val SSC = New StreamingContext (SC, Seconds (1))//Kafka configurations val topics = Set ("Mytopic") Val brokers = "Spa rk1:9092,spark2:9092,spark3:9092 "val kafkaparams = map[string, String] (" Metadata.broker.list "-> Brokers," serial Izer.class "->" kafka.serializer.StRingencoder ")//Create a direct stream val Kafkastream = kafkautils.createdirectstream[string, String, Stringdec oder, Stringdecoder] (SSC, Kafkaparams, topics) Val Urlclicklogpairsdstream = Kafkastream.flatmap (_._2.split ("")). Map ((_, 1)) Val Urlclickcountdaysdstream = Urlclicklogpairsdstream.reducebykeyandwindow ((v1:int, V2:int) =>
{v1 + v2}, Seconds (a), Seconds (5));
Urlclickcountdaysdstream.print (); Ssc.start () ssc.awaittermination ()}
spark-streaming receives data from the Kafka cluster and calculates the WordCount value within 60s per 5s
5. Test results
Add three logs in the past log
The results of spark-streaming processing are as follows:
(hive,1)
(spark,2)
(hadoop,2)
(storm,1)
---------------------------------------
(hive,1)
(spark,3)
(hadoop,3)
(storm,1)
---------------------------------------
(hive,2)
(spark,5)
(hadoop,5)
(storm,2)
As expected, fully embodies the characteristics of spark-streaming sliding window