Real-time streaming processing complete flow based on flume+kafka+spark-streaming _spark

Source: Internet
Author: User
Tags redis server port log4j

Real-time streaming processing complete flow based on flume+kafka+spark-streaming


1, environment preparation, four test server

Spark Cluster Three, SPARK1,SPARK2,SPARK3

Kafka cluster Three, SPARK1,SPARK2,SPARK3

Zookeeper cluster three, SPARK1,SPARK2,SPARK3

Log Receive server, SPARK1

Log collection server, Redis (this machine is used to do redis development, now used to do log collection test, the hostname does not change)


Log collection process:

Log collection Server-> log receiving server->kafka cluster->spark cluster processing

Description: Log collection server, in the actual production is likely to be the application system server, log receiving server for a large data server, log through the network to the log receiving server, and then into the cluster processing.

Because, in the production environment, the network is often only one-way open to a certain server port access.


Flume version: apache-flume-1.5.0-cdh5.4.9, this version has been better integrated with Kafka support


2. Log collection server (summary end)

Configure flume to collect specific logs dynamically, collect.conf configured as follows:

# Name The components on this agent
a1.sources = tailsource-1
a1.sinks = remotesink a1.channels
= Memorychnane L-1

# describe/configure the source
a1.sources.tailsource-1.type = exec
A1.sources.tailsource-1.command = Tail-f/opt/modules/tmpdata/logs/1.log

a1.sources.tailsource-1.channels = memoryChnanel-1

# Describe the Sink
A1.sinks.k1.type = Logger

# Use a channel which buffers events in memory
A1.channels.memorychnanel-1.type = Memory
a1.channels.memorychnanel-1.keep-alive = Ten
a1.channels.memorychnanel-1.capacity = 100000
a1.channels.memorychnanel-1.transactioncapacity = 100000

# Bind the source and sink to the channel
A1.sinks.remotesink.type = Avro
a1.sinks.remotesink.hostname = spark1
  a1.sinks.remotesink.port = 666
A1.sinks.remotesink.channel = memoryChnanel-1

Log real-time monitoring log, through the network Avro type, transferred to the SPARK1 server on the 666 port

To start the Log collection-side script:

Bin/flume-ng agent--conf conf--conf-file conf/collect.conf--name A1-dflume.root.logger=info,console


3. Log receiving Server

Configure Flume to receive logs in real time, collect.conf configured as follows:

 #agent Section producer.sources = s producer.channels = c Producer.sinks = r #source Section producer.so  Urces.s.type = Avro Producer.sources.s.bind = Spark1 Producer.sources.s.port = 666 Producer.sources.s.channels = C #  Each sink ' s type must is defined Producer.sinks.r.type = Org.apache.flume.sink.kafka.KafkaSink Producer.sinks.r.topic =
Mytopic producer.sinks.r.brokerlist = spark1:9092,spark2:9092,spark3:9092 producer.sinks.r.requiredacks = 1 Producer.sinks.r.batchsize = Producer.sinks.r.channel = C1 #Specify The channel sink should use Producer.sinks.  
R.channel = C # Each channel ' s type is defined.
Producer.channels.c.type = Org.apache.flume.channel.kafka.KafkaChannel Producer.channels.c.capacity = 10000
producer.channels.c.transactioncapacity = 1000 producer.channels.c.brokerlist=spark1:9092,spark2:9092,spark3:9092 Producer.channels.c.topic=channel1 producer.channels.c.zookeeperconnect=spark1:2181,spark2:2181,spark3:2181 


The key is to specify the 666来 data such as the source to receive the network port, and enter the Kafka cluster, the topic and ZK addresses should be configured.

To start the receive-side script:

Bin/flume-ng agent--conf conf--conf-file conf/receive.conf--name Producer-dflume.root.logger=info,console


4, Spark cluster processing receive data

Import org.apache.spark.SparkConf Import Org.apache.spark.SparkContext Import Org.apache.spark.streaming.kafka.KafkaUtils Import org.apache.spark.streaming.Seconds Import Org.apache.spark.streaming.StreamingContext Import Kafka.serializer.StringDecoder Import Scala.collection.immutable.HashMap Import org.apache.log4j.Level Import org.apache.log4j.Logger/** * @author Administrator */Object Kafkadatatest {def main (args:array[string]): unit = {Logger.getlogger ("Org.apache.spark
    "). Setlevel (Level.warn);

    Logger.getlogger ("Org.eclipse.jetty.server"). Setlevel (Level.error); Val conf = new sparkconf (). Setappname ("Stocker"). Setmaster ("local[2]") val sc = new Sparkcontext (conf) Val SSC = New StreamingContext (SC, Seconds (1))//Kafka configurations val topics = Set ("Mytopic") Val brokers = "Spa rk1:9092,spark2:9092,spark3:9092 "val kafkaparams = map[string, String] (" Metadata.broker.list "-> Brokers," serial Izer.class "->" kafka.serializer.StRingencoder ")//Create a direct stream val Kafkastream = kafkautils.createdirectstream[string, String, Stringdec oder, Stringdecoder] (SSC, Kafkaparams, topics) Val Urlclicklogpairsdstream = Kafkastream.flatmap (_._2.split ("")). Map ((_, 1)) Val Urlclickcountdaysdstream = Urlclicklogpairsdstream.reducebykeyandwindow ((v1:int, V2:int) =>

    {v1 + v2}, Seconds (a), Seconds (5));

    Urlclickcountdaysdstream.print (); Ssc.start () ssc.awaittermination ()}

spark-streaming receives data from the Kafka cluster and calculates the WordCount value within 60s per 5s


5. Test results

Add three logs in the past log

The results of spark-streaming processing are as follows:

(hive,1)
(spark,2)
(hadoop,2)
(storm,1)

---------------------------------------

(hive,1)
(spark,3)
(hadoop,3)
(storm,1)

---------------------------------------

(hive,2)
(spark,5)
(hadoop,5)
(storm,2)

As expected, fully embodies the characteristics of spark-streaming sliding window

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.