Big Data Entry 24th day--sparkstreaming (2) integration with Flume, Kafka

Last Update:2018-04-16 Source: Internet

Author: User

Tags message queue

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The data source used in the previous article is to take data from a socket, a bit belonging to the "Heterodoxy", serious is from the Kafka and other message queue to take the data!

The main supported source, learned by the official website are as follows:

　　The form of data acquisition includes push push and pull pulls

first, spark streaming integration Flume

　　The way of 1.push

More recommended is the pull method.

　　　　Introduce dependencies:

     <Dependency>            <groupId>Org.apache.spark</groupId>            <Artifactid>spark-streaming-flume_2.10</Artifactid>            <version>${spark.version}</version>        </Dependency>

　　　Write code:

 Packagecom.streamingImportorg.apache.spark.SparkConfImportorg.apache.spark.streaming.flume.FlumeUtilsImportorg.apache.spark.streaming. {Seconds, StreamingContext}/*** Created by ZX on 2015/6/22. */Object Flumepushwordcount {def main (args:array[string]) {val host= Args (0) Val Port= Args (1). ToInt Val conf=NewSparkconf (). Setappname ("Flumewordcount")//. Setmaster ("local[2]")//using this constructor, you can omit SC, which is built by the constructorVal SSC =NewStreamingContext (Conf, Seconds (5))    //push mode: Flume sends data to spark (note that the host and port here are streaming addresses and ports for others to send to this address)Val Flumestream =Flumeutils.createstream (SSC, host, port)//flume data in the Event.getbody () to get the real contentVal words = flumestream.flatmap (x =NewString (X.event.getbody (). Array ()). Split ("")). Map ((_, 1)) Val Results= Words.reducebykey (_ +_) Results.print () Ssc.start () Ssc.awaittermination ()}}

　　　　Flume-push.conf--flume-side configuration file:

# Name the components on Thisagenta1.sources=r1a1.sinks=K1a1.channels=c1# Sourcea1.sources.r1.type=Spooldira1.sources.r1.spoolDir=/export/data/Flumea1.sources.r1.fileHeader=true# Describe The Sinka1.sinks.k1.type=avro# This is the receiver A1.sinks.k1.hostname= 192.168.31.172A1.sinks.k1.port= 8888# Use a channel which buffers events in Memorya1.channels.c1.type=memorya1.channels.c1.capacity= 1000a1.channels.c1.transactionCapacity= 100# Bind The source and sink to the Channela1.sources.r1.channels=C1a1.sinks.k1.channel= C1

flume-push.conf

　　The way of 2.pull

Is the recommended way to actively pull the data generated by the flume by streaming

　　　　Write code: (dependent on IBID.)

 Packagecom.streamingImportjava.net.InetSocketAddressImportorg.apache.spark.SparkConfImportOrg.apache.spark.storage.StorageLevelImportorg.apache.spark.streaming.flume.FlumeUtilsImportorg.apache.spark.streaming. {Seconds, streamingcontext}object flumepollwordcount {def main (args:array[string]) {val conf=NewSparkconf (). Setappname ("Flumepollwordcount"). Setmaster ("local[2]") Val SSC=NewStreamingContext (Conf, Seconds (5))    //pull data from Flume (flume address), through SEQ sequence, inside can be new multiple address, from multiple flume address pullVal address = Seq (NewInetsocketaddress ("172.16.0.11", 8888)) Val Flumestream=Flumeutils.createpollingstream (SSC, Address, Storagelevel.memory_and_disk) val words= Flumestream.flatmap (x = =NewString (X.event.getbody (). Array ()). Split ("")). Map ((_,1)) Val Results= Words.reducebykey (_+_) Results.print () Ssc.start () Ssc.awaittermination ()}}

　　　　　　Configure Flume

　　By pulling the way you need to flume the Lib directory with the relevant jar (to be transferred via the Spark Program Flume pull), through the official website can know the specific jar information:

　　　　Configuration flume:

# Name the components on Thisagenta1.sources=r1a1.sinks=K1a1.channels=c1# Sourcea1.sources.r1.type=Spooldira1.sources.r1.spoolDir=/export/data/Flumea1.sources.r1.fileHeader=true# Describe The sink (configuration is flume address, waiting to be pulled) A1.sinks.k1.type=Org.apache.spark.streaming.flume.sink.SparkSinka1.sinks.k1.hostname=Mini1a1.sinks.k1.port= 8888# Use a channel which buffers events in Memorya1.channels.c1.type=memorya1.channels.c1.capacity= 1000a1.channels.c1.transactionCapacity= 100# Bind The source and sink to the Channela1.sources.r1.channels=C1a1.sinks.k1.channel= C1

flume-poll.conf

　　　　Start Flume, and start the spark streaming in idea:

Bin/flume-ng agent-c conf-f conf/netcat-logger.conf-n A1  -dflume.root.logger=info,console// - D post-parameter optional

Big Data Entry 24th day--sparkstreaming (2) integration with Flume, Kafka

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More