The data source used in the previous article is to take data from a socket, a bit belonging to the "Heterodoxy", serious is from the Kafka and other message queue to take the data!
The main supported source, learned by the official website are as follows:
The form of data acquisition includes push push and pull pulls
first, spark streaming integration Flume
The way of 1.push
More recommended is the pull method.
Introduce dependencies:
<Dependency> <groupId>Org.apache.spark</groupId> <Artifactid>spark-streaming-flume_2.10</Artifactid> <version>${spark.version}</version> </Dependency>
Write code:
Packagecom.streamingImportorg.apache.spark.SparkConfImportorg.apache.spark.streaming.flume.FlumeUtilsImportorg.apache.spark.streaming. {Seconds, StreamingContext}/*** Created by ZX on 2015/6/22. */Object Flumepushwordcount {def main (args:array[string]) {val host= Args (0) Val Port= Args (1). ToInt Val conf=NewSparkconf (). Setappname ("Flumewordcount")//. Setmaster ("local[2]")//using this constructor, you can omit SC, which is built by the constructorVal SSC =NewStreamingContext (Conf, Seconds (5)) //push mode: Flume sends data to spark (note that the host and port here are streaming addresses and ports for others to send to this address)Val Flumestream =Flumeutils.createstream (SSC, host, port)//flume data in the Event.getbody () to get the real contentVal words = flumestream.flatmap (x =NewString (X.event.getbody (). Array ()). Split ("")). Map ((_, 1)) Val Results= Words.reducebykey (_ +_) Results.print () Ssc.start () Ssc.awaittermination ()}}
Flume-push.conf--flume-side configuration file:
# Name the components on Thisagenta1.sources=r1a1.sinks=K1a1.channels=c1# Sourcea1.sources.r1.type=Spooldira1.sources.r1.spoolDir=/export/data/Flumea1.sources.r1.fileHeader=true# Describe The Sinka1.sinks.k1.type=avro# This is the receiver A1.sinks.k1.hostname= 192.168.31.172A1.sinks.k1.port= 8888# Use a channel which buffers events in Memorya1.channels.c1.type=memorya1.channels.c1.capacity= 1000a1.channels.c1.transactionCapacity= 100# Bind The source and sink to the Channela1.sources.r1.channels=C1a1.sinks.k1.channel= C1
flume-push.conf
The way of 2.pull
Is the recommended way to actively pull the data generated by the flume by streaming
Write code: (dependent on IBID.)
Packagecom.streamingImportjava.net.InetSocketAddressImportorg.apache.spark.SparkConfImportOrg.apache.spark.storage.StorageLevelImportorg.apache.spark.streaming.flume.FlumeUtilsImportorg.apache.spark.streaming. {Seconds, streamingcontext}object flumepollwordcount {def main (args:array[string]) {val conf=NewSparkconf (). Setappname ("Flumepollwordcount"). Setmaster ("local[2]") Val SSC=NewStreamingContext (Conf, Seconds (5)) //pull data from Flume (flume address), through SEQ sequence, inside can be new multiple address, from multiple flume address pullVal address = Seq (NewInetsocketaddress ("172.16.0.11", 8888)) Val Flumestream=Flumeutils.createpollingstream (SSC, Address, Storagelevel.memory_and_disk) val words= Flumestream.flatmap (x = =NewString (X.event.getbody (). Array ()). Split ("")). Map ((_,1)) Val Results= Words.reducebykey (_+_) Results.print () Ssc.start () Ssc.awaittermination ()}}
Configure Flume
By pulling the way you need to flume the Lib directory with the relevant jar (to be transferred via the Spark Program Flume pull), through the official website can know the specific jar information:
Configuration flume:
# Name the components on Thisagenta1.sources=r1a1.sinks=K1a1.channels=c1# Sourcea1.sources.r1.type=Spooldira1.sources.r1.spoolDir=/export/data/Flumea1.sources.r1.fileHeader=true# Describe The sink (configuration is flume address, waiting to be pulled) A1.sinks.k1.type=Org.apache.spark.streaming.flume.sink.SparkSinka1.sinks.k1.hostname=Mini1a1.sinks.k1.port= 8888# Use a channel which buffers events in Memorya1.channels.c1.type=memorya1.channels.c1.capacity= 1000a1.channels.c1.transactionCapacity= 100# Bind The source and sink to the Channela1.sources.r1.channels=C1a1.sinks.k1.channel= C1
flume-poll.conf
Start Flume, and start the spark streaming in idea:
Bin/flume-ng agent-c conf-f conf/netcat-logger.conf-n A1 -dflume.root.logger=info,console// - D post-parameter optional
Big Data Entry 24th day--sparkstreaming (2) integration with Flume, Kafka