Big Data Entry 24th day--sparkstreaming (2) integration with Flume, Kafka

Source: Internet
Author: User
Tags message queue

The data source used in the previous article is to take data from a socket, a bit belonging to the "Heterodoxy", serious is from the Kafka and other message queue to take the data!

The main supported source, learned by the official website are as follows:

  The form of data acquisition includes push push and pull pulls

first, spark streaming integration Flume

  The way of 1.push

More recommended is the pull method.

    Introduce dependencies:

     <Dependency>            <groupId>Org.apache.spark</groupId>            <Artifactid>spark-streaming-flume_2.10</Artifactid>            <version>${spark.version}</version>        </Dependency>

   Write code:

 Packagecom.streamingImportorg.apache.spark.SparkConfImportorg.apache.spark.streaming.flume.FlumeUtilsImportorg.apache.spark.streaming. {Seconds, StreamingContext}/*** Created by ZX on 2015/6/22. */Object Flumepushwordcount {def main (args:array[string]) {val host= Args (0) Val Port= Args (1). ToInt Val conf=NewSparkconf (). Setappname ("Flumewordcount")//. Setmaster ("local[2]")//using this constructor, you can omit SC, which is built by the constructorVal SSC =NewStreamingContext (Conf, Seconds (5))    //push mode: Flume sends data to spark (note that the host and port here are streaming addresses and ports for others to send to this address)Val Flumestream =Flumeutils.createstream (SSC, host, port)//flume data in the Event.getbody () to get the real contentVal words = flumestream.flatmap (x =NewString (X.event.getbody (). Array ()). Split ("")). Map ((_, 1)) Val Results= Words.reducebykey (_ +_) Results.print () Ssc.start () Ssc.awaittermination ()}}

    Flume-push.conf--flume-side configuration file:

# Name the components on Thisagenta1.sources=r1a1.sinks=K1a1.channels=c1# Sourcea1.sources.r1.type=Spooldira1.sources.r1.spoolDir=/export/data/Flumea1.sources.r1.fileHeader=true# Describe The Sinka1.sinks.k1.type=avro# This is the receiver A1.sinks.k1.hostname= 192.168.31.172A1.sinks.k1.port= 8888# Use a channel which buffers events in Memorya1.channels.c1.type=memorya1.channels.c1.capacity= 1000a1.channels.c1.transactionCapacity= 100# Bind The source and sink to the Channela1.sources.r1.channels=C1a1.sinks.k1.channel= C1
flume-push.conf

  The way of 2.pull

Is the recommended way to actively pull the data generated by the flume by streaming

    Write code: (dependent on IBID.)

 Packagecom.streamingImportjava.net.InetSocketAddressImportorg.apache.spark.SparkConfImportOrg.apache.spark.storage.StorageLevelImportorg.apache.spark.streaming.flume.FlumeUtilsImportorg.apache.spark.streaming. {Seconds, streamingcontext}object flumepollwordcount {def main (args:array[string]) {val conf=NewSparkconf (). Setappname ("Flumepollwordcount"). Setmaster ("local[2]") Val SSC=NewStreamingContext (Conf, Seconds (5))    //pull data from Flume (flume address), through SEQ sequence, inside can be new multiple address, from multiple flume address pullVal address = Seq (NewInetsocketaddress ("172.16.0.11", 8888)) Val Flumestream=Flumeutils.createpollingstream (SSC, Address, Storagelevel.memory_and_disk) val words= Flumestream.flatmap (x = =NewString (X.event.getbody (). Array ()). Split ("")). Map ((_,1)) Val Results= Words.reducebykey (_+_) Results.print () Ssc.start () Ssc.awaittermination ()}}

      Configure Flume

  By pulling the way you need to flume the Lib directory with the relevant jar (to be transferred via the Spark Program Flume pull), through the official website can know the specific jar information:

  

    Configuration flume:

# Name the components on Thisagenta1.sources=r1a1.sinks=K1a1.channels=c1# Sourcea1.sources.r1.type=Spooldira1.sources.r1.spoolDir=/export/data/Flumea1.sources.r1.fileHeader=true# Describe The sink (configuration is flume address, waiting to be pulled) A1.sinks.k1.type=Org.apache.spark.streaming.flume.sink.SparkSinka1.sinks.k1.hostname=Mini1a1.sinks.k1.port= 8888# Use a channel which buffers events in Memorya1.channels.c1.type=memorya1.channels.c1.capacity= 1000a1.channels.c1.transactionCapacity= 100# Bind The source and sink to the Channela1.sources.r1.channels=C1a1.sinks.k1.channel= C1
flume-poll.conf

    Start Flume, and start the spark streaming in idea:

Bin/flume-ng agent-c conf-f conf/netcat-logger.conf-n A1  -dflume.root.logger=info,console// - D post-parameter optional

Big Data Entry 24th day--sparkstreaming (2) integration with Flume, Kafka

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.