Pull data to Flume in Spark streaming

Last Update:2015-05-13 Source: Internet

Author: User

Tags ack

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Here are the solutions to see

https://issues.apache.org/jira/browse/SPARK-1729

Please be personal understanding, there are questions please leave a message.

In fact, itself Flume is not support like Kafka Publish/Subscribe function, that is, can not let spark to flume pull data, so foreigners think of a trickery way.

In flume in fact sinks is to the channel initiative to take data, then let on the custom sinks self-monitoring, and then make sparkstreaming and sinks connected together, let streaming to decide whether to take the data and the frequency of the data, So this is not the realization of the streaming to Flume to take the data demand it?

You see, it's a smart thing to do, but I think, if there is a need to publish/subscribe, in fact, the Kafka bar ...

Finally, here's how to use

First, you need to compile the following code into a jar package, and then use it in Flume, where the code goes from here (if you find a tool that you need to rely on, please look in the Scala file in the same directory)

Package Org.apache.spark.streaming.flume.sinkimport Java.net.InetSocketAddressimport Java.util.concurrent._import Org.apache.avro.ipc.NettyServerimport Org.apache.avro.ipc.specific.SpecificResponderimport Org.apache.flume.Contextimport Org.apache.flume.Sink.Statusimport org.apache.flume.conf. {Configurable, configurationexception}import org.apache.flume.sink.abstractsink/** * A sink that uses Avro RPC to run A SE RVer that can is polled by Spark ' s * flumepollinginputdstream. This sink have the following configuration parameters: * * hostname-the hostname to bind to. default:0.0.0.0 * Port-the port to bind to.  (No default-mandatory) * timeout-time in seconds after which a transaction are rolled back, * If an ACK is not received From spark within, that time * threads-number of threads-to-use-to-receive requests from Spark (DEFAULT:10) * * This SI NK is unlike other Flume sinks in the sense of it does not to push data, * instead the process method in this sink simply B Locks the SinkrUnner The first time it is * called. This sink starts up a Avro IPC server that uses the Sparkflumeprotocol. * Each time a Geteventbatch call comes, creates a transaction and reads events * from the channel. When enough events was read, the events sent to the Spark receiver and * The thread itself is blocked and a reference To it saved off.  * * When the ACK is received, * the thread which created the transaction are retrieved and it commits the Transaction with the * channel from the same thread it is originally created in (Since Flume transactions is * thread L ocal). If a nack is received instead, the sink rolls back the transaction. If No ACK * is received within the specified timeout, the transaction was rolled back too. If an ACK comes * after that, it is simply ignored and the events get re-sent. * */class Sparksink extends Abstractsink with Logging and configurable {//Size of the pool to use for holding Transact  Ion processors. private Var poolsize:iNteger = sparksinkconfig.default_threads//Timeout for each transaction. If Spark does not respond in this much time,//rollback the transaction private var transactiontimeout = Sparksinkconfi   G.default_transaction_timeout//Address info to bind on private var hostname:string = Sparksinkconfig.default_hostname private var port:int = 0 private var Backoffinterval:int = $//Handle to the server private var serveropt:optio N[nettyserver] = None//The handler that handles the callback from Avro private var Handler:option[sparkavrocallbackha  Ndler] = None//Latch that blocks off the Flume framework from wasting 1 thread.  Private Val Blockinglatch = new Countdownlatch (1) Override def start () {Loginfo ("Starting Spark Sink:" + GetName + "  On port: "+ ports +" and interface: "+ hostname +" with "+" Pool Size: "+ poolsize +" and Transaction timeout:    "+ TransactionTimeout +". ") Handler = Option (new Sparkavrocallbackhandler (Poolsize, Getchannel, trAnsactiontimeout, Backoffinterval)) Val responder = new Specificresponder (Classof[sparkflumeprotocol], handler.get )//Using The constructor, takes specific thread-pools requires bringing in Netty//dependencies which is Bei Ng excluded in the build.    In practice,//Netty dependencies is already available on the JVM as Flume would has pulled them in.      serveropt = Option (new Nettyserver (Responder, new inetsocketaddress (hostname, port))) Serveropt.foreach (server = { Loginfo ("Starting Avro server for sink:" + getName) Server.start ()}) Super.start ()} Override def stop () {loginfo ("Stopping Spark Sink:" + getName) Handler.foreach (CallbackHandler = {Callbackhandler.shutdow N ()}) Serveropt.foreach (server = {Loginfo ("stopping Avro server for sink:" + getName) Server.close ( ) Server.join ()}) Blockinglatch.countdown () Super.stop ()} Override def configure (Ctx:context) {Impo RT Sparksinkconfig._ hostname = ctx.getstring (conf_hostname, default_hostname) port = Option (Ctx.getinteger (Conf_port)). Getorelse (throw new ConfigurationException ("The port to bind to must is specified")) Poolsize = Ctx.getinteger (THREADS , default_threads) TransactionTimeout = Ctx.getinteger (conf_transaction_timeout, Default_transaction_timeout) backOf Finterval = Ctx.getinteger (Conf_backoff_interval, Default_backoff_interval) loginfo ("Configured Spark Sink with Hostna  Me: "+ hostname +", Port: "+ Port +", "+" Poolsize: "+ poolsize +", TransactionTimeout: "+ transactiontimeout + "," + "Backoffinterval:" + Backoffinterval)} override Def process (): Status = {//This method is called I n a loop by the Flume framework-block it until the sink are//stopped to save CPU resources.    The sink runner would interrupt this thread when the sink are//being shut down.    Loginfo ("Blocking Sink Runner, Sink'll continue to run..") Blockinglatch.await ()   Status.backoff} Private[flume] def getport (): Int = {serveropt. Map (_.getport). Getorelse (thro      W New RuntimeException ("Server is not started!") )}/** * Pass in a [[Countdownlatch]] for testing purposes. This batch is counted down when each * batch is received.   The test can simply call await in this latch till the expected number of * batches is received. * @param latch */private[flume] def countdownwhenbatchreceived (latch:countdownlatch) {Handler.foreach (_.countdown whenbatchacked (latch))}}/** * Configuration parameters and their defaults. */private[flume]object Sparksinkconfig {val THREADS = "THREADS" val default_threads = ten Val conf_transaction_timeout = "Timeout" val default_transaction_timeout = "val conf_hostname =" HOSTNAME "val default_hostname =" 0.0.0.0 "Val C Onf_port = "PORT" val conf_backoff_interval = "Backoffinterval" val default_backoff_interval = 200}

Then use the following code in your streaming

Package Org.apache.spark.examples.streamingimport Org.apache.spark.SparkConfimport Org.apache.spark.storage.StorageLevelimport Org.apache.spark.streaming._import Org.apache.spark.streaming.flume. _import Org.apache.spark.util.IntParamimport java.net.inetsocketaddress/** * Produces a count of events received from Flu Me. * * This should is used in conjunction with the Spark Sink running in a Flume agent. See * The Spark Streaming Programming Guide for more details. * * Usage:flumepollingeventcount

Pull data to Flume in Spark streaming

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More