Using Kafka with Flume

Source: Internet
Author: User


This document is Cloudera distribution of Apache Kafka 1.3.x. Other versions of the document are in Cloudera documentation.

Using Kafka with Flume in CDH 5.2.0 and later versions, Flume contains a Kafka source and sink. Using them allows data to flow from Kafka to Hadoop or from any flume source into Kafka. Important Notes : You cannot configure a Kafka source to send data to a Kafka sink. If you do, the Kafka source sets the topic in the event header, overriding the sink confi Guration and creating an infinite loops, sending messages back and forth between the source and sink. If you need to use both a source and a sink, use an interceptor to modify the event header and set a different topic.Kafka Source

Use Kafka source to flow data from Kafka topics to Hadoop. The Kafka source can be combined with any flume sink, which makes it easy to write data from Kafka to HDFS, HBase, and SOLR.

 tier1.sources = Source1 Tier1.channels = Channel1 Tier1.sinks = Sink1 Tier1.sources.source1.type = Org.apache.flume . source.kafka.KafkaSource Tier1.sources.source1.zookeeperConnect = zk01.example.com:2181 Tier1.sources.source1.topic = weblogs Tier1.sources.source1.groupId = Flume Tier1.sources.source1.channels = Channel1 Tier1.sources.source1.interceptors = I1 Tier1.sources.source1.interceptors.i1.type = Timestamp tier1.sources.source1.kafka.consumer.timeout.ms = Tier1.channels.channel1.type = Memory Tier1.channels.channel1.capacity = 10000 Tier1.channels.channel1.transactionCapacity = Tier1.sinks.sink1.type = HDFs Tier1.sinks.sink1.hdfs.path =/tmp/kafka/%{topic}/%y-%m-%d Tier1.sinks.sink1.hdfs.rollInterval = 5 Tier1.sinks.sink1.hdfs.rollSize = 0 Tier1.sinks.sink1.hdfs.rollCount = 0 Tier1.sinks.sink1.hdfs.fileType = DataStream Tier1.sinks.sink1.channel = Channel1 

For higher throughput, you can configure multiple Kafka sources to read a topic. If all sources are configured with an identical GroupID, and topic With multiple partitions, setting each source to read data from different partitions can improve efficiency.

The following list describes the parameters supported by the Kafka source; Required parameters are listed in bold. Table 1. Kafka Source Properties Property
Name Default Value Description
type Must be set to Org.apache.flume.source.kafka.KafkaSource.
Zookeeperconnect The URI of the ZooKeeper server or quorum used by Kafka. This can is a single node (for example, zk01.example.com:2181) or a comma-separated list of nodes in a ZooKeeper quorum (f or example, zk01.example.com:2181,zk02.example.com:2181, zk03.example.com:2181).
Topic Source reads the Kafka topic of the message. Flume each source supports only one topic.
GroupID Flume The unique identifier of the Kafka consumer group. Set the same groupID in all sources to indicate, they belong to the same consumer group.
BatchSize 1000 The maximum number of messages written to the channel
Batchdurationmillis 1000 The maximum time (in milliseconds) to write to the channel .
property   Kafka Source Configuration Kafka Consumer can use any  kafka.  (for example,  kafka.fetch.min.bytes). See the  kafka documentation for the full list of Kafka consumer properties.

Tuning

Kafka source overrides the properties of two Kafka consumer:
    1. Auto.commit.enable is set to false by the source, and every batch is committed. To improve performance, set to true to use kafka.auto.commit.enable instead. This may lose data if the source goes down before committing.
    2. Consumer.timeout.ms set to ten, so if Flume polls Kafka for new data, it waits no more than MS for the data to be availabl E. Setting this to a higher value can reduce the CPU utilization due to less frequent polling, but introduces latency in Writi NG batches to the channel.
Kafka Sink

Use Kafka sink to send data from a Flume source to Kafka. You can use the Kafka sink in addition to Flume sinks such as HBase or HDFS.

The following Flume configuration example uses a Kafka sink with an exec source:

Tier1.sources  = Source1 Tier1.channels = Channel1 tier1.sinks = sink1  Tier1.sources.source1.type = exec Tier1.sources.source1.command =/usr/bin/vmstat 1 tier1.sources.source1.channels = Channel1  Tier1.channels.channel1.type = Memory Tier1.channels.channel1.capacity = 10000 tier1.channels.channel1.transactionCapacity =  Tier1.sinks.sink1.type = Org.apache.flume.sink.kafka.KafkaSink Tier1.sinks.sink1.topic = Sink1 Tier1.sinks.sink1.brokerList = kafka01.example.com:9092,kafka02.example.com:9092 Tier1.sinks.sink1.channel = Channel1 tier1.sinks.sink1.batchSize = 20

The following table describes parameters the Kafka sink supports; Required properties is listed in bold.

Table 2. Kafka Sink Properties Property
Name Default Value Description
type Must be set to: Org.apache.flume.sink.kafka.KafkaSink.
brokerlist The brokers the Kafka sink uses to discover topic partitions, formatted as a comma-separated list of hostname:port entries . You don't need to specify the entire list of brokers, but Cloudera recommends so you specify at least both for high Avai Lability.
Topic Default-flume-topic The Kafka topic to which messages is published by default. If The event header contains a topic field, the event is published to the designated topic, overriding the configured Topi C.
BatchSize 100 The number of messages to process in a single batch. Specifying a larger batchsize can improve throughput and increase latency.
Requiredacks 1 The number of replicas that must acknowledge a message before it is written successfully. Possible values is 0 (does not wait for a acknowledgement), 1 (wait for the leader to acknowledge only), And-1 (Wait for All replicas to acknowledge). To avoid potential loss of the data in case of a leader failure, set this to-1.
Other Kafka producer supported properties   used to configure the Kafka producer used by the Kafka sink. You can use any producer properties supported by Kafka. Prepend The Producer property is name with the prefix kafka.  (for Example, kafka.compression.codec). See The kafka documentation for The full list of Kafka producer properties.

Kafka sink uses topic and key properties from the Flumeevent headers to determine where to send events in Kafka. If the header contains the topic property, then the event is sent to the designated topic, overriding the configured topic. If the header contains the key property, then that key was used to partition events within the topic. Events with the same key is sent to the same partition. If The key parameter is not specified, events was distributed randomly to partitions. Use these properties to control the topics and partitions to which events is sent through the Flume source or interceptor .

Kafka Channel CDH 5.3 and later versions contain a Kafka channel to Flume in addition to the existing memory and file channels. You can use the Kafka channel:
    • To write to Hadoop directly from Kafka without using a source. Do not use source to write data directly from the Kafka to Hadoop.
    • To write to Kafka directly from Flume sources without additional buffering. Write data directly from the Flume source to Kafka without using additional buffers.
    • As a reliable and highly available channel for any source/sink combination. Can be combined with any source/sink.
The Flume configuration below uses a Kafka channel and an exec source and HDFs sink:
tier1.sources = Source1tier1.channels = Channel1tier1.sinks = Sink1tier1.sources.source1.type = Exectier1.sources.source1.command =/usr/bin/vmstat 1tier1.sources.source1.channels = Channel1tier1.channels.channel1.type = Org.apache.flume.channel.kafka.KafkaChanneltier1.channels.channel1.capacity = 10000tier1.channels.channel1.transactioncapacity = 1000tier1.channels.channel1.brokerlist = kafka02.example.com : 9092,kafka03.example.com:9092tier1.channels.channel1.topic = Channel2tier1.channels.channel1.zookeeperConnect = Zk01.example.com:2181tier1.channels.channel1.parseasflumeevent = Truetier1.sinks.sink1.type = Hdfstier1.sinks.sink1.hdfs.path =/tmp/kafka/channeltier1.sinks.sink1.hdfs.rollinterval = 5tier1.sinks.sink1.hdfs.rollsize = 0tier1.sinks.sink1.hdfs.rollcount = 0tier1.sinks.sink1.hdfs.filetype = DataStreamtier1.sinks.sink1.channel = Channel1

The following list describes the parameters supported by the Kafka channel; Bold is the necessary parameter.

Table 3. Kafka Channel Properties Property
Name Default Value Description
type must be set to:Org.apache.flume.channel.kafka.KafkaChannel.
brokerlist The brokers the Kafka channel uses to discover topic partitions, formatted as a comma-separated list of Hostname:port entr ies. You don't need to specify the entire list of brokers, but Cloudera recommends so you specify at least both for high Avai Lability.
Zookeeperconnect The URI of the ZooKeeper server or quorum used by Kafka. This can is a single node (for example, zk01.example.com:2181) or a comma-separated list of nodes in a ZooKeeper quorum (f or example, zk01.example.com:2181,zk02.example.com:2181, zk03.example.com:2181).
Topic Flume-channel The Kafka topic the channel would use.
GroupID Flume The unique identifier of the Kafka consumer group the channel uses to register with Kafka.
Parseasflumeevent True Set to True if a Flume source was writing to the channel and expects avrodataums with the Flumeevent schema (org.apache.flu Me.source.avro.AvroFlumeEvent) in the channel. Set to False if and other producers be writing to the topic, the channel is using.
Readsmallestoffset False If true, reads all data in the topic. If false, reads only data written after the channel has started. Only used if Parseasflumeevent is false.
kafka.consumer.timeout.ms 100 The time between polling when writing data to sink.
Other properties supported by Kafka producer Used to configure the Kafka producer. You can use any producer properties supported by Kafka. Prepend The Producer property is name with the prefix Kafka. (for example, Kafka.compression.codec). See the Kafka documentation for the full list of Kafka producer properties.
Original address: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/topics/kafka_flume.html

Using Kafka with Flume

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.