This document is Cloudera distribution of Apache Kafka 1.3.x. Other versions of the document are in Cloudera documentation.
Using Kafka with Flume in CDH 5.2.0 and later versions, Flume contains a Kafka source and sink. Using them allows data to flow from Kafka to Hadoop or from any flume source into Kafka.
Important Notes : You cannot configure a Kafka source to send data to a Kafka sink. If you do, the Kafka source sets the topic in the event header, overriding the sink confi Guration and creating an infinite loops, sending messages back and forth between the source and sink. If you need to use both a source and a sink, use an interceptor to modify the event header and set a different topic.Kafka Source
Use Kafka source to flow data from Kafka topics to Hadoop. The Kafka source can be combined with any flume sink, which makes it easy to write data from Kafka to HDFS, HBase, and SOLR.
tier1.sources = Source1 Tier1.channels = Channel1 Tier1.sinks = Sink1 Tier1.sources.source1.type = Org.apache.flume . source.kafka.KafkaSource Tier1.sources.source1.zookeeperConnect = zk01.example.com:2181 Tier1.sources.source1.topic = weblogs Tier1.sources.source1.groupId = Flume Tier1.sources.source1.channels = Channel1 Tier1.sources.source1.interceptors = I1 Tier1.sources.source1.interceptors.i1.type = Timestamp tier1.sources.source1.kafka.consumer.timeout.ms = Tier1.channels.channel1.type = Memory Tier1.channels.channel1.capacity = 10000 Tier1.channels.channel1.transactionCapacity = Tier1.sinks.sink1.type = HDFs Tier1.sinks.sink1.hdfs.path =/tmp/kafka/%{topic}/%y-%m-%d Tier1.sinks.sink1.hdfs.rollInterval = 5 Tier1.sinks.sink1.hdfs.rollSize = 0 Tier1.sinks.sink1.hdfs.rollCount = 0 Tier1.sinks.sink1.hdfs.fileType = DataStream Tier1.sinks.sink1.channel = Channel1
For higher throughput, you can configure multiple Kafka sources to read a topic. If all sources are configured with an identical GroupID, and topic With multiple partitions, setting each source to read data from different partitions can improve efficiency.
The following list describes the parameters supported by the Kafka source; Required parameters are listed in bold. Table 1. Kafka Source Properties
Property
| Name |
Default Value |
Description |
| type |
|
Must be set to Org.apache.flume.source.kafka.KafkaSource. |
| Zookeeperconnect |
|
The URI of the ZooKeeper server or quorum used by Kafka. This can is a single node (for example, zk01.example.com:2181) or a comma-separated list of nodes in a ZooKeeper quorum (f or example, zk01.example.com:2181,zk02.example.com:2181, zk03.example.com:2181). |
| Topic |
|
Source reads the Kafka topic of the message. Flume each source supports only one topic. |
| GroupID |
Flume |
The unique identifier of the Kafka consumer group. Set the same groupID in all sources to indicate, they belong to the same consumer group. |
| BatchSize |
1000 |
The maximum number of messages written to the channel |
| Batchdurationmillis |
1000 |
The maximum time (in milliseconds) to write to the channel . |
| property |
|
Kafka Source Configuration Kafka Consumer can use any kafka. (for example, kafka.fetch.min.bytes). See the kafka documentation for the full list of Kafka consumer properties. |
Tuning
Kafka source overrides the properties of two Kafka consumer:
- Auto.commit.enable is set to false by the source, and every batch is committed. To improve performance, set to true to use kafka.auto.commit.enable instead. This may lose data if the source goes down before committing.
- Consumer.timeout.ms set to ten, so if Flume polls Kafka for new data, it waits no more than MS for the data to be availabl E. Setting this to a higher value can reduce the CPU utilization due to less frequent polling, but introduces latency in Writi NG batches to the channel.
Kafka Sink
Use Kafka sink to send data from a Flume source to Kafka. You can use the Kafka sink in addition to Flume sinks such as HBase or HDFS.
The following Flume configuration example uses a Kafka sink with an exec source:
Tier1.sources = Source1 Tier1.channels = Channel1 tier1.sinks = sink1 Tier1.sources.source1.type = exec Tier1.sources.source1.command =/usr/bin/vmstat 1 tier1.sources.source1.channels = Channel1 Tier1.channels.channel1.type = Memory Tier1.channels.channel1.capacity = 10000 tier1.channels.channel1.transactionCapacity = Tier1.sinks.sink1.type = Org.apache.flume.sink.kafka.KafkaSink Tier1.sinks.sink1.topic = Sink1 Tier1.sinks.sink1.brokerList = kafka01.example.com:9092,kafka02.example.com:9092 Tier1.sinks.sink1.channel = Channel1 tier1.sinks.sink1.batchSize = 20
The following table describes parameters the Kafka sink supports; Required properties is listed in bold.
Table 2. Kafka Sink Properties
Property
| Name |
Default Value |
Description |
| type |
|
Must be set to: Org.apache.flume.sink.kafka.KafkaSink. |
| brokerlist |
|
The brokers the Kafka sink uses to discover topic partitions, formatted as a comma-separated list of hostname:port entries . You don't need to specify the entire list of brokers, but Cloudera recommends so you specify at least both for high Avai Lability. |
| Topic |
Default-flume-topic |
The Kafka topic to which messages is published by default. If The event header contains a topic field, the event is published to the designated topic, overriding the configured Topi C. |
| BatchSize |
100 |
The number of messages to process in a single batch. Specifying a larger batchsize can improve throughput and increase latency. |
| Requiredacks |
1 |
The number of replicas that must acknowledge a message before it is written successfully. Possible values is 0 (does not wait for a acknowledgement), 1 (wait for the leader to acknowledge only), And-1 (Wait for All replicas to acknowledge). To avoid potential loss of the data in case of a leader failure, set this to-1. |
| Other Kafka producer supported properties |
|
used to configure the Kafka producer used by the Kafka sink. You can use any producer properties supported by Kafka. Prepend The Producer property is name with the prefix kafka. (for Example, kafka.compression.codec). See The kafka documentation for The full list of Kafka producer properties. |
Kafka sink uses topic and key properties from the Flumeevent headers to determine where to send events in Kafka. If the header contains the topic property, then the event is sent to the designated topic, overriding the configured topic. If the header contains the key property, then that key was used to partition events within the topic. Events with the same key is sent to the same partition. If The key parameter is not specified, events was distributed randomly to partitions. Use these properties to control the topics and partitions to which events is sent through the Flume source or interceptor .
Kafka Channel CDH 5.3 and later versions contain a Kafka channel to Flume in addition to the existing memory and file channels. You can use the Kafka channel:
- To write to Hadoop directly from Kafka without using a source. Do not use source to write data directly from the Kafka to Hadoop.
- To write to Kafka directly from Flume sources without additional buffering. Write data directly from the Flume source to Kafka without using additional buffers.
- As a reliable and highly available channel for any source/sink combination. Can be combined with any source/sink.
The Flume configuration below uses a Kafka channel and an exec source and HDFs sink:
tier1.sources = Source1tier1.channels = Channel1tier1.sinks = Sink1tier1.sources.source1.type = Exectier1.sources.source1.command =/usr/bin/vmstat 1tier1.sources.source1.channels = Channel1tier1.channels.channel1.type = Org.apache.flume.channel.kafka.KafkaChanneltier1.channels.channel1.capacity = 10000tier1.channels.channel1.transactioncapacity = 1000tier1.channels.channel1.brokerlist = kafka02.example.com : 9092,kafka03.example.com:9092tier1.channels.channel1.topic = Channel2tier1.channels.channel1.zookeeperConnect = Zk01.example.com:2181tier1.channels.channel1.parseasflumeevent = Truetier1.sinks.sink1.type = Hdfstier1.sinks.sink1.hdfs.path =/tmp/kafka/channeltier1.sinks.sink1.hdfs.rollinterval = 5tier1.sinks.sink1.hdfs.rollsize = 0tier1.sinks.sink1.hdfs.rollcount = 0tier1.sinks.sink1.hdfs.filetype = DataStreamtier1.sinks.sink1.channel = Channel1
The following list describes the parameters supported by the Kafka channel; Bold is the necessary parameter.
Table 3. Kafka Channel Properties
Property
| Name |
Default Value |
Description |
| type |
|
must be set to:Org.apache.flume.channel.kafka.KafkaChannel. |
| brokerlist |
|
The brokers the Kafka channel uses to discover topic partitions, formatted as a comma-separated list of Hostname:port entr ies. You don't need to specify the entire list of brokers, but Cloudera recommends so you specify at least both for high Avai Lability. |
| Zookeeperconnect |
|
The URI of the ZooKeeper server or quorum used by Kafka. This can is a single node (for example, zk01.example.com:2181) or a comma-separated list of nodes in a ZooKeeper quorum (f or example, zk01.example.com:2181,zk02.example.com:2181, zk03.example.com:2181). |
| Topic |
Flume-channel |
The Kafka topic the channel would use. |
| GroupID |
Flume |
The unique identifier of the Kafka consumer group the channel uses to register with Kafka. |
| Parseasflumeevent |
True |
Set to True if a Flume source was writing to the channel and expects avrodataums with the Flumeevent schema (org.apache.flu Me.source.avro.AvroFlumeEvent) in the channel. Set to False if and other producers be writing to the topic, the channel is using. |
| Readsmallestoffset |
False |
If true, reads all data in the topic. If false, reads only data written after the channel has started. Only used if Parseasflumeevent is false. |
| kafka.consumer.timeout.ms |
100 |
The time between polling when writing data to sink. |
| Other properties supported by Kafka producer |
|
Used to configure the Kafka producer. You can use any producer properties supported by Kafka. Prepend The Producer property is name with the prefix Kafka. (for example, Kafka.compression.codec). See the Kafka documentation for the full list of Kafka producer properties. |
Original address: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/topics/kafka_flume.html
Using Kafka with Flume