Target: Using flume agent implementation, the data in the Kafka is taken out and fed into elasticsearch.
Analysis: Flume agent needs to work, two points: Flume Kafka Source: Responsible for reading from the Kafka data; Flume ElasticSearch Sink: Responsible for the data into the ElasticSearch;
The current flume 1.5.2 already contains Elasticsearchsink, so a custom implementation of Flume Kafka source is required. It is now known from Jira that the Flume 1.6.0 will contain flume-ng-kafka-source, however, the current Flume 1.6.0 version is not released, what to do. Two roads: Flume-ng-kafka-source Flume 1.6.0 in the code of the branch in GitHub Flume-ng-kafka-source
The Flume-ng-kafka-source part of the flume 1.6.0 Branch is initially selected, and this part of the code is already included in the Flume-ng-extends-source. Compiling code
Execute command: mvn clean packages get jar package: Flume-ng-extends-source-x.x.x.jar. Installing plugins
Two types of jar packages: Lib Jar package Flume-ng-extends-source-x.x.x.jar Libext in jar package Kafka_2.9.2-0.8.2.0.jar Kafka-clients-0.8.2.0.jar Metrics-core-2.2.0.jar Scala-library-2.9.2.jar Zkclient-0.3.jar
question : How to export the current jar package and its dependent packages when MAVEN is packaged. Reference Thilinamb Flume Kafka sink configuration
The
is configured in the properties file to configure the sample file:
# Kafka Source for retrieve from Kafka cluster.
Agent.sources.seqGenSrc.type = Com.github.ningg.flume.source.KafkaSource #agent. sources.seqGenSrc.batchSize = 2
Agent.sources.seqGenSrc.batchDurationMillis = Agent.sources.seqGenSrc.topic = good
Agent.sources.seqGenSrc.zookeeperConnect = 168.7.2.164:2181,168.7.2.165:2181,168.7.2.166:2181 Agent.sources.seqGenSrc.groupId = Elasticsearch #agent. sources.seqGenSrc.kafka.consumer.timeout.ms = 1000 #
Agent.sources.seqGenSrc.kafka.auto.commit.enable = False # Elasticsearchsink for ElasticSearch.
Agent.sinks.loggerSink.type = Org.apache.flume.sink.elasticsearch.ElasticSearchSink
Agent.sinks.loggerSink.indexName = Flume Agent.sinks.loggerSink.indexType = Log agent.sinks.loggerSink.batchSize = 100 #agent. Sinks.loggerSink.ttl = 5 Agent.sinks.loggerSink.client = Transport Agent.sinks.loggerSink.hostNames = 168.7.1.69:9300 #agent. sinks.loggerSink.client = Rest #agent. sinks.loggerSink.hostNames = 168.7.1.69:9200 # Agent.sinks.loggerSink.serializer =Org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer
Custom
Goal: Customize Elasticsearchsink's serializer.
Phenomenon: After setting the parameter batchsize=1000 of Elasticsearchsink, the 120,000+ record appears in index of the current ES, and at this time, the original platform found that the data currently produced is only 20,000, so Guess Kafkasource all the data under the specified topic in the Kafka cluster is passed into ES.
Some:
New configuration parameters in Elasticsearchsink: Indexnamebuilder=org.apache.flume.sink.elasticsearch.timebasedindexnamebuilder The above will be in INDEXPREFIX-YYYY-MM-DD mode, each day to produce an index; Other options: Org.apache.flume.sink.elasticsearch.SimpleIndexNameBuilder, which is directly set to the Indexprefix (IndexName is actually set) DateFormat =YYYY-MM-DD TIMEZONE=ETC/UTC Serializer=org.apache.flume.sink.elasticsearch.elasticsearchlogstasheventserializer The above option adds Key-value to a new field @fields in the Header of the Flume event; Other options: Org.apache.flume.sink.elasticsearch.ElasticSearchDynamicSerializer, which directly constructs the body, the header as a JSON string, Added to the Elasticsearch. Restart
If you terminate the flume Agent, then restart. Question: Whether the data in Kafka will be sent to Elasticsearch repeatedly. The data in the Kafka is omitted and is not sent to Elasticsearch.
Think, several cases: Kafka corresponding consumer has offset Kafka in the data, periodic cleanup, such as the default 3 days
The restart scenario of the Flume agent needs to be considered in detail.