Kafak, Flume, Elasticsearch

Source: Internet
Author: User

Target: Using flume agent implementation, the data in the Kafka is taken out and fed into elasticsearch.

Analysis: Flume agent needs to work, two points: Flume Kafka Source: Responsible for reading from the Kafka data; Flume ElasticSearch Sink: Responsible for the data into the ElasticSearch;

The current flume 1.5.2 already contains Elasticsearchsink, so a custom implementation of Flume Kafka source is required. It is now known from Jira that the Flume 1.6.0 will contain flume-ng-kafka-source, however, the current Flume 1.6.0 version is not released, what to do. Two roads: Flume-ng-kafka-source Flume 1.6.0 in the code of the branch in GitHub Flume-ng-kafka-source

The Flume-ng-kafka-source part of the flume 1.6.0 Branch is initially selected, and this part of the code is already included in the Flume-ng-extends-source. Compiling code

Execute command: mvn clean packages get jar package: Flume-ng-extends-source-x.x.x.jar. Installing plugins

Two types of jar packages: Lib Jar package Flume-ng-extends-source-x.x.x.jar Libext in jar package Kafka_2.9.2-0.8.2.0.jar Kafka-clients-0.8.2.0.jar Metrics-core-2.2.0.jar Scala-library-2.9.2.jar Zkclient-0.3.jar

question : How to export the current jar package and its dependent packages when MAVEN is packaged. Reference Thilinamb Flume Kafka sink configuration

The

is configured in the properties file to configure the sample file:

# Kafka Source for retrieve from Kafka cluster.
Agent.sources.seqGenSrc.type = Com.github.ningg.flume.source.KafkaSource #agent. sources.seqGenSrc.batchSize = 2
Agent.sources.seqGenSrc.batchDurationMillis = Agent.sources.seqGenSrc.topic = good
Agent.sources.seqGenSrc.zookeeperConnect = 168.7.2.164:2181,168.7.2.165:2181,168.7.2.166:2181 Agent.sources.seqGenSrc.groupId = Elasticsearch #agent. sources.seqGenSrc.kafka.consumer.timeout.ms = 1000 #
Agent.sources.seqGenSrc.kafka.auto.commit.enable = False # Elasticsearchsink for ElasticSearch.
Agent.sinks.loggerSink.type = Org.apache.flume.sink.elasticsearch.ElasticSearchSink
Agent.sinks.loggerSink.indexName = Flume Agent.sinks.loggerSink.indexType = Log agent.sinks.loggerSink.batchSize = 100 #agent. Sinks.loggerSink.ttl = 5 Agent.sinks.loggerSink.client = Transport Agent.sinks.loggerSink.hostNames = 168.7.1.69:9300 #agent. sinks.loggerSink.client = Rest #agent. sinks.loggerSink.hostNames = 168.7.1.69:9200 # Agent.sinks.loggerSink.serializer =Org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer 
Custom

Goal: Customize Elasticsearchsink's serializer.

Phenomenon: After setting the parameter batchsize=1000 of Elasticsearchsink, the 120,000+ record appears in index of the current ES, and at this time, the original platform found that the data currently produced is only 20,000, so Guess Kafkasource all the data under the specified topic in the Kafka cluster is passed into ES.

Some:

New configuration parameters in Elasticsearchsink: Indexnamebuilder=org.apache.flume.sink.elasticsearch.timebasedindexnamebuilder The above will be in INDEXPREFIX-YYYY-MM-DD mode, each day to produce an index; Other options: Org.apache.flume.sink.elasticsearch.SimpleIndexNameBuilder, which is directly set to the Indexprefix (IndexName is actually set) DateFormat =YYYY-MM-DD TIMEZONE=ETC/UTC Serializer=org.apache.flume.sink.elasticsearch.elasticsearchlogstasheventserializer The above option adds Key-value to a new field @fields in the Header of the Flume event; Other options: Org.apache.flume.sink.elasticsearch.ElasticSearchDynamicSerializer, which directly constructs the body, the header as a JSON string, Added to the Elasticsearch. Restart

If you terminate the flume Agent, then restart. Question: Whether the data in Kafka will be sent to Elasticsearch repeatedly. The data in the Kafka is omitted and is not sent to Elasticsearch.

Think, several cases: Kafka corresponding consumer has offset Kafka in the data, periodic cleanup, such as the default 3 days

The restart scenario of the Flume agent needs to be considered in detail.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.