Flume is a real-time message collection system, it defines a variety of source, channel, sink, can be selected according to the actual situation.
Flume Download and Documentation:
http://flume.apache.org/
Kafka
Kafka is a high-throughput distributed publish-subscribe messaging system that has the following features:
Provides persistence of messages through the disk data structure of O (1), a structure that maintains long-lasting performance even with terabytes of message storage.
High throughput: Even very common hardware Kafka can support hundreds of thousands of messages per second.
Support for partitioning messages through Kafka servers and consumer clusters.
Supports Hadoop parallel data loading.
The purpose of Kafka is to provide a publishing subscription solution that can handle all the action flow data in a consumer-scale website. This kind of action (web browsing, search and other user actions) is a key factor in many social functions on modern networks. This data is usually resolved by processing logs and log aggregations due to throughput requirements. This is a viable solution for the same log data and offline analysis system as Hadoop, but requires real-time processing constraints. The purpose of Kafka is to unify online and offline message processing through Hadoop's parallel loading mechanism, and also to provide real-time consumption through the cluster machine.
Kafka distributed subscription architecture such as:--taken from Kafka official website
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/4C/D7/wKiom1RGJRLB8OBQAABekLDcj7A019.jpg "title=" 150105je8xweaxjsassesa.png "alt=" Wkiom1rgjrlb8obqaabekldcj7a019.jpg "/>
Configure Kafka configuration file Server.properties, others can be modified according to their own circumstances.
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/4C/D8/wKioL1RGJSmT6kL3AALMNNsSy1E209.jpg "title=" Untitled. jpg "alt=" wkiol1rgjsmt6kl3aalmnnssy1e209.jpg "/>
Start Kafka, start zookeeper,zookeeper before starting the configuration no longer described.
# bin/kafka-server-start.sh Config/server.properties
Create a topic
# bin/kafka-topics.sh--create--zookeeper localhost:2181----replication-factor 1--partitions 1--topic test
View Topic
# bin/kafka-topics.sh--list--zookeeper localhost:2181
Test the normal production and consumption; Verify the correctness of the process
# bin/kafka-console-producer.sh--broker-list localhost:9092--topic test
# Bin/kafka-console-consumer.sh--zookeeper localhost:2181--topic test--from-beginning
Next is the integration between the frameworks
Flume and Kafka Integration
1. Download Flume-kafka-plus:https://github.com/beyondj2ee/flumeng-kafka-plugin
2. Extracting the Flume-conf.properties file from the plugin
Modify the File: #source section
Producer.sources.s.type = Exec
Producer.sources.s.command = Tail-f-n+1/mnt/hgfs/vmshare/test.log
Producer.sources.s.channels = C
Change the value of all topic to test
Put the changed configuration file into the flume/conf directory
In the project, extract the following jar packages into the environment under the flume Lib:
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/4C/D8/wKiom1RGKweDSgs2AABYq9hiPC4413.jpg "title=" Untitled 1.jpg "alt=" Wkiom1rgkwedsgs2aabyq9hipc4413.jpg "/>
The package directory is also Flumeng-kafka-plugin.jar in the Flume Lib directory.
Attach the Flume configuration file
############################################
# producer Config
###########################################
#agent section
Producer.sources = S
Producer.channels = C
Producer.sinks = R
#source section
Producer.sources.s.type = Exec
Producer.sources.s.channels = C
Producer.sources.s.command = Tail-f/var/log/messages
#producer. Sources.s.type=spooldir
#producer. Sources.s.spooldir=/home/xiaojie.li
#producer. Sources.s.fileheader=false
#producer. sources.s.type=syslogtcp
#producer. sources.s.port=5140
#producer. Sources.s.host=localhost
# each sink ' s type must be defined
Producer.sinks.r.type = Org.apache.flume.plugins.KafkaSink
producer.sinks.r.metadata.broker.list=10.10.10.127:9092
producer.sinks.r.zk.connect=10.10.10.127:2181
Producer.sinks.r.partition.key=0
Producer.sinks.r.partitioner.class=org.apache.flume.plugins.singlepartition
Producer.sinks.r.serializer.class=kafka.serializer.stringencoder
Producer.sinks.r.request.required.acks=0
producer.sinks.r.max.message.size=1000000
Producer.sinks.r.producer.type=sync
Producer.sinks.r.custom.encoding=utf-8
Producer.sinks.r.custom.topic.name=test
#Specify the channel the sink should use
Producer.sinks.r.channel = C
# each channel ' s type is defined.
Producer.channels.c.type = Memory
producer.channels.c.capacity = 1000
producer.channels.c.transactioncapacity=100
#producer. Channels.c.type=file
#producer. Channels.c.checkpointdir=/home/checkdir
#producer. Channels.c.datadirs=/home/datadir
Validating Flume and Kafka combinations
The front Kafka has been started, here directly to start flume
# bin/flume-ng agent-c conf-f conf/master.properties-n producer-dflume.root.logger=info,console
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/4C/E5/wKiom1RHJs-RUiZCABf8lQG-d6w703.jpg "title=" Untitled 2.jpg "alt=" Wkiom1rhjs-ruizcabf8lqg-d6w703.jpg "/>
Use Kafka's kafka-console-consumer.sh script to see if any flume have transmitted data to Kafka;
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/4C/E9/wKiom1RHW6bwiJWUAAwelWX_Eh0824.jpg "title=" Untitled 5.png "alt=" wkiom1rhw6bwijwuaawelwx_eh0824.jpg "/> can see that tail/var/log/messages has been passed flume to Kafka, stating flume+ The Kafka combination has been successful.
The logs eventually need to be kept in HDFs.
Also need to develop their own plug-ins to achieve, there is no more to say.
This article is from the "technology never-ending we are moving forward" blog, please be sure to keep this source http://470220878.blog.51cto.com/3101627/1566728
Flume+kafka+hdfs Building real-time message processing system