Recently learned the use of the next flume, in line with the company will be independent of the development of the log system, the official website user manual: Http://flume.apache.org/FlumeUserGuide.html
Flume schema A. Component
First move, the structure of the Internet.
As you can see from the diagram, the Flume event is defined as a data stream, a data stream consisting of an agent, which is actually a JVM instance, each containing three components: Source, Channel, Sinksource: A log message outside the agent, It is actually the producer of the event within the agent, and he is also the consumer of the external log (e.g. Web Server) Channel: Similar to Message Queuing, connected to source and Sinksink: Logs in the consumer Channel, Send logs to the repository or to the next agent, depending on your own configuration (or your own flume schema topology) data source: Flume Data sources are rich, the following detailed storage: Sink will eventually store the data, can be file, HDFs, hbase, etc., Or another agentb:sourcesflume contains a rich source, see the website User manual: Http://flume.apache.org/FlumeUserGuide.html#flume-sourcesAvro Source
A Avro interface is provided, and a AVRO message is sent to the address and port set, and the source can receive, such as: Log4jappender sends a message to the agent via Avro Source
Thrift Source
Provides a thrift interface, similar to Avro
Exec Source
When source starts, it runs a set of UNIX commands (such as cat file) that continuously outputs data to the standard output (stdout), which is packaged into an event for processing
JMS Source
Reads a message from a JMS target, such as Activemq
Spooling Directory Source
Listen to a directory, when a new file appears, package the contents of the file into an event for processing
NetCat Source
Listen to a port and package the message as an event
Syslog Source
Read syslog data and convert to event
Multiport Syslog TCP Source
Syslog UDP Source
HTTP Source
Receive HTTP POST or get messages and convert to event
Custom source
The user customizes the source that satisfies the requirement by implementing the interface provided by the Flume
C. channelshttp://flume.apache.org/flumeuserguide.html#flume-channels
Memory Channel
Messages are saved in memory and can be set to a maximum capacity, and messages exceeding the message will be lost. Scenarios that are suitable for high throughput and allow partial message loss when the agent exits
JDBC Channel
Message saving is saved to the database through JDBC. Only built-in Derby is currently supported. After the agent exits and restarts, messages in the channel are not lost
File Channel
The message is saved in a local file. After the agent exits and restarts, the messages in the channel are not lost. Note that the file channel by default will save the data in a directory of the filesystem, while locking, if a system starts multiple file channel at the same time, the first successful start, the other will because the default directory is locked to cause the start to fail. This is required to specify your own directory when you configure the channel
Spillable Memory Channel
Messages are stored in memory queues and local files, with high memory queue precedence. When the memory queue is full, messages received later are saved to the local file, which is actually a built-in, file Channel. According to the official documentation, it is not recommended to use the system on-line for the time being.
Pseudo Transaction Channel
Use only as unit tests
Custom Channel
D. sinkshttp://flume.apache.org/flumeuserguide.html#flume-sinks
Flume provides a variety of sink implementations that save collected logs to a storage or forward to other systems
HDFS Sink
Logs are saved to HDFs
Logger Sink
Logs are output to the console through a system-defined log4j form
Avro Sink
A log is sent to a Avro interface, such as another agent containing Avro source
Thrift Sink
A log is sent to a thrift interface, such as another agent containing thrift source
IRC Sink
Log sent to IRC interface
File Roll Sink
Log output to a local file, the file will be rolled up to generate a new file periodically
Null Sink
Log all discarded
HBase Sink
Logs are saved to HBase
Asynchbase Sink
Asynchronously writes a log to HBase
MORPHLINESOLR Sink
Log saved to a server in SOLR full-text index
ElasticSearch Sink
Logs are saved to the ES full-text Index Server
Custom Sink
E. Channel selectorshttp://flume.apache.org/flumeuserguide.html#flume-channel-selectors
Assume that the following schema topology is present
In the diagram, there is a data source outside the agent, distributed to 3 different channel, and then there are 3 different sink to consume the corresponding one CHANNEL,SINK1 last store to hdfs,sink2 forwarding to the JMS,SINK3 data flow Agent Bar So how do we implement the above topology? The knowledge we have previously introduced does not solve this problem, the following describes a component to solve this problem flume provides two kinds of selector
Replicating Channel Selector
Source will send the message to all connected channel
Multiplexing Channel Selector
source determines which channel the message is sent to, based on the value of a field (specified at configuration) in the header of the message.
F. Sink processor multiple Sink are logically composed of a whole Sink to be used, can provide functions such as failover and load Balancee, can also be defined by the user, add other functions
Default Sink Processor
Default, no configuration required, is a single sink mode of operation
Failover Sink Processor
Provides a fault-tolerant mechanism for maintaining a prioritized sink list, which is handled by the current highest priority sink, and after a high-priority sink error, the message is processed by the highest-priority sink in the current list during the time it resumes running. Until the higher-priority sink of the error is restored.
Load balancing Sink Processor
A group of sink process messages together through load balancing, in two ways Round_robin and random.
G. Event Serializers provides a mechanism for re-encapsulating flumeevent, which is used in sink. Currently File_roll Sink and HDFS Sink are supported.
Body Text Serializer
Flumeevent the body output, headers information ignored
Avro Event Serializer
Encapsulates flumeevent serialization to Avro container file, which can be used in Avro RPC
H. Interceptor interceptors, equivalent to our familiar SPRINGMVC interceptor, function Similarly, in the source layer can modify or discard the received message, to reach the message we want to enter the channel
Timestamp Interceptor
Insert a system time for the current agent in the header of the event
Host Interceptor
Insert the host name or IP of the current agent in the header of the event
Static Interceptor
Insert a fixed key-value value pair in the header of the event
UUID Interceptor
Morphline Interceptor
Regex Filtering Interceptor
Regex Extractor Interceptor
This blog is introduced here, if you need to know more detailed knowledge and configuration, you can go to the official User Guide study, then I will update several examples of how to use flume
Flume Study 01-flume Introduction