- Flume Introduction
Flume is a highly available, highly reliable, and distributed system for massive log collection, aggregation, and transmission provided by cloudera. Flume supports Custom Data senders in the log system, flume is used to collect data. Flume also provides the ability to process data and write data to various data receivers (customizable.
- System Functions
- Log collection
Flume was the first log collection system provided by cloudera and is currently an incubator project under Apache. Flume allows you to customize various data senders in the log system to collect data.
- Data Processing
Flume provides simple processing of data and the ability to write data to various data receivers (customizable). Flume provides console, RPC (thrift-RPC), and text (file), tail (UNIX tail), syslog (syslog log system, supports TCP and UDP modes), exec (command execution), and other data sources to collect data.
- Work Mode
Flume adopts the multi-master mode. To ensure configuration data consistency, Zookeeper is introduced in flume [1] to save configuration data. zookeeper can ensure consistency and high availability of configuration data. In addition, when configuration data changes, zookeeper can notify the flume master node. The flume master synchronizes data using the gossip protocol.
- Process Structure
The structure of flume is mainly divided into three parts: source, channel and sink. source is the source, logs are collected, channel is the channel, transmission and temporary storage, sink is the destination, and collected logs are saved. In the process of real log collection, select the source, channel, and sink Types Based on the types and storage requirements of the logs to be collected, so as to collect and save the logs.
- Flume log collection solution
- Requirement Analysis
- Log category
Operating System: Linux
Log update type: New logs are generated and appended at the end of the original log.
- Collection time requirement
Collection Cycle: Short Cycle (within one day)
- Collection scheme
- Collection Architecture
The process of using flume to collect log files is concise. You only need to select the appropriate source, channel, and sink and configure them. If you have special requirements, you can perform secondary development on your own to meet your individual needs.
The specific process is as follows: Configure an agent as needed, select the appropriate source and sink, and then start the agent to collect logs.
- Source
Flume provides a variety of sources for users to choose from to meet most log collection requirements as much as possible. common source types include Avro, exec, Netcat, spooling-directory, and syslog. For specific usage scope and configuration methods, see source.
- Channel
The channel in flume is not as important as source and sink, but it is an integral part that cannot be ignored. The commonly used channel is memory-channel, and there are other types of channels, such as JDBC, file-channel, and custom-channel. For details, see channel.
- Sink
Flume has many sink types, including Avro, logger, HDFS, hbase, and file-roll. In addition, there are other types of sinks, such as thrift, IRC, and custom. For the specific scope and usage, see sink.
- Flume processing logs
Flume can not only collect logs, but also perform simple processing on logs. In the source field, interceptor can be used to filter and extract important content in the log body, in the channel, you can use headers to classify different types of logs into different channels. In the sink, you can use regular serialization to further filter and classify the body content.
- Flume source interceptors
Flume can extract important information and add it to the header Through Interceptor. Common interceptor includes timestamp, host name, And UUID. You can also compile regular filter based on your needs, filter logs in specific formats to meet special requirements.
- Flume channel selectors
Flume can transmit different logs to different channels as needed. There are two specific methods: Replication and multi-channel transmission. Copying means that logs are not grouped, but all logs are transmitted to each channel, which is not treated differently for all channels. multiplexing means that logs are classified based on the specified header, according to the classification rules, different logs are put into different channels, so that logs are classified manually.
- Flume sink Processors
Flume can also process logs at the sink. Common sink processors include custom, failover, load balancing, and default, which are the same as interceptor, you can also use the Regular Expression Filter processor to filter the log Content Based on special requirements. However, unlike interceptor, the content filtered by regular expression serialization at the sink is not added to the header, this will not make the log header appear too bloated.
?
- Appendix
- Common source
- Avro Source
Avro can monitor and collect logs on specified ports. To use the Avro source, you need to describe the IP address and port number of the host to be monitored. The following is a specific example:
A1.sources = r1
A1.channels = C1
A1.sources. r1.type = Avro
A1.sources. r1.channels = C1
A1.sources. r1.bind = 0.0.0.0
A1.sources. r1.port = 4141
- Exec Source
Exec can read logs through specified operations. When using exec, You need to specify shell commands to read logs. The following is a specific example:
A1.sources = r1
A1.channels = C1
A1.sources. r1.type = Exec
A1.sources. r1.command = tail-F/var/log/secure
A1.sources. r1.channels = C1
- Spooling-directory Source
Spo_dir can read the logs in the folder and specify a folder to read all the files in the folder. Note that the files in the folder cannot be modified during reading, the file name cannot be modified. The following is a specific example:
Agent-1.channels = CH-1
Agent-1.sources = src-1
?
Agent-1.sources.src-1.type = spooldir
Agent-1.sources.src-1.channels = CH-1
Agent-1.sources.src-1.spoolDir =/var/log/Apache/flumespool
Agent-1.sources.src-1.fileHeader = true
- Syslog Source
Syslog can read system logs through the Syslog protocol, which can be divided into TCP and UDP. You need to specify the IP address and port when using syslog. The following is an example of UDP:
A1.sources = r1
A1.channels = C1
A1.sources. r1.type = syslogudp
A1.sources. r1.port = 5140
A1.sources. r1.host = localhost
A1.sources. r1.channels = C1
- Common channels
There are not many channels in flume, and memory channels are the most common ones. The following is an example:
A1.channels = C1
A1.channels. c1.type = memory
A1.channels. c1.capacity = 10000
A1.channels. c1.transactioncapacity = 10000
A1.channels. c1.bytecapacitybufferpercentage = 20
A1.channels. c1.bytecapacity = 800000
- Common sink
- Logger sink
As the name suggests, logger writes the collected logs to the flume log, which is a simple but practical sink.
- Avro sink
Avro can send received logs to the specified port for the next hop of the cascade agent to collect and accept logs. The destination IP address and port must be specified for use. The example is as follows:
A1.channels = C1
A1.sinks = K1
A1.sinks. k1.type = Avro
A1.sinks. k1.channel = C1
A1.sinks. k1.hostname = 10.10.10.10
A1.sinks. K 1.port = 4545
- File roll sink
File_roll can write logs collected within a certain period of time to a specified file. The specific process is to specify a folder and a cycle for the user, and then start the agent, at this time, the folder will generate a file to write all the logs collected in the cycle into the file, until a new file is generated again in the next cycle, and then continue writing, and so on. The following is a specific example:
A1.channels = C1
A1.sinks = K1
A1.sinks. k1.type = file_roll
A1.sinks. k1.channel = C1
A1.sinks. k1.sink. Directory =/var/log/Flume
- HDFS sink
Similar to file roll, HDFS writes collected logs to newly created files for storage. However, the difference is that the file storage path of file roll is the local path of the system, the HDFS storage path is the path of the Distributed File System HDFS. At the same time, HDFS can create a new file at a time, a file size, or a log collection number. The specific example is as follows:
A1.channels = C1
A1.sinks = K1
A1.sinks. k1.type = HDFS
A1.sinks. k1.channel = C1
A1.sinks. k1.hdfs. Path =/flume/events/% Y-% m-% d/% H % m/% s
A1.sinks. k1.hdfs. fileprefix = events-
A1.sinks. k1.hdfs. Round = true
A1.sinks. k1.hdfs. roundvalue = 10
A1.sinks. k1.hdfs. roundunit = minute
- Hbase sink
Hbase is a database that stores logs. When using hbase, you must specify the table name and column family name for storing logs. Then, the agent can insert collected logs to the database one by one. Example:
A1.channels = C1
A1.sinks = K1
A1.sinks. k1.type = hbase
A1.sinks. k1.table = foo_table
A1.sinks. k1.columnfamily = bar_cf
A1.sinks. k1.serializer = org. Apache. flume. Sink. hbase. regexhbaseeventserializer
A1.sinks. k1.channel = C1