Flume collection and processing log files

Source: Internet
Author: User
  1. Flume Introduction

Flume is a highly available, highly reliable, and distributed system for massive log collection, aggregation, and transmission provided by cloudera. Flume supports Custom Data senders in the log system, flume is used to collect data. Flume also provides the ability to process data and write data to various data receivers (customizable.

  1. System Functions
    1. Log collection

Flume was the first log collection system provided by cloudera and is currently an incubator project under Apache. Flume allows you to customize various data senders in the log system to collect data.

  1. Data Processing

Flume provides simple processing of data and the ability to write data to various data receivers (customizable). Flume provides console, RPC (thrift-RPC), and text (file), tail (UNIX tail), syslog (syslog log system, supports TCP and UDP modes), exec (command execution), and other data sources to collect data.

  1. Work Mode

Flume adopts the multi-master mode. To ensure configuration data consistency, Zookeeper is introduced in flume [1] to save configuration data. zookeeper can ensure consistency and high availability of configuration data. In addition, when configuration data changes, zookeeper can notify the flume master node. The flume master synchronizes data using the gossip protocol.

  1. Process Structure

The structure of flume is mainly divided into three parts: source, channel and sink. source is the source, logs are collected, channel is the channel, transmission and temporary storage, sink is the destination, and collected logs are saved. In the process of real log collection, select the source, channel, and sink Types Based on the types and storage requirements of the logs to be collected, so as to collect and save the logs.

  1. Flume log collection solution
    1. Requirement Analysis
      1. Log category

Operating System: Linux

Log update type: New logs are generated and appended at the end of the original log.

  1. Collection time requirement

Collection Cycle: Short Cycle (within one day)

  1. Collection scheme
    1. Collection Architecture

The process of using flume to collect log files is concise. You only need to select the appropriate source, channel, and sink and configure them. If you have special requirements, you can perform secondary development on your own to meet your individual needs.

The specific process is as follows: Configure an agent as needed, select the appropriate source and sink, and then start the agent to collect logs.

  1. Source

Flume provides a variety of sources for users to choose from to meet most log collection requirements as much as possible. common source types include Avro, exec, Netcat, spooling-directory, and syslog. For specific usage scope and configuration methods, see source.

  1. Channel

The channel in flume is not as important as source and sink, but it is an integral part that cannot be ignored. The commonly used channel is memory-channel, and there are other types of channels, such as JDBC, file-channel, and custom-channel. For details, see channel.

  1. Sink

Flume has many sink types, including Avro, logger, HDFS, hbase, and file-roll. In addition, there are other types of sinks, such as thrift, IRC, and custom. For the specific scope and usage, see sink.

  1. Flume processing logs

Flume can not only collect logs, but also perform simple processing on logs. In the source field, interceptor can be used to filter and extract important content in the log body, in the channel, you can use headers to classify different types of logs into different channels. In the sink, you can use regular serialization to further filter and classify the body content.

  1. Flume source interceptors

Flume can extract important information and add it to the header Through Interceptor. Common interceptor includes timestamp, host name, And UUID. You can also compile regular filter based on your needs, filter logs in specific formats to meet special requirements.

  1. Flume channel selectors

Flume can transmit different logs to different channels as needed. There are two specific methods: Replication and multi-channel transmission. Copying means that logs are not grouped, but all logs are transmitted to each channel, which is not treated differently for all channels. multiplexing means that logs are classified based on the specified header, according to the classification rules, different logs are put into different channels, so that logs are classified manually.

  1. Flume sink Processors

Flume can also process logs at the sink. Common sink processors include custom, failover, load balancing, and default, which are the same as interceptor, you can also use the Regular Expression Filter processor to filter the log Content Based on special requirements. However, unlike interceptor, the content filtered by regular expression serialization at the sink is not added to the header, this will not make the log header appear too bloated.

?

  1. Appendix
    1. Common source
      1. Avro Source

Avro can monitor and collect logs on specified ports. To use the Avro source, you need to describe the IP address and port number of the host to be monitored. The following is a specific example:

A1.sources = r1

A1.channels = C1

A1.sources. r1.type = Avro

A1.sources. r1.channels = C1

A1.sources. r1.bind = 0.0.0.0

A1.sources. r1.port = 4141

  1. Exec Source

Exec can read logs through specified operations. When using exec, You need to specify shell commands to read logs. The following is a specific example:

A1.sources = r1

A1.channels = C1

A1.sources. r1.type = Exec

A1.sources. r1.command = tail-F/var/log/secure

A1.sources. r1.channels = C1

  1. Spooling-directory Source

Spo_dir can read the logs in the folder and specify a folder to read all the files in the folder. Note that the files in the folder cannot be modified during reading, the file name cannot be modified. The following is a specific example:

Agent-1.channels = CH-1

Agent-1.sources = src-1

?

Agent-1.sources.src-1.type = spooldir

Agent-1.sources.src-1.channels = CH-1

Agent-1.sources.src-1.spoolDir =/var/log/Apache/flumespool

Agent-1.sources.src-1.fileHeader = true

  1. Syslog Source

Syslog can read system logs through the Syslog protocol, which can be divided into TCP and UDP. You need to specify the IP address and port when using syslog. The following is an example of UDP:

A1.sources = r1

A1.channels = C1

A1.sources. r1.type = syslogudp

A1.sources. r1.port = 5140

A1.sources. r1.host = localhost

A1.sources. r1.channels = C1

  1. Common channels

There are not many channels in flume, and memory channels are the most common ones. The following is an example:

A1.channels = C1

A1.channels. c1.type = memory

A1.channels. c1.capacity = 10000

A1.channels. c1.transactioncapacity = 10000

A1.channels. c1.bytecapacitybufferpercentage = 20

A1.channels. c1.bytecapacity = 800000

  1. Common sink
    1. Logger sink

As the name suggests, logger writes the collected logs to the flume log, which is a simple but practical sink.

  1. Avro sink

Avro can send received logs to the specified port for the next hop of the cascade agent to collect and accept logs. The destination IP address and port must be specified for use. The example is as follows:

A1.channels = C1

A1.sinks = K1

A1.sinks. k1.type = Avro

A1.sinks. k1.channel = C1

A1.sinks. k1.hostname = 10.10.10.10

A1.sinks. K 1.port = 4545

  1. File roll sink

File_roll can write logs collected within a certain period of time to a specified file. The specific process is to specify a folder and a cycle for the user, and then start the agent, at this time, the folder will generate a file to write all the logs collected in the cycle into the file, until a new file is generated again in the next cycle, and then continue writing, and so on. The following is a specific example:

A1.channels = C1

A1.sinks = K1

A1.sinks. k1.type = file_roll

A1.sinks. k1.channel = C1

A1.sinks. k1.sink. Directory =/var/log/Flume

  1. HDFS sink

Similar to file roll, HDFS writes collected logs to newly created files for storage. However, the difference is that the file storage path of file roll is the local path of the system, the HDFS storage path is the path of the Distributed File System HDFS. At the same time, HDFS can create a new file at a time, a file size, or a log collection number. The specific example is as follows:

A1.channels = C1

A1.sinks = K1

A1.sinks. k1.type = HDFS

A1.sinks. k1.channel = C1

A1.sinks. k1.hdfs. Path =/flume/events/% Y-% m-% d/% H % m/% s

A1.sinks. k1.hdfs. fileprefix = events-

A1.sinks. k1.hdfs. Round = true

A1.sinks. k1.hdfs. roundvalue = 10

A1.sinks. k1.hdfs. roundunit = minute

  1. Hbase sink

Hbase is a database that stores logs. When using hbase, you must specify the table name and column family name for storing logs. Then, the agent can insert collected logs to the database one by one. Example:

A1.channels = C1

A1.sinks = K1

A1.sinks. k1.type = hbase

A1.sinks. k1.table = foo_table

A1.sinks. k1.columnfamily = bar_cf

A1.sinks. k1.serializer = org. Apache. flume. Sink. hbase. regexhbaseeventserializer

A1.sinks. k1.channel = C1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.