Flume log collection and flume log collection in the log system

Source: Internet
Author: User

Flume log collection and flume log collection in the log system

Recently, I took over the maintenance of a log system, which is used to collect logs on the application server, provide real-time analysis and processing, and finally store the logs to the target storage engine. To address these three links, the industry already has a set of components to meet their respective needs: flume + kafka + hdfs/hbase. In terms of real-time analysis and storage, we chose the same practices as in the industry, but the agent was written by the team, due to the expansion requirements for multiple data sources and the shortcomings in the original log collection method, we investigated the flume agent. The result is that flume fits our actual needs and has good scalability and stability. So we plan to use the flume agent to replace our original implementation.

This article describes how to use the flume agent and what extensions are made to meet our needs. Note: flume in the full text refers to flume-ng, and the version is based on 1.6.0.

Flume Introduction

Flume collects logs on each server through the Agent. It relies on three core components: source, channel, and sink. The connection relationships between them are as follows:


The relationship is also relatively simple: the source is responsible for collecting logs from various data sources; the channel is responsible for saving logs in the middle, decoupling log collection from log sending; the sink is responsible for sending logs, send Logs to the destination. For more details, go to the official website. Let's talk about the use and extension of flume.

Source Extension

Flume provides a source that is called Spool Directory Source based on the number of files in the tracking folder. It tracks the target log folder and triggers collection of new log files when new log files are generated. However, it does not support append of log files. That is to say, once it starts to collect a log file, the log file cannot be edited. If the log file is changed during reading, it will throw an exception. That is to say, when a new log is written into the log file collected on the current day, the source is not suitable for this requirement.

If you want to collect logs close to "quasi-real-time" and you need to use this souce, the solution is: you can only select the log Framework of the application (such as the commonly used log4j) the "scroll mechanism" of the appender is set to scroll by minute (that is, a new log file is generated every minute ). This mechanism is not unfeasible, but there are some shortcomings. For example, there are too many log files: this mechanism is very unacceptable when logs need to be locally retained in addition to being collected by the log system.

We want log files to be scrolled every day to generate new log files, logs of the current day are appended to the log files of the current day, and the Agent must be able to collect new logs (append) at a close to real-time speed. If the agent fails or the server is down, log files cannot be lost, and the agent can automatically collect logs across dates. In fact, spooling directory source has provided templates for our implementation, but we need to make some changes, mainly including the following:

(1) The original Spooling Directory Source does not support appending the content of the collected log file:


If the file has any changes, it will be thrown in the form of an exception. Remove the exception here.

(2) Continuous monitoring of log files on the current day

In the original implementation, when the event cannot be obtained, the current file is deleted or renamed, and the file is automatically moved to the next file:

/* It's possible that the last read took us just up to a file boundary.     * If so, try to roll to the next file, if there is one. */    if (events.isEmpty()) {      retireCurrentFile();      currentFile = getNextFile();      if (!currentFile.isPresent()) {        return Collections.emptyList();      }      events = currentFile.get().getDeserializer().readEvents(numEvents);    }

After modification, when the current file is not the log file of the current day, the current file is processed and automatically rolled to the next file. If it is the file of the current day, continue tracking:

if(!isTargetFile(currentFile) //Only CurrentFile is no longer the target, at the meanwhile, next file exists.    && (isExistNextFile()) ){//Then deal with the history file(ever target file)  logger.info("File:{} is no longer a TARGET File, which will no longer be monitored.", currentFile.get().getFile().getName());  retireCurrentFile();  currentFile = getNextFile();}

For the source code of flume, see github.

In addition, we determine whether the processing method of the target file (log file of the current day) is to compare the date contained in the server date with the file name:

private boolean isTargetFile(Optional<FileInfo> currentFile2) {  String inputFilename = currentFile2.get().getFile().getName();  SimpleDateFormat dateFormat = new SimpleDateFormat(targetFilename);  String substringOfTargetFile = dateFormat.format(new Date());  if(inputFilename.toLowerCase().contains(substringOfTargetFile.toLowerCase())){    return true;  }  return false;}

Therefore, you also need to add the date format configuration in the new configuration, usually yyyy-MM-dd.

Sink Extension

Sink acts as a data output in the agent component of Flume. In versions earlier than flume (1.5.2), multiple data persistence systems have been provided with built-in support (such as hdfs/HBase), but kafka is not available by default. If we want to send log messages to kafka, we need to expand a kafkaSink. Later, we found that kafkaSink has been officially integrated in the latest stable release version 1.6.0. However, 1.6.0 was just released on July 15, May 20. The official Download page and User Guide have not been updated. Therefore, please Download 1.6.0 on the version list page. The latest KafkaSink introduction is provided in the downloaded installation package.

Core configurations include brokerList (at least two broker configurations are recommended for high availability) and topic. For details, refer to the list:


Out of curiosity, I probably browsed the official kafkaSink source code on github and found that the Event Header was not packaged into the message and sent:

        byte[] eventBody = event.getBody();        Map<String, String> headers = event.getHeaders();        if ((eventTopic = headers.get(TOPIC_HDR)) == null) {          eventTopic = topic;        }        eventKey = headers.get(KEY_HDR);        if (logger.isDebugEnabled()) {          logger.debug("{Event} " + eventTopic + " : " + eventKey + " : "            + new String(eventBody, "UTF-8"));          logger.debug("event #{}", processedEvents);        }        // create a message and add to buffer        KeyedMessage<String, byte[]> data = new KeyedMessage<String, byte[]>          (eventTopic, eventKey, eventBody);        messageList.add(data);

This may not satisfy our needs: we need the information in the message header to be a part of the message, and then process the header information in storm. For example:

(1) by default, we add the Host of the server that generates logs to the header to distribute logs or compensate logs that do not store hosts"

(2) We will add the log type identifier in the header by default to distinguish different logs and distribute them to different Resolvers for parsing.

Because the sources and forms of logs are diverse, the information contained in the header is necessary. The official KafkaSink of flume filters out the information in the header. Therefore, we choose to simply expand it and package the Event header and body into a complete json object. Specific implementation:

    private byte[] generateCompleteMsg(Map<String, String> header, byte[] body) {        LogMsg msg = new LogMsg();        msg.setHeader(header);        msg.setBody(new String(body, Charset.forName("UTF-8")));        String tmp = gson.toJson(msg, LogMsg.class);        logger.info(" complete message is : " + tmp);        return tmp.getBytes(Charset.forName("UTF-8"));    }

                // create a message and add to buffer                KeyedMessage<String, byte[]> data = new KeyedMessage<String, byte[]>                    (eventTopic, eventKey, generateCompleteMsg(headers, eventBody));                messageList.add(data);

Interceptor usage

As mentioned above, the log sources and formats are diverse. We cannot format the log formats of all tools and components as we want, especially for some closed components or online systems. Obviously, the source and sink are only responsible for log collection and sending, and do not distinguish the log Content. The Interceptor function provided by flume provides more powerful scalability for flume. We block logs and add specific headers to them through several built-in interceptor in flume. We have applied the following interceptor:

(1) host: Set the Host information of the current host in the header;

(2) static: Set a pre-configured key-value pair to the header. We use it to identify different log sources.

(3) regex: by converting the body of the Event into a string of the UTF-8, and then matching the regular expression, if the match is successful, you can choose to allow or choose to delete

The first two interceptor have previously mentioned its purpose, and the third one is used to match whether the tag with the word "DEGUG" exists in the log, delete the log (this is optional ).

Use of Selector

There is no need to use Selector at present, but its usage is also very common: it can be used to select a Channel. If you have multiple channels and have conditional selective sending, you can use Selector to increase the flexibility of log collection. For example, if you want to send logs of different log sources to different destinations, you can create multiple channels and match them according to certain rules. Here, Multiplexing channel Selector is used.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.