Flume acquisition and Morphline analysis of log system

Source: Internet
Author: User
Tags logstash hadoop ecosystem

Overview

This time spent part of the time processing the message bus and log docking. Here to share some of the problems encountered in log collection and log parsing and processing scenarios.

Log capture-flumelogstash VS flume

First, let's talk about our selection on the log collector. Since we chose to use Elasticsearch as a log of storage with search engines. And based on the Elk (Elasticsearch,logstash,kibana) technology stack in the direction of the log system is so popular, so the Logstash included in the study object is also logical, Logstash in several major mainstream log collector is a rising star, Acquired by elastic more mature, the community is also more active.

Logstash's design: input , filter , output . The design of Flume, and of source channel sink course Flume also has interceptor . The specific design is not much nonsense, is generally split , decoupling ,Pipeline (pipeline) thinking. At the same time, they all support distributed extensions, such as Logstash can be used as shipper as well as indexer,flume can be composed of multiple agents distributed event stream.

My contact with Flume was earlier than Logstash. When the recent survey was Logstash, it was impressive for its powerful filter, especially grok . The flume camp has previously emphasized that its Source,sink,channel support for various open source components is very powerful.

Logstash is a good one, but the implementation of the JRuby language (a language that resembles the Ruby syntax-based JVM platform) makes it inflexible enough to be customized, which is the main reason I gave up the Logstash. For ecological reasons, I do need the extensibility provided by the Java Technology Stack (the main goal here is to use the message bus as a cache queue for log capture), which is exactly what Flume's strengths are. But Flume rarely mentions parsing support for logs, even if there is a interceptor to support the regular, it is only a very limited search, replacement, and so on. After some research found that in fact Flume provides such a interceptor-- morphline . It can complete the parsing of the log.

Introduction to log parsing-morphlinemorphline

Morphline is an ETL framework that is cloudera open source by Flume's parent company. It is used to build and change the streaming handler for ETL (extract, transfer, load) based on Hadoop. (It is worth mentioning that Flume was donated by Cloudera to Apache, which was later constituted by Flume-ng). Morphline allows you to build ETL jobs without coding and requires a lot of mapreduce skills.

Morphline is a rich profile that can easily define a conversion chain that consumes any type of data from any data source, processes the data, and then loads the results into the Hadoop component. It replaces Java programming with simple configuration steps.

Morphline is a class library that can be embedded in any Java program. Morphline is a memory container that can store conversion commands. These commands are loaded into morphline in the form of plugins to perform tasks such as loading, parsing, transforming, or processing a single record. A record is the data structure of a name-value pair in memory. And Morphline is extensible to integrate existing functions and third-party systems.

This article is not Morphline's soft text, so more introduction please visit Cloudera CDK official documents.

Here is an image showing the approximate processing model of Morphline:

There is also a diagram showing the architectural model of Morphline in the Big Data ecosystem:

Later, Morphline's development was dominated by kite, a set of API interfaces for an abstract data model layer built on Hadoop. Here is a description of kitesdk about Morphline.

Powerful Regular Extractor--grok

In fact, I find Morphline is to find Grok, or find a grok to provide a cut-in mouth. Grok uses regular parsing capabilities to extract structured fields from unstructured log data. Because Logstash has provided a lot of validated grok rules, which is Logstash's advantage, if you can use these rules directly in flume, then you will be able to directly integrate the ability of logstash (in fact, as long as the text is a rule, the regular can be extracted, But there are already mature things that do not have to spend a great deal of effort to verify. Here are the Grok documentation, which is no longer too much to introduce.

Service-Side Use Morphline

Flume uses morphline in the agent. The advantages of ETL on the client side of the log can take advantage of the decentralized computing power of clients to eliminate the hassle of service-side parsing, but the number of agents is very large, and scattered in the various production servers, the format of the log is also a variety of. In other words, doing too much at the agent will make us less flexible when it comes to coping with change. Therefore, we only collect in the client does not parse. The log is parsed using Morphline on the server. It is equivalent to starting a parsing service, extracting logs from the Log acquisition queue, parsing the transformation with Morphline, and then sending the parsed more structured log to the index queue until the Indexing Service stores it in Elasticsearch. The whole process is roughly as follows:

This asynchronous queue-based pipeline is actually the same as the synchronous pipeline of a stream processor such as storm, which uses inexpensive PCs to split the computational amount.

Sample Program

In order to use Morphline in your program, you first need to add a maven dependency to Morphline:

        <dependency>            <groupId>Org.kitesdk</groupId>            <artifactid>Kite-morphlines-all</artifactid>            <version>${kite.version}</version>            <exclusions>                <exclusion>                    <groupId>Org.apache.hadoop</groupId>                    <artifactid>Hadoop-common</artifactid>                </exclusion>            </exclusions>            <type>Pom</type>            <optional>True</Optional>        </Dependency>

Version is 1.0.0. Note that there are some dependencies that need to be downloaded from the Twitter repository, so you know: please bring your own ladder.

Sample program:

private void process (Message message) {Msgbuffer. Add(message);if (msgbuffer. Size() < Message_buffer_size) return;try {Notifications. Notifybegintransaction(Morphline);for (Message msg:msgbuffer) {Event logEvent = Gson. Fromjson(New String (msg. GetContent()), Event. Class);String originallog = new String (logEvent. GetBody());LogEvent. Getheaders(). Put(Morphline_grok_field_name, Originallog);LogEvent. Setbody(NULL);Record record = new record ();for (Map. Entry<string, string> entry:logevent. Getheaders(). EntrySet()) {record. Put(Entry. GetKey(), entry. GetValue());} byte[] bytes = logEvent. GetBody();if (bytes! = null && bytes. Length>0) {Logger. Info("Original:"+ new String (bytes));Record. Put(Fields. ATTACHMENT_body, bytes);} Notifications. Notifystartsession(Morphline);Boolean success = Morphline. Process(record);if (!success) {Logger. Error("failed to process record! From: "+ Morphlinefileandid);Logger. Error("Record body:"+ New String (logEvent. GetBody()));}}//do some ETL jobs list<record> records = this. Extract();list<event> events = This. Transfer(Records);This. Load(events);} catch (Jsonsyntaxexception e) {Logger. Error(e);Notifications. Notifyrollbacktransaction(Morphline);} finally {//clear buffer andExtractor this. Extracter. GetRecords(). Clear();This. Msgbuffer. Clear();Notifications. Notifycommittransaction(Morphline);Notifications. Notifyshutdown(Morphline);}    }

This is just part of the code that shows the approximate usage of morphline. The primary logic is in the configuration file:

morphlines: [{id:morphline1 importcommands: [ "O rg.kitesdk.** "] Commands: [{grok {dictionarystring:  Expressions: { Original: } extract:  true  numrequiredmatches:atleastonce # defaul T  is atleastonce findsubstrings: false  add Emptystrings: false }} {loginfo {format:  "output record: {}" , args: [ "@{}"  "}}]}] 

As mentioned above, our main purpose is to use Grok to parse the log, and Logstash has provided a lot of grok patterns for you out of the box, but for the custom log format type, you usually need to parse it yourself. Here's a grok online debug tool.

Review

In fact, the use of flume in the industry are large-scale internet companies, such as American Regiment. They usually use the flume+kafka+storm+hadoop ecosystem. Using storm stream for real-time parsing, using MapReduce for off-line analysis, this highly customized usage scenario rarely requires the ability of the Flume agent to parse on the client, so Flume's morphline is seldom mentioned.

But Morphline is still a rare text ETL weapon, whether you are in the acquisition of the Morphline do ETL or in the service side do, flume+morphline add up to bring flexibility also do not lose logstash.

More articles, welcome to visit: http://vinoyang.com

Flume acquisition and Morphline analysis of log system

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.