Overview
This time spent part of the time processing the message bus and log docking. Here to share some of the problems encountered in log collection and log parsing and processing scenarios.
Log capture-flumelogstash VS flume
First, let's talk about our selection on the log collector. Since we chose to use Elasticsearch as a log of storage with search engines. And based on the Elk (Elasticsearch,logstash,kibana) technology stack in the direction of the log system is so popular, so the Logstash included in the study object is also logical, Logstash in several major mainstream log collector is a rising star, Acquired by elastic more mature, the community is also more active.
Logstash's design: input
, filter
, output
. The design of Flume, and of source
channel
sink
course Flume also has interceptor
. The specific design is not much nonsense, is generally split , decoupling ,Pipeline (pipeline) thinking. At the same time, they all support distributed extensions, such as Logstash can be used as shipper as well as indexer,flume can be composed of multiple agents distributed event stream.
My contact with Flume was earlier than Logstash. When the recent survey was Logstash, it was impressive for its powerful filter, especially grok
. The flume camp has previously emphasized that its Source,sink,channel support for various open source components is very powerful.
Logstash is a good one, but the implementation of the JRuby
language (a language that resembles the Ruby syntax-based JVM platform) makes it inflexible enough to be customized, which is the main reason I gave up the Logstash. For ecological reasons, I do need the extensibility provided by the Java Technology Stack (the main goal here is to use the message bus as a cache queue for log capture), which is exactly what Flume's strengths are. But Flume rarely mentions parsing support for logs, even if there is a interceptor to support the regular, it is only a very limited search, replacement, and so on. After some research found that in fact Flume provides such a interceptor-- morphline
. It can complete the parsing of the log.
Introduction to log parsing-morphlinemorphline
Morphline is an ETL framework that is cloudera open source by Flume's parent company. It is used to build and change the streaming handler for ETL (extract, transfer, load) based on Hadoop. (It is worth mentioning that Flume was donated by Cloudera to Apache, which was later constituted by Flume-ng). Morphline allows you to build ETL jobs without coding and requires a lot of mapreduce skills.
Morphline is a rich profile that can easily define a conversion chain that consumes any type of data from any data source, processes the data, and then loads the results into the Hadoop component. It replaces Java programming with simple configuration steps.
Morphline is a class library that can be embedded in any Java program. Morphline is a memory container that can store conversion commands. These commands are loaded into morphline in the form of plugins to perform tasks such as loading, parsing, transforming, or processing a single record. A record is the data structure of a name-value pair in memory. And Morphline is extensible to integrate existing functions and third-party systems.
This article is not Morphline's soft text, so more introduction please visit Cloudera CDK official documents.
Here is an image showing the approximate processing model of Morphline:
There is also a diagram showing the architectural model of Morphline in the Big Data ecosystem:
Later, Morphline's development was dominated by kite, a set of API interfaces for an abstract data model layer built on Hadoop. Here is a description of kitesdk about Morphline.
Powerful Regular Extractor--grok
In fact, I find Morphline is to find Grok, or find a grok to provide a cut-in mouth. Grok uses regular parsing capabilities to extract structured fields from unstructured log data. Because Logstash has provided a lot of validated grok rules, which is Logstash's advantage, if you can use these rules directly in flume, then you will be able to directly integrate the ability of logstash (in fact, as long as the text is a rule, the regular can be extracted, But there are already mature things that do not have to spend a great deal of effort to verify. Here are the Grok documentation, which is no longer too much to introduce.
Service-Side Use Morphline
Flume uses morphline in the agent. The advantages of ETL on the client side of the log can take advantage of the decentralized computing power of clients to eliminate the hassle of service-side parsing, but the number of agents is very large, and scattered in the various production servers, the format of the log is also a variety of. In other words, doing too much at the agent will make us less flexible when it comes to coping with change. Therefore, we only collect in the client does not parse. The log is parsed using Morphline on the server. It is equivalent to starting a parsing service, extracting logs from the Log acquisition queue, parsing the transformation with Morphline, and then sending the parsed more structured log to the index queue until the Indexing Service stores it in Elasticsearch. The whole process is roughly as follows:
This asynchronous queue-based pipeline is actually the same as the synchronous pipeline of a stream processor such as storm, which uses inexpensive PCs to split the computational amount.
Sample Program
In order to use Morphline in your program, you first need to add a maven dependency to Morphline:
<dependency> <groupId>Org.kitesdk</groupId> <artifactid>Kite-morphlines-all</artifactid> <version>${kite.version}</version> <exclusions> <exclusion> <groupId>Org.apache.hadoop</groupId> <artifactid>Hadoop-common</artifactid> </exclusion> </exclusions> <type>Pom</type> <optional>True</Optional> </Dependency>
Version is 1.0.0. Note that there are some dependencies that need to be downloaded from the Twitter repository, so you know: please bring your own ladder.
Sample program:
private void process (Message message) {Msgbuffer. Add(message);if (msgbuffer. Size() < Message_buffer_size) return;try {Notifications. Notifybegintransaction(Morphline);for (Message msg:msgbuffer) {Event logEvent = Gson. Fromjson(New String (msg. GetContent()), Event. Class);String originallog = new String (logEvent. GetBody());LogEvent. Getheaders(). Put(Morphline_grok_field_name, Originallog);LogEvent. Setbody(NULL);Record record = new record ();for (Map. Entry<string, string> entry:logevent. Getheaders(). EntrySet()) {record. Put(Entry. GetKey(), entry. GetValue());} byte[] bytes = logEvent. GetBody();if (bytes! = null && bytes. Length>0) {Logger. Info("Original:"+ new String (bytes));Record. Put(Fields. ATTACHMENT_body, bytes);} Notifications. Notifystartsession(Morphline);Boolean success = Morphline. Process(record);if (!success) {Logger. Error("failed to process record! From: "+ Morphlinefileandid);Logger. Error("Record body:"+ New String (logEvent. GetBody()));}}//do some ETL jobs list<record> records = this. Extract();list<event> events = This. Transfer(Records);This. Load(events);} catch (Jsonsyntaxexception e) {Logger. Error(e);Notifications. Notifyrollbacktransaction(Morphline);} finally {//clear buffer andExtractor this. Extracter. GetRecords(). Clear();This. Msgbuffer. Clear();Notifications. Notifycommittransaction(Morphline);Notifications. Notifyshutdown(Morphline);} }
This is just part of the code that shows the approximate usage of morphline. The primary logic is in the configuration file:
morphlines: [{id:morphline1 importcommands: [ "O rg.kitesdk.** "] Commands: [{grok {dictionarystring: Expressions: { Original: } extract: true numrequiredmatches:atleastonce # defaul T is atleastonce findsubstrings: false add Emptystrings: false }} {loginfo {format: "output record: {}" , args: [ "@{}" "}}]}]
As mentioned above, our main purpose is to use Grok to parse the log, and Logstash has provided a lot of grok patterns for you out of the box, but for the custom log format type, you usually need to parse it yourself. Here's a grok online debug tool.
Review
In fact, the use of flume in the industry are large-scale internet companies, such as American Regiment. They usually use the flume+kafka+storm+hadoop ecosystem. Using storm stream for real-time parsing, using MapReduce for off-line analysis, this highly customized usage scenario rarely requires the ability of the Flume agent to parse on the client, so Flume's morphline is seldom mentioned.
But Morphline is still a rare text ETL weapon, whether you are in the acquisition of the Morphline do ETL or in the service side do, flume+morphline add up to bring flexibility also do not lose logstash.
More articles, welcome to visit: http://vinoyang.com
Flume acquisition and Morphline analysis of log system