Talk about Flume and Logstash.

Source: Internet
Author: User
Tags kibana logstash

Reprint: http://blog.csdn.net/jek123456/article/details/65658790

In a logstash scene, I produced why can not use flume instead of Logstash doubt, so consulted a lot of materials summarized here, most of them are predecessors of the work experience, add some of my own thinking in the inside, I hope to help everyone.

This article is suitable for readers who have a certain big data base to read, but if you do not have the technical basis, you can continue to see (this is like you see "Sunflower Treasure Book" The first page: to practice this work, will be the first self-palace, and then turn to the second page: if not from the palace, can also exercise, that is,

Big Data collection is a very important and basic part of big data technology, the data will not go to your data platform software without a purpose, you have to use something from the existing devices (such as servers, routers, switches, firewalls, databases, etc.) collected, and then transferred to your platform, Then there will be more complex and difficult processing technology behind it.

At present, Flume and Logstash are more mainstream data acquisition tools (mainly used for log capture), but many people still do not understand the difference between the two, especially for users, specific scenarios using the appropriate collection tools, can greatly improve efficiency and reliability, and reduce resource costs.

Kegua Zi: Hey, it's all useless nonsense, good story. =

Well, now let's get down to business. First of all, we give a general data acquisition model, mainly for the readers who do not quite understand the computer or communication.


Data acquisition in pervasive environment

Among them, data collection and storage are necessary links, others do not necessarily need. Isn't it simple? Originally programming is actually a modular thing, not so difficult. But this is only a rough general model, the development of different open source community or commercial manufacturers will have their own considerations and purposes. The flume and logstash that we are going to discuss in this paper are in principle the category of data collection, although both of them are technically more or less self-bringing some functions such as buffering, filtering and so on.

OK, we'll see Logstash first, then look at Flume, and when you're done, you'll know why I've arranged it.

Logstash is one of the elk components. The so-called elk refers to, ElasticSearch, Logstash, Kibana these three components. So why do these three components fit together to say? First, these three components are often used in conjunction (ES is responsible for data storage and indexing, Logstash is responsible for collecting and filtering conversion, Kibana is responsible for graphical interface processing); second, these three components have been acquired in the name of elastic.co company. Isn't that a coincidence? Here's an aside, the original Elk stack was renamed Elastic stack after adding the Beats (a proxy) suite in version 5.0, which is a meaning, except for the addition of the Beats Proxy tool and a name change.

Logstash was born in 2009 8 2nd, the author of the world-renowned virtual hosting agent DreamHost, operations Engineer Jordancise (Jordan Sissel). Logstash development very early, contrast, Scribed was born in 2008, Flume was born in 2010, Graylog2 was born in 2010, Fluentd was born in 2011. Logstash was acquired by Elasticsearch Company in 2013. Incidentally, Logstash is Jordan's work, so with a unique personality, this is not like Facebook's Scribe,apache Flume Open Source Fund project.

You are right, the above is nonsense. (Manual Funny →_→)

Logstash's design is very standard, there are three components, the division of the following:

1, Shipper responsible for log collection. The responsibility is to monitor changes to local log files and output them to Redis cache; 2. Broker can be seen as a log hub, which can connect multiple Shipper and multiple indexer;3, Indexer responsible for log storage. In this architecture, the log is received from Redis and written to a local file.

Here to illustrate, because the architecture is more flexible, if you do not want to use Logstash storage, you can also be to the Elasticsearch, which is the previous ELK of the routine.


Flume structure diagram

If we continue to subdivide, Logstash can also look at this anatomy.


Logstash Three stages of operation

Seems to be here ... It seems to be finished ... Reader friends do not scold me, because Logstash is so simple, all the code integration, the programmer does not need to care about how it works.

Logstash most noteworthy is that in the Filter plugin section has a relatively complete function, such as Grok, through regular parsing and structure of any text, Grok is currently the best way to parse unstructured log data into a structured and queryable. In addition, Logstash can rename, delete, replace, and modify event fields, including, of course, full discard events, such as debug events. There are a lot of complex features for programmers to choose from, and you'll find that these features flume absolutely no (it's also impossible to do with its lightweight threads). Of course, in the input and output two plug-ins section also has a lot of similar optional features, programmers can choose freely, this is similar to flume.

Big dividing line, readers friends can go to the toilet, and then buy a package of melon seeds.

Logstash is not difficult to understand because of its integrated design. Now we are talking about flume, this piece of content is a bit more.

The earliest flume is a log collection system developed by Cloudrea, the initial release version is called Flume OG (Original Generation meaning), as an open Source tool, once published, is actually a very interesting set of tools, but later with the expansion of functionality, Exposes the code engineering bloated, the core component design is unreasonable, the core configuration is not standard and so on various shortcomings. Especially in the last release version of Flume og 0.94.0, log transmission instability is particularly serious. Let's take a look at what's wrong with Flume og.


Flume og Frame composition

Until now, you on the network search flume related material time also will often appear flume OG's structure diagram, this is very unfriendly to the newcomer, it is easy to cause misleading, please readers friends must pay attention! We can see that the Flume og has three roles of nodes: the broker Node (agent), the collection node (collector), the Master node (master).

The process is also not difficult to understand: the agent collects log data from individual data sources, centralizes the collected data into collector, and then is stored in HDFs by the collection node Rollup. Master is responsible for managing agent,collector activities. Agents, collector are called node,node roles are divided into logical node (logical node), physical node (physical node) according to the configuration. The distinction, configuration and use of logical nodes and physical nodes have always been the most troublesome places for users.


The composition of nodes in Flume og

The agent, collector, consists of source, sink, which represents the current node data being transferred from source to sink.

Even the layman, see here also feel very big, this is who design out of the broken thing?

The exposure of various problems, forcing developers to make up their minds, abandon the original design concept, completely rewrite flume. So on October 22, 2011, Cloudera completed the Flume-728 and made a landmark change to Flume: Refactoring the core components, the core configuration, and the code architecture, the reconstructed version collectively known as Flume NG (Next Generation Next generation); Another reason for the change is to include Flume in Apache, Cloudera Flume renamed Apache Flume, so now Flume is a member of the Apache ETL toolset.

Here's a digression, as we all know, usually large companies, especially large IT companies, are more likely to reject the use of some unstable new technologies, and do not like to change technology frequently, which is simple because changes can easily lead to surprises. For example, Linux has been around for more than 20 years, and most companies are using Redhat, CentOS, and Ubuntu to provide a stable, compatible version, and if you see a company with a new Linux kernel, it's probably a new company, There is a need for some new technologies to prevail in the competition.

Well, we've got some historical background, and now we can put the Flume NG structure on the map.


Flume ng structure diagram

Is it easy to lie in the gutter? Comparing the structure of OG, the layman will marvel: So easy!

This time the developer learned OG's blood-drenched lessons and changed the core pieces of the BLOCK:

1. NG has only one role node: Proxy node (agent), not as many roles as og;

2, no Collector,master node. This is the core component of the core changes;

3, removed the physical nodes, logical nodes concept and related content;

4, Agent node composition has also changed, NG Agent from source, sink, channel composition.

So what's good about doing this? There are three points to this simple summary:

1, NG simplifies the core components, removed OG version code engineering bloated, core component design unreasonable, core configuration is not standard and other shortcomings, making the configuration of data flow more simple and reasonable, this is a more intuitive improvement point;

2, NG out of the Flume stability of zookeeper dependence. In earlier versions of OG, Flume's use of stability depended on zookeeper. It requires zookeeper to manage the work of its multi-class nodes (agent, collector, Master), especially if multiple master is configured in the cluster. Of course, OG can also manage the configuration information of various nodes in memory, but the user can tolerate the loss of configuration information in the event of machine failure. So the stability of OG's exercise is dependent on zookeeper.

3, NG version of the user requirements greatly reduced: installation process In addition to Java without the need to configure complex flume related attributes, also do not need to build zookeeper cluster, installation process almost 0 workload.

Some people are puzzled, how suddenly came out of a zookeeper this concept, what is this a thing? Simply put, Zookeeper is a reliable coordination system for large-scale distributed systems and is suitable for multi-class role cluster management. You can interpret it as the total butler for the entire Hadoop, responsible for coordinating the management of all components of the system. This component is usually very humble, but very important. Like a basketball team, five players are all stars, so we are usually used to pay attention to these five people, but the whole team's win is not enough to coach the coordination of organizational, tactical arrangements, zookeeper is the entire Hadoop system coach. Although the analogy is somewhat blunt, just want to explain the importance of zookeeper, but also the side of the zookeeper to get rid of the dependence of the more light and flexible.

On the other hand, the OG version of the use of the document has more than 90 pages, and NG only with more than 20 pages of content to complete the new version of the Flume use instructions. In the field of scientific research, mankind is always seeking truth, and truth can always be described in the simplest language.

It's almost flume to go here, because this threading tool is really simple in principle, with three-segment structure: source (sources input)--storage (channel pipeline)--export (sink target output). But also because of the three structure involved, so the configuration is more complex, here for example, we see how flume in some scenarios is how to set up the layout.


Flume Cluster deployment

Here to correct a few many beginners flume friends of the misunderstanding. First, Flume has been able to support a number of different types of channel and sink in an agent, and we can choose to copy the source data and distribute it to different destination ports, such as:


Multiple multiplexing of Flume

Secondly, Flume also has its own partition and interceptor function, so it is not as many experimenters think that there is no filtering function (of course I admit that the Flume filter function is weaker).

Readers may vaguely feel that flume in the cluster is the best thing to do is to do the route, because each flume agent connected to constitute a link, which is a lot of collection tools flume very good highlights. But because of this, if there is a problem with a flume agent, then the entire link will also have problems, so in the cluster need to design a hierarchical architecture, and so on to achieve redundant backup. But in this way, the configuration can become cumbersome.

The last big dividing line

Logstash and Flume are all finished, we can finally compare and summarize.

First of all, from a structural comparison, we will find amazing how similar the two are! Logstash Shipper, Broker, indexer respectively and Flume source, Channel, sink each corresponding! Just Logstash integration, broker can not need, and flume need to be configured separately, and indispensable, but this once again shows that the computer design ideas are universal! It's just a different way of achieving it.

From the programmer's point of view, as mentioned above, Flume is really cumbersome, you need to separate the source, channel, sink manual configuration, and involves a complex data acquisition environment, you may have to do multiple configurations, which mentioned above, Conversely Logstash configuration is very concise and clear, three parts of the properties are defined, the programmer himself to choose the line, even if not, can also develop their own plug-in, very convenient. Of course, flume plug-ins are also many, but the channel is only memory and file two (in fact, now more than, but commonly used also two). The reader can see that both configurations are very flexible, just look at the scene choice.

In fact, from the author and historical background, the original design of the two purposes is not quite the same. The flume itself was originally designed to pass data into HDFs (not designed to capture the log, which is fundamentally different from logstash), so it's important to focus on data transmission, and programmers need to be very clear about the routing of the entire data. And more than logstash a reliability strategy, the channel above is for the purpose of persistence, the data is not deleted unless the acknowledgement is transmitted to the next location, this step is controlled by the transaction, this design makes the reliability is very good. On the contrary, Logstash obviously focuses on the preprocessing of the data, because the log fields require a lot of preprocessing, to pave the way for parsing.

Back to see why I first told Logstash and then flume? Here are a few considerations, one: Logstash is actually more like a generic model, so it is easier for new people to understand, and flume such a lightweight thread, may have a certain basis for computer programming to understand better, and second: Most of the time, Logstash use more, This data I have not statistics, but according to experience, Logstash can be used with other components of elk, development, application will be much simpler, mature technology, use a wide range of scenarios. Conversely flume components need to be used in conjunction with many other tools, the scene will be more targeted, not to mention the flume configuration is too cumbersome and complex.

Finally summed up, we can understand their differences: Logstash is like a purchased desktop, motherboard, power, hard disk, chassis (Logstash) put everything inside, you can directly use, of course, can also assemble their own modifications; Flume is like providing you with a complete motherboard, Power supply, hard drive, flume not packaged, just like the instructions to guide you how to assemble, to run up.

Finish the lecture.

Reference documents:

"Flume log collection and map reduce mode" Zhang Long translation

Tachenlin of the ELK Stack Authority guide

Summary and examples of www.2cto.com/kf/201607/530428.html flume

Www.dataguru.cn/thread-477981-1-1.html Flume Log Collection

www.ibm.com/developerworks/cn/data/library/bd-1404flumerevolution/index.html Flume NG

Shiyanjun.cn/archives/1497.html Flume Log Collection tiered architecture application practices

Www.cnblogs.com/xing901022/p/5631445.html Flume Log Collection system

Www.tuicool.com/articles/BJRz22V Logstash Guide

Tchuairen.blog.51cto.com/3848118/1840596/logstash Explanation and practical application

Talk about Flume and Logstash.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.