Log Collection System Flume research note 1th-Flume Introduction

Source: Internet
Author: User

The collection of user behavior data is undoubtedly a prerequisite for building a referral system, and the Flume project under the Apache Foundation is tailored for distributed log collection, this is the 1th of the Flume research note, which mainly introduces Flume's basic architecture, The next note will illustrate the deployment and use steps of flume with an example. The flume version used in this article is the latest edition of the ver1.5.2, it belongs to the Flume-ng, in the system architecture and Flume-og differ, the difference can refer to the Flumewiki document description. 1. What is flume?Flume is an open source project under the Apache Foundation, which implements a distributed, reliable and highly available system for efficiently collecting, aggregating, or moving large amounts of log data from different sources (typically, such as logs from multiple webserver). This data is also supported to be written uniformly to a common storage system (typically such as HDFs). 2. Flume's application ScenarioThe most typical application scenario for Flume is to aggregate log data from different sources and write those datasets to a unified storage system (such as HDFs or Kafka). In applications that involve streaming computing (such as real-time recommender systems), we often see flume figure. The input accepted by Flume is referred to as " event data". For Flume, "event" is a bunch of byte streams ( to Flume, an event is just a generic blob of bytes. From the Official Document Flume User Guide), "event data" also supports 2 of binary data, tablets, network streams, and so on, in addition to plain text data. It should be noted, however, that theevent data supported by Flume is limited in size, and that a single event database cannot be larger than the memory or disk of the machine that deploys the flume system , but the individual data generated by the normal module should not exceed this physical limit. according to the official recommended Flume NG Performance test document, a single machine from multiple flume agent process simultaneously to the same HDFs write data, maintain event-size=300bytes send pressure, without losing data, the machine can handle at least 4w+ EVENTS/SEC, upper limit up to 7w+ events/sec. Of course, the exact data is related to the machine hardware configuration, but we can evaluate whether the flume meets the performance requirements of the actual business. In addition, the results show that the maximum throughput of a single machine is related to the concurrency of the Flume agent, the optimal concurrency number is consistent with the number of CPU cores, and the details can be read from the source document. in summary, Flume can be applied to the vast majority of distributed applications with log collection and aggregation requirements in a reasonably configurable situation . 3. Flume Typical system Architecturebased on the description of the Flume User Guild documentation, a typical flume data flow model is as follows:

The part of China that is framed is Flume's system architecture, which is abstracted as an agent, which physically manifests itself as a flume-agent process, which is actually a JVM. Each agent consists of 3 types of components (note that not 3, such as the logical topology in which a single agent process can be implemented with flume configuration files based on the business requirements), which are described separately in the order in which the data flows from the front to the back.

3.1 Source
Source is responsible for receiving and parsing the events data from an external source, such as deserialization, and sending the parsed data to one or more channel (s) connected to it.
Here are some notes:
1)
The data format that the external source sends to source must be consistent with the source type specified in the Flume configuration file. For example, if the source type of configuration flume is thrift, the data sent must be packaged as thrift protocol.
2)The current way that source supports receiving external source data includes RPC, such as when you configure source as Avro, you can send data to flume by Avro clients in RPC mode, thrift source, HTTP source, exec source, JMS source, A SEQ source (similar to a count generator, which continues to generate an event, primarily for testing), and so on. For a list of supported source and examples of use, refer to the official documentation flume sources for details.
3)In the same agent process, if the source is configured with multiple channels, at this point, depending on the business requirements, the source can be configured with different event routing policies, common channel Selector includes two types of replicating and multiplexing, where the former is the default policy, indicating that events from source will be sent to all channels connected to it at the same time (Obviously, this situation consumes more memory or diskWhile the latter indicates that the events of source will only be sent to a specific channel (s), in particular, source Selector.header specifies the key of the routing decision field through its configuration item selector.mapping.< The hdr-value> specifies the channel to which events will be sent to Hdr-value, where,4)You can use source. Interceptors modify or filter the event, details can refer to document Flume interceptors.
5)The source of the custom implementation can also be integrated into Flume as a plugin.

3.2 channel
Channel is a passive storage component, It maintains a memory queue or disk file to hold the event from source until the event is consumed by sink. That is, it connects sources and sinks like a queue.
The most common type of channel is the memory channel and the file channel, which improves performance by maintaining events in the memory queue, but data that is not consumed by sink is lost when a machine failure or process exits While the latter maintains events through disk files, you can avoid data loss in unexpected situations, but obviously performance is compromised. In addition to the memory channel and the file channel, the flume supports the JBDC channel and other channel types, with details to view flume channels documents. The
notes are as follows:
1)
When using the memory type channel, be aware of the maximum capacity (capacity) problem, if the source produces events faster than the sink consumption rate, May cause the channel buffer to be full and throw an exception. In this case, the data is lost if the external application that writes the data to the source does not have the exception handling logic (Execsource is most likely to be the case).
2) when using channel of file type and configuring multiple file channels, it is best to explicitly configure each channel with its own separate files path, because if the default configuration path is used, Multiple channel will compete for the same file lock, resulting in a successful initialization of only 1 channel.
3) can be configured with memory and file mixed channel type spillable memory channel, pros and cons can view the document, here do not repeat.
4) The channel interface for the custom implementation can be integrated into flume as plugin.

3.3 SinkSink is responsible for consuming events from the channel and writing data to the external storage system based on the configured sink type. common types of Sink include HDFs Sink, Logger Sink (output to the terminal for ease of debugging), Avro Sink (such as Flume Cascade), Thrift Sink, Elasticsearchsink, HBase Sink, and so on. In addition, from Flume v1.6, Flume added Kafka Sink. Here are some notes: 1) the same channel can be connected to multiple sinks, but the same sink can only consume data from 1 channel. 2) the same agent process can group sinks, and the same sink group can implement failover and load_balance between sink according to Processor.type configuration items. 3) The sink interface of the custom implementation can be integrated into flume as plugin, and the sink processor interface can also be customized. 4. Flume CascadeIn addition to source-channel-sink one-to-one correspondence, Flume also supports other forms of system architecture. 1) Multi-agent Cascade
2) Multi-agent aggregation Cascade

3) Multi-channel shunt
, the events of source can be assigned to different channel according to the configuration, which has been described in the key description of source described above, and is not mentioned here.

"References"1. Apache Flume2. Flumewiki-getting Started
3. Flume1.5.2 User guide-is Flume a good fit for your problem?
4.Flume NG performancemeasurements
5. Apache Flume-architecture of Flume NG

======================== EOF =========================

Log Collection System Flume research note 1th-Flume Introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.