Flume principle Analysis "turn"

Source: Internet
Author: User
Tags http post syslog

I. Introduction of Flume

Flume, as a real-time log collection system developed by Cloudera, has been recognized and widely used by the industry. The initial release version of Flume is now collectively known as Flume OG (original Generation), which belongs to Cloudera.

But with the expansion of the FLume function, FLume og code engineering bloated, the core component design is unreasonable, the core configuration is not standard and other shortcomings exposed, especially in FLume OG the last release version 0.9.4. Medium, Day

In order to solve these problems, Cloudera completed Flume-728 on October 22, 2011 and made a milestone change to Flume: Refactoring core components, core configuration to

and code architecture, the reconstructed version is collectively referred to as Flume NG (Next generation), another reason for the change is to Flume into Apache, Cloudera Flume renamed Apache Flume.

Remark: Flume Reference

Official website: http://flume.apache.org/
User documentation: Http://flume.apache.org/FlumeUserGuide.html
Development Documentation: Http://flume.apache.org/FlumeDeveloperGuide.html

Ii. Characteristics of Flume

Flume is a distributed, reliable, and highly available system for collecting, aggregating, and transmitting large volumes of logs. Support for customizing various data senders in the log system for data collection, while Flume provides simple processing of data

and write about the ability of various data recipients (such as text, HDFS, hbase, etc.).
Flume data flows are always run through events. An event is the basic unit of data for Flume, which carries log data (in the form of byte arrays) and carries header information, which is generated by source outside the agent, when

When the source captures an event, it is formatted in a specific format, and then the source pushes the event into (single or multiple) channel. You can think of the channel as a buffer, which will save the event until sink finishes processing the event.

Sink is responsible for persisting the log or pushing the event to another source.

1) Reliability of Flume
When a node fails, the log can be transmitted to other nodes without loss. Flume provides three levels of reliability assurance, from strong to weak in order: End-to-end (the data agent receives the first

The event is written to disk and then deleted when the data transfer succeeds, and can be sent again if the data sent fails. ), Store On failure (this is also the strategy adopted by scribe, when the data receiver crash,

Data is written locally, and continues to be sent after recovery, BestEffort (the data is sent to the receiver without confirmation).

2) Recoverability of the Flume
or by the channel. It is recommended to use FileChannel, where events persist in the local file system (poor performance).

Third, some core concepts of flume

Client:client production data, running on a separate thread.

Event: A data unit, a message header, and a message body. (Events can be log records, Avro objects, and so on.) )
Flow:event the abstraction of the migration from the source point to the destination point.
Agent: A separate flume process that contains the component source, Channel, Sink. (the agent uses the JVM to run Flume.) Each machine runs an agent, but you can include it in one agent

Multiple sources and sinks. )
Source: Data collection component. (source collects data from the client and passes it to the channel)
Channel: A temporary storage for the relay event that holds the event passed by the source component. (The channel connects sources and sinks, which is a bit like a queue.) )
Sink: Reads and removes an event from the channel, passes the event to the next agent in Flowpipeline (if any) (Sink collects data from the channel and runs on a separate thread. )

3.1. Agent structure

The core of the Flume operation is the Agent. Flume is the smallest independent operating unit of the agent. An agent is a JVM. It is a complete data collection tool that contains three core components, namely

Source, channel, sink. With these components, the Event can flow from one place to another, as shown in.

  

3.2. Source

Source is the collection side of the data that is responsible for capturing the data for special formatting, encapsulating the data into the event, and then pushing the event into the channel. Flume provides a number of built-in
Source, supports Avro, log4j, syslog, and HTTP post (body in JSON format). Allows applications to interact directly with existing source, such as Avrosource,
Syslogtcpsource. If the built-in source does not meet your needs, Flume also supports customizing the source.
  

Source Type:

  

3.3. Channel

The channel is a component that connects source and sink, which can be viewed as a buffer of data (a data queue) that can be staged into memory or persisted to a local disk, directly
The event is processed to sink. Two more commonly used channel, Memorychannel and FileChannel are introduced.
  

Channel Type:

  

3.4, Sink

Sink the event from the channel and sends the data elsewhere, either to the file system, to the database, to Hadoop, or to the other agent's source. When there is less log data, you can
To store the data in the file system and set a time interval to save the data.

  

Sink type:

  

Top (go to top) quad, Flume interceptor, data stream and Reliability 4.1, Flume Interceptor

When we need to filter the data, in addition to our code modifications in source, channel, and Sink, Flume provides us with interceptors, which are also in chain form.

The location of the interceptor is between source and channel, and when we specify the interceptor for source, we get an event in the interceptor, and we can keep the event on demand or

Discarded, discarded data will not enter the channel.

  

4.2. Flume data stream

1) The core of Flume is to collect the data from the data source and send it to the destination. To ensure a successful delivery, the data is cached before it is sent to the destination, and when the data really arrives at the destination,

Delete your own cached data.
2) The basic unit of the data transmitted by Flume is the Event, if it is a text file, usually a row of records, which is the basic unit of the transaction. Event from Source, flow to Channel, to Sink,

This is a byte array and can carry headers information. An Event represents the smallest complete unit of a data stream, from an external data source to an external destination.

  

It is important to note that Flume provides a large number of built-in source, channel, and sink types. Different types of source,channel and sink can be freely combined. The combination is based on user-set profiles and is very flexible.

For example, a channel can persist an event in memory, or it can be persisted to a local hard disk. Sink can write logs to HDFs, HBase, or even another source, and so on. Flume enables users to build multilevel streams,

In other words, multiple agents can work together and support Fan-in, fan-out, contextual Routing, and Backup Routes, which is where Flume is powerful. As shown in the following:

  

4.3. Flume Reliability

Flume uses transactional methods to ensure the reliability of the entire process of transmitting an event. Sink must be sent to the next agent after the event has been deposited in the Channel, or,

After it has been deposited into an external data destination, the Event can be removed from the Channel. So that the event in the data stream flows between an agent or multiple agents,

Ensures that the event is stored successfully because the transactions above are guaranteed to be reliable. For example, Flume supports storing a file channel locally as a backup, and the memory channel

Event exists in memory queue, fast, but cannot be recovered if lost.

Top (go to top) v. Flume usage Scenarios

Flume in English means waterways, but Flume is more like a fire hose that can be assembled at will, according to official documents, showing several flow.

5.1.multiple agent sequential connections

  

  Multiple Agent sequences can be concatenated to collect the original data sources and store them in the final storage system. This is the simplest case, in general, should control this sequential connection
the number of agents, because the data flows through the path is longer, if the failover is not considered , the failure will affect the entire Flow of agent collection services.

5.2, multiple agent data aggregation to the same agent

  

  This situation applies more scenarios, such as to collect Web site user behavior logs, Web sites in order to use the load cluster mode of availability, each node generates a user behavior log, can be considered
Each node is configured with an agent to collect the log data separately, and then multiple agents eventually converge the data into one to store the data storage system, such as HDFS .

5.3. Multistage Flow

flume Also support multi-level flow, what multistage stream? Combining applications in cloud development for example, when syslog, java< Span class= "Fontstyle1" >, nginx,   tomcatagent
agent

  

5.4. Load balance function

  

  The Agent1 is a routing node that balances the Channel staging Event to the corresponding plurality of Sink components, and each Sink component is connected to a separate the Agent.

Top (go to top)VI. Flume Core Components

Flume mainly consists of 3 important components:
1) Source: Complete the collection of log data, divided into Transtion and event into the channel
Flume provides various implementations of the source, including Avro Source, Exce source, spooling
Directory source, NetCat Source, syslog source, syslog TCP source,
Syslog UDP Source, HTTP Source, HDFS source, etc.
2) The Channel:flume Channel primarily provides a queue function that provides a simple caching of the data in the source supply.
Flume for channel, it provides memory channel, JDBC Chanel, File channel,etc

3) Sink:flume Sink take out the data in the channel, make the corresponding storage file system, database, or submit to the remote server.
Includes HDFs sink, Logger sink, Avro sink, File roll sink, Null sink, Hbasesink, etc.

6.1. Source

Spool How does Source work?
In the actual use of the process, can be used in conjunction with log4j, when using log4j, the log4j file segmentation mechanism is set to 1 minutes, the file is copied to the spool monitoring directory.

LOG4J has a timerolling plug-in that can put log4j split files into the spool directory. The basic realization of real-time monitoring. Flume after the file is passed, the text will be modified

The suffix of the piece, changed to. Completed (suffix can also be flexibly specified in the configuration file)

Exec Source and Spool source comparison
1) Execsource can realize the real-time collection of the log, but there is no flume to run or error in instruction execution, the log data cannot be collected and the log data cannot be validated.

The integrity of the.
2) Spoolsource Although it is not possible to collect data in real time, it can be used to split files in minutes, approaching real-time.
3) Summary: If the application is not implemented in minutes to cut log files, you can use the two collection methods together.

6.2. Channel

1) memorychannel enables high-speed throughput, but does not guarantee data integrity
2) memoryrecoverchannelfilechannel Filechannel Ensure the integrity and consistency of the data. In the specific configuration of filechannelfilechannel< Span class= "Fontstyle1" > directory and program log files saved directories

Set to a different disk for increased efficiency.

6.3, Sink

Flume Sink when setting up storage data, you can store data in the file system, in the database, in Hadoop, and when the log data is low, the data can be stored in the file systems and

and set a certain time interval to save the data. When more log data is available, the corresponding log data can be stored in Hadoop for future data analysis.

Flume principle parsing "go"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.