Flume NG Study Notes (a) Introduction

Source: Internet
Author: User
Tags syslog

First, Introduction

Flume is a distributed, reliable, high-availability, large-volume log aggregation system that enables the customization of data senders in the system for data collection, while Flume provides simple processing of data and the ability to write to various data receivers.

The flume has a large schema adjustment between 0.9.x and 1.x, and the 1.x version is renamed Flume NG (Next generation) and 0.9.x is called Flume OG (originalgeneration).

For the OG version, the main changes for Flume NG (1.x.x) are as follows:

1, sources and sinks use channels to link

2, two main channel. 1) In-memorychannel Non-persistent support, fast speed. 2) jdbc-based channel persistence support.

3, no longer distinguish between logical and physical node, all physical nodes collectively known as "agents", each agents can run 0 or more sources and sinks

4, no longer need the master node and the dependency on zookeeper, configuration file simplification.

5, plug-in, part of the face of users, tools or system developers.

6, using thrift, Avro Flume sources can send events from Flume0.9.4 to Flume 1.x

For the flume architecture

650) this.width=650; "Src=" http://img.blog.csdn.net/20141021160041291?watermark/2/text/ ahr0cdovl2jsb2cuy3nkbi5uzxqvbg9va2xvb2s1/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/dissolve/70/gravity/ Southeast "style=" Border:none; "/>

The relevant components are as follows:

Component

Function

Agent

Run flume using the JVM . Each machine runs an agent, but it can contain multiple sources and sinks in one agent.

Client

Production data, running on a separate thread.

Source

Collect data from the client and pass it to the channel.

Sink

Collects data from the channel and runs on a separate thread.

Channel

connecting sources and sinks, this is a bit like a queue.

Events

This can be log records, Avro objects, and so on.

Flume architecture as a whole is the Source-->channel-->sink three-tier architecture, similar to the architecture of the generator and the consumer, between them through the channel transmission, decoupling.

Flume is the smallest independent operating unit of the agent. An agent is a JVM. The single agent consists of the source, sink, and channel three components, note: When running flume, the machine must be installed with more than JDK6.0 version

An event is the basic unit of data for Flume, which carries log data (in the form of a byte array) and carries header information that is generated by an out-of-agent data source

When source captures an event, it is formatted in a specific format, and then the source pushes the event into (single or multiple) channel. You can think of the channel as a buffer, which will save the event until sink finishes processing the event. Sink is responsible for persisting the log or pushing the event to another source.

Flume enables users to build multilevel streams, which means that multiple agents can work together


650) this.width=650; "Src=" http://img.blog.csdn.net/20141021160106641?watermark/2/text/ ahr0cdovl2jsb2cuy3nkbi5uzxqvbg9va2xvb2s1/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/dissolve/70/gravity/ Southeast "style=" Border:none; "/>

Second, Flume Source

Flumesource: Complete the collection of log data, divided into Transtion and event into the channel.

Flume provides a variety of implementations of the source, including Avro Source, Exce source, spooling Directory source, NetCat Source, syslog source, syslog TCP source, Syslog UDP Source, HTTP source, HDFS source,etc.

The minimal use of existing program changes is to use the log file that is the original record of the direct reader, the basic can achieve seamless access, no need to make any changes to the existing program. There are two ways to directly read the file source:

1, Exec Source to run the Linux command, the continuous output of the latest data, such as the Tail-f file name directive, in this way, the filename must be specified.

Execsource supports real-time data, but when Flume is not running and scripting errors, it loses data and does not support the ability to continue the breakpoint. Since there is no record of where the last file was read, there is no way to know where to start reading the next time. Especially when the log files are constantly increasing. The source of the flume is hung up. When the flume source is opened again this time, the added log content, there is no way to be read by the source. However, Flume has a execstream extension, you can write a monitoring log to increase the situation, the increased log, through their own tools to write the added content, to the Flume node. and send it to sink's node. If it can be supported in the source of the tail class, it will be more perfect when node hangs out of this time, and then continues to teleport the next time the node is opened.

2. Spool source is a new file under the directory of monitoring configuration, and the data in the file is read out. Among them, Spoolsource has 2 note place, the first one is copied to the spool directory files can not be opened to edit, the second is the spool directory cannot contain the corresponding subdirectories.

Spoolsource in the actual use of the process, can be used in conjunction with log4j, when using log4j, the log4j file segmentation mechanism is set to 1 minutes, the file is copied to the spool monitoring directory. LOG4J has a timerolling plug-in that can put log4j split files into the spool directory. The basic realization of real-time monitoring. Flume after the file is passed, the suffix of the file will be modified and changed to. Completed (suffix can also be flexibly specified in the configuration file)

Exec Source and Spool source comparison

1), Execsource can realize the real-time collection of the log, but there is no flume or instruction execution error, you will not be able to collect log data, the integrity of the log data cannot be validated.

2), Spoolsource Although unable to achieve real-time data collection, but can be used in minutes to split the file, near real-time.

3), Summary: If the application is not implemented in minutes to cut log files, can be used in combination of two collection methods. Second, Flume Sink

Flumesink take out the data in the channel, store the file system, database, or submit it to the remote server.

Flume also provides a variety of sink implementations, including HDFs sink, Loggersink, Avro sink, File roll sink, Null sink, hbasesink,etc.

third, Flume Channel

Flumechannel primarily provides a queue function that provides a simple cache of the data in the source supply.

Flume for channel, the Memory channel, JDBC Chanel, and File channel,etc are provided.

which

Memorychannel can achieve high-speed throughput, but cannot guarantee the integrity of the data.

Memoryrecoverchannel has been built to replace the official documentation with FileChannel.

FileChannel guarantees data integrity and consistency, and events persist in the local file system (poor performance). In the case of FileChannel configuration, it is recommended that the directory and program log files FileChannel set to a different disk for increased efficiency.


This article is from "Pioneer Home" blog, please make sure to keep this source http://jackwxh.blog.51cto.com/2850597/1906862

Flume NG Learning Note (i) Introduction

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.