Introduction and application of flume

Source: Internet
Author: User
Tags http post solr syslog elastic search hadoop fs

Copyright NOTICE: This article is Yunshuxueyuan original article.
If you want to reprint please indicate the source: http://www.cnblogs.com/sxt-zkys/
QQ Technology Group: 299142667

the concept of flume

1. As a real-time log collection system developed by Flume, Cloudera has been recognized and widely used by the industry. The initial release version of Flume is now collectively known as Flume OG (original Generation), which belongs to Cloudera. But with the expansion of the FLume function, FLume OG code Engineering bloated, the core component design is unreasonable, the core configuration is not standard and other shortcomings exposed, especially in FLume OG final release 0.94.0, log transmission instability is particularly serious, in order to solve these problems, 2011 October 22, Cloudera completed the Flume-728 and made a milestone change to Flume: Refactoring the core components, core configuration, and code architecture, the reconstructed version is collectively known as Flume NG (Next generation), and another reason for the change is Flume Included in Apache, Cloudera Flume renamed Apache Flume.

2. Features of Flume:

Flume is a distributed, reliable, and highly available system for collecting, aggregating, and transmitting large volumes of logs. Support for customizing various data senders in the log system for data collection, while Flume provides the ability to simply process the data and write to various data recipients (such as text, HDFS, hbase, etc.).

Flume data flows are always run through events. An event is the basic unit of data for Flume, which carries log data (in the form of a byte array) and carries header information that is generated by source outside the agent, which is formatted when the source captures the event, and then the source pushes the event into (single or multiple) The channel. You can think of the channel as a buffer, which will save the event until sink finishes processing the event. Sink is responsible for persisting the log or pushing the event to another source.

3. Reliability of the Flume

When a node fails, the log can be transmitted to other nodes without loss. Flume provides three levels of reliability assurance, from strong to weak in order: End-to-end (Received data agent first writes the event to disk, when the data transfer is successful, then delete; If the data sent fails, you can resend it.) ), Store On failure (this is also the policy adopted by scribe, when the data receiver crash, writes the data to the local, after the recovery, continues to send), BestEffort (data sent to the receiver, will not be confirmed).

4. Recoverability of the Flume

or by the channel. It is recommended to use FileChannel, where events persist in the local file system (poor performance).

5. Some core concepts of flume

Agent: Run flume using the JVM. Each machine runs an agent, but it can be in one agent

Contains multiple sources and sinks.

Client: Production data, running on a separate thread.

Source: Collects data from the client and passes it to the channel.

Sink: Collects data from the channel and runs on a separate thread.

Channel: Connect sources and sinks, which is a bit like a queue.

Events: Can be log records, Avro objects, and so on.

the concept of event

Introduce the concept of event in Flume: The core of Flume is to collect the data from the data source to send the collected data to the specified destination (sink). In order to ensure that the delivery process must be successful, before sending to the destination (sink), the data will be cached (channel), when the data really arrives at the destination (sink), flume delete their own cached data.

During the transmission of the entire data, the event is flowing, that is, the transaction is guaranteed at the event level. So what is an event? -–event encapsulates the transmitted data, which is the basic unit of flume transmitted data, and if it is a text file, usually a row of records, the event is the basic unit of the transaction. Event from source, to channel, to sink, is itself a byte array, and can carry headers (header information) information. An event represents the smallest complete unit of data, from an external data source, to an external destination.

To make it easy for everyone to understand, give an event data flow graph:

Flume Architecture

Flume is so magical because of its own design, the design is that the agent,agent itself is a Java process, running in the Log collection node-so-called log collection node is the server node.

The agent contains 3 core components: Source-->channel-–>sink, a structure similar to that of a producer, a warehouse, and a consumer.

Source:source components are designed to collect data that can handle log data in various types and formats, including Avro, thrift, exec, JMS, spooling directory, netcat, sequence generator , Syslog, HTTP, Legacy, custom.

The Channel:source component collects the data and temporarily stores it in the channel, that is, the channel component is dedicated to storing temporary data in the agent-a simple cache of the collected data, which can be stored in memory, JDBC, file, and so on.

The Sink:sink component is a component used to send data to a destination, including HDFs, logger, Avro, thrift, IPC, file, NULL, Hbase, SOLR, Kafaka, and custom.

Flume Source

Source Type:

Avro Source: Supports Avro protocol (actually Avro RPC) with built-in support

Thrift Source: Support Thrift protocol, built-in support

Exec Source: Unix-based command produces data on standard output

JMS Source: Reading data from a JMS system (message, subject)

Spooling directory Source: monitoring data changes within a specified directory

Twitter 1% firehose Source: Continuous download of Twitter data via API, test nature

Netcat Source: Monitors a port and takes each text line of data flowing through the port as an event input

Sequence Generator Source: Sequence generator data source, production sequence data

Syslog Sources: Reads syslog data, generates event, supports UDP and TCP two protocols

HTTP Source: A data source based on an HTTP POST or get method that supports JSON, blob representations

Legacy Sources: Compatible with old flume og source (0.9.x version)

Flume Channel

Channel Type:

Memory channel:event data stored in RAM

JDBC channel:event data is stored in persistent storage and the current flume Channel has built-in support for Derby

File channel:event data is stored in disk files

The spillable memory channel:event data is stored in memory and on disk, and when the RAM queue is full, it holds

Long-lasting to disk file

Pseudo Transaction Channel: Test use

Custom channel: Customizing channel implementations

Flume Sink

Sink type description

HDFs Sink: Data written to HDFs

Logger Sink: Data is written to the log file

Avro Sink: Data is converted to Avro Event and then sent to the configured RPC port

Thrift Sink: Data is converted to Thrift Event and then sent to the configured RPC port

IRC Sink: Data is played back on IRC

File Roll Sink: Storing data to a local filesystem

Null Sink: Discard to all data

HBase Sink: Data written to hbase database

Morphline SOLR Sink: Data sent to SOLR Search server (cluster)

ElasticSearch Sink: Data sent to Elastic Search server (cluster)

Kite DataSet Sink: Write data to Kite DataSet, test-type

Custom Sink: Customizing the Sink implementation

Flume operating mechanism

Flume is the core of an agent, the agent has two external interaction, one is to accept data input--source, one is the output of the data Sink,sink responsible for sending data to the external designated destination. After the source receives the data, the data is sent to Channel,chanel as a data buffer to temporarily hold the data, and then sink sends the data in the channel to the specified place-for example, HDFs, Note: The channel will only delete the temporary data after the sink has successfully sent the data in the channel, which guarantees the reliability and security of the data transmission.

generalized usage of flume

Flume is so magical-the reason is that Flume can support multi-level flume agent, that is, Flume can be successive, for example, sink can write data to the next agent source, so that can be connected to a string, can be treated as a whole. Flume also supports fan-in (fan-in), fan-out (fan-out). The so-called fan-in is the source can accept multiple inputs, so-called fan-out is sink can output data to multiple destinations destination.

Flume Installation

1. Download the source package and upload the node to the cluster:

2. Unzip to the specified directory

3. Modify Conf/flume.env.sh:

Note: The java_opts configuration should be modified if we transfer the file over a large report memory overflow

4. Configure Environment variables

Refresh Profile: Source/etc/profile

5. Verify that the installation is successful

Flume ApplicationsCase 1

Http://flume.apache.org/FlumeUserGuide.html#a-simple-example

Configuration file simple.conf

# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = c1# Describe/configure the sourcea1.sources.r1.type = netcata1.sources.r1.bind = localhosta1.sources.r1.port = 44444# Describe the sinka1.sinks.k1.type = logger# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1

Start flume

Flume-ng agent-n a1-c conf-f simple.conf-dflume.root.logger=info,console

Install Telnet

Yum Install Telnet

Memory Chanel Configuration

Capacity: By default, the maximum number of event numbers that can be stored in this channel is 100,

Trasactioncapacity: The maximum number of event times that can be received in source or sent to sink is 100

Keep-alive:event the allowed time to add to the channel or move out

byte**: The limit of the byte amount of the event, including only Eventbody

Case 2, two flume do a cluster

NODE01 Server, configuration file

# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = c1# Describe/configure the sourcea1.sources.r1.type = netcata1.sources.r1.bind = node1a1.sources.r1.port = 44444# Describe the sink# a1.sinks.k1.type = loggera1.sinks.k1.type = avroa1.sinks.k1.hostname = node2a1.sinks.k1.port = 60000# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1

NODE02 Server, install flume (step slightly)

Configuration file

# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = c1# Describe/configure the sourcea1.sources.r1.type = avroa1.sources.r1.bind = node2a1.sources.r1.port = 60000# Describe the sinka1.sinks.k1.type = logger# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1

Start Node02 's flume first

Flume-ng agent-n a1-c conf-f avro.conf-dflume.root.logger=info,console

And start Node01 's flume.

Flume-ng agent-n a1-c conf-f simple.conf2-dflume.root.logger=info,console

Open Telnet Test node02 console output results

Case 3, Exec Source

Http://flume.apache.org/FlumeUserGuide.html#exec-source

Configuration file

a1.sources = r1a1.sinks = k1a1.channels = c1# Describe/configure the sourcea1.sources.r1.type = execa1.sources.r1.command = tail -F /home/flume.exec.log# Describe the sinka1.sinks.k1.type = logger# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1

Start flume

Flume-ng agent-n a1-c conf-f exec.conf-dflume.root.logger=info,console

Create an empty file demo touch Flume.exec.log

Looping Add data

For i in {1..50}; Do echo "$i hi Flume" >> flume.exec.log; Sleep 0.1; Done

Case 4, spooling Directory Source

Http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source

Configuration file

a1.sources = r1a1.sinks = k1a1.channels = c1# Describe/configure the sourcea1.sources.r1.type = spooldira1.sources.r1.spoolDir = /home/logsa1.sources.r1.fileHeader = true# Describe the sinka1.sinks.k1.type = logger# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1

Start flume

Flume-ng agent-n a1-c conf-f spool.conf-dflume.root.logger=info,console

Copy File Demo

mkdir logs

CP Flume.exec.log logs/

Case 5, HDFs sink

Http://flume.apache.org/FlumeUserGuide.html#hdfs-sink

Configuration file

############################################################

A1.sources = R1

A1.sinks = K1

A1.channels = C1

# Describe/configure The source

A1.sources.r1.type = Spooldir

A1.sources.r1.spoolDir =/home/logs

A1.sources.r1.fileHeader = True

# Describe The sink

Modify only the configuration code block of the previous spool sink A1.sinks.k1.type = Logger

A1.sinks.k1.type=hdfs

A1.sinks.k1.hdfs.path=hdfs://sxt/flume/%y-%m-%d/%h%m

# #每隔60s或者文件大小超过10M的时候产生新文件

# How many messages are in HDFs when you create a new file, 0 is not based on the number of messages

A1.sinks.k1.hdfs.rollcount=0

# How long does HDFs create new files, 0 not based on time

A1.sinks.k1.hdfs.rollinterval=60

# How large is HDFs when creating a new file, 0 not based on file size

a1.sinks.k1.hdfs.rollsize=10240

# When a temporary file that is currently open has no data written in the time specified by the parameter (seconds), the temporary file is closed and renamed to the destination file

A1.sinks.k1.hdfs.idletimeout=3

A1.sinks.k1.hdfs.filetype=datastream

A1.sinks.k1.hdfs.uselocaltimestamp=true

# # Generates a directory every five minutes:

# Whether to enable "discard" on time, "discard" here, similar to "rounding", followed by the introduction. If enabled, it affects all other time expressions except%t

A1.sinks.k1.hdfs.round=true

# The value of "discard" on time;

A1.sinks.k1.hdfs.roundvalue=5

# time on "discard" units, including: Second,minute,hour

A1.sinks.k1.hdfs.roundunit=minute

# Use a channel which buffers events in memory

A1.channels.c1.type = Memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

A1.sources.r1.channels = C1

A1.sinks.k1.channel = C1

############################################################

Create an HDFs directory

Hadoop Fs-mkdir/flume

Start flume

Flume-ng agent-n a1-c conf-f hdfs.conf-dflume.root.logger=info,console

View HDFs Files

Hadoop fs-ls/flume/...

Hadoop fs-get/flume/...

Copyright NOTICE: This article is Yunshuxueyuan original article.
If you want to reprint please indicate the source: http://www.cnblogs.com/sxt-zkys/
QQ Technology Group: 299142667

Introduction and application of flume

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.