Flume Introduction and monitoring file directory and sink to HDFs combat

Last Update:2016-06-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Scenario 1. What is Flume 1.1 background

Flume, as a real-time log collection system developed by Cloudera, has been recognized and widely used by the industry. The initial release version of Flume is now collectively known as Flume OG (original Generation), which belongs to Cloudera. But with the expansion of the FLume function, FLume OG code Engineering bloated, the core component design is unreasonable, the core configuration is not standard and other shortcomings exposed, especially in FLume OG final release 0.94.0, log transmission instability is particularly serious, in order to solve these problems, 2011 October 22, Cloudera completed the Flume-728 and made a milestone change to Flume: Refactoring the core components, core configuration, and code architecture, the reconstructed version is collectively known as Flume NG (Next generation), and another reason for the change is Flume Included in Apache, Cloudera Flume renamed Apache Flume.
Flume is a distributed, reliable, and highly available system for collecting, aggregating, and transmitting large volumes of logs. Support for customizing various data senders in the log system for data collection, while Flume provides the ability to simply process the data and write to various data recipients (such as text, HDFS, hbase, etc.).
Flume data flows are always run through events. An event is the basic unit of data for Flume, which carries log data (in the form of a byte array) and carries header information that is generated by source outside the agent, which is formatted when the source captures the event, and then the source pushes the event into (single or multiple) The channel. You can think of the channel as a buffer, which will save the event until sink finishes processing the event. Sink is responsible for persisting the log or pushing the event to another source.

1.2 Features

Reliability of the Flume
When a node fails, the log can be transmitted to other nodes without loss. Flume provides three levels of reliability assurance, from strong to weak in order: End-to-end (Received data agent first writes the event to disk, when the data transfer is successful, then delete; If the data sent fails, you can resend it.) ), Store On failure (this is also the policy adopted by scribe, when the data receiver crash, writes the data to the local, after the recovery, continues to send), BestEffort (data sent to the receiver, will not be confirmed).
Recoverability
or by the channel. It is recommended to use FileChannel, where events persist in the local file system (poor performance).
Core Concepts

Agent: Run flume using the JVM. Each machine runs an agent, but it can contain multiple sources and sinks in one agent.
Client: Production data, running on a separate thread.
Source: Collects data from the client and passes it to the channel.
Sink: Collects data from the channel and runs on a separate thread.
Channel: Connect sources and sinks, which is a bit like a queue.
Events: Can be log records, Avro objects, and so on.

Flume is the smallest independent operating unit of the agent. An agent is a JVM. A single agent consists of three components of source, sink, and channel, such as:

It is important to note that Flume provides a large number of built-in source, channel, and sink types. Different types of source,channel and sink can be freely combined. The combination is based on user-set profiles and is very flexible. For example, a channel can persist an event in memory, or it can be persisted to a local hard disk. Sink can write logs to HDFs, HBase, or even another source, and so on. Flume support users to establish multi-level flow, that is to say, multiple agents can work together, and support fan-in, fan-out, contextual Routing, Backup Routes, which is the place of NB. As shown in the following:
　　

2. How to configure Flume

Here is an example of the Flume listening directory and the configuration of the file contents of this directory sink to the specified directory on HDFs, as described in the experimental section.

Experiment

Configuring the Flume-env.sh File
Append the following to the end of the file:

export FLUME_HOME=/home/hadoop/apache-flume-1.6.0-binexport FLUME_CONF_DIR=$FLUME_HOME/confexport PATH=.:$PATH::$FLUME_HOME/bin

Configuring the Flume-conf.properties File

Agent1. Sources= SpooldirSourceagent1. Channels= FileChannelagent1. Sinks= Hdfssink#配置sources, which is the source directory being monitoredAgent1. Sources. Spooldirsource. Type=spooldiragent1. Sources. Spooldirsource. Spooldir=/home/hadoop/flumeagent1. Sources. Spooldirsource. Channels=filechannel#配置sinks, which is the destination directoryAgent1. Sinks. Hdfssink. Type=hdfsagent1. Sinks. Hdfssink. HDFs. Path=hdfs://master:9000/input/flume/%y-%m-%dagent1. Sinks. Hdfssink. HDFs. Fileprefix=flumeagent1. Sinks. Sink1. HDFs. Round= True# Number of seconds to wait before rolling current file (0 = never roll based on time interval)Agent1. Sinks. Hdfssink. HDFs. Rollinterval=3600# File size to trigger roll, in bytes (0:never roll based on File size)Agent1. Sinks. Hdfssink. HDFs. Rollsize=128000000Agent1. Sinks. Hdfssink. HDFs. Rollcount=0Agent1. Sinks. Hdfssink. HDFs. BatchSize= +#Rounded down to the highest multiple of this (with the unit configured using Hdfs.roundunit), less than current time.
     Agent1. Sinks. Hdfssink. HDFs. Roundvalue=1Agent1. Sinks. Hdfssink. HDFs. Roundunit= Minuteagent1. Sinks. Hdfssink. HDFs. Uselocaltimestamp= Trueagent1. Sinks. Hdfssink. Channel=filechannelagent1. Sinks. Hdfssink. HDFs. FileType= DataStream#channels, Channel directory configuration: Persist file events to local hard diskAgent1. Channels. FileChannel. Type= Fileagent1. Channels. FileChannel. Checkpointdir=/home/hadoop/apache-flume-1.6. 0-bin/checkpointagent1. Channels. FileChannel. Datadirs=/home/hadoop/apache-flume-1.6. 0-bin/datadir

Test
1, Flume environment test and start

Hadoop@master:~$ Flume-ng versionFlume 1.6.0SourceCoderepository: https://git-wip-us.apache.org/repos/asf/flume. gitRevision: 2561a23240a71ba20bf288c7c2cda88f443c2080Compiledby Hshreedharan onMon  May  One  One: the: - PDT  - fromSource with checksum B29e416802ce9ece3269d34233baf43fhadoop@master:~$

[Email protected]:~$ ${flume_home}/bin/flume-ng agent--conf./conf/-F conf/flume-conf. Properties-dflume. Root. Logger=debug,console-n agent1 > Log. Log 2>&1&[2]13370[Email protected]:~$ tailf ~/apache-flume-1.6. 0-bin/log. Log .- .-Geneva  -: +: the,007(Log-backgroundworker-filechannel) [debug-org. Apache. Flume. Channel. File. Flumeeventqueue. Checkpoint(Flumeeventqueue. Java:139)] Checkpoint not required .- .-Geneva  -: +: the,828(conf-file-poller-0) [debug-org. Apache. Flume. Node. Pollingpropertiesfileconfigurationprovider$FileWatcherRunnable. Run(Pollingpropertiesfileconfigurationprovider. Java:126)] Checking file:conf/flume-conf. PropertiesFor changes

2. Add files to the listening directory

cp ~/wordcount.txt ~/flume/

3. Implementation results

Summarize

Flume is a distributed, reliable, and highly available system for collecting, aggregating, and transmitting large volumes of logs. Support for customizing various data senders in the log system for data collection, while Flume provides the ability to simply process the data and write to various data recipients (such as text, HDFS, hbase, etc.).
　

Reference

Liaoliang DT Big Data Dream Factory-imf-, Wally classmate
Flume official website
Flume on HDFS configuration

Flume Introduction and monitoring file directory and sink to HDFs combat

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Flume Introduction and monitoring file directory and sink to HDFs combat

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Flume Introduction and monitoring file directory and sink to HDFs combat

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support