Scenario 1. What is Flume 1.1 background
Flume, as a real-time log collection system developed by Cloudera, has been recognized and widely used by the industry. The initial release version of Flume is now collectively known as Flume OG (original Generation), which belongs to Cloudera. But with the expansion of the FLume function, FLume OG code Engineering bloated, the core component design is unreasonable, the core configuration is not standard and other shortcomings exposed, especially in FLume OG final release 0.94.0, log transmission instability is particularly serious, in order to solve these problems, 2011 October 22, Cloudera completed the Flume-728 and made a milestone change to Flume: Refactoring the core components, core configuration, and code architecture, the reconstructed version is collectively known as Flume NG (Next generation), and another reason for the change is Flume Included in Apache, Cloudera Flume renamed Apache Flume.
Flume is a distributed, reliable, and highly available system for collecting, aggregating, and transmitting large volumes of logs. Support for customizing various data senders in the log system for data collection, while Flume provides the ability to simply process the data and write to various data recipients (such as text, HDFS, hbase, etc.).
Flume data flows are always run through events. An event is the basic unit of data for Flume, which carries log data (in the form of a byte array) and carries header information that is generated by source outside the agent, which is formatted when the source captures the event, and then the source pushes the event into (single or multiple) The channel. You can think of the channel as a buffer, which will save the event until sink finishes processing the event. Sink is responsible for persisting the log or pushing the event to another source.
1.2 Features
Reliability of the Flume
When a node fails, the log can be transmitted to other nodes without loss. Flume provides three levels of reliability assurance, from strong to weak in order: End-to-end (Received data agent first writes the event to disk, when the data transfer is successful, then delete; If the data sent fails, you can resend it.) ), Store On failure (this is also the policy adopted by scribe, when the data receiver crash, writes the data to the local, after the recovery, continues to send), BestEffort (data sent to the receiver, will not be confirmed).
Recoverability
or by the channel. It is recommended to use FileChannel, where events persist in the local file system (poor performance).
Core Concepts
Agent: Run flume using the JVM. Each machine runs an agent, but it can contain multiple sources and sinks in one agent.
Client: Production data, running on a separate thread.
Source: Collects data from the client and passes it to the channel.
Sink: Collects data from the channel and runs on a separate thread.
Channel: Connect sources and sinks, which is a bit like a queue.
Events: Can be log records, Avro objects, and so on.
Flume is the smallest independent operating unit of the agent. An agent is a JVM. A single agent consists of three components of source, sink, and channel, such as:
It is important to note that Flume provides a large number of built-in source, channel, and sink types. Different types of source,channel and sink can be freely combined. The combination is based on user-set profiles and is very flexible. For example, a channel can persist an event in memory, or it can be persisted to a local hard disk. Sink can write logs to HDFs, HBase, or even another source, and so on. Flume support users to establish multi-level flow, that is to say, multiple agents can work together, and support fan-in, fan-out, contextual Routing, Backup Routes, which is the place of NB. As shown in the following:
2. How to configure Flume
Here is an example of the Flume listening directory and the configuration of the file contents of this directory sink to the specified directory on HDFs, as described in the experimental section.
Experiment
- Configuring the Flume-env.sh File
Append the following to the end of the file:
export FLUME_HOME=/home/hadoop/apache-flume-1.6.0-binexport FLUME_CONF_DIR=$FLUME_HOME/confexport PATH=.:$PATH::$FLUME_HOME/bin
- Configuring the Flume-conf.properties File
Agent1. Sources= SpooldirSourceagent1. Channels= FileChannelagent1. Sinks= Hdfssink#配置sources, which is the source directory being monitoredAgent1. Sources. Spooldirsource. Type=spooldiragent1. Sources. Spooldirsource. Spooldir=/home/hadoop/flumeagent1. Sources. Spooldirsource. Channels=filechannel#配置sinks, which is the destination directoryAgent1. Sinks. Hdfssink. Type=hdfsagent1. Sinks. Hdfssink. HDFs. Path=hdfs://master:9000/input/flume/%y-%m-%dagent1. Sinks. Hdfssink. HDFs. Fileprefix=flumeagent1. Sinks. Sink1. HDFs. Round= True# Number of seconds to wait before rolling current file (0 = never roll based on time interval)Agent1. Sinks. Hdfssink. HDFs. Rollinterval=3600# File size to trigger roll, in bytes (0:never roll based on File size)Agent1. Sinks. Hdfssink. HDFs. Rollsize=128000000Agent1. Sinks. Hdfssink. HDFs. Rollcount=0Agent1. Sinks. Hdfssink. HDFs. BatchSize= +#Rounded down to the highest multiple of this (with the unit configured using Hdfs.roundunit), less than current time.
Agent1. Sinks. Hdfssink. HDFs. Roundvalue=1Agent1. Sinks. Hdfssink. HDFs. Roundunit= Minuteagent1. Sinks. Hdfssink. HDFs. Uselocaltimestamp= Trueagent1. Sinks. Hdfssink. Channel=filechannelagent1. Sinks. Hdfssink. HDFs. FileType= DataStream#channels, Channel directory configuration: Persist file events to local hard diskAgent1. Channels. FileChannel. Type= Fileagent1. Channels. FileChannel. Checkpointdir=/home/hadoop/apache-flume-1.6. 0-bin/checkpointagent1. Channels. FileChannel. Datadirs=/home/hadoop/apache-flume-1.6. 0-bin/datadir
- Test
1, Flume environment test and start
Hadoop@master:~$ Flume-ng versionFlume 1.6.0SourceCoderepository: https://git-wip-us.apache.org/repos/asf/flume. gitRevision: 2561a23240a71ba20bf288c7c2cda88f443c2080Compiledby Hshreedharan onMon May One One: the: - PDT - fromSource with checksum B29e416802ce9ece3269d34233baf43fhadoop@master:~$
[Email protected]:~$ ${flume_home}/bin/flume-ng agent--conf./conf/-F conf/flume-conf. Properties-dflume. Root. Logger=debug,console-n agent1 > Log. Log 2>&1&[2]13370[Email protected]:~$ tailf ~/apache-flume-1.6. 0-bin/log. Log .- .-Geneva -: +: the,007(Log-backgroundworker-filechannel) [debug-org. Apache. Flume. Channel. File. Flumeeventqueue. Checkpoint(Flumeeventqueue. Java:139)] Checkpoint not required .- .-Geneva -: +: the,828(conf-file-poller-0) [debug-org. Apache. Flume. Node. Pollingpropertiesfileconfigurationprovider$FileWatcherRunnable. Run(Pollingpropertiesfileconfigurationprovider. Java:126)] Checking file:conf/flume-conf. PropertiesFor changes
2. Add files to the listening directory
cp ~/wordcount.txt ~/flume/
3. Implementation results
Summarize
Flume is a distributed, reliable, and highly available system for collecting, aggregating, and transmitting large volumes of logs. Support for customizing various data senders in the log system for data collection, while Flume provides the ability to simply process the data and write to various data recipients (such as text, HDFS, hbase, etc.).
Reference
Liaoliang DT Big Data Dream Factory-imf-, Wally classmate
Flume official website
Flume on HDFS configuration
Flume Introduction and monitoring file directory and sink to HDFs combat