Flume is a high-performance, highly-possible distributed log collection system for the Cloudera company.
The core of Flume is to collect data from the data source and send it to the destination. To ensure a successful delivery, the data is cached before it is sent to the destination, and the data is deleted when the data actually arrives at the destination.
The basic unit of the data transmitted by Flume is the event, if it is a text file, usually a row of records, which is the basic unit of the transaction.
The core of the flume operation is the agent. It is a complete data collection tool that contains three core components, namely source, channel, sink. With these components, the event can flow from one place to another, as shown in.
Source can receive data sent from an external source. Different source, can accept different data format. For example, there is a directory pool (spooling directory) data source, you can monitor the new file changes in the specified folder, if there are files in the directory, the contents will be read immediately.
The channel is a storage place that receives the output of the source until a sink consumes the data in the channel. The data in the channel is not deleted until it enters the next channel or enters the terminal. When the sink write fails, it can be restarted automatically without data loss and is therefore reliable.
Sink will consume the data in the channel and then send it to an external source or to other sources. If the data can be written to HDFs or HBase.
Flume allows multiple agents to be joined together to form a multi-level hop before and after a connection.
Use
- 1. Download from official website apache-flume-1.4.0-bin.tar.gz and the apache-flume-1.4.0-src.tar.gz
2. Unzip each, then copy all the contents of the SRC project to the bin project
3. Delete the SRC project and rename the bin project to Flume
4. Configure to environment variables
5. Writing Agent Configuration
The core of using flume is how to configure the agent file. Agent configuration is a plain text file, using key-value pairs to store configuration information, you can set up multiple agent information. The contents of the configuration include source, channel, sink, and so on. Component source, channel, and sink all have names, types, and many personalized property configurations.
The configuration file should be written like this
# list The sources, sinks and channels for the agent
<agent>.sources = <Source>
<agent>.sinks = <Sink>
<agent>.channels = <Channel1> <Channel2>
# Set channel for source
<agent>.sources.<source>.channels = <Channel1> <Channel2> ...
# Set channel for sink
<agent>.sinks.<sink>.channel = <Channel1>
# Properties for sources
<Agent>.sources.<Source>.<someProperty> = <someValue>
# Properties for channels
<Agent>.channel.<Channel>.<someProperty> = <someValue>
# Properties for Sinks
<Agent>.sources.<Sink>.<someProperty> = <someValue>
# here is an example
#下面的agent1是代理名称, corresponding to the source, the name is SRC1, there is a sink, the name is SINK1; there is a channel, the name is CH1.
Agent1.sources = Src1
Agent1.sinks = Sink1
Agent1.channels = CH2
# Config directory source, monitor directory (must exist) change, require file name must be unique, otherwise flume error
Agent1.sources.src1.type = Spooldir
Agent1.sources.src1.channels = CH2
Agent1.sources.src1.spoolDir =/root/hmbbs
Agent1.sources.src1.fileHeader = False
Agent1.sources.src1.interceptors = I1
Agent1.sources.src1.interceptors.i1.type = Timestamp
# Configure Memory Channel
Agent1.channels.ch1.type = Memory
agent1.channels.ch1.capacity = 1000
agent1.channels.ch1.transactionCapacity = 1000
Agent1.channels.ch1.byteCapacityBufferPercentage = 20
Agent1.channels.ch1.byteCapacity = 800000
# config file Channel
Agent1.channels.ch2.type = File
Agent1.channels.ch2.checkpointDir =/root/flumechannel/checkpoint
Agent1.channels.ch2.dataDirs =/root/flumechannel/data
# Configuring HDFs Sink
Agent1.sinks.sink1.type = HDFs
Agent1.sinks.sink1.channel = CH2
Agent1.sinks.sink1.hdfs.path = hdfs://hadoop0:9000/flume/%y-%m-%d/
Agent1.sinks.sink1.hdfs.rollinterval=1
Agent1.sinks.sink1.hdfs.fileType = DataStream
Agent1.sinks.sink1.hdfs.writeFormat = Text
# Configure HBase Sink
#配置hbase SINK2
Agent1.sinks.sink2.type = HBase
Agent1.sinks.sink2.channel = Channel1
Agent1.sinks.sink2.table = Hmbbs
agent1.sinks.sink2.columnFamily = CF
Agent1.sinks.sink2.serializer = Flume. Hmbbshbaseeventserializer
Agent1.sinks.sink2.serializer.suffix = Timestamp
Agent1.sinks.sink2.serializer = Org.apache.flume.sink.hbase.SimpleHbaseEventSerializer
5. The script to start the agent is the Flume-ng agent, you need to specify the agent name, configuration directory, configuration file
-N Specify Agent Name
-c Specifies the configuration file directory
-F Specify configuration file
-dflume.root.logger=debug,console
So the full boot command should be written like this.
Bin/flume-ng agent–n agent1–c conf–f conf/example–dflume.root.logger=debug,console
After successful startup, you can put the file into directory/root/hmbbs, Flume will perceive the new file and upload it to the/flume directory in HDFs.
The principle and use of flume