1) Introduction
Flume is a distributed, reliable, and highly available system for aggregating massive logs. It supports customization of various data senders in the system for data collection. Flume also provides simple data processing, and write the capabilities of various data receivers (customizable.
Design goals:
(1) Reliability
When a node fails, logs can be transferred to other nodes without being lost. Flume provides three levels of reliability assurance, from strong to weak: end-to-end (after receiving the data agent, the event is first written to the disk. when the data is transmitted successfully, delete the data. If the data fails to be sent, resend the data .), Store on failure (when the data receiver crashes, the data is written to the local device and continues to be sent after recovery), Best effort (the data is not confirmed after it is sent to the receiver ).
(2) scalability
Flume uses a three-tier architecture, namely agent, collector, and storage. Each layer can be horizontally expanded. Among them, all agents and collector are centrally managed by the master, which makes the system easy to monitor and maintain, and the master allows multiple (using ZooKeeper for management and load balancing ), this avoids spof.
(3) manageability
All agents and colletors are centrally managed by the master, which makes the system easy to maintain. In the case of multiple masters, Flume uses ZooKeeper and gossip to ensure dynamic configuration data consistency. You can view the execution of each data source or data stream on the master, and configure and dynamically load each data source. Flume provides two forms of web and shell script command to manage data streams.
(4) Functional scalability
You can add your own agent, collector, or storage as needed. In addition, Flume comes with many components, including various agents (such as file and syslog), collector and storage (such as File, HDFS, and HBase ).
2) Configuration
Hadoop and hbase have been configured before. Therefore, you need to start hadoop and hbase before writing files into hdfs and hbase. For the configuration of hadoop-2.2.0 and hbase-0.96.0, see distributed configuration Hadoop-2.2.0 in Ubuntu and CentOS distributed environment installation HBase-0.96.0.
The configuration environment is two test clusters with centos installed. The machine with the master host name is responsible for log collection, and the machine with the node host name is responsible for log writing. There are three write modes for this configuration: Write to common directory and write to hdfs.
First download the flume-ng binary compressed file. Address: http://flume.apache.org/download.html. After downloading the file, decompress the file. First, edit the/etc/profile file and add the following lines in it:
- Export FLUME_HOME =/home/aaron/apache-flume-1.4.0-bin
- Export FLUME_CONF_DIR = $ FLUME_HOME/conf
- Export PATH = $ PATH: $ FLUME_HOME/bin
export FLUME_HOME=/home/aaron/apache-flume-1.4.0-binexport FLUME_CONF_DIR=$FLUME_HOME/confexport PATH=$PATH:$FLUME_HOME/bin
Run the $ souce/etc/profile command to make the modification take effect.
In the conf directory of the flume folder on the master, create a new flume-master.conf file with the following content:
- Agent. sources = seqGenSrc
- Agent. channels = memoryChannel
- Agent. sinks = remoteSink
- # For each one of the sources, the type is defined
- Agent. sources. seqGenSrc. type = exec
- Agent. sources. seqGenSrc. command = tail-F/home/aaron/test
- # The channel can be defined as follows.
- Agent. sources. seqGenSrc. channels = memoryChannel
- # Each sink's type must be defined
- Agent. sinks. loggerSink. type = logger
- # Specify the channel the sink shoshould use
- Agent. sinks. loggerSink. channel = memoryChannel
- # Each channel's type is defined.
- Agent. channels. memoryChannel. type = memory
- # Other config values specific to each type of channel (sink or source)
- # Can be defined as well
- # In this case, it specifies the capacity of the memory channel
- Agent. channels. memoryChannel. capacity = 100
- Agent. channels. memoryChannel. keep-alive = 100
- Agent. sinks. remoteSink. type = avro
- Agent. sinks. remoteSink. hostname = node
- Agent. sinks. remoteSink. port = 23004
- Agent. sinks. remoteSink. channel = memoryChannel
agent.sources = seqGenSrcagent.channels = memoryChannelagent.sinks = remoteSink# For each one of the sources, the type is definedagent.sources.seqGenSrc.type = execagent.sources.seqGenSrc.command = tail -F /home/aaron/test# The channel can be defined as follows.agent.sources.seqGenSrc.channels = memoryChannel# Each sink's type must be definedagent.sinks.loggerSink.type = logger#Specify the channel the sink should useagent.sinks.loggerSink.channel = memoryChannel# Each channel's type is defined.agent.channels.memoryChannel.type = memory# Other config values specific to each type of channel(sink or source)# can be defined as well# In this case, it specifies the capacity of the memory channelagent.channels.memoryChannel.capacity = 100agent.channels.memoryChannel.keep-alive = 100agent.sinks.remoteSink.type = avroagent.sinks.remoteSink.hostname = nodeagent.sinks.remoteSink.port = 23004agent.sinks.remoteSink.channel = memoryChannel
Add the above configuration to the/etc/profile file on the node machine. Then, create a new flume-node.conf file in conf and modify it as follows:
- Agent. sources = seqGenSrc1
- Agent. channels = memoryChannel
- # Agent. sinks = fileSink
- Agent. sinks = <SPANstyle = "FONT-FAMILY: Arial, Helvetica, sans-serif"> fileSink </SPAN>
- # For each one of the sources, the type is defined
- Agent. sources. seqGenSrc1.type = avro
- Agent. sources. seqGenSrc1.bind = node
- Agent. sources. seqGenSrc1.port = 23004
- # The channel can be defined as follows.
- Agent. sources. seqGenSrc1.channels = memoryChannel
- # Each sink's type must be defined
- Agent. sinks. loggerSink. type = logger
- # Specify the channel the sink shoshould use
- Agent. sinks. loggerSink. channel = memoryChannel
- # Each channel's type is defined.
- Agent. channels. memoryChannel. type = memory
- # Other config values specific to each type of channel (sink or source)
- # Can be defined as well
- # In this case, it specifies the capacity of the memory channel
- Agent. channels. memoryChannel. capacity = 100
- Agent. channels. memoryChannel. keep-alive = 100
- Agent. sources. flieSink. type = avro
- Agent. sources. fileSink. channel = memoryChannel
- Agent. sources. fileSink. sink. directory =/home/aaron/
- Agent. sources. fileSink. serializer. appendNewline = true
agent.sources = seqGenSrc1agent.channels = memoryChannel#agent.sinks = fileSinkagent.sinks = fileSink# For each one of the sources, the type is definedagent.sources.seqGenSrc1.type = avro agent.sources.seqGenSrc1.bind = nodeagent.sources.seqGenSrc1.port = 23004 # The channel can be defined as follows.agent.sources.seqGenSrc1.channels = memoryChannel# Each sink's type must be definedagent.sinks.loggerSink.type = logger#Specify the channel the sink should useagent.sinks.loggerSink.channel = memoryChannel# Each channel's type is defined.agent.channels.memoryChannel.type = memory# Other config values specific to each type of channel(sink or source)# can be defined as well# In this case, it specifies the capacity of the memory channelagent.channels.memoryChannel.capacity = 100agent.channels.memoryChannel.keep-alive = 100agent.sources.flieSink.type = avroagent.sources.fileSink.channel = memoryChannelagent.sources.fileSink.sink.directory = /home/aaron/agent.sources.fileSink.serializer.appendNewline = true
Run the following command on the master node:
- $ Bin/flume-ng agent -- conf./conf/-f conf/flume-maste.conf-Dflume. root. logger = DEBUG, console-n agent
$ bin/flume-ng agent --conf ./conf/ -f conf/flume-maste.conf -Dflume.root.logger=DEBUG,console -n agent
Run the following command on node:
- $ Bin/flume-ng agent -- conf./conf/-f conf/flume-node.conf-Dflume. root. logger = DEBUG, console-n agent
$ bin/flume-ng agent --conf ./conf/ -f conf/flume-node.conf -Dflume.root.logger=DEBUG,console -n agent
After the startup, you can find that the two can communicate with each other, and the files on the master can be sent to the node. Modify the test file on the master and append the content later, node can also receive messages.
If you want to write content to hadoop, make the following changes to the flume-node.conf file in node:
- Agent. sinks = k2
- Agent. sinks. k2.type = hdfs
- Agent. sinks. k2.channel = memoryChannel
- Agent. sinks. k2.hdfs. path = hdfs: // master: 8089/hbase
- Agent. sinks. k2.hdfs. fileType = DataStream
- Agent. sinks. k2.hdfs. writeFormat = Text
agent.sinks = k2agent.sinks.k2.type = hdfsagent.sinks.k2.channel = memoryChannelagent.sinks.k2.hdfs.path = hdfs://master:8089/hbaseagent.sinks.k2.hdfs.fileType = DataStreamagent.sinks.k2.hdfs.writeFormat = Text
Among them, hdfs: // master: 8089/hbase is the hdfs file path of hadoop.