Flume Knowledge Points:
Event is a row of data
1.flume is a distributed log collection system that transmits collected data to its destination.
2.flume has a core concept, called an agent. The agent is a Java process that runs on the Log collection node.
The 3.agent consists of 3 core components: source, channel, sink.
The 3.1 source component is dedicated to collecting logs and can handle various types of log data in various formats, including Avro, thrift, exec, JMS, spooling directory, netcat, sequence generator, Syslog, HTTP, Legacy, custom.
The source component collects the data and temporarily stores it in the channel.
The 3.2 channel component is used by the agent to temporarily store data, which can be stored in memory, JDBC, file, and custom.
The data in the channel is not deleted until the sink is sent successfully.
The 3.3 sink component is a component used to send data to a destination, including HDFs, logger, Avro, thrift, IPC, file, NULL, HBase, SOLR, and custom.
4. During the entire data transfer process, the event is flowing. The transaction guarantee is at the event level.
5.flume can support multi-level flume agent, support fan-in (fan-in), fan-out (fan-out).
Fan-in refers to: source can receive multiple inputs
Fan-out refers to: sink can output multiple destinations
Flume Installation:
1. Unzip each of these two files in the node:
2. Copy the SRC content to the bin:
Cp-ri apache-flume-1.4.0-src/* apache-flume-1.4.0-bin/
3.SRC useless can be erased:
RM-RF APACHE-FLUME-1.4.0-SRC
4. Rename Apache-flume-1.4.0-bin to Flume:
MV Apache-flume-1.4.0-bin/flume
Note: The flume installation is based on the premise that you have Hadoop installed because it uses the Hadoop jar
5. Writing the configuration file example
Agent1 represents the proxy name:
Agent1.sources=source1
Agent1.sinks=sink1
Agent1.channels=channel1
Spooling directory is the monitoring of changes to new files in the specified folder, and once a new file appears, the contents of the file are parsed and then written to Channle. When the write is complete, mark the file as completed or delete the file.
Configure Source1
Agent1.sources.source1.type=spooldir
Agent1.sources.source1.spooldir=/root/hmbbs
Agent1.sources.source1.channels=channel1
Agent1.sources.source1.fileHeader = False
Agent1.sources.source1.interceptors = I1
Agent1.sources.source1.interceptors.i1.type = Timestamp
Configure SINK1
Agent1.sinks.sink1.type=hdfs
Agent1.sinks.sink1.hdfs.path=hdfs://hadoop0:9000/hmbbs
Agent1.sinks.sink1.hdfs.filetype=datastream
Agent1.sinks.sink1.hdfs.writeformat=text
Agent1.sinks.sink1.hdfs.rollinterval=1//Specified time file is closed
Agent1.sinks.sink1.channel=channel1
agent1.sinks.sink1.hdfs.fileprefix=%y-%m-%d//prefix of generated files
Configure Channel1
Agent1.channels.channel1.type=file
Backup directory
Agent1.channels.channel1.checkpointdir=/root/hmbbs_tmp/123
agent1.channels.channel1.datadirs=/root/hmbbs_tmp/
Write the file to the Flume Conf folder and name it example
6. Create a folder in the root directory Hmbbs
[Email protected]/]# Cd/root
[[email protected] ~]# ls
Anaconda-ks.cfg Documents install.log Music public Videos
Desktop Downloads install.log.syslog Pictures Templates
[Email protected] ~]# mkdir Hmbbs
7. Create a folder under Hadoop
Hadoop Fs-mkdir/hmbbs
8. Executive Flume
Enter Flume Execute command
Bin/flume-ng agent-n agent1-c conf-f conf/example-dflume.root.logger=debug,console
9. Create
[Email protected] ~]# VI Hello
[[email protected] ~]# CP Hello Hmbbs
You'll see the file transfer in HDFs.
10.
[Email protected] ~]# CD Hmbbs
[[email protected] hmbbs]# ls
hello.completed
The red section indicates that the task is complete and has been transferred to the channel, suffix. Completed is the result of renaming.
[Email protected] ~]# CD hmbbs_tmp
[[email protected] hmbbs_tmp]# ls
HMBBS_TMP represents the directory used by the channel.
[[Email protected] hmbbs_tmp]# CD 123
[[email protected] 123]# ls
Checkpoint Checkpoint.meta inflightputs Inflighttakes
The data here is backup data, if the data loss in DataDir can be recovered from here.
in the actual production is multi-node configuration, more complex, you can refer to the official documents:
http://flume.apache.org
Distributed Log Collection system: Flume