Welcome to the big Data and AI technical articles released by the public number: Qing Research Academy, where you can learn the night white (author's pen name) carefully organized notes, let us make a little progress every day, so that excellent become a habit!
First, the introduction of flume:
Developed by Cloudera, Flume is a system that provides high availability, high reliability, distributed mass log acquisition, aggregation and transmission, Flume supports the customization of various data senders in the log system for data acquisition, while Flume provides simple processing of data and write to a variety of data receiver's ability, if can summarize flume in a sentence, then Flume is the real-time acquisition log data acquisition engine.
Second, the flume system structure:
Flume architecture is divided into three parts: data source, Flume, destination
There are many types of data sources: from directory, HTTP, Kafka, and so on, Flume provides the source component for collecting data sources.
1. Source function: Collect log
Source Type: 1, spooling directory Source: Collect logs in Directory
2. Htttp Source: Collect logs in HTTP
3. Kafka Source: Collect logs from Kafka
......
The collected logs need to be cached, and Flume provides the channel component to cache the data.
2. Channel function: Cache log
Channel type: 1, Memory Channel: Cache to RAM (most commonly used)
2. JDBC Channel: Cached in the relational database via JDBC
3. Kafka Channel: Cache to Kafka
......
The cached data ultimately needs to be saved, and Flume provides the sink component to hold the data.
3. Sink function: Save Log
Sink species: 1, HDFs sink: saving in HDFs
2. HBase sink: Saving in HBase
3. Hive Sink: Save to Hive
4. Kafka Sink: Save to Kafka
......
The official website has the flume each component different kind of enumeration:
Third, installation and configuration flume:
1. Installation: TAR-ZXVF apache-flume-1.7.0-bin.tar.gz-c ~/training
2. Create a profile a4.conf: Define the agent, define the source, channel, sink, and assemble it to define the conditions for generating the log file.
The following is what is in the a4.conf configuration file, where the data source is defined from the directory, the data is cached in memory, the data is eventually saved to HDFs, and the conditions for generating the log file are defined: The log file size reaches 128M or the log file is generated after 60 seconds.
#定义agent名, name of source, channel, sink
A4.sources = R1
A4.channels = C1
A4.sinks = K1
#具体定义source
A4.sources.r1.type = Spooldir
A4.sources.r1.spoolDir =/root/training/logs
#具体定义channel
A4.channels.c1.type = Memory
A4.channels.c1.capacity = 10000
a4.channels.c1.transactionCapacity = 100
#定义拦截器, add a timestamp to the message
A4.sources.r1.interceptors = I1
A4.sources.r1.interceptors.i1.type = Org.apache.flume.interceptor.timestampinterceptor$builder
#具体定义sink
A4.sinks.k1.type = HDFs
A4.sinks.k1.hdfs.path = hdfs://192.168.157.11:9000/flume/%y%m%d
A4.sinks.k1.hdfs.filePrefix = events-
A4.sinks.k1.hdfs.fileType = DataStream
#不按照条数生成文件
A4.sinks.k1.hdfs.rollCount = 0
#HDFS上的文件达到128M时生成一个日志文件
A4.sinks.k1.hdfs.rollSize = 134217728
#HDFS上的文件达到60秒生成一个日志文件
A4.sinks.k1.hdfs.rollInterval = 60
#组装source, channel, sink
A4.sources.r1.channels = C1
A4.sinks.k1.channel = C1
Iv. using flume statements to collect data:
1. Create a directory to save the log:
Mkdir/root/training/logs
2, start flume, ready to collect logs in real time:
Bin/flume-ng.agent-n a4-f myagent/a4.conf-c Conf-dflume.root.logger=info.console
3. Import the log into the directory:
CP * ~/training/logs
V. Similarities and differences between Sqoop and Flume:
Same point: Sqoop and Flume have only one installation mode, there is no local mode, cluster mode, etc.
Different points: Sqoop Batch data acquisition, flume real-time data acquisition.
Li Jinze Allenli, Tsinghua University in the master's degree, Research direction: Big Data and artificial intelligence
Take a look at the log in the flume& collection directory of the Big Data acquisition engine