Take a look at the log in the flume& collection directory of the Big Data acquisition engine

Last Update:2018-03-04 Source: Internet

Author: User

Tags sqoop

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Welcome to the big Data and AI technical articles released by the public number: Qing Research Academy, where you can learn the night white (author's pen name) carefully organized notes, let us make a little progress every day, so that excellent become a habit!

First, the introduction of flume:

Developed by Cloudera, Flume is a system that provides high availability, high reliability, distributed mass log acquisition, aggregation and transmission, Flume supports the customization of various data senders in the log system for data acquisition, while Flume provides simple processing of data and write to a variety of data receiver's ability, if can summarize flume in a sentence, then Flume is the real-time acquisition log data acquisition engine.

Second, the flume system structure:

Flume architecture is divided into three parts: data source, Flume, destination

There are many types of data sources: from directory, HTTP, Kafka, and so on, Flume provides the source component for collecting data sources.

1. Source function: Collect log

Source Type: 1, spooling directory Source: Collect logs in Directory

2. Htttp Source: Collect logs in HTTP

3. Kafka Source: Collect logs from Kafka

......

The collected logs need to be cached, and Flume provides the channel component to cache the data.

2. Channel function: Cache log

Channel type: 1, Memory Channel: Cache to RAM (most commonly used)

2. JDBC Channel: Cached in the relational database via JDBC

3. Kafka Channel: Cache to Kafka

......

The cached data ultimately needs to be saved, and Flume provides the sink component to hold the data.

3. Sink function: Save Log

Sink species: 1, HDFs sink: saving in HDFs

2. HBase sink: Saving in HBase

3. Hive Sink: Save to Hive

4. Kafka Sink: Save to Kafka

......

The official website has the flume each component different kind of enumeration:

Third, installation and configuration flume:

1. Installation: TAR-ZXVF apache-flume-1.7.0-bin.tar.gz-c ~/training

2. Create a profile a4.conf: Define the agent, define the source, channel, sink, and assemble it to define the conditions for generating the log file.

The following is what is in the a4.conf configuration file, where the data source is defined from the directory, the data is cached in memory, the data is eventually saved to HDFs, and the conditions for generating the log file are defined: The log file size reaches 128M or the log file is generated after 60 seconds.

#定义agent名, name of source, channel, sink

A4.sources = R1

A4.channels = C1

A4.sinks = K1

#具体定义source

A4.sources.r1.type = Spooldir

A4.sources.r1.spoolDir =/root/training/logs

#具体定义channel

A4.channels.c1.type = Memory

A4.channels.c1.capacity = 10000

a4.channels.c1.transactionCapacity = 100

#定义拦截器, add a timestamp to the message

A4.sources.r1.interceptors = I1

A4.sources.r1.interceptors.i1.type = Org.apache.flume.interceptor.timestampinterceptor$builder

#具体定义sink

A4.sinks.k1.type = HDFs

A4.sinks.k1.hdfs.path = hdfs://192.168.157.11:9000/flume/%y%m%d

A4.sinks.k1.hdfs.filePrefix = events-

A4.sinks.k1.hdfs.fileType = DataStream

#不按照条数生成文件

A4.sinks.k1.hdfs.rollCount = 0

#HDFS上的文件达到128M时生成一个日志文件

A4.sinks.k1.hdfs.rollSize = 134217728

#HDFS上的文件达到60秒生成一个日志文件

A4.sinks.k1.hdfs.rollInterval = 60

#组装source, channel, sink

A4.sources.r1.channels = C1

A4.sinks.k1.channel = C1

Iv. using flume statements to collect data:

1. Create a directory to save the log:

Mkdir/root/training/logs

2, start flume, ready to collect logs in real time:

Bin/flume-ng.agent-n a4-f myagent/a4.conf-c Conf-dflume.root.logger=info.console

3. Import the log into the directory:

CP * ~/training/logs

V. Similarities and differences between Sqoop and Flume:

Same point: Sqoop and Flume have only one installation mode, there is no local mode, cluster mode, etc.

Different points: Sqoop Batch data acquisition, flume real-time data acquisition.

Li Jinze Allenli, Tsinghua University in the master's degree, Research direction: Big Data and artificial intelligence

Take a look at the log in the flume& collection directory of the Big Data acquisition engine

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More