Flume Cluster Log Collection

Source: Internet
Author: User

I. Introduction of Flume

Flume is a distributed, highly available, massive log collection, aggregation, and transport log collection system that enables the customization of various types of data senders (such as KAFKA,HDFS, etc.) in the log system to facilitate data collection. The core of Agent,agent is a Java process that runs on the Log collection node.

The agent consists of 3 core components: source, channel, sink.
The source component is dedicated to collecting logs and can handle various types of log data in various formats, including Avro, thrift, exec, JMS, spooling directory, netcat, sequence generator, syslog, HTTP, Legacy, custom, while the source component collects data

Later, it is temporarily deposited in the channel.

Channel components are used in the agent for temporary storage of data, can be stored in memory, JDBC, file, custom and so on. The data in the channel is not deleted until the sink is sent successfully.

The sink component is a component used to send data to a destination, including HDFs, logger, Avro, thrift, IPC, file, NULL, HBase, SOLR, and custom.

During the entire data transfer process, the event is flowing. The transaction guarantee is at the event level. Flume can support multi-level flume agent, support fan-in (fan-in), fan-out (fan-out).

Second, the Environment preparation

1) Hadoop cluster (landlord version 2.7.3, a total of 6 nodes, can refer to http://www.cnblogs.com/qq503665965/p/6790580.html)

2) Flume cluster planning:

HOST

Role

Way

Path

Hadoop01

Agent

Spooldir

/home/hadoop/logs

Hadoop05

Collector

Hdfs

/logs

Hadoop06

Collector

Hdfs

/logs

The basic structure of the official website gives a more specific explanation, the landlord here directly copy came:

Third, cluster configuration

1) System environment variable configuration

1 Export Flume_home=/home/hadoop/apache-flume-1.7.0-bin 2 Export path= $PATH: $JAVA _home/bin: $HADOOP _home/bin: $HADOOP _home/sbin: $FLUME _home/bin

Remember source/etc/profile .

2) Flume JDK Environment

1 MV Flume-env.sh.templete flume-env.sh 2 Vim flume-env.sh 3 Export java_home=/usr/jdk1.7.0_60//add JDK installation path

3) flume configuration in HADOOP01

In the Conf directory, add the configuration file flume-client , as follows:

1 #agent1名称2 agent1.channels = C13 agent1.sources = R14 agent1.sinks = K1 K25 6 #sink组名称7 agent1.sinkgroups = G18 9 #set ChannelTen agent1.channels.c1.type = Memory One agent1.channels.c1.capacity = + A agent1.channels.c1.transactionCapacity = -  - agent1.sources.r1.channels = C1 the Agent1.sources.r1.type = Spooldir - #日志源 - Agent1.sources.r1.spoolDir =/home/hadoop/logs -  + agent1.sources.r1.interceptors = I1 i2 - Agent1.sources.r1.interceptors.i1.type = static + Agent1.sources.r1.interceptors.i1.key = Type A Agent1.sources.r1.interceptors.i1.value = LOGIN at Agent1.sources.r1.interceptors.i2.type = Timestamp -  - # set Sink1 - Agent1.sinks.k1.channel = C1 - Agent1.sinks.k1.type = Avro - #sink1所在主机 in agent1.sinks.k1.hostname = hadoop05 - Agent1.sinks.k1.port = 52020 to  + #设置sink2 - Agent1.sinks.k2.channel = C1 the Agent1.sinks.k2.type = Avro * #sink2所在主机 $ agent1.sinks.k2.hostname = hadoop06Panax Notoginseng Agent1.sinks.k2.port = 52020 -  the #设置sink组包含sink1, Sink2 + agent1.sinkgroups.g1.sinks = K1 K2 A  the #高可靠性 + Agent1.sinkgroups.g1.processor.type = Failover - #设置优先级 $ Agent1.sinkgroups.g1.processor.priority.k1 = Ten $ AGENT1.SINKGROUPS.G1.PROCESSOR.PRIORITY.K2 = 1 -Agent1.sinkgroups.g1.processor.maxpenalty = 10000

4) HADOOP05 Configuration

1 #设置 Agent Name2 a1.sources = R13 a1.channels = C14 a1.sinks = K15 6 #设置channlels7 a1.channels.c1.type = Memory8 a1.channels.c1.capacity = +9 a1.channels.c1.transactionCapacity =Ten  One # current node information A A1.sources.r1.type = Avro - #绑定主机名 - A1.sources.r1.bind = hadoop05 the A1.sources.r1.port = 52020 - a1.sources.r1.interceptors = I1 - A1.sources.r1.interceptors.i1.type = static - A1.sources.r1.interceptors.i1.key = Collector + A1.sources.r1.interceptors.i1.value = hadoop05 - a1.sources.r1.channels = C1 +  A #sink的hdfs地址 at A1.sinks.k1.type=hdfs - A1.sinks.k1.hdfs.path=/logs - A1.sinks.k1.hdfs.filetype=datastream - A1.sinks.k1.hdfs.writeformat=text - #没1K产生文件 - a1.sinks.k1.hdfs.rollinterval=1 in A1.SINKS.K1.CHANNEL=C1 - #文件后缀 toa1.sinks.k1.hdfs.fileprefix=%y-%m-%d

5) HADOOP06 Configuration

1 #设置 Agent Name2 a1.sources = R13 a1.channels = C14 a1.sinks = K15 6 #设置channel7 a1.channels.c1.type = Memory8 a1.channels.c1.capacity = +9 a1.channels.c1.transactionCapacity =Ten  One # current node information A A1.sources.r1.type = Avro - #绑定主机名 - A1.sources.r1.bind = hadoop06 the A1.sources.r1.port = 52020 - a1.sources.r1.interceptors = I1 - A1.sources.r1.interceptors.i1.type = static - A1.sources.r1.interceptors.i1.key = Collector + A1.sources.r1.interceptors.i1.value = hadoop06 - a1.sources.r1.channels = C1 + #设置sink的hdfs地址目录 A A1.sinks.k1.type=hdfs at A1.sinks.k1.hdfs.path=/logs - A1.sinks.k1.hdfs.filetype=datastream - A1.sinks.k1.hdfs.writeformat=text - a1.sinks.k1.hdfs.rollinterval=1 - A1.SINKS.K1.CHANNEL=C1 -a1.sinks.k1.hdfs.fileprefix=%y-%m-%d
Iv. starting the Flume cluster

1) Start collector, i.e. hadoop05,hadoop06

1 flume-ng agent-n a1-c conf-f flume-server-dflume.root.logger=debug,console

2) Start agent, i.e. HADOOP01

Flume-ng agent-n a1-c conf-f flume-client-dflume.root.logger=debug,console

After the agent starts, the HADOOP05,HADOOP06 console can see the following printing information:

  

V. Log Collection Test

1) Start zookeeper cluster (students who do not build zookeeper can ignore)

2) Start HDFs start-dfs.sh

3) Simulation website log, the landlord here casually get the test data

4) Upload to/hadoop/home/logs

HADOOP01 output:

HADOOP05 output:

 

Because the HADOOP05 setting has a higher priority than HADOOP06, HADOOP06 has no log writes.

We looked again at HDFs, whether the log file was successfully uploaded:

  

Vi. High Availability Testing

Because the landlord set the HADOOP05 priority is higher than hadoop06, this is also the above log collection HADOOP05 output instead of hadoop06 output reason. Now we kill the priority high HADOOP05, see if hadoop06 can do the log collection work normally.

  

We add a test log file to the log source:

  

HADOOP06 output:

View HDFs:

  

All right! Flume cluster configuration and log collection is introduced here, the next landlord will specifically introduce the use of MapReduce to clean the log, and stored in hbase related content.

  

Flume Cluster Log Collection

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.