Flume-ng+hadoop Implementation Log Collection

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Overview
Flume is a high-performance, highly possible distributed log collection system for Cloudera company.
The core of Flume is to collect data from the data source and send it to the destination. In order to ensure that the transmission must be successful, before sending to the destination, will first cache the data, waiting for the data to really arrive at the destination, delete their own cached data.
The basic unit of the data transmitted by Flume is event, if it is a text file, it is usually a line of records, which is also the basic unit of the transaction.
The core of the flume operation is the agent. It is a complete data collection tool that contains three core components, source, channel, and sink. With these components, an event can flow from one place to another, as shown in the following illustration.

Source can receive data sent from an external source. Different source, you can accept different data formats. For example, there is a directory pool (spooling directory) data source, you can monitor the specified folder changes in the new file, if the directory has a file, it will immediately read its contents.
Channel is a storage place, receiving output from source until sink consumes the data in channel. The data in the channel will not be deleted until it enters the next channel or enters the terminal. When a sink write fails, it can be restarted automatically without causing data loss and therefore reliable.
Sink consumes the data in the channel and sends it to an external source or to other sources. If the data can be written to HDFs or HBase.
Flume allows multiple agents to be linked together to form a cascade of hops.
The following figure:

2. Introduction of Core Components
2.1 Sources
Client-side operations source of consumer data, Flume supports Avro,log4j,syslog and HTTP post (the body is JSON format). You can allow your application to deal directly with existing source, such as Avrosource,syslogtcpsource. can also write a source, IPC or RPC way to access their own applications, Avro and thrift can (respectively, nettyavrorpcclient and Thriftrpcclient implemented the Rpcclient interface), Where Avro is the default RPC protocol. Specific code level of client-side data access, you can refer to the official manual.
The smallest way to use the existing program is to use the log file that is the original record of the direct Reader program, which can basically achieve seamless access without any changes to the existing program.
There are two ways to directly read a file source:
Execsource: The most recent data, such as the tail-f filename instruction, is persisted in the way that Linux commands are run, in which case the filename must be specified. Execsource can achieve real-time collection of logs, but there is no flume or instruction execution error, will not be able to collect log data, can not guarantee the integrity of the log data.
Spoolsource: Monitors the configured directory for new files and reads the data from the files. Note Two: Files that are copied to the spool directory cannot be opened for editing; The spool directory cannot contain the corresponding subdirectories. Spoolsource Although it is not possible to collect data in real time, it can be used to split the file in minutes, which tends to be real-time. If the application does not implement minutes to cut log files, you can use the two methods of collection. In the actual use of the process, can be combined with log4j use, the use of log4j, the log4j file segmentation mechanism is set to 1 minutes at a time, copy files to the spool of the Monitoring directory. Log4j has a timerolling plug-in that can split the log4j file into the spool directory. The basic realization of real-time monitoring. Flume after the file is passed, the suffix of the file will be modified and changed to. COMPLETED (suffixes can also be specified flexibly in the configuration file) 2.2 Channel
There are currently several Channel to choose from, namely Memory Channel, JDBC Channel, File Channel,psuedo Transaction. The first three kinds of channel are more common.
Memorychannel can achieve high-speed throughput, but it does not guarantee the integrity of the data.
Memoryrecoverchannel has been built on the advice of the official document to replace it with FileChannel.
FileChannel guarantees the integrity and consistency of the data. When you configure FileChannel, it is recommended that the directories saved by the FileChannel directory and the program log file be set up as separate disks to improve efficiency.
The File channel is a persistent tunnel (channel) that persists all events and stores them on disk. Therefore, even if the Java virtual machine is dropped, or the operating system crashes or reboots, or the event is not successfully passed to the next agent in the pipeline, none of this will cause data loss. Memory Channel is an unstable tunnel because it stores all events in memory. If the Java process dies, any events stored in memory will be lost. In addition, the memory space receives a limit on the size of RAM, and file channel is an advantage in that it can store all event data on disk as long as there is enough disk space.
For more information please check:
Http://flume.apache.org/FlumeUserGuide.html#file-channel
Http://flume.apache.org/FlumeUserGuide.html#memory-channel

2.3 Sink
Sink when you set up storage data, you can save data to file systems, databases, Hadoop, and, in the case of less log data, store data in a file system and set a time interval to save data. When there are more log data, the corresponding log data can be stored in Hadoop for future data analysis.
More sink can refer to the Official Handbook.

3. Prepare installation files
JDK Download Address: http://www.oracle.com/technetwork/java/javase/downloads/index.html
Flume+hadoop Download Address: http://archive-primary.cloudera.com/cdh4/cdh/4/

After the use of multiple versions of the comparison, concluded that Flume-ng and Hadoop version must correspond, otherwise there will be a variety of errors.
Flume-ng:flume-ng-1.3.0-cdh4.2.0.tar.gz
Hadoop:hadoop-2.0.0-cdh4.2.0.tar.gz
jdk:jdk-7u45-linux-x64.rpm

4.JDK installation Configuration
# Mkdir/usr/java # RPM-IVH jdk-7u45-linux-x64.rpm--prefix/usr/java installed automatically when the installation is complete, there are/usr/java/jdk1.7.0_45 folder configuration environment variables, modify/ Etc/profile file (System effective) or modify. bashrc file (valid for individual users)
# Vim/etc/profile Add the following at the end:
#java
Export java_home=/usr/java/jdk1.7.0_45
Export JRE_HOME=/USR/JAVA/JDK1.7.0_45/JRE
Export classpath=.: $CLASSPATH: $JAVA _home/lib: $JRE _home/lib
Export path= $PATH: $JAVA _home/bin: $JRE _home/bin make/etc/profile file changes immediately after the entry into force, you can use the following command:
# Source/etc/profile or re-logged in
Next, remove the original JDK environment. In addition to the removal program, it is not enough to remove the basic system/java platform: First, enter "Rpm-qa|grep GCJ" in the terminal to view the GCJ version number, where the results are:
java-1.4.2-gcj-compat-1.4.2.0-40jpp.115
Libgcj-4.1.2-48.el5
Second, uninstall the system with the JDK itself. Enter "Yum-y remove Java java-1.4.2-gcj-compat-1.4.2.0-40jpp.115" in the terminal and wait for the system to unload its own JDK. The final display of "complete!" in the terminal, uninstall complete.
Test installation, view version number: Java-version
5.hadoop of configuration
Hadoop has an official release and a Cloudera version, where the Cloudera version is the commercial version of Hadoop.
Hadoop has three modes of operation: single node mode, single machine pseudo distribution mode and cluster mode.
Hadoop divides the hosts into two roles from three angles.
First, divided into master and slave, namely master and slave;
Second, from the angle of HDFs, the host is divided into Namenode and Datanode (in the Distributed File system, the management of the directory is very important, the management directory is the equivalent of the owner, and Namenode is the directory manager);
Third, from the MapReduce point of view, the host is divided into Jobtracker and Tasktracker (a job is often divided into multiple tasks, from this point of view it is not difficult to understand the relationship between them).
To facilitate testing, we install pseudo distributed, salve for localhost (that is, for itself).
5.1 Add name mappings to/etc/hosts files
# vim/etc/hosts 192.168.10.89 Master
192.168.10.89 slave1 5.2 Close SELinux and iptables
# sed-i "S/selinux=enforcing/selinux=disabled/g"/etc/selinux/config
# Service Iptables Stop
# chkconfig iptables off 5.3 set SSH no password login
First on the master host set up the SSH password login, and then copy the generated id_dsa.pub file to the salve1 above
# ssh-keygen-t RSA #输入后 prompts you to create. Ssh/id_rsa, id_rsa.pub files, where the first one is the key and the second is the public key. The process will require a password, in order to SSH access process without the password, you can directly enter.
# cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
# chmod ~/.ssh/authorized_keys copies the public key to the SALVE1
# SCP ~/.ssh/id_rsa.pub Root@salve1:~/.ssh/authorized_keys

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Flume-ng+hadoop Implementation Log Collection

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Flume-ng+hadoop Implementation Log Collection

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support