Background
Flume is a distributed log management system sponsored by Apache, and the main function is to log,collect the logs generated by each worker in the cluster to a specific location.
Why write this article, because now the search out of the literature is mostly the old version of the Flume, in Flume1. x version, that is, flume-ng version with a lot of changes before, many of the market's documents are outdated, we must pay attention to this point, I will provide a few more new, reference value of the article later.
There are several aspects to Flume's advantages:
* Java implementation, good cross-platform performance
* There is a certain fault-tolerant mechanism, and the prevention of data protection mechanism
* Provides a lot of agents
* Easy to develop, with developer option
Function
Stand-alone version is the above form, there are three components, respectively, is source,channel,sink. In use, as long as the installation of Flume, and then configure the corresponding conf file, it is OK.
Source: Mainly the origin of the configuration log file (multiple agents are available, multiple data sources are supported)
Channel: Similar to a queue, staging the received log data
Sink: Output The log file (there are many ways to project it onto the screen, or you can read it to a database or a specified file)
# Name The components in this agentA1. Sources=R1A1. Sinks= K1A1. Channels= C1# describe/configure The sourceA1. Sources. R1. Type= Avro#avro是flume的一种type, read the local log fileA1. Sources. R1. Bind= localhost#这个和下面的port对应于avro the-client portA1. Sources. R1. Port=44444# Describe The sinkA1. Sinks. K1. Type=com. Waqu. Sink. Odpssink #对应代码里的包名A1. Sinks. K1. Sink. BatchSize= - #需要大于10A1. Sinks. K1. Sink. Table= *******#自己建的hub表以及key-id InformationA1. Sinks. K1. Sink. Project=******* A1. Sinks. K1. Sink. ODPs. Access_id =********** A1. Sinks. K1. Sink. ODPs. Access_key =********** A1. Sinks. K1. Sink. ODPs. End_point =***********A1. Sinks. K1. Sink. Sink. Tunnel. End_point =*******# Use a channel which buffers events in memoryA1. Channels. C1. Type= Memorya1. Channels. C1. Checkpointdir= +A1. Channels. C1. Datadirs= -# Bind The source and sink to the channelA1. Sources. R1. Channels= C1A1. Sinks. K1. Channel= C1
The following is for these three points, detailed introduction of the following
Flume Workflow
The agent supports a variety of input source, several more commonly used type.
*http, can listen to the HTTP port, take log
*netcat, you can listen for Telnet-like port data
*spooling, listening for new files in a file directory
*avro Source, send the specified file, this does not support real-time monitoring, that is to say we monitor A.log file, when A.log changed, we can not get the change of the log
*exec Source, which can monitor a file in real time
The point is that exec Source, which is cool, allows shell commands to be executed on the agent so that we can use the tail command to monitor what's new in a file.
-flog.txt
Develop
* Start with the Official SDK package to develop a packaged jar file
* Put the jar in the Flume lib file directory
* Configure conf file
* Start Agent: flume-ng agent --conf conf --conf-file ./conf/my.conf -name a1 -Dflume.root.logger=INFO,console
* Start Data Source:flume-ng avro-client -H localhost -p 44444 -F /home/garvin/log.txt -Dflume.root.logger=INFO,console
Recommend a few useful things:
An example of a code implementation: Https://github.com/waqulianjie/odps_sink
Developer Document:http://flume.apache.org/flumeuserguide.html
A more complete introduction: http://www.aboutyun.com/thread-8917-1-1.html
This article comes from the blog "Bo Li Garvin"
Reprint please indicate source: Http://blog.csdn.net/buptgshengod]
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Distributed computing Distributed log Import Tool-flume