Flume Log Collection _hadoop

Source: Internet
Author: User
Tags xsl zookeeper
First, Flume introduction

Flume is a distributed, reliable, and highly available mass log aggregation system that enables customization of data senders in the system for data collection, while Flume provides the ability to simply process data and write to a variety of data-receiving parties (customizable). Design objectives:

(1) Reliability

When a node fails, the log can be transmitted to other nodes without loss. Flume provides three levels of reliability assurance, from strong to weak in order: End-to-end (Received data agent first write event to disk, when the data transfer success, then delete; If the data sent failed, you can resend.) ), Store On failure (this is also the strategy adopted by scribe, when the data receiver crash, write the data locally, after recovery, continue to send), best effort (data sent to the receiver, will not be confirmed).

(2) Scalability

Flume employs a three-tier architecture, Agent,collector and storage, each of which can be scaled horizontally. All agents and collector are managed by master, which makes the system easy to monitor and maintain, and master allows multiple (using zookeeper for management and load balancing), which avoids a single point of failure.

(3) Manageability

All agents and Colletor are managed uniformly by master, which makes the system easier to maintain. Multi-master situation, flume use zookeeper and gossip to ensure the consistency of dynamic configuration data. Users can view individual data sources or data flow execution on master, and can configure and dynamically load individual data sources. Flume provides Web and Shell script command two forms for managing data streams.

(4) Functional Scalability

Users can add their own agent,collector or storage on demand. In addition, Flume has many components, including various agents (file, syslog, etc.), collector and storage (FILE,HDFS, etc.).

Second, the logical architecture of the flume architecture flume:


As mentioned earlier, the Flume uses a layered architecture: agent,collector and Storage, respectively. The agent and collector are composed of two parts: source and Sink,source are data sources, sink is the data whereabouts.

Flume uses two components: Master and Node,node determine whether they are acting as agents or collector based on dynamic configuration in the master shell or the web. (1) Agent

The role of the agent is to send data from the data source to the collector.

Flume has a lot of directly available data sources (source), such as: text ("filename"): file filename as a data source, by line send tail ("filename"): Probe the new data generated by filename, sent out by line Fsyslogtcp (5140): Listens for TCP's 5140 port, and receives the data to send out Taildir ("dirname" [, fileregex= ". *" [, startfromend=false[, recursedepth= 0]]): Listen to the end of the file in the directory, use the regular to select the files (not including directories) to be monitored, recursedepth the depth of the subdirectories

See more of this friend's collation: http://www.cnblogs.com/zhangmiao-chp/archive/2011/05/18/2050465.html

Also provides a lot of sink, such as: console[("format"): Direct data display on CONSOLR text ("txtfile"): Write data to file txtfile dfs ("Dfsfile") : Writes data to the Dfsfile file on HDFs syslogtcp ("host", Port): Passing data over TCP to the host node agentsink[("Machine" [, Port])]: equivalent to Agente2esink, If omitted, the machine parameter is used by default Flume.collector.event.host and Flume.collector.event.port as the default COLLECOTR agentdfosink[("Machine" [, Port]]: The local hot standby agent,agent discovers the collector node failure, continuously checks the collector's surviving state in order to resend the event, the data generated here will be cached on the local disk agentbesink[("Machine" [ , Port]]: irresponsible agent, if the collector failure, will not do any processing, it will send the data is also directly discarded Agente2echain: Specify multiple collector increase availability. When the event fails to send to the main collector, the second collector is sent, and when all the collector fail, it will be very persistent.

See more of this friend's collation: http://www.cnblogs.com/zhangmiao-chp/archive/2011/05/18/2050472.html (2) collector

The role of collector is to aggregate data from multiple agents and load them into the storage.

Its source and sink are similar to those of the agent.

Data source, such as: collectorsource[(port)]:collector source, listener Port aggregation data autocollectorsource: Automatic aggregation of data through master coordination physical nodes Logicalsource: Logical source, allocating ports from master and listening for Rpcsink

Sink, such as: Collectorsink ("Fsdir", "Fsfileprefix", Rollmillis): Collectorsink, after the data is gathered through the collector to HDFs, Fsdir is the HDFs directory, Fsfileprefix for file prefix code customdfs ("Hdfspath" [, "format"]): Custom Format DFS (3) storage

Storage is a storage system, can be a normal file, can also be hdfs,hive,hbase, distributed storage and so on. (4) Master

Master is the management of the coordination agent and collector configuration information, is the flume cluster controller.

In Flume, the most important abstraction is data flow, which describes a path from which data is generated, transmitted, processed, and eventually written to the target.

For the agent data stream configuration is where to get the data and send the data to which collector. For collector is the receiving agent sent data, the data sent to the specified target machine.

Note: The flume framework's reliance on Hadoop and zookeeper is only on the jar package and does not require that the Hadoop and zookeeper services be started when the flume is started. III. Flume Distributed Environment Deployment 1. Experimental scenario Operating system version: RedHat 5.6 Hadoop version: 0.20.2 JDK version: jdk1.6.0_26 install flume version: Flume-distribution-0.9.4-bin

Deploy flume on the cluster, follow these steps: Install flume on each machine on the cluster select one or more nodes as master to modify the static configuration file to start a master on at least one machine, all nodes start flume node dynamic configuration

You need to deploy flume on every machine in the cluster.

Note: Flume cluster network environment to ensure stable, reliable, otherwise there will be some inexplicable errors (such as: the agent end can not send data to collector). 1.Flume Environment Installation

$wget http://cloud.github.com/downloads/cloudera/flume/flume-distribution-0.9.4-bin.tar.gz
$tar-XZVF flume-distribution-0.9.4-bin.tar.gz
$cp-rf flume-distribution-0.9.4-bin/usr/local/flume
$vi/etc/profile  #添加环境配置
    Export flume_home=/usr/local/flume
    export path=.: $PATH:: $FLUME _home/bin
$source/etc/ Profile

$flume #验证安装

2. Select one or more nodes as Master

For master selection, you can define a master on the cluster, or you can select multiple nodes for master in order to increase availability. Single-point master mode: Easy to manage, but fault-tolerant and extensibility-prone Master mode in the system: typically run 3/5 master and can be well fault tolerant

Flume Master Number Selection principle:

The premise that distributed master can continue to work normally does not crash is that the number of normal working master exceeds half of the total master number.

Flume Master has two main functions: tracking the configuration of each node, notifying node configuration changes, tracking from the end of the flow control in the reliable mode (E2E) of information, so that the source of the flow to know when to stop transmission event. 3. Modify the static configuration file

Site-specific settings are configurable for flume nodes and master through the conf/flume-site.xml of each cluster node, if this file does not exist, set the properties by default in conf/ Flume-conf.xml, in the next example, set the master name on the flume node and let the node find the flume master called Master.

<?xml version= "1.0"?>
    <?xml-stylesheet type= "text/xsl"  href= "configuration.xsl"?>
    < configuration>
        <property>
            <name>flume.master.servers</name>
            <value> master</value>
         </property>
    </configuration>

In the case of multiple master, the following configuration is required:

<property>
    <name>flume.master.servers</name>
   <value>hadoopmaster.com, hadoopedge.com,datanode4.com</value>
    <description>a comma-separated List of hostnames, one for each Machine in the Flume master.</description>
</property>
<property>
    <name> flume.master.store</name>
    <value>zookeeper</value>
    <description>how the flume Master stores node configurations. Must be either ' zookeeper ' or ' memory ' .</description>
</property>
<property>
    < name>flume.master.serverid</name>
    <value>2</value>
    <description>the Unique identifier for a machine in a Flume Master ensemble. Must to different on every master instance.</description>
</property>

Note: The Flume.master.serverid property is configured primarily for master, and the Flume.master.serverid of the master node on the cluster must not be the same, and the value of the property begins with 0.

When using the agent role, you can set the default collector host by adding the following configuration file in Flume-conf.xml:

<property>
    <name>flume.collector.event.host</name>
    <value>collector</value >
    <description>this is the host name of the default "remote"  collector.</description>
</ property>
<property>
    <name>flume.collector.port</name>
    <value>35853< /value>
    <description>this Default TCP port that's collector listens to into order to receive events it is Col Lecting.</description>
</property>

For configuration See also: http://www.cnblogs.com/zhangmiao-chp/archive/2011/05/18/2050443.html.

4. Start Cluster

Node start on cluster: Enter at command line: Flume Master boot master node at command line input: Flume node–n nodename Start other nodes, nodename it is best to name the child according to the division of cluster logic, so Master is quite clear when it is configured.

The name rule is defined by itself, convenient for memory and dynamic configuration (the following will be introduced dynamic configuration)



Reproduced from: http://www.cnblogs.com/Leo_wl/archive/2012/05/25/2518716.html

Http://www.cnblogs.com/oubo/archive/2012/05/25/2517751.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.