I. Introduction of Flume
Flume is a distributed, reliable, and highly available mass-log aggregation system that enables the customization of various data senders in the system for data collection, while Flume provides the ability to simply process the data and write to various data-receiving parties (customizable).
Design objectives:
(1) Reliability
When a node fails, the log can be transmitted to other nodes without loss. Flume provides three levels of reliability assurance, from strong to weak in order: End-to-end (Received data agent first writes the event to disk, when the data transfer is successful, then delete; If the data sent fails, you can resend it.) ), Store On failure (this is also the strategy adopted by scribe, when the data receiver crash, the data is written to local, after recovery, continue to send), best effort (data sent to the receiver, will not be confirmed).
(2) Scalability
The Flume employs a three-tier architecture, Agent,collector and storage, each of which can be scaled horizontally. All agents and collector are managed by master, which makes the system easy to monitor and maintain, and master allows multiple (management and load balancing using zookeeper), which avoids a single point of failure.
(3) Manageability
All agents and Colletor are managed centrally by master, which makes the system easy to maintain. Multi-master case, Flume uses zookeeper and gossip to ensure the consistency of dynamic configuration data. Users can view individual data sources or data flow executions on master, and can be configured and dynamically loaded on individual data sources. Flume provides two forms of web and Shell Script command to manage data flow.
(4) Functional Scalability
Users can add their own agent,collector or storage as needed. In addition, Flume comes with a number of components, including various agents (file, syslog, etc.), collector and storage (FILE,HDFS, etc.).
Second, flume structure
Flume's logical architecture:
As mentioned earlier, Flume uses a layered architecture: agent,collector and Storage, respectively. The agent and collector are composed of two parts: sourceand Sink,source are data sources, and sink is the data whereabouts .
Flume uses two components: Master and Node,node are determined as agents or collector depending on whether they are configured dynamically in Mastershell or the web.
(1) Agent
The role of the agent is to send data from the data source to collector.
Flume comes with a number of directly available data sources (source), such as:
- Text ("filename"): Send file filename as a data source, by row
- Tail ("filename"): Detects the new data generated by filename and sends it by line
- Fsyslogtcp (5140): Listens to TCP's 5140 port and sends out incoming data
- Taildir ("dirname" [, fileregex= ". *" [, startfromend=false[, Recursedepth=0]]): Listens to the end of files in the directory, Use regular to select the file (without the directory) to be listened to, recursedepth to listen to the depth of its subdirectory recursively
See also this friend's arrangement: http://www.cnblogs.com/zhangmiao-chp/archive/2011/05/18/2050465.html
A number of sinkare also available, such as:
- console[("format")]: Display data directly on CONSOLR
- Text ("txtfile"): Writes data to file txtfile
- DFS ("Dfsfile"): Writes data to the Dfsfile file on HDFs
- SYSLOGTCP ("host", Port): Passing data over TCP to the host node
- agentsink[("Machine" [, Port])]: equivalent to Agente2esink, if omitted, machine parameter, Default Flume.collector.event.host and Flume.collector.event.port are used as default Collecotr
- agentdfosink[("Machine" [, Port])]: Local hot standby Agent,agent discovers collector node failure, constantly checks the survival status of collector to resend the event, Data generated here will be cached on the local disk
- agentbesink[("Machine" [, Port])]: The agent is not responsible, if the collector fault, will not do any processing, it sends the data will be directly discarded
- Agente2echain: Specify multiple collector to increase availability. When sending an event to the primary collector, turn to the second collector send, and when all the collector fail, it will be very persistent again
See also this friend's arrangement: http://www.cnblogs.com/zhangmiao-chp/archive/2011/05/18/2050472.html
(2) Collector
The role of collector is to load data from multiple agents into the storage after it has been aggregated.
Its source and sink are similar to agents.
Data sources (source), such as:
- collectorsource[(port)]:collector source, listening port aggregation data
- Autocollectorsource: Automatic aggregation of data through master coordination of physical nodes
- Logicalsource: Logical source, which is assigned the port by master and listens to Rpcsink
Sink, such as:
- Collectorsink ("Fsdir", "Fsfileprefix", Rollmillis): Collectorsink, data sent to HDFs via collector, Fsdir is the HDFs directory, Fsfileprefix for file prefix code
- Customdfs ("Hdfspath" [, "format"]): Custom Format DFS
(3) Storage
Storage is a storage system, can be a common file, it can be hdfs,hive,hbase, distributed storage and so on.
(4) Master
Master is the management of coordination agent and collector configuration information, is the controller of the flume cluster.
In Flume, the most important abstraction is data flow, which describes a path from which data is generated, transmitted, processed, and eventually written to the target.
- For the agent data flow configuration is where to get the data, send the data to which collector.
- For collector is the data sent by the receiving agent, sending the data to the specified target machine.
Note: The flume framework relies on Hadoop and zookeeper only on the jar package and does not require that the Hadoop and zookeeper services be started when the flume starts.
Three, flume distributed environment deployment
1. Experimental scenarios
- Operating system version: RedHat 5.6
- Hadoop version: 0.20.2
- JDK version: jdk1.6.0_26
- Install flume version: Flume-distribution-0.9.4-bin
To deploy flume on the cluster, follow these steps:
- Install flume on each machine on the cluster
- Select one or more nodes as Master
- Modifying a static configuration file
- Start a master on at least one machine, all nodes start flume node
- Dynamic configuration
You need to deploy flume on each machine in the cluster.
Note: Flume cluster cluster network environment to ensure stable, reliable, otherwise there will be some inexplicable errors (such as: the agent can not send data to collector).
1.Flume Environment Installation
$wgethttp://cloud.github.com/downloads/cloudera/flume/flume-distribution-0.9.4-bin.tar.gz
$tar-xzvfflume-distribution-0.9.4-bin.tar.gz
$CP-RF Flume-distribution-0.9.4-bin/usr/local/flume
$vi/etc/profile #添加环境配置
Export Flume_home=/usr/local/flume
Export path=.: $PATH:: $FLUME _home/bin
$source/etc/profile
$flume #验证安装
2. Select one or more nodes as Master
For master selection, you can define a master on the cluster, or you can select multiple nodes as master for increased availability.
- Single point Master mode: Easy to manage, but defective in system fault tolerance and extensibility
- Multi-point Master mode: usually runs 3/5 master, which can be very good fault tolerance
Flume Master Number selection principle :
The ability of the distributed master to continue working normally does not crash if the number of normal working master is more than half of the total master number.
There are two main functions of Flume master:
- Tracking the configuration of each node and notifying the configuration changes of the nodes;
- Track the end of flow control in reliable mode (E2E) so that the source of flow knows when to stop transmitting the event.
3. Modify the static configuration file
Site-specific settings for flume nodes and master through Conf/flume-site.xml on each cluster node are configurable, if this file does not exist, set the properties of the default in conf/ Flume--conf.xml, in the next example, set the master name on the flume node and let the node find the flume master called "Master" on its own.
<?xml version= "1.0"?>
<?xml-stylesheettype= "text/xsl" href= "configuration.xsl"?>
<configuration>
<property>
<name>flume.master.servers</name>
<value>master</value>
</property>
</configuration>
In the case of multiple master, the following configuration is required:
<property>
<name>flume.master.servers</name>
<value>hadoopmaster.com,hadoopedge.com,datanode4.com</value>
<description>acomma-separated List of hostnames, one for each machine in the Flume master.</description>
</property>
<property>
<name>flume.master.store</name>
<value>zookeeper</value>
<description>howthe Flume Master Stores node configurations. Must be either ' zookeeper ' or ' memory ' .</description>
</property>
<property>
<name>flume.master.serverid</name>
<value>2</value>
<description>theunique identifier for a machine in a Flume Master ensemble. Must be Differenton every master instance.</description>
</property>
Note: The configuration of the Flume.master.serverid property is primarily for master, and the Flume.master.serverid of the master node on the cluster must not be the same, and the value of this property starts with 0.
When using the agent role, you can set the default collector host by adding the following configuration file in Flume-conf.xml:
<property>
<name>flume.collector.event.host</name>
<value>collector</value>
<description>this is the host name of the default "remote" collector.</description>
</property>
<property>
<name>flume.collector.port</name>
<value>35853</value>
<description>this default TCP port, the collector listens Toin order to receive events it's Collecting.</des Cription>
</property>
See also: http://www.cnblogs.com/zhangmiao-chp/archive/2011/05/18/2050443.html for configuration.
4. Start the cluster
Node start on cluster:
- On the command line input: Flume master starts the master node
- In the command line input: Flume node–n nodeName Start other nodes, NodeName best according to the cluster logical division to the name of the child, so when the master configuration is relatively clear.
Name rules are defined by themselves, convenient memory and dynamic configuration can be (followed by the introduction of dynamic configuration)
5. Dynamic configuration based on the flume shell
About the command in the Flume shell see: http://www.cnblogs.com/zhangmiao-chp/archive/2011/05/18/2050461.html
Suppose we currently deploy the flume cluster structure as follows:
We want to collect the system log of the machine where the a-f is located in HDFs, how do we configure it in the Flume shell to achieve our goal?
1. Set logical nodes (logical node)
$flume Shell
>connect localhost
>help
>exec map 192.168.0.1 AgentA
>exec map 192.168.0.2 AGENTB
>exec Map 192.168.0.3 AGENTC
>exec Map 192.168.0.4 Agentd
>exec map 192.168.0.5 Agente
>exec Map 192.168.0.6 agentf
>getnodestatus
192.168.0.1---IDLE
192.168.0.2---IDLE
192.168.0.3---IDLE
192.168.0.4---IDLE
192.168.0.5---IDLE
192.168.0.6---IDLE
AgentA---IDLE
AGENTB---IDLE
AGENTC---IDLE
Agentd---IDLE
Agente---IDLE
agentf---IDLE
>exec Map 192.168.0.11 Collector
Here you can also open the Web master interface to view.
2. Start the Collector listening port
>exec config collector ' Collectorsource (35853) ' Collectorsink ("", "") ' #collector节点监听35853端口过来的数据, This is a very important
Log in to the collector server for port detection
$netstat-nalp|grep 35853
If the above configuration is not done in master, this open port is not detected on collector
3. Set source and sink for each node
>exec config collector ' Collectorsource (35853) ' Collectorsink ("hdfs://namenode/flume/", "syslog") '
>exec config AgentA ' tail ("/tmp/log/message") ' Agentbesink ("192.168.0.11") ' #经过实验, like a logical node, There can be at most one source and sink.
A..
>exec config agentf ' tail ("/tmp/log/message") ' Agentbesink ("192.168.0.11") '
At this point the configuration can be viewed from the master Web at a glance, at this point we can already achieve our original purpose.
The above dynamic configuration through the flume shell can be done in the Flume master Web, without further explanation.
Iv. Advanced Dynamic Configuration
The advanced configuration is actually to add the following features in the simple configuration above to ensure better operation of the system:
- Multiple master (high availability for master node)
- Collector Chain (High availability of Collector)
Multiple master cases have been described above, including the use and number of master. The following is a simple look at collector Chain, in fact, is also very simple, is in the dynamic configuration, using Agent*chain to specify multiple collector to ensure the availability of its log transport. Take a look at the logical diagram of Flume in a general formal environment:
Here AgentA and Agentb point to Collectora, if Collectora Crach, according to the configured reliability level agent will have the corresponding action, we are likely to ensure efficient transmission and did not choose e2e (even this way, Agent local log accumulation is still a problem, and will typically configure multiple collector to form collector chain.
>exec config Agentc ' tail ("/tmp/log/message") ' Agente2echain ("collectorb:35853", "collectora:35853") '
>exec config agentd ' tail ("/tmp/log/message") ' Agente2echain ("collectorb:35853", "collectorc:35853") '
This collectorb in the case of a problem:
V. Questions and summaries
The above nodes are as follows: Master, agent, collector, storage, for each type of node we look at high availability and there is no possibility of causing performance bottlenecks.
First, the failure of thestorage layer is the same as the failure of the collector layer , as long as the data is not placed in the final position, the node is considered to be a failure. We will be based on the reliability of the data collected to set the appropriate transmission mode, and according to our configuration, we control the collector receive data, collector performance impact is the entire flume cluster data throughput, so collector best to deploy separately, Therefore, high-availability issues are generally not considered.
Then,agent layer failure , flume data Security level configuration of the main agent configuration, the agent provides three levels to send data to collector:e2e, DFO, BF, in some do not repeat. Take a look at a summary from Daniel:
Agent node monitoring all files under the log folder, each agent listens for up to 1024 files , each file in the agent will have a cursor-like thing, record the location of the listening file read, so that every time the file has a new record to produce, Then the cursor reads the delta record, and the security level attribute sent to collector according to the agent configuration is E2E,DFO.
If this is the case of e2e, then the agent node will first write the file to the Agent node's folder, and then send to collector, if the final data is finally successfully stored to the storage layer, then the agent deletes the file written before, if not received the successful information, Then keep the information.
If there is a problem with the agent node, then the equivalent of all the record information disappears, if you restart directly, the agent will assume that all files under the log folder are not listening, there is no file record, so the file will be re-read, so that the log will be duplicated, the specific recovery method is as follows
The log files that have been sent under the Listening log folder on the agent node are moved out and processed and the agent can be restarted after the failure.
Note: In the case of failure of the agent node, according to the point of failure, the data files before the point in time are moved out, the Flume.agent.logdir configuration folder is emptied, and the agent is restarted.
Finally, master fails, master is down, the entire cluster will not work, restart the cluster, move all the files under the log folder that the agent listens to, and then restart Master. In the case of multi-master nodes, the cluster will work as long as the master working on the cluster is greater than half the total master number, so long as the master of the outage is restored.
Summary of issues:
1.Flume when collecting data on the agent side, the default is to generate a temporary directory under/tmp/flume-{user} to hold the agent's own intercepted log files, if the file is too large to fill the disk, the agent will report
Error closing Logicalnode a2-18 sink:no space left on device, so you need to be aware when configuring the agent side
<property>
<name>flume.agent.logdir</name>
<value>/data/tmp/flume-${user.name}/agent</value>
</property>
property, as long as the flume is guaranteed to run during the 7x24 hours the agent side does not make the path Flume.agent.logdir disk full.
2. Flume will look for Hadoop-core-*.jar files at boot time, and the name of the standard Hadoop core jar package needs to be changed to Hadoop-*-core.jar Hadoop-core-*.jar.
The flume in a 3.Flume cluster must be version consistent. Otherwise there will be an inexplicable error.
4.Flume cluster-collected logs are sent to the HDFs to establish a folder based on the time of the event, which is Clock.unixtime () on the source code, so if you want to generate the file based on the time generated by the log, you need to
Com.cloudera.flume.core.EventImpl Constructors for class
Public Eventimpl (byte[] s, long timestamp,priority pri, long Nanotime,
String host, Map<string, byte[]> fields) is re-written to parse the contents of the array s to remove the time and assign to timestamp.
Note: The flume framework constructs an array of S content that is empty, and is used to send an event similar to simple validation, so it is important to note that the S content is empty when the timestamp problem occurs.
5. If collector and agent are not in a network segment, the phenomenon of flash will occur, so that the agent can not transmit data collector So, in the deployment agent and collector preferably in a network segment.
6. If you start master: "Try to start hostname, but hostname is not in the master list of errors", this is the need to check whether the host address and hostname configuration is correct or not.
7. At the source side, there is a relatively large defect in the tail class of source, not supported, the breakpoint continues to transmit function. Since restarting node does not record where the last file was read, there is no way to know where to start reading the next time.
Especially when the log files are constantly increasing. The source node of the flume is hung. When the flume source is opened again this time, the added log content, there is no way to be read by the source.
However, Flume has a execstream extension, you can write a monitoring log to increase the situation, the increased log, through their own tools to write the added content, to the Flume node. and send it to sink's node.
Brief analysis of Flume structure