1. Introduction to Flume
Flume is a distributed, reliable, andHigh AvailabilityA system that aggregates massive logs. It supports customization of various data senders to collect data. Flume also provides simple processing of data and writes it to various data receivers (customizable) capabilities.
Design goals:
(1) Reliability
When a node fails, logs can be transferred to other nodes without being lost. Flume provides three levels of reliability assurance, from strong to weak: end-to-end (after receiving the data agent, the event is first written to the disk. when the data is transmitted successfully, delete the data. If the data fails to be sent, resend the data .), Store on Failure, will not be confirmed ).
(2) scalability
Flume uses a three-tier architecture, namely agent, collector, and storage. Each layer can be horizontally expanded. Among them, all agents and collector are centrally managed by the master, which makes the system easy to monitor and maintain, and the master allows multiple (using zookeeper for management and load balancing ), this avoids spof.
(3) manageability
All agents and colletors are centrally managed by the master, which makes the system easy to maintain. In the case of multiple masters, flume uses zookeeper and gossip to ensure dynamic configuration data consistency. You can view the execution of each data source or data stream on the master, and configure and dynamically load each data source. Flume provides two forms of Web and shell script command to manage data streams.
(4) Functional scalability
You can add your own agent, collector, or storage as needed. In addition, flume comes with many components, including various agents (such as file and Syslog), collector and storage (such as file and HDFS ).
Ii. Flume Architecture
Logical architecture of flume:
As mentioned above, flume usesLayered Architecture: Agent, collector, and storage. Both agent and collector are composed of two parts:Source and Sink, source is the data source, and sink is the data destination.
Flume uses two components: Master and node. Based on the dynamic configuration in the master shell or web, the node determines whether it is used as an agent or a collector.
(1) Agent
The agent sends data from the data source to collector.
Flume comes with many directly available data sources (Source), Such:
- Text ("filename"): uses the file filename as the data source and sends it by row.
- Tail ("filename"): detects new data generated by filename and sends data by row.
- Fsyslogtcp (5140): listens to port 5140 of TCP and sends the received data.
- Taildir ("Dirname"[, Fileregex = ". * "[, startfromend = false [, recursedepth = 0]): select the files to be listened to (excluding Directories) using regular expressions at the end of the file in the listening directory ), recursedepth indicates the depth of sub-directories under recursive listening.
More can see the sort of this friend: http://www.cnblogs.com/zhangmiao-chp/archive/2011/05/18/2050465.html
At the same time, manySinkSuch:
- Console [("format")]: displays data directly on consolr.
- Text ("txtfile"): writes data to the txtfile file.
- DFS ("dfsfile"): writes data to the dfsfile file on HDFS.
- Syslogtcp ("host", Port): transmits data to the host node over TCP
- Agentsink [("machine" [, port])]: equivalent to agente2esink. If this parameter is omitted, flume is used by default. collector. event. host and flume. collector. event. port as the default collecotr
- Agentdfosink [("machine" [, port])]: The local Hot Standby agent. After the agent detects a fault on the collector node, it constantly checks the alive status of collector to send the event again, the data generated here will be cached to the local disk.
- Agentbesink [("machine" [, port])]: the agent that is not responsible. If collector fails, it will not process it, and the data it sends will be discarded directly.
- Agente2echain: specify multiple collector to improve availability. When an event fails to be sent to the master collector, it is sent to the second collector. When all the collector fail, it will be persistently sent again.
More can see the sort of this friend: http://www.cnblogs.com/zhangmiao-chp/archive/2011/05/18/2050472.html
(2) Collector
Collector is used to aggregate the data of multiple agents and load them to storage.
Its source and sink are similar to those of the agent.
Data Source (Source), Such:
- Collectorsource [(Port)]: Collector source, listening port aggregation data
- Autocollectorsource: automatically aggregates data through the master coordinate physical nodes
- Logicalsource: logical source. The master allocates ports and listens to rpcsink.
Sink, for example:
- Collectorsink ("fsdir", "fsfileprefix", rollmillis): collectorsink. Data is aggregated by collector and sent to HDFS. fsdir is the HDFS directory, and fsfileprefix is the file prefix.
- Customdfs ("hdfspath" [, "format"]): custom format DFS
(3)Storage
Storage is a storage system, which can be a common file, HDFS, hive, hbase, and distributed storage.
(4)Master
The master is the controller of the flume cluster to manage and coordinate the configurations of the agent and collector.
In flume, the most important abstraction is data flow. Data Flow describes a path where data is generated, transmitted, processed, and finally written to the target.
- For the agent data flow configuration, you can obtain the data from and send the data to the Collector.
- Collector receives data from the agent and sends the data to the specified target machine.
Note: The flume framework only depends on hadoop and zookeeper in the jar package, and does not require the hadoop and zookeeper services to be started when flume is started.
Iii. Flume distributed environment deployment 1. Experiment scenarios
- Operating system version: RedHat 5.6
- Hadoop version: 0.20.2
- JDK version: jdk1.6.0 _ 26
- Install flume: Flume-distribution-0.9.4-bin
To deploy flume on a cluster, follow these steps:
- Install flume on each machine in the Cluster
- Select one or more nodes as the master node
- Modify static configuration files
- Start one master on at least one machine, and all nodes start flume Node
- Dynamic Configuration
Deploy flume on each machine in the cluster.
Note: The network environment of the entire cluster of the flume cluster must be stable and reliable. Otherwise, some inexplicable errors may occur (for example, the agent cannot send data to collector ).
1. Install the flume Environment
$ Wget ready-xzvf flume-distribution-0.9.4-bin.tar.gz $ CP-RF flume-distribution-0.9.4-bin/usr/local/flume $ VI/etc/profile # Add environment configuration export flume_home =/usr/local /flume export Path =.: $ path: $ flume_home/bin $ source/etc/profile $ Flume # verify Installation
2. Select one or more nodes as the master node.
For Master selection, you can define a master on the cluster, or you can select multiple nodes as the master to improve availability.
- Single-point master mode: easy to manage, but flawed in System Fault Tolerance and scalability
- Multi-Point master mode: generally runs 3/5 masters, which can be highly fault tolerant.
How to select the number of flume masters:
The premise that the distributed master can continue to work normally and will not crash is that the number of master nodes working normally exceeds half of the total number of Master nodes.
Flume master has two main functions:
- Tracks the configuration of each node and notifies the node of configuration changes;
- Tracks information from the end of the flow in reliable mode (e2e), so that the flow source knows when to stop transmitting events.
3. Modify the static configuration file
Site-specific settings are configurable for Flume nodes and master nodes through the conf/flume-site.xml on each cluster node if this file does not exist, the default attribute is in CONF/flume-Conf. in XML, in the following example, set the Master name on the flume node and let the node find the flume master named "master" by itself.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>flume.master.servers</name> <value>master</value> </property> </configuration>
In the case of multiple masters, the following configurations are required:
<property> <name>flume.master.servers</name> <value>hadoopmaster.com,hadoopedge.com,datanode4.com</value> <description>A comma-separated list of hostnames, one for each machine in the Flume Master.</description></property><property> <name>flume.master.store</name> <value>zookeeper</value> <description>How the Flume Master stores node configurations. Must be either ‘zookeeper‘ or ‘memory‘.</description></property><property> <name>flume.master.serverid</name> <value>2</value> <description>The unique identifier for a machine in a Flume Master ensemble. Must be different on every master instance.</description></property>
Note: The flume. master. serverid attribute is mainly configured for the master node. The flume. master. serverid of the master node on the cluster must be different. The attribute value starts with 0.
When using the agent role, you can set the default collector host by adding the following configuration file in the flume-conf.xml:
<property> <name>flume.collector.event.host</name> <value>collector</value> <description>This is the host name of the default "remote" collector.</description></property><property> <name>flume.collector.port</name> <value>35853</value> <description>This default tcp port that the collector listens to in order to receive events it is collecting.</description></property>
For more information, see http://www.cnblogs.com/zhangmiao-chp/archive/2011/05/18/2050443.html.
4. Start the Cluster
Start a node on the cluster:
- On the command line, enter flume master to start the master node.
- Enter flume node-N nodename in the command line to start other nodes. It is best to name the sub-node according to the logic of the cluster, so that the master configuration is clearer.
Name rules are defined by yourself to facilitate memory and dynamic configuration (dynamic configuration will be introduced later)
5. dynamic configuration based on flume Shell
For command in flume shell, see: http://www.cnblogs.com/zhangmiao-chp/archive/2011/05/18/2050461.html
Assume that the current flume cluster is structured as follows:
We want to collect the system logs of the machine where the A-F is located to HDFS. How can we configure it in flume shell to achieve our goal?
1. Set a logical node)
$flume shell>connect localhost>help>exec map 192.168.0.1 agentA>exec map 192.168.0.2 agentB>exec map 192.168.0.3 agentC>exec map 192.168.0.4 agentD>exec map 192.168.0.5 agentE>exec map 192.168.0.6 agentF>getnodestatus 192.168.0.1 --> IDLE 192.168.0.2 --> IDLE 192.168.0.3 --> IDLE 192.168.0.4 --> IDLE 192.168.0.5 --> IDLE 192.168.0.6 --> IDLE agentA --> IDLE agentB --> IDLE agentC --> IDLE agentD --> IDLE agentE --> IDLE agentF --> IDLE>exec map 192.168.0.11 collector
You can also view it on the Web master interface.
2. Start the listening port of collector
> Exec config collector 'collectorsource (35853) ''' collectorsink ("", "") '# The Collector node listens to data from Port 35853,This is very important
Log on to the Collector server for port Detection
$netstat -nalp|grep 35853
If the above configuration is not performed on the master, the open port cannot be detected on the collector.
3. Set the Source and Sink of each node
> Exec config collector 'collectorsource (35853) ''' collectorsink ("HDFS: // namenode/flume/", "syslog ") '> exec config agenta 'Tail ("/tmp/log/message")' agentbesink ("192.168.0.11") '# after experiment, it seemsA logical node can have at most one source and sink.
>...
> Exec config agentf 'Tail ("/tmp/log/message") ''agentbesink ("192.168.0.11 ")'
At this time, the configuration can be clearly viewed from the master web. At this time, we can achieve our initial goal.
The above dynamic configuration Through flume shell can be performed in the flume master web, which is not further described here.
4. Advanced dynamic configuration
Advanced Configuration is to add the following features in the preceding simple configuration to ensure better system running:
- Multi-master (High Availability of Master nodes)
- Collector chain (High Availability of collector)
Multi-master has been introduced above, including the usage and number of masters. Let's take a brief look at the collector chain, which is actually quite simple. In the dynamic configuration, you can use the agent * chain to specify multiple collector to ensure the availability of log transmission. Let's take a look at the logic diagram of flume in the general official environment:
Here agenta and agentb point to collectora. If collectora crach is configured, the agent has corresponding actions based on the configured reliability level. We may not choose e2e to ensure efficient transmission (even in this way, agent local log accumulation is still a problem). Generally, multiple collector are configured to form a collector chain.
>exec config agentC ‘tail("/tmp/log/message")‘ ‘agentE2EChain("collectorB:35853","collectorA:35853")‘>exec config agentD ‘tail("/tmp/log/message")‘ ‘agentE2EChain("collectorB:35853","collectorC:35853")‘
In this case, collectorb has the following problems:
V. Problems and summary
The above nodes are classified into the following types: Master, agent, collector, and storage. For each type of nodes, we can check whether high availability and performance bottlenecks may occur.
First,Storage layer failureAndCollector layer failureThe same is true. As long as the data cannot be placed in the final position, it is considered that the node fails. We will certainly set an appropriate transmission mode based on the reliability of the collected data, and will control the situation where collector receives data based on our configuration, collector's performance affects the data throughput of the entire flume cluster. Therefore, it is best to deploy collector separately, so high availability is generally not considered.
Then,Agent layer failureIn the configuration of the main agent for Flume data security level, the agent provides three levels of data to be sent to collector: e2e, DFO, BF. Let's take a look at a summary of Daniel:
All files in the agent node monitoring log folder. Each agent can listen to a maximum of 1024 files.Every file in the agent will have something similar to a cursor that records the location where the listening file is read. In this way, every time a new record is generated for the file, the cursor will read the incremental record, the security-level attributes sent to Collector Based on Agent configurations include e2e and DFO.
In the case of e2e, the agent node first writes the file to the agent node folder and sends it to collector. If the final data is successfully stored in the storage layer, if the agent deletes the previously written file and does not receive the successful message, the information is retained. If an error occurs on the Agent node, all the record information disappears. If you restart the node, the agent considers that all the files in the log folder have not been listened, if there is no file record identifier, the file will be read again. In this case, the log will be duplicated. The specific recovery method is as follows: remove the log files that have been sent from the log folder listened to on the Agent node, after the fault is processed, restart the agent. Note: When the agent node fails, remove the data file before the time point according to the time point of failure, clear the folder configured by flume. Agent. logdir, and restart the agent.
Finally, the master fails and the master goes down. The whole cluster will not work. Restart the cluster, remove all the files in the log folder monitored by the agent, and restart the master. In the case of multiple master nodes, as long as the number of master nodes working normally on the cluster is more than half of the total number of Master nodes, the cluster can work normally, as long as the master nodes in which the cluster is down can be restored.
Summary:
1. when flume collects data on the agent, a temporary directory is generated under/tmp/flume-{user} by default to store the log files intercepted by the agent, if the file is too large and the disk is full, the agent will report an error closing logicalnode a2-18 sink: no space left on device. Therefore, when configuring the agent, pay attention to <property> <Name> flume. agent. logdir </Name> <value>/data/tmp/flume-$ {user. name}/agent </value> </property> attribute, As long as flume runs in, the agent does not make the path flume. agent. the logdir disk is full.
2. flume will search for hadoop-core-* at startup -*. JAR file, you need to modify the name of the hadoop core jar package of the standard version to hadoop-*-core. change jar to hadoop-core -*. jar.
3. The flume version in the flume cluster must be consistent. Otherwise, an inexplicable error may occur.
4. the time when the logs collected by the flume cluster are sent to the HDFS to create folders is based on the event time, and the source code is clock. unixtime (), so if you want to generate files based on the log generation time, you need. cloudera. flume. core. eventimpl class constructor public eventimpl (byte [] s, long timestamp, priority PRI, long nanotime, string host, Map <string, byte []> fields) re-write, parse the Content Retrieval time of array S and assign it to timestamp.
Note: The flume framework constructs an empty array of S content to send events similar to simple verification. Therefore, pay attention to the timestamp issue when S content is empty.
5. if the collector and agent are not in the same CIDR block, a transient disconnection may occur. In this case, the agent cannot transmit data to a collector. Therefore, it is best to deploy the agent and collector in the same CIDR block.
6. If the following error occurs during master startup: "try to start hostname but hostname is not in the master list", check whether the host address and hostname are correctly configured.
7. There is a major defect in the source end. The source in the tail class does not support resumable data transfer. Since the position of the last file read is not recorded after the node is restarted, you cannot know where to start reading next time. Especially when the log files are being added. The source node of flume is down. When the source of flume is enabled again, the added log content cannot be read by the source. However, flume has an execstream extension. You can write a monitoring log to increase the number of logs, and send the added content to the node of flume through a self-written tool. Then, it is sent to the sink node.
The scribe solution has been introduced in previous articles. The most intuitive feeling for me is:
- Complicated installation and simple configuration
- Simple installation and complex dynamic configuration of Flume
The following figure compares Dong's blog posts:
Flume log collection