Tag: Data sent stream via example database high availability Val System
Flume is a log collection system provided by Cloudera, with the characteristics of distributed, high reliability, high availability and so on, the Flume supports the development of various kinds of data transmission in the log system, and Flume provides the ability to handle the data easily and write to the various number of receiver. Its design principle is based on the data flow, such as log data from a variety of Web site servers together to store in the hdfs,hbase and other centralized storage.
Flume Features: Reliability, scalability, manageability
Next we are talking about the benefits of the Flume NG architecture:
1, NG on the core components of the large-scale adjustment
2, greatly reduces the user's requirements, such as users do not need to build zookeeper cluster
3, facilitate the integration of Flume and other technologies, Hadoop peripheral components
4. More powerful and more scalable in function
The core of Flume is to collect data from the data source and send it to the destination, in order to ensure that the delivery must be successful, before sending to the destination, the data will be cached before the data really arrive at the destination, delete their cached data
The basic unit of data transmitted by Flume is an event, and if it is a text file, it is usually a line of records, which is the basic unit of the transaction. Event from source, to channel, to sink, itself as a byte array, and can carry headers information. An event represents the smallest complete unit of a data stream, from an external data source to an external destination.
The core of the flume operation is the agent. It is a complete data collection tool that contains three core components, namely source, channel, sink. These components allow the event to flow from one place to another.
Source can receive data sent from an external source, and different sources can accept different data formats. For example, there is a directory pool (spooling directory) data source, you can monitor the new file changes in the specified folder, if there are files in the directory, it will read its contents immediately
The channel is a storage place that receives the output of the source until the sink consumes the data in the channel. The data in the channel is not deleted until it enters the next channel or enters the terminal. Reliable when Sink writes fail and can be restarted automatically without loss of data
Sink will consume the data in the channel and then send it to an external source or to other sources. If the data can be written to HDFs or HBase.
Flume allows multiple agents to join together to form a pre-and post-connected multi-hop:
Flume Architecture Core Components:
Source:source is responsible for receiving an event or generating an event through a special mechanism, and placing events in batches into one or more channel,source must be associated with at least one channel
Different types of Source: Source:syslog with system integration, Netcat, Source:execsource, Spoolsource for direct reading of files, IPC Source:avro for communication between agents and agents, Thrift
The Channel:channel is located between source and sink and is used to cache incoming event. The event is removed from the channel when sink successfully sends the event to the next-hop channel or final destination.
Several channel types: Memorychannel can achieve high-speed throughput, but data integrity is not guaranteed, and FileChannel (disk channel) guarantees data integrity and consistency. When configuring FileChannel specifically, it is recommended that the directory and program log files FileChannel set to a different disk for increased efficiency
Sink:sink is responsible for transmitting the event to the next hop or the final purpose; sink when setting up storage data, you can store data to the file system, database, Hadoop, and when the log data is low, the data can be stored in the file system and set at a certain time interval to save the data. When more log data is available, the corresponding log data can be stored in Hadoop for future data analysis. Must act on an exact channel.
Download Source Bundle: http://mirror.bit.edu.cn/apache/flume/1.6.0/
1. Install the package:
[Email protected] ~]$ TAR-XVF apache-flume-1.6.0-bin.tar.gz
[Email protected] ~]$ TAR-XVF apache-flume-1.6.0-src.tar.gz
2. Merge the source code into the installation directory Apache-flume-1.6.0-bin
To configure environment variables:
[Email protected] ~]$ vim ~/.bash_profile
Export flume_home=/home/lan/apache-flume-1.6.0-bin/
Export path= $PATH: $FLUME _home/bin
To test whether the Flume-ng was installed successfully:
Flume-ng version
3.
Create a new configuration file for the flume proxy agent1 example.conf
[Email protected] ~]$ CD apache-flume-1.6.0-bin/conf/
[Email protected] conf]$ vim example.conf
#agent1
Agent1.sources = Source1
Agent1.sinks = Sink1
Agent1.channels = C1
#source1
Agent1.sources.source1.type = Spooldir
Agent1.sources.source1.spoolDir =/home/lan/agent1log
Agent1.sources.source1.channels = C1
Agent1.sources.source1.fileHeader = False
#sink1
Agent1.sinks.sink1.type = HDFs
Agent1.sinks.sink1.hdfs.path = Hdfs://master:9000/agentlog
Agent1.sinks.sink1.hdfs.fileType = DataStream
Agent1.sinks.sink1.hdfs.writeFormat = TEXT
Agent1.sinks.sink1.hdfs.rollInteval = 4
Agent1.sinks.sink1.channel = C1
#channel1
Agent1.channels.c1.type = File
Agent1.channels.c1.checkpointDir =/HOME/LAN/AGENT1_TMP1
Agent1.channels.c1.dataDirs =/home/lan/agent1_tmpdata
#agent1. channels.channel1.capacity = 10000
#agent1. channels.channel.transactionCapacity = 1000
New Agent1log:
[Email protected] ~]$ mkdir Agent1log
[Email protected] ~]$ CD Apache-flume-1.6.0-bin
[Email protected] apache-flume-1.6.0-bin]$ flume-ng agent-n agent1-c conf-f/home/lan/apache-flume-1.6.0-bin/conf/ Example.conf-dflume.root.logger=debug,console
Start another terminal and create a new file under the monitoring directory Test2.txt
CD ~/agent1log
Vim Test2.txt
Looking at the output of the SINK1, we found a file with a flumedata starting at the target path and a file timestamp suffix, indicating that flume can monitor the target directory changes and collect the changed parts in real time to sink output.
Hadoop-flume Log Collection System