Copyright NOTICE: This article is Yunshuxueyuan original article.
If you want to reprint please indicate the source: http://www.cnblogs.com/sxt-zkys/
QQ Technology Group: 299142667
the concept of flume
1. As a real-time log collection system developed by Flume, Cloudera has been recognized and widely used by the industry. The initial release version of Flume is now collectively known as Flume OG (original Generation), which belongs to Cloudera. But with the expansion of the FLume function, FLume OG code Engineering bloated, the core component design is unreasonable, the core configuration is not standard and other shortcomings exposed, especially in FLume OG final release 0.94.0, log transmission instability is particularly serious, in order to solve these problems, 2011 October 22, Cloudera completed the Flume-728 and made a milestone change to Flume: Refactoring the core components, core configuration, and code architecture, the reconstructed version is collectively known as Flume NG (Next generation), and another reason for the change is Flume Included in Apache, Cloudera Flume renamed Apache Flume.
2. Features of Flume:
Flume is a distributed, reliable, and highly available system for collecting, aggregating, and transmitting large volumes of logs. Support for customizing various data senders in the log system for data collection, while Flume provides the ability to simply process the data and write to various data recipients (such as text, HDFS, hbase, etc.).
Flume data flows are always run through events. An event is the basic unit of data for Flume, which carries log data (in the form of a byte array) and carries header information that is generated by source outside the agent, which is formatted when the source captures the event, and then the source pushes the event into (single or multiple) The channel. You can think of the channel as a buffer, which will save the event until sink finishes processing the event. Sink is responsible for persisting the log or pushing the event to another source.
3. Reliability of the Flume
When a node fails, the log can be transmitted to other nodes without loss. Flume provides three levels of reliability assurance, from strong to weak in order: End-to-end (Received data agent first writes the event to disk, when the data transfer is successful, then delete; If the data sent fails, you can resend it.) ), Store On failure (this is also the policy adopted by scribe, when the data receiver crash, writes the data to the local, after the recovery, continues to send), BestEffort (data sent to the receiver, will not be confirmed).
4. Recoverability of the Flume
or by the channel. It is recommended to use FileChannel, where events persist in the local file system (poor performance).
5. Some core concepts of flume
Agent: Run flume using the JVM. Each machine runs an agent, but it can be in one agent
Contains multiple sources and sinks.
Client: Production data, running on a separate thread.
Source: Collects data from the client and passes it to the channel.
Sink: Collects data from the channel and runs on a separate thread.
Channel: Connect sources and sinks, which is a bit like a queue.
Events: Can be log records, Avro objects, and so on.
the concept of event
Introduce the concept of event in Flume: The core of Flume is to collect the data from the data source to send the collected data to the specified destination (sink). In order to ensure that the delivery process must be successful, before sending to the destination (sink), the data will be cached (channel), when the data really arrives at the destination (sink), flume delete their own cached data.
During the transmission of the entire data, the event is flowing, that is, the transaction is guaranteed at the event level. So what is an event? -–event encapsulates the transmitted data, which is the basic unit of flume transmitted data, and if it is a text file, usually a row of records, the event is the basic unit of the transaction. Event from source, to channel, to sink, is itself a byte array, and can carry headers (header information) information. An event represents the smallest complete unit of data, from an external data source, to an external destination.
To make it easy for everyone to understand, give an event data flow graph:
Flume Architecture
Flume is so magical because of its own design, the design is that the agent,agent itself is a Java process, running in the Log collection node-so-called log collection node is the server node.
The agent contains 3 core components: Source-->channel-–>sink, a structure similar to that of a producer, a warehouse, and a consumer.
Source:source components are designed to collect data that can handle log data in various types and formats, including Avro, thrift, exec, JMS, spooling directory, netcat, sequence generator , Syslog, HTTP, Legacy, custom.
The Channel:source component collects the data and temporarily stores it in the channel, that is, the channel component is dedicated to storing temporary data in the agent-a simple cache of the collected data, which can be stored in memory, JDBC, file, and so on.
The Sink:sink component is a component used to send data to a destination, including HDFs, logger, Avro, thrift, IPC, file, NULL, Hbase, SOLR, Kafaka, and custom.
Flume Source
Source Type:
Avro Source: Supports Avro protocol (actually Avro RPC) with built-in support
Thrift Source: Support Thrift protocol, built-in support
Exec Source: Unix-based command produces data on standard output
JMS Source: Reading data from a JMS system (message, subject)
Spooling directory Source: monitoring data changes within a specified directory
Twitter 1% firehose Source: Continuous download of Twitter data via API, test nature
Netcat Source: Monitors a port and takes each text line of data flowing through the port as an event input
Sequence Generator Source: Sequence generator data source, production sequence data
Syslog Sources: Reads syslog data, generates event, supports UDP and TCP two protocols
HTTP Source: A data source based on an HTTP POST or get method that supports JSON, blob representations
Legacy Sources: Compatible with old flume og source (0.9.x version)
Flume Channel
Channel Type:
Memory channel:event data stored in RAM
JDBC channel:event data is stored in persistent storage and the current flume Channel has built-in support for Derby
File channel:event data is stored in disk files
The spillable memory channel:event data is stored in memory and on disk, and when the RAM queue is full, it holds
Long-lasting to disk file
Pseudo Transaction Channel: Test use
Custom channel: Customizing channel implementations
Flume Sink
Sink type description
HDFs Sink: Data written to HDFs
Logger Sink: Data is written to the log file
Avro Sink: Data is converted to Avro Event and then sent to the configured RPC port
Thrift Sink: Data is converted to Thrift Event and then sent to the configured RPC port
IRC Sink: Data is played back on IRC
File Roll Sink: Storing data to a local filesystem
Null Sink: Discard to all data
HBase Sink: Data written to hbase database
Morphline SOLR Sink: Data sent to SOLR Search server (cluster)
ElasticSearch Sink: Data sent to Elastic Search server (cluster)
Kite DataSet Sink: Write data to Kite DataSet, test-type
Custom Sink: Customizing the Sink implementation
Flume operating mechanism
Flume is the core of an agent, the agent has two external interaction, one is to accept data input--source, one is the output of the data Sink,sink responsible for sending data to the external designated destination. After the source receives the data, the data is sent to Channel,chanel as a data buffer to temporarily hold the data, and then sink sends the data in the channel to the specified place-for example, HDFs, Note: The channel will only delete the temporary data after the sink has successfully sent the data in the channel, which guarantees the reliability and security of the data transmission.
generalized usage of flume
Flume is so magical-the reason is that Flume can support multi-level flume agent, that is, Flume can be successive, for example, sink can write data to the next agent source, so that can be connected to a string, can be treated as a whole. Flume also supports fan-in (fan-in), fan-out (fan-out). The so-called fan-in is the source can accept multiple inputs, so-called fan-out is sink can output data to multiple destinations destination.
Flume Installation
1. Download the source package and upload the node to the cluster:
2. Unzip to the specified directory
3. Modify Conf/flume.env.sh:
Note: The java_opts configuration should be modified if we transfer the file over a large report memory overflow
4. Configure Environment variables
Refresh Profile: Source/etc/profile
5. Verify that the installation is successful
Flume ApplicationsCase 1
Http://flume.apache.org/FlumeUserGuide.html#a-simple-example
Configuration file simple.conf
# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = c1# Describe/configure the sourcea1.sources.r1.type = netcata1.sources.r1.bind = localhosta1.sources.r1.port = 44444# Describe the sinka1.sinks.k1.type = logger# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1
Start flume
Flume-ng agent-n a1-c conf-f simple.conf-dflume.root.logger=info,console
Install Telnet
Yum Install Telnet
Memory Chanel Configuration
Capacity: By default, the maximum number of event numbers that can be stored in this channel is 100,
Trasactioncapacity: The maximum number of event times that can be received in source or sent to sink is 100
Keep-alive:event the allowed time to add to the channel or move out
byte**: The limit of the byte amount of the event, including only Eventbody
Case 2, two flume do a cluster
NODE01 Server, configuration file
# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = c1# Describe/configure the sourcea1.sources.r1.type = netcata1.sources.r1.bind = node1a1.sources.r1.port = 44444# Describe the sink# a1.sinks.k1.type = loggera1.sinks.k1.type = avroa1.sinks.k1.hostname = node2a1.sinks.k1.port = 60000# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1
NODE02 Server, install flume (step slightly)
Configuration file
# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = c1# Describe/configure the sourcea1.sources.r1.type = avroa1.sources.r1.bind = node2a1.sources.r1.port = 60000# Describe the sinka1.sinks.k1.type = logger# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1
Start Node02 's flume first
Flume-ng agent-n a1-c conf-f avro.conf-dflume.root.logger=info,console
And start Node01 's flume.
Flume-ng agent-n a1-c conf-f simple.conf2-dflume.root.logger=info,console
Open Telnet Test node02 console output results
Case 3, Exec Source
Http://flume.apache.org/FlumeUserGuide.html#exec-source
Configuration file
a1.sources = r1a1.sinks = k1a1.channels = c1# Describe/configure the sourcea1.sources.r1.type = execa1.sources.r1.command = tail -F /home/flume.exec.log# Describe the sinka1.sinks.k1.type = logger# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1
Start flume
Flume-ng agent-n a1-c conf-f exec.conf-dflume.root.logger=info,console
Create an empty file demo touch Flume.exec.log
Looping Add data
For i in {1..50}; Do echo "$i hi Flume" >> flume.exec.log; Sleep 0.1; Done
Case 4, spooling Directory Source
Http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source
Configuration file
a1.sources = r1a1.sinks = k1a1.channels = c1# Describe/configure the sourcea1.sources.r1.type = spooldira1.sources.r1.spoolDir = /home/logsa1.sources.r1.fileHeader = true# Describe the sinka1.sinks.k1.type = logger# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1
Start flume
Flume-ng agent-n a1-c conf-f spool.conf-dflume.root.logger=info,console
Copy File Demo
mkdir logs
CP Flume.exec.log logs/
Case 5, HDFs sink
Http://flume.apache.org/FlumeUserGuide.html#hdfs-sink
Configuration file
############################################################
A1.sources = R1
A1.sinks = K1
A1.channels = C1
# Describe/configure The source
A1.sources.r1.type = Spooldir
A1.sources.r1.spoolDir =/home/logs
A1.sources.r1.fileHeader = True
# Describe The sink
Modify only the configuration code block of the previous spool sink A1.sinks.k1.type = Logger
A1.sinks.k1.type=hdfs
A1.sinks.k1.hdfs.path=hdfs://sxt/flume/%y-%m-%d/%h%m
# #每隔60s或者文件大小超过10M的时候产生新文件
# How many messages are in HDFs when you create a new file, 0 is not based on the number of messages
A1.sinks.k1.hdfs.rollcount=0
# How long does HDFs create new files, 0 not based on time
A1.sinks.k1.hdfs.rollinterval=60
# How large is HDFs when creating a new file, 0 not based on file size
a1.sinks.k1.hdfs.rollsize=10240
# When a temporary file that is currently open has no data written in the time specified by the parameter (seconds), the temporary file is closed and renamed to the destination file
A1.sinks.k1.hdfs.idletimeout=3
A1.sinks.k1.hdfs.filetype=datastream
A1.sinks.k1.hdfs.uselocaltimestamp=true
# # Generates a directory every five minutes:
# Whether to enable "discard" on time, "discard" here, similar to "rounding", followed by the introduction. If enabled, it affects all other time expressions except%t
A1.sinks.k1.hdfs.round=true
# The value of "discard" on time;
A1.sinks.k1.hdfs.roundvalue=5
# time on "discard" units, including: Second,minute,hour
A1.sinks.k1.hdfs.roundunit=minute
# Use a channel which buffers events in memory
A1.channels.c1.type = Memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
A1.sources.r1.channels = C1
A1.sinks.k1.channel = C1
############################################################
Create an HDFs directory
Hadoop Fs-mkdir/flume
Start flume
Flume-ng agent-n a1-c conf-f hdfs.conf-dflume.root.logger=info,console
View HDFs Files
Hadoop fs-ls/flume/...
Hadoop fs-get/flume/...
Copyright NOTICE: This article is Yunshuxueyuan original article.
If you want to reprint please indicate the source: http://www.cnblogs.com/sxt-zkys/
QQ Technology Group: 299142667
Introduction and application of flume