Introduction:
The contents of this paper include flume background, data flow model, common data flow operation, flume agent startup and flume Agent Simple example. The reference document is flume1.8.0 flumeuserguide of Flume official website.
I. BACKGROUND
Flume is a distributed log collection system produced by Cloudera software company, which was donated to the Apache Software Foundation in 2009 and is currently a top-level project for Apache.
Flume is a highly available, high-reliability, distributed mass log collection, aggregation and transmission system, it is noteworthy that because the data transmitted is customizable, so flume can not only be used for the transmission of log records, but also for the transmission of network traffic data, social media data, mail information and other data. The current Flume has two versions, Flume 0.9X version is collectively known as flume-og,flume1.x version flume-ng. Since Flume-ng has undergone significant refactoring, it is very different from the flume-og and should be differentiated when used.
second, the data flow model
Data flow Model 2.1 in Flume, where Flume is defined in the
Terminology Interpretationas follows: 1.event: A Data flow unit, the storage format is byte; 2.agent: A JVM process, which is the smallest constituent unit of Flume, is responsible for the management of flume components, including source, channel, and sink. Each component can have multiple; 3.source: A data source that is responsible for receiving data sent from an external data source; 4.channel: A buffer that is stored in one or more channel when the source receives the data; 5.sink: Responsible for consuming data in the channel and sending it to external storage such as HDFs or to source of another flume agent.
Figure 2.1 Flume Data flow model
A typical data flow process for a flume agent is:
1. External data sources (one or more) send data in the specified format to flume Agent,source to deserialize the data and store it in one or more channel; 2. When a data is required by the sink, the original data in the channel is removed from the specified channel and deleted. 3.sink serializes the data in the specified format and sends it to the specified location. Important: 1. A source can correspond to one or more channel, but a sink can only correspond to one channel; 2. When sink consumes the data in the corresponding channel, the original data in the channel is deleted; 3. Both source and sink must specify that their type,type are different and the serialization mechanism is different.
iii. Common Data flow operations
The following common data flow operations are mentioned in flume1.8.0 's flumeuserguide: 1. Multiple agent cascade (multi-agent Flow) Flume agents can be cascaded together to form an agent chain.
Figure 3.1 Multi-Agent Flow
2. Consolidation (Consolidation) can be used to collect data produced by multiple data sources through Consolidation,flume.
Figure 3.2 Consolidation
3.Multiplexing the Flow
An event stream can be sent to one or more destinations via Multiplexing,flume.
Figure 3.3 Multiplexing The flow
Four, flume agent start
As mentioned above, a flume agent can contain multiple source, channel, and sink. Depending on the flumeuserguide of flume1.8.0, you can start a flume agent with the following command:
$FLUME _home/bin/flume-ng agent-n $agent _name-c conf-f conf/flume-conf.properties.template
where the-n parameter specifies the agent name, the-c parameter specifies the Conf directory, and the-f parameter specifies the configuration file.
The startup process for a flume agent is as follows:
1. Prepare the configuration file according to the requirements;
2. Use the above command to specify the configuration file and start the Flume agent.
V. Examples of flume agents
Flume website provides a simple example of a flume agent whose name is A1 and the configuration file is as follows:
# example.conf:a Single-node Flume configuration# Name The components in this agenta1.sources = R1a1.sinks = K1a1.channel s = c1# describe/configure the Sourcea1.sources.r1.type = Netcata1.sources.r1.bind = Localhosta1.sources.r1.port = 44444# Describe the Sinka1.sinks.k1.type = logger# use a channel which buffers events in Memorya1.channels.c1.type = memorya1.ch annels.c1.capacity = 1000a1.channels.c1.transactioncapacity = 100# Bind The source and sink to the channela1.sources.r1.ch Annels = C1a1.sinks.k1.channel = C1
The above configuration file is configured with a flume agent with a name of A1, a source named R1, a channel named C1, and a k1 named sink, with a data flow direction of R1-->C1-->K1. The detailed properties of each component are as follows:
Component/Attribute |
Name |
Type |
Bind |
Port |
Channel |
Capacity |
Transactioncapacity |
Source |
R1 |
Netcat |
localhost |
44444 |
C1 |
No |
No |
Channel |
C1 |
Memory |
No |
No |
No |
1000 |
100 |
Sink |
K1 |
Logger |
No |
No |
C1 |
No |
No |
in the Flume installation directory, execute the following command to start the Flume agent:(Note: example.conf is located in the Flume installation directory)
Bin/flume-ng agent--conf conf--conf-file example.conf--name A1-dflume.root.logger=info,console
Command parameter explanation:
1.--CONF conf Specifies the configuration file directory as the Conf directory under the current directory; ( note : The directory must have flume-enx.sh files and log4j configuration files, otherwise the run fails)
2.--conf-file example.conf The example.conf file that specifies the configuration file as the current directory;
3.--name A1 The specified agent name is called A1;
4.-DFLUME.ROOT.LOGGER=INFO,CONSOLE specifies that the log output level is INFO and the output destination is the console.
The startup information is as follows:
When sending data to localhost:44444 over a TCP connection, the Flume agent will receive the sent data and print it to the console:
1. Establish a TCP connection via Telnet and send the data:
2.agent receives data and prints it on the console:
Flume Introduction (i)