Flume 1.7.0 User Guide
- Introduction (Introduction)
- Overview (Review)
- System Requirements (Systems requirements)
- Architecture (architecture)
- Data flow model
- Complex flows (complex flow)
- Reliability (Reliability)
- Recoverability (recoverability)
- Setup (configuration) configuration (config
- Setting up an agent (Setup an agent)
- Configuring individual components (Configuring a single component)
- Wiring the pieces together (fragment aggregation)
- Starting an agent (starting an agent)
- A simple example (a straightforward example)
- Logging raw data (record raw)
- Zookeeper based configuration (Zookeeper Foundation)
- Installing Third-party Plugins (install third-party plugins)
- The PLUGINS.D directory (plugin directory)
- Directory layout for plugins (directory layouts for plug-ins)
- Data Ingestion (Acquisition)
- RPC (Remote Call)
- Executing commands (Execute command)
- Network streams (NET stream)
- Setting multi-agent Flow (set multiple agent flows)
- Consolidation (Consolidated)
- Multiplexing the flow (multiplexed stream)
- Defining the flow (define a stream)
- Configuration
- Defining the Flow
- Configuring individual components (Configuring a single component)
- Adding multiple flows in an agent (adds multiple streams to an agent)
- Configuring a multi agent flow (configuring a multi-agent stream)
- Fan out flow (Fanout stream)
- Flume Sources (various source)
- Avro source
- Thrift Source
- Exec Source
- JMS Sou Rce
- spooling Directory Source
- Event deserializers
- line
- AVRO
- blobdeserializer
- taildir source
- Twitter 1% firehose Source (experimental)
- Kafka source
- Ne TCat source
- Sequence Generator source
- syslog Sources
- syslog TCP Source
- Multiport syslog TCP Source
- syslog UDP source
- HTTP source
- Stress Source
- Legacy Sources
- Avro Legacy so Urce
- Thrift Legacy source
- Custom Source
- Scribe Source
- Flume Sinks (various sink)
- HDFS Sink
- Hive Sink
- Logger Sink
- Avro Sink
- Thrift Sink
- IRC Sink
- File Roll Sink
- Null Sink
- Hbasesinks
- Morphlinesolrsink
- Elasticsearchsink
- Kite Dataset Sink
- Kafka Sink
- Custom Sink
- Flume Channels (various channel)
- Memory Channel
- JDBC Channel
- Kafka Channel
- File Channel
- Spillable Memory Channel
- Pseudo Transaction Channel
- Custom Channel
- Flume Channel selectors (channel selector)
- replicating channel Selector (default)
- multiplexing Channel Selector
- Custom Channel Selector
- Flume Sink Processors (actuator)
- Default Sink Processor
-
failover Sink Processor
- Load balancing Sink Processor
- Custom Sink Processor
Introductionoverview
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amo Unts of log data from many different sources to a centralized data store.
The use of Apache Flume are not only restricted to log data aggregation. Since data sources is customizable, Flume can be used to transport massive quantities of event data including and not Lim ited to network traffic data, social-media-generated data, e-mail messages and pretty much any data source possible.
Apache Flume is a top level project at the Apache software Foundation.
There is currently, release code lines available, versions 0.9.x and 1.x.
Documentation for the 0.9.x track was available at the Flume 0.9.x User Guide.
This documentation applies-to-the-1.4.x track.
New and existing users is encouraged to use the 1.x releases so as to leverage the performance improvements and Configura tion flexibilities available in the latest architecture.
Apache Flume is a distributed, highly reliable, and highly available collection, collection, and movement of a large number of log data from different sources to a central data warehouse.
Apache Flume is not limited to data aggregation. Because data is customizable, flume can be used to transport large amounts of time data including not limited to network transmission data, social media generated data, e-mail messages and almost any data source.
Apache Flume is the top project of the Apache Software Foundation.
There are currently two release versions available, 0.9.x and 1.x.
We encourage new and old users to use the 1.x release to improve performance and take advantage of the configuration flexibility of the new structure.
System Requirements
-
- Java Runtime Environment-java 1.7 or later (Java Runtime Environment-java1.7 or later version)
- Memory-sufficient memory for configurations used by sources, channels or sinks (memory--enough RAM to configure souuces,channels and sinks)
- Disk Space-sufficient disk space for configurations used by channels or sinks (space-sufficient disk for configuration channels or sinks)
- Directory Permissions-read/write Permissions for directories used by agent (directory permissions-directory read/write permissions used by the agent)
Architecture (architecture) Data flow model
A Flume event is defined as a unit of data flow has a byte payload and an optional set of string attributes. A Flume agent is a (JVM) process, the components of the through which events flow from a external source to the next D Estination (Hop).
A flume event is defined as a data flow unit that has a payload of one byte and an optional string property configuration. The Flume agent is a JVM process that controls the flow of component completion events from an external source to the next destination.
A Flume source consumes events delivered to it is a external source like a Web server. The external source sends events to Flume in a format that's recognized by the target Flume source. For example, an Avro Flume source can is used to receive Avro events from Avro clients or other Flume agents in the flow T Hat send events from an Avro sink. A similar flow can be defined using a Thrift Flume Source to receive events from a Thrift Sink or a Flume Thrift Rpc clien T or Thrift clients written in any language generated from the Flume Thrift protocol. When the a Flume source receives an event, it stores it to one or more channels. The channel is a passive store that keeps the event until it's consumed by a Flume sink. The file channel is one example–it was backed by the local filesystem. The sink removes the event from the channel and puts it into a external repository like HDFS (via Flume HDFS sink) or for Wards it to the Flume source of the next Flume agent (next hop) in the flow. The SourCE and sink within the given agent run asynchronously with the events staged in the channel.
Flume source consumes an external source like the event that the Web server transmits to him. The external source sends an event with the destination flume source defined format to flume. For example, Avro Flume source is used to receive Avro events from sink Avro in Avro clients or other Flume in the stream. A similar stream can be used Thrift Flume Source to receive events from Flume sink or fluemthrift RPC clients or a Flume client written in any language that complies with the Thrift Thrift protocol. When a flume source receives an event, it stores the event in one or more cannel. The channel is a passive repository used to hold events until it is consumed by flume sink. File Channel is an example-it backs the local file system. Sink removes the event from the channel and places the event into an external warehouse like HDFs (via Flume HDFs sink) or forwards to another flume Agent in the stream. The agent in source and sink asynchronously executes the channel events.
Complex flows (complex flow)
Flume allows a user to build Multi-hop flows where events travel through multiple agents before reaching the final Destina tion. It also allows Fan-in and fan-out flows, contextual Routing and Backup routes (Fail-over) for failed hops.
Flume allows some users to establish a multi-hop stream when an event passes through multiple agents when it reaches its final destination. It also allows for fan-in and fan-out flows, context routing, and recovery routing for failed hops.
Reliability (Reliability)
the events is staged in a channel on each agent. The events is then delivered to the next agent or terminal repository (like HDFS) in the flow. The events is removed from a channel only after they is stored in the channel of next agent or in the terminal repositor Y. A how the Single-hop message delivery semantics in Flume provide end-to-end reliability of the flow.
Flume uses a transactional approach to guarantee the reliable delivery of the events. The sources and sinks encapsulate in a transaction the Storage/retrieval, respectively, of the events placed in or provide D by a transaction provided by the channel. This ensures, the set of events is reliably passed from point to point in the flow. In the A-multi-hop flow, the sink from the previous hop and the source from the next hop both has their transacti ONS running to ensure that the data are safely stored in the channel of the next hop.
Events are (stored) the channel in each agent. The event is routed to the next agent or to the final destination in the stream like HDFs. The event is removed from the original agent after it is stored in the channel of another agent or after the end depot. This is a single hop in the flow of information transmission definition, in order to provide end-to-end flow reliability.
Flume uses a transactional scheme to guarantee the reliability of event delivery. source, sink, and channel provide different transaction mechanisms respectively, source and sink are the storage/recovery of encapsulated events in a transaction mechanism, the channel encapsulates the location of the event and is provided in a transactional mechanism. This ensures that the event set is reliably transmitted from one point in the stream to another. In multiple hop streams, the sink of the previous hop and the source of the latter hop have their transactional mechanisms to ensure that the data is stored securely in the next hop.
Recoverability (recoverability)
The events is staged in the channel, which manages recovery from failure. Flume supports a durable file channel which is backed by the local file system. There ' s also a memory channel which simply stores the events in an in-memory queue, which was faster but any events still l The EFT in the memory channel is a agent process dies can ' t be recovered.
Events are stored in the channel and are responsible for failure recovery. The flume supports a persistent file channel that relies on the local file system. Also eat a memory channel simply stores the event in a memory queue, processing is fast but when the agent hangs up the memory of the event will be lost and no way to recover.
Setup (Setup) Setting up an agent (Setup agent)
Flume agent configuration is stored in a local configuration file. This was a text file that follows the Java properties file format. Configurations for one or more agents can is specified in the same configuration file. The configuration file includes properties of each source, sink and channel in a agent and how they is wired together to form data flows.
The Flume agent configuration is stored in a local configuration file. This is a text file in the same format as the Java properties file. One or more agents can specify the same configuration file to configure. The configuration file includes the properties of each source, the sink and channel in the agent, and how they are connected to form the data stream.
Configuring individual components (accessories for a single component)
Each component (source, sink or channel) in the flow have a name, type, and set of properties that is specific to the type and instantiation. For example, the Avro source needs a hostname (or IP address) and a port number to receive the data from. A Memory channel can has max queue size ("capacity"), and an HDFS sink needs to know the file system URI, path to create files, frequency of file rotation ("Hdfs.rollinterval") etc. All such attributes of a component needs to being set in the properties file of the hosting Flume agent.
Each component in the stream (Source,sink or channel) has a name, a type, and a set of attributes and instantiations that specify the type. For example, a Avro source needs a hostname (or IP address) and port to receive data, the memory channel has a maximum queue value ("Capacity"), and HDFs sink needs to know the URI of the file system to create the path, Polling file frequency (hdfs.roollinterval), and so on. All properties of the component must be configured in the properties file of the Flume agetnt.
Wiring the Pieces Together (fragment collection)
The agent needs to know, what individual, and how they is connected in order to constitute the flow. This was done by listing the names of each of the sources, sinks and channels in the agent, and then specifying the Connect ing channel for each sink and source. For example, an agent flows events from an Avro source called Avroweb to HDFS sink hdfs-cluster1 via a file channel called File-channel. The configuration file would contain names of these components and File-channel as a shared channel for both Avroweb source and Hdfs-cluster1 sink.
The agent needs to know what each component loads and how they are connected to form a stream. This is done by listing each source, sink, and channel in the agent and specifying the channel to which each sink and source is connected. For example, an agent flow event flows from a Avro called Avroweb sources to an HDFs hdfs-cluster1 called sink through a file channel called File-channel. The configuration document will contain the names of these components and the File-channel avroweb source and Hdfs-cluster1 sink intermediate shared.
Starting an agent (starting an agent)
An agent is started using a shell script called Flume-ng which are located in the bin directory of the Flume distribution. You need to specify the agent name, the Config directory, and the config file in the command line:
The agent is started by a script called Flume-ngshell located in the bin directory of the Flume project. You must specify an agent name on the command line, configure the directory and configure the document
$ bin/flume-ng agent-n $agent _name-c conf-f conf/flume-conf.properties.template
Now the agent would start running source and sinks configured in the given properties file.
Now the agent will start running the cource and sink in the given property document.
A simple example (a straightforward example)
Here, we give an example configuration file, describing a single-node Flume deployment. This configuration lets a user generate events and subsequently logs them to the console.
Here we give an example of a configuration file that illustrates a single point flume deployment, which allows a user to generate an event and then print the event to the console.
# example.conf:a single-node Flume configuration # Name the components on Thisagenta1.sources=r1a1.sinks=K1a1.channels=C1 # Describe/Configure the Sourcea1.sources.r1.type=Netcata1.sources.r1.bind=Localhosta1.sources.r1.port= 44444# Describe The Sinka1.sinks.k1.type=Logger # Use a channel which buffers events in Memorya1.channels.c1.type=memorya1.channels.c1.capacity= 1000a1.channels.c1.transactionCapacity= 100# Bind The source and sink to the Channela1.sources.r1.channels=C1a1.sinks.k1.channel= C1
This configuration defines a single agent named A1. A1 has a source this listens for data on Port 44444, a channel this buffers event data in memory, and a sink that logs Eve NT data to the console. The configuration file names the various components, then describes their types and configuration parameters. A given configuration file might define several named agents; When a given Flume process was launched a flag is passed telling it which named agent to manifest.
Given This configuration file, we can start Flume as follows:
This configuration information defines a single-point agent named A1. The A1 has a source with a listening data port of 44444, a memory channel, and a sink that prints events to the console. The configuration document names multiple components and describes their type and configuration parameters. A given configuration document can define multiple agents, and when a given flume process loads, a flag is passed to tell him which agent to run.
$ bin/flume-ng Agent--conf conf--conf-file example.conf--name A1-dflume.root.logger=info,console
Note that in a full deployment we would typically include one more option:--conf=<conf-dir>. The <conf-dir> directory would include a shell script flume-env.sh and potentially a log4j properties file. In this example, we pass a Java option to force Flume to log to the console and we go without a custom environment script .
It should be stated that in a complete deployment we should typically include one more option:--conf=<conf-dir>.<conf-dir> directory contains a shell script flume-env.sh and a potential log4j property document. In this example, we use a Java option to force Flume to print information to the console and not to customize an environment script.
From a separate terminal, we can then telnet to port 44444 and send Flume an event:
With a separate terminal, we can telnet to port 4444 and send an event:
$ telnet localhost 44444127.0.0.1 ... Connected to Localhost.localdomain (127.0.0.1' ^] '. HelloWorld! <ENTER>OK
The original Flume terminal would output the event in a log message.
The original flume terminal will print the event in the console:
12/06/19 15:32:19 INFO source. Netcatsource:source starting12/06/19 15:32:19 INFO Source. netcatsource:created serversocket:sun.nio.ch.serversocketchannelimpl[/127.0.0.1:44444]12/06/19 15:32:34 INFO sink. Loggersink:event: {headers:{} body:48 6C 6C 6F, 6F, 6C, 0D, Hello world!.}
Congratulations-you ' ve successfully configured and deployed a Flume agent! Subsequent sections cover agent configuration in much more detail.
Congratulations-you have successfully configured and deployed a flume agent! The next section overrides more details about the agent configuration.
Flume Official document Translation--flume 1.7.0 User Guide (unreleased version) (i)