1. System Requirements
1. Java operating Environment--java 1.8 and above
2. Memory--enough memory for configuration of sources,channels or sinks use
3. Hard disk space-enough hard disk space for the configured channels or sinks use
4. File permissions--agent Use the folder read and write permissions
2. Architecture and Data flow model
Model Introduction Details Reference: http://www.cnblogs.com/swordfall/p/8093464.html
3. Create 3.1 to create an agent
The Flume agent configuration is stored in a local configuration file. This text file follows the Java configuration file format. Configuring one or more agents can be in the same configuration file. The configuration file contains each Source,sink and channel in an agent, and how they are connected together to form a data stream. Each component in the stream (Source,sink or channel) has a name, a type, and a series of configurations. For example, Avro source has hostname and port. The memory channel can have a maximum queue value (such as capacity), and HDFS Sink has a file system URI path.
3.1.1 Starting an agent
The agent needs to start running with the Flume-ng shell script, which is located in the bin directory of the Flume folder. You need to specify the agent name, config directory, and config file in the command:
3.1.2 A simple example
The following is a single-node flume deployment configuration file. The profile lets the user generate events and then record them in the console.
This configuration defines a simple agent called A1. The A1 has a source listening data port 44444, a channel caches the event data in memory, and a sink logs the event log in the console. Configure the file name variable component, and then describe the type and configuration properties of the component. A well-configured file may define several different agents, and when a flume process runs, it is necessary to indicate which agent to start. View the following commands:
Note: A full deployment should contain more than one option:--conf=<conf-dir>. The <conf-dir> directory contains flume-env.sh shell scripts and log4j configuration files. In this example, we use Java to select Force Flume to print the log in the console without the traditional environment script.
With a separate terminal, we can telnet to port 44444 and send an event to Flume:
Flume the original terminal prints the event in the log message:
3.1.3 Using environment variables in a configuration file
This to be achieved requires setting Propertiesimplementation = Org.apache.flume.node.EnvVarResolverProperties
Flume startup commands such as:
$ nc_port=44444 bin/flume-ng agent–conf conf–conf-file example.conf–name a1-dflume.root.logger=info,console- Dpropertiesimplementation=org.apache.flume.node.envvarresolverproperties
Note Environment variables can also be configured in the conf/flume-env.sh file
3.1.4 Record raw Data
In many production environments, it is not expected to record the raw traffic on the ingest channel, as this may cause flume log files to leak sensitive data or security-related configurations such as keys. By default, Flume does not record this type of information. On the other hand, if the data channel is interrupted, Flume will attempt to provide clues to debugging problems.
The way to debug the problem on the event pipeline is to set an additional memory channel connection logger Sink, which outputs all the event data on the flume logs. In some cases, however, this approach is not enough.
To be able to record event-and configuration-related data, some Java system properties must be set on the log4j configuration.
Enable configuration-related logging and set the Java System Properties-dorg.apache.flume.log.printconfig=true. This can be set at the command line or in the flume-env.sh file java_opts variable.
Enable data logging to set-dorg.apache.flume.log.rawdata=true in the same method as described above. For most components, the log4j log level is set to debug or trace in order for event-specific logs to appear in the Flume log. In the following example, when you set the log4j log level to debug, the console can print both the configuration log and the raw data log:
Bin/flume-ng agent--conf conf--conf-file example.conf--name A1-dflume.root.logger=debug,console- Dorg.apache.flume.log.printconfig=true-dorg.apache.flume.log.rawdata=true
3.1.5 Zookeeper-based configuration
Flume supports agent configuration through zookeeper. This is an experimental feature. The configuration file needs to be uploaded to zookeeper with a configurable prefix. The configuration file is saved on the Zookeeper node data nodes.
Once the profile is uploaded, start the agent with the following options:
$ bin/flume-ng agent–conf conf-z zkhost:2181,zkhost1:2181-p/flume–name a1-dflume.root.logger=info,console
Argument Name |
Default |
Description |
Z |
– |
Zookeeper connection string. Comma separated list of Hostname:port |
P |
/flume |
Base Path in Zookeeper to store Agent configurations |
3.2 Extracting data 3.2.1 RPC mode
Flume distribution contains AVRO clients that can send files to Flume Avro source by using the Avro RPC mechanism.
The above command will send the/usr/log.10 file contents on port 41414 to Flume source.
3.2.2 Network Streams
Flume supports the following mechanism to read data from the log stream type, such as: 1. Avro 2. Thrift 3. Syslog 4. Netcat
3.2.3 Setting up multiple agent flows
In order for the data to flow across multiple agents or hops, the sink of the previous agent and the source of the current hop need to be of the Avro type, while sink points to the source's hostname (or IP address) and port.
3.2.4 Merging
A large number of log generation clients send data to several consumer agents, and these agents are connected to the storage system. For example, hundreds of Web servers collected logs are sent to multiple agents, and finally agents are written into the HDFs cluster.
By configuring multiple first-level agents with Avro Sink and then all pointing to the Avro source of the individual agent (or using thrift sources/sinks/clients), This can be achieved on the flume. The source of the two-level agent merges received events into a single channel, the events within the channel are consumed by a sink to enter the final destination.
3.2.5 multiplexing Process
Flume supports multiplexing of event flows to one or more destinations. By defining a stream multiplexer implementation that can replicate or selectively route an event to one or more channels.
The example above shows that the source of an agent "foo" disperses traffic to three different channels. This dispersion can be duplicated or reused. In the case of replication traffic, each event is sent to three channels. For reuse, the event is passed to a subset of the available channels when the property of an event matches the preconfigured value. For example, if an event property "Txntype" is set to "customer", then it should go to Channel1 and channel3, if it is "vendor", then it should go channel2, otherwise channel3. This mapping can be set in the agent's configuration file.
4. Configure 4.1 to define Flow
To define the process in a simple agent, you need to connect sources and sinks through a channel. You need to list sources,sinks and channels for the given agent, and then J points the source and sink to the channel. A source instance can specify multiple channels, but a sink instance may specify only one channel. The format is as follows:
For example, an agent named Agent_foo reads data from an external Avro client and sends data to HDFS via a memory channel. The configuration file looks like this:
This will cause events to flow through the memory channel MEM-CHANNEL-1 from Avro-appsrv-source to Hdfs-sink-1.
4.2 Configuring a single component
After defining the process, you need to set the configuration for each source, sink, and channel. In the same class namespace, you can set component types and other values for each component-specific property.
Each component of the flume needs to set the type "type", in order to indicate what type of object the component needs to be. Each source, sink, and channel type has a number of functional attributes. They need to be set up when needed. For the previous example, we have a flow through memory channel MEM-CHANNEL-1 from Avro-appsrv-source to Hdfs-cluster1-sink. The following example shows the configuration of those components:
4.3 Adding multiple processes to an agent
A simple flume agent can contain multiple independent processes. You can list multiple sources, sinks, and channels in the configuration. These components can be linked to form multiple processes:
You can link sources and sinks to their corresponding channels, creating two different processes. For example, if you need to build two processes in an agent, one from an external Avro client to an external HDFs, and the other from tail to Avro sink, the configuration is as follows:
4.4 Configuring a multi-agent process
To build a multi-tier process, you need the first layer of avro/thrift sink to point to the next layer of Avro/thrift source. This will cause the events of the first Flume agent to be forwarded to the next flume agent. For example, you are using the Avro client to send a file periodically (a file is an event) to the local flume agent, and the local agent forwards the file to another agent with a storage function.
Weblog Agent configuration:
HDFS Agent configuration:
Here, we link the avro-forward-sink of the weblog agent to the avro-collection-source of the HDFs agent. This causes events from the external appserver source to be eventually stored in HDFs.
4.5 Dispersion Process
As discussed in the previous section, Flume supports a decentralized process from one source to multiple channels. There are two modes of dispersion, replication (replicating) and multiplexing (multiplexing). During the replication process, the event will be sent to all configured channels. In the case of multiplexing, the event will only be sent to the channels that meets the requirements. In the decentralized process, you need to specify the channels list and scatter rules for the source. Add a Channel "selector", either replicating or multiplexing, and then specify the selection rule, if you do not specify a selector, the default replicating:
Multiplexing options have a series of attributes to divert the flow. This requires that the channel setting specifies the mapping of the event property. Selector checks the properties of each configuration in the header of the event. If the match is to the specified value, the event is sent to the channels that maps the value. If no match is reached, the event will be sent to the configured default channels.
The following example shows that a simple process is multiplexed to two paths. The agent named Agent_foo has a simple Avro source and two channels that are linked to two sinks:
Selector Check the header called "state". If the value is "CA", then the event it belongs to will be sent to MEM-CHANNEL-1, if the value is "AZ" is sent to File-channel-2, or a value of "NY" is two channels sent. If the "state" header is not set or matches one of the three values, the event is sent to the default mem-channel-1.
The selector also supports an optional channel. Specify an optional channels for the header, and the configuration Properties ' optional ' can be used as follows:
Selector will attempt to send events to the requested channels for the first time, and if these channels consumption events fail, the commit transaction fails and the transaction will be submitted again to those channels. Once the required channels consume these events,selector will send events to the optional channels. If these optional channels consumption fails, it will be ignored and the transaction will not be committed again.
Note: If the header does not require the channels,events will be written into the default channels, it will also attempt to write into the optional channels. If channels is not required, specifying an optional channels will also cause the event to be written into the default channels. If there is no default and required Channels,selector will attempt to write events into optional channels. In this case, the failure will be ignored.
4.6 Flume Sources4.6.1 Avro Source
Listen to the Avro port and receive events from Avro client streams. The requirement property is in bold character.
Agent A1 Example:
Ipfilterrules Example:
ipfilterrules=allow:ip:127.*, allow:name:localhost,deny:ip:*
4.6.2 Thrift Source
Listen to the thrift port and receive events from the external thrift client streams. The requirement property is in bold characters:
Agent A1 Example:
4.6.3 Exec Source
Exec source runs a UNIX command line at startup, and expects this process to produce data continuously on standard output. The requirement property is in bold characters:
Agent A1 Example:
The ' shell ' configuration is used to invoke ' command ' through a command shell.
4.6.4 JMS Source
JMS source reads a message from a JMS destination, such as a queue or subject. The JMS application should be able to work with any JMS provider, but it can only be tested using ACTIVEMQ. The requirement property is in bold character.
Agent A1 Example:
4.6.5 spooling Directory Source
The source lets you extract the data by placing the extracted files in the disk "spooling" directory in this way. The source will monitor new files for the specified directory and parse the event when a new file appears. The event parsing logic is pluggable. When a given file is fully read into the channel, it is renamed to be identified as completed (or optionally deleted).
agent-1 Example:
4.6.6 Taildir Source
Note: This source cannot be used with Windows.
Agent A1 Example:
4.6.7 Twitter 1% firehose Source (Trial)
Slightly
4.6.8 Kafka Source
Kafka source is an Apache Kafka consumer that reads messages from Kfaka topics. If you have multiple Kafka source running, you can configure them in the same consumer Group so that they each read topics unique partitions.
Examples of topic subscriptions with a comma-delimited list of topic:
Examples of topic subscriptions with regular expressions:
Secure and Kafka Source
Kafka 0.9.0 supports SASL/GSSAPI or SSL protocol.
Set the value of the kafka.consumer.security.protocol :
①sasl_plaintext-kerberos or plaintext authentication with no data encryption
②sasl_ssl-kerberos or plaintext authentication with data encryption
③SSL-TLS based encryption with optional authentication.
TLS and Kafka Source
Examples with server-side authentication and data encryption configuration:
Note: The attribute ssl.endpoint.identification.algorithm is not defined, so there is no hostname validation, in order to be hostname authentication, you can set the property:
If client authentication is required, add the following configuration in the Flume agent configuration. Each flume agent must have its client credentials in order to be trusted by Kafka brokers.
If KeyStore and key are protected with a password that is not used, then the Ssl.key.password attribute needs to be provided:
Kerberos and Kafka Soure:
The Kerberos configuration file can be specified in flume-env.sh through java_opts:
Example of a security configuration using Sasl_plaintest:
Example of a security configuration using Sasl_ssl:
Jaas file instance (not yet read):
4.6.9 NetCat TCP Source
Netcat source listens on a given port and converts each line of the text file into an event. The requirement property is in bold character.
Agent A1 Example:
4.6.10 NetCat UDP Source
Netcat source listens on a given port and converts each line of the text file into an event. The requirement property is in bold character.
Example of Agent A1:
4.6.11 Sequence Generator Source
A simple sequence generator can constantly generate events, with counter counters, starting from 0, incrementing at 1, stopping at totalevents. Attempts are kept when events are not sent to channels.
Agent A1 Example:
4.6.12 Syslog Sources
Read the system log and generate flume events. UDP source takes the entire message as a simple event. TCP Source is a new event with a new line of "n"-separated strings.
4.6.12.1 Syslog TCP Source
The original, reliable syslog TCP source.
Example of a syslog TCP source for agent A1:
4.6.12.2 multiport Syslog TCP Source
This is a new, faster, multi-port syslog TCP source version. Note The ports configuration replaces port.
Example of multiport syslog TCP source for agent A1:
4.6.12.3 Syslog UDP Source
Example of syslog UDP source for agent A1:
4.6.13 HTTP Source
Source receives flume events via HTTP POST and GET. Get can only be used for experimentation. The HTTP requests is converted to flume events by the "handler" that must implement the Httpsourcehandler interface. The handler gets httpservletrequest and then returns a series of flume events.
Example of an HTTP source for agent A1:
Handler attribute has two kinds, one is Jsonhandler, one is Blobhandler.
Blobhandler is used to process request parameters with larger objects (Binary Large object), such as PDF or JPG.
4.6.14 Stress Source
Stresssource is an implementation of the internal load generation source, which is useful for stress testing. It allows the user to configure the size of the event payload.
Example of Agent A1:
4.6.15 Legacy Sources
Legacy sources allows Flume 1.x agent to receive events from Flume 0.9.4 agents.
Legacy Source supports Avro and thrift RPC connections. In order to use a bridge built with two flume versions, you need to start a Flume 1.x agent with avrolegacy or thriftlegacy source. 0.9.4agent should have agent sink pointing to 1.x agent Host/port.
4.6.15.1 Avro Legacy Source
Example of Agent A1:
4.6.15.2 Thrift Legacy Source
Example of Agent A1:
4.6.16 Custom Source (Customize Source)
The custom source is the source interface that you implement. When you start the Flume agent, a custom source class and its dependencies must be in the classpath of the agent.
Example of Agent A1:
4.6.17 scrible Source
scribe is another type of extraction system. Using the existing scribe extraction system, Flume should use the Scribesource based on the Thrift Compatible Transport protocol.
Agent A1 Example:
Resources:
Https://flume.apache.org/FlumeUserGuide.html
flume1.8 Use Guide Learning sentiment (i)