Flume Building and learning (Basic article)

Source: Internet
Author: User

Reprint please indicate the original source: http://www.cnblogs.com/lighten/p/6830439.html

1. Introduction

This article is mainly to translate the official related documents, the source address click here. Introduce some basic knowledge and construction method of Flume.

Apache Flume is a distributed, reliable and usable system for efficient collection, aggregation, and movement of large amounts of log data from many different sources to centralized data storage.

The use of Apache Flume is not limited to log data aggregation. Because data sources are customizable, you can use flume to transfer large amounts of event data, including but not limited to network traffic data, social media generation data, e-mail messages, and almost any data source.

There are currently two versions of the code lines available, 0.9.x and 1.x versions. The "Flume 0.9.x User Guide" provides documentation for 0.9.x tracking. This document applies to 1.4.x traces.

Encourage new and existing users to use the 1.x version to take advantage of the performance improvements and configuration flexibility offered in the latest architectures.

2. Environmental requirements

Java Runtime Environment-Java version 1.7 or later

Memory-sources, channels or sinks, need to be configured with enough memory

Disk space-sufficient disk space for the configuration used by channels or sinks

Directory permissions-read/write permissions for the directory used by the agent

3. Data flow model

The flume event is defined as a dataflow unit that contains payload bytes and an optional set of string properties. The flume agent is a (JVM) process that manages the event flow of a component from an external source to the next target.

The flume source consumes events that are passed to it by an external source, such as a Web server. The external source is sent to flume with a specific format that can be identified by the target flume. For example, a Avro flume source can be used to receive sink events from Avro clients or other streams that send flume agents from Avro Avro events. Similar streams can be defined using a Thrift Flume source to receive events from Thrift sink or Flume Thrift RPC clients, or Flume clients written in any language generated from the Thrift Thrift protocol. When the flume source receives an event, it stores it in one or more channels. The channel is a passive store that maintains events until it is consumed by flume sink. The file channel is an example-it is supported by the local file system. The receiver removes the event from the channel and places it into an external repository (such as HDFs (via Flume HDFs sink)) or forwards it to the flume source of the next flume proxy (next hop) in the stream. The source and sink in the given proxy run asynchronously with the events in the channel.

Flume allows users to set up multiple hop streams, where events pass through multiple agents before reaching their final destination. It also allows fan-in and fan-out streams, context routing, and backup routing (failover) to be provided for failed hops.

Events are performed on each agent's channel. The event is then passed to the next proxy or terminal repository (such as HDFs) in the stream. Events are removed from the channel only after they are stored in the channel or terminal repository of the next agent. This is how single-hop message delivery semantics in Flume provides end-to-end reliability of the stream.

Flume uses transactional approach to ensure the reliable delivery of events. Source and sink encapsulate storage/retrieval in a transaction, placing or providing events by the transaction provided by the channel. This ensures that the set of events can be reliably passed from point to place in the process. In the case of the multi hop process, the aggregation from the previous hop and the source from the next hop have their transaction run to ensure that the data is securely stored in the next hop channel.

These events are performed in the channel and it manages to recover from the failure. Flume supports persistent file channels supported by the local file system. There is also a memory channel, which simply stores the event in an in-memory queue and is faster, but any event that remains in the memory channel when the agent process freezes is unrecoverable.

4. Download

Click here to download the binary package, unzip the installation package, the directory structure is as follows:

Official document Address: here. Wiki: here.

5. Configuration

The flume proxy configuration is stored in the local configuration file. This is a text file that follows the Java properties file format. You can specify the configuration of one or more agents in the same configuration file. A configuration file includes the properties of each source, host, and channel in the agent, and how they are connected together to form a data flow.

Each component in the stream (source, host, or channel) has a specific type and instantiated name, type, and property set. For example, a Avro source requires a host name (or IP address) and a port number to receive data. The memory channel can have a maximum queue size ("capacity"), and the HDFs receiver needs to know the file system URI, the path to create the file, the frequency of file rotation ("Hdfs.rollinterval"), and so on. All these properties of the component need to be set in the properties file of the managed Flume agent.

The agent needs to know the individual components to load and how they are connected to make up the stream. This is done by listing the names of each source, receiver, and channel in the agent, and then specifying a connection channel for each sink and source. For example, an agent passes an event from the Avro source of a Avro source through the HDFs sink hdfs-cluster1 through a file channel called a file channel. The configuration file will contain the names and file channels of these components as shared channels for the Avroweb source and Hdfs-cluster1 sink.

Open a proxy: The agent is turned on by using a shell script called Flume-ng, which is located in the bin directory of Flume. You need to specify the name of the proxy, configure the directory and configuration file, and proceed with the following command:

$ bin/flume-ng agent–n $agent _name–c conf–f conf/flume-conf.properties.template

Here, we give a sample configuration file that describes a single-node flume deployment. This configuration allows users to generate events and then log them to the console.

This configuration defines a single agent named A1. The A1 has a source that listens to the data on port 44444, the channel of the event data in the buffer memory, and the receiver that logs the event data to the console. The configuration file names various components and then describes their type and configuration parameters. A given configuration file can define several named proxies; When a given flume process is started, a flag is passed to tell it which named Proxy is displayed.

Given this configuration file, we can start flume as follows:

$ bin/flume-ng agent–conf conf–conf-file example.conf–name a1–dflume.root.logger=info,console

I'm working on Windows, and here's how to handle windows:

Windows, Flume is integrated with PowerShell, the underlying flume configuration file is not changed, but to output related log information in the console, You also need to rename the Flume-env.ps1.template under the Conf folder to Flume-env.ps1, which is added as follows:

Then execute the following command:

Bin> flume-ng.cmd agent-conf. /conf-conf-file. /conf/flume-conf.properties-name A1

Use Flume-ng.cmd Help to view the usage under Windows:

Note that in a full deployment, we will typically include one more option:--conf = <conf-dir>. The <conf-dir> directory will contain shell script flume-env.sh and potential log4j properties files. In this example, we pass a Java option to force Flume to log on to the console without a custom environment script.

From another separate terminal, we can telnet port 44444 and send Flume an event:

At the end of the previous flume, you can see:

This completes the simple configuration.

6. PostScript

This article is mainly to explain the basic knowledge and construction of Flume, the follow-up may further complement the relevant knowledge.

Flume Building and learning (Basic article)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.