Hadoop Learning Note -19.flume Framework Learning

Source: Internet
Author: User

START: Flume is a high-availability, highly reliable, open-source, distributed, high-volume log collection system provided by Cloudera, where log data can flow through flume to storage terminal destinations. The log here is a general term, refers to the file, Operation Records and many other data.

First, flume basic Theory 1.1 Common distributed log Collection system

scribe is Facebook's open-source log collection system, which has been used extensively within Facebook. Chukwa is an open source data collection system for monitoring large distributed systems. This is built on the HDFs and map/reduce framework of Hadoop, which inherits the scalability and robustness of Hadoop. Flume is a distributed, reliable, and highly available system for collecting, aggregating, and transmitting large volumes of logs. Support for customizing various data senders in the log system for data collection, while Flume provides the ability to simply process the data and write to various data recipients (such as text, HDFS, hbase, etc.).

Data flow model of 1.2 flume

The core of Flume is to collect data from the data source and send it to the destination . To ensure a successful delivery, the data is cached before it is sent to the destination, and the data is deleted when the data actually arrives at the destination.

The basic unit of the data transmitted by Flume is the Event, if it is a text file, usually a row of records, which is the basic unit of the transaction. The Event flows from Source to Channel, and then to Sink, which is itself a byte array and can carry headers information. An event represents the smallest complete unit of a data stream, from an external data source to an external destination.

Three core components of 1.3 flume

The core of the flume operation is the Agent. It is a complete data collection tool that contains three core components, namely source,channel,sink. With these components, the event can flow from one place to another, as shown in 1.

Figure 1 Flume Data flow model

A flume system can consist of one or more agents, and multiple agents can be strung together with a few simple configurations, such as a string of two agents (Foo, bar) working together, Just connect the bar's source (inlet) to Foo's sink (exit). As shown in 2.

Figure 2 Multi-level agent connection model

Figure 3 shows that the 4 agents are strung together, agent1, Agent2, and Agent3 are the data that gets the Web server, and then the data obtained is sent to agent4 uniformly, and finally the data collected by Agent4 is stored in HDFs.

Figure 3 Multiple-to-one merge model

(1) What is an agent?

The core of Flume is the agent. The agent is a Java process that runs on the log collection side, receives logs through the agent, and then temporarily saves them and sends them to the destination.

(2) Three core components

Source: Dedicated to collecting logs , can handle various types of log data in various formats, including Avro, thrift, exec, JMS, spooling directory, netcat, sequence Generator, Syslog, HTTP, Legacy, custom, and more.

Channel: Dedicated to temporary storage of data , can be stored in memory, JDBC, file, database, customization and so on. The data stored is not deleted until the sink is sent successfully.

Sink: Dedicated to sending data to a destination , including HDFs, logger, Avro, thrift, IPC, file, NULL, HBase, SOLR, Custom, and more.

Understanding source, channel, and Sink:

Source is the source of water , which is the entrance of aent data;

Channel is a pipeline , is the data (obtained by resource) flow of channels, the main role is to transmit and store data;

The sink is a sink used to receive incoming data from the channel and output the data to a specified place.

You can think of the agent as a water pipe, source is the entrance of the pipe, sink is the outlet of the pipe, the data as water, data flow also means water flow. The data is obtained through the channel through the source and finally passed to sink. 1 demonstrates a complete agent process, the data is obtained by webserver, the data flows through the channel to sink, and finally the data is stored in HDFS by sink.

Reliability assurance of 1.3 flume

The core of Flume is to collect data from the data source and send it to the destination. To ensure a successful delivery, the data is cached before it is sent to the destination, and the data is deleted when the data actually arrives at the destination.
  Flume uses transactional methods to ensure the reliability of the entire process of transmitting an event . The sink must be sent to the next agent after the event has been deposited, or after it has been deposited into an external data destination before the event can be removed from the channel. This ensures that the event in the data flow, whether in a single agent or between multiple agents, is reliable because the transaction above guarantees that the event will be stored successfully. The multiple implementations of the channel have different guarantees on recoverability. It also guarantees the reliability of the event in varying degrees. For example, Flume supports storing a file channel locally as a backup, while the memory channel will be in a ram queue, fast, but not recoverable if it is lost.

Second, Flume basic Practice 2.1 Flume installation

(1) Download the flume installation package, here is the 1.4.0 version, I have uploaded it to the network disk (Http://pan.baidu.com/s/1kTEFUfX)

(2) Unzip the bin with the SRC package and rename it

Step1. Unzip two packages

TAR-ZVXF libs/apache-flume-1.4.0-bin.tar.gz

TAR-ZVXF libs/apache-flume-1.4.0-src.tar.gz

Step2. Copy the source package to the bin directory

Cp-ri apache-flume-1.4.0-src/* apache-flume-1.4.0-bin/

Step3. "Optional" Rename to Flume

MV Apache-flume-1.4.0-bin Flume

2.2 Flume Basic Configuration

This example of the practice source from spooling Directory,sink flows to HDFs. Monitoring files in the/root/edisonchou file directory, once the new file, the contents of the file immediately through the agent to the Hdfs://hadoop-master:9000/testdir/edisonchou file in HDFs. Before that, we need to make a basic configuration of the flume.

First, enter the Flume conf directory and create a new example.conf, which configures the three core components as follows:

(1) Configuring the source

Agent1.sources.source1.type=spooldiragent1.sources.source1.spoolDir=/root/  Edisonchouagent1.sources.source1.channels=false== Timestamp

(2) Configure Channel

Agent1.channels.channel1.type=fileagent1.channels.channel1.checkpointDir=/root/edisonchou_tmp/123  Agent1.channels.channel1.dataDirs=/root/edisonchou_tmp/

(3) configuration sink

Agent1.sinks.sink1.type=hdfsagent1.sinks.sink1.hdfs.path=hdfs://hadoop-master:9000/ Testdir/edisonchouagent1.sinks.sink1.hdfs.filetype=DataStreamagent1.sinks.sink1.hdfs.writeFormat =TEXTagent1.sinks.sink1.hdfs.rollInterval=1agent1.sinks.sink1.channel=  Channel1agent1.sinks.sink1.hdfs.filePrefix=%y-%m-%d
2.3 Monitoring specified directory tests

(1) Start Hadoop, old command:start-all.sh

(2) Create a new folder/root/edisonchou and create a new directory in HDFs/testdir/edisonchou

(3) Execute the following command in the Flume directory to start the sample agent

Bin/flume-ng agent-n agent1-c conf-f conf/example.conf-dflume.root.logger=debug,console

When shown, the agent started successfully.

(4) Open an SSH connection, create a new file test in the connection, write the content casually, then move it to the/root/edisonchou directory, and then look at the console information in the previous connection as follows, you can find the following information:

It can be found that when we add a file to the monitoring directory/root/edisonchou, the agent immediately writes the file to HDFs, which goes through about three steps: Create, close, rename. In the rename step, the primary is to remove the. tmp suffix. Shows the file we added to the Monitoring directory test has been added to HDFs through the agent:

  

Resources

(1) Hanlong, "flume-Open source Distributed Log Collection system": http://www.cnblogs.com/hanganglin/articles/4224928.html

(2) Windcarp, "Flume Collection Processing log file": http://www.cnblogs.com/windcarp/p/3872578.html

(3) My Little Life, "introduction and use of Flume 1.4": http://www.cnblogs.com/fuhaots2009/p/3473122.html

(4) Remnant night, "Flume log Collection": http://www.cnblogs.com/oubo/archive/2012/05/25/2517751.html

(5) Sandyfog, overview and simple examples of flume: http://www.cnblogs.com/sandyfog/p/3795967.html

(6) Apache, "Flume document": http://flume.apache.org/documentation.html

Zhou Xurong

Source: http://www.cnblogs.com/edisonchou/

The copyright of this article is owned by the author and the blog Park, welcome reprint, but without the consent of the author must retain this paragraph, and in the article page obvious location to give the original link.

Hadoop Learning Note -19.flume Framework Learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.