Introduction to IBM biginsights Flume
Flume is an open source mass log collection system that supports real-time collection of logs. The initial flume version was Flume OG (flume original Generation), developed by Cloudera company, called Cloudera Flume; later, Cloudera contributed to Apache, the version to FL UME NG (Flume Next generation) is now known as Apache Flume. The initial biginsights uses Flume 0.9.1, and then biginsights upgrades Flume to 0.9.4. All two versions belong to Flume OG, Cloudera Flume based on open source. In my opinion, the Flume upgrade to Flume NG is an inevitable trend in the release of Biginsights later.
Biginsights contains two components that are closely related to flume: Hadoop and zookeeper. The link between Flume and Hadoop is that it can store the collected logs in HDFs, so that it can use Hadoop to efficiently process log data and extract useful information from the log, whereas the zookeeper relationship is that the various nodes within the Flume collection log can be Zookeeper (Efficient and reliable collaborative work system) management, flume each node configuration information stored in the Zookeeper data file. As a whole, flume and zookeeper are peripheral components of Hadoop that are closely related to Hadoop, and Hadoop can also use zookeeper to manage various types of nodes within it. With Hadoop as the core of the large data processing system biginsights, the zookeeper and flume integrated into the system, providing users with visual cleansing, convenient for multiple components on the integration of multiple nodes installed; that is, using the Biginsights visual interface, Users can easily deploy hadoop+flume+zookeeper on multiple nodes to achieve the layout of the log system.
In addition, the biginsights internal flume through the Flume Runtime Toolkit, provides the user without the configuration flume installs the package, realizes to the current flume cluster rapid expansion.
Flume Basic Knowledge
Flume the collection of log data through three nodes: Master,agent,collector collaboration between the two (table 1). Where the agent, collector belong to the Log collection node. The transfer of data requires the designation of the data Sources (source) and points (sink) (table 2). Another important concept in Flume is the data flow. The data stream is the pipeline that transmits data, and describes the transfer process of log data from the production to the final destination. In flume, the data stream is established by the Source,sink configuration of the Log collection node.
Flume Configuration
The Flume configuration file is $FLUME _home/conf/flume-conf.xml and Flume will use the default profile flume-conf.xml.template if the user does not configure this file. The description of the attribute is described in a formal way (for example, in attribute Flume.master.zk.servers):
<property>
<name>flume.master.zk.servers</name>
<value>hostname:2181</value >
<description>zookeeper server<description>
</property>
The configuration file gives the property name (name), attribute value (value), and attribute description (description) For each property, where the attribute description is not required and can be omitted. There are several properties (property) that are closely related to the user during Deployment Flume (table 4):
In addition, directories that store flume logs can be specified in the Log4j.properties file through the attribute Flume.log.dir, such as: Flume.log.dir=/tmp/flume/logs. Flume has two log files: Flumemaster.out and Flumenode.out. As the name implies, Flumemaster.out stores the log information of the master node (master), flumenode.out the information (agent and collector) of the Logging collection node. When the flume node is started, there will be a corresponding process file to store the PID, the storage path can be set in the flume-env.sh, such as: Export flume_pid_dir= "/tmp/flume/pids".
See more highlights of this column: http://www.bianceng.cnhttp://www.bianceng.cn/Programming/extra/