Building Big Data real-time system with Flume+kafka+storm+mysql

Last Update:2015-11-01 Source: Internet

Author: User

Tags log4j

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Architecture diagram

Data Flow graph

Some of the core concepts of 1.Flume:

2. Data flow model

Flume is the smallest independent operating unit of the agent. An agent is a JVM. A single agent consists of three components of source, sink, and channel, such as:

Flume data flows are always run through events. An event is the basic unit of data for Flume, which carries log data (in the form of a byte array) and carries header information, which is generated by source outside the agent, such as Web server in. When source captures an event, it is formatted in a specific format, and then the source pushes the event into (single or multiple) channel. You can think of the channel as a buffer, which will save the event until sink finishes processing the event. Sink is responsible for persisting the log or pushing the event to another source.
A straightforward design, notably, Flume provides a large number of built-in source, channel, and sink types. Different types of source,channel and sink can be freely combined. The combination is based on user-set profiles and is very flexible. For example, a channel can persist an event in memory, or it can be persisted to a local hard disk. Sink can write logs to HDFs, HBase, or even another source, and so on.
If you think Flume is capable of that, it's a big mistake. Flume enables users to build multilevel streams, which means that multiple agents can work together and support Fan-in, fan-out, contextual Routing, and Backup Routes. As shown in the following:

3. High reliability

As a production environment to run the software, high reliability is a must. From a single agent, Flume uses transaction-based data transfer to ensure the reliability of event delivery. Source and sink are encapsulated into a transaction. Events are stored in the channel until the event is processed and the events in the channel are removed. This is a reliable mechanism that flume provides to the point-to. From the perspective of multistage flow, the sink of the former agent and the source of the latter agent also have their transactions to ensure the reliability of the data.

4. recoverability

or by the channel. It is recommended to use FileChannel, where events persist in the local file system (poor performance).

5.Flume Overall Architecture Introduction

The flume architecture as a whole is Source-->channel-->sink's three-tier architecture (see figure one above), a structure similar to that of the generator and the consumer, which is decoupled through the queue (channel).

Source: Completes the collection of log data, divided into Transtion and event into the channel.
Channel: A function of a queue that provides a simple caching of the data in the source supply.
Sink: Take out the data in the channel, make the corresponding storage file system, database, or submit to the remote server.
The minimal use of existing program changes is to use the log file that is the original record of the direct reader, the basic can achieve seamless access, no need to make any changes to the existing program.
There are two main ways of directly reading a file source:

2.1 ExecSource

　　You can organize your data by writing Unix command, and the most common is tail-f [file].
Real-time transmission is possible, but when Flume is not running and scripting errors are lost, data is dropped, and the breakpoint continuation function is not supported. Since there is no record of where the last file was read, there is no way to know where to start reading the next time. Especially when the log files are constantly increasing. The source of the flume is hung up. When the flume source is opened again this time, the added log content, there is no way to be read by the source. However, Flume has a execstream extension, you can write a monitoring log to increase the situation, the increased log, through their own tools to write the added content, to the Flume node. and send it to sink's node. If it can be supported in the source of the tail class, it will be more perfect when node hangs out of this time, and then continues to teleport the next time the node is opened.

2.2 SpoolingDirectory Source

　　Spoolsource: Is the monitoring configuration of the directory under the new file, and the data in the file read out, can achieve quasi-real-time. Need to note two points: 1, copied to the spool directory file can not open the edit. 2. The spool directory cannot contain the corresponding subdirectories. In the actual use of the process, can be used in conjunction with log4j, when using log4j, the log4j file segmentation mechanism is set to 1 minutes, the file is copied to the spool monitoring directory. LOG4J has a timerolling plug-in that can put log4j split files into the spool directory. The basic realization of real-time monitoring. Flume after the file is passed, the suffix of the file will be modified and changed to. Completed (suffix can also be flexibly specified in the configuration file)
Execsource,spoolsource comparison: Execsource can realize the real-time collection of the log, but there is no flume to run or error in instruction execution, the log data cannot be collected and the integrity of the log data cannot be validated. Spoolsource Although it is not possible to collect data in real time, it can be used to split files in minutes, approaching real-time. If your app doesn't implement cutting log files in minutes, it can be used in combination with two collection methods.
Channel is available in several ways: Memorychannel, JDBC Channel, Memoryrecoverchannel, FileChannel. Memorychannel can achieve high-speed throughput, but cannot guarantee the integrity of the data. Memoryrecoverchannel has been built to replace the official documentation with FileChannel. FileChannel guarantees the integrity and consistency of the data. When configuring FileChannel specifically, it is recommended that the directory and program log files that you set up FileChannel be saved to a different disk for increased efficiency.
Sink when setting up storage data, you can store data in the file system, in the database, in Hadoop, and when the log data is low, you can save the data in the file systems and set a certain interval of time. When more log data is available, the corresponding log data can be stored in Hadoop for future data analysis.

Building Big Data real-time systems with Flume+kafka+storm+mysql

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More