Architecture diagram
Data Flow graph
Some of the core concepts of 1.Flume:
2. Data flow model
Flume is the smallest independent operating unit of the agent. An agent is a JVM. A single agent consists of three components of source, sink, and channel, such as:
Flume data flows are always run through events. An event is the basic unit of data for Flume, which carries log data (in the form of a byte array) and carries header information, which is generated by source outside the agent, such as Web server in. When source captures an event, it is formatted in a specific format, and then the source pushes the event into (single or multiple) channel. You can think of the channel as a buffer, which will save the event until sink finishes processing the event. Sink is responsible for persisting the log or pushing the event to another source.
A straightforward design, notably, Flume provides a large number of built-in source, channel, and sink types. Different types of source,channel and sink can be freely combined. The combination is based on user-set profiles and is very flexible. For example, a channel can persist an event in memory, or it can be persisted to a local hard disk. Sink can write logs to HDFs, HBase, or even another source, and so on.
If you think Flume is capable of that, it's a big mistake. Flume enables users to build multilevel streams, which means that multiple agents can work together and support Fan-in, fan-out, contextual Routing, and Backup Routes. As shown in the following:
3. High reliability
As a production environment to run the software, high reliability is a must. From a single agent, Flume uses transaction-based data transfer to ensure the reliability of event delivery. Source and sink are encapsulated into a transaction. Events are stored in the channel until the event is processed and the events in the channel are removed. This is a reliable mechanism that flume provides to the point-to. From the perspective of multistage flow, the sink of the former agent and the source of the latter agent also have their transactions to ensure the reliability of the data.
4. recoverability
or by the channel. It is recommended to use FileChannel, where events persist in the local file system (poor performance).
5.Flume Overall Architecture Introduction
The flume architecture as a whole is Source-->channel-->sink's three-tier architecture (see figure one above), a structure similar to that of the generator and the consumer, which is decoupled through the queue (channel).
Source: Completes the collection of log data, divided into Transtion and event into the channel.
Channel: A function of a queue that provides a simple caching of the data in the source supply.
Sink: Take out the data in the channel, make the corresponding storage file system, database, or submit to the remote server.
The minimal use of existing program changes is to use the log file that is the original record of the direct reader, the basic can achieve seamless access, no need to make any changes to the existing program.
There are two main ways of directly reading a file source:
2.1 ExecSource
You can organize your data by writing Unix command, and the most common is tail-f [file].
Real-time transmission is possible, but when Flume is not running and scripting errors are lost, data is dropped, and the breakpoint continuation function is not supported. Since there is no record of where the last file was read, there is no way to know where to start reading the next time. Especially when the log files are constantly increasing. The source of the flume is hung up. When the flume source is opened again this time, the added log content, there is no way to be read by the source. However, Flume has a execstream extension, you can write a monitoring log to increase the situation, the increased log, through their own tools to write the added content, to the Flume node. and send it to sink's node. If it can be supported in the source of the tail class, it will be more perfect when node hangs out of this time, and then continues to teleport the next time the node is opened.
2.2 SpoolingDirectory Source
Spoolsource: Is the monitoring configuration of the directory under the new file, and the data in the file read out, can achieve quasi-real-time. Need to note two points: 1, copied to the spool directory file can not open the edit. 2. The spool directory cannot contain the corresponding subdirectories. In the actual use of the process, can be used in conjunction with log4j, when using log4j, the log4j file segmentation mechanism is set to 1 minutes, the file is copied to the spool monitoring directory. LOG4J has a timerolling plug-in that can put log4j split files into the spool directory. The basic realization of real-time monitoring. Flume after the file is passed, the suffix of the file will be modified and changed to. Completed (suffix can also be flexibly specified in the configuration file)
Execsource,spoolsource comparison: Execsource can realize the real-time collection of the log, but there is no flume to run or error in instruction execution, the log data cannot be collected and the integrity of the log data cannot be validated. Spoolsource Although it is not possible to collect data in real time, it can be used to split files in minutes, approaching real-time. If your app doesn't implement cutting log files in minutes, it can be used in combination with two collection methods.
Channel is available in several ways: Memorychannel, JDBC Channel, Memoryrecoverchannel, FileChannel. Memorychannel can achieve high-speed throughput, but cannot guarantee the integrity of the data. Memoryrecoverchannel has been built to replace the official documentation with FileChannel. FileChannel guarantees the integrity and consistency of the data. When configuring FileChannel specifically, it is recommended that the directory and program log files that you set up FileChannel be saved to a different disk for increased efficiency.
Sink when setting up storage data, you can store data in the file system, in the database, in Hadoop, and when the log data is low, you can save the data in the file systems and set a certain interval of time. When more log data is available, the corresponding log data can be stored in Hadoop for future data analysis.
Building Big Data real-time systems with Flume+kafka+storm+mysql