Flume principle and code implementation

Last Update:2016-12-31 Source: Internet

Author: User

Tags node server

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint marked Source: http://www.cnblogs.com/adealjason/p/6240122.html

Recently want to play a nasty calculation, first saw the implementation of the principle of flume and source code

Source can go to Apache official website to download

The following flume principle and code implementation:

Flume is a real-time data collection tool, one of the ecosystem of Hadoop, mainly used in the distributed environment of the server node to do data collection, and then aggregated to a unified data storage platform, flume support a variety of deployment architecture patterns, single-point agent deployment, hierarchical Schema mode deployment, such as through a load Balancer agent to distribute the collected data to each sub-agent, and then aggregated to the same agent, data transfer to a unified data storage platform, again not much nonsense, flume supported deployment architecture diagram can be see the source of the doc directory under the picture

Flume principle:

The current version is Flume Ng, which is based on Flume ng:

Flume consists of the following core concepts:

Flume event:flume Internal data unit, comprising two parts, a head node, a body node, the head node is a map<string, string> The deployed Agent node can be placed in the message head through the existing interceptor or custom interceptor data, such as ip,hostname, such as identifying the source of the message from the server, the event in the flume internal flow, is the carrier of data transmission

Flume Source:source is flume data source, flume support a variety of data sources, such as Taildir Monitor a file changes, Spolldir Monitor a folder changes, Jmssource receive JMS messages, etc. The most commonly used avrosource is the basis of the flume layered architecture, source is an interface, Flume provides a variety of message access methods, in the SourceType enumeration class are listed in detail, under special instructions, because Flume is interface-oriented programming, There is an enumeration of other, which is a placeholder, where the user can customize the source, requiring that it be loaded into the class at the start of the flume (the underlying is obtained by reflection to the instance of Class)

The

Flume Channel:flume is based on the pipeline mode, The presence of channels enriches the flume data dissemination pathway, where the channel can buffer between source and sink, dynamically adjust the collection and transmission of data (there is a xxxcounter in the inside will not receive an event or send an event will be recorded) , buffering the pressure between source and sink, the second channel can be associated with more than one source, such as a source can be selected according to the configuration of the data copied to each pipeline, or according to the message header automatically distributed to the specified pipeline, a channel can be connected to multiple sink, This realization of the same data of the multi-send pool, the realization of data reuse and load balancing functions, channel internal transfer of data carrier is Event,flume channel support a variety of data buffering implementation, such as FileChannel: with a file to do data caching, Memorychannel: Using a memory cache, the underlying implementation is a linkedblockingdeque, a two-way blocking list, see Channeltype

Flume sink:flume data Send pool, mainly responsible for the data transmission, from the channel to receive the event, and then sent to the designated data receiver, Flume provides a variety of sink implementations, specifically see Sinktype, Commonly used are: Loggersink: This is mainly used for flume deployment debugging, it will receive the event events directly with log4j output, Rollingfilesink: This sink is mainly to serialize the received log file into a file directory, So need to configure the address of the file, the frequency of slicing files, etc., Avrosink: This is the most common flume layered architecture sink, general and Avrosource paired use, Avro is a sub-project of Apache, for data serialization, When using Avrosource and Avrosink, it is necessary to listen to a port on the Avrosource Agent node server, and the Avrosink agent sends the received data to the IP, port, and then completes the flume tiered deployment. Avro is only a data serialization tool, the underlying implementation consists of a rpcclient to transfer data between this source and sink (you can leave the startup log, will automatically create a rpcclient), of course, flume encoding is based on interface-oriented, So support custom sink like source

The above is a few core concepts, formally due to the flume of this design ideas and coding style, so that Flume has a strong expansion of

Of course only these few still can not completely let Flume run up, Flume provides a lot of auxiliary classes for driving, distributing internal event and the whole flume system operation, basically as follows:

Configuration area:

Agentconfiguration: This looks at the name to know is the flume of the configuration element in the realm of things, yes, The data that the consumer configures in Flume-conf.properties is parsed into agentconfiguration, which is a configuration file to an object-oriented abstraction

Abstractconfigurationprovider: The class looks at the name is an abstract configuration provider class, inside there is an important method is: GetConfiguration (), The method uses the following private methods to load flume channel, source, sink, sinkgroups, and associate them.

Loadchannels (agentconf, Channelcomponentmap);

Loadsources (agentconf, Channelcomponentmap, Sourcerunnermap);

Loadsinks (agentconf, Channelcomponentmap, Sinkrunnermap);

Flume also supports dynamic loading, Pollingpropertiesfileconfigurationprovider (a specific implementation of Abstractconfigurationprovider) starts a thread when the flume is started filewatcherrunnable , monitor flume configuration file changes, the configuration file is loaded with Google Eventbus to drive

Driving field:

The source of Flume has the following two sub-interfaces: Pollablesource and Eventdrivensource, the former need to go round the data source, whether the current can be loaded into the data, if there is loaded into the event to convert to Flume, The implementation class has Taildir, Spolldir, Jsmsource, Kafkasource, and so on, the interface adds a process method for round-robin calls, which is an event-driven source that does not need to proactively access the data source. Only need to receive data-driven event and converted to flume event, the implementation class has: Scribesource (the data source used to get through Facebook's scribe data collection Tool), Avrosource, etc.

Sourcerunner:

Because of the existence of these two source, so Flume provides two sourcerunner to drive the operation of the source, respectively Pollablesourcerunner and Eventdrivensourcerunner, The former automatically starts a Pollingrunner thread for the timed round-robin process method

Channelprocessor:

This class is used to send data between the source and channel, implementing a source that can be associated to multiple channel, simple points such as these 2 interfaces, the definition of Source: Setchannelprocessor (channelprocessor Channelprocessor) specifies a channelprocessor, channelprocessor associated to a final channelselector, Selector Association to Channel:setchannels (list<channel> channels)

Channelprocessor:

The association to the specified Channelselector,channelselector provides two selector ways, replicatingchannelselector: Copy the event of the source into each channel, Multiplexingchannelselector: Automatically routed to the corresponding channel according to header information of the head node.

Transaction and Basictransactionsemantics

The channel interior of the flume guarantees that an event is sent on a transaction completion, and if the send fails or the receive fails, it is rolled back, and the event is removed from the channel when it succeeds

Sinkprocessor:

What do you mean by choosing the sink to send? The class has two implementations:

Loadbalancingsinkprocessor:

Load Balancing method: Provides the roud_bin algorithm and the random algorithm, as well as the fixed order algorithm implementation way, sends the event of the channel to the multiple sink

Failoversinkprocessor:

Can implement failover function, the specific process is similar to loadbalancingsinkprocessor, the difference is failoversinkprocessor maintained a priorityqueue, used to choose sink according to the weight

Sinkrunner:

This class is used to drive a sink, which is opened internally by a thread Pollingrunner, timed call Sinkprocessor

These are all core concepts and code functions, and the following describes the running flow of flume:

1. When the system starts, it can be configured to load a flume according to customer-defined configuration

2.SourceRunner and Sinkprocessor at the same time, a channel to produce the event, a consumption event from the channel, the interior is a producer consumer model

3. Multi-channel distribution and hierarchical architecture for channel to source and sink through a number of auxiliary classes

Here is a self-built flume configuration file, for reference:

Implementation process:

Load Balancer + distribution + Landing to log files

1. Load Balancer node:

Read data from two file sources, increase the data source identifier in the event header, copy to two channel, one log print, one load balancer distributed to the other two machines, the load balancing algorithm uses Roud_robin

Loadbalancagent.sources = Taildirsrc

Loadbalancagent.channels = Memorychannel FileChannel

Loadbalancagent.sinks = LoggerSink1 LoggerSink2 loggerSink3

Loadbalancagent.sinkgroups = loadbalancegroups

# # TAILDIRSRC Config

LoadBalancAgent.sources.taildirSrc.type = Taildir

LoadBalancAgent.sources.taildirSrc.positionFile =/alidata1/admin/opensystem/flumetest/log/taildir_position.json

loadBalancAgent.sources.taildirSrc.filegroups = F1 F2

LOADBALANCAGENT.SOURCES.TAILDIRSRC.FILEGROUPS.F1 =/alidata1/admin/dts-server-web/dts-server.log

LoadBalancAgent.sources.taildirSrc.headers.f1.headerKey1 = Dts-server-log

LOADBALANCAGENT.SOURCES.TAILDIRSRC.FILEGROUPS.F2 =/alidata1/admin/flume/test.log

LoadBalancAgent.sources.taildirSrc.headers.f2.headerKey1 = Flume-test-log

LoadBalancAgent.sources.taildirSrc.fileHeader = True

# # Replicating Channel Config

LoadBalancAgent.sources.taildirSrc.selector.type = Replicating

LoadBalancAgent.sources.taildirSrc.channels = Memorychannel FileChannel

LoadBalancAgent.sources.taildirSrc.selector.optional = FileChannel

# # Memory Chanel Config

LoadBalancAgent.channels.memoryChannel.type = Memory

LoadBalancAgent.channels.memoryChannel.capacity = 10000

LoadBalancAgent.channels.memoryChannel.transactionCapacity = 10000

LoadBalancAgent.channels.memoryChannel.byteCapacityBufferPercentage = 20

LoadBalancAgent.channels.memoryChannel.byteCapacity = 800000

# # File Channel Config

LoadBalancAgent.channels.fileChannel.type = File

LoadBalancAgent.channels.fileChannel.checkpointDir =/alidata1/admin/opensystem/flumetest/log

LoadBalancAgent.channels.fileChannel.dataDirs =/alidata1/admin/opensystem/flumetest/data

# # LoadBalance Sink processor

LoadBalancAgent.sinkgroups.loadBalanceGroups.sinks = LoggerSink1 LoggerSink2

LoadBalancAgent.sinkgroups.loadBalanceGroups.processor.type = Load_balance

LoadBalancAgent.sinkgroups.loadBalanceGroups.processor.backoff = True

LoadBalancAgent.sinkgroups.loadBalanceGroups.processor.selector = Round_robin

# # LOGGERSINK1 Config

LoadBalancAgent.sinks.loggerSink1.type = Avro

LoadBalancAgent.sinks.loggerSink1.channel = Memorychannel

LoadBalancAgent.sinks.loggerSink1.hostname = 10.253.42.162

LoadBalancAgent.sinks.loggerSink1.port = 4141

# # LOGGERSINK2 Config

LoadBalancAgent.sinks.loggerSink2.type = Avro

LoadBalancAgent.sinks.loggerSink2.channel = Memorychannel

LoadBalancAgent.sinks.loggerSink2.hostname = 10.139.53.6

LoadBalancAgent.sinks.loggerSink2.port = 4141

# # LOGGERSINK3 Config

LoadBalancAgent.sinks.loggerSink3.type = File_roll

LoadBalancAgent.sinks.loggerSink3.channel = FileChannel

LoadBalancAgent.sinks.loggerSink3.sink.rollInterval = 0

LoadBalancAgent.sinks.loggerSink3.sink.directory =/alidata1/admin/opensystem/flumetest/dtsserverlog

2. Load Balancer Node 1

Receiving Avrosink and landing in a file

dispatchagent.sources= AVROSRC

Dispatchagent.channels=memorychannel

Dispatchagent.sinks=loggersink

# # AVROSRC Config

DispatchAgent.sources.avroSrc.type = Avro

DispatchAgent.sources.avroSrc.channels = Memorychannel

DispatchAgent.sources.avroSrc.bind = 0.0.0.0

DispatchAgent.sources.avroSrc.port = 4141

# # Memorychannel Config

DispatchAgent.channels.memoryChannel.type = Memory

DispatchAgent.channels.memoryChannel.capacity = 10000

DispatchAgent.channels.memoryChannel.transactionCapacity = 10000

DispatchAgent.channels.memoryChannel.byteCapacityBufferPercentage = 20

DispatchAgent.channels.memoryChannel.byteCapacity = 800000

# # Loggersink Config

DispatchAgent.sinks.loggerSink.type = Logger

DispatchAgent.sinks.loggerSink.channel = Memorychannel

3. Load Balancer Node 2

dispatchagent.sources= AVROSRC

Dispatchagent.channels=memorychannel

Dispatchagent.sinks=loggersink

# # AVROSRC Config

DispatchAgent.sources.avroSrc.type = Avro

DispatchAgent.sources.avroSrc.channels = Memorychannel

DispatchAgent.sources.avroSrc.bind = 0.0.0.0

DispatchAgent.sources.avroSrc.port = 4141

# # Memorychannel Config

DispatchAgent.channels.memoryChannel.type = Memory

DispatchAgent.channels.memoryChannel.capacity = 10000

DispatchAgent.channels.memoryChannel.transactionCapacity = 10000

DispatchAgent.channels.memoryChannel.byteCapacityBufferPercentage = 20

DispatchAgent.channels.memoryChannel.byteCapacity = 800000

# # Loggersink Config

DispatchAgent.sinks.loggerSink.type = Logger

DispatchAgent.sinks.loggerSink.channel = Memorychannel

Flume principle and code implementation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More