Reprint marked Source: http://www.cnblogs.com/adealjason/p/6240122.html
Recently want to play a nasty calculation, first saw the implementation of the principle of flume and source code
Source can go to Apache official website to download
The following flume principle and code implementation:
Flume is a real-time data collection tool, one of the ecosystem of Hadoop, mainly used in the distributed environment of the server node to do data collection, and then aggregated to a unified data storage platform, flume support a variety of deployment architecture patterns, single-point agent deployment, hierarchical Schema mode deployment, such as through a load Balancer agent to distribute the collected data to each sub-agent, and then aggregated to the same agent, data transfer to a unified data storage platform, again not much nonsense, flume supported deployment architecture diagram can be see the source of the doc directory under the picture
Flume principle:
The current version is Flume Ng, which is based on Flume ng:
Flume consists of the following core concepts:
Flume event:flume Internal data unit, comprising two parts, a head node, a body node, the head node is a map<string, string> The deployed Agent node can be placed in the message head through the existing interceptor or custom interceptor data, such as ip,hostname, such as identifying the source of the message from the server, the event in the flume internal flow, is the carrier of data transmission
Flume Source:source is flume data source, flume support a variety of data sources, such as Taildir Monitor a file changes, Spolldir Monitor a folder changes, Jmssource receive JMS messages, etc. The most commonly used avrosource is the basis of the flume layered architecture, source is an interface, Flume provides a variety of message access methods, in the SourceType enumeration class are listed in detail, under special instructions, because Flume is interface-oriented programming, There is an enumeration of other, which is a placeholder, where the user can customize the source, requiring that it be loaded into the class at the start of the flume (the underlying is obtained by reflection to the instance of Class)
The
Flume Channel:flume is based on the pipeline mode, The presence of channels enriches the flume data dissemination pathway, where the channel can buffer between source and sink, dynamically adjust the collection and transmission of data (there is a xxxcounter in the inside will not receive an event or send an event will be recorded) , buffering the pressure between source and sink, the second channel can be associated with more than one source, such as a source can be selected according to the configuration of the data copied to each pipeline, or according to the message header automatically distributed to the specified pipeline, a channel can be connected to multiple sink, This realization of the same data of the multi-send pool, the realization of data reuse and load balancing functions, channel internal transfer of data carrier is Event,flume channel support a variety of data buffering implementation, such as FileChannel: with a file to do data caching, Memorychannel: Using a memory cache, the underlying implementation is a linkedblockingdeque, a two-way blocking list, see Channeltype
Flume sink:flume data Send pool, mainly responsible for the data transmission, from the channel to receive the event, and then sent to the designated data receiver, Flume provides a variety of sink implementations, specifically see Sinktype, Commonly used are: Loggersink: This is mainly used for flume deployment debugging, it will receive the event events directly with log4j output, Rollingfilesink: This sink is mainly to serialize the received log file into a file directory, So need to configure the address of the file, the frequency of slicing files, etc., Avrosink: This is the most common flume layered architecture sink, general and Avrosource paired use, Avro is a sub-project of Apache, for data serialization, When using Avrosource and Avrosink, it is necessary to listen to a port on the Avrosource Agent node server, and the Avrosink agent sends the received data to the IP, port, and then completes the flume tiered deployment. Avro is only a data serialization tool, the underlying implementation consists of a rpcclient to transfer data between this source and sink (you can leave the startup log, will automatically create a rpcclient), of course, flume encoding is based on interface-oriented, So support custom sink like source
The above is a few core concepts, formally due to the flume of this design ideas and coding style, so that Flume has a strong expansion of
Of course only these few still can not completely let Flume run up, Flume provides a lot of auxiliary classes for driving, distributing internal event and the whole flume system operation, basically as follows:
Configuration area:
Agentconfiguration: This looks at the name to know is the flume of the configuration element in the realm of things, yes, The data that the consumer configures in Flume-conf.properties is parsed into agentconfiguration, which is a configuration file to an object-oriented abstraction
Abstractconfigurationprovider: The class looks at the name is an abstract configuration provider class, inside there is an important method is: GetConfiguration (), The method uses the following private methods to load flume channel, source, sink, sinkgroups, and associate them.
Loadchannels (agentconf, Channelcomponentmap);
Loadsources (agentconf, Channelcomponentmap, Sourcerunnermap);
Loadsinks (agentconf, Channelcomponentmap, Sinkrunnermap);
Flume also supports dynamic loading, Pollingpropertiesfileconfigurationprovider (a specific implementation of Abstractconfigurationprovider) starts a thread when the flume is started filewatcherrunnable , monitor flume configuration file changes, the configuration file is loaded with Google Eventbus to drive
Driving field:
The source of Flume has the following two sub-interfaces: Pollablesource and Eventdrivensource, the former need to go round the data source, whether the current can be loaded into the data, if there is loaded into the event to convert to Flume, The implementation class has Taildir, Spolldir, Jsmsource, Kafkasource, and so on, the interface adds a process method for round-robin calls, which is an event-driven source that does not need to proactively access the data source. Only need to receive data-driven event and converted to flume event, the implementation class has: Scribesource (the data source used to get through Facebook's scribe data collection Tool), Avrosource, etc.
Sourcerunner:
Because of the existence of these two source, so Flume provides two sourcerunner to drive the operation of the source, respectively Pollablesourcerunner and Eventdrivensourcerunner, The former automatically starts a Pollingrunner thread for the timed round-robin process method
Channelprocessor:
This class is used to send data between the source and channel, implementing a source that can be associated to multiple channel, simple points such as these 2 interfaces, the definition of Source: Setchannelprocessor (channelprocessor Channelprocessor) specifies a channelprocessor, channelprocessor associated to a final channelselector, Selector Association to Channel:setchannels (list<channel> channels)
Channelprocessor:
The association to the specified Channelselector,channelselector provides two selector ways, replicatingchannelselector: Copy the event of the source into each channel, Multiplexingchannelselector: Automatically routed to the corresponding channel according to header information of the head node.
Transaction and Basictransactionsemantics
The channel interior of the flume guarantees that an event is sent on a transaction completion, and if the send fails or the receive fails, it is rolled back, and the event is removed from the channel when it succeeds
Sinkprocessor:
What do you mean by choosing the sink to send? The class has two implementations:
Loadbalancingsinkprocessor:
Load Balancing method: Provides the roud_bin algorithm and the random algorithm, as well as the fixed order algorithm implementation way, sends the event of the channel to the multiple sink
Failoversinkprocessor:
Can implement failover function, the specific process is similar to loadbalancingsinkprocessor, the difference is failoversinkprocessor maintained a priorityqueue, used to choose sink according to the weight
Sinkrunner:
This class is used to drive a sink, which is opened internally by a thread Pollingrunner, timed call Sinkprocessor
These are all core concepts and code functions, and the following describes the running flow of flume:
1. When the system starts, it can be configured to load a flume according to customer-defined configuration
2.SourceRunner and Sinkprocessor at the same time, a channel to produce the event, a consumption event from the channel, the interior is a producer consumer model
3. Multi-channel distribution and hierarchical architecture for channel to source and sink through a number of auxiliary classes
Here is a self-built flume configuration file, for reference:
Implementation process:
Load Balancer + distribution + Landing to log files
1. Load Balancer node:
Read data from two file sources, increase the data source identifier in the event header, copy to two channel, one log print, one load balancer distributed to the other two machines, the load balancing algorithm uses Roud_robin
Loadbalancagent.sources = Taildirsrc
Loadbalancagent.channels = Memorychannel FileChannel
Loadbalancagent.sinks = LoggerSink1 LoggerSink2 loggerSink3
Loadbalancagent.sinkgroups = loadbalancegroups
# # TAILDIRSRC Config
LoadBalancAgent.sources.taildirSrc.type = Taildir
LoadBalancAgent.sources.taildirSrc.positionFile =/alidata1/admin/opensystem/flumetest/log/taildir_position.json
loadBalancAgent.sources.taildirSrc.filegroups = F1 F2
LOADBALANCAGENT.SOURCES.TAILDIRSRC.FILEGROUPS.F1 =/alidata1/admin/dts-server-web/dts-server.log
LoadBalancAgent.sources.taildirSrc.headers.f1.headerKey1 = Dts-server-log
LOADBALANCAGENT.SOURCES.TAILDIRSRC.FILEGROUPS.F2 =/alidata1/admin/flume/test.log
LoadBalancAgent.sources.taildirSrc.headers.f2.headerKey1 = Flume-test-log
LoadBalancAgent.sources.taildirSrc.fileHeader = True
# # Replicating Channel Config
LoadBalancAgent.sources.taildirSrc.selector.type = Replicating
LoadBalancAgent.sources.taildirSrc.channels = Memorychannel FileChannel
LoadBalancAgent.sources.taildirSrc.selector.optional = FileChannel
# # Memory Chanel Config
LoadBalancAgent.channels.memoryChannel.type = Memory
LoadBalancAgent.channels.memoryChannel.capacity = 10000
LoadBalancAgent.channels.memoryChannel.transactionCapacity = 10000
LoadBalancAgent.channels.memoryChannel.byteCapacityBufferPercentage = 20
LoadBalancAgent.channels.memoryChannel.byteCapacity = 800000
# # File Channel Config
LoadBalancAgent.channels.fileChannel.type = File
LoadBalancAgent.channels.fileChannel.checkpointDir =/alidata1/admin/opensystem/flumetest/log
LoadBalancAgent.channels.fileChannel.dataDirs =/alidata1/admin/opensystem/flumetest/data
# # LoadBalance Sink processor
LoadBalancAgent.sinkgroups.loadBalanceGroups.sinks = LoggerSink1 LoggerSink2
LoadBalancAgent.sinkgroups.loadBalanceGroups.processor.type = Load_balance
LoadBalancAgent.sinkgroups.loadBalanceGroups.processor.backoff = True
LoadBalancAgent.sinkgroups.loadBalanceGroups.processor.selector = Round_robin
# # LOGGERSINK1 Config
LoadBalancAgent.sinks.loggerSink1.type = Avro
LoadBalancAgent.sinks.loggerSink1.channel = Memorychannel
LoadBalancAgent.sinks.loggerSink1.hostname = 10.253.42.162
LoadBalancAgent.sinks.loggerSink1.port = 4141
# # LOGGERSINK2 Config
LoadBalancAgent.sinks.loggerSink2.type = Avro
LoadBalancAgent.sinks.loggerSink2.channel = Memorychannel
LoadBalancAgent.sinks.loggerSink2.hostname = 10.139.53.6
LoadBalancAgent.sinks.loggerSink2.port = 4141
# # LOGGERSINK3 Config
LoadBalancAgent.sinks.loggerSink3.type = File_roll
LoadBalancAgent.sinks.loggerSink3.channel = FileChannel
LoadBalancAgent.sinks.loggerSink3.sink.rollInterval = 0
LoadBalancAgent.sinks.loggerSink3.sink.directory =/alidata1/admin/opensystem/flumetest/dtsserverlog
2. Load Balancer Node 1
Receiving Avrosink and landing in a file
dispatchagent.sources= AVROSRC
Dispatchagent.channels=memorychannel
Dispatchagent.sinks=loggersink
# # AVROSRC Config
DispatchAgent.sources.avroSrc.type = Avro
DispatchAgent.sources.avroSrc.channels = Memorychannel
DispatchAgent.sources.avroSrc.bind = 0.0.0.0
DispatchAgent.sources.avroSrc.port = 4141
# # Memorychannel Config
DispatchAgent.channels.memoryChannel.type = Memory
DispatchAgent.channels.memoryChannel.capacity = 10000
DispatchAgent.channels.memoryChannel.transactionCapacity = 10000
DispatchAgent.channels.memoryChannel.byteCapacityBufferPercentage = 20
DispatchAgent.channels.memoryChannel.byteCapacity = 800000
# # Loggersink Config
DispatchAgent.sinks.loggerSink.type = Logger
DispatchAgent.sinks.loggerSink.channel = Memorychannel
3. Load Balancer Node 2
dispatchagent.sources= AVROSRC
Dispatchagent.channels=memorychannel
Dispatchagent.sinks=loggersink
# # AVROSRC Config
DispatchAgent.sources.avroSrc.type = Avro
DispatchAgent.sources.avroSrc.channels = Memorychannel
DispatchAgent.sources.avroSrc.bind = 0.0.0.0
DispatchAgent.sources.avroSrc.port = 4141
# # Memorychannel Config
DispatchAgent.channels.memoryChannel.type = Memory
DispatchAgent.channels.memoryChannel.capacity = 10000
DispatchAgent.channels.memoryChannel.transactionCapacity = 10000
DispatchAgent.channels.memoryChannel.byteCapacityBufferPercentage = 20
DispatchAgent.channels.memoryChannel.byteCapacity = 800000
# # Loggersink Config
DispatchAgent.sinks.loggerSink.type = Logger
DispatchAgent.sinks.loggerSink.channel = Memorychannel
Flume principle and code implementation