Basic concepts of flume, data stream model, and flume data stream
1. Basic concepts of flume
AllFlumeAll related terms are in italic English. The meanings of these terms are as follows.
FlumeA reliable and distributed system for collecting, aggregating, and transmitting massive log data.
Web ServerOne generationEvents.
Agent flumeA node in the system contains three parts:Source,Channel,Sink.
EventEvent, inFlume-agentInternal data structure. OneEventByMap <String, String> HeadersAndByte [] bodyAndHeadersSavedEvent,BodySavedEvent.
Source Agent SourceUsed to receiveWebServerGeneratedEvents, And othersFlume-agentInSinkGeneratedEvents.
Channel SourceSetEventsPut inChannelSave,ChannelThere are two main types:MemoryChannelAndFileChannel,SeparateEventsStored in memory and in files.
SinkUsed for consumptionChannelSavedEvents,ThenEventsSend it out.
SinkgroupsAdd multipleSinkCombined to formSinkgroups.
HDFSHadoop distributed file system, which is used to store log data, that isSinksSentEvents.
2.
FlumeData Flow Model (1) single Agent Data Flow Model
As shown in 1AgentIt consists of three parts:Source,Channel,Sink.
Figure 1 ticketAgentData Flow Model
The entire data flow is as follows:
Web ServerGenerateEvents, AndEventsSendSource.
SourceReceiveEvents, AndEventsSendChannel.
ChannelStorageEvents.
SinkConsumptionChannelStored inEvents, AndEventsSendHDFS.
HDFSDisk StorageEvents.
(2) Multiple
AgentSerial transmission data stream model
2.AgentA data stream transmission model.
Figure 2: TwoAgentSerial transmission data stream model
The entire data flow is as follows:
AgentFooAgentIn fooSourceReceive externalEvents, StoredChannelMedium,SinkSlaveChannel.Events, And thenEventsTransferAgentBarSource.
AgentBarAgentIn barSourceReceiveAgentFoo'sSinkSentEvents, Stored in barChannelBy barSinkConsumption.
The whole Data Stream only implements one thing, that is, data transmission.
(3) Data Stream collection Model
3,Agent1,Agent2,Agent3Responsible for differentWeb ServerReceivingEvents, AndEventsSendAgent4,Agent4AddEventsSendHDFS.
Figure 3: Data Stream collection Model
The entire data flow is as follows:
Agent1ReceiveEvents, AndEventsTransferAgent4.
Agent2ReceiveEvents, AndEventsTransferAgent4.
Agent3ReceiveEvents, AndEventsTransferAgent4.
Agent4ReceiveAgent1,Agent2,Agent3OfEvents, And thenEventsStoreHDFS.
Functions of the entire data stream: DifferentAgentCollect differentWeb ServerThe generated log data, and all the log data is stored in one destination.HDFS.
(4) Multi-Channel Data Stream Model
OneAgentBySource, MultipleChannels, MultipleSinksThe multiplexing data stream model is shown in Figure 4.
OneSourceReceive externalEvents, AndEventsSent to three channelsChannelAnd then differentSinkDifferent consumptionChannelInternalEvents, And thenEventsFor different processing.
SourceHow to SetEventsSent to differentChannelMedium? HereFlumeTwo different strategies are adopted:ReplicatingAndMultiplexing.
WhereReplicatingYesSourceSetEventAre sentChannel.EventsCopy three copies to different places.
WhereMultiplexingYesSourceBased on some mappingsEventSent to differentChannelTo allEventsDivided into three parts and sent to threeChannels.
Figure 4 Multi-Channel Data Stream Model
The entire data flow is as follows:
Agent foo SourceTheEventsSendChannel1--Sink1--HDFS,Channel2--Sink2--JMS,Channel3--Sink3--Agent bar
Agent bar SourceReceiveAgent fooMediumSink3SentEventsAnd then sentChannel4 -- Sink
(5) Sinkgroups data stream model
Now we have two questions:SinkConsumesChannelInEvents, IfSinkThisChannelIt will be blocked.
Second,SinkConsumesChannelInEvents, IfSinkSlow speed, orSinkThe consumption capacity is insufficient.Source? A large numberEventsTheChannelCongestion.
To solve these two problems,FlumeThere is a Data Flow Model inSinksBound together to formSinkgroupsThey are responsible for consumingChannelInternalEvents.
However, there is only oneSinkConsumptionChannelInternalEventsSo there are two policies to ensureSinkgroupsSelectSinkTo consumeChannelInEvents.
These two policies are:FailoverAndLoad_balance. WhereFailoverMechanism, allSinksIdentifies a priority, and a Map in priority order storesSink, An error occurred while saving a queueSink.
The highest priority will be selected every timeSinkTo consumeChannelOfEvents. Every timeSinksFor detection, if it becomes alive, it will be inserted into the aliveSinkMap.
AnotherLoad_balanceMechanism. In this mechanism, there are two different policies:Round_robinAndRandom. ThenRound_robinIs constantly pollingSinkgroupsInternalSinks, Guaranteed balance.
RandomIs fromSinkgroupsInSinksSelect one randomly.
Figure 5 shows the data stream model.
Figure 5 Sinkgroups data stream model
The entire data flow is as follows:
SourceResponsible for receivingEventsAnd send itChannel.
ChannelResponsible for storageEvents.
SinkgroupsResponsible for consumptionChannelInEvents, AndEventsSendHDFSStorage.
(6) ticket
Agent, Multiple data streams
As shown in figure 6Agent, Can be composed of multipleSources,Channels,SinksA data stream that is completely different from each other.
Figure 6
The entire data flow is as follows:
Source1 Channel1 Sink1 HDFS1Data Flow 1
Source2 Channel2 Sink2 HDFS2Data Flow 2
Data Stream 1 and Data Stream 2 are completely irrelevant.
(7) various data stream models
From the six different data stream models described above, we can know that model 1 and Model 2 are equivalent to sequential execution in programming.
Model 3Agent1,Agent2,Agent3CollectionEventsIn parallelAgent4SendEventsIn the concurrency status.
Model 4 is equivalent to if -- else in programming. Select a model.
Three in Model 5SinksIt is also equivalent to being in the concurrent State.
Model 6 is equivalent to the parallel model in programming.
With these six different data stream models, we can combine them to form a variety of data stream models to meet our needs.
3. parse the data stream model
Different data stream models have different functions, but these data stream models are composed of components and policies. This section analyzes different data stream models inAgentInternal implementation.
(1)
AgentInternal component architecture
As shown in figure 6, this isAgentA complete internal architecture diagram not only containsSource,Channel,Sink, IncludingSinkRunner,Interceptor,ChannelSelector,Transaction,
SinkRunner,SinkProcessor,SinkSelector. Next we will detail each part inAgentInternal responsibilities.
Figure 6AgentInternal component architecture
We can see from the process that the entire data stream is divided into two phases: Source --> Channel, and Channel --> Sink.
The following describes the responsibilities of each component in the data flow process from these two phases.
(2) Phase 1
Figure 6 is both a data flow diagram and an object structure diagram. We can see thatSourceRunnerThe object containsSourceObject,SourceThe object containsChannelProcessorObject,
OneChannelProcessorThe object contains multipleInterceptorObject andChannelSelectorObject.
FirstSourceRunnerStartSource, ThenSourceMonitor whether there isEventsSent. If yes, it receivesEvents.
SecondEventsQuiltChannelProcessorInInterceptorFilter,InterceptorThere are three functions available: discardEvent, ModifyEventThen, return directly.Event(Do not perform any operation ).
Example:InterceptorThere are two types of products that are easy to understand:Timestamp InterceptorAndHost Interceptor,Timestamp InterceptorFor eachEventAdd attribute timestamp,Host InterceptorWill be
EachEventAdd attributeHost.
Then,ChannelSelectorThe main function is to complete the preceding multi-channel data stream model, which has two types,ReplicatingAndMultiplexing. That is to sayChannelSelectorFor eachEventSelectChannel.
InReplicatingModeEventAre sent to multipleChannelInMultiplexingDifferentEventsWill be sent to differentChannel.
Finally,SourceAnd eachChannelPassTransactionEstablish a connectionEventsSendChannel.
(3) Stage 2
We can see thatSinkRunnerThe object containsSinkProcessorObject,SinkProcessorThe object contains multipleSinksAnd/orSinkSelector.
First SinkRunnerStartSinkProcessorObject,SinkProcessorThere are three types:DefaultSinkProcessor,FailoverSinkProcessor,LoadBalancingSinkProcessor.
Are you a little impressed here? Yes, this is what we mentioned above.SinkgroupsData Stream model. If a singleSinkUseDefaultSinkProcessorIt is responsible for startingSink;
If multipleSinksTo form a group, you can setSinkProcessorIsFailoverOrLoadBalance.
WhereFailoverSinkProcessorSinkSet priority. SavedSortedMap <Integer, Sink> liveSinks, LivingSinks, OneQueue <FailedSink> failedSinks, Save the deadSinks.
Each timeLiveSinksSelect the highest prioritySinkTo consumeEvents. IfSinkPut it inFailedSinksAnd try every timeFailedSinksThe firstSinkIf it becomes alive
GoLiveSinks. WhereLoadBalancingSinkProcessorThere is an object inSinkSelector,SinkSelectorThere are two types:Round_robinAndRandom. Here you are impressed again. ThenSinkSelectorInSinkgroupsMedium
SelectSinkTo consumeEvents.Round_robinRound RobinSinkgroupsAllSinks,RandomYesfromSinkgroupsSelect a randomSink.
Second,SinkProcessorSelectSink, StartSink.
Finally,SinkAndChannelPassTransactionEstablish a connection. ConsumptionChannelInternalEvents.
(16) A data flow chart is used to abstract the logic model of a software. A data flow chart consists of a specific graphic operator.
(16) [Answer]
[Test Site] Software Engineering Basics
[Evaluation]
The data flow diagram is used in the demand analysis phase. In this phase, we only consider the general data flow direction, not the specific internal processing, and how to implement the data flow on the computer. We do not need to discuss the control flow. We only care about the following: data Stream, data storage, transformation/processing (equivalent to a black box, do not care about internal details), external entities, data flow diagrams are easy to understand, because it is far away from the computer, users (do not need to understand programming) and software personnel are easy to accept.
For example, a simple software system logic model:
Input stream and output stream are the sources and pools in D.
1. What is a data flow diagram? What is its role? What are the meanings of the basic symbols?
Data Flow Diagram: DFD is a graphic representation of the system's logical functions, the logical flow of data within the system, and the logical transformation process, it is the main expression tool of the structured system analysis method and a graphical method used to represent the software model.
The meaning of the basic symbol of the data flow diagram:
1. A rectangle indicates the external entity of the data;
2. The rectangle of the rounded corner indicates the processing logic of the transformed data;
3. Less right edges and rectangles indicate data storage;
4. Arrows indicate data streams.
The data flow chart has the following main elements:
→: Data stream. Data Stream is the path where data is transmitted in the system. Therefore, data streams are composed of fixed data. For example, a ticket consists of a passenger's name, age, unit, ID card number, date, destination, and other data items. Because a data stream is a flow of data, it must have a flow direction. In addition to the data stream between the data stream and the data storage, the data stream should be named by a noun or a noun phrase.
□: Data source (endpoint ). Represents entities outside the system, which can be people, things, or other software systems.
○: Data processing ). Processing is a unit for processing data. It receives a certain amount of data input, processes it, and generates output.
Children: data storage. Static storage of information, which can represent a part of a file, a part of a file, and database elements.