In the distributed system, each machine has the local log that the program runs, sometimes in order to analyze the demand, have to these scattered log summary requirements, I believe many people will choose RSYNC,SCP, but they are not strong in real-time, but also bring the problem of name conflict. The scalability is not satisfactory, not elegant at all.
In reality, we are confronted with the need to summarize the Nginx logs of multiple servers on the line in real time. Flume meritorious.
Flume Introduction
F Lume is a distributed, reliable and efficient log collection system that allows users to customize the data transfer model, and therefore scalability is also strong. There are also strong fault-tolerant and recovery mechanisms. Here are a few important concepts
- Event:event is the basic unit of FLUME data transmission. Flume the data from the source to the final destination in the form of an event.
- Agent:agent contains Sources, Channels, Sinks, and other components that utilize these components to transfer events from one node to another or to the ultimate purpose.
- Source:source is responsible for receiving events and placing events in batches into one or more channels.
- Channel:channel is located between Source and Sink, and is used to cache incoming events, and events are removed from the channel when Sink successfully sends events to the next-hop channel or end purpose.
- The Sink:sink is responsible for transferring events to the next hop or final purpose, and then removing events from the channel after successful completion.
- Source has Syslog source, Kafka source,http source, Exec source Avro source, and so on.
- Sink have Kafka Sink, Avro Sink, File roll Sink, HDFS Sink and so on.
- Channel has Memory channel,file channel, etc.
It provides a skeleton, as well as a variety of Source, Sink, Channel, lets you design the right data model. In fact, it can be done with multiple Flume, just like a subway compartment.
Defining a Data flow model
Back to our beginning of the scene, we will be more than one server's Nginx log to summarize the analysis,
Divided into two flume to achieve
- Flume1 data stream is Exec Source, Memory Channel, Avro Sink, deployed on the business machine
- Flume2 Data flow is Avro Source, Memory Channel, Fileroll Sink
Required Preparation
You need to install
- Download Flume
- Install JAVASDK, and after downloading the extracted conf/flume-env.sh, configure
# 我用的是oracle-java-8export JAVA_HOME=/usr/lib/jvm/java-8-oracle/jre/
- Think about your data flow model, write the configuration, as described above in Flume1, tail2avro.conf:
Agent.sources = S1Agent.channels = C1Agent.sinks = K1Agent.sources.s1.Type=execAgent.sources.s1.Command=tail-F <YourFile path>agent.sources.s1.channels=c1agent.channels.c1. Type=memoryagent.channels.c1.capacity=10000agent.channels.c1.transactioncapacity=10000 Agent.sinks.k1. type = AvroAgent.sinks.k1.hostname = <Your Target address>agent.sinks.k1. Port = <Your Target port>agent.sinks.k1.channel=c1
The avro2file.conf in Flume2
Agent.sources = S1Agent.channels = C1Agent.sinks = K1Agent.sources.s1.Type=AvroAgent.sources.s1.bind = <YourAddress>Agent.sources.s1.port = <YourPort>Agent.sources.s1.channels = C1Agent.sinks.k1.Type= file_rollagent.sinks.k1.sink.directory =/data/log/ngxlog# scrolling interval Agent.sinks.k1.sink.rollInterval = 86400agent.sinks.k1.channel = C1agent.channels.c1. Type = memory# Capacity of Event in queue agent.channels.c1.capacity = 10000 Agent.channels.c1.transactionCapacity = 10000agent.channels.c1.keep-alive =
# 启动flume1bin/flume-ng agent -n agent -c conf -f conf/tail2avro.conf -Dflume.root.logger=WARN# 启动flume2in/flume-ng agent -n agent -c conf -f conf/avro2file.conf -Dflume.root.logger=INFO
Flume Real-time collection of logs