2017-09-06 Zhu Big Data and cloud computing technologies
Any production system will produce a large number of logs during operation, and the log often hides a lot of valuable information. These logs are stored for a period of time and are cleaned up before the method is parsed. With the development of technology and the improvement of analytical ability, the value of log is re-valued. Before you analyze these logs, you need to collect the logs that are scattered across production systems. This section describes the widely used flume log collection system.
I. Overview
Flume is a high-performance, highly available distributed log collection system for Cloudera, now a top-of-the-class project for Apache. Similar to flume, the log collection system also has Facebook Scribe, Apache Chuwka.
Ii. History of Flume development
The initial release version of Flume is now collectively known as Flume OG (Original Generation), which belongs to Cloudera. But with the expansion of Flume function, Flume OG code Engineering bloated, the core component design is unreasonable, the core configuration is not standard and other shortcomings gradually exposed, especially in Flume OG's last release version 0.94.0, the log transmission instability is particularly serious. To address these issues, October 22, 2011, Cloudera completed the Flume-728 and made a milestone change to Flume: Refactoring core components, core configuration and code architecture, the reconstructed version collectively known as Flume NG (Next Generation) Another reason for the change is the incorporation of Flume into Apache, Cloudera Flume renamed Apache Flume.
Three, flume structure analysis
1. System Features① Reliability
When a node fails, the log can be transmitted to other nodes without loss. Flume provides three levels of reliability assurance, from strong to weak in order: End-to-end (after receiving the data, the agent first writes the event to disk, when the data transfer is successful, then delete; if data sent fails, resend), Store on Failure (This is also the strategy adopted by scribe, when the data receiver crashes, writes the data locally, resumes sending after recovery), best Effort (the data is not confirmed after it is sent to the receiver).
② Scalability
The Flume employs a three-tier architecture, agent, collector, and storage, each of which can be scaled horizontally. All the agents and collector are managed by master, which makes the system easy to be monitored and maintained. and master allows multiple (using zookeeper for management and load balancing), thus avoiding a single point of failure.
③ Manageability
When there is more than one master, Flume uses zookeeper and gossip to ensure the consistency of dynamic configuration data. Users can view individual data sources or data flow executions on master, and can configure and dynamically load individual data sources. Flume provides two forms of web and Shell Script command to manage data flow.
④ Functional Scalability
Users can add their own agents, collector, or storage as needed. In addition, Flume comes with a number of components, including various agents (such as file, syslog, etc.), collector, and storage (such as file, HDFs, etc.).
2. System Architecture
is the architecture of Flume og.
The architecture of Flume ng is as shown. The flume uses a layered architecture of agents, collector, and storage, respectively. Among them, agent and collector are composed of source and sink, source is data source, sink is data whereabouts.
The flume uses two components: Master and node. node determines whether it is an agent or a collector, depending on its dynamic configuration in the master shell or the web.
①agent
The role of the agent is to send data from the data source to collector. The flume comes with a number of directly available data sources (source), as follows.
Text ("filename"): Sends the file filename as a data source, by row.
Tail ("filename"): Detects the new data generated by filename and sends it by line.
Fsyslogtcp (5140): Listens on TCP's 5140 port and sends the received data.
Taildir ("DirName" [, fileregex= ". *" [, Startfromend=false[,recursedepth=0]]): Listens to the end of a file in a directory, uses regular expressions to select which files to listen to (not including directories), Recursedepth listens to the depth of its subdirectory recursively, and provides many sink, such as console[("format"), which directly displays the data on the console.
Text ("txtfile"): Writes the data to the file txtfile.
DFS ("Dfsfile"): Writes data to the Dfsfile file on HDFs.
SYSLOGTCP ("host", port): Passes data over TCP to the host node.
agentsink[("Machine" [, Port])]: equivalent to Agente2esink, if the machine parameter is omitted, The default is to use Flume.collector.event.host and Flume.collector.event.port as the default Collectro.
agentdfosink[("Machine" [, Port])]: Local hot standby agent. After the agent discovers a collector node failure, it constantly checks the surviving state of the collector to resend the event, during which the resulting data is cached to the local disk.
agentbesink[("Machine" [, Port])]: Agent not responsible. If the collector fails, no processing will be done and the data it sends will be discarded directly.
Agente2echain: Specify multiple collector to increase availability. When the event is sent to the primary collector, it will be sent to the second collector, and when all the collector are invalidated, it will be sent again.
②collector
The role of collector is to load data from multiple agents into the storage after it has been aggregated. Its source and sink are similar to agents.
Source is as follows.
collectorsource[(port)]:collector Source, monitoring port aggregation data.
Autocollectorsource: Automatically aggregates data through master coordination of physical nodes.
Logicalsource: Logical source, which is assigned the port by master and listens for Rpcsink.
Sink is as follows.
Collectorsink ("Fsdir", "Fsfileprefix", Rollmillis): Collectorsink, data sent to collector after Hdfs,fsdir is the HDFs directory, Fsfileprefix is the file prefix code.
Customdfs ("Hdfspath" [, "format"]): Custom format DFS.
③storage
Storage is a storage system that can be a common file or HDFs, Hive, HBase, distributed storage, and so on.
④master
Master is responsible for managing and coordinating the configuration information of the agent and collector, which is the controller of the flume cluster.
In Flume, the most important abstraction is data flow. Data flow describes a path from the generation, transmission, processing to the final write target, as shown in.
For the agent data flow configuration, where to get the data, the data is sent to which collector.
For collector, the data sent by the agent is received, and then the data is sent to the specified target machine.
Note: The flume framework's dependency on Hadoop and zookeeper only exists on the jar package, and does not require the flume to start at the same time as the Hadoop and zookeeper services.
3. Introduction of Components
The flume described in this article is based on the 1.4.0 version.
①client
Path: apache-flume-1.4.0-src\flume-ng-clients.
Manipulate the initial data and send the data to the agent. There are two ways to establish data communication between the client and the agent.
The first way: Create a iclient that inherits flume existing source, such as Avrosource or syslogtcpsource, but must ensure that the data being transmitted is source understandable.
The second way: Write a flume source communicates directly with an existing application via the IPC or RPC protocol and needs to be converted into an event that Flume can identify.
Client SDK: is an RPC protocol-based SDK library that allows applications to connect directly to Flume via the RPC protocol. The API functions of the SDK can be called directly without paying attention to how the underlying data interacts, providing append and Appendbatch two interfaces, specifically to see the code apache-flume-1.4.0-src\flume-ng-sdk\src\main\ Java\org\apache\ Flume\api\rpcclient.java.
②nettyavrorpcclient
Avro is the default RPC protocol. Nettyavrorpcclient and Thriftrpcclient respectively to the Rpcclient interface is implemented, the implementation can see the code apache-flume-1.4.0-src\flume-ng-sdk\src\main\ Java\org\apache\flume\api\ Nettyavrorpcclient.java and Apache-flume-1.4.0-src\flume-ng-sdk\src\main\java\org\apache \flume\api\ Thriftrpcclient.java.
Here is a sample using the SDK to establish a connection with the flume, for example, the actual use can refer to the implementation:
Import org.apache.flume.Event;
Import org.apache.flume.EventDeliveryException;
Import org.apache.flume.api.RpcClient;
Import Org.apache.flume.api.RpcClientFactory;
Import Org.apache.flume.event.EventBuilder;
Import Java.nio.charset.Charset;
public class MyApp {
public static void Main (string[] args) {
Myrpcclientfacade client = new Myrpcclientfacade ();
Initialize client with the remote Flume agent ' s host and port
Client.init ("host.example.org", 41414);
Send the events to the remote Flume agent. That agent should is
Configured to listen with an avrosource.
String sampleData = "Hello flume!";
for (int i = 0; i < i++) {
Client.senddatatoflume (SampleData);
}
Client.cleanup ();
}
}
Class Myrpcclientfacade {
Private Rpcclient client;
Private String hostname;
private int port;
public void Init (String hostname,int port) {
Setup the RPC connection
This.hostname = hostname;
This.port = port;
This.client = Rpcclientfactory.getdefaultinstance (Hostname,port);
Use the following method to create a thrift client (instead of the above line):
This.client = Rpcclientfactory.getthriftinstance (Hostname,port);
}
public void Senddatatoflume (String data) {
Create a Flume Event object that encapsulates the sample data
Event event = Eventbuilder.withbody (Data,charset.forname ("UTF-8"));
Send the event
try {
Client.append (event);
} catch (Eventdeliveryexception e) {
Clean up and recreate the client
Client.close ();
client = null;
Client = Rpcclientfactory.getdefaultinstance (Hostname,port);
Use the following method to create a thrift client (instead of the above line):
This.client = Rpcclientfactory.getthriftinstance (Hostname,port);
}
}
public void CleanUp () {
Close the RPC connection
Client.close ();
}
}
In order to be able to hear the associated port, you need to add the port and host configuration information (config file apache-flume-1.4.0-src\conf\flume-conf.properties.template) in the configuration file.
Client.type = default (for Avro) or thrift (for thrift)
hosts = H1 &N Bsp # Default client accepts only 1 host
# (additional hosts would Be ignored)
Hosts.h1 = host1.example.org:41414 # host and Port must both be specified
&NBSP ; # (neither has a Default)
Batch-size = # must be >=1 (default:100)
Connect-timeout = 20000 # must be >=1000 (default:20000)
Request-timeout = 2 0000 # must be >=1000 (default:20000)
In addition to the above two types of implementations, Failoverrpcclient.java and Loadbalancingrpcclient.java also implemented the Rpcclient interface respectively.
③failoverrpcclient
This interface mainly implements the main standby switch, in the form of
④loadbalancingrpcclient
This interface acts as a load balancer when there is more than one host.
⑤embeded Agent
Flume allows users to embed an agent in their own application. This inline agent is a lightweight agent and does not support all source Sink Channel.
⑥transaction
The three main components of the Flume--source, Sink, and channel must use transaction to send and receive messages. The transaction interface is implemented in the channel class, whether source or sink, as long as the channel is connected, the transaction object must first be obtained, as shown in.
Examples of use are as follows, which can be used for reference in the build environment:
Channel ch = new Memorychannel ();
Transaction Txn = Ch.gettransaction ();
Txn.begin ();
try {
Event eventtostage = Eventbuilder.withbody ("Hello flume!", Charset.forname ("UTF-8"));
Ch.put (Eventtostage);
Txn.commit ();
} catch (Throwable t) {
Txn.rollback ();
if (t instanceof Error) {
throw (Error) t;
}
} finally {
Txn.close ();
}
⑦sink
An important function of sink is to get events from the channel and then send the event to the next agent, or store the event in another repository. A sink is associated with a channel, which is configured in the Flume configuration file. After the Sinkrunner.start () function is called, a thread is created that is responsible for managing the entire life cycle of the sink. The sink needs to implement the start () and Stop () methods of the Lifecycleaware interface.
Sink.start (): Initializes the Sink, sets the state of the Sink, and can send and receive events.
Sink.stop (): Perform the necessary cleanup action.
Sink.process (): Responsible for specific event operations.
Sink Use the reference code example as follows:
public class Mysink extends Abstractsink implements configurable {
Private String MyProp;
@Override
public void Configure (context context) {
String MyProp = context.getstring ("MyProp", "DefaultValue");
Process the MyProp value (e.g. validation)
Store MyProp for later retrieval by process () method
This.myprop = MyProp;
}
@Override
public void Start () {
Initialize the connection to the external repository (e.g. HDFS) that
This Sink would forward Events to..
}
@Override
public void Stop () {
Disconnect from the external respository and does any
Additional cleanup (e.g. releasing resources or Nulling-out
Field values):
}
@Override
Public Status process () throws Eventdeliveryexception {
Status status = NULL;
Start Transaction
Channel ch = getchannel ();
Transaction Txn = Ch.gettransaction ();
Txn.begin ();
try {
This try clause includes whatever Channel operations your want to do
Event event = Ch.take ();
Send the Event to the external repository.
Storesomedata (e);
Txn.commit ();
status = Status.ready;
} catch (Throwable t) {
Txn.rollback ();
Log Exception,handle individual exceptions as needed
status = Status.backoff;
Re-throw all Errors
if (t instanceof Error) {
throw (Error) t;
}
} finally {
Txn.close ();
}
return status;
}
}
⑧source
The role of source is to receive events from the client side and then store the events in the channel. Pollablesourcerunner.start () is used to create a thread that manages the life cycle of the pollablesource. It is also necessary to implement both the start () and Stop () methods. It is important to note that there is also a class of source, called Eventdrivensource. The difference is that Eventdrivensource has its own callback function for capturing events, and not every thread will drive a eventdrivensource.
Here is an example of a pollablesource:
public class MySource extends Abstractsource implements configurable, Pollablesource {
Private String MyProp;
@Override
public void Configure (context context) {
String MyProp = context.getstring ("MyProp", "DefaultValue");
Process the MyProp value (e.g. Validation,convert to another type,...)
Store MyProp for later retrieval by process () method
This.myprop = MyProp;
}
@Override
public void Start () {
Initialize the connection to the external client
}
@Override
public void Stop () {
Disconnect from external client and does any additional cleanup
(e.g. releasing resources or Nulling-out field values):
}
@Override
Public Status process () throws Eventdeliveryexception {
Status status = NULL;
Start Transaction
Channel ch = getchannel ();
Transaction Txn = Ch.gettransaction ();
Txn.begin ();
try {
This try clause includes whatever Channel operations your want to do
Receive New Data
Event e = Getsomedata ();
Store the Event into this Source ' s associated Channel (s)
Getchannelprocessor (). processevent (E)
Txn.commit ();
status = Status.ready;
} catch (Throwable t) {
Txn.rollback ();
Log Exception,handle individual exceptions as needed
status = Status.backoff;
Re-throw all Errors
if (t instanceof Error) {
throw (Error) t;
}
} finally {
Txn.close ();
}
return status;
}
}
4. Flume usage Mode
Flume data flows are always run through events. An event is the basic unit of data for Flume, which carries log data (in the form of a byte array) and carries header information, which is generated by source outside the agent, such as Web server in. When source captures an event, it is formatted in a specific format, and then the source pushes the event into (single or multiple) channel. You can think of the channel as a buffer, which will save the event until sink finishes processing the event. Sink is responsible for persisting the log or pushing the event to another source.
A straightforward design, notably, Flume provides a large number of built-in source, channel, and sink types. Different types of source,channel and sink can be freely combined. Multi-agent concatenation, as shown in.
Or a multi-agent merge, as shown in.
If you think Flume is capable of that, it's a big mistake. Flume enables users to build multilevel streams, which means that multiple agents can work together and support Fan-in, fan-out, contextual Routing, and Backup Routes. As shown in.
Reference Documents
Refer to http://www.aboutyun.com/thread-7848-1-1.html website user manual, http://flume.apache.org/FlumeUserGuide.html.
GitHub address Https://github.com/apache/flume.
Refer to http://flume.apache.org/FlumeUserGuide.html.
Flume Log Collection system architecture--Go