Address: http://www.cnblogs.com/ibook360/p/3159544.html
1. Background
Many companies' platforms generate a large number of logs (generally stream data, such as search engine PVs and queries) every day. to process these logs, a specific log system is required. Generally, these systems must have the following features:
(1) build bridges between application systems and analysis systems and decouple them;
(2) supports near-real-time online analysis systems and offline analysis systems similar to hadoop;
(3) high scalability. That is, when the data volume increases, you can scale horizontally by adding nodes.
This article compares the current open-source log systems from the aspects of design architecture, load balancing, scalability and fault tolerance, including Facebook's scribe, Apache's chukwa, LinkedIn's Kafka, and cloudera's flume.
2. Facebook's scribe
Scribe is an open-source log collection system of Facebook. It has been widely used in Facebook. It can collect logs from various log sources and store the logs to a central storage system (such as NFS and distributed file systems) for centralized statistical analysis and processing. It provides a scalable and highly fault-tolerant solution for "distributed collection, unified processing" of logs.
Its most important feature is its fault tolerance. When the backend storage system crash, scribe writes data to the local disk. When the storage system returns to normal, scribe reloads the logs to the storage system.
Architecture:
The architecture of scribe is relatively simple. It consists of three parts: scribe agent, scribe and storage system.
(1) scribe agent
The scribe agent is actually a thrift client. The only way to send data to scribe is to use the thrift client. Scribe internally defines a thrift interface, which is used to send data to the server.
(2) scribe
Scribe receives the data sent by the thrift client and sends the data of different topics to different objects according to the configuration file. Scribe provides a variety of stores, such as file and HDFS. Scribe can load data to these stores.
(3) Storage System
The storage system is actually the store in scribe. Currently, scribe supports many stores, including file, buffer (double-layer storage, one primary storage, and one secondary storage ), network (another scribe server), bucket (contains multiple stores and stores data in different stores through hash), null (ignore data), thriftfile (written to a thrift tfiletransport file) and multi (store data in different stores at the same time ).
3. Apache chukwa
Chukwa is a very new open-source project. Because it belongs to hadoop products, it uses many hadoop components (stored in HDFS and processed in mapreduce ), it provides many modules to support hadoop cluster log analysis.
Requirements:
(1) flexible and dynamically controllable data sources
(2) high-performance and highly scalable Storage Systems
(3) suitable framework for analyzing the collected large-scale data
Architecture:
There are three main roles in chukwa: adaptor, agent, and collector.
(1) adaptor Data Source
Other data sources such as file and Unix Command Line tools can be encapsulated.
Currently, available data sources include hadoop logs, application metric data, and system parameter data (such as Linux CPU usage rate ).
(2) HDFS Storage System
Chukwa uses HDFS as the storage system. HDFS was designed to support big file storage and Low-concurrency and high-speed writing scenarios. The characteristics of the log system are the opposite, it must support high-concurrency and low-speed writing and storage of a large number of small files. Note that the small files directly written to HDFS are invisible until the file is closed. In addition, HDFS does not support file re-opening.
(3) Collector and Agent
To overcome (2) problems, the agent and collector stages are added.
The role of the agent is to provide various services to the adaptor, including starting and disabling the adaptor and passing data over HTTP to the Collector; regularly record the status of the adaptor for recovery after crash.
COLLECTOR: combines the data sent from multiple data sources and loads the data to HDFS. Hides the Implementation Details of HDFS. For example, after the HDFS version is changed, you only need to modify collector.
(4) Demux and achieving
Mapreduce can be directly used to process data. It has two built-in mapreduce jobs for obtaining data and converting data into structured logs. Stored in data store (such as database or HDFS.
4. Kafka of LinkedIn
Kafka is an open-source project in July. It was written in Scala and uses a variety of efficiency optimization mechanisms. The overall architecture is relatively novel (push/pull) and is more suitable for heterogeneous clusters.
Design goals:
(1) The cost of data access to the disk is O (1)
(2) high throughput, which can process hundreds of thousands of messages per second on common servers
(3) distributed architecture, able to partition messages
(4) Support parallel data loading to hadoop
Architecture:
Kafka is actually a message publishing and subscription system. The producer publishes a message to a topic, and the consumer subscribes to the message of a topic. Once a message about a topic is generated, the broker passes the message to all the consumers who subscribe to it. In Kafka, messages are organized by topic, and each topic is divided into multiple partitions, which facilitates data management and load balancing. It also uses zookeeper for load balancing.
Kafka has three main roles: producer, broker, and consumer.
(1) Producer
The task of producer is to send data to the broker. Kafka provides two producer interfaces, namely the low_level interface, which sends data to a partition under a topic of a specific broker; the other is the high level interface, which supports synchronous/asynchronous data transmission, automatic identification of Broker Based on zookeeper and load balancing (based on partitioner ).
Among them, the automatic identification of Broker Based on zookeeper is worth mentioning. The producer can obtain the list of available brokers through zookeeper or register a listener in zookeeper. the listener will be awakened in the following situations:
A. Add a broker
B. delete a broker
C. register a new topic
D. The broker registers an existing topic.
When the producer knows the time above, it can take some action as needed.
(2) Broker
Brokers adopt various policies to improve data processing efficiency, including sendfile and zero copy.
(3) Consumer
The role of consumer is to load log information to the central storage system. Kafka provides two consumer interfaces, one of which is low level. It maintains a connection to a broker and the connection is stateless, that is, every time pull data is obtained from the broker, the offset of the broker data must be told. The other is the high-level interface, which hides the details of the broker and allows the consumer to push data from the broker without worrying about the network topology. More importantly, for most log systems, the data information that consumer has obtained is saved by the broker, while in Kafka, consumer maintains the data information it obtains.
5. Flume of cloudera
Flume is an open-source log system of cloudera in July 2009. Its built-in components are very complete and users can use them without any additional development.
Design goals:
(1) Reliability
When a node fails, logs can be transferred to other nodes without being lost. Flume provides three levels of reliability assurance, from strong to weak: end-to-end (after receiving the data agent, the event is first written to the disk. when the data is transmitted successfully, delete the data. If the data fails to be sent, resend the data .), Store on Failure, will not be confirmed ).
(2) scalability
Flume uses a three-tier architecture, asking the agent, collector, and storage respectively. Each layer can be horizontally expanded. Among them, all agents and collector are centrally managed by the master, which makes the system easy to monitor and maintain, and the master allows multiple (using zookeeper for management and load balancing ), this avoids spof.
(3) manageability
All agents and colletors are centrally managed by the master, which makes the system easy to maintain. You can view the execution of each data source or data stream on the master, and configure and dynamically load each data source. Flume provides two forms of Web and shell script command to manage data streams.
(4) Functional scalability
You can add your own agent, colletor, or storage as needed. In addition, flume comes with many components, including various agents (such as file and Syslog), collector and storage (such as file and HDFS ).
Architecture:
As mentioned above, flume uses a layered architecture consisting of three layers: Agent, collector, and storage. The agent and collector are both composed of two parts: Source and Sink. source is the data source and sink is the data destination.
(1) Agent
The agent sends data from the data source to collector. Flume comes with many directly available data sources, such:
Text ("filename"): uses the file filename as the data source and sends it by row.
Tail ("filename"): detects new data generated by filename and sends data by row.
Fsyslogtcp (5140): listens to port 5140 of TCP and sends the received data.
At the same time, many sinks are provided, such:
Console [("format")]: displays data directly on the desktop.
Text ("txtfile"): writes data to the txtfile file.
DFS ("dfsfile"): writes data to the dfsfile file on HDFS.
Syslogtcp ("host", Port): transmits data to the host node over TCP
(2) Collector
Collector is used to aggregate the data of multiple agents and load them to storage. Its source and sink are similar to those of the agent.
In the following example, the agent listens to the data received by port 5140 of TCP and sends it to collector, where collector loads the data to HDFS.
host : syslogTcp(5140) | agentSink("localhost",35853) ; collector : collectorSource(35853) | collectorSink("hdfs://namenode/user/flume/ ","syslog");
A more complex example is as follows:
There are 6 agents and 3 collector. All collector import data into HDFS. Agent A and B send data to collector A, agent C, and D send the data to collectorb, agent C, and D send the data to collectorb. At the same time, add end-to-end reliability assurance for each agent (three types of reliability assurance for flume are implemented by agente2echain, agentdfochain, and agentbechain respectively). For example, when collector A fails, agent A and Agent B send the data to collector B and collector C respectively.
The following is a short configuration file segment:
agentA : src | agentE2EChain("collectorA:35853","collectorB:35853"); agentB : src | agentE2EChain("collectorA:35853","collectorC:35853"); agentC : src | agentE2EChain("collectorB:35853","collectorA:35853"); agentD : src | agentE2EChain("collectorB:35853","collectorC:35853"); agentE : src | agentE2EChain("collectorC:35853","collectorA:35853"); agentF : src | agentE2EChain("collectorC:35853","collectorB:35853"); collectorA : collectorSource(35853) | collectorSink("hdfs://...","src"); collectorB : collectorSource(35853) | collectorSink("hdfs://...","src"); collectorC : collectorSource(35853) | collectorSink("hdfs://...","src");
In addition, when autoe2echain is used, flume automatically detects an available collector when a collector fails and directs the data to this new available collector.
(3) Storage
Storage is a storage system, which can be a common file, HDFS, hive, hbase, etc.
6. Summary
According to the architecture design of these four systems, we can conclude that a typical log system requires three basic components: Agent (encapsulate the data source and send the data from the data source to collector ), collector (receives data from multiple agents, summarizes the data, and imports the data to the backend store). Store (central storage system, should be scalable and reliable, it should support the very popular HDFS ).
The following table compares the four systems:
7. References
Scribe home: https://github.com/facebook/scribe
Chukwa home: http://incubator.apache.org/chukwa/
Kafka home: http://sna-projects.com/kafka/
Flume home: https://github.com/cloudera/flume/
From: http://dongxicheng.org/search-engine/log-systems/
Comparison of open-source log systems