scribe, Chukwa, Kafka, flume log System comparison
1. Background informationMany of the company's platforms generate a large number of logs per day (typically streaming data, such as search engine PV, queries, etc.), processing these logs requires a specific logging system, in general, these systems need to have the following characteristics: (1) Build the bridge of application system and analysis system, and decouple the association between them; (2) Supports near real-time online analysis systems and offline analysis systems like Hadoop, (3) with high scalability. That is, when the amount of data increases, it can be scaled horizontally by increasing the nodes.
This paper contrasts today's open-source log systems with design architectures, load balancing, scalability, and fault tolerance, including Facebook's Scribe,apache Chukwa,linkedin Kafka and Cloudera Flume.
2. Facebook's scribe
scribe is Facebook's open-source log collection system, which has been used extensively within Facebook. It collects logs from a variety of log sources, stores them on a central storage system (which can be NFS, distributed file systems, etc.) for centralized statistical analysis processing. It provides a scalable, high-fault-tolerant solution for "distributed collection, unified processing" of logs.
Its most important feature is the good fault tolerance. When the back-end storage system is crash, Scribe writes the data to the local disk, and when the storage system returns to normal, scribe reloads the log to the storage system.
Architecture :
The architecture of scribe is relatively simple, including three parts, namely scribe Agent, scribe and storage system.
(1) scribe Agent
The scribe agent is actually a thrift client. The only way to send data to scribe is to use the thrift client, which scribe internally defines a thrift interface that users use to send data to the server.
(2) scribe
scribe receives the data sent by thrift client, and sends different topic data to different objects according to the configuration file. scribe provides a variety of stores, such as file, HDFs, and so on, scribe can load data into the store.
(3) Storage System
The storage system is actually the store in scribe, and the current scribe supports very many stores, including file, buffer (two-tier storage, one primary storage, one secondary storage), network (another scribe server), Buckets (containing multiple stores, which store data in a different store by hashing), null (ignoring data), thriftfile (writes to a thrift Tfiletransport files) and multi (storing the data in a different store).
3. Apache's Chukwa
Chukwa is a very new open source project that, because it belongs to the Hadoop family of products, uses a lot of Hadoop components (stored in HDFs, processing data with MapReduce), and it provides many modules to support Hadoop cluster log analysis.
Demand:
(1) Flexible, dynamically controllable data sources
(2) High-performance, highly scalable storage system
(3) A suitable framework for the analysis of large-scale data collected
Architecture :
There are 3 main characters in Chukwa, namely: Adaptor,agent,collector.
(1) Adaptor data source
Can encapsulate other data sources, such as File,unix command-line tools, etc.
Currently available data sources are: Hadoop logs, application metrics data, system parameter data (such as Linux CPU usage stream rate).
(2) HDFS Storage System
Chukwa uses HDFS as the storage system. HDFs is designed to support large file storage and small concurrent high-speed write scenarios, and the log system is the opposite, it needs to support high concurrency low-rate write and a large number of small file storage. Note that small files that are written directly to HDFs are not visible until the file is closed, and HDFs does not support file reopening.
(3) Collector and Agent
To overcome the problem in (2), the agent and collector phases are added.
Agent role: To provide various services to adaptor, including: Start and close adaptor, the data is passed through HTTP to collector, the adaptor status is regularly recorded so that crash after recovery.
Collector's role: Merging data from multiple data sources and then loading into HDFs, hiding the details of HDFS implementations, such as when the HDFs version is replaced, simply modify the collector.
(4) Demux and achieving
Direct support for processing data using MapReduce. It contains two mapreduce jobs, which are used to obtain data and convert data to a structured log. stored in the data store (either a database or HDFS, etc.).
4. LinkedIn's Kafka
Kafka is a December 2010 Open source project, written in the Scala language, using a variety of efficiency optimization mechanisms, the overall architecture is relatively new (push/pull), more suitable for heterogeneous clusters.
Design goal:
(1) The cost of data access on disk is O (1)
(2) High throughput rate, hundreds of thousands of messages per second on a regular server
(3) Distributed architecture, capable of partitioning messages
(4) Support for loading data in parallel to Hadoop
Architecture :
Kafka is actually a message publishing subscription system. Producer publishes a message to a topic, and consumer subscribes to a topic message, and once there is a new message about a topic, the broker is passed to all consumer that subscribed to it. In Kafka, messages are organized by topic, and each topic is divided into multiple partition, which makes it easy to manage data and load balance. At the same time, it uses zookeeper for load balancing.
There are three main characters in Kafka, namely Producer,broker and consumer.
(1) Producer
The task of producer is to send data to the broker. Kafka provides two producer interfaces, one Low_level interface, which sends data to a certain partition under one topic of a particular broker, and one that is a high-level interface that supports synchronous/asynchronous sending of data , zookeeper-based broker automatic recognition and load balancing (based on partitioner).
Among them, broker automatic recognition based on zookeeper is worth saying. Producer can obtain a list of available brokers through zookeeper, or you can register listener in zookeeper, which is woken up in the following situations:
A Add a Broker
b Delete a broker
C Register a new Topic
D Broker Registration of existing topic
When producer know the above time, can take certain action according to need.
(2) Broker
Broker has adopted a variety of strategies to improve data processing efficiency, including sendfile and zero copy technologies.
(3) Consumer
The role of consumer is to load log information onto a central storage system. The Kafka provides two consumer interfaces, one that is low, that maintains a connection to a broker, and that the connection is stateless, that is, each time the data is pull from the broker, the offset of the broker data is told. The other is the high-level interface, which hides the details of the broker, allowing consumer to push data from the broker without having to care about the network topology. More importantly, for most log systems, the data information that consumer has acquired is saved by the broker, while in Kafka, the data information is maintained by consumer itself.
5. Cloudera's Flume
Flume is Cloudera's Open source log system in July 2009. It has a wide range of built-in components that users can use with little or no additional development.
Design goal:
(1) Reliability
When a node fails, the log can be transmitted to other nodes without loss. Flume provides three levels of reliability assurance, from strong to weak in order: End-to-end (Received data agent first writes the event to disk, when the data transfer is successful, then delete; If the data sent fails, you can resend it.) ), Store On failure (this is also the strategy adopted by scribe, when the data receiver crash, the data is written to local, after recovery, continue to send), best effort (data sent to the receiver, will not be confirmed).
(2) Scalability
Flume uses a three-tier architecture that asks Agent,collector and storage, each of which can be scaled horizontally. All agents and collector are managed by master, which makes the system easy to monitor and maintain, and master allows multiple (management and load balancing using zookeeper), which avoids a single point of failure.
(3) Manageability
All agents and Colletor are managed centrally by master, which makes the system easy to maintain. Users can view individual data sources or data flow executions on master, and can be configured and dynamically loaded on individual data sources. Flume provides two forms of web and Shell Script command to manage data flow.
(4) Functional Scalability
Users can add their own agent,colletor or storage as needed. In addition, Flume comes with a number of components, including various agents (file, syslog, etc.), collector and storage (FILE,HDFS, etc.).
Architecture :
As mentioned earlier, Flume uses a layered architecture consisting of three layers of agent,collector and storage, respectively. The agent and collector are composed of two parts: source and Sink,source are data sources, and sink is the data whereabouts.
(1) Agent
The role of the agent is to send data from the data source to collector,flume with a lot of directly available data sources (source), such as:
Text ("filename"): Send file filename as a data source, by row
Tail ("filename"): Detects the new data generated by filename and sends it by line
Fsyslogtcp (5140): Listens to TCP's 5140 port and sends out incoming data
A number of sink are also available, such as:
console[("format")]: Display data directly on the desktop
Text ("txtfile"): Writes data to file txtfile
DFS ("Dfsfile"): Writes data to the Dfsfile file on HDFs
SYSLOGTCP ("host", Port): Passing data over TCP to the host node
(2) Collector
The role of collector is to load data from multiple agents into the storage after it has been aggregated. Its source and sink are similar to agents.
In the following example, the agent listens for TCP's 5140 port received data and sends it to collector, which is loaded by collector into HDFs.
<textarea class="crayon-plain print-no" style="-moz-tab-size: 4; -o-tab-size: 4; -webkit-tab-size: 4; tab-size: 4; font-size: 12px !important; line-height: 15px !important;" readonly="" data-settings="dblclick">host:syslogtcp (5140) | Agentsink ("localhost", 35853); Collector:collectorsource (35853) | Collectorsink ("hdfs:// namenode/user/flume/"," syslog ");</textarea>
12 |
host : syslogtcp (5140) | agentsink ( "localhost" ,35853) collector : collectorsource(35853) | collectorsink("hdfs://namenode/user/flume/","syslog"); |
A more complex example is the following: There are 6 agent,3 collector, and all collector import data into HDFs. Agent A B sends the data to collector A,agent C,d sends the data to Collectorb,agent C,d sends the data to Collectorb. At the same time, add end-to-end reliability assurance for each agent (flume three reliability guarantees are implemented by Agente2echain, Agentdfochain, and Agentbechain respectively), e.g. when collector a fails , Agent A and Agent B will send the data to Collector B and collector C, respectively.
The following is a shorthand profile fragment:
<textarea class="crayon-plain print-no" style="-moz-tab-size: 4; -o-tab-size: 4; -webkit-tab-size: 4; tab-size: 4; font-size: 12px !important; line-height: 15px !important;" readonly="" data-settings="dblclick">agenta:src | agente2echain ("collectora:35853", "collectorb:35853"); AGENTB:SRC | Agente2echain ("collectorA:35853 "," collectorc:35853 "); AGENTC:SRC | Agente2echain ("collectorb:35853", "collectora:35853"); AGENTD:SRC | Agente2echain ("collectorb:35853", "collectorc:35853"); AGENTE:SRC | Agente2echain ("collectorc:35853", "collectora:35853"); AGENTF:SRC | Agente2echain ("collectorc:35853", "collectorb:35853"); Collectora:collectorsource (35853) | Collectorsink ("hdfs://...", "src"); Collectorb:collectorsource (35853) | Collectorsink ("hdfs://...", "src"); Collectorc:collectorsource (35853) | Collectorsink ("hdfs://...", "src");</textarea>
1234567891011121314151617 |
AgentA : src | agente2echain("collectora:35853","collectorb:35853"); agentb : src | agente2echain("collectora:35853","collectorc:35853"); AGENTC : src | agente2echain("collectorb:35853","collectora:35853"); agentd : src | agente2echain("collectorb:35853","collectorc:35853"); agente : src | agente2echain("collectorc:35853","collectora:35853"); agentf : src | agente2echain("collectorc:35853","collectorb:35853"); Collectora : collectorsource(35853) | collectorsink("hdfs://...","src"); collectorb : collectorsource(35853) | collectorsink("hdfs://...","src"); collectorc : collectorsource(35853) | collectorsink("hdfs://...","src"); |
In addition, with Autoe2echain, when a collector fails, Flume automatically detects an available collector and directs the data to the new available collector.
(3) Storage
Storage is a storage system, can be a common file, it can be hdfs,hive,hbase and so on.
6. Summary
According to the architecture design of these four systems, it can be concluded that a typical log system needs three basic components, namely, the agent (encapsulating data source, sending data from data source to collector), collector (receiving data from multiple agents, and are aggregated after importing into the back-end store), the store (the central storage system should be scalable and reliable and should support the currently very popular HDFs).
The table below compares the four systems:
7. References
scribe Home: Https://github.com/facebook/scribe
Chukwa Home: http://incubator.apache.org/chukwa/
Kafka Home: http://sna-projects.com/kafka/
Flume Home: https://github.com/cloudera/flume/
This article from: http://my.oschina.net/sunzy/blog/183795
scribe, Chukwa, Kafka, flume log System comparison