scribe, Chukwa, Kafka, flume log System comparison

Source: Internet
Author: User
Tags zookeeper cpu usage
1. Background introduction Many of the company's platforms generate a large number of logs per day (typically streaming data, for example, the search engine PV, query, etc.), the processing of these logs requires a specific log system, in general, these systems need to have the following characteristics: (1) The construction of application systems and analysis systems of the bridge, and the correlation between them decoupling (2) support for near real-time online analysis system and off-line analysis system similar to Hadoop, (3) with high scalability. That is, when the amount of data is increased, it can be expanded horizontally by increasing the number of nodes.

This paper contrasts the current open source journaling system in terms of design architecture, load balancing, scalability, and fault tolerance, including Facebook's Scribe,apache Chukwa,linkedin Kafka and Cloudera Flume.

2. Facebook's scribe

scribe is Facebook's Open source Log collection system, which has been heavily used within Facebook. It can collect logs from a variety of log sources and store them on a central storage system (which can be NFS, distributed file systems, etc.) for centralized statistical analysis processing. It provides an extensible, highly fault-tolerant solution for the "distributed collection and processing" of logs.

The most important feature of it is good fault tolerance. When the back-end storage system is crash, Scribe writes the data to the local disk, and when the storage system returns to normal, scribe reloads the log into the storage system.

Architecture:

Scribe's architecture is simpler, consisting mainly of three parts, namely scribe Agent, scribe and storage system.

(1) scribe Agent

The scribe agent is actually a thrift client. The only way to send data to scribe is to use thrift client, scribe internally defines a thrift interface that users use to send data to the server.

(2) scribe

scribe receives data sent by the thrift client and sends different topic data to different objects according to the configuration file. scribe provides a variety of store, such as file, HDFs, etc., scribe can load the data into these store.

(3) Storage System

The storage system is actually the store in scribe, and the current scribe supports a lot of stores, including file, buffer (two-tier storage, one primary storage, one secondary storage), network (another scribe server), Bucket (contains multiple store, the hash of the data stored in different store), null (ignore data), Thriftfile (write to a thrift Tfiletransport files) and multi (the data is stored at the same time in a different store).

3. Chukwa of Apache

Chukwa is a very new open source project, because it belongs to the Hadoop family, it uses a lot of Hadoop components (with HDFs storage, with mapreduce processing data), it provides a lot of modules to support the Hadoop cluster log analysis.

Demand:

(1) Flexible, dynamically controllable data source

(2) High performance, highly scalable storage system

(3) An appropriate framework for the analysis of large data collected

Architecture:

There are 3 main roles in Chukwa, namely: Adaptor,agent,collector.

(1) Adaptor data source

Other data sources can be encapsulated, such as File,unix command line tools

Currently available data sources are: Hadoop logs, application metrics, system parameter data (such as the Linux CPU usage flow rate).

(2) HDFS Storage System

Chukwa uses HDFS as the storage system. HDFs is designed to support large file storage and small concurrent high-speed write scenarios, while the log system features the opposite, it needs to support high concurrent low speed write and a large number of small file storage. Note that small files written directly to HDFs are not visible until you close the file, and HDFs does not support file reopening.

(3) Collector and Agent

In order to overcome the problems in (2), the agent and the collector phase were added.

Agent role: To provide a variety of services to adaptor, including: Starting and shutting down the adaptor, passing the data through HTTP to the collector, and periodically recording the adaptor status for crash recovery.

Collector: Merge data from multiple data sources, then load into HDFs, hide HDFs implementation details, such as HDFs version replacement, just modify collector.

(4) Demux and achieving

Direct support for processing data using MapReduce. It has two mapreduce jobs built into it, which are used to get data and convert data into a structured log. stored in the data store (can be database or HDFS, etc.).

4. LinkedIn's Kafka

Kafka is the December 2010 Open source project, using Scala language, the use of a variety of efficiency optimization mechanisms, the overall architecture is relatively novel (push/pull), more suitable for heterogeneous clusters.

Design objectives:

(1) The access cost of data on disk is O (1)

(2) High throughput rate, can handle hundreds of thousands of messages per second on the normal server

(3) Distributed architecture, the ability to partition messages

(4) Support to load data into Hadoop in parallel


Architecture:

Kafka is actually a message publishing subscription system. Producer publishes a message to a topic, consumer subscribes to a topic message, and once there is a new message about a topic, broker passes it to all the consumer that subscribe to it. In Kafka, messages are organized by topic, and each topic is divided into multiple partition, which facilitates managing data and load balancing. At the same time, it also uses zookeeper for load balancing.

There are three main characters in Kafka, namely Producer,broker and consumer.

(1) Producer

The producer task is to send data to broker. Kafka provides two producer interfaces, one of which is the Low_level interface, which sends data to a partition under a certain topic of a particular broker, and the high level interface that supports synchronous/asynchronous sending of data , automatic broker recognition and load balancing based on zookeeper (Partitioner).

Among them, the automatic broker recognition based on zookeeper is worth saying. Producer can obtain a list of available broker through zookeeper, or you can register listener in zookeeper, which is awakened in the following situations:

A Add a Broker

b Delete a broker

C Sign up for a new topic

D Broker registration already exists topic

When producer know the above time, can take certain action according to the need.

(2) Broker

Broker has adopted a variety of strategies to improve the efficiency of data processing, including Sendfile and zero copy technology.

(3) Consumer

The role of consumer is to load log information onto a central storage system. Kafka provides two consumer interfaces, one of low levels, that maintains a connection to a broker, and the connection is stateless, that is, the offset of the broker data is told each time the data is pull from broker. The other is the high-level interface, which hides the details of the broker, allowing the consumer to push data from broker without caring about the network topology. More importantly, for most log systems, the data information that consumer has obtained is saved by broker, while in Kafka, consumer maintains the data information itself.

5. The flume of Cloudera

Flume is the Cloudera open source log system in July 2009. Its built-in components are very complete and can be used by users with little additional development.

Design objectives:

(1) Reliability

When a node fails, the log can be transmitted to other nodes without loss. Flume provides three levels of reliability assurance, from strong to weak in order: End-to-end (Received data agent first write event to disk, when the data transfer success, then delete; If the data sent failed, you can resend.) ), Store On failure (this is also the strategy adopted by scribe, when the data receiver crash, write the data locally, after recovery, continue to send), best effort (data sent to the receiver, will not be confirmed).

(2) Scalability

Flume employs a three-tier architecture that asks Agent,collector and storage, each of which can be scaled horizontally. All agents and collector are managed by master, which makes the system easy to monitor and maintain, and master allows multiple (using zookeeper for management and load balancing), which avoids a single point of failure.

(3) Manageability

All agents and Colletor are managed uniformly by master, which makes the system easier to maintain. Users can view individual data sources or data flow execution on master, and can configure and dynamically load individual data sources. Flume provides Web and Shell script command two forms for managing data streams.

(4) Functional Scalability

Users can add their own agent,colletor or storage on demand. In addition, Flume has many components, including various agents (file, syslog, etc.), collector and storage (FILE,HDFS, etc.).

Architecture:

As mentioned earlier, Flume uses a layered architecture, consisting of three layers, agent,collector and storage respectively. The agent and collector are composed of two parts: source and Sink,source are data sources, sink is the data whereabouts.

(1) Agent

The role of the agent is to send data from the data source to collector,flume with a number of directly available data sources (source), such as:

Text ("filename"): file filename as the data source, sent by row

Tail ("filename"): Detect the new data generated by filename, sent out by line

Fsyslogtcp (5140): Listens for TCP's 5140 port, and receives the data to send out

At the same time provides a lot of sink, such as:

console[("format")]: Show data directly on the desktop

Text ("txtfile"): Writes data to a file txtfile

DFS ("Dfsfile"): Writes data to a dfsfile file on HDFs

SYSLOGTCP ("host", Port): Passing data over TCP to the host node

(2) Collector

The role of collector is to aggregate data from multiple agents and load them into the storage. Its source and sink are similar to those of the agent.

In the following example, the agent listens for data received by TCP's 5140 ports and sends it to collector, where collector loads the data onto the HDFS.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.