1. Background information
Many of the company's platforms generate a large number of logs per day (typically streaming data, such as search engine PV, queries, etc.), and processing these logs requires a specific logging system, in general, these systems need to have the following characteristics:
(1) Construct the bridge of application system and analysis system, and decouple the correlation between them;
(2) Support near real-time online analysis system and similar to the offline analysis system such as Hadoop;
(3) High scalability. That is, when the amount of data increases, it can be scaled horizontally by increasing the nodes.
This paper contrasts today's open-source log systems with design architectures, load balancing, scalability, and fault tolerance, including Facebook's Scribe,apache Chukwa,linkedin Kafka and Cloudera Flume. 2. Facebook's scribe
scribe is Facebook's open-source log collection system, which has been used extensively within Facebook. It can collect logs from a variety of log sources and store them in a central storage system
(Can be NFS, distributed file system, etc.) for centralized statistical analysis processing. It provides a scalable, high-fault-tolerant solution for "distributed collection, unified processing" of logs.
Its most important feature is the good fault tolerance. When the back-end storage system is crash, Scribe writes the data to the local disk, and when the storage system returns to normal, scribe reloads the log to the storage system.
Architecture:
The architecture of scribe is relatively simple, including three parts, namely scribe Agent, scribe and storage system.
(1) scribe Agent
The scribe agent is actually a thrift client. The only way to send data to scribe is to use the thrift client, which scribe internally defines a thrift interface that users use to send data to the server.
(2) scribe
scribe receives the data sent by thrift client, and sends different topic data to different objects according to the configuration file. scribe provides a variety of stores, such as file, HDFs, and so on, scribe can load data into the store.
(3) Storage System
The storage system is actually the store in scribe, and the current scribe supports very many stores, including file, buffer (two-tier storage, a primary storage,
One secondary storage), network (another scribe server), bucket (containing multiple
Store, by hashing the data into a different store), null (ignoring data), Thriftfile (write to a thrift
Tfiletransport files) and multi (storing the data in a different store). 3. Apache's Chukwa
Chukwa is a very new open source project that, because it belongs to the Hadoop family of products, uses a lot of Hadoop components (stored in HDFs, processing data with MapReduce), and it provides many modules to support Hadoop cluster log analysis.
Demand:
(1) Flexible, dynamically controllable data sources
(2) High-performance, highly scalable storage system
(3) A suitable framework for the analysis of large-scale data collected
Architecture:
There are 3 main characters in Chukwa, namely: Adaptor,agent,collector.
(1) Adaptor data source
Can encapsulate other data sources, such as File,unix command-line tools, etc.
Currently available data sources are: Hadoop logs, application metrics data, system parameter data (such as Linux CPU usage stream rate).
(2) HDFS Storage System
Chukwa uses HDFS as the storage system. HDFs is designed to support large file storage and small concurrent high-speed write scenarios, while the log system features exactly the opposite, it needs to support high concurrency low speed
Rate of write and storage of a large number of small files. Note that small files that are written directly to HDFs are not visible until the file is closed, and HDFs does not support file reopening.
(3) Collector and Agent
To overcome the problem in (2), the agent and collector phases are added.
Agent role: To provide various services to adaptor, including: Start and close adaptor, the data is passed through HTTP to collector, the adaptor status is regularly recorded so that crash after recovery.
Collector's role: Merging data from multiple data sources and then loading into HDFs, hiding the details of HDFS implementations, such as when the HDFs version is replaced, simply modify the collector.
(4) Demux and achieving
Direct support for processing data using MapReduce. It contains two mapreduce jobs, which are used to obtain data and convert data to a structured log. stored in the data store (either a database or HDFS, etc.). 4. LinkedIn's Kafka
Kafka is a December 2010 Open source project, written in the Scala language, using a variety of efficiency optimization mechanisms, the overall architecture is relatively new (push/pull), more suitable for heterogeneous clusters.
Design goal:
(1) The cost of data access on disk is O (1)
(2) High throughput rate, hundreds of thousands of messages per second on a regular server
(3) Distributed architecture, capable of partitioning messages
(4) Support for loading data in parallel to Hadoop
Architecture:
Kafka is actually a message publishing subscription system. Producer publishes a message to a topic, and consumer subscribes to a topic message, and then once there is a new
A topic message that the broker passes to all consumer that subscribe to it.
In Kafka, messages are organized by topic, and each topic is divided into multiple partition, which makes it easy to manage data and load balance. At the same time, it also uses the
Zookeeper for load balancing.
There are three main characters in Kafka, namely Producer,broker and consumer.
(1) Producer
The task of producer is to send data to the broker. Kafka provides two producer interfaces, one is the Low_level interface, which uses the interface to provide a specific
One of the topic under one of the broker's partition sends the data; the other one is high.
Level interface, which supports synchronous/asynchronous sending of data, zookeeper-based broker auto-recognition and load balancing (based on partitioner).
Among them, broker automatic recognition based on zookeeper is worth saying. Producer can obtain a list of available brokers through zookeeper, or you can register listener in zookeeper, which is woken up in the following situations:
A Add a Broker
b Delete a broker
C Register a new Topic
D Broker Registration of existing topic
When producer know the above time, can take certain action according to need.
(2) Broker
Broker has adopted a variety of strategies to improve data processing efficiency, including sendfile and zero copy technologies.
(3) Consumer
The role of consumer is to load log information onto a central storage system. The Kafka provides two consumer interfaces, one of which is low
Level, it maintains a connection to a broker, and the connection is stateless, that is, each time the data is pull from the broker, the offset of the broker data is told
Amount The other is high-level.
interface, which hides the details of the broker, allowing consumer to push data from the broker without having to care about the network topology. More importantly, for most log systems
, consumer the data information that has been obtained is saved by the broker, while in Kafka, the data information is maintained by consumer. 5. Cloudera's Flume
Flume is Cloudera's Open source log system in July 2009. It has a wide range of built-in components that users can use with little or no additional development.
Design goal:
(1) Reliability
When a node fails, the log can be transmitted to other nodes without loss. Flume provides three levels of reliability assurance, from strong to weak in order: End-to-end (Receive data
The agent writes the event to disk first, and then deletes the data after it has been successfully sent, and can resend if the data send fails. ), Store on
Failure (This is also the strategy adopted by scribe, when the data receiver crash, writes the data to the local, after restores, continues to send), best
Effort (after the data is sent to the receiver, no confirmation is made).
(2) Scalability
Flume uses a three-tier architecture that asks Agent,collector and storage, each of which can be scaled horizontally. Of these, all agents and collector are
Master Unified Management, which makes the system easy to monitor and maintain, and master allows multiple (using zookeeper for management and load balancing), which avoids a single point of failure.
(3) Manageability
All agents and Colletor are managed centrally by master, which makes the system easy to maintain. Users can view individual data sources or data flow executions on master, and can
Data source configuration and dynamic loading. Flume provides two forms of web and Shell Script command to manage data flow.
(4) Functional Scalability
Users can add their own agent,colletor or storage as needed. In addition, Flume comes with a number of components, including various agents (file, syslog, etc.), collector and storage (FILE,HDFS, etc.).
Architecture:
As mentioned earlier, Flume uses a layered architecture consisting of three layers of agent,collector and storage, respectively. The agent and collector are composed of two parts: source and Sink,source are data sources, and sink is the data whereabouts.
(1) Agent
The role of the agent is to send data from the data source to collector,flume with a lot of directly available data sources (source), such as:
Text ("filename"): Send file filename as a data source, by row
Tail ("filename"): Detects the new data generated by filename and sends it by line
Fsyslogtcp (5140): Listens to TCP's 5140 port and sends out incoming data
A number of sink are also available, such as:
console[("format")]: Display data directly on the desktop
Text ("txtfile"): Writes data to file txtfile
DFS ("Dfsfile"): Writes data to the Dfsfile file on HDFs
SYSLOGTCP ("host", Port): Passing data over TCP to the host node
(2) Collector
The role of collector is to load data from multiple agents into the storage after it has been aggregated. Its source and sink are similar to agents.
In the following example, the agent listens for TCP's 5140 port received data and sends it to collector, which is loaded by collector into HDFs.
<